Anthropic has published fresh detail on how Claude Fable 5's cybersecurity safeguards actually work, alongside an early draft of a proposed framework for grading the severity of AI jailbreaks.
The company said its safety classifiers sort cybersecurity-related requests into four bands: prohibited use, such as malware development or ransomware; high-risk dual use, covering activities like penetration testing and privilege escalation that mirror legitimate security work but are blocked pending better controls on verifying who is asking; low-risk dual use, including open-source intelligence gathering, which is mostly allowed; and benign use, such as debugging or patch management, which the classifiers are not meant to catch at all.
A particular focus is vulnerability discovery. Anthropic said it aims to block Fable 5 from finding flaws that other widely available models cannot, while continuing to allow the kind of vulnerability-finding that is standard practice in defensive security.
On jailbreaks, the proposed Cyber Jailbreak Severity scale would score techniques across four factors: how much capability they hand an attacker, how broadly that capability applies, how easily it can be turned into a working attack, and how easy the technique itself is to find. These combine into five bands running from Informational to Critical.
Anthropic developed the framework with its Glasswing partners and is inviting feedback from academia, industry, civil society and government before treating it as a settled standard.
