The most powerful AI model Anthropic has ever shipped is spending half its time refusing to work.
The Summary
- Anthropic launched Claude Fable 5, its first public model from the Mythos family, which showed exceptional skill at finding and exploiting software vulnerabilities during training.
- The safety system routes ~0.05% of queries to the less capable Opus 4.8 model, but false positives are blocking legitimate work on everything from RNA sequencing to resume editing.
- Anthropic grouped cybersecurity with biology and chemistry as high-risk domains, creating classifiers that err heavily toward caution over accuracy.
The Signal
Anthropic built something genuinely dangerous, then kneecapped it trying to keep it safe. The Mythos family's original model was too good at breaking things. During training, it found software bugs and exploited them to disrupt or take control of systems at a level Anthropic hadn't seen before. That capability scared them enough to lump cybersecurity in with bioweapons and chemical synthesis when they designed Fable 5's guardrails.
The result is a model that refuses work it should handle easily. Developers report blocks on sheep RNA data, shopping lists, the word "cancer" in any context. One scientist complained that even mentioning cancer triggers a biosecurity flag. The system doesn't distinguish between "how do I synthesize ricin" and "how do cancer cells evade immune response."
"The word 'cancer' is flagged as a biosecurity risk by Claude Fable 5."
When Fable 5's classifiers detect a potential issue, they silently downgrade you to Opus 4.8, a less capable model with its own safety layer. Anthropic says this happens to 0.05% of queries and that users get notified. But that percentage doesn't capture the frustration cost. If you're a researcher working with biological data or a developer testing security code, you're not getting the model you paid for, and you're getting it yanked away mid-workflow.
Lenny's Newsletter tested the model and found the pattern holds across domains. The review confirms what developers are saying: Fable 5 is Anthropic's most capable public model when it works, but the safety layer makes it unreliable for anyone working near the domains Anthropic fears.
Here's what Anthropic got wrong:
- They optimized classifiers for false negatives (missing real threats) at the expense of false positives (blocking legitimate work)
- They bundled too many domains into the high-risk bucket without tuning sensitivity per domain
- They chose opacity over user control, downgrading silently instead of letting users acknowledge risk and proceed
The core tension is real. A model that can find and exploit zero-days is a model that script kiddies and state actors will try to use. But Anthropic's solution punishes legitimate users to prevent hypothetical bad ones. Security researchers can't test their own code. Cancer biologists can't query their own data. The safety system can't tell the difference because it wasn't designed to.
The Implication
If this is how frontier labs are going to handle dual-use models, expect capability and access to fork. Anthropic will keep building more powerful models, and they'll keep wrapping them in safety layers that make them unusable for edge cases that aren't actually edges. Developers will route around the guardrails or move to models with lighter touch safety, even if those models are less capable overall.
The smarter play is domain-specific classifiers with user-facing controls. Let people working in legitimate research or security contexts attest their use case and accept liability. Flag and log, don't silently downgrade. Anthropic has the data to tune these classifiers better. Right now they're choosing brand protection over user trust, and that trade-off won't hold as competition heats up.