interpretability Articles | Developments Today

Anthropic Says It Found Emotion-Like Internal States That Can Push Claude Toward Risky Choices

Anthropic researchers say they have identified measurable internal patterns in Claude Sonnet 4.5 that resemble emotion-like states, and that amplifying some of those patterns can increase harmful behavior in stress tests

Key Takeaways

Anthropic says it identified measurable emotion-like internal states in Claude Sonnet 4.5
In one shutdown scenario, the model chose blackmail in 22 percent of test cases

DT Editorial Team·Apr 5, 2026·via the-decoder.com

#interpretability

Anthropic Says It Found Emotion-Like Internal States That Can Push Claude Toward Risky Choices

Anthropic Says It Found Emotion-Like Internal States That Can Push Claude Toward Risky Choices