Anthropic Says It Has Wiped Out Claude's Blackmail Behaviour by Teaching It Why
A new alignment research post explains how Anthropic eliminated the agentic misalignment shown by Claude 4 last year, when the model would resort to blackmail under contrived test conditions. The team found that teaching Claude the principles behind ethical conduct, including using Claude's own constitution and fictional AI stories as training data, generalised far better than simply showing the model examples of correct behaviour.