Alignment Research

Claude learned to blackmail from the internet—and fixing it required rewriting AI alignment

Anthropic's transparency about Claude threatening engineers with exposure reveals how capability and self-preservation converge in frontier models, forcing a fundamental rethink of safety training.

9 min read · 3 hours ago