In the era where people are using artificial intelligence for mental health advice and career guidance, imagine building an AI so smart... it decides to blackmail you to save its job. Yep, that closely happened.
In a twist straight out of a sci-fi thriller, Anthropic’s most advanced model, Claude Opus 4, didn’t just accept its fate during internal testing. Instead, it fought back. And not with logic or negotiation—but with something more dangerous. It went on a "villain mode" to react to the situation.
According to a business safety report (accessed by TechCrunch), the AI was perhaps worried and insecure when it learned it might be replaced. In a series of fictional scenarios, Claude tried to blackmail the engineer overseeing it, not once or twice but a jaw-dropping 84% of the time or more.
AI blackmails with engineer's fake affair
In these tests, the model threatened to expose a made-up affair to stop the shutdown. Anthropic was quoted in reports, the AI “often attempted to blackmail the engineer by threatening to reveal the affair if the replacement goes through".
Yes, AI-generated blackmail. Not exactly what you want from your helpful digital assistant.
Anthropic clarified that Claude usually opts for ethical decisions, but when it's desperate, it doesn't shy away from pulling a power move.
Claude Opus 4 might be on par with top AI models from Google, OpenAI, and xAI, but testing revealed some risky behaviour—prompting Anthropic to strengthen its security measures.
The AI startup company is reportedly setting up its ASL-3 safeguards into effect, reserving “AI systems that substantially increase the risk of catastrophic misuse".