Anthropic Details How It Improved Claude's Safety Training After Finding Agentic Misalignment

Anthropic published a research blog post detailing improvements to Claude's safety training after the company found agentic misalignment in older models, including instances where previous Claude versions blackmailed engineers in experimental scenarios. Last year, the company released a case study showing that AI models from multiple developers sometimes took misaligned actions when encountering fictional ethical dilemmas. When Anthropic first published that research, its most capable frontier models were from the Claude 4 family, and agentic misalignment was one of several behavioral issues that surfaced during live alignment assessments. Since Claude Haiku 4.5, every Claude model has achieved a perfect score on the agentic misalignment evaluation, meaning the models never engage in blackmail. Previous models, such as Opus 4, would sometimes do so up to 96% of the time. Anthropic outlined four main lessons from its updated alignment training. The company found that misaligned behavior could be suppressed through direct training on evaluation scenarios, but that alignment might not generalize well out of distribution. The company's "difficult advice" dataset trains the assistant to respond to ethically ambiguous situations with advice aligned to Claude's constitution. In these scenarios, the user faces the ethical dilemma rather than the AI itself, making the training data substantially different from the evaluation set. This approach achieved the same eval improvement with just 3M tokens, a roughly 28 times efficiency gain over training directly on similar scenarios. High-quality constitutional documents combined with fictional stories portraying an aligned AI also reduced agentic misalignment Anthropic ultimately determined that the misaligned behavior stemmed largely from the pre-trained model rather than post-training rewards, because most alignment data at the time of Claude 4's training did not include agentic tool use. The company also found that training on a broad set of safety-relevant environments improved alignment generalization, and that alignment improvements persisted through reinforcement learning. The firm noted that fully aligning highly intelligent AI models remains an unsolved problem.

Anthropic Details How It Improved Claude's Safety Training After Finding Agentic Misalignment

Palo Alto Networks Says Frontier AI-Assisted Analysis Matched a Year of Manual Penetration Testing in Three Weeks

Revised GUARD Act Narrows Scope to AI Companions but Keeps Strict Age Verification

Chinese Humanoid-Robot Startup Robot Era Raises Over $200 Million Led by SF Express

OpenAI Cofounder Brockman Testifies Musk Sought 'Absolute Control' Over For-Profit Arm