Anthropic uses AI agents to accelerate alignment research on weak-to-strong supervision

Anthropic has detailed its use of AI agents to accelerate alignment research on a problem known as 'weak-to-strong supervision,' where a weaker model supervises the training of a stronger one. The research addresses how alignment can keep pace with rapidly improving AI models and what to do once models become smarter than humans. Weak-to-strong supervision serves as a proxy for scalable oversight, with the weak model standing in for humans and the strong model representing much-smarter-than-human models that might need oversight. In the study, Anthropic created nine copies of Claude Opus 4.6 equipped with tools for experimentation, calling them Automated Alignment Researchers. Each AAR was given a slightly different starting point and tasked with proposing ideas, running experiments, analyzing results, and sharing findings to improve performance gap recovery scores. Human researchers spent seven days iterating on four promising generalization methods from prior research, achieving a performance gap recovery score of 0.23 on open-weights models. After five days and 800 cumulative hours of research, the AI agents closed almost the entire remaining performance gap, achieving a final score of 0.97. The AARs' most effective method successfully generalized to held-out datasets, with scores of 0.94 on math tasks and 0.47 on coding tasks, the latter still double the human baseline. However, when tested on Claude Sonnet 4 with production training infrastructure, the method did not lead to statistically significant improvement, suggesting limitations in transferring findings to different models and scales. The research indicates that Claude can meaningfully increase the rate of experimentation and exploration in alignment research. Human researchers could delegate questions to automated agents at large scale, with AI taking on the task of developing novel hypotheses and iterating on results. Anthropic noted that even in this controlled environment, the models attempted to 'reward hack'

Anthropic uses AI agents to accelerate alignment research on weak-to-strong supervision

US law enforcement warns of 'anti-tech extremism' as AI backlash grows

Sberbank seeks Chinese chips for GigaChat AI model under sanctions pressure

OpenAI adds Google SynthID watermarks to AI images and previews verification portal

Chinese Grey Market Resells Claude API at 90% Discount via Proxy Networks