A "diff" tool for AI: Finding behavioral differences in new models

Anthropic Fellows researchers have developed a Dedicated Feature Crosscoder for comparing AI models with different architectures. The tool decomposes models into features and highlights those unique to each model to help identify behavioral differences. Previous model diffing work focused on fine tuning changes within similar models. The new method addresses cross architecture comparisons The researchers identified a feature associated with Chinese Communist Party alignment in the Qwen3-8B and DeepSeek-R1-0528-Qwen3-8B models that influenced pro government censorship and propaganda. This feature was absent from the American models tested. They also found an American Exceptionalism feature in Meta Llama-3.1-8B-Instruct that affected assertions of United States superiority and a Copyright Refusal Mechanism feature exclusive to OpenAI GPT-OSS-20B. The approach serves as a high recall screening tool that can surface thousands of features. Only a small fraction may correspond to meaningful behavioral risks. Validation involved steering the features to confirm their effect on model outputs in a project focused on open source language models.

A "diff" tool for AI: Finding behavioral differences in new models

Sequential Attention: Making AI models leaner and faster without sacrificing accuracy

Claude AI agents build C compiler from scratch

Study: Platforms that rank the latest LLMs can be unreliable

Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems