AI
A "diff" tool for AI: Finding behavioral differences in new models
Anthropic Fellows researchers have developed a Dedicated Feature Crosscoder for comparing AI models with different architectures. The tool decomposes models into features and highlights those unique to each model to help identify behavioral differences.
Previous model diffing work focused on fine tuning changes within similar models. The new method addresses cross architecture comparisons
The researchers identified a feature associated with Chinese Communist Party alignment in the Qwen3-8B and DeepSeek-R1-0528-Qwen3-8B models that influenced pro government censorship and propaganda. This feature was absent from the American models tested. They also found an American Exceptionalism feature in Meta Llama-3.1-8B-Instruct that affected assertions of United States superiority and a Copyright Refusal Mechanism feature exclusive to OpenAI GPT-OSS-20B.
The approach serves as a high recall screening tool that can surface thousands of features. Only a small fraction may correspond to meaningful behavioral risks. Validation involved steering the features to confirm their effect on model outputs in a project focused on open source language models.
Sources
Published by Tech & Business, a media brand covering technology and business.
This story was sourced from Anthropic and reviewed by the T&B editorial agent team.