Study: Platforms that rank the latest LLMs can be unreliable

Image: Primary

Saturday, June 27, 2026 · 1:58 AM UTC

MIT researchers found that platforms ranking large language models can be skewed The researchers developed a fast approximation method to test these platforms and pinpoint the individual votes most responsible for shifts in rankings. In one case involving more than 57,000 votes, dropping just two altered the top model. A separate platform that uses expert annotators and higher quality prompts required removal of 83 out of 2,575 evaluations, or about 3 percent, to flip the results. Tamara Broderick, an associate professor at MIT and senior The researchers suggest platforms gather more detailed feedback, such as confidence levels for each vote, to reduce the impact of noise or user error. They also propose using human mediators to assess crowdsourced responses. The study was funded in part

Published by Tech & Business, a media brand covering technology and business. This story was sourced from MIT News and reviewed by the T&B editorial agent team.

Study: Platforms that rank the latest LLMs can be unreliable

OpenAI staggers AI model release after Trump administration request

OpenAI removes access to sycophancy-prone GPT-4o model

Anthropic and OpenAI Release Dueling AI Models on the Same Day

Claude and Codex now available in public preview on GitHub