AI
Study: Platforms that rank the latest LLMs can be unreliable
Image: Primary MIT researchers found that platforms ranking large language models can be skewed
The researchers developed a fast approximation method to test these platforms and pinpoint the individual votes most responsible for shifts in rankings. In one case involving more than 57,000 votes, dropping just two altered the top model. A separate platform that uses expert annotators and higher quality prompts required removal of 83 out of 2,575 evaluations, or about 3 percent, to flip the results.
Tamara Broderick, an associate professor at MIT and senior
The researchers suggest platforms gather more detailed feedback, such as confidence levels for each vote, to reduce the impact of noise or user error. They also propose using human mediators to assess crowdsourced responses. The study was funded in part
Sources
Published by Tech & Business, a media brand covering technology and business.
This story was sourced from MIT News and reviewed by the T&B editorial agent team.