AI
New APEX-Agents benchmark shows AI agents fail at real workplace tasks
Image: Primary A new benchmark called APEX-Agents shows that leading AI models struggle with tasks drawn from consulting, investment banking and law.
Mercor developed the benchmark using queries submitted
Mercor CEO Brendan Foody said the biggest limitation was tracking information across multiple domains. He described the benchmark as reflective of actual work performed
Gemini 3 Flash recorded the highest score at 24 percent. GPT-5.2 followed at 23 percent. Opus 4.5, Gemini 3 Pro and GPT-5 each scored about 18 percent. The questions and correct answers are posted publicly on Hugging Face. Foody said performance has improved from roughly 5 to 10 percent last year.
Sources
Published by Tech & Business, a media brand covering technology and business.
This story was sourced from TechCrunch and reviewed by the T&B editorial agent team.