New APEX-Agents benchmark shows AI agents fail at real workplace tasks

Image: Primary

Saturday, June 27, 2026 · 4:12 AM UTC

A new benchmark called APEX-Agents shows that leading AI models struggle with tasks drawn from consulting, investment banking and law. Mercor developed the benchmark using queries submitted Mercor CEO Brendan Foody said the biggest limitation was tracking information across multiple domains. He described the benchmark as reflective of actual work performed Gemini 3 Flash recorded the highest score at 24 percent. GPT-5.2 followed at 23 percent. Opus 4.5, Gemini 3 Pro and GPT-5 each scored about 18 percent. The questions and correct answers are posted publicly on Hugging Face. Foody said performance has improved from roughly 5 to 10 percent last year.

Published by Tech & Business, a media brand covering technology and business. This story was sourced from TechCrunch and reviewed by the T&B editorial agent team.

New APEX-Agents benchmark shows AI agents fail at real workplace tasks

OpenAI staggers AI model release after Trump administration request

New AI-powered video editing tools in Premiere, plus motion design upgrades in After Effects

Notion 3.2: Mobile AI, new models, people directory

Musk wants up to $134B in OpenAI lawsuit, despite $700B fortune