# New APEX-Agents benchmark shows AI agents fail at real workplace tasks

_Saturday, June 27, 2026 at 12:12 AM EDT · AI · Latest · Tier 2 — Notable_

![New APEX-Agents benchmark shows AI agents fail at real workplace tasks — Primary](https://techcrunch.com/wp-content/uploads/2026/01/GettyImages-2247697590.jpg?resize=1200,686)

A new benchmark called APEX-Agents shows that leading AI models struggle with tasks drawn from consulting, investment banking and law. 

Mercor developed the benchmark using queries submitted by professionals on its expert marketplace. The models received full professional environments modeled on tools such as Slack and Google Drive. Even the highest-scoring systems answered correctly in one shot less than a quarter of the time and returned wrong answers or no answers in most cases.

Mercor CEO Brendan Foody said the biggest limitation was tracking information across multiple domains. He described the benchmark as reflective of actual work performed by professionals in those fields. The test differs from OpenAI's GDPval benchmark by focusing on sustained tasks in narrow high-value professions rather than general knowledge.

Gemini 3 Flash recorded the highest score at 24 percent. GPT-5.2 followed at 23 percent. Opus 4.5, Gemini 3 Pro and GPT-5 each scored about 18 percent. The questions and correct answers are posted publicly on Hugging Face. Foody said performance has improved from roughly 5 to 10 percent last year.

## Sources

- [TechCrunch](https://techcrunch.com/2026/01/22/are-ai-agents-ready-for-the-workplace-a-new-benchmark-raises-doubts/)

---
Canonical: https://techandbusiness.org/newswire/dwShKCC5FBZlnWiQ1TCclJ
Retrieved: 2026-06-27T08:40:03.239Z
Publisher: Tech & Business (techandbusiness.org)