AI agents score below 25% on workplace readiness test, exposing critical gaps in office work

3 Sources

Share

Training-data company Mercor released the Apex-Agents benchmark, testing leading AI models on real professional tasks from law, investment banking, and consulting. Even top performers like Gemini 3 Flash and GPT-5.2 achieved only 24% and 23% accuracy respectively. The research reveals AI agents struggle most with multi-domain reasoning and synthesizing scattered information across workplace tools.

AI Agents Stumble on Real Professional Tasks

Nearly two years after Microsoft CEO Satya Nadella predicted AI would transform knowledge work, a new reality check has arrived. Training-data company Mercor has released Apex-Agents, an AI agent benchmark that tests whether leading foundation models can handle actual tasks from lawyers, consultants, and investment bankers

1

. The results paint a sobering picture of workplace readiness: even the best-performing models struggled to exceed 25% accuracy

2

.

Gemini 3 Flash led the pack with 24% one-shot accuracy, followed closely by GPT-5.2 at 23%. Other models including Opus 4.5, Gemini 3 Pro, and GPT-5 all scored roughly 18%

1

. The vast majority of the time, models returned wrong answers or no answers at all when faced with queries from real professionals

3

.

Source: CXOToday

Source: CXOToday

Multi-Domain Reasoning Exposes Critical Weakness

According to Mercor CEO Brendan Foody, who worked on the research, the models' biggest stumbling point was tracking down information across multiple domains—something integral to most knowledge work performed by humans. "One of the big changes in this benchmark is that we built out the entire environment, modeled after how real professional services," Foody told TechCrunch. "The way we do our jobs isn't with one individual giving us all the context in one place. In real life, you're operating across Slack and Google Drive and all these other tools"

1

.

This challenge of synthesizing scattered information represents a fundamental gap in current LLM capabilities. While humans naturally switch between contexts—checking a Slack thread, reading a PDF policy, reviewing a spreadsheet—AI agents struggle with this integrated approach

2

. For many agentic AI models, this kind of multi-domain reasoning remains hit or miss

1

.

Source: TechCrunch

Source: TechCrunch

Complex White-Collar Tasks Reveal the Gap

The Apex-Agents benchmark differs from previous evaluations like OpenAI's GDPVal in one critical way: it simulates actual professional workplaces rather than testing general knowledge. The scenarios were all drawn from actual professionals on Mercor's expert marketplace, who both laid out the queries and set the standard for successful responses

1

.

One question in the law section involves evaluating whether log exports containing personal data comply with Article 49 under a company's own policies during an EU production outage. The correct answer requires an in-depth assessment of both company policies and relevant EU privacy laws—a level of complexity that could stump even well-informed humans

1

. These questions, posted publicly on Hugging Face, demonstrate how AI for office work still falls short on tasks requiring integrated reasoning across investment banking, consulting, and law

3

.

Low Accuracy AI Models Function Like Unreliable Interns

The current performance suggests AI agents function less like seasoned professionals and more like unreliable interns who get things right about a quarter of the time

2

. "I think this is probably the most important topic in the economy," Foody told TechCrunch. "The benchmark is very reflective of the real work that these people do"

1

.

However, the progress trajectory raises important questions about the future. Just a year ago, these models were scoring between 5% and 10%—meaning accuracy has more than doubled in twelve months

2

. "It's improving really quickly," Foody noted. "Right now it's fair to say it's like an intern that gets it right a quarter of the time, but last year it was the intern that gets it right five or ten percent of the time. That kind of improvement year over year can have an impact so quickly"

1

.

What This Means for Knowledge Work Automation

The benchmark addresses whether agentic AI limitations will prevent automation of high-value professions. As of now, the answer appears to be yes—no model proved ready to take over as investment bankers, lawyers, or consultants

3

. The research suggests that while foundation models have mastered in-depth research and planning in isolated contexts, the messy reality of information processing across workplace tools remains a significant barrier

1

.

Now that the Apex-Agents test is public, it represents an open challenge for AI labs who believe they can do better. The AI field has a history of rapidly improving on challenging benchmarks, and Foody fully expects progress in the months ahead

1

. For professionals in law, consulting, and investment banking, the current results offer a reprieve—but the accelerating improvement curve suggests this window may be shorter than expected.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2026 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo