AI Agents Fail Workplace Readiness Test at 24%

AI Agents Stumble on Real Professional Tasks

Nearly two years after Microsoft CEO Satya Nadella predicted AI would transform knowledge work, a new reality check has arrived. Training-data company Mercor has released Apex-Agents, an AI agent benchmark that tests whether leading foundation models can handle actual tasks from lawyers, consultants, and investment bankers1

. The results paint a sobering picture of workplace readiness: even the best-performing models struggled to exceed 25% accuracy2

Gemini 3 Flash led the pack with 24% one-shot accuracy, followed closely by GPT-5.2 at 23%. Other models including Opus 4.5, Gemini 3 Pro, and GPT-5 all scored roughly 18%1

. The vast majority of the time, models returned wrong answers or no answers at all when faced with queries from real professionals3

Source: CXOToday

Multi-Domain Reasoning Exposes Critical Weakness

According to Mercor CEO Brendan Foody, who worked on the research, the models' biggest stumbling point was tracking down information across multiple domains—something integral to most knowledge work performed by humans. "One of the big changes in this benchmark is that we built out the entire environment, modeled after how real professional services," Foody told TechCrunch. "The way we do our jobs isn't with one individual giving us all the context in one place. In real life, you're operating across Slack and Google Drive and all these other tools"1

This challenge of synthesizing scattered information represents a fundamental gap in current LLM capabilities. While humans naturally switch between contexts—checking a Slack thread, reading a PDF policy, reviewing a spreadsheet—AI agents struggle with this integrated approach2

. For many agentic AI models, this kind of multi-domain reasoning remains hit or miss1

Source: TechCrunch

Complex White-Collar Tasks Reveal the Gap

The Apex-Agents benchmark differs from previous evaluations like OpenAI's GDPVal in one critical way: it simulates actual professional workplaces rather than testing general knowledge. The scenarios were all drawn from actual professionals on Mercor's expert marketplace, who both laid out the queries and set the standard for successful responses1

One question in the law section involves evaluating whether log exports containing personal data comply with Article 49 under a company's own policies during an EU production outage. The correct answer requires an in-depth assessment of both company policies and relevant EU privacy laws—a level of complexity that could stump even well-informed humans1

. These questions, posted publicly on Hugging Face, demonstrate how AI for office work still falls short on tasks requiring integrated reasoning across investment banking, consulting, and law3

Low Accuracy AI Models Function Like Unreliable Interns

The current performance suggests AI agents function less like seasoned professionals and more like unreliable interns who get things right about a quarter of the time2

. "I think this is probably the most important topic in the economy," Foody told TechCrunch. "The benchmark is very reflective of the real work that these people do"1

However, the progress trajectory raises important questions about the future. Just a year ago, these models were scoring between 5% and 10%—meaning accuracy has more than doubled in twelve months2

. "It's improving really quickly," Foody noted. "Right now it's fair to say it's like an intern that gets it right a quarter of the time, but last year it was the intern that gets it right five or ten percent of the time. That kind of improvement year over year can have an impact so quickly"1

What This Means for Knowledge Work Automation

The benchmark addresses whether agentic AI limitations will prevent automation of high-value professions. As of now, the answer appears to be yes—no model proved ready to take over as investment bankers, lawyers, or consultants3

. The research suggests that while foundation models have mastered in-depth research and planning in isolated contexts, the messy reality of information processing across workplace tools remains a significant barrier1

Now that the Apex-Agents test is public, it represents an open challenge for AI labs who believe they can do better. The AI field has a history of rapidly improving on challenging benchmarks, and Foody fully expects progress in the months ahead1

. For professionals in law, consulting, and investment banking, the current results offer a reprieve—but the accelerating improvement curve suggests this window may be shorter than expected.

AI agents score below 25% on workplace readiness test, exposing critical gaps in office work

AI Agents Stumble on Real Professional Tasks

Multi-Domain Reasoning Exposes Critical Weakness

Complex White-Collar Tasks Reveal the Gap

Low Accuracy AI Models Function Like Unreliable Interns

What This Means for Knowledge Work Automation

References

Are AI agents ready for the workplace? A new benchmark raises doubts.

New study shows AI isn't ready for office work

New Benchmark Casts Doubts on Agentic AI's Workplace Readiness

Related Stories

Enterprise AI adoption reveals 6x productivity gap between power users and typical workers

OpenAI's GDPval Benchmark: AI Models Approaching Human-Level Performance in Various Occupations

AI Revolution Targets White-Collar Workers While Hourly Employees Remain Largely Unaffected

Recent Highlights

OpenAI Releases GPT-5.4, New AI Model Built for Agents and Professional Work

Anthropic sues Pentagon over supply chain risk label after refusing autonomous weapons use

OpenAI secures $110 billion funding round as questions swirl around AI bubble and profitability

Recent Highlights

Today's Top Stories

Google Maps unveils Ask Maps with Gemini AI and 3D Immersive Navigation in biggest update

Google uses AI and 5 million news reports to predict flash floods across 150 countries

Perplexity launches Personal Computer, an AI agent that runs 24/7 on your Mac mini

AI autocomplete covertly shifts human opinions on social issues, even when users ignore suggestions