3 Sources
3 Sources
[1]
Are AI agents ready for the workplace? A new benchmark raises doubts.
It's been nearly two years since Microsoft CEO Satya Nadella predicted AI would replace knowledge work -- the white-collar jobs held by lawyers, investment bankers, librarians, accountants, IT and others. But despite the huge progress made by foundation models, the change in knowledge work has been slow to arrive. Models have mastered in-depth research and agentic planning, but for whatever reason, most white-collar work has been relatively unaffected. It's one of the biggest mysteries in AI -- and thanks to new research from the training-data giant Mercor, we're finally getting some answers. The new research looks at how leading AI models hold up doing actual white-collar work tasks, drawn from consulting, investment banking, and law. The result is a new benchmark called Apex-Agents -- and so far, every AI lab is getting a failing grade. Faced with queries from real professionals, even the best models struggled to get more than a quarter of the questions right. The vast majority of the time, the model came back with a wrong answer or no answer at all. According to researcher Brendan Foody, who worked on the paper, the models' biggest stumbling point was tracking down information across multiple domains -- something that's integral to most of the knowledge work performed by humans. "One of the big changes in this benchmark is that we built out the entire environment, modeled after how real professional services," Foody told Techcrunch. "The way we do our jobs isn't with one individual giving us all the context in one place. In real life, you're operating across Slack and Google Drive and all these other tools." For many agentic AI models, that kind of multi-domain reasoning is still hit or miss. The scenarios were all drawn from actual professionals on Mercor's expert marketplace, who both laid out the queries and set the standard for a successful response. Looking through the questions, which are posted publicly on Hugging Face, gives a sense of how complex the tasks can get. One question in the "Law" section reads: During the first 48 minutes of the EU production outage, Northstar's engineering team exported one or two bundled sets of EU production event logs containing personal data to the U.S. analytics vendor....Under Northstar's own policies, it can reasonably treat the one or two log exports as consistent with Article 49? The correct answer is yes, but getting there requires an in-depth assessment of the company's own policies as well as the relevant EU privacy laws. That might stump even a well-informed human, but the researchers were trying to model the work done by professionals in the field. If an LLM can reliably answer these questions, it could effectively replace many of the lawyers working today. "I think this is probably the most important topic in the economy," Foody told TechCrunch. "The benchmark is very reflective of the real work that these people do." OpenAI also attempted to measure professional skills with its GDPVal benchmark -- but the Apex Agents test differs in important ways. Where GDPVal tests general knowledge across a wide range of professions, the Apex Agents benchmark measures the system's ability to perform sustained tasks in a narrow set of high-value professions. The result is more difficult for models, but also more closely tied to whether these jobs can be automated. While none of the models proved ready to take over as investment bankers, some were clearly closer to the mark. Gemini 3 Flash performed the best of the group with 24% one-shot accuracy, followed closely by GPT-5.2 with 23%. Below that, Opus 4.5, Gemini 3 Pro and GPT-5 all scored roughly 18%. While the initial results fall short, the AI field has a history of blowing through challenging benchmarks. Now that the Apex test is public, it's an open challenge for AI labs who believe they can do better -- something Foody fully expects in the months to come. "It's improving really quickly," he told TechCrunch. "Right now it's fair to say it's like an intern that gets it right a quarter of the time, but last year it was the intern that gets it right five or ten percent of the time. That kind of improvement year after year can have an impact so quickly."
[2]
New study shows AI isn't ready for office work
Your job is safe for now as AI still struggles with real office tasks It has been nearly two years since Microsoft CEO Satya Nadella predicted that generative AI would take over knowledge work, but if you look around a typical law firm or investment bank today, the human workforce is still very much in charge. Despite all the hype about "reasoning" and "planning," a new study from training-data company Mercor explains exactly why the robot revolution is stalled: AI just can't handle the messiness of real work. A reality check for the "replacement" theory Mercor released a new benchmark called APEX-Agents, and it is brutal. unlike the usual tests that ask AI to write a poem or solve a math problem, this one uses actual queries from lawyers, consultants, and bankers. It asks the models to do complete, multi-step tasks that require jumping between different types of information. Recommended Videos The results? Even the absolute best models on the market -- we are talking about Gemini 3 Flash and GPT-5.2 -- couldn't crack a 25% accuracy rate. Gemini led the pack at 24%, with GPT-5.2 right behind it at 23%. Most others were stuck in the teens. Why AI is failing the "office test" Mercor CEO Brendan Foody points out that the issue isn't raw intelligence; it's context. In the real world, answers aren't served up on a silver platter. A lawyer has to check a Slack thread, read a PDF policy, look at a spreadsheet, and then synthesize all that to answer a question about GDPR compliance. Humans do this context-switching naturally. AI, it turns out, is terrible at it. When you force these models to hunt for information across "scattered" sources, they either get confused, give the wrong answer, or just give up entirely. The "Unreliable Intern" For anyone worried about their job security, this is a bit of a relief. The study suggests that right now, AI functions less like a seasoned professional and more like an unreliable intern who gets things right about a quarter of the time. That said, the progress is terrifyingly fast. Foody noted that just a year ago, these models were scoring between 5% and 10%. Now they are hitting 24%. So, while they aren't ready to take the wheel yet, they are learning to drive much faster than we expected. For now, though, the "knowledge work" revolution is on hold until the bots learn how to multitask p
[3]
New Benchmark Casts Doubts on Agentic AI's Workplace Readiness
New benchmark is created by a system of real experts asking questions and setting acceptability limits for answers Microsoft CEO Satya Nadella came on the Dwarkesh Patel podcast and declared that lawyers, accountants, investment bankers and IT coders would soon become redundant as AI agents would replace them in hoards. However, OpenAI co-founder Andrej Karpathy described agentic AI as a "slop" in another edition of the same podcast. So, where does the world stand on this dichotomy today? Without doubt, there has been considerable progress on foundation models, but clear use-cases for agentic AI are still far and few. While AI agents doing research and planning activities have seen success, white-collar workforce is still unaffected, at least to a great extent. And new research pioneered by training-data company Mercor suggests that AI agents remain largely unprepared or under-prepared for real life workplaces. The company has come out with an Apex-Agents benchmark, which exposes critical gaps in AI's ability to perform complex tasks at the workplace. Leading models are scoring below 25% for accuracy in simulations on work relating to law, investment banking and consulting. "We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate lawyers. APEX-Agents requires agents to navigate realistic work environments with files and tools," says the submission on the web page of Cornell University. The report suggests that all AI labs have failed the test when queried by real professionals from these industry verticals. Some of the best models found it tough to get the questions right at least a quarter of the times they were asked. The report said that at most times, the model had the wrong answer or worse still, no answer at all. Details provided about the new benchmark suggest that it differs from previous evaluations in one simple manner in that it simulates professional workplaces instead of conducting specific tests. In doing so, the tests indicated that foundation models struggled with integrated, multi-domain reasoning. What made AI models perform poorly? Mercor CEO Brendan Foody was quoted by TechCrunch as saying that the biggest stumbling point of the model was to track information across multiple domains, something that was integral to most of the knowledge work performed by humans. "One of the big changes in this benchmark is that we built out the entire environment, modeled after real professional services. The way we do our jobs isn't with one individual giving us all the context in one place. In real life, you're operating across Slack and Google Drive and all these other tools," Foody was quoted as saying. Looks like the Apex-Agents benchmark could well be the start of newer ways to test out AI models for specific weaknesses. Going forward, similar benchmarks could be developed based on queries from actual professionals in divergent fields to test whether the AI agents could navigate such complex information landscapes. Higher the specificity, poorer the response Mercor has posted some of the queries and expected response standards on Hugging Face. It is quite clear that the queries are very specific to given situations that had emerged in the past. In fact, TechCrunch refers to one involving a production outage and an engineering team's responses thereof. While the correct answer is stated as a yes, but there is no way an AI agent could there without an in-depth assessment of the company's policies and the guidelines as per a country or a region's laws. According to Foody, this level of complexity is enough to befuddle even an expert on the job and it there is no way that the LLM could reliably respond to such a query at this juncture in its growth as a reliable machine-learnt assistant. "I think this is probably the most important topic in the economy. The benchmark is very reflective of the real work that these people do," says Foody. This statement is significant at a time when most AI giants are seeking to convince users as well as their large investors that Agentic AI is up and running. Last September, OpenAI had come out with a blog post titled "Measuring the performance of our models on real-world tasks" with their GDPval. However, while that tested general knowledge across professions, Foody and team has created a benchmark that measures the system's capabilities when tasks that encompass complex queries across multiple areas of a single vertical. What this tells us for the moment is whether agentic AI can automate tasks in these high-value professions. As of now, the answer is a big NO. According to information provided by Mercor, Gemini 3 Flash did the best with 24% single-shot accuracy followed by GPT-5.2 with 23% with the others including Opus 4.5, Gemini 3 Pro and GPT-5 scoring around 18%. These numbers may only mean that for now these LLMs need to learn some more.
Share
Share
Copy Link
Training-data company Mercor released the Apex-Agents benchmark, testing leading AI models on real professional tasks from law, investment banking, and consulting. Even top performers like Gemini 3 Flash and GPT-5.2 achieved only 24% and 23% accuracy respectively. The research reveals AI agents struggle most with multi-domain reasoning and synthesizing scattered information across workplace tools.
Nearly two years after Microsoft CEO Satya Nadella predicted AI would transform knowledge work, a new reality check has arrived. Training-data company Mercor has released Apex-Agents, an AI agent benchmark that tests whether leading foundation models can handle actual tasks from lawyers, consultants, and investment bankers
1
. The results paint a sobering picture of workplace readiness: even the best-performing models struggled to exceed 25% accuracy2
.Gemini 3 Flash led the pack with 24% one-shot accuracy, followed closely by GPT-5.2 at 23%. Other models including Opus 4.5, Gemini 3 Pro, and GPT-5 all scored roughly 18%
1
. The vast majority of the time, models returned wrong answers or no answers at all when faced with queries from real professionals3
.
Source: CXOToday
According to Mercor CEO Brendan Foody, who worked on the research, the models' biggest stumbling point was tracking down information across multiple domains—something integral to most knowledge work performed by humans. "One of the big changes in this benchmark is that we built out the entire environment, modeled after how real professional services," Foody told TechCrunch. "The way we do our jobs isn't with one individual giving us all the context in one place. In real life, you're operating across Slack and Google Drive and all these other tools"
1
.This challenge of synthesizing scattered information represents a fundamental gap in current LLM capabilities. While humans naturally switch between contexts—checking a Slack thread, reading a PDF policy, reviewing a spreadsheet—AI agents struggle with this integrated approach
2
. For many agentic AI models, this kind of multi-domain reasoning remains hit or miss1
.
Source: TechCrunch
The Apex-Agents benchmark differs from previous evaluations like OpenAI's GDPVal in one critical way: it simulates actual professional workplaces rather than testing general knowledge. The scenarios were all drawn from actual professionals on Mercor's expert marketplace, who both laid out the queries and set the standard for successful responses
1
.One question in the law section involves evaluating whether log exports containing personal data comply with Article 49 under a company's own policies during an EU production outage. The correct answer requires an in-depth assessment of both company policies and relevant EU privacy laws—a level of complexity that could stump even well-informed humans
1
. These questions, posted publicly on Hugging Face, demonstrate how AI for office work still falls short on tasks requiring integrated reasoning across investment banking, consulting, and law3
.Related Stories
The current performance suggests AI agents function less like seasoned professionals and more like unreliable interns who get things right about a quarter of the time
2
. "I think this is probably the most important topic in the economy," Foody told TechCrunch. "The benchmark is very reflective of the real work that these people do"1
.However, the progress trajectory raises important questions about the future. Just a year ago, these models were scoring between 5% and 10%—meaning accuracy has more than doubled in twelve months
2
. "It's improving really quickly," Foody noted. "Right now it's fair to say it's like an intern that gets it right a quarter of the time, but last year it was the intern that gets it right five or ten percent of the time. That kind of improvement year over year can have an impact so quickly"1
.The benchmark addresses whether agentic AI limitations will prevent automation of high-value professions. As of now, the answer appears to be yes—no model proved ready to take over as investment bankers, lawyers, or consultants
3
. The research suggests that while foundation models have mastered in-depth research and planning in isolated contexts, the messy reality of information processing across workplace tools remains a significant barrier1
.Now that the Apex-Agents test is public, it represents an open challenge for AI labs who believe they can do better. The AI field has a history of rapidly improving on challenging benchmarks, and Foody fully expects progress in the months ahead
1
. For professionals in law, consulting, and investment banking, the current results offer a reprieve—but the accelerating improvement curve suggests this window may be shorter than expected.Summarized by
Navi
[2]
10 Dec 2025•Business and Economy

26 Sept 2025•Technology

26 Nov 2025•Business and Economy

1
Policy and Regulation

2
Technology

3
Technology
