2 Sources
[1]
Microsoft researchers find AI models and agents can't handle long-running tasks
Companies exploring automated workflows would be well advised to keep their AI agents on a short leash. Microsoft researchers have found that even the priciest frontier models introduce errors in long workflows, the very thing for which AI software has been pitched. Anthropic, for example, says, "Claude Cowork handles tasks autonomously. Give it a goal and Claude works on your computer, local files, and applications to return a finished deliverable." Redmond promotes similar usage, touting Microsoft 365 Copilot's ability to "Tackle complex, multistep research across your work data and the web." The Windows maker's scientists aren't so sure about that. Philippe Laban, Tobias Schnabel, and Jennifer Neville from Microsoft Research set out to study what happens when large language models (LLMs) are asked to complete multistep tasks. They recently published their findings in a preprint paper with a spoiler title: "LLMs Corrupt Your Documents When You Delegate." To test how LLMs handle long-running knowledge work tasks, the researchers devised a benchmark called DELEGATE-52. It simulates multistep workflows across 52 professional domains, such as writing code, crystallography, and music notation. It is a more taxing test than sorting a spreadsheet, a task that should be table stakes for any aspiring workflow agent. In the accounting domain, for example, the challenge involves a seed document that represents the accounting ledger of Hack Club, a nonprofit organization. The model is asked to split the seed document into separate category-based files and then to merge these chronologically back into a single file. "Our findings show that current LLMs introduce substantial errors when editing work documents, with frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4) losing on average 25 percent of document content over 20 delegated interactions, and an average degradation across all models of 50 percent," the authors report. The authors found that LLMs did better on programming tasks and worse on natural language tasks. To be considered "ready" for a given work domain, the researchers set the bar at 98 percent or higher after 20 interactions. They only found one domain qualified: Python programming. For every other domain, the authors found LLMs fell short of "ready." "A per-domain breakdown of end-of-simulation scores reveals that models are not ready for delegated workflows in the vast majority of domains, with models severely corrupting documents (at least -20 percent degradation) in 80 percent of our simulated conditions," the authors state. The study found that "catastrophic corruption," meaning a benchmark score of 80 percent or less, occurred in more than 80 percent of model/domain combinations. The best performing model, Google Gemini 3.1 Pro, was ready for only 11 of 52 domains. In weaker models, degradation took the form of content deletion; in frontier models, it took the form of content corruption. And when errors occurred, they tended to happen all at once, resulting in the loss of 10 to 30 points in a single round-trip interaction, rather than accumulating over the entire test run. "The stronger models (Gemini 3.1 Pro, Claude 4.6, GPT 5.4) aren't avoiding small errors better, they delay critical failures to later rounds and experience them in fewer interactions," the researchers observe in their paper. The Microsoft authors went on to test how agents - LLMs given access to file reading, writing, and code execution through a basic harness - handle the DELEGATE-52 benchmark. Tools in this instance didn't help. "The four tested models perform worse when operated agentically with tools than without, incurring an average additional degradation of 6 percent by the end of simulation," the authors observe, in reference to GPT-5.4, 5.2, 5.1, and 4.1. Given that task delegation is the whole point of an AI agent - if you wanted to do it yourself, you wouldn't have tried to automate the task - this casts a bit of a shadow on the AI hype train. An intern who corrupted a quarter of a document over a long workflow would be shown the door. Yet companies are showing AI the money: according to Deloitte, organizations are spending an average of 36 percent of their digital budgets on AI automation. That might make sense if arming LLMs with the tools to function as full-blown agents meant less document degradation. But that's not the case. The authors found "using a basic agentic harness does not improve the performance of LLMs" with regard to the DELEGATE-52 test and that LLM performance after two interactions doesn't reflect how models perform after 20, which they argue underscores the need for long-horizon evaluation. "Current LLMs are ready for delegated workflows in some domains such as Python coding, but not in other less common domains," the authors conclude. "In general, users still need to closely monitor LLM systems as they operate and complete tasks on their behalf." Yet they also note that LLMs have been getting better, pointing to the performance of OpenAI's GPT model family, which has seen its benchmark performance increase over 16 months from 14.7 percent to 71.5 percent. ®
[2]
'Current LLMs introduce substantial errors when editing work documents': Microsoft scientists find most AI models struggle with long-running tasks -- so maybe don't trust them completely just yet
* Microsoft researchers determine that current LLMs aren't good at long-running tasks * More interactions and less structure significantly reduce benchmark performance * "Python is the only domain where most models are ready" New research from a trio of Microsoft workers has uncovered a fundamental issue that could be blocking effective agentic AI -namely that most AI models can't actually reliably handle long-running workflows. To quantify their findings, the researchers introduced a new DELEGATE-52 benchmark to provide metrics across 52 sectors, including coding, accounting, science and more. Ultimately, the paper concluded current LLMs "introduce sparse but severe errors that silently corrupt documents, compounding over long interaction." AI isn't that good at long-running tasks, yet The study goes into some of the latest AI models including Gemini 3.1 Pro, Claude 4.6 Opus and GPT-5.4. It found that even they "corrupt an average of 25% of document content by the end of long workflows," with lesser models even more likely to get things wrong. The DELEGATE-52 benchmark uses real documents at around 15K tokens in length and introduced 5-10 complex editing tasks with a "round-trip relay simulation" that asks AI to perform a transformation then reverse it. This allows the researchers to measure how effectively each model reconstructs the documents back to their original forms. Highly structured and programmatic areas were where the models performed best, with the Microsoft researchers concluding that "Python is the only domain where most models are ready." Conversely, natural language workflows, creative areas and semi-structured documents saw model models struggle. The paper also uncovers that, the longer the token length, the more likely an AI model is to struggle. Where frontier models differed was not in their ability to eliminate errors - just that they were able to delay errors. Some of the other models tested by Microsoft's researchers included a number of GPT-5 and GPT-4 generations, Claude options, Gemini models and one each from Mistral, xAI and Moonshot - totalling 19 different models from six families. Gemini 3.1 Pro took first place with a DELEGATE-52 benchmark score of 80.9% after 20 interactions; Claude 4.6 Opus (73.1%) and GPT-5.4 (71.5%) round out the top three, and GPT 5 Nano (10.0%) falls into last place. In short, the paper concludes that today's AI models are not reliable enough to be trusted for long-running, autonomous workflows, highlighting key areas where model developers must focus on in the future and offering up yet another benchmark to determine model capability. Via The Register Follow TechRadar on Google News and add us as a preferred source to get our expert news, reviews, and opinion in your feeds.
Share
Copy Link
Microsoft researchers tested 19 AI models across 52 professional domains and found even the most advanced LLMs introduce substantial errors in multi-step workflows. Frontier models like Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 lost an average of 25% of document content over 20 interactions, with weaker models degrading by 50%. Only Python programming met the readiness threshold.
Microsoft researchers have uncovered a critical weakness in AI models that challenges the promise of autonomous automation. In a preprint paper titled "LLMs Corrupt Your Documents When You Delegate," scientists Philippe Laban, Tobias Schnabel, and Jennifer Neville reveal that even the most advanced LLMs introduce substantial errors when handling long, multi-step tasks
1
. The findings directly contradict marketing claims from companies like Anthropic and Microsoft itself, which tout their AI agents as capable of handling complex, autonomous work.
Source: TechRadar
The research team developed DELEGATE-52, a benchmark that simulates multistep workflows across 52 professional domains including coding, accounting, crystallography, and music notation
2
. Testing documents averaged around 15K tokens in length, with each domain requiring 5-10 complex editing tasks through a "round-trip relay simulation" that asks models to perform transformations and then reverse them2
.The results paint a sobering picture for delegated workflows. Frontier models including Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 lost an average of 25% of document content over 20 delegated interactions, while the average degradation across all 19 tested models reached 50%
1
. The Microsoft researchers set a readiness threshold at 98% accuracy or higher after 20 interactions. Only one domain qualified: Python programming1
.Catastrophic corruption—defined as a benchmark score of 80% or less—occurred in more than 80% of model/domain combinations
1
. Even the best performing model, Gemini 3.1 Pro, achieved readiness in only 11 of 52 domains with a DELEGATE-52 benchmark score of 80.9% after 20 interactions2
. Claude 4.6 Opus scored 73.1% and GPT 5.4 reached 71.5%, while GPT 5 Nano fell to last place at just 10.0%2
.The research delivers another blow to the agentic AI narrative. When LLMs were equipped with file reading, writing, and code execution tools through a basic harness, performance actually declined. The four tested GPT models—versions 5.4, 5.2, 5.1, and 4.1—performed worse when operated as autonomous AI agents with tools than without, incurring an average additional degradation of 6% by the end of simulation
1
.The pattern of failure differs between model tiers. Weaker models primarily deleted content, while frontier models corrupted it
1
. Errors didn't accumulate gradually but struck suddenly, with document content loss of 10 to 30 points occurring in a single round-trip interaction1
. "The stronger models aren't avoiding small errors better, they delay critical failures to later rounds and experience them in fewer interactions," the researchers noted1
.Related Stories
These findings arrive at a critical moment. According to Deloitte, organizations currently spend an average of 36% of their digital budgets on AI automation
1
. The research suggests much of this investment may be premature for long-running tasks beyond highly structured domains like Python programming.The study found LLMs performed better on programming tasks and worse on natural language tasks
1
. Natural language workflows, creative areas, and semi-structured documents proved particularly challenging2
. The researchers also discovered that longer token lengths correlated with increased likelihood of model failure2
.The Microsoft researchers concluded that "current LLMs are ready for delegated workflows in some domains such as Python coding, but not in other less common domains" and emphasized that "users still need to closely monitor LLM systems as they operate and complete tasks on their behalf"
1
. The paper highlights that LLM performance after two interactions doesn't reflect how models perform after 20, underscoring the need for long-horizon evaluation1
. Companies exploring automated workflows should keep their AI agents on a short leash and watch closely for signs of document corruption in extended professional workflows.Summarized by
Navi
23 Jan 2026•Science and Research

23 Jan 2026•Technology

06 Nov 2025•Science and Research

1
Technology

2
Policy and Regulation

3
Policy and Regulation
