AI Models Fail Long-Running Tasks, Lose 25% Content

AI Models Struggle With Document Integrity in Professional Workflows

Microsoft researchers have uncovered a critical weakness in AI models that challenges the promise of autonomous automation. In a preprint paper titled "LLMs Corrupt Your Documents When You Delegate," scientists Philippe Laban, Tobias Schnabel, and Jennifer Neville reveal that even the most advanced LLMs introduce substantial errors when handling long, multi-step tasks 1

. The findings directly contradict marketing claims from companies like Anthropic and Microsoft itself, which tout their AI agents as capable of handling complex, autonomous work.

Source: TechRadar

The research team developed DELEGATE-52, a benchmark that simulates multistep workflows across 52 professional domains including coding, accounting, crystallography, and music notation 2

. Testing documents averaged around 15K tokens in length, with each domain requiring 5-10 complex editing tasks through a "round-trip relay simulation" that asks models to perform transformations and then reverse them 2

Frontier Models Lose 25% of Document Content

The results paint a sobering picture for delegated workflows. Frontier models including Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 lost an average of 25% of document content over 20 delegated interactions, while the average degradation across all 19 tested models reached 50% 1

. The Microsoft researchers set a readiness threshold at 98% accuracy or higher after 20 interactions. Only one domain qualified: Python programming 1

Catastrophic corruption—defined as a benchmark score of 80% or less—occurred in more than 80% of model/domain combinations 1

. Even the best performing model, Gemini 3.1 Pro, achieved readiness in only 11 of 52 domains with a DELEGATE-52 benchmark score of 80.9% after 20 interactions 2

. Claude 4.6 Opus scored 73.1% and GPT 5.4 reached 71.5%, while GPT 5 Nano fell to last place at just 10.0% 2

AI Agents Perform Worse Than Models Alone

The research delivers another blow to the agentic AI narrative. When LLMs were equipped with file reading, writing, and code execution tools through a basic harness, performance actually declined. The four tested GPT models—versions 5.4, 5.2, 5.1, and 4.1—performed worse when operated as autonomous AI agents with tools than without, incurring an average additional degradation of 6% by the end of simulation 1

The pattern of failure differs between model tiers. Weaker models primarily deleted content, while frontier models corrupted it 1

. Errors didn't accumulate gradually but struck suddenly, with document content loss of 10 to 30 points occurring in a single round-trip interaction 1

. "The stronger models aren't avoiding small errors better, they delay critical failures to later rounds and experience them in fewer interactions," the researchers noted 1

Implications for Enterprise AI Adoption

These findings arrive at a critical moment. According to Deloitte, organizations currently spend an average of 36% of their digital budgets on AI automation 1

. The research suggests much of this investment may be premature for long-running tasks beyond highly structured domains like Python programming.

The study found LLMs performed better on programming tasks and worse on natural language tasks 1

. Natural language workflows, creative areas, and semi-structured documents proved particularly challenging 2

. The researchers also discovered that longer token lengths correlated with increased likelihood of model failure 2

The Microsoft researchers concluded that "current LLMs are ready for delegated workflows in some domains such as Python coding, but not in other less common domains" and emphasized that "users still need to closely monitor LLM systems as they operate and complete tasks on their behalf" 1

. The paper highlights that LLM performance after two interactions doesn't reflect how models perform after 20, underscoring the need for long-horizon evaluation 1

. Companies exploring automated workflows should keep their AI agents on a short leash and watch closely for signs of document corruption in extended professional workflows.

AI models corrupt 25% of documents in long-running tasks, Microsoft researchers reveal

AI Models Struggle With Document Integrity in Professional Workflows

Frontier Models Lose 25% of Document Content

AI Agents Perform Worse Than Models Alone

Implications for Enterprise AI Adoption

References

Microsoft researchers find AI models and agents can't handle long-running tasks

'Current LLMs introduce substantial errors when editing work documents': Microsoft scientists find most AI models struggle with long-running tasks -- so maybe don't trust them completely just yet

Related Stories

AI agents score below 25% on workplace readiness test, exposing critical gaps in office work

AI Agents Hit Mathematical Wall: Vishal Sikka Paper Shows LLMs Can't Handle Complex Tasks

Microsoft's AI Agent Marketplace Study Reveals Critical Flaws in Autonomous Shopping Systems

Recent Highlights

Meta AI chatbot exploited by hackers to hijack high-profile Instagram accounts worth millions

Florida sues OpenAI and Sam Altman over ChatGPT safety, alleging AI harms linked to violence

Nvidia RTX Spark chips power new AI laptops with up to 128GB memory and local agent capabilities

Recent Highlights

Today's Top Stories

Anthropic warns Claude AI writes 80% of its code as recursive self-improvement risks accelerate

Bot Traffic Surpasses Human Activity on Internet for First Time as AI Agents Surge

Meta quietly embedded face recognition code for smart glasses in app downloaded 50 million times

ChatGPT's Dreaming V3 memory upgrade lets it remember you better across conversations