AI models corrupt 25% of documents in long-running tasks, Microsoft researchers reveal

Reviewed byNidhi Govil

2 Sources

Share

Microsoft researchers tested 19 AI models across 52 professional domains and found even the most advanced LLMs introduce substantial errors in multi-step workflows. Frontier models like Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 lost an average of 25% of document content over 20 interactions, with weaker models degrading by 50%. Only Python programming met the readiness threshold.

AI Models Struggle With Document Integrity in Professional Workflows

Microsoft researchers have uncovered a critical weakness in AI models that challenges the promise of autonomous automation. In a preprint paper titled "LLMs Corrupt Your Documents When You Delegate," scientists Philippe Laban, Tobias Schnabel, and Jennifer Neville reveal that even the most advanced LLMs introduce substantial errors when handling long, multi-step tasks

1

. The findings directly contradict marketing claims from companies like Anthropic and Microsoft itself, which tout their AI agents as capable of handling complex, autonomous work.

Source: TechRadar

Source: TechRadar

The research team developed DELEGATE-52, a benchmark that simulates multistep workflows across 52 professional domains including coding, accounting, crystallography, and music notation

2

. Testing documents averaged around 15K tokens in length, with each domain requiring 5-10 complex editing tasks through a "round-trip relay simulation" that asks models to perform transformations and then reverse them

2

.

Frontier Models Lose 25% of Document Content

The results paint a sobering picture for delegated workflows. Frontier models including Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 lost an average of 25% of document content over 20 delegated interactions, while the average degradation across all 19 tested models reached 50%

1

. The Microsoft researchers set a readiness threshold at 98% accuracy or higher after 20 interactions. Only one domain qualified: Python programming

1

.

Catastrophic corruption—defined as a benchmark score of 80% or less—occurred in more than 80% of model/domain combinations

1

. Even the best performing model, Gemini 3.1 Pro, achieved readiness in only 11 of 52 domains with a DELEGATE-52 benchmark score of 80.9% after 20 interactions

2

. Claude 4.6 Opus scored 73.1% and GPT 5.4 reached 71.5%, while GPT 5 Nano fell to last place at just 10.0%

2

.

AI Agents Perform Worse Than Models Alone

The research delivers another blow to the agentic AI narrative. When LLMs were equipped with file reading, writing, and code execution tools through a basic harness, performance actually declined. The four tested GPT models—versions 5.4, 5.2, 5.1, and 4.1—performed worse when operated as autonomous AI agents with tools than without, incurring an average additional degradation of 6% by the end of simulation

1

.

The pattern of failure differs between model tiers. Weaker models primarily deleted content, while frontier models corrupted it

1

. Errors didn't accumulate gradually but struck suddenly, with document content loss of 10 to 30 points occurring in a single round-trip interaction

1

. "The stronger models aren't avoiding small errors better, they delay critical failures to later rounds and experience them in fewer interactions," the researchers noted

1

.

Implications for Enterprise AI Adoption

These findings arrive at a critical moment. According to Deloitte, organizations currently spend an average of 36% of their digital budgets on AI automation

1

. The research suggests much of this investment may be premature for long-running tasks beyond highly structured domains like Python programming.

The study found LLMs performed better on programming tasks and worse on natural language tasks

1

. Natural language workflows, creative areas, and semi-structured documents proved particularly challenging

2

. The researchers also discovered that longer token lengths correlated with increased likelihood of model failure

2

.

The Microsoft researchers concluded that "current LLMs are ready for delegated workflows in some domains such as Python coding, but not in other less common domains" and emphasized that "users still need to closely monitor LLM systems as they operate and complete tasks on their behalf"

1

. The paper highlights that LLM performance after two interactions doesn't reflect how models perform after 20, underscoring the need for long-horizon evaluation

1

. Companies exploring automated workflows should keep their AI agents on a short leash and watch closely for signs of document corruption in extended professional workflows.

Today's Top Stories

TheOutpost.ai

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

Instagram logo
LinkedIn logo
Youtube logo
© 2026 TheOutpost.AI All rights reserved