Databricks OfficeQA benchmark exposes AI agents struggling with real enterprise document work

2 Sources

Share

Databricks launched OfficeQA, a new AI benchmark testing agents on real enterprise document tasks using U.S. Treasury Bulletins. The results reveal a sobering reality: even the best AI agents from OpenAI and Anthropic achieve less than 45% accuracy on document-heavy work that mirrors actual business needs, exposing a critical disconnect between academic tests and enterprise requirements.

Databricks Challenges Academic AI Benchmarks With Real-World Test

Databricks has released OfficeQA, an AI benchmark designed to evaluate whether AI agents can handle the complex document-heavy enterprise tasks that dominate actual business workflows

1

. The data and AI platform company found that existing benchmarks like Humanity's Last Exam, ARC-AGI-2, and GDPval focus on abstract capabilities such as PhD-level exams and mathematical problems, but fail to reflect the economically valuable work enterprises need AI to perform

1

.

Source: AIM

Source: AIM

"If we focus our research efforts on getting better at [existing benchmarks], then we're probably not solving the right problems to make Databricks a better platform," explained Erich Elsen, principal research scientist at Databricks

1

. The gap between AI capabilities and enterprise needs prompted the company to develop a benchmark that would actually improve their platform's ability to solve customer problems.

Frontier AI Models Struggle With Enterprise Document Tasks

The results expose a sobering reality about AI agents' performance on real-world work. On the OfficeQA benchmark, Anthropic's Claude Opus 4.5 Agent solved only 37.4% of questions while OpenAI's GPT-5.1 Agent achieved 43.1% accuracy on the full dataset

2

. Performance dropped even further on OfficeQA-Hard, a subset of 113 challenging examples, where Claude Opus 4.5 scored 21.1% and GPT-5.1 reached only 24.8%

2

.

Without access to the document corpus, frontier AI models answered approximately 2% of questions correctly

2

. Even when provided with PDFs, accuracy remained below 45%, revealing fundamental limitations in how LLMs handle information retrieval and analytical reasoning tasks that require unforgiving accuracy

2

.

Source: VentureBeat

Source: VentureBeat

U.S. Treasury Bulletins Mirror Document Complexity

To create an evaluation framework that approximates enterprise document complexity, Databricks built OfficeQA using U.S. Treasury Bulletins published from 1939 onwards

1

. The corpus spans roughly 89,000 pages across eight decades, containing prose, complex tables, charts, and figures describing federal financial operations

1

. Each bulletin runs 100 to 200 pages and includes scanned images from physical documents published before 1996, adding realistic challenges like hierarchical table structures and tabular data with nested headers

1

.

The benchmark contains 246 questions requiring agents to handle messy, real-world challenges that mirror what enterprises face daily

1

. Tasks range from simple value lookups to multi-step analysis requiring statistical calculations and cross-year comparisons

1

. One example asks agents to run a linear regression to predict the Department of Agriculture's 1999 outlays using data from 1990-1998, while another requires counting local maxima on a line plot from the September 1990 Treasury Bulletin

2

.

Parsing Quality Drives Significant Performance Gains

Databricks filtered out questions that could be answered using parametric knowledge or web search alone, ensuring the benchmark requires actual document-grounded retrieval

1

. Every question includes validated ground truth answers, typically numbers or dates, enabling automated evaluation without human judging

1

. This design supports reinforcement learning approaches that require verifiable rewards.

When tested with raw PDFs, even frontier models struggled significantly. However, preprocessing the corpus with Databricks' ai_parse_document system produced dramatic improvements, with GPT-5.1 showing a 32.4-point accuracy jump

2

. The Mosaic Research team at Databricks describes the benchmark as testing tasks where "being off by one on a product or invoice number can have catastrophic downstream results"

2

. Human evaluators needed an average of 50 minutes per question, with most time spent locating data buried across decades of publications

2

. One visual reasoning task counting local maxima on a 1990 Treasury plot was not solved by any AI agent, highlighting persistent challenges in complex document-heavy enterprise tasks

2

.

Today's Top Stories

TheOutpost.ai

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

Instagram logo
LinkedIn logo
Youtube logo
© 2026 TheOutpost.AI All rights reserved