Databricks OfficeQA benchmark exposes AI agents struggling with real enterprise document work

2 Sources

Share

Databricks launched OfficeQA, a new AI benchmark testing agents on real enterprise document tasks using U.S. Treasury Bulletins. The results reveal a sobering reality: even the best AI agents from OpenAI and Anthropic achieve less than 45% accuracy on document-heavy work that mirrors actual business needs, exposing a critical disconnect between academic tests and enterprise requirements.

Databricks Challenges Academic AI Benchmarks With Real-World Test

Databricks has released OfficeQA, an AI benchmark designed to evaluate whether AI agents can handle the complex document-heavy enterprise tasks that dominate actual business workflows

1

. The data and AI platform company found that existing benchmarks like Humanity's Last Exam, ARC-AGI-2, and GDPval focus on abstract capabilities such as PhD-level exams and mathematical problems, but fail to reflect the economically valuable work enterprises need AI to perform

1

.

Source: AIM

Source: AIM

"If we focus our research efforts on getting better at [existing benchmarks], then we're probably not solving the right problems to make Databricks a better platform," explained Erich Elsen, principal research scientist at Databricks

1

. The gap between AI capabilities and enterprise needs prompted the company to develop a benchmark that would actually improve their platform's ability to solve customer problems.

Frontier AI Models Struggle With Enterprise Document Tasks

The results expose a sobering reality about AI agents' performance on real-world work. On the OfficeQA benchmark, Anthropic's Claude Opus 4.5 Agent solved only 37.4% of questions while OpenAI's GPT-5.1 Agent achieved 43.1% accuracy on the full dataset

2

. Performance dropped even further on OfficeQA-Hard, a subset of 113 challenging examples, where Claude Opus 4.5 scored 21.1% and GPT-5.1 reached only 24.8%

2

.

Without access to the document corpus, frontier AI models answered approximately 2% of questions correctly

2

. Even when provided with PDFs, accuracy remained below 45%, revealing fundamental limitations in how LLMs handle information retrieval and analytical reasoning tasks that require unforgiving accuracy

2

.

Source: VentureBeat

Source: VentureBeat

U.S. Treasury Bulletins Mirror Document Complexity

To create an evaluation framework that approximates enterprise document complexity, Databricks built OfficeQA using U.S. Treasury Bulletins published from 1939 onwards

1

. The corpus spans roughly 89,000 pages across eight decades, containing prose, complex tables, charts, and figures describing federal financial operations

1

. Each bulletin runs 100 to 200 pages and includes scanned images from physical documents published before 1996, adding realistic challenges like hierarchical table structures and tabular data with nested headers

1

.

The benchmark contains 246 questions requiring agents to handle messy, real-world challenges that mirror what enterprises face daily

1

. Tasks range from simple value lookups to multi-step analysis requiring statistical calculations and cross-year comparisons

1

. One example asks agents to run a linear regression to predict the Department of Agriculture's 1999 outlays using data from 1990-1998, while another requires counting local maxima on a line plot from the September 1990 Treasury Bulletin

2

.

Parsing Quality Drives Significant Performance Gains

Databricks filtered out questions that could be answered using parametric knowledge or web search alone, ensuring the benchmark requires actual document-grounded retrieval

1

. Every question includes validated ground truth answers, typically numbers or dates, enabling automated evaluation without human judging

1

. This design supports reinforcement learning approaches that require verifiable rewards.

When tested with raw PDFs, even frontier models struggled significantly. However, preprocessing the corpus with Databricks' ai_parse_document system produced dramatic improvements, with GPT-5.1 showing a 32.4-point accuracy jump

2

. The Mosaic Research team at Databricks describes the benchmark as testing tasks where "being off by one on a product or invoice number can have catastrophic downstream results"

2

. Human evaluators needed an average of 50 minutes per question, with most time spent locating data buried across decades of publications

2

. One visual reasoning task counting local maxima on a 1990 Treasury plot was not solved by any AI agent, highlighting persistent challenges in complex document-heavy enterprise tasks

2

.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2025 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo