Databricks OfficeQA AI Benchmark Shows Agents Fail

Databricks Challenges Academic AI Benchmarks With Real-World Test

Databricks has released OfficeQA, an AI benchmark designed to evaluate whether AI agents can handle the complex document-heavy enterprise tasks that dominate actual business workflows 1

. The data and AI platform company found that existing benchmarks like Humanity's Last Exam, ARC-AGI-2, and GDPval focus on abstract capabilities such as PhD-level exams and mathematical problems, but fail to reflect the economically valuable work enterprises need AI to perform 1

Source: AIM

"If we focus our research efforts on getting better at [existing benchmarks], then we're probably not solving the right problems to make Databricks a better platform," explained Erich Elsen, principal research scientist at Databricks 1

. The gap between AI capabilities and enterprise needs prompted the company to develop a benchmark that would actually improve their platform's ability to solve customer problems.

Frontier AI Models Struggle With Enterprise Document Tasks

The results expose a sobering reality about AI agents' performance on real-world work. On the OfficeQA benchmark, Anthropic's Claude Opus 4.5 Agent solved only 37.4% of questions while OpenAI's GPT-5.1 Agent achieved 43.1% accuracy on the full dataset 2

. Performance dropped even further on OfficeQA-Hard, a subset of 113 challenging examples, where Claude Opus 4.5 scored 21.1% and GPT-5.1 reached only 24.8% 2

Without access to the document corpus, frontier AI models answered approximately 2% of questions correctly 2

. Even when provided with PDFs, accuracy remained below 45%, revealing fundamental limitations in how LLMs handle information retrieval and analytical reasoning tasks that require unforgiving accuracy 2

Source: VentureBeat

U.S. Treasury Bulletins Mirror Document Complexity

To create an evaluation framework that approximates enterprise document complexity, Databricks built OfficeQA using U.S. Treasury Bulletins published from 1939 onwards 1

. The corpus spans roughly 89,000 pages across eight decades, containing prose, complex tables, charts, and figures describing federal financial operations 1

. Each bulletin runs 100 to 200 pages and includes scanned images from physical documents published before 1996, adding realistic challenges like hierarchical table structures and tabular data with nested headers 1

The benchmark contains 246 questions requiring agents to handle messy, real-world challenges that mirror what enterprises face daily 1

. Tasks range from simple value lookups to multi-step analysis requiring statistical calculations and cross-year comparisons 1

. One example asks agents to run a linear regression to predict the Department of Agriculture's 1999 outlays using data from 1990-1998, while another requires counting local maxima on a line plot from the September 1990 Treasury Bulletin 2

Parsing Quality Drives Significant Performance Gains

Databricks filtered out questions that could be answered using parametric knowledge or web search alone, ensuring the benchmark requires actual document-grounded retrieval 1

. Every question includes validated ground truth answers, typically numbers or dates, enabling automated evaluation without human judging 1

. This design supports reinforcement learning approaches that require verifiable rewards.

When tested with raw PDFs, even frontier models struggled significantly. However, preprocessing the corpus with Databricks' ai_parse_document system produced dramatic improvements, with GPT-5.1 showing a 32.4-point accuracy jump 2

. The Mosaic Research team at Databricks describes the benchmark as testing tasks where "being off by one on a product or invoice number can have catastrophic downstream results" 2

. Human evaluators needed an average of 50 minutes per question, with most time spent locating data buried across decades of publications 2

. One visual reasoning task counting local maxima on a 1990 Treasury plot was not solved by any AI agent, highlighting persistent challenges in complex document-heavy enterprise tasks 2

Databricks OfficeQA benchmark exposes AI agents struggling with real enterprise document work

Databricks Challenges Academic AI Benchmarks With Real-World Test

Frontier AI Models Struggle With Enterprise Document Tasks

U.S. Treasury Bulletins Mirror Document Complexity

Parsing Quality Drives Significant Performance Gains

References

Databricks releases enterprise-focused OfficeQA AI benchmark after finding academic tests miss real-world document tasks

Databricks Benchmark Tests AI on Enterprise Tasks That Demand 'Unforgiving Accuracy' | AIM

Related Stories

AI agents score below 25% on workplace readiness test, exposing critical gaps in office work

New AI Benchmark 'Humanity's Last Exam' Stumps Top Models, Revealing Limits of Current AI

OpenAI's Deep Research Dominates Humanity's Last Exam, Setting New Benchmarks in AI Capabilities

Recent Highlights

Anthropic overtakes OpenAI as most valuable AI startup with $965 billion valuation

Pope Leo XIV releases major AI encyclical calling for 'disarmament' of artificial intelligence

Apple's Siri overhaul for iOS 27 brings Gemini integration and standalone app to compete with ChatGPT

Recent Highlights

Today's Top Stories

OpenAI model disproves famous math problem from 1946 that stumped mathematicians for 80 years

Microsoft Surface Laptop Ultra debuts with Nvidia RTX Spark chip and 128GB unified memory

Nvidia launches humanoid robot platform with Unitree as Chinese startup faces US scrutiny

NVIDIA Cosmos 3 Gives Robots and Autonomous Vehicles a Brain to Think Before They Act