2 Sources
2 Sources
[1]
Databricks releases enterprise-focused OfficeQA AI benchmark after finding academic tests miss real-world document tasks
There is no shortage of AI benchmarks in the market today, with popular options like Humanity's Last Exam (HLE), ARC-AGI-2 and GDPval, among numerous others. AI agents excel at solving abstract math problems and passing PhD-level exams that most benchmarks are based on, but Databricks has a question for the enterprise: Can they actually handle the document-heavy work most enterprises need them to do? The answer, according to new research from the data and AI platform company, is sobering. Even the best-performing AI agents achieve less than 45% accuracy on tasks that mirror real enterprise workloads, exposing a critical gap between academic benchmarks and business reality. "If we focus our research efforts on getting better at [existing benchmarks], then we're probably not solving the right problems to make Databricks a better platform," Erich Elsen, principal research scientist at Databricks, explained to VentureBeat. "So that's why we were looking around. How do we create a benchmark that, if we get better at it, we're actually getting better at solving the problems that our customers have?" The result is OfficeQA, a benchmark designed to test AI agents on grounded reasoning: Answering questions based on complex proprietary datasets containing unstructured documents and tabular data. Unlike existing benchmarks that focus on abstract capabilities, OfficeQA proxies for the economically valuable tasks enterprises actually perform. Why academic benchmarks miss the enterprise mark There are numerous shortcomings of popular AI benchmarks from an enterprise perspective, according to Elsen. HLE features questions requiring PhD-level expertise across diverse fields. ARC-AGI evaluates abstract reasoning through visual manipulation of colored grids. Both push the frontiers of AI capabilities, but don't reflect daily enterprise work. Even GDPval, which was specifically created to evaluate economically useful tasks, misses the target. "We come from a pretty heavy science or engineering background, and sometimes we create evals that reflect that," Elsen said. " So they're either extremely math-heavy, which is a great, useful task, but advancing the frontiers of human mathematics is not what customers are trying to do with Databricks." While AI is commonly used for customer support and coding apps, Databricks' customer base has a broader set of requirements. Elsen noted that answering questions about documents or corpora of documents is a common enterprise task. These require parsing complex tables with nested headers, retrieving information across dozens or hundreds of documents and performing calculations where a single-digit error can cascade into organizations making incorrect business decisions. Building a benchmark that mirrors enterprise document complexity To create a meaningful test of grounded reasoning capabilities, Databricks needed a dataset that approximates the messy reality of proprietary enterprise document corpora, while remaining freely available for research. The team landed on U.S. Treasury Bulletins, published monthly for five decades beginning in 1939 and quarterly thereafter. The Treasury Bulletins check every box for enterprise document complexity. Each bulletin runs 100 to 200 pages and consists of prose, complex tables, charts and figures describing Treasury operations: Where federal money came from, where it went and how it financed government operations. The corpus spans approximately 89,000 pages across eight decades. Until 1996, the bulletins were scans of physical documents; afterwards, they were digitally produced PDFs. USAFacts, an organization whose mission is "to make government data easier to access and understand," partnered with Databricks to develop the benchmark, identifying Treasury Bulletins as ideal and ensuring questions reflected realistic use cases. The 246 questions require agents to handle messy, real-world document challenges: Scanned images, hierarchical table structures, temporal data spanning multiple reports and the need for external knowledge like inflation adjustments. Questions range from simple value lookups to multi-step analysis requiring statistical calculations and cross-year comparisons. To ensure the benchmark requires actual document-grounded retrieval, Databricks filtered out questions that LLMs could answer using parametric knowledge or web search alone. This removed simpler questions and some surprisingly complex ones where models leveraged historical financial records memorized during pre-training. Every question has a validated ground truth answer (typically a number, sometimes dates or small lists), enabling automated evaluation without human judging. This design choice matters: It allows reinforcement learning (RL) approaches that require verifiable rewards, similar to how models train on coding problems. Current performance exposes fundamental gaps Databricks tested Claude Opus 4.5 Agent (using Claude's SDK) and GPT-5.1 Agent (using OpenAI's File Search API). The results should give pause to any enterprise betting heavily on current agent capabilities. When provided with raw PDF documents: * Claude Opus 4.5 Agent (with default thinking=high) achieved 37.4% accuracy. * GPT-5.1 Agent (with reasoning_effort=high) achieved 43.5% accuracy. However, performance improved noticeably when provided with pre-parsed versions of pages using Databricks' ai_parse_document, indicating that the poor raw PDF performance stems from LLM APIs struggling with parsing rather than reasoning. Even with parsed documents, the experiments show room for improvement. When provided with documents parsed using Databricks' ai_parse_document: * Claude Opus 4.5 Agent achieved 67.8% accuracy (a +30.4 percentage point improvement) * GPT-5.1 Agent achieved a 52.8% accuracy (a +9.3 percentage point improvement) Three findings that matter for enterprise deployments The testing identified critical insights for practitioners: Parsing remains the fundamental blocker: Complex tables with nested headers, merged cells and unusual formatting frequently produce misaligned values. Even when given exact oracle pages, agents struggled primarily due to parsing errors, although performance roughly doubled with pre-parsed documents. Document versioning creates ambiguity: Financial and regulatory documents get revised and reissued, meaning multiple valid answers exist depending on the publication date. Agents often stop searching once they find a plausible answer, missing more authoritative sources. Visual reasoning is a gap: About 3% of questions require chart or graph interpretation, where current agents consistently fail. For enterprises where data visualizations communicate critical insights, this represents a meaningful capability limitation. How enterprises can use OfficeQA The benchmark's design enables specific improvement paths beyond simple scoring. "Since you're able to look at the right answer, it's easy to tell if the error is coming from parsing," Elsen explained. This automated evaluation enables rapid iteration on parsing pipelines. The verified ground truth answers also enable RL training similar to coding benchmarks, since there's no human judgment required. Elsen said the benchmark provides "a really strong feedback signal" for developers working on search solutions. However, he cautioned against treating it as training data. "At least in my imagination, the goal of releasing this is more as an eval and not as a source of raw training data," he said. "If you tune too specifically into this environment, then it's not clear how generalizable your agent results would be." What this means for enterprise AI deployments For enterprises currently deploying or planning document-heavy AI agent systems, OfficeQA provides a sobering reality check. Even the latest frontier models achieve only 43% accuracy on unprocessed PDFs and fall short of 70% accuracy even with optimal document parsing. Performance on the hardest questions plateaus at 40%, indicating substantial room for improvement. Three immediate implications: Evaluate your document complexity: If your documents resemble the complexity profile of Treasury Bulletins (scanned images, nested table structures, cross-document references), expect accuracy well below vendor marketing claims. Test on your actual documents before production deployment. Plan for the parsing bottleneck: The test results indicate that parsing remains a fundamental blocker. Budget time and resources for custom parsing solutions rather than assuming off-the-shelf OCR will suffice. Plan for hard question failure modes: Even with optimal parsing, agents plateau at 40% on complex multi-step questions. For mission-critical document workflows that require multi-document analysis, statistical calculations or visual reasoning, current agent capabilities may not be ready without significant human oversight. For enterprises looking to lead in AI-powered document intelligence, this benchmark provides a concrete evaluation framework and identifies specific capability gaps that need solving.
[2]
Databricks Benchmark Tests AI on Enterprise Tasks That Demand 'Unforgiving Accuracy' | AIM
On the benchmark, Anthropic's Claude Opus 4.5 Agent solved 37.4% whereas OpenAI's GPT-5.1 Agent scored 43.1% on the full data set. Databricks has introduced OfficeQA, a new benchmark designed to assess whether AI agents can handle the grounded, document-heavy reasoning that dominates real enterprise work. Unlike existing stress tests such as GDPval, ARC-AGI-2 or Humanity's Last Exam, Databricks argues these do not reflect "the kinds of tasks that are important to our customers." OfficeQA is meant to fill that gap by evaluating how well AI systems retrieve, parse and reason over sprawling, messy, real-world corpora. The benchmark is built from the US Treasury Bulletins spanning more than eight decades, a corpus of roughly 89,000 pages of scanned tables, charts and narrative updates about federal finances. The Mosaic Research team at Databricks describes it as a proxy for "economically valuable tasks performed by Databricks' enterprise customers," where accuracy is unforgiving and even "being off by one on a product or invoice number can have catastrophic downstream results." OfficeQA contains 246 questions across easy and hard tiers, each requiring information retrieval across multiple documents and grounded analytical reasoning. Example questions include retrieving the total U.S. national defense expenditures for the 1940 calendar year, running a linear regression to predict the Department of Agriculture's 1999 outlays using data from 1990-1998, or interpreting visuals such as counting the number of local maxima on a line plot from the September 1990 Treasury Bulletin. Human evaluators needed an average of 50 minutes per question, most of it spent locating data buried across decades of publications. Databricks filtered out any item that could be answered with an LLM's memorised knowledge or through a simple web search, ensuring that "questions require document-grounded retrieval." Databricks tested several frontier agents, including a GPT-5.1 agent using OpenAI's File Search and Retrieval API and a Claude Opus 4.5 agent built with Anthropic's SDK. Performance was weak when models were asked to work directly from PDFs. Without access to the corpus, models answered about 2% of questions correctly. When given only PDFs, accuracy rose but remained below 45%. Anthropic's Claude Opus 4.5 Agent solved 37.4% whereas OpenAI's GPT 5.1 Agent scored 43.1% on the full data set. However, on OfficeQA-Hard, a subset of 113 hard examples, Claude Opus 4.5 Agent scored 21.1% and GPT5.1 Agent scored 24.8%. "Despite frontier models performing well on Olympiad-style questions, we find they still struggle on these economically important tasks," said Databricks. Even when provided with access to the exact document slices containing the answer, raw PDF interpretation caused large errors. Significant gains emerged only after preprocessing the corpus with Databricks' own parsing system. "When these same pages are preprocessed using Databricks ai_parse_document, performance jumps significantly," the researchers write, noting a +32.4-point jump for GPT-5.1. One visual task -- counting local maxima on a 1990 Treasury plot -- was not solved by any AI agent.
Share
Share
Copy Link
Databricks launched OfficeQA, a new AI benchmark testing agents on real enterprise document tasks using U.S. Treasury Bulletins. The results reveal a sobering reality: even the best AI agents from OpenAI and Anthropic achieve less than 45% accuracy on document-heavy work that mirrors actual business needs, exposing a critical disconnect between academic tests and enterprise requirements.
Databricks has released OfficeQA, an AI benchmark designed to evaluate whether AI agents can handle the complex document-heavy enterprise tasks that dominate actual business workflows
1
. The data and AI platform company found that existing benchmarks like Humanity's Last Exam, ARC-AGI-2, and GDPval focus on abstract capabilities such as PhD-level exams and mathematical problems, but fail to reflect the economically valuable work enterprises need AI to perform1
.
Source: AIM
"If we focus our research efforts on getting better at [existing benchmarks], then we're probably not solving the right problems to make Databricks a better platform," explained Erich Elsen, principal research scientist at Databricks
1
. The gap between AI capabilities and enterprise needs prompted the company to develop a benchmark that would actually improve their platform's ability to solve customer problems.The results expose a sobering reality about AI agents' performance on real-world work. On the OfficeQA benchmark, Anthropic's Claude Opus 4.5 Agent solved only 37.4% of questions while OpenAI's GPT-5.1 Agent achieved 43.1% accuracy on the full dataset
2
. Performance dropped even further on OfficeQA-Hard, a subset of 113 challenging examples, where Claude Opus 4.5 scored 21.1% and GPT-5.1 reached only 24.8%2
.Without access to the document corpus, frontier AI models answered approximately 2% of questions correctly
2
. Even when provided with PDFs, accuracy remained below 45%, revealing fundamental limitations in how LLMs handle information retrieval and analytical reasoning tasks that require unforgiving accuracy2
.
Source: VentureBeat
To create an evaluation framework that approximates enterprise document complexity, Databricks built OfficeQA using U.S. Treasury Bulletins published from 1939 onwards
1
. The corpus spans roughly 89,000 pages across eight decades, containing prose, complex tables, charts, and figures describing federal financial operations1
. Each bulletin runs 100 to 200 pages and includes scanned images from physical documents published before 1996, adding realistic challenges like hierarchical table structures and tabular data with nested headers1
.The benchmark contains 246 questions requiring agents to handle messy, real-world challenges that mirror what enterprises face daily
1
. Tasks range from simple value lookups to multi-step analysis requiring statistical calculations and cross-year comparisons1
. One example asks agents to run a linear regression to predict the Department of Agriculture's 1999 outlays using data from 1990-1998, while another requires counting local maxima on a line plot from the September 1990 Treasury Bulletin2
.Related Stories
Databricks filtered out questions that could be answered using parametric knowledge or web search alone, ensuring the benchmark requires actual document-grounded retrieval
1
. Every question includes validated ground truth answers, typically numbers or dates, enabling automated evaluation without human judging1
. This design supports reinforcement learning approaches that require verifiable rewards.When tested with raw PDFs, even frontier models struggled significantly. However, preprocessing the corpus with Databricks' ai_parse_document system produced dramatic improvements, with GPT-5.1 showing a 32.4-point accuracy jump
2
. The Mosaic Research team at Databricks describes the benchmark as testing tasks where "being off by one on a product or invoice number can have catastrophic downstream results"2
. Human evaluators needed an average of 50 minutes per question, with most time spent locating data buried across decades of publications2
. One visual reasoning task counting local maxima on a 1990 Treasury plot was not solved by any AI agent, highlighting persistent challenges in complex document-heavy enterprise tasks2
.Summarized by
Navi
[1]
24 Jan 2025•Science and Research

04 Feb 2025•Technology

12 Nov 2024•Technology

1
Science and Research

2
Policy and Regulation

3
Technology
