DocLang: Linux Foundation's AI Document Format

DocLang Emerges as Solution to Enterprise AI Document Challenges

The LF AI & Data Foundation under the Linux Foundation has established a working group to advance DocLang, an AI document format specifically engineered to make documents readable by AI systems rather than humans1

. Founded by IBM, NVIDIA, Red Hat, ABBYY, HumanSignal, and Forgis, the initiative addresses what the coalition describes as a foundational problem in enterprise AI: existing document formats like PDF, Markdown, HTML, and LaTeX were built for human consumption and prove ill-suited for AI document parsing1

The working group seeks to create an open universal AI-native document format designed to improve how enterprises prepare, exchange, and govern enterprise document data for AI systems2

. "DocLang is designed to solve one of the foundational problems in enterprise AI: documents were built for humans, not machines," said Maxime Vermeir, VP of AI Strategy at ABBYY1

Technical Architecture Built for LLM Tokenizers

DocLang distinguishes itself through optimization specifically for LLM tokenizers, employing markup that maps between DocLang elements and LLM tokens on a 1-to-1 basis1

. The specification relies on a limited XML vocabulary that aligns with LLM tokenizers to produce optimized prompts while maintaining lossless conversion that preserves valuable information1

. The format supports common graphical elements including tables, formulas, charts, and multimodal content.

The initiative builds upon IBM's late 2024 development of Docling, an open source toolkit for AI document parsing similar to Microsoft's MarkItDown. DocLang expands that foundation by establishing a standard for exchanging structured output across different systems1

. According to the specification authors, existing formats lose semantic information, structural relationships, or geometric context when AI models convert them into tokens, creating structural ambiguity that forces models into guesswork.

Dramatic Cost Reductions and Performance Gains

The potential to reduce token costs represents a significant value proposition for enterprises. According to AI Cost Check, having an AI model conduct an OCR scan on a PDF requires approximately 1,200 input tokens and 150 output tokens as a baseline1

. While inconsequential on a one-off basis, this becomes critical at scale, particularly with expensive frontier models processing long, complicated documents.

ABBYY's initial benchmarks demonstrate 4x to more than 30x lower cost depending on the model evaluated1

. The DocLang Interactive Benchmark illustrates these savings: IBM's 2025 annual report as a PDF requires 8,421 input tokens and 512 output tokens, while the DocLang version needs only 5,310 input tokens and 498 output tokens1

. The DocLang version also delivers lower latency—2.7 seconds versus 4.2 seconds—and better quality, with the AI missing one subsection and mangling a table merger in the PDF version.

Addressing Document Processing Bottlenecks

"PDFs were designed for rendering, not understanding," explained Jon Knisley, AI Value and Enablement Lead at ABBYY1

. Every time a PDF enters an AI pipeline, structure, meaning and layout get lost, bottlenecking model accuracy based on document quality rather than model quality. Teams currently compensate by building custom parsers at every integration point, resulting in brittle, one-off work and new engineering sprints for every document type1

Ambiguous structure drives up hallucination risk and burns tokens deciphering layout instead of extracting meaning, creating measurable token waste1

. With DocLang, customers can expect better accuracy, lower costs, fewer tokens consumed, faster performance and more consistent outputs, though exact savings depend on use case and document complexity.

Governance and Future Adoption Trajectory

Knisley cited additional governance advantages, noting that document provenance data and metadata often get stripped when documents move between systems. DocLang keeps that information attached, addressing compliance and accountability concerns1

. The standard is open and free to build on, with the working group actively inviting more technology providers and enterprises to join1

"It's still early, and we won't overstate adoption," Knisley acknowledged, though he noted the early response has been encouraging1

. As enterprises increasingly deploy AI systems at scale, the ability to efficiently process documents without sacrificing accuracy or inflating costs will likely determine which organizations can sustain long-term AI implementations. The development raises questions about how document processing workflows will evolve and whether AI-optimized formats will become standard across industries.

Linux Foundation backs DocLang, a new AI document format designed to cut token costs by 30x

DocLang Emerges as Solution to Enterprise AI Document Challenges

Technical Architecture Built for LLM Tokenizers

Dramatic Cost Reductions and Performance Gains

Addressing Document Processing Bottlenecks

Governance and Future Adoption Trajectory

References

A modest proposal: Reformat everything to make documents more palatable to AI

DocLang aims to make documents readable by AI, not humans

Related Stories

Docusign AI Contracts Tool Translates Legal Jargon Into Plain English for Faster Signing

DeepSeek-OCR: Revolutionary AI Model Compresses Text into Images, Transforming Language Processing

Adobe's SlimLM: Revolutionizing On-Device AI for Document Processing

Recent Highlights

OpenAI releases GPT-5.6 models after government review, unveils ChatGPT Work to compete in AI agent race

US-China AI tensions reach new heights as both nations move to restrict each other's models

Meta's new AI image generator can create deepfakes from public Instagram photos without notice

Recent Highlights

Today's Top Stories

Apple sues OpenAI over alleged trade secret theft as hardware rivalry intensifies

Elon Musk admits he was wrong about Anthropic, calls it the AI leader in surprise reversal

Tencent moves to acquire Manus after Beijing forces Meta to unwind $2 billion AI deal

OpenAI and Google Sell AI Models to Pentagon-Blacklisted Chinese Tech Giants via Singapore