2 Sources
[1]
A modest proposal: Reformat everything to make documents more palatable to AI
Websites are being redesigned for consumption by AI models, and now a coalition wants to extend the trend to digital documents. The LF AI & Data Foundation, under the Linux Foundation, has formed a working group to steer the development of DocLang, an AI-friendly document format that aims to help enterprises feed their files to AI systems. The DocLang group, founded by IBM, NVIDIA, Red Hat, ABBYY, HumanSignal, and Forgis, contends that existing formats like PDF, Markdown, HTML, and LaTeX are ill-suited for AI document parsing. In late 2024, IBM developed an open source toolkit called Docling to facilitate AI document parsing, not unlike Microsoft's MarkItDown or the Marker project. Docling provides a way to convert various file formats into structured AI-ready data. DocLang expands upon that foundation with a standard for exchanging structured output across different systems. "DocLang is designed to solve one of the foundational problems in enterprise AI: documents were built for humans, not machines," said Maxime Vermeir, VP of AI Strategy at AI automation biz ABBYY in a statement. "By introducing a minimal, standardized, and AI-native representation of document structure, layout, meaning and governance, DocLang creates a far more deterministic foundation for modern AI systems." The new DocLang format is necessary, the spec authors argue, because existing formats were designed for rendering and lose semantic information, structural relationships, or geometric context when AI models turn them into tokens. The specification explains that Markdown lacks sufficient scope, that HTML is excessively verbose, and that LaTeX allows too much ambiguity. Essentially, DocLang is optimized for LLM tokenizers through markup that maps between DocLang elements and LLM tokens on a 1-to-1 basis. The spec relies on a limited XML vocabulary that aligns with LLM tokenizers to produce optimized prompts. It is lossless, so the AI conversion doesn't do away with valuable info. It's designed to support common graphical elements like tables, formulas, charts, and multimodal content. And it's an open standard. DocLang could also help keep costs under control. According to AI Cost Check, having an AI model conduct an OCR scan on a PDF requires about 1,200 input tokens and 150 output tokens as a baseline. That's inconsequential to corporate AI customers on a one-off basis but demands attention at scale. And because AI models have highly variable token costs, companies may find they are spending more than they anticipated to have their AI system ingest PDFs, particularly if the documents are long and complicated or an expensive frontier model is used. "PDFs were designed for rendering, not understanding," said Jon Knisley, AI Value and Enablement Lead at ABBYY, in an email to The Register. "Every time a PDF enters an AI pipeline, structure, meaning and layout get lost, so the model's accuracy ends up bottlenecked by document quality rather than model quality. Teams compensate by building custom parsers at every integration point, which results in brittle, one-off work, and a new engineering sprint for every new document type." According to Knisley, that has measurable cost. "Ambiguous structure forces the model into guesswork, which drives up hallucination risk and burns tokens deciphering layout instead of extracting meaning," he explained. "With DocLang, customers can expect better accuracy, lower costs, fewer tokens consumed, faster performance and more consistent outputs. The exact savings depend on the use case and document complexity, but our initial benchmarks show 4x to more than 30x lower cost depending on the model evaluated." Knisley also cited governance advantages, noting that document provenance data and metadata can get stripped when documents gets moved. DocLang, he said, keeps that information attached. ABBYY, which offers AI document processing, has created the DocLang Interactive Benchmark to illustrate the potential token savings of feeding DocLang documents to AI models. A PDF of IBM's 2025 annual report, for example, results 8,421 input tokens and 512 output tokens while a DocLang version requires only 5,310 input tokens and 498 output tokens. What's more, the DocLang version results in lower latency (2.7s vs 4.2s) and delivers better quality (the AI missed one subsection and mangled a table merger in the PDF). "It's still early, and we won't overstate adoption," said Knisley. "The standard is open and free to build on, and the group is actively inviting more technology providers and enterprises to join. The early response has been encouraging, and we're optimistic about where it goes from here." ®
[2]
DocLang aims to make documents readable by AI, not humans
Development of the AI-native DocLang document format raises questions about its impact on human workers, as well as on governance and accountability. AIs struggle to understand documents designed for humans; the DocLang working group seeks to flip that imbalance with its specification for machine-readable business documents "built from the ground up for LLM tokenizers." The working group, founded by IBM, Nvidia, and Red Hat and hosted by the Linux Foundation's LF AI & Data project, aims to create an open, universal, AI-native document format designed to improve how enterprises prepare, exchange, and govern document data for AI systems. ABBYY and Human Signal will also be involved in its development, and other contributors are welcome.
Share
Copy Link
The Linux Foundation has formed a working group to develop DocLang, an AI-native document format that aims to solve enterprise AI's document processing challenges. Led by IBM, NVIDIA, Red Hat, ABBYY, and others, DocLang promises to reduce token costs by 4x to 30x while improving accuracy and speed compared to traditional PDF processing.

The LF AI & Data Foundation under the Linux Foundation has established a working group to advance DocLang, an AI document format specifically engineered to make documents readable by AI systems rather than humans
1
. Founded by IBM, NVIDIA, Red Hat, ABBYY, HumanSignal, and Forgis, the initiative addresses what the coalition describes as a foundational problem in enterprise AI: existing document formats like PDF, Markdown, HTML, and LaTeX were built for human consumption and prove ill-suited for AI document parsing1
.The working group seeks to create an open universal AI-native document format designed to improve how enterprises prepare, exchange, and govern enterprise document data for AI systems
2
. "DocLang is designed to solve one of the foundational problems in enterprise AI: documents were built for humans, not machines," said Maxime Vermeir, VP of AI Strategy at ABBYY1
.DocLang distinguishes itself through optimization specifically for LLM tokenizers, employing markup that maps between DocLang elements and LLM tokens on a 1-to-1 basis
1
. The specification relies on a limited XML vocabulary that aligns with LLM tokenizers to produce optimized prompts while maintaining lossless conversion that preserves valuable information1
. The format supports common graphical elements including tables, formulas, charts, and multimodal content.The initiative builds upon IBM's late 2024 development of Docling, an open source toolkit for AI document parsing similar to Microsoft's MarkItDown. DocLang expands that foundation by establishing a standard for exchanging structured output across different systems
1
. According to the specification authors, existing formats lose semantic information, structural relationships, or geometric context when AI models convert them into tokens, creating structural ambiguity that forces models into guesswork.The potential to reduce token costs represents a significant value proposition for enterprises. According to AI Cost Check, having an AI model conduct an OCR scan on a PDF requires approximately 1,200 input tokens and 150 output tokens as a baseline
1
. While inconsequential on a one-off basis, this becomes critical at scale, particularly with expensive frontier models processing long, complicated documents.ABBYY's initial benchmarks demonstrate 4x to more than 30x lower cost depending on the model evaluated
1
. The DocLang Interactive Benchmark illustrates these savings: IBM's 2025 annual report as a PDF requires 8,421 input tokens and 512 output tokens, while the DocLang version needs only 5,310 input tokens and 498 output tokens1
. The DocLang version also delivers lower latency—2.7 seconds versus 4.2 seconds—and better quality, with the AI missing one subsection and mangling a table merger in the PDF version.Related Stories
"PDFs were designed for rendering, not understanding," explained Jon Knisley, AI Value and Enablement Lead at ABBYY
1
. Every time a PDF enters an AI pipeline, structure, meaning and layout get lost, bottlenecking model accuracy based on document quality rather than model quality. Teams currently compensate by building custom parsers at every integration point, resulting in brittle, one-off work and new engineering sprints for every document type1
.Ambiguous structure drives up hallucination risk and burns tokens deciphering layout instead of extracting meaning, creating measurable token waste
1
. With DocLang, customers can expect better accuracy, lower costs, fewer tokens consumed, faster performance and more consistent outputs, though exact savings depend on use case and document complexity.Knisley cited additional governance advantages, noting that document provenance data and metadata often get stripped when documents move between systems. DocLang keeps that information attached, addressing compliance and accountability concerns
1
. The standard is open and free to build on, with the working group actively inviting more technology providers and enterprises to join1
2
."It's still early, and we won't overstate adoption," Knisley acknowledged, though he noted the early response has been encouraging
1
. As enterprises increasingly deploy AI systems at scale, the ability to efficiently process documents without sacrificing accuracy or inflating costs will likely determine which organizations can sustain long-term AI implementations. The development raises questions about how document processing workflows will evolve and whether AI-optimized formats will become standard across industries.Summarized by
Navi
14 Jan 2026•Technology

21 Oct 2025•Technology

21 Nov 2024•Technology

1
Policy and Regulation

2
Policy and Regulation

3
Business and Economy
