5 Sources
5 Sources
[1]
New Deepseek model drastically reduces resource usage by converting text and documents into images -- 'vision-text compression' uses up to 20 times fewer tokens
Could help cut costs and improve the efficiency of the latest AI models. Chinese developers of Deepseek AI have released a new model that leverages its multi-modal capabilities to improve the efficiency of its handling of complex documents and large blocks of text, by converting them into images first, as per SCMP. Vision encoders were able to take large quantities of text and convert them into images, which, when accessed later, required between seven and 20 times fewer tokens, while maintaining an impressive level of accuracy. Deepseek is the Chinese-developed AI that shocked the world in early 2025, showcasing capabilities similar to those of OpenAI's ChatGPT, or Google's Gemini, despite requiring far less money and data to develop. The creators have continued to work on making the AI more efficient since, and with the latest release known as DeepSeek-OCR (optical character recognition), the AI can deliver an impressive understanding of large quantities of textual data without the usual token overhead. "Through DeepSeek-OCR, we demonstrated that vision-text compression can achieve significant token reduction - seven to 20 times - for different historical context stages, offering a promising direction" to handle long-context calculations, the developer said. The new model is made up of two components, the DeepEncoder and DeepSeek3B-MoE-A570M, which acts as the decoder. The encoder can take large quantities of text data and convert it into high-resolution images, while the decoder is particularly adept at taking those high-resolution images and understanding the textual context within them, while requiring fewer tokens to do so than if you just fed the text right into the AI wholesale. It manages this by dissecting each task into separate sub-networks and uses specific AI agent experts to target each subset of the data. This works really well for handling tabulated data, graphs, and other visual representations of information. This could be of particular use in finance, science, or medicine, the developers suggest. In benchmarking, the developers claim that when reducing the number of tokens by less than a factor of 10, DeepSeek-OCR can maintain a 97% accuracy rating in decoding the information. If the compression ratio is increased to 20 times, the accuracy falls to 60%. That's less desirable and shows there are diminishing returns on this technology, but if a near-100% accuracy rate could be achieved with even a 1-2x compression rate, that could still make a huge difference in the cost of running many of the latest AI models. It's also being pitched as a way of developing training data for future models, although introducing errors at that point, even in the form of a few percent off base, seems like a bad idea. If you want to play around with the model yourself, it's available via online developer platforms Hugging Face and GitHub.
[2]
DeepSeek drops open-source model that compresses text 10x through images, defying conventions
DeepSeek, the Chinese artificial intelligence research company that has repeatedly challenged assumptions about AI development costs, has released a new model that fundamentally reimagines how large language models process information -- and the implications extend far beyond its modest branding as an optical character recognition tool. The company's DeepSeek-OCR model, released Monday with full open-source code and weights, achieves what researchers describe as a paradigm inversion: compressing text through visual representation up to 10 times more efficiently than traditional text tokens. The finding challenges a core assumption in AI development and could pave the way for language models with dramatically expanded context windows, potentially reaching tens of millions of tokens. "We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping," the research team wrote in their technical paper. "Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10Γ), the model can achieve decoding (OCR) precision of 97%." The implications have resonated across the AI research community. Andrej Karpathy, co-founder of OpenAI and former director of AI at Tesla, said in a post that the work raises fundamental questions about how AI systems should process information. "Maybe it makes more sense that all inputs to LLMs should only ever be images," Karpathy wrote. "Even if you happen to have pure text input, maybe you'd prefer to render it and then feed that in." How DeepSeek achieved 10x compression by treating text as images While DeepSeek marketed the release as an OCR model -- a technology for converting images of text into digital characters -- the research paper reveals more ambitious goals. The model demonstrates that visual representations can serve as a superior compression medium for textual information, inverting the conventional hierarchy where text tokens were considered more efficient than vision tokens. "Traditionally, vision LLM tokens almost seemed like an afterthought or 'bolt on' to the LLM paradigm," wrote Jeffrey Emanuel, an AI researcher, in a detailed analysis of the paper. "And 10k words of English would take up far more space in a multimodal LLM when expressed as intelligible pixels than when expressed as tokens...But that gets inverted now from the ideas in this paper." The model's architecture consists of two primary components: DeepEncoder, a novel 380-million-parameter vision encoder, and a 3-billion-parameter mixture-of-experts language decoder with 570 million activated parameters. DeepEncoder combines Meta's Segment Anything Model (SAM) for local visual perception with OpenAI's CLIP model for global visual understanding, connected through a 16x compression module. To validate their compression claims, DeepSeek researchers tested the model on the Fox benchmark, a dataset of diverse document layouts. The results were striking: using just 100 vision tokens, the model achieved 97.3% accuracy on documents containing 700-800 text tokens -- representing an effective compression ratio of 7.5x. Even at compression ratios approaching 20x, accuracy remained around 60%. The practical impact: Processing 200,000 pages per day on a single GPU The efficiency gains translate directly to production capabilities. According to the company, a single Nvidia A100-40G GPU can process more than 200,000 pages per day using DeepSeek-OCR. Scaling to a cluster of 20 servers with eight GPUs each, throughput reaches 33 million pages daily -- sufficient to rapidly construct training datasets for other AI models. On OmniDocBench, a comprehensive document parsing benchmark, DeepSeek-OCR outperformed GOT-OCR2.0 (which uses 256 tokens per page) while using only 100 vision tokens. More dramatically, it surpassed MinerU2.0 -- which requires more than 6,000 tokens per page on average -- while using fewer than 800 vision tokens. DeepSeek designed the model to support five distinct resolution modes, each optimized for different compression ratios and use cases. The "Tiny" mode operates at 512Γ512 resolution with just 64 vision tokens, while "Gundam" mode combines multiple resolutions dynamically for complex documents. "Gundam mode consists of nΓ640Γ640 tiles (local views) and a 1024Γ1024 global view," the researchers wrote. Why this breakthrough could unlock 10 million token context windows The compression breakthrough has immediate implications for one of the most pressing challenges in AI development: expanding the context windows that determine how much information language models can actively consider. Current state-of-the-art models typically handle context windows measured in hundreds of thousands of tokens. DeepSeek's approach suggests a path to windows ten times larger. "The potential of getting a frontier LLM with a 10 or 20 million token context window is pretty exciting," Emanuel wrote. "You could basically cram all of a company's key internal documents into a prompt preamble and cache this with OpenAI and then just add your specific query or prompt on top of that and not have to deal with search tools and still have it be fast and cost-effective." The researchers explicitly frame their work in terms of context compression for language models. "Through DeepSeek-OCR, we demonstrate that vision-text compression can achieve significant token reduction (7-20Γ) for different historical context stages, offering a promising direction for addressing long-context challenges in large language models," they wrote. The paper includes a speculative but intriguing diagram illustrating how the approach could implement memory decay mechanisms similar to human cognition. Older conversation rounds could be progressively downsampled to lower resolutions, consuming fewer tokens while maintaining key information -- a form of computational forgetting that mirrors biological memory. How visual processing could eliminate the 'ugly' tokenizer problem Beyond compression, Karpathy highlighted how the approach challenges fundamental assumptions about how language models should process text. Traditional tokenizers -- the systems that break text into units for processing -- have long been criticized for their complexity and limitations. "I already ranted about how much I dislike the tokenizer," Karpathy wrote. "Tokenizers are ugly, separate, not end-to-end stage. It 'imports' all the ugliness of Unicode, byte encodings, it inherits a lot of historical baggage, security/jailbreak risk (e.g. continuation bytes). It makes two characters that look identical to the eye look as two completely different tokens internally in the network." Visual processing of text could eliminate these issues while enabling new capabilities. The approach naturally handles formatting information lost in pure text representations: bold text, colors, layout, embedded images. "Input can now be processed with bidirectional attention easily and as default, not autoregressive attention - a lot more powerful," Karpathy noted. The implications resonate with human cognitive science. Emanuel drew a parallel to Hans Bethe, the renowned physicist who memorized vast amounts of reference data: "Having vast amounts of task-specific knowledge in your working memory is extremely useful. This seems like a very clever and additive approach to potentially expanding that memory bank by 10x or more." The model's training: 30 million PDF pages across 100 languages The model's capabilities rest on an extensive training regimen using diverse data sources. DeepSeek collected 30 million PDF pages covering approximately 100 languages, with Chinese and English accounting for 25 million pages. The training data spans nine document types -- academic papers, financial reports, textbooks, newspapers, handwritten notes, and others. Beyond document OCR, the training incorporated what the researchers call "OCR 2.0" data: 10 million synthetic charts, 5 million chemical formulas, and 1 million geometric figures. The model also received 20% general vision data for tasks like image captioning and object detection, plus 10% text-only data to maintain language capabilities. The training process employed pipeline parallelism across 160 Nvidia A100-40G GPUs (20 nodes with 8 GPUs each), with the vision encoder divided between two pipeline stages and the language model split across two others. "For multimodal data, the training speed is 70B tokens/day," the researchers reported. Open source release accelerates research and raises competitive questions True to DeepSeek's pattern of open development, the company released the complete model weights, training code, and inference scripts on GitHub and Hugging Face. The GitHub repository gained over 4,000 stars within 24 hours of release, according to Dataconomy. The breakthrough raises questions about whether other AI labs have developed similar techniques but kept them proprietary. Emanuel speculated that Google's Gemini models, which feature large context windows and strong OCR performance, might employ comparable approaches. "For all we know, Google could have already figured out something like this, which could explain why Gemini has such a huge context size and is so good and fast at OCR tasks," Emanuel wrote. Google's Gemini 2.5 Pro offers a 1-million-token context window, with plans to expand to 2 million, though the company has not publicly detailed the technical approaches enabling this capability. OpenAI's GPT-5 supports 400,000 tokens, while Anthropic's Claude 4.5 offers 200,000 tokens, with a 1-million-token window available in beta for eligible organizations. The unanswered question: Can AI reason over compressed visual tokens? While the compression results are impressive, researchers acknowledge important open questions. "It's not clear how exactly this interacts with the other downstream cognitive functioning of an LLM," Emanuel noted. "Can the model reason as intelligently over those compressed visual tokens as it can using regular text tokens? Does it make the model less articulate by forcing it into a more vision-oriented modality?" The DeepSeek paper focuses primarily on the compression-decompression capability, measured through OCR accuracy, rather than downstream reasoning performance. This leaves open whether language models could reason effectively over large contexts represented primarily as compressed visual tokens. The researchers acknowledge their work represents "an initial exploration into the boundaries of vision-text compression." They note that "OCR alone is insufficient to fully validate true context optical compression" and plan future work including "digital-optical text interleaved pretraining, needle-in-a-haystack testing, and other evaluations." DeepSeek has established a pattern of achieving competitive results with dramatically lower computational resources than Western AI labs. The company's earlier DeepSeek-V3 model reportedly cost just $5.6 million to train -- though this figure represents only the final training run and excludes R&D and infrastructure costs -- compared to hundreds of millions for comparable models from OpenAI and Anthropic. Industry analysts have questioned the $5.6 million figure, with some estimates placing the company's total infrastructure and operational costs closer to $1.3 billion, though still lower than American competitors' spending. The bigger picture: Should language models process text as images? DeepSeek-OCR poses a fundamental question for AI development: should language models process text as text, or as images of text? The research demonstrates that, at least for compression purposes, visual representation offers significant advantages. Whether this translates to effective reasoning over vast contexts remains to be determined. "From another perspective, optical contexts compression still offers substantial room for research and improvement, representing a promising new direction," the researchers concluded in their paper. For the AI industry, the work adds another dimension to the race for longer context windows -- a competition that has intensified as language models are applied to increasingly complex tasks requiring vast amounts of information. The open-source release ensures the technique will be widely explored, tested, and potentially integrated into future AI systems. As Karpathy framed the deeper implication: "OCR is just one of many useful vision -> text tasks. And text -> text tasks can be made to be vision ->text tasks. Not vice versa." In other words, the path forward for AI might not run through better tokenizers -- it might bypass text tokens altogether.
[3]
DeepSeek's New OCR Model Can Process Over 2 Lakh Pages Daily on a Single GPU | AIM
The technology introduces a vision-based approach to context compression, converting text into compact visual tokens. DeepSeek AI has announced DeepSeek-OCR, a new optical character recognition (OCR) system designed to improve how large language models handle long text contexts through optical 2D mapping. The technology introduces a vision-based approach to context compression, converting text into compact visual tokens. DeepSeek claimed that it achieves over 96% OCR precision when compressing text at a 9x to 10x ratio, and about 60% accuracy, even at 20x compression. DeepSeek-OCR comprises two key components, DeepEncoder and DeepSeek3B-MoE-A570M, working together to balance accuracy and efficiency. DeepEncoder reduces vision tokens before processing, preventing GPU overload even with high-resolution inputs. On the OmniDocBench benchmark, the system outperformed existing OCR models such as GOT-OCR2.0 and MinerU2.0, using fewer vision tokens while maintaining higher efficiency. DeepSeek reported that the model processes over 2,00,000 pages per day on a single NVIDIA A100 GPU and scales up to 33 million pages daily using 20 nodes. The company said this scalability makes DeepSeek-OCR suitable for large-scale document digitisation and AI training data generation. It also supports multiple resolutions and document types, including charts, chemical formulas, and multilingual text. DeepSeek added that its approach represents a new paradigm in language model efficiency by using visual modalities for compression. The system's design allows smaller language models to decode visual representations effectively, indicating potential applications in memory optimisation and long-context processing. Both the code and model weights for DeepSeek-OCR are available as an open-source model on GitHub. The company said it aims to support broader research into combining vision and language for more efficient AI systems. DeepSeek said the paradigm "opens new possibilities for rethinking how vision and language modalities can be synergistically combined to enhance computational efficiency in large-scale text processing and agent systems." The release follows DeepSeek's recent V3.2-Exp model, which reportedly achieves major efficiency gains in training and inference, furthering its push toward cheaper long-context processing for LLMs.
[4]
DeepSeek-OCR: New open-source AI model goes viral on GitHub
DeepSeek-OCR's power lies in its ability to compress information. According to its creators, the model can take a 1,000-word article and compress it into just 100 visual tokens. A new open-source model named DeepSeek-OCR has been released, disrupting the traditional paradigm of large models. The model, which was open-sourced yesterday afternoon, has seen a meteoric rise in the AI community, gaining over 4,000 stars on GitHub overnight. The core focus of DeepSeek-OCR is a novel visual approach to handling text, which promises to solve one of the biggest challenges in AI: long-context efficiency. The new DeepSeek-OCR model is not just another text-reading tool. Its power lies in its ability to compress information. According to its creators, the model can take a 1,000-word article and compress it into just 100 visual tokens. This represents a staggering tenfold compression ratio with 97% accuracy. This efficiency is remarkable; a single NVIDIA A100 GPU can process 200,000 pages of data per day using the DeepSeek-OCR method. This new processing approach could signal a significant shift in the input methods used for large models. The rapid traction of DeepSeek-OCR was amplified by high-profile endorsements. Andrej Karpathy, the co-founder of OpenAI and former Director of Autopilot at Tesla, shared his excitement about the paper. He called DeepSeek-OCR a "good OCR model" and highlighted its more "interesting part": the concept of a computer vision AI "masquerading as a natural language person." Karpathy believes this visual-first method is a superior input for large language models. He proposed that LLMs should use images as their primary input, and even when processing plain text, they should render it into an image first. In his view, this would lead to much higher information compression and a more generalized information flow. Karpathy also emphasized that the DeepSeek-OCR approach could solve issues with traditional "word segmenters," or tokenizers. He argued that word segmenters are "ugly and standalone," introduce Unicode and byte encoding issues, and can even increase security risks. He views OCR as just one of many visual-text tasks, suggesting that text-to-text tasks could be converted to visual-text tasks, but not the other way around. This sentiment was echoed by Xie Saining, an assistant professor at New York University, who agreed with Karpathy's views on integrating computer vision and natural language processing. The DeepSeek-OCR model is available as an open-source project on GitHub and Hugging Face under the name . The model, which has 3 billion parameters, is available for download and use with the Hugging Face library. The creators have provided code examples for inference on NVIDIA GPUs, and the repository also includes guidance for PDF processing and model acceleration using vLLM.
[5]
DeepSeek-OCR Could Change How AI Reads Text From Images
The model turns text into pixels to improve its context memory DeepSeek, on Monday, released a new open-source artificial intelligence (AI) model that changes how these machines analyse and process plain text. Dubbed DeepSeek-OCR, it uses 2D mapping to convert text into pixels to compress long context into a digestible size. The AI startup claims that large language models (LLMs) are more efficient in processing pixels over text, and the compression allows them to capture more relevant information to generate the response. Additionally, the new approach is also said to generate more accurate results compared to traditional methods. DeepSeek-OCR Introduces Novel Technique to Process Text Based on optical character recognition (OCR) technology, the latest DeepSeek AI model uses a new method to process information. It first converts plain text into images, and then analyses the content to generate responses. The promise is that by reading the text in an image, it also compresses and stores massive chunks of a document in a way that makes it easier for a model to remember and reason with the information. At its core, the model introduces "Context Optical Compression," an approach of turning long pages of text into images, then letting the model convert those images into a highly condensed "vision token" representation, which is much smaller in size than the usual text-token representation. To highlight the conversion, the makers say that a 1,000-word article could be processed with just 100 vision tokens. How the model works is also interesting. First, a document image is captured. Then, a vision encoder, which is a custom module made by the researchers, analyses the image and breaks the information into smaller patches. It is then compressed into a smaller number of vision tokens. Then, a decoder takes these vision tokens and reconstructs the textual meaning. Because the AI model is working with far fewer tokens, the downstream language model (or reasoning module) has less memory burden and can handle longer content or bigger documents. Andrej Karpathy, Co-Founder of OpenAI and former Director of AI at Tesla, praised DeepSeek-OCR for its novel implementation of vision tokens. He said that the approach could lead to higher efficiency and has the potential for bidirectional attention. He also said that this method could lead to the elimination of the tokeniser, which would make models more efficient. For those who want to try out the DeepSeek-OCR, the model is currently being hosted on GitHub, where it has received more than 6,700 likes in just 24 hours. The model is available with the permissive MIT licence for both academic and commercial use cases.
Share
Share
Copy Link
DeepSeek's new open-source AI model, DeepSeek-OCR, introduces a groundbreaking approach to text processing by converting text into images. This method achieves up to 20x compression while maintaining high accuracy, potentially revolutionizing AI language models' efficiency and capabilities.
Chinese AI company DeepSeek has unveiled a groundbreaking open-source model called DeepSeek-OCR, which challenges conventional approaches to text processing in large language models (LLMs). The model's innovative technique converts text into images, achieving significant compression while maintaining high accuracy
1
2
.Source: NDTV Gadgets 360
DeepSeek-OCR's core innovation lies in its ability to compress textual information through visual representation. The model can achieve a compression ratio of up to 20 times, with a 97% accuracy rate at 10x compression
1
. This approach inverts the traditional hierarchy where text tokens were considered more efficient than vision tokens2
.DeepSeek-OCR comprises two main components:
The model outperforms existing OCR systems on benchmarks like OmniDocBench while using fewer vision tokens
2
3
.DeepSeek-OCR's efficiency translates to impressive real-world performance. A single Nvidia A100-40G GPU can process more than 200,000 pages per day, scaling up to 33 million pages daily with a cluster of 20 servers
2
3
. This efficiency makes it suitable for large-scale document digitization and AI training data generation3
.The compression breakthrough could potentially unlock 10 million token context windows for language models, a significant leap from current state-of-the-art models that typically handle context windows measured in hundreds of thousands of tokens
2
.Related Stories
The AI community has responded enthusiastically to DeepSeek-OCR. Andrej Karpathy, co-founder of OpenAI and former director of AI at Tesla, suggested that this approach could fundamentally change how AI systems process information
4
5
.DeepSeek has made both the code and model weights for DeepSeek-OCR available as an open-source project on GitHub and Hugging Face
3
5
. This release aims to support broader research into combining vision and language for more efficient AI systems, potentially leading to a paradigm shift in how language models process and understand information3
4
.Summarized by
Navi
[1]
[2]
[3]
Analytics India Magazine
|[5]
29 Sept 2025β’Technology
27 Dec 2024β’Technology
29 May 2025β’Technology
1
Technology
2
Business and Economy
3
Business and Economy