5 Sources
[1]
DeepSeek permanently reduces the price of its flagship V4 model by 75 percent - Engadget
The lower prices could be aimed at undercutting the competition. DeepSeek is leaning hard into being the "cost-effective" choice for AI agents. According to its website, the Chinese startup is dropping the price for its latest flagship model, DeepSeek V4 Pro, to a fourth of its original price. This latest price update makes permanent the 75 percent discount promotion that was previously supposed to end on May 31, 2026. As seen on the website's pricing page, the DeepSeek V4 Pro prices now range from $0.003625 to $0.87 per one million tokens, compared to the previous range between $0.0145 to $3.48 for every million tokens. The company's decision to permanently reduce the price comes a month after it released its V4 models, Pro and Flash, which it claimed would welcome the "era of cost-effective 1M context length." DeepSeek's deep discounts should prove to be a major cost savings for enterprise accounts or power users who go through millions of tokens in a day. The major price drop also presents a more affordable alternative to other popular AI models, like OpenAI's GPT-5 or the recently released Gemini 3.5 Flash from Google. DeepSeek's price undercutting might provoke competitors, like Anthropic, who previously accused the Chinese company of "distillation attacks" that improperly learn from Claude's more capable AI models.
[2]
DeepSeek made its 75% discount permanent. The AI price war just escalated.
DeepSeek permanently cut V4 Pro prices by 75%, to $0.87 per million output tokens. It undercuts GPT-5, Gemini, and Claude. DeepSeek has made permanent the 75% price discount on its flagship V4 Pro model. The promotion was originally scheduled to expire on 31 May. The Chinese AI startup's pricing now ranges from $0.003625 to $0.87 per million tokens, down from $0.0145 to $3.48. The price points are striking in context. OpenAI's GPT-5 charges $2.50 per million input tokens and $10 per million output tokens. Anthropic's Claude Opus 4.7 is priced at $5 input and $25 output. Google's Gemini 3.5 Flash, its cost-optimised model, charges $0.15 input and $0.60 output per million tokens. DeepSeek V4 Pro's new permanent pricing sits below all of them. The gap is widest against the frontier reasoning models that enterprise customers rely on for demanding workloads. The decision to lock in the discount one month after launching the V4 models suggests DeepSeek is prioritising market share over per-unit revenue. The company described V4 as welcoming the "era of cost-effective 1M context length." It is positioning its models as the default for applications that process large documents, codebases, or conversational histories where token costs compound fast. For enterprise accounts consuming millions of tokens daily, the savings are material. Salesforce projects $300 million in Anthropic token spending this year. At DeepSeek's new pricing, an equivalent volume would cost a fraction of that figure. The question for enterprise buyers is whether DeepSeek's model quality, reliability, and compliance posture justify the switch. The price advantage may be offset by the geopolitical and technical risks of routing sensitive workloads through a Chinese AI provider. That calculus varies by industry and by the sensitivity of the data involved. The competitive dynamics are complicated by Anthropic's public accusation that DeepSeek has engaged in "distillation attacks." The allegation is that DeepSeek improperly trained on Claude's responses to improve its own models. DeepSeek has not publicly addressed the accusation in detail. If substantiated, it would mean that some of DeepSeek's capability advantage was built on Anthropic's research investment. The price differential would then reflect intellectual property arbitrage rather than engineering efficiency. The accusation remains unresolved. Anthropic's annualised revenue surged from $9 billion to $30 billion between the end of 2025 and early April 2026. That growth was driven largely by enterprise adoption of Claude Code. DeepSeek's pricing pressure threatens the revenue-per-token economics that support Anthropic's valuation trajectory. If enterprise customers begin routing lower-complexity tasks to DeepSeek while reserving Claude for high-stakes reasoning, Anthropic's token volume could hold while revenue per token declines. The broader AI pricing landscape has been moving toward commoditisation throughout 2026. Google has repeatedly cut Gemini prices to compete with open-weight models. OpenAI's pivot toward consumer platform features, including personal finance tools and advertising, reflects a recognition that API token revenue alone may not sustain its $852 billion valuation. DeepSeek's permanent price cut accelerates a trend that was already compressing margins across the industry. The era of high-margin AI tokens may be ending faster than anyone expected. DeepSeek V4 Pro supports a one-million-token context window at the new pricing. That makes it competitive for document analysis, legal review, and codebase comprehension. These are the long-context applications where input cost is the binding constraint on adoption. The combination of frontier-adjacent capability and radically lower pricing creates a genuine dilemma for CTOs. The cheapest option is also the one with the most geopolitical complexity. It has the least transparency about training data provenance and an unresolved IP accusation from one of its most capable competitors. DeepSeek's strategy appears to be that price will win. Enough volume will flow to the cheapest capable model regardless of origin. The geopolitical concerns that constrain adoption in government and regulated industries will not prevent adoption in the broader market. Whether that bet is correct depends on whether Western AI companies can close the price gap before DeepSeek closes the capability gap. The alternative is that the market bifurcates into a Western tier and a Chinese tier with fundamentally different economics. DeepSeek just made sure the gap between them got wider.
[3]
DeepSeek V4's permanent price cut upends enterprise AI
DeepSeek's announcement over the weekend that it has made its 75% price cut permanent on its flagship V4 Pro model is a disruptive assault on the capital-heavy business models of Silicon Valley's frontier labs. The reduction on DeepSeek V4 Pro directly undercuts comparable Western models used as workhorses for enterprise production. It is 7x cheaper on inputs and 17x cheaper on outputs than Anthropic's Claude Sonnet or OpenAI's GPT 5.5-Med, while the lightweight DeepSeek V4 Flash undercuts entry-tier alternatives like Claude Haiku by 10x to 25x. The price cuts are enabled by a series of hardware-software innovations, especially around cache, that make DeepSeek's models radically more efficient to run. When hosted natively in China, DeepSeek's cache-read pricing is a whopping 87x cheaper than Western clouds -- a deflationary floor so aggressive that handset giant Xiaomi just moved to match the exact pricing tier for its newly deployed MiMo architecture. DeepSeek V4 Pro's performance is ranked almost on par with Western frontier models, hitting 80.6% on coding-agent tasks via the SWE-bench Verified leaderboard and an elite reasoning score of 87.5 on the advanced MMLU-Pro technical index. Both V4 Pro and V4 Flash -- a hyper-optimized speedy version for developers -- are open-weight and issued under a permissive MIT license. This gives enterprises complete flexibility over deployment. This dual-model strategy allows technical teams to route their heaviest, multi-step autonomous agent workloads to the lightning-fast Flash model, while reserving the heavy Pro model for deep reasoning tasks, drastically lowering costs at a time when budget concerns have grown considerably. This also comes at a time when the closed Western labs, in particular OpenAI and Anthropic, face an intense return-on-investment scrutiny for their multi-billion dollar general-purpose hardware infrastructure investments. This deflationary collapse will not affect all Silicon Valley labs equally, signaling a permanent bifurcation of the enterprise AI market. While a premium, deterministic tier will endure for mission-critical engineering workflows, the high-volume background agentic layer is being completely commoditized by open weights. Ultimately, it creates a much more dangerous exposure for OpenAI -- whose revenue mix relies heavily on general-purpose commodity API streams -- than for software-insulated peers like Anthropic. The token cost crisis Uber says it burned through its entire 2026 budget for Claude Code and Cursor in just the first four months of the year; its COO said that the cost related to high token usage by some of its engineers was getting "harder to justify" without better products to show for it. Airbnb's Brian Chesky said last year that while the company uses OpenAI's latest models, they don't rely on them heavily in production -- favoring faster, cheaper alternatives like Alibaba's Qwen. And in the latest episode of VentureBeat's podcast Beyond the Pilot, Pinterest CTO Matt Madrigal confirmed that the company went all-in on an open-source AI strategy, post-training Alibaba's open Qwen model on the company's proprietary "taste graph" to drive Pinterest's assistant -- achieving frontier-like quality at a 90% reduction in costs. DeepSeek's subsequent price drop makes the possibility of such cost differences even greater. Geopolitical headwinds and compliance defenses Widespread enterprise adoption of Chinese models faces massive geopolitical headwinds in the West. For highly regulated U.S. giants in finance, healthcare, and defense, getting comfortable with DeepSeek will take time. Even though an open-weights architecture under an MIT license allows a company to self-host the model locally and prevent active data exfiltration to foreign servers, corporate compliance boards remain deeply paranoid over software supply chain risks, potential hidden backdoors, and the legal threat of sudden federal sanctions. Smaller, more nimble software teams, on the other hand, face far less bureaucratic gridlock. Free from multi-month security review cycles, these fast-moving organizations view the immediate 75% infrastructure savings as a massive competitive edge worth deploying right now The OpenRouter clearinghouse: mapping global token traffic Take the token usage metrics on OpenRouter, a leading public proxy for what models are the most popular among developers. OpenRouter allows developers an easy way to compare and deploy models, and while its data is by no means a full proxy for real model popularity -- it confirms this structural migration is already taking place within company data pipelines. DeepSeek V4 Flash model has captured the No. 1 position on the OpenRouter leaderboard over the past week, surging 48% in token usage. Its advanced counterpart, V4 Pro, sits at No. 6. DeepSeek's top three models processed nearly 6 trillion tokens on OpenRouter over the past week, giving it a huge lead over other competitors. For example, OpenAI's premium model, GPT-5.5, has slipped down to No. 15 at 470B tokens. It's not clear exactly how much of the world's token traffic is on OpenRouter. Conservative estimates put it at about 3%. It does not show the massive amounts of tokens being served by the APIs offered directly to developers by companies like Anthropic, OpenAI and Google. But recent estimates suggest OpenRouter processes between 15 and 40% of each of OpenAI's and Google's token usage, and growing, making it a significant indicator of relative trends regardless of the exact percentage it represents. While skeptics often dismiss aggregator traffic as an indie developer signal rather than a reflection of Fortune 500 IT spend, the corporate pipeline reality is shifting. An infrastructure analysis by a leading venture capital firm, Andreessen Horowitz, revealed that enterprise production environments deploy a median of 14 different models simultaneously to price-route workloads and avoid single-vendor lock-in. This structural architecture shift is why OpenRouter recently secured a massive $113 million Series B funding round backed directly by the big enterprise data and software vendors that serve corporate America -- including ServiceNow Ventures, Snowflake Ventures, Databricks Ventures, Nvidia's NVentures, and Google's CapitalG. Stripe also cited OpenRouter's enterprise customers in its decision to partner closely with the company. That's why DeepSeek's surge on this leaderboard is so eye-opening. DeepSeek itself offers an API directly to developers, and so it too delivers more token traffic than what OpenRouter lets on. Beyond chatbots: the rise of multi-step autonomous agents The DeepSeek spike on OpenRouter indicates a deeper structural shift in how automated software architectures consume machine intelligence. Technical teams are moving beyond using trivial, single-turn chatbots, and starting to deploy more sophisticated autonomous agents that persist for hours at a time -- recursively looping through codebases and data lakes. Their huge number of tool calls, and continuous rereading of long context histories, means AI token consumption expands exponentially. Running these recursive loops on closed, premium Western APIs quickly creates unsustainable infrastructure costs. While corporate tech teams spent last year experimenting freely with early, single-turn prototypes without worrying about budgets, the onset of token-prolific autonomous agents has triggered an enterprise line-item crisis. VentureBeat's Q1 2026 research, which surveyed enterprise users at organizations with over 100 employees (n=65, in the U.S. software, finance and healthcare industries), confirms the shift: "Cost per token or licensing model" jumped from 25.4% in January to 36.7% in March, trailing only raw performance as the primary selection criterion for enterprise buyers. DeepSeek target-optimized its weights for this specific trend of agentic high-token use. It has locked in on a standard input cost of $0.435 per million tokens and a standard output rate of $0.87 per million tokens, alongside a rock-bottom prefix-cached read cost of $0.003625 per million. It's this third cost item -- for cache -- which is arguably the most significant. "If you measure how all of these agents now are using tokens, 80 to 90% of the tokens are cache-read tokens," said Val Bercovici, Chief AI Officer at WEKA, a company that provides fast storage for much of this cache. "Which means that [that price] is almost by far the most important price, making the others irrelevant -- nearly a rounding error. So what DeepSeek did is not just say we're going to be 5% cheaper, 10% cheaper, 20% cheaper. They're like 87x cheaper on that cache-read price with DeepSeek V4 Pro. So that's really set the industry on notice." The infrastructure coup: Decoupling HBM from Context DeepSeek's core innovations are around hardware-software alignment. This is where we get a little technical. While Western frontier labs like OpenAI have prioritized performance at all cost, they've invested billions into uncompressed "dense" neural architectures. DeepSeek, by contrast, has systematically sought to extract maximum intelligence from lower grade hardware, given that they've lacked access to Nvidia's GPUs. By pioneering deep software optimizations as early as its V2 architectures in 2024, the lab engineered a series of four interconnected hardware-software alignment breakthroughs that decoupled a model's operational context from expensive computing overhead: Breakthrough 1: Sequence Dimension Compression via CSA and HCA The transformer architecture that most LLMs use is bottlenecked by something called the Key-Value (KV) cache. As an agent executes long, multi-step sessions, historical context keys clog the high-bandwidth memory (HBM) on the GPU, causing severe latency spikes and an expensive infrastructure tax. DeepSeek resolved this structural bottleneck by introducing a hybrid attention mechanism -- documented in the DeepSeek V4 Architecture Paper -- that combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to cut overall KV-cache usage by a massive 90% across its 1-million-token context window. While traditional models try to keep a unique memory log for every individual word, DeepSeek compresses the rows of its memory cache. CSA acts as a local filter, condensing small windows of text into concise, indexable blocks so the model doesn't sweat the fine-grained details. HCA acts as an aggressive global index, crushing massive spans of text deep within a session's history into high-density summaries. By interleaving these layers, DeepSeek shrinks millions of memory rows down to a fraction of their size. Breakthrough 2: Native memory offloading via Multi-head Latent Attention (MLA) Using something called Multi-head Latent Attention (MLA), DeepSeek strips the active memory footprint of its context history down to a fraction of standard models. It achieves this by running a physical division of labor between hardware chips. While traditional models force expensive GPUs to hold a session's entire history, DeepSeek's architecture keeps only the tiny, highly compressed search index tags (the Keys) on the GPU. Meanwhile, it offloads the heavy data payloads (the Values) entirely into cheaper system memory and local storage tiers. Once the GPU handles the high-speed matching to find relevant data, it calls the values from storage only on an as-needed basis. DeepSeek's architecture is so different that the inference engines that load an AI model's weights into GPU memory, in order to be ready for prompting, are being stretched. The three most popular engines -- Nvidia TensorRT-LLM, the UC Berkeley one, SGLang and the really popular vLLM -- "are all being stretched to keep up with being able to offer it, which is not normal," explains WEKA's Bercovici. "Every other open model has had some similarity to other open models. This one from DeepSeek is just built different." DeepSeek's software engineering means its massive 1.6-trillion parameter model requires an astonishingly tiny 5.48 GB of HBM to hold a 1-million-token context loop in production, according to calculations by an analyst using hardware modeling benchmarks. For comparison, smaller models utilizing standard Western architectures choke up to 89 GB of HBM under the exact same context load. DeepSeek's extreme compression of the KV cache down to 5.48 GB of HBM is also a calculated geopolitical strategy to bypass U.S. export bans on top-tier Nvidia GPUs. By reducing the need for HBM and Nvidia's CUDA ecosystem, DeepSeek's software design allows frontier AI to run efficiently on domestic, lower-cost, and unsanctioned Chinese storage tiers like NAND flash, commodity SSDs, and LPDDR memory (produced by domestic giants like YMTC and CXMT). Breakthrough 3: Ultra-Low Footprint Inference via FP4 Quantization-Aware Training (QAT) To keep compute costs low over massive context windows, DeepSeek moved away from the old approach of scanning bulky, uncompressed numbers every time the model searches its memory. Instead, as detailed in the DeepSeek V4 Technical Report, the architecture runs an advanced form of data compression directly on the active pathways it uses to find information during training. This compression slashes memory demands to deliver a 2x hardware speedup, yet it maintains a near-flawless 99.7% accuracy in how the system targets and indexes specific data blocks. This engineering win allows enterprise workflows to process massive, multi-step agent tasks smoothly while keeping an exceptional 83.5% retrieval accuracy on extreme, million-token "needle-in-a-haystack" benchmarks -- eliminating performance lags without draining expensive GPU power. Breakthrough 4: Ultra-scale training stability via manifold-constrained hyper-connections (mHC) Training a 1.6-trillion parameter model creates instability risk -- causing too many data pathways and processing signals to cascade out of control, crashing the run. DeepSeek resolved this with a framework called Manifold-Constrained Hyper-Connections (mHC), which uses a balancing routine to force the model's internal data tables to always sum to one -- a mathematical safety valve that lets complex data move through deep networks without runaway spikes. The infrastructure pivot: rebuilding corporate plumbing DeepSeek's significant architectural cache efficiency alters the underlying unit economics for the cloud platforms hosting these models. On developer aggregators like OpenRouter, where third-party providers routinely offer advanced endpoints at a loss, to capture developer mindshare, this hardware-software decoupling alters the balance sheet. DeepSeek's extremely low cost likely gives DeepSeek a profit, at least when it comes to serving the model in China, Bercovici said. This transformation in provider-side unit economics is mirrored on the buy-side, which shows a structural change happening across enterprise IT budgets. VentureBeat's Q1 2026 AI Infrastructure and Compute tracker survey -- which tracks enterprise technology buyers at organizations with over 100 employees (n=53 in January, n=39 in February) across software, financial services, healthcare, and manufacturing sectors -- revealed that enterprise adoption of custom, self-managed inference stacks utilizing open-source frameworks like Triton, vLLM, Ray, and Kubernetes surged from 11.3% to 17.9%. Because these software layers allow corporate engineering teams to deploy open-weights architectures natively across their own clusters, they act as an operational escape hatch from closed cloud ecosystems. This software shift is paired with an aggressive hardware migration: enterprise workloads moving to specialized, inference-first AI clouds like CoreWeave, Lambda, and Crusoe grew from 30.2% to 35.9% in the latest survey window. These infrastructure metrics indicate that corporate technology leaders are no longer just prototyping with open alternatives; they are actively laying down the physical plumbing required to host architectures like DeepSeek V4 independently, increasingly pricing away the premium markup of Western API gatekeepers. The strategic split for Western labs This baseline cost reduction could soon fracture the competitive field in Silicon Valley, by rewriting the expectations for labs attempting to yield a return on massive infrastructure investments. For now, though, the Silicon Valley music is unlikely to stop anytime soon. Anthropic remains on an extraordinary enterprise trajectory, driven by widespread adoption of Claude Code and its codebase-aware terminal execution. For enterprise engineering teams, paying a premium for Anthropic's deterministic accuracy makes perfect sense for core production software development. Yet even an elite frontier lab scaling at this pace must watch DeepSeek with caution: an open-weights architecture under an MIT license offering near-frontier utility at a 75% cost reduction places downward pricing pressure on the high-volume operational layers of any multi-agent system. The primary structural margin squeeze may land more squarely on OpenAI, despite its aggressive pivot toward a multi-cloud footprint. To support its staggering consumer and API token volumes, OpenAI fundamentally altered its historic seven-year exclusive alliance with Microsoft, unbundling its distribution so it can serve models across Azure, Oracle, AWS, and Google Cloud. Yet this multi-cloud strategy, while providing raw capacity at scale, leaves the company intensely exposed to infrastructure commodity pressure. Unlike Anthropic, which has successfully insulated its margins by embedding its models into premium, high-utility software environments like Claude Code, a massive portion of OpenAI's enterprise revenue relies on high-volume, general-purpose API token streams. To be fair, Western labs have already begun quietly retreating from this territory -- aggressively launching deep batch API discounts, prompt caching features, and lightweight entry models to stem the bleed. Yet this tactical retreat only reinforces the structural crisis: Silicon Valley is actively conceding the high-volume commodity layer because they know they cannot defend its margins. When those exact same automated background workflows can be handled natively by highly intelligent open weights like DeepSeek V4, defending a premium price point for raw cloud text completion ceases to be a defensible strategy. More significantly, unlike OpenAI or Anthropic, DeepSeek has much less interest in urgently building consumer wrappers or locking developers into subscription frameworks. Instead, DeepSeek is positioned for a longer-term ecosystem play. Supported by a massive state-backed funding round led by China's "Big Fund" -- which has pushed the startup's targeted valuation into the $10 billion to $45 billion range -- the lab's more likely objective is to prove the viability of a self-sufficient, independent Chinese AI hardware stack that could one day be worth up to $10 trillion. The operational division between western labs and models like DeepSeek V4 Pro is already showing up. Financial company Ramp benchmarked automated cybersecurity agent swarms, and showed that while DeepSeek V4 Pro completely flatlines on the most complex security logic, it achieves a flawless 100% detection rate on high-volume baseline tasks like cloud configuration triage -- significantly outperforming OpenAI's GPT-5.5 (44%). For an enterprise CISO, the strategy is clear: You offload the high-volume token burn of routine background noise to cheap open weights, and reserve premium frontier models strictly for the high-level reasoning required to catch the most sophisticated flaws. The enterprise verdict For IT operations directors and data pipeline managers, the choice to migrate to an open architecture like DeepSeek V4-Pro is a smart governance decision. The open model gives companies total architecture control, allowing them to host it on-premise or via any specialized cloud layer they choose. Crucially, it provides enterprise infrastructure leads with a strategic operational fallback that closed vendors can't match: the power to download raw model weights and execute them privately for zero marginal token cost if public cloud pricing or API access conditions change. The assumption that closed frontier labs hold a permanent monopoly on useful enterprise reasoning has collapsed. While engineering directors will continue to pay a premium to protect specialized, deterministic workflows, the financial foundation of the frontier lab model has fundamentally shifted. By diverting the immense, day-to-day token volume of recursive background agents onto highly optimized, open-source clusters, enterprise teams are starving proprietary clouds of their highest-margin fuel. Silicon Valley's multi-billion dollar token moat didn't just narrow -- it was completely drained from the bottom up.
[4]
DeepSeek, Xiaomi Just Made Frontier AI 99% Cheaper. American Labs Went the Other Way - Decrypt
OpenAI's GPT-5.5 doubled output prices to $30 per million tokens at launch, and Anthropic's Claude Opus 4.7 shipped with an updated tokenizer that can inflate actual costs by up to 35%. DeepSeek made the 75% discount on DeepSeek V4-Pro, which was set to expire, permanent earlier this week. And now fellow Chinese AI lab Xiaomi slashed MiMo-V2.5 API prices by up to 99% for cached inputs. Two of the most capable AI models on the market just got aggressively cheaper, while American labs moved in the opposite direction. Quick explainer for the non-developers in the room: When you use ChatGPT or Claude in a browser, you're paying a flat subscription -- or nothing. When a company builds a product on top of an AI model, they pay per token, where a token is roughly three-quarters of a word. Every message sent, every reply generated, every document processed: all of it adds up at a rate measured in millions of tokens. An API is the raw pipe that makes this possible, making it possible for an app, an agent, a web site, etc. to use the model in their own environment. So token pricing determines whether an AI-powered product is economically viable or a money pit. Token plans are a subscription wrapper on top of that. You buy credits upfront; the model eats through them. Xiaomi's billing upgrade gives users 5 to 8 times more tokens at the same price. The Max plan at $100 now gets you 82 billion tokens, up from 1.6 billion. For context, 82 billion tokens is more than 60 billion words. Why the cuts are real, not marketing Fuli Luo, head of Xiaomi's MiMo team and a former core DeepSeek developer who co-built DeepSeek-V2, published a technical explanation on X. The biggest savings come from a smarter way of storing and reusing information the AI has already processed. Instead of repeatedly doing the same work, Xiaomi's system can remember much more data at once -- about five times more than before. That means the AI needs far less computing power, cutting storage and processing costs by around 80%. "Operating at these newly reduced API prices, our production inference engine is running at near full capacity, and we can still essentially break even," Luo wrote. "If more architectures that save compute and KV [Key-Value cache] cache emerge, along with better inference Infra to drive down API costs, this will form an excellent virtuous cycle in the industry." DeepSeek's architecture lands in the same place differently. V4 uses two interleaved attention types -- one compressing every four tokens for selective attention, another collapsing every 128 tokens for global context at minimal compute. At one million tokens of context, V4-Pro's KV cache is 10% the size of its predecessor's, and single-token inference runs at 27% of the previous compute cost. The result is a model 98% cheaper than GPT-5.5 Pro with a competitive performance. Silicon Valley's bet Claude Opus 4.7 costs $5 per million input tokens and $25 per million output tokens. Anthropic kept the rate card flat but shipped it with a new tokenizer that can produce up to 35% more tokens for the same input text. So the price didn't go up. Your bill still might. GPT-5.5, released in late April, just doubled its predecessor's output price to $30 per million tokens. Gemini 2.5 Pro sits at $1.25 input and $10 output -- cheap by American standards. DeepSeek V4-Pro is a 1.6 trillion parameter model that gives you the knowledge base of a massive model at a fraction of the compute cost. It now permanently runs at $0.435 input and $0.87 output per million tokens. That's a model that scored 80.6% on SWE-Verified against Claude Opus 4.6's 80.8% -- a benchmark measuring real GitHub issue resolution, not cherry-picked demos. The pricing gap between models with essentially the same coding score: 34x on output. MiMo-V2.5-Pro matches that same $0.435/$0.87 per million tokens after the new cuts. Cache hits drop to $0.0036. For context, that's cheaper per token than most people pay per character in an SMS. DeepSeek and Xiaomi aren't alone These cuts landed in a market where Chinese models were already much cheaper before any of this. MiniMax M2.7, which trades punches with Claude Opus on coding benchmarks per Artificial Analysis, costs $0.30 input and $1.20 output per million tokens -- about 5% of Opus 4.7's output rate. Kimi K2.5 from Moonshot AI, with 76.8% on SWE-bench Verified, runs $0.60 input and $2.50 output. GLM-5.1 from Z.AI beat Claude Opus 4.6 on a key coding benchmark earlier this quarter. Four Chinese frontier models shipped in a 12-day window in early May, all under one-third of Opus 4.7's per-token cost. For better visualization, this chart shows how Chinese models stack up against the three most popular American AI providers (Anthropic, OpenAI, and Meta) in terms of price to quality ratio. The Q2 2026 gap between Chinese and American frontier models sits at 15-30x, depending on which models you compare -- and that's the baseline, before any cache discounts. What this week's cuts do is collapse that gap further for the specific workloads that actually run in production: agent pipelines with stable system prompts, document processors, retrieval tools, things that hit cache constantly. At $0.003625 per million cached input tokens, DeepSeek V4-Pro's cost for repeated context is functionally rounding error.
[5]
China's DeepSeek to make permanent 75% price cut on flagship V4‑Pro AI model - The Economic Times
DeepSeek did not disclose whether the permanent price cut was due to increased supply of Huawei's Ascend 950 chips, which it used to maximize V4's performance.Chinese artificial intelligence startup DeepSeek will make permanent a 75% price cut on its flagship V4‑Pro artificial intelligence model, keeping prices at a quarter of their original level, the company said in a statement on Saturday. DeepSeek did not disclose whether the permanent price cut was due to increased supply of Huawei's Ascend 950 chips, which it used to maximize V4's performance. The company cut V4‑Pro API costs to between 0.025 and 6 yuan per million tokens (about $0.0035 to $0.83) depending on usage type, from 0.1 to 24 yuan previously, the statement said. A "token" is a unit of text processed by the AI model. Huawei's AI chip sales have benefited from U.S. export controls that prevent Nvidia from selling its most advanced semiconductors in China, although separate curbs on chipmaking equipment exports have limited Huawei's ability to scale up Ascend production. When DeepSeek launched V4 last month, it said the Pro version would cost up to 12 times more than the less powerful Flash version due to "constraints in high-end compute capacity," limiting availability. It also said Pro pricing was expected to fall sharply once Huawei Ascend 950 supernodes are launched in large quantities in the second half of the year.
Share
Copy Link
Chinese AI startup DeepSeek has permanently slashed prices on its flagship V4 Pro model by 75%, bringing costs down to $0.87 per million output tokens. The aggressive move undercuts OpenAI's GPT-5 and Anthropic's Claude by up to 34x, forcing a reckoning in the enterprise AI market as token costs become the binding constraint on adoption.
DeepSeek has made its 75% price discount on the flagship V4 Pro model permanent, a move that was originally scheduled to expire on May 31, 2026
1
. The Chinese AI startup now offers pricing ranging from $0.003625 to $0.87 per million tokens, down from the previous range of $0.0145 to $3.482
. This permanent price cut arrives just one month after DeepSeek released its V4 models, which the company claimed would usher in the "era of cost-effective 1M context length"1
.
Source: ET
The decision to lock in the discount signals that DeepSeek is prioritizing market share over per-unit revenue, positioning itself as the default choice for applications processing large documents, codebases, or conversational histories where token costs compound quickly
2
. For enterprise accounts consuming millions of tokens daily, the savings are material. Salesforce projects $300 million in Anthropic token spending this year, and at DeepSeek's new pricing, an equivalent volume would cost a fraction of that figure2
.The pricing gap between DeepSeek and Western competitors is striking. OpenAI's GPT-5 charges $2.50 per million input tokens and $10 per million output tokens, while Anthropic's Claude Opus 4.7 is priced at $5 input and $25 output
2
. Google's Gemini 3.5 Flash, its cost-optimized model, charges $0.15 input and $0.60 output per million tokens2
. DeepSeek V4 Pro now sits below all of them, creating a 34x pricing gap on outputs compared to models with essentially the same coding performance4
.This API price reduction directly undercuts comparable Western models used as workhorses for enterprise production. DeepSeek is 7x cheaper on inputs and 17x cheaper on outputs than Claude Sonnet or OpenAI's GPT 5.5-Med
3
. The lightweight DeepSeek V4 Flash undercuts entry-tier alternatives like Claude Haiku by 10x to 25x3
. Meanwhile, OpenAI's GPT-5.5 doubled output prices to $30 per million tokens at launch, and Anthropic's Claude Opus 4.7 shipped with an updated tokenizer that can inflate actual costs by up to 35%4
.The price cuts are enabled by hardware-software innovations, especially around cache, that make DeepSeek's models radically more efficient to run
3
. When hosted natively in China, DeepSeek's cache-read pricing is 87x cheaper than Western clouds—a deflationary floor so aggressive that handset giant Xiaomi just moved to match the exact pricing tier for its newly deployed MiMo architecture3
.
Source: VentureBeat
DeepSeek V4 uses two interleaved attention types that compress tokens for selective attention and collapse every 128 tokens for global context at minimal compute. At one million tokens of context, V4 Pro's KV cache is 10% the size of its predecessor's, and single-token inference runs at 27% of the previous compute cost
4
. DeepSeek did not disclose whether the permanent price cut was due to increased supply of Huawei Ascend 950 chips, which it used to maximize V4's performance5
.Companies are already feeling the pressure of high API costs. Uber burned through its entire 2026 budget for Claude Code and Cursor in just the first four months of the year, with its COO saying the cost related to high token usage was getting "harder to justify" without better products to show for it
3
. Pinterest CTO Matt Madrigal confirmed the company went all-in on an open-source AI strategy, post-training Alibaba's Qwen model to achieve frontier-like quality at a 90% reduction in costs3
.DeepSeek V4 Flash has captured the No. 1 position on the OpenRouter leaderboard over the past week, surging 48% in token usage. Its advanced counterpart, V4 Pro, sits at No. 6. DeepSeek's top three models processed nearly 6 trillion tokens on OpenRouter over the past week
3
. This structural migration confirms that developers are actively routing workloads to the cheapest capable models.Related Stories
The question for enterprise buyers is whether DeepSeek's model quality, reliability, and compliance posture justify the switch. The price advantage may be offset by geopolitical considerations of routing sensitive workloads through a Chinese AI provider
2
. For highly regulated U.S. giants in finance, healthcare, and defense, getting comfortable with DeepSeek will take time, despite an open-weights architecture under an MIT license that allows companies to self-host the model locally3
.
Source: Decrypt
The competitive dynamics are complicated by Anthropic's public accusation that DeepSeek has engaged in "distillation attacks," improperly training on Claude's responses to improve its own models
1
2
. If substantiated, it would mean some of DeepSeek's capability advantage was built on Anthropic's research investment, and the price differential would reflect intellectual property arbitrage rather than engineering efficiency2
.Anthropic's annualized revenue surged from $9 billion to $30 billion between the end of 2025 and early April 2026, driven largely by enterprise adoption of Claude Code
2
. DeepSeek's pricing pressure threatens the revenue-per-token economics that support Anthropic's valuation trajectory. If enterprise customers begin routing lower-complexity tasks to DeepSeek while reserving Claude for high-stakes reasoning, Anthropic's token volume could hold while revenue per token declines2
.The broader AI pricing landscape has been moving toward commoditization throughout 2026. Google has repeatedly cut Gemini prices to compete with open-weight models, and OpenAI's pivot toward consumer platform features reflects recognition that API token revenue alone may not sustain its $852 billion valuation
2
. DeepSeek's strategy appears to be that price will win, with enough volume flowing to the cheapest capable model regardless of origin. Whether that bet is correct depends on whether Western AI companies can close the price gap before DeepSeek closes the capability gap, or whether the market bifurcates into a Western tier and a Chinese tier with fundamentally different economics2
.Summarized by
Navi
[1]
[3]
[4]
1
Technology

2
Business and Economy

3
Health
