2 Sources
[1]
US Government Says China's Best AI Models Lag Behind. Experts Aren't So Sure - Decrypt
Stanford's 2026 AI Index found the U.S.-China performance gap on public leaderboards had collapsed to 2.7%. A U.S. government institute published its verdict on China's most powerful AI: eight months behind, and the more time passes, the wider the gap gets. The internet read the methodology and started asking questions. CAISI -- the Center for AI Standards and Innovation, a unit inside NIST -- released its evaluation of DeepSeek V4 Pro on May 1. The conclusion: DeepSeek's open-weight flagship "lags behind the frontier by about 8 months." CAISI also calls it the most capable Chinese AI model it has evaluated to date. CAISI doesn't average benchmark scores like most evaluators do. Instead, it applies Item Response Theory -- a statistical method from standardized testing -- to estimate each model's latent capability by tracking which problems it solves and which it doesn't, across nine benchmarks in five domains: cybersecurity, software engineering, natural sciences, abstract reasoning, and math. The IRT-estimated Elo scores: GPT-5.5 at 1,260 points, Anthropic's Claude Opus 4.6 at 999. DeepSeek V4 Pro scores around 800 (±28), which is very close to GPT-5.4 mini at 749. In CAISI's system, DeepSeek sits closer to the old generation of GPT mini than to Opus. The points system in benchmarks score models the way standardized tests score students -- not by raw percentage correct, but by weighting which problems they solve and which they miss, producing a points estimate that only means something relative to other models in the same evaluation. The more points, the better the model is in general terms, with the best model's score becoming the reference point to see how capable a model is. It's impossible to reproduce CAISI's results because two of the nine benchmarks are non-public, and in those two benchmarks is where the gap is widest. For example, GPT-5.5 scored 71% on CTF-Archive-Diamond, one of CAISI's cybersecurity tests with DeepSeek registering around 32%. On public benchmarks, the picture shifts. GPQA-Diamond -- PhD-level science reasoning, scored as percentage correct -- placed DeepSeek at 90%, one point behind Opus 4.6's 91%. Math olympiad benchmarks (OTIS-AIME-2025, PUMaC 2024, SMT 2025) put DeepSeek at 97%, 96%, and 96%. On SWE-Bench Verified -- real GitHub bug fixes, scored as percentage resolved -- DeepSeek scored 74% to GPT-5.5's 81%. DeepSeek's own technical report claims V4 Pro matches Opus 4.6 and GPT-5.4. For cost comparison, CAISI filtered out any U.S, model that performed significantly worse or cost significantly more per token than DeepSeek. Only one model cleared the bar: GPT-5.4 mini. That's the entire U.S. frontier, filtered to a single entry. DeepSeek came out cheaper on 5 of 7 benchmarks even beating OpenAI's tiniest and least capable AI model. Criticizing CAISI's methodology doesn't fully vindicate DeepSeek. The AI developer under the pseudonym Ex0bit pushed back directly: "There's no 'gap', and no one's 8 months behind. We've been trolled on every closed U.S drop and flexed on with open weights." The Artificial Analysis Intelligence Index v4.0 -- a rating system tracking frontier model intelligence across 10 evaluations -- shows OpenAI near 60 points and DeepSeek in the low 50s as of May 2026, compressed far tighter than a year ago. Based on standardized benchmarks, their methodology shows the gap is actually getting smaller. When DeepSeek first emerged in January 2025, the question was whether China had already caught up. U.S. labs scrambled to respond. Stanford's 2026 AI Index -- released April 13 -- reports the Arena leaderboard gap between Claude Opus 4.6 and China's Dola-Seed-2.0 Preview is shrinking, separated now by only 2.7%. CAISI plans to release a fuller IRT methodology write up in the near future.
[2]
US Government report claims China 8 months behind: Is the AI lead real or a benchmark illusion?
The Center for AI Standards and Innovation (CAISI), an American government body, has compiled a report that shows China's best AI model to be eight months behind major American frontier models. Sounds convenient, right? I've been looking at this evaluation of DeepSeek V4 Pro and the more I look, the more the 8 month headline feels like it may not entirely be true. I am not saying that the report is wrong but when one of the competitors is the one saying that they are in the lead, I feel that brings a question worth asking: who's grading the test, and who designed the questions? Also read: Understanding camera sensors: APS-C vs Full Frame - Which one do you actually need? CAISI ran DeepSeek V4 Pro through benchmarks covering all the major sectors - cybersecurity, software engineering, abstract reasoning, and mathematics. Using a statistical method borrowed from psychometric testing, they placed it at roughly the same level as GPT-5 which launched about eight months before today's American frontier. Hence the claim of it being 8 months behind. The methodology is described in detail, confidence intervals are displayed, and CAISI pre-committed to its benchmark suite before seeing the results. That is more than most evaluators care to do. However, the benchmarks most damaging to DeepSeek's performance - PortBench, CTF-Archive-Diamond, and ARC-AGI-2 semi-private - are all internally developed by CAISI or private data sets, where independent verification is impossible. You cannot verify an experiment that you cannot see. Also read: Best 10,000mAh wireless power banks in India for iPhone and Android in 2026 According to DeepSeek, V4 Pro is rated on par with Opus 4.6 and GPT-5.4 - models that were released only two months ago, not eight. In addition, Artificial Analysis, an organization that provides AI capability evaluations independent of geopolitical interests, states that the gap between US and China remains steady, not increasing. When a US government organization develops proprietary tests, conducts them using a Chinese model, and declares that China is falling behind, there is no way to verify the claims. The figures may be accurate. But it is not a scientific method. That's just a credentialed opinion. In CAISI's cost comparison, DeepSeek V4 Pro comes cheaper than GPT-5.4 mini on five out of seven tests, even more than 50% at times. Cursor, one of the more popular AI coding assistants, built its own model on a Chinese open-weight model precisely because it was cheaper than OpenAI and Anthropic. Capability tests test only one characteristic of a model. Cost per useful task dictates scalability. By that standard, the gap is far closer than eight months. Is the lead real then? Yes and no. While the difference on ARC-AGI-2 tests - GPT-5.5 at 79% and DeepSeek at 46% - cannot be disregarded, "eight months behind" is an exact benchmark figure from internal comparisons done by one competitor against another. The US likely has a real but contested capability lead. China is winning on economics. Calling it a race assumes both sides are optimising for the same thing when they might not be.
Share
Copy Link
A US government report claims China's best AI model trails American frontier models by eight months. But the methodology raises questions—two of nine benchmarks are non-public, and independent evaluators tell a different story. Stanford's AI Index shows the US-China performance gap has collapsed to just 2.7% on public leaderboards, while China wins decisively on cost-effectiveness.
The Center for AI Standards and Innovation (CAISI), a unit inside NIST, published its evaluation of DeepSeek V4 Pro on May 1, declaring that China's most capable AI model "lags behind the frontier by about 8 months."
1
The US government AI report uses Item Response Theory—a statistical method borrowed from standardized testing—rather than averaging AI benchmarks like most evaluators. CAISI tested DeepSeek across nine benchmarks covering cybersecurity, software engineering, natural sciences, abstract reasoning, and math. The IRT-estimated Elo scores placed GPT-5.5 at 1,260 points, Claude Opus 4.6 at 999, and DeepSeek V4 Pro around 800 points, positioning it closer to GPT-5.4 mini at 749.1

Source: Digit
The methodology raises significant questions about independent verification. Two of the nine benchmarks CAISI used are non-public, making it impossible to reproduce the results.
1
On these private datasets, the AI performance gap appears widest—GPT-5.5 scored 71% on CTF-Archive-Diamond, one of CAISI's cybersecurity tests, while DeepSeek registered around 32%.1
When a US government organization develops proprietary tests and declares China AI is falling behind based on benchmarks others cannot verify, it raises questions about scientific rigor.2
On public benchmarks, the picture shifts dramatically. GPQA-Diamond—PhD-level science reasoning—placed DeepSeek at 90%, just one point behind Claude Opus 4.6's 91%.
1
Math olympiad benchmarks put DeepSeek at 97%, 96%, and 96%. On SWE-Bench Verified, which tests real GitHub bug fixes in software engineering, DeepSeek scored 74% compared to GPT-5.5's 81%. DeepSeek's own technical report claims V4 Pro matches Opus 4.6 and GPT-5.4—AI frontier models released just two months ago, not eight.1
The Artificial Analysis Intelligence Index v4.0 shows OpenAI near 60 points and DeepSeek in the low 50s as of May 2026, compressed far tighter than a year ago.1
Stanford's 2026 AI Index, released April 13, reports the Arena leaderboard gap between Claude Opus 4.6 and China's Dola-Seed-2.0 Preview has shrunk to just 2.7%.
1
When DeepSeek first emerged in January 2025, the question was whether China AI had already caught up, prompting US labs to scramble in response. Based on standardized benchmarks tracked by independent evaluators, the AI capability lead is narrowing, not widening as CAISI suggests.1
Artificial Analysis, an organization providing evaluations independent of geopolitical interests, states the gap between US and China remains steady.2
Related Stories
In CAISI's token cost comparison, they filtered out any US model that performed significantly worse or cost significantly more per token than DeepSeek. Only one model cleared the bar: GPT-5.4 mini from OpenAI.
1
DeepSeek came out cheaper on five of seven benchmarks, even beating OpenAI's smallest model—sometimes by more than 50%.2
Cursor, a popular AI coding assistant, built its own model on a Chinese open-weight model precisely because it was cheaper than OpenAI and Anthropic.2
Cost per useful task dictates scalability in real-world deployments, and by that standard, the gap is far closer than eight months.The US likely maintains a real but contested AI capability lead on certain tasks, particularly on ARC-AGI-2 tests where GPT-5.5 scored 79% compared to DeepSeek's 46%.
2
However, calling it a race assumes both sides optimize for the same goals. China appears to prioritize cost-effectiveness and accessibility through open-weight models, while US labs focus on absolute capability on proprietary benchmarks. CAISI plans to release a fuller IRT methodology write-up in the near future, which may address some transparency concerns.1
For developers and enterprises choosing between AI frontier models, the decision increasingly hinges on whether they value marginal performance gains on specialized tasks or significant cost savings at near-competitive capability levels. Watch how leaderboards from Stanford and independent evaluators track this gap over the next six months—their public, reproducible benchmarks may prove more reliable indicators than government assessments built on private datasets.
Source: Decrypt
Summarized by
Navi
07 Apr 2025•Technology

23 Jan 2026•Policy and Regulation

14 Apr 2026•Science and Research

1
Entertainment and Society

2
Health

3
Technology
