China AI 8 Months Behind? AI Benchmarks Tell Different Story

US Government AI Report Claims Eight-Month Gap

The Center for AI Standards and Innovation (CAISI), a unit inside NIST, published its evaluation of DeepSeek V4 Pro on May 1, declaring that China's most capable AI model "lags behind the frontier by about 8 months." 1

The US government AI report uses Item Response Theory—a statistical method borrowed from standardized testing—rather than averaging AI benchmarks like most evaluators. CAISI tested DeepSeek across nine benchmarks covering cybersecurity, software engineering, natural sciences, abstract reasoning, and math. The IRT-estimated Elo scores placed GPT-5.5 at 1,260 points, Claude Opus 4.6 at 999, and DeepSeek V4 Pro around 800 points, positioning it closer to GPT-5.4 mini at 749. 1

Source: Digit

AI Model Evaluation Methodologies Under Scrutiny

The methodology raises significant questions about independent verification. Two of the nine benchmarks CAISI used are non-public, making it impossible to reproduce the results. 1

On these private datasets, the AI performance gap appears widest—GPT-5.5 scored 71% on CTF-Archive-Diamond, one of CAISI's cybersecurity tests, while DeepSeek registered around 32%. 1

When a US government organization develops proprietary tests and declares China AI is falling behind based on benchmarks others cannot verify, it raises questions about scientific rigor. 2

Public Benchmarks Tell a Different Story About US vs China AI

On public benchmarks, the picture shifts dramatically. GPQA-Diamond—PhD-level science reasoning—placed DeepSeek at 90%, just one point behind Claude Opus 4.6's 91%. 1

Math olympiad benchmarks put DeepSeek at 97%, 96%, and 96%. On SWE-Bench Verified, which tests real GitHub bug fixes in software engineering, DeepSeek scored 74% compared to GPT-5.5's 81%. DeepSeek's own technical report claims V4 Pro matches Opus 4.6 and GPT-5.4—AI frontier models released just two months ago, not eight. 1

The Artificial Analysis Intelligence Index v4.0 shows OpenAI near 60 points and DeepSeek in the low 50s as of May 2026, compressed far tighter than a year ago. 1

Stanford AI Index Reveals Narrowing AI Performance Gap

Stanford's 2026 AI Index, released April 13, reports the Arena leaderboard gap between Claude Opus 4.6 and China's Dola-Seed-2.0 Preview has shrunk to just 2.7%. 1

When DeepSeek first emerged in January 2025, the question was whether China AI had already caught up, prompting US labs to scramble in response. Based on standardized benchmarks tracked by independent evaluators, the AI capability lead is narrowing, not widening as CAISI suggests. 1

Artificial Analysis, an organization providing evaluations independent of geopolitical interests, states the gap between US and China remains steady. 2

AI Cost-Effectiveness: China's Decisive Advantage

In CAISI's token cost comparison, they filtered out any US model that performed significantly worse or cost significantly more per token than DeepSeek. Only one model cleared the bar: GPT-5.4 mini from OpenAI. 1

DeepSeek came out cheaper on five of seven benchmarks, even beating OpenAI's smallest model—sometimes by more than 50%. 2

Cursor, a popular AI coding assistant, built its own model on a Chinese open-weight model precisely because it was cheaper than OpenAI and Anthropic. 2

Cost per useful task dictates scalability in real-world deployments, and by that standard, the gap is far closer than eight months.

What This Means for the AI Race Ahead

The US likely maintains a real but contested AI capability lead on certain tasks, particularly on ARC-AGI-2 tests where GPT-5.5 scored 79% compared to DeepSeek's 46%. 2

However, calling it a race assumes both sides optimize for the same goals. China appears to prioritize cost-effectiveness and accessibility through open-weight models, while US labs focus on absolute capability on proprietary benchmarks. CAISI plans to release a fuller IRT methodology write-up in the near future, which may address some transparency concerns. 1

For developers and enterprises choosing between AI frontier models, the decision increasingly hinges on whether they value marginal performance gains on specialized tasks or significant cost savings at near-competitive capability levels. Watch how leaderboards from Stanford and independent evaluators track this gap over the next six months—their public, reproducible benchmarks may prove more reliable indicators than government assessments built on private datasets.

Source: Decrypt

US Government Claims China AI Lags 8 Months Behind, But Experts Question the Benchmarks

US Government AI Report Claims Eight-Month Gap

AI Model Evaluation Methodologies Under Scrutiny

Public Benchmarks Tell a Different Story About US vs China AI

Stanford AI Index Reveals Narrowing AI Performance Gap

AI Cost-Effectiveness: China's Decisive Advantage

What This Means for the AI Race Ahead

References

US Government Says China's Best AI Models Lag Behind. Experts Aren't So Sure - Decrypt

US Government report claims China 8 months behind: Is the AI lead real or a benchmark illusion?

Related Stories

Global AI Race Intensifies: China Closes Gap with US as Competition Heats Up

Mistral CEO calls China's AI lag a 'fairy tale' as debate intensifies over US-China AI race

China nearly erases US lead in AI as Stanford report reveals widening gap between experts and public

Recent Highlights

Google bets on AI agents with Gemini 3.5 Flash, Spark, and Omni at I/O 2026

AI passes the Turing Test as GPT-4.5 appears more human than actual people in landmark study

OpenAI cracks 80-year-old Erdős problem, stunning mathematicians with AI's biggest math breakthrough

Recent Highlights

Today's Top Stories

Pope Leo XIV calls to disarm AI, warns tech elite threatens humanity in landmark encyclical

Huawei unveils Tau Scaling Law to bypass US chip sanctions, targets 1.4nm chips by 2031

Anthropic's Chris Olah says AI development cannot be left to tech companies alone at Vatican

99% of Executives Expect AI Layoffs Within Two Years as Worker Anxiety Surges