US Government Claims China AI Lags 8 Months Behind, But Experts Question the Benchmarks

Reviewed byNidhi Govil

2 Sources

Share

A US government report claims China's best AI model trails American frontier models by eight months. But the methodology raises questions—two of nine benchmarks are non-public, and independent evaluators tell a different story. Stanford's AI Index shows the US-China performance gap has collapsed to just 2.7% on public leaderboards, while China wins decisively on cost-effectiveness.

US Government AI Report Claims Eight-Month Gap

The Center for AI Standards and Innovation (CAISI), a unit inside NIST, published its evaluation of DeepSeek V4 Pro on May 1, declaring that China's most capable AI model "lags behind the frontier by about 8 months."

1

The US government AI report uses Item Response Theory—a statistical method borrowed from standardized testing—rather than averaging AI benchmarks like most evaluators. CAISI tested DeepSeek across nine benchmarks covering cybersecurity, software engineering, natural sciences, abstract reasoning, and math. The IRT-estimated Elo scores placed GPT-5.5 at 1,260 points, Claude Opus 4.6 at 999, and DeepSeek V4 Pro around 800 points, positioning it closer to GPT-5.4 mini at 749.

1

Source: Digit

Source: Digit

AI Model Evaluation Methodologies Under Scrutiny

The methodology raises significant questions about independent verification. Two of the nine benchmarks CAISI used are non-public, making it impossible to reproduce the results.

1

On these private datasets, the AI performance gap appears widest—GPT-5.5 scored 71% on CTF-Archive-Diamond, one of CAISI's cybersecurity tests, while DeepSeek registered around 32%.

1

When a US government organization develops proprietary tests and declares China AI is falling behind based on benchmarks others cannot verify, it raises questions about scientific rigor.

2

Public Benchmarks Tell a Different Story About US vs China AI

On public benchmarks, the picture shifts dramatically. GPQA-Diamond—PhD-level science reasoning—placed DeepSeek at 90%, just one point behind Claude Opus 4.6's 91%.

1

Math olympiad benchmarks put DeepSeek at 97%, 96%, and 96%. On SWE-Bench Verified, which tests real GitHub bug fixes in software engineering, DeepSeek scored 74% compared to GPT-5.5's 81%. DeepSeek's own technical report claims V4 Pro matches Opus 4.6 and GPT-5.4—AI frontier models released just two months ago, not eight.

1

The Artificial Analysis Intelligence Index v4.0 shows OpenAI near 60 points and DeepSeek in the low 50s as of May 2026, compressed far tighter than a year ago.

1

Stanford AI Index Reveals Narrowing AI Performance Gap

Stanford's 2026 AI Index, released April 13, reports the Arena leaderboard gap between Claude Opus 4.6 and China's Dola-Seed-2.0 Preview has shrunk to just 2.7%.

1

When DeepSeek first emerged in January 2025, the question was whether China AI had already caught up, prompting US labs to scramble in response. Based on standardized benchmarks tracked by independent evaluators, the AI capability lead is narrowing, not widening as CAISI suggests.

1

Artificial Analysis, an organization providing evaluations independent of geopolitical interests, states the gap between US and China remains steady.

2

AI Cost-Effectiveness: China's Decisive Advantage

In CAISI's token cost comparison, they filtered out any US model that performed significantly worse or cost significantly more per token than DeepSeek. Only one model cleared the bar: GPT-5.4 mini from OpenAI.

1

DeepSeek came out cheaper on five of seven benchmarks, even beating OpenAI's smallest model—sometimes by more than 50%.

2

Cursor, a popular AI coding assistant, built its own model on a Chinese open-weight model precisely because it was cheaper than OpenAI and Anthropic.

2

Cost per useful task dictates scalability in real-world deployments, and by that standard, the gap is far closer than eight months.

What This Means for the AI Race Ahead

The US likely maintains a real but contested AI capability lead on certain tasks, particularly on ARC-AGI-2 tests where GPT-5.5 scored 79% compared to DeepSeek's 46%.

2

However, calling it a race assumes both sides optimize for the same goals. China appears to prioritize cost-effectiveness and accessibility through open-weight models, while US labs focus on absolute capability on proprietary benchmarks. CAISI plans to release a fuller IRT methodology write-up in the near future, which may address some transparency concerns.

1

For developers and enterprises choosing between AI frontier models, the decision increasingly hinges on whether they value marginal performance gains on specialized tasks or significant cost savings at near-competitive capability levels. Watch how leaderboards from Stanford and independent evaluators track this gap over the next six months—their public, reproducible benchmarks may prove more reliable indicators than government assessments built on private datasets.

Source: Decrypt

Source: Decrypt

Today's Top Stories

TheOutpost.ai

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

Instagram logo
LinkedIn logo
Youtube logo
© 2026 TheOutpost.AI All rights reserved