AI Hallucinations Contaminate NeurIPS Papers as 100 Fabricated Citations Slip Past Peer Review

4 Sources

Share

GPTZero scanned 4,841 papers from NeurIPS, one of AI's most prestigious conferences, and found 100 hallucinated citations across 51 accepted papers. The discovery highlights how AI-generated references are infiltrating scientific papers despite rigorous peer review, raising concerns about research integrity as submission volumes surge 220% since 2020.

AI Hallucinations Surface at Prestigious AI Conference

GPTZero, an AI detection startup, has uncovered a troubling pattern at the heart of AI research itself. After scanning all 4,841 papers accepted by the Conference on Neural Information Processing Systems (NeurIPS) in December, the company identified 100 hallucinated citations across 51 scientific papers that slipped past multiple peer reviewers

1

. These fabricated citations included nonexistent authors, made-up paper titles, fake journals, and URLs leading nowhere

4

. The findings expose how AI-generated references are contaminating academic publishing at one of the world's most selective AI research venues, where acceptance rates hover around 24.52%

3

.

Source: Fortune

Source: Fortune

NeurIPS prides itself on rigorous scholarly work, making the discovery particularly ironic. Edward Tian, cofounder and CEO of GPTZero, told Fortune this represents "the first documented cases of hallucinated citations entering the official record of the top machine learning conference"

4

. The detection follows GPTZero's earlier discovery of 50 hallucinated citations in papers under review for ICLR, another major AI conference

2

.

How Large Language Models Generate Fake References

The problem stems from researchers using Large Language Models (LLMs) to handle citation tasks. These AI systems can sound confident while inventing details they never verified. In some cases, an LLM blended elements from multiple real papers, creating believable-sounding titles and author lists

4

. Other instances showed subtle changes—expanding author initials into guessed first names, dropping coauthors, or paraphrasing titles

4

. Some citations plainly listed "John Smith" and "Jane Doe" as authors

4

.

Source: Earth.com

Source: Earth.com

Prediction-driven writing rewards plausibility, so LLM-generated content can appear credible while containing fundamental errors

3

. Earlier studies found that 55% of AI-generated references from older ChatGPT models were fabricated, though newer versions reduced this to 18%

3

. Around half the papers with hallucinated citations showed signs of extensive AI use

4

.

Submission Tsunami Strains Peer Review Process

The scale of the problem reflects broader pressures on academic publishing. Between 2020 and 2025, submissions to NeurIPS surged 220%—from 9,467 to 21,575 papers

2

. This submission tsunami has strained the peer review process to breaking point, forcing organizers to recruit ever-larger numbers of peer reviewers

2

. When reviewers juggle research, teaching, and tight deadlines, reference lists become easy to skim

3

.

NeurIPS instructed reviewers to flag AI hallucinations, yet the errors survived

4

. GPTZero senior machine-learning engineer Nazar Shmatko and colleagues argue that generative AI tools have fueled "a tsunami of AI slop" that creates issues of oversight, expertise alignment, and even fraud

2

.

Source: TechCrunch

Source: TechCrunch

No one can fault peer reviewers given the sheer volume involved, but the findings raise questions about research integrity when verification fails

1

.

Citations as Currency and Career Metrics

Fabricated citations carry consequences beyond simple errors. In AI research, citations function as career currency—metrics that demonstrate how influential a researcher's work is among peers

1

. Citation metrics often sit alongside recommendation letters during hiring decisions, signaling attention that translates into funding, jobs, and collaboration invitations

3

. When AI makes them up, it waters down their value

1

.

The NeurIPS board emphasized that "even if 1.1% of the papers have one or more incorrect references due to the use of LLMs, the content of the papers themselves are not necessarily invalidated"

1

. While this protects valid findings, it leaves readers with extra verification work when tracking evidence

3

.

Rising Error Rates Signal Broader Quality Concerns

The citation problem coincides with increasing substantive errors in scientific papers. A December 2025 pre-print from researchers at Together AI, NEC Labs America, Rutgers University, and Stanford University examined AI papers from ICLR, NeurIPS, and TMLR

2

. They found the average number of mistakes per paper increased 55.3% at NeurIPS—from 3.8 errors in 2021 to 5.9 in 2025

2

. These mistakes include incorrect formulas, miscalculations, and errant figures beyond citation issues

2

.

Academic communication reached 5.7 million articles in 2024, up from 3.9 million five years earlier, according to the International Association of Scientific, Technical & Medical Publishers

2

. Alex Marcus, co-founder of Retraction Watch, noted that "publishers have made themselves vulnerable to these assaults by adopting a business model that has prioritized volume over quality"

2

.

AI Detection Tools Enter the Verification Arms Race

GPTZero argues its Hallucination Check software should become part of publishers' AI detection tools arsenal

2

. Unlike text-based AI detection prone to false positives, hallucination detection verifies facts by searching academic databases and the open web to confirm whether cited papers exist

4

. The company claims accuracy above 99%, with every flagged citation reviewed by human experts

4

. ICLR has hired GPTZero to check future submissions during peer review

4

.

Yet countermeasures exist. Tools like Claude Code's "Humanizer" claim to remove signs of AI-generated writing, making detection harder

2

. This creates an arms race where defenders may struggle to withstand the siege

2

.

What This Means for AI Accuracy and Scholarly Work

The discovery raises a pointed question: If leading AI experts with reputations at stake cannot ensure AI accuracy in their own work, what does that mean for wider adoption

1

? The legal community has flagged more than 800 errant citations attributed to AI models in court filings, often with consequences for attorneys and judges

2

. Academic rigor demands the same fact-checking standards, yet publishing practices have not adapted to the reality of LLM-generated content

2

.

Reform proposals include letting authors rate review quality and giving peer reviewers formal credit for effort, creating feedback loops that discourage rushed work

3

. Reference managers that pull details from databases can reduce typing errors and maintain consistency

3

. When AI systems help draft text, verifying each referenced title adds minutes but spares readers from chasing dead ends

3

. As data integrity concerns mount, the question becomes whether academic publishing can maintain trust while navigating the flood of AI-assisted research assessment and submission growth.

Today's Top Stories

TheOutpost.ai

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

© 2026 Triveous Technologies Private Limited
Instagram logo
LinkedIn logo