AI Hallucinations Found in NeurIPS Scientific Papers

AI Hallucinations Surface at Prestigious AI Conference

GPTZero, an AI detection startup, has uncovered a troubling pattern at the heart of AI research itself. After scanning all 4,841 papers accepted by the Conference on Neural Information Processing Systems (NeurIPS) in December, the company identified 100 hallucinated citations across 51 scientific papers that slipped past multiple peer reviewers 1

. These fabricated citations included nonexistent authors, made-up paper titles, fake journals, and URLs leading nowhere 4

. The findings expose how AI-generated references are contaminating academic publishing at one of the world's most selective AI research venues, where acceptance rates hover around 24.52% 3

Source: Fortune

NeurIPS prides itself on rigorous scholarly work, making the discovery particularly ironic. Edward Tian, cofounder and CEO of GPTZero, told Fortune this represents "the first documented cases of hallucinated citations entering the official record of the top machine learning conference" 4

. The detection follows GPTZero's earlier discovery of 50 hallucinated citations in papers under review for ICLR, another major AI conference 2

How Large Language Models Generate Fake References

The problem stems from researchers using Large Language Models (LLMs) to handle citation tasks. These AI systems can sound confident while inventing details they never verified. In some cases, an LLM blended elements from multiple real papers, creating believable-sounding titles and author lists 4

. Other instances showed subtle changes—expanding author initials into guessed first names, dropping coauthors, or paraphrasing titles 4

. Some citations plainly listed "John Smith" and "Jane Doe" as authors 4

Source: Earth.com

Prediction-driven writing rewards plausibility, so LLM-generated content can appear credible while containing fundamental errors 3

. Earlier studies found that 55% of AI-generated references from older ChatGPT models were fabricated, though newer versions reduced this to 18% 3

. Around half the papers with hallucinated citations showed signs of extensive AI use 4

Submission Tsunami Strains Peer Review Process

The scale of the problem reflects broader pressures on academic publishing. Between 2020 and 2025, submissions to NeurIPS surged 220%—from 9,467 to 21,575 papers 2

. This submission tsunami has strained the peer review process to breaking point, forcing organizers to recruit ever-larger numbers of peer reviewers 2

. When reviewers juggle research, teaching, and tight deadlines, reference lists become easy to skim 3

NeurIPS instructed reviewers to flag AI hallucinations, yet the errors survived 4

. GPTZero senior machine-learning engineer Nazar Shmatko and colleagues argue that generative AI tools have fueled "a tsunami of AI slop" that creates issues of oversight, expertise alignment, and even fraud 2

Source: TechCrunch

No one can fault peer reviewers given the sheer volume involved, but the findings raise questions about research integrity when verification fails 1

Citations as Currency and Career Metrics

Fabricated citations carry consequences beyond simple errors. In AI research, citations function as career currency—metrics that demonstrate how influential a researcher's work is among peers 1

. Citation metrics often sit alongside recommendation letters during hiring decisions, signaling attention that translates into funding, jobs, and collaboration invitations 3

. When AI makes them up, it waters down their value 1

The NeurIPS board emphasized that "even if 1.1% of the papers have one or more incorrect references due to the use of LLMs, the content of the papers themselves are not necessarily invalidated" 1

. While this protects valid findings, it leaves readers with extra verification work when tracking evidence 3

Rising Error Rates Signal Broader Quality Concerns

The citation problem coincides with increasing substantive errors in scientific papers. A December 2025 pre-print from researchers at Together AI, NEC Labs America, Rutgers University, and Stanford University examined AI papers from ICLR, NeurIPS, and TMLR 2

. They found the average number of mistakes per paper increased 55.3% at NeurIPS—from 3.8 errors in 2021 to 5.9 in 2025 2

. These mistakes include incorrect formulas, miscalculations, and errant figures beyond citation issues 2

Academic communication reached 5.7 million articles in 2024, up from 3.9 million five years earlier, according to the International Association of Scientific, Technical & Medical Publishers 2

. Alex Marcus, co-founder of Retraction Watch, noted that "publishers have made themselves vulnerable to these assaults by adopting a business model that has prioritized volume over quality" 2

AI Detection Tools Enter the Verification Arms Race

GPTZero argues its Hallucination Check software should become part of publishers' AI detection tools arsenal 2

. Unlike text-based AI detection prone to false positives, hallucination detection verifies facts by searching academic databases and the open web to confirm whether cited papers exist 4

. The company claims accuracy above 99%, with every flagged citation reviewed by human experts 4

. ICLR has hired GPTZero to check future submissions during peer review 4

Yet countermeasures exist. Tools like Claude Code's "Humanizer" claim to remove signs of AI-generated writing, making detection harder 2

. This creates an arms race where defenders may struggle to withstand the siege 2

What This Means for AI Accuracy and Scholarly Work

The discovery raises a pointed question: If leading AI experts with reputations at stake cannot ensure AI accuracy in their own work, what does that mean for wider adoption 1

? The legal community has flagged more than 800 errant citations attributed to AI models in court filings, often with consequences for attorneys and judges 2

. Academic rigor demands the same fact-checking standards, yet publishing practices have not adapted to the reality of LLM-generated content 2

Reform proposals include letting authors rate review quality and giving peer reviewers formal credit for effort, creating feedback loops that discourage rushed work 3

. Reference managers that pull details from databases can reduce typing errors and maintain consistency 3

. When AI systems help draft text, verifying each referenced title adds minutes but spares readers from chasing dead ends 3

. As data integrity concerns mount, the question becomes whether academic publishing can maintain trust while navigating the flood of AI-assisted research assessment and submission growth.

AI Hallucinations Contaminate NeurIPS Papers as 100 Fabricated Citations Slip Past Peer Review

AI Hallucinations Surface at Prestigious AI Conference

How Large Language Models Generate Fake References

Submission Tsunami Strains Peer Review Process

Citations as Currency and Career Metrics

Rising Error Rates Signal Broader Quality Concerns

AI Detection Tools Enter the Verification Arms Race

What This Means for AI Accuracy and Scholarly Work

References

Irony alert: Hallucinated citations found in papers from NeurIPS, the prestigious AI conference

AI conference's papers contaminated by AI hallucinations

How AI-generated references are polluting scientific papers

NeurIPS papers contained 100+ AI-hallucinated citations, new report claims | Fortune

Related Stories

AI Hallucinations Flood Libraries as Chatbots Invent Academic Papers That Don't Exist

AI-Generated Research Papers Found on Google Scholar, Raising Concerns in Academic Community

Scientists Hide AI Prompts in Research Papers to Manipulate Peer Reviews

Recent Highlights

OpenAI Releases GPT-5.4, New AI Model Built for Agents and Professional Work

Anthropic takes Pentagon to court over unprecedented supply chain risk designation

Meta smart glasses face lawsuit and UK probe after workers watched intimate user footage

Recent Highlights

Today's Top Stories

Threat actors now use AI agents to manage attack infrastructure and accelerate cyberattacks

AI agent spontaneously attempted crypto mining and created backdoors during training

X investigates Grok chatbot over racist and offensive posts generated from user prompts

Drone Strikes on Data Centers Expose Vulnerability of Gulf's AI Infrastructure Ambitions