4 Sources
[1]
Low-quality papers are surging by exploiting public data sets and AI
Last year, Matt Spick began to notice oddly similar papers flooding in for peer review at Scientific Reports, where he is an associate editor. He smelled a rat. The papers all drew on a publicly available U.S. data set: the National Health and Nutrition Examination Survey (NHANES), which through health exams, blood tests, and interviews has collected dietary information and other health-related measurements from more than 130,000 people. "I was getting so many nearly identical papers -- one a day, sometimes even two a day," says Spick, a statistician at the University of Surrey. What he was seeing at his one journal is part of a larger problem, Spick has discovered. In recent years, there has been a drastic surge in poor-quality papers using NHANES, possibly spearheaded by illicit moneymaking enterprises known as paper mills and facilitated by the use of artificial intelligence (AI)-generated text, he and colleagues reported in PLOS Biology last week. The finding suggests large public health data sets are ripe for exploitation, they say. Such free data sources allow almost anyone to take a known research method and swap in new variables to create fresh "findings" in a kind of "research Mad Libs," says Reese Richardson, a metascientist at Northwestern University who was not involved with the work. Other researchers have found similar "explosions" in a range of topics, he says, including various kinds of genetic studies as well as analyses of bibliometrics or gender disparities in different scientific disciplines. The NHANES papers Spick was receiving all followed the same formula: They chose a health condition, an environmental or physiological factor that could be associated with it, and a population group -- perhaps looking at the link between vitamin D levels and depression in men over age 65, or poor dental health and diabetes in women between the ages of 18 and 45. "It felt like every possible combination was being worked through by someone," Spick says. To get a better understanding of how prevalent these studies are, he and his team searched two major databases of scientific papers, PubMed and Scopus, for studies using NHANES data that looked at single associations. They found 341 of these papers published in 147 journals, including Scientific Reports, BMC Public Health, and BMJ Open. Between 2014 and 2021, an average of four such papers were published per year -- but a rapid increase kicked off in 2022, with 190 papers published in 2024 up to October, when the researchers did their search. The rise far outstripped the growth in health studies using large data sets generally, the authors report, suggesting some additional factor underlying the swell of NHANES studies. The timing points to the widespread availability of AI chatbots such as ChatGPT that can generate readable text from simple prompts and uploaded information. They may have been used to rephrase the same basic NHANES findings endlessly to avoid plagiarism detection, says Jennifer Byrne, a molecular biologist at the University of Sydney who peer reviewed the PLOS Biology paper. It's not possible to conclude with certainty that paper mills -- commercial entities that sell authorship on fraudulent or low-quality papers -- produced the papers, she says, but the "timing and scale of the increase make you think there has to be some kind of coordination behind this." Many of the more recent NHANES studies selectively analyzed portions of its data set without a clear rationale -- for example, authors limited their analysis to certain years, or certain ages of people in the survey. That suggests the authors were on the hunt for statistically significant results to generate easy publications, Spick says. But fishing for results in such a huge data set is bound to come up with many false positive findings. When the team took a closer look at the 28 NHANES studies that had explored depression, they found that only 13 of the results survived a statistical adjustment that corrects for the risk of finding false positives. Spick and his team think their analysis may drastically underestimate the problem. Their search only looked for NHANES studies that fit the formula Spick had been seeing, but a broader search finds that papers using the data set increased from 4926 in 2023 to 7876 in 2024. And other big health data sets -- such as the Global Burden of Disease study -- may also be vulnerable, Spick says. These data sets make it easy for researchers to interact with their information using coding languages such as Python or R, but this also makes them easy to exploit: His team was easily able to write code that could pull all the data from NHANES and "chug through the combinations" of diseases and health variables. The "industrialization" of low-quality research overwhelms the literature with useless findings, Spick says. "Honestly, I got really hopping mad about it." These papers reflect broad problems in both scientific publishing and how research is rewarded, Richardson says. "All of the publishers named in the article accepted fees, likely on the order of $1000 each, to publish this junk," he notes. (Open-access journals, including PLOS Biology, generally charge author fees to make papers freely available.) And researchers are incentivized to publish more papers, rather than higher quality papers, in order to advance in their careers, Richardson adds. The problem, he warns, "will only get worse unless we radically restructure incentives around scientific publication."
[2]
AI paper mills are swamping science with garbage studies
Research flags rise in one-dimensional health research fueled by large language models A report from a British university warns that scientific knowledge itself is under threat from a flood of low-quality AI-generated research papers. The research team from the University of Surrey notes an "explosion of formulaic research articles," including inappropriate study designs and false discoveries, based on data cribbed from the US National Health and Nutrition Examination Survey (NHANES) nationwide health database. The study, published in PLOS Biology, a nonprofit publisher of open-access journals, found that many post-2021 papers used "a superficial and oversimplified approach to analysis." These often focused on a single variable while ignoring more realistic, multi-factor explanations of links between health conditions and potential causes, along with some cherry-picked narrow data subsets without justification. "We've seen a surge in papers that look scientific but don't hold up under scrutiny - this is 'science fiction' using national health datasets to masquerade as science fact," states Matt Spick, a lecturer in health and biomedical data analytics at Surrey University, and one of the authors of the report. "The use of these easily accessible datasets via APIs, combined with large language models, is overwhelming some journals and peer reviewers, reducing their ability to assess more meaningful research - and ultimately weakening the quality of science overall," he added. The report notes that AI-ready datasets, such as NHANES, can open up new opportunities for data-driven research, but also lead to the risk of potential data exploitation by what it calls "paper mills" - entities that churn out questionable scientific papers, often for paying clients seeking confirmation of an existing belief. Surrey Uni's work involved a systematic literature search going back ten years to retrieve potentially formulaic papers covering NHANES data, and analyzing these for telltale statistical approaches or study design. The team identified and retrieved 341 reports published across a number of different journals. It found that over the last three years, there has been a rapid rise in the number of publications analyzing single-factor associations between predictors (independent variables) and various health conditions using the NHANES dataset. An average of four papers per year were published between 2014 and 2021, increasing to 33, 82, and 190 in 2022, 2023, and the first ten months of 2024, respectively. Also noted is a change in the origins of the published research. From 2014 to 2020, just two out of 25 manuscripts had a primary author affiliation in China. Between 2021 and 2024, this rose to 292 out of 316 manuscripts. The report says this jump in single-factor associative research means there is a corresponding increase in the risk of misleading findings being introduced to the wider body of scientific literature. For example, it says that some well-known multifactorial health issues are analyzed as single-factor studies, citing depression, cardiovascular disease, and cognitive function - all recognized as multifactorial - being investigated using simplistic, single-factor approaches in some of the papers reviewed. To combat this, the team sets out a number of suggestions, including that editors and reviewers at scientific journals should regard single-factor analysis of conditions known to be complex and multifactorial as a "red flag" for potentially problematic research. Providers of datasets should also take steps including API keys and application numbers to prevent data dredging, an approach already used by the UK Biobank, the report says. Publications referencing such data should be made to include an auditable account number as a condition of access. Another suggestion is that full dataset analysis should be made mandatory, unless using data subsets can be justified. "We're not trying to block access to data or stop people using AI in their research - we're asking for some common sense checks," said Tulsi Suchak, a post-graduate researcher at the University of Surrey and lead author of the study. "This includes things like being open about how data is used, making sure reviewers with the right expertise are involved, and flagging when a study only looks at one piece of the puzzle." This isn't the first time the issue has come to light. Last year, US publishing house Wiley discontinued 19 scientific journals overseen by its Hindawi subsidiary that were publishing reports churned out by AI paper mills. It is also part of a wider problem of AI-generated content appearing online and in web searches that can be difficult to distinguish from reality. Dubbed "AI slop," this includes fake pictures and entire video sequences of celebrities and world leaders, but also fake historical photographs and AI-generated portraits of historical figures appearing in search results as if they were genuine.
[3]
AI tools may be weakening the quality of published research, study warns
Artificial intelligence could be affecting the scientific rigor of new research, according to a study from the University of Surrey. The research team has called for a range of measures to reduce the flood of "low-quality" and "science fiction" papers, including stronger peer review processes and the use of statistical reviewers for complex datasets. In a study published in PLOS Biology, researchers reviewed papers proposing an association between a predictor and a health condition using an American government dataset called the National Health and Nutrition Examination Survey (NHANES), published between 2014 and 2024. NHANES is a large, publicly available dataset used by researchers around the world to study links between health conditions, lifestyle and clinical outcomes. The team found that between 2014 and 2021, just four NHANES association-based studies were published each year -- but this rose to 33 in 2022, 82 in 2023, and 190 in 2024. Dr. Matt Spick, co-author of the study from the University of Surrey, said, "While AI has the clear potential to help the scientific community make breakthroughs that benefit society, our study has found that it is also part of a perfect storm that could be damaging the foundations of scientific rigor. "We've seen a surge in papers that look scientific but don't hold up under scrutiny -- this is 'science fiction' using national health datasets to masquerade as science fact. The use of these easily accessible datasets via APIs, combined with large language models, is overwhelming some journals and peer reviewers, reducing their ability to assess more meaningful research -- and ultimately weakening the quality of science overall." The study found that many post-2021 papers used a superficial and oversimplified approach to analysis -- often focusing on single variables while ignoring more realistic, multi-factor explanations of the links between health conditions and potential causes. Some papers cherry-picked narrow data subsets without justification, raising concerns about poor research practice, including data dredging or changing research questions after seeing the results. Tulsi Suchak, post-graduate researcher at the University of Surrey and lead author of the study, added, "We're not trying to block access to data or stop people using AI in their research -- we're asking for some common-sense checks. This includes things like being open about how data is used, making sure reviewers with the right expertise are involved, and flagging when a study only looks at one piece of the puzzle. "These changes don't need to be complex, but they could help journals spot low-quality work earlier and protect the integrity of scientific publishing." To help tackle the issue, the team has laid out a number of practical steps for journals, researchers and data providers. They recommend that researchers use the full datasets available to them unless there's a clear and well-explained reason to do otherwise, and that they are transparent about which parts of the data were used, over what time periods, and for which groups. For journals, the authors suggest strengthening peer review by involving reviewers with statistical expertise and making greater use of early desk rejection to reduce the number of formulaic or low-value papers entering the system. Finally, they propose that data providers assign unique application numbers or IDs to track how open datasets are used -- a system already in place for some UK health data platforms. Anietie E. Aliu, co-author of the study and post-graduate student at the University of Surrey, said, "We believe that in the AI era, scientific publishing needs better guardrails. Our suggestions are simple things that could help stop weak or misleading studies from slipping through, without blocking the benefits of AI and open data. "These tools are here to stay, so we need to act now to protect trust in research."
[4]
AI research tools might be creating more problems than they solve
A new study has uncovered an alarming rise in formulaic research papers derived from the National Health and Nutrition Examination Survey (NHANES), suggesting that artificial intelligence tools are being misused to mass-produce statistically weak and potentially misleading scientific literature. The authors point to a surge in single-factor analyses that disregard multifactorial complexity, exploit open data selectively, and bypass robust statistical corrections. Between 2014 and 2021, just four such papers were published each year. But in 2024 alone, up to October 9, the tally had ballooned to 190. This exponential growth, paired with a shift in publication origins and a reliance on automation, indicates that AI-assisted pipelines may be accelerating low-quality manuscript production. At the heart of the problem is the misuse of NHANES, a respected and AI-ready U.S. government dataset originally developed to evaluate public health trends across the population. NHANES provides an exceptionally rich dataset, combining clinical, behavioral, and laboratory data across thousands of variables. It is accessible through APIs and has standardized Python and R libraries, allowing researchers to extract and analyze the data efficiently. This makes it a valuable tool for both public health researchers and AI developers. But this very convenience also creates a vulnerability: it allows researchers to generate results quickly, and with minimal oversight, leading to an explosion of formulaic research. The new study analyzed 341 NHANES-based papers published between 2014 and 2024 that relied on single-variable correlations. These papers, on average, appeared in moderate-impact journals (average impact factor of 3.6), and often focused on conditions like depression, diabetes, or cardiovascular disease. Instead of exploring the multifactorial nature of these conditions, the studies typically drew statistical significance from a single independent variable, bypassing false discovery correction and frequently relying on unexplained data subsetting. One major concern is that multifactorial health conditions -- such as mental health disorders, chronic inflammation, or cardiovascular disease -- were analyzed using methods more suited for simple binary relationships. In effect, these studies presented findings that stripped away nuance and ignored the reality that health outcomes are rarely driven by a single factor. Depression was used as a case study, with 28 individual papers claiming associations between the condition and various independent variables. However, only 13 of these associations remained statistically significant after applying False Discovery Rate (FDR) correction. Without proper correction, these publications risk introducing a high volume of Type I errors into the scientific literature. In some instances, researchers appeared to recycle variables as both predictors and outcomes across papers, further muddying the waters. Microsoft's ADeLe wants to give your AI a cognitive profile Another issue uncovered by the authors was the use of unjustified data subsets. Although NHANES provides a broad timeline of health data dating back to 1999, many researchers chose narrow windows of analysis without disclosing rationale. For example, some studies used only the 2003 to 2018 window to analyze diabetes and inflammation, despite broader data availability. The practice hints at data dredging or HARKing, hypothesizing after results are known, a methodologically flawed approach that undermines reproducibility and transparency. The median study analyzed just four years of NHANES data, despite the database offering over two decades of information. This selective sampling enables authors to increase the likelihood of achieving significant results without accounting for the full dataset's complexity, making it easier to produce and publish manuscripts in high volume. The findings pose a serious challenge to the integrity of scientific literature. Single-variable studies that fail to consider complex interdependencies are more likely to be misleading. When repeated at scale, such research floods the academic ecosystem with papers that meet publication thresholds but offer little new insight. This is compounded by weak peer review and the growing pressure on researchers to publish frequently and rapidly. The authors warn that these practices, if left unchecked, could shift the balance in some subfields where manufactured papers outnumber legitimate ones. The use of AI to accelerate manuscript generation only amplifies this risk. As generative models become more accessible, they enable rapid conversion of statistical outputs into full-length manuscripts, reducing the time and expertise required to publish scientific articles. Recommendations for stakeholders: To mitigate the risks of AI-enabled data dredging and mass-produced research, the authors propose several concrete steps:
Share
Copy Link
A study reveals a dramatic increase in formulaic, AI-generated research papers exploiting public health datasets, raising concerns about the integrity of scientific literature and the misuse of AI in academic publishing.
A recent study published in PLOS Biology has uncovered a concerning trend in scientific publishing: a dramatic increase in low-quality research papers that exploit public health datasets and potentially misuse AI tools 1. The research, led by Matt Spick from the University of Surrey, identified a surge in formulaic papers using data from the National Health and Nutrition Examination Survey (NHANES), a comprehensive U.S. health dataset 2.
The study revealed a stark increase in NHANES-based papers focusing on single-factor associations:
This exponential growth far outpaces the general increase in health studies using large datasets, suggesting additional factors at play.
The researchers identified several red flags in these studies:
The timing of this surge coincides with the widespread availability of AI language models like ChatGPT. These tools may be facilitating the rapid generation of readable text from simple prompts and data inputs 1. The researchers suspect that "paper mills" – commercial entities producing fraudulent or low-quality papers – may be behind this coordinated increase in publications.
This flood of low-quality papers poses several threats to scientific integrity:
The study authors propose several measures to address this issue:
This trend reflects larger issues in scientific publishing and research incentives. The pressure to publish frequently often outweighs the emphasis on quality, creating an environment ripe for exploitation by AI tools and paper mills 1. As AI continues to advance, the scientific community must adapt to ensure the integrity and quality of published research in the face of these new challenges.
Summarized by
Navi
[2]
Databricks raises $1 billion in a new funding round, valuing the company at over $100 billion. The data analytics firm plans to invest in AI database technology and an AI agent platform, positioning itself for growth in the evolving AI market.
11 Sources
Business
14 hrs ago
11 Sources
Business
14 hrs ago
SoftBank makes a significant $2 billion investment in Intel, boosting the chipmaker's efforts to regain its competitive edge in the AI semiconductor market.
22 Sources
Business
22 hrs ago
22 Sources
Business
22 hrs ago
OpenAI introduces ChatGPT Go, a new subscription plan priced at ₹399 ($4.60) per month exclusively for Indian users, offering enhanced features and affordability to capture a larger market share.
15 Sources
Technology
22 hrs ago
15 Sources
Technology
22 hrs ago
Microsoft introduces a new AI-powered 'COPILOT' function in Excel, allowing users to perform complex data analysis and content generation using natural language prompts within spreadsheet cells.
8 Sources
Technology
14 hrs ago
8 Sources
Technology
14 hrs ago
Adobe launches Acrobat Studio, integrating AI assistants and PDF Spaces to transform document management and collaboration, marking a significant evolution in PDF technology.
10 Sources
Technology
14 hrs ago
10 Sources
Technology
14 hrs ago