2 Sources
2 Sources
[1]
Open-source AI tool beats giant LLMs in literature reviews -- and gets citations right
Researchers have published the recipe for an artificial-intelligence model that reviews the scientific literature better than some major large language models (LLMs) are able to, and gets the citations correct as often as human experts do. OpenScholar -- which combines a language modelwith a database of 45 million open-access articles -- links the information it sources directly back to the literature, to stop the system from making up or 'hallucinating' citations. Several commercial AI-based literature-review tools already exist that use similar techniques, but few have been released as open source, says Akari Asai, an AI researcher at Carnegie Mellon University in Pittsburgh, Pennsylvania, and a co-author of the work, published in Nature on 4 February. Being open source means that researchers can not only try OpenScholar for free in an online demonstration, but also deploy it on their own machine and use the method in the paper to boost the literature-review skills of any LLM, says Asai. In the 14 months since OpenScholar was first published in the arXiv repository, AI firms such as OpenAI have used similar methods to tack 'deep research' tools onto their commercial LLMs, which has greatly improved their accuracy. But as a small and efficient system, running OpenScholar costs a fraction of the price of using OpenAI's GPT-5 with deep research, co-author Hannaneh Hajishirzi, a computer scientist at the University of Washington in Seattle, tells the Nature podcast. However, the authors acknowledge that OpenScholar has limitations. For example, it doesn't always retrieve the most representative or relevant papers for a query, and it is limited by the scope of its database. But if researchers are able to access the tool for free, "it can become one of the most popular apps for scientific searches," says Mushtaq Bilal, a researcher at Silvi, a Copenhagen-based firm that has its own AI-based literature-review tool. LLMs can write fluently, but they often struggle with citations. This is because they learn by building links between words in their training data, which include sources outside science, and then generate text on the basis of probable associations that are not always correct or up to date. This is a feature of LLMs, not a bug, and it is proving to be a problem when people use LLMs in research. For example, at least 51 papers accepted to the high-profile machine learning NeurIPS conference in December 2025, contained non-existent or inaccurate citations, according to an analysis using the GPTZero tool. OpenScholar is a way to force an LLM to answer queries using a specific data store, says Hajishirzi. When a user asks a question, a retrieval system finds related scientific articles within the repository, then ranks them by relevance and generates a response that is based only on the most useful. The LLM -- trained on examples of questions and answers -- then refines the answer to improve it. "We designed an efficient pipeline where the model generates an answer once, but then keeps improving if needed," says Asai. Although the data store used in the paper contains scientific articles up to October 2024, Asai says that the demo version of OpenScholar can tap into the academic search engine Semantic Scholar, from the Allen Institute for Artificial Intelligence in Seattle, to access current papers. And because responses are built around real, specific papers, the system very rarely fabricates citations. It can, however, still cite a paper that doesn't support a claim very well, in the same way that humans can, says Asai. To test OpenScholar, the team compared its ability to answer realistic queries with that of other AI tools and with answers written by experienced, PhD-level humans, in computer science, physics, neuroscience and biomedicine. They found that experts preferred the models' responses over human-written ones in most cases. OpenScholar also outperformed GPT-4o, as well as rival tools such as PaperQA2, from FutureHouse in San Francisco, California, on machine-judged evaluations of citation and factual accuracy. A limitation of the tool is that it cannot ensure that it accesses only scientifically rigorous articles, says Bilal. It also cannot access paywalled papers, which means it is much less useful in disciplines, such as engineering or social sciences, in which open-access preprints are not the norm, he says. "Lack of access to copyrighted, licensed, or paywalled research is one of the biggest bottlenecks in improving these AI tools," he says. In the future, the team would like to develop a more flexible system that allows users to tap into papers they hold subscriptions for and that they have downloaded locally, says Asai.
[2]
Open-source AI program can answer science questions better than humans
Developed by and for academics, OpenScholar aims to improve searches of the ballooning scientific literature Scientists have a new tool for keeping on top of the exponentially growing body of research papers, which broke 4 million in 2024: an artificial intelligence (AI) program designed specifically to analyze the scientific literature. Dubbed OpenScholar and developed by academic researchers rather than any of the leading AI companies, the AI answers question about diverse research subjects more accurately than several widely used, general purpose chatbots -- and in many cases better than human experts, a new study says. The study, published today in Nature, first appeared as a preprint in November 2024, and the authors acknowledge that new versions of other large language models (LLMs) that power AIs such as ChatGPT have narrowed the gap with OpenScholar by now, or even surpassed it. But other researchers praise OpenScholar's creators at the Allen Institute for AI (Ai2) and five universities for making its code and underlying data free to access, unlike widely used commercial chatbots. "Certainly these [proprietary] systems have gotten better, but they're not peer reviewed," says Min-Yen Kan of the National University of Singapore, who studies information technologies and scholarly communication. "It's very important to put this type of [open-source] research out there because it is replicable." Given a question like "What are ways to cool the center-of-mass motion of levitated nanoparticles?" OpenScholar responds by checking a database of 45 million open-access papers optimized for searches about science in subjects that include biomedicine, computer science, and physics. Unlike earlier LLMs, which typically provided answers drawn from only one paper at a time, it examines content from multiple relevant papers. In addition, OpenScholar's answers run several hundred words longer than those produced by other models, helping it capture more nuance useful to scientists. OpenScholar also critiques and iteratively improves each response before finalizing it. That move reduced hallucinated references, a notorious feature of the LLMs that power many chatbots, its creators report in the study. The research team assessed the quality of the answers produced by OpenScholar using a benchmarking program that draws on guidance drafted by human subject matter experts. It found OpenScholar answered 51% of computer science questions correctly compared with 45% for GPT-4o, an advanced LLM created by the OpenAI organization that was available in 2024 when the study was completed. OpenScholar also had a higher score than Meta's popular LLM known as Llama, whose code can be accessed by researchers under certain restrictions. Human evaluators for various topics -- 12 Ph.D. students and postdoctoral researchers -- preferred OpenScholar's responses to those of human experts in 51% of cases, a figure that rose to 70% when the LLM was combined with GPT-4o. Jevin West, a data scientist at the University of Washington who was not involved in the study, suggests caution about interpreting that finding. "We have a hard time figuring out how to define 'better' because there's such variance across individuals within a discipline about what is the best citation to support an argument," he says. "That's where continued work will be needed." What's more, LLMs are designed to produce persuasive answers, even if substance is lacking, he notes. "We can become a bit hypnotized by their summarization abilities." About 30,000 scientists have used a demonstration version of OpenScholar since its debut, and most of them work in disciplines outside computer science, says lead author Akari Asai, a computer scientist at Ai2. "Many of them say it's useful to quickly understand or to quickly identify big papers," she says. "Some of them are expert in a domain, but they wanted to see if there are any papers they missed." The paper acknowledges, though, that the absence of paywalled content in the database searched by OpenScholar may limit the strength of its answers. Scientists who use tools like OpenScholar face risks, Kan says. Like anyone else using AI to obtain information, they must decide for themselves how much to trust the answers. "If you're using these tools to [substitute] for the primary sources, that can be dangerous because there could be nuances that are lost," Kan says. That might be more acceptable in a fast-moving field such as AI, where such tools could help make sense of an exploding literature, than in a field such as psychiatry where patients' health is at stake. Another risk is "deskilling," says Katherine Collins, a cognitive science postdoctoral researcher at the Massachusetts Institute of Technology. "I do worry that scaling up these kinds of systems could encourage younger scientists to not deeply read the literature, which can help spawn new ideas and make new connections," says Collins, who co-authored a commentary about AI benchmarks last week in Nature. "People could lose, or not learn, that skill in a world where it's so easy to get summaries of papers." Those questions will become more urgent as the technology improves. In November 2025, members of the OpenScholar team posted a preprint describing a more advanced LLM, dubbed DR Tulu-8B, which generates comprehensive reports in response to in-depth questions on a variety of subjects, from sources across the internet. It performs as well or better than Open Scholar, human experts, and the latest versions of several other leading LLMs, the developers say. Although it is not designed exclusively for scientists, the team thinks researchers may be quick to adopt it.
Share
Share
Copy Link
Academic researchers unveiled OpenScholar, an open-source AI tool that outperformed major LLMs like GPT-4o in scientific literature reviews. The system combines a language model with 45 million open-access articles to deliver accurate citations without hallucinations. Over 30,000 scientists have already tested the free tool since its debut.
Academic researchers have released OpenScholar, an open-source AI tool designed specifically for literature reviews that delivers more accurate results than prominent Large Language Models (LLMs) including GPT-4o
1
. Published in Nature on February 4, the system was developed by teams at the Allen Institute for Artificial Intelligence, Carnegie Mellon University, and the University of Washington, among others1
2
. The AI tool combines a language model with a database of 45 million open-access articles, linking information directly to sources to prevent citation hallucinations that plague conventional chatbots1
.
Source: Nature
In benchmark tests, OpenScholar answered 51% of computer science questions correctly compared to 45% for GPT-4o, demonstrating its capability to answer science questions more reliably than widely used commercial systems
2
. The system also outperformed Meta's Llama and PaperQA2 on evaluations measuring citation and factual accuracy1
. Human evaluators—12 PhD students and postdoctoral researchers across computer science, physics, neuroscience, and biomedicine—preferred OpenScholar's responses over those written by human experts in 51% of cases, a figure that climbed to 70% when combined with GPT-4o2
.Unlike traditional LLMs that generate text based on probable word associations learned from diverse training data, OpenScholar forces responses to draw exclusively from its scientific database
1
. When users submit queries, a retrieval system locates related articles within the repository, ranks them by relevance, and generates responses based only on the most useful papers. The system then critiques and iteratively improves each answer before finalizing it, significantly reducing citation hallucinations2
."We designed an efficient pipeline where the model generates an answer once, but then keeps improving if needed," says Akari Asai, AI researcher at Carnegie Mellon University and co-author of the work
1
. This approach addresses a persistent problem: at least 51 papers accepted to the NeurIPS conference in December 2025 contained non-existent or inaccurate citations, according to analysis using the GPTZero tool1
.The demo version can tap into Semantic Scholar from the Allen Institute for Artificial Intelligence to access current papers beyond the October 2024 cutoff in the original database
1
. OpenScholar's responses run several hundred words longer than other models, capturing more nuance useful for AI for academic research2
.The decision to make OpenScholar fully open source distinguishes it from commercial alternatives. Researchers can try the tool for free in an online demonstration, deploy it on their own machines, and use the published method to enhance scientific literature searches using any LLM
1
. "It's very important to put this type of research out there because it is replicable," says Min-Yen Kan of the National University of Singapore2
.Running OpenScholar costs a fraction of using OpenAI's GPT-5 with deep research tools, according to Hannaneh Hajishirzi, computer scientist at the University of Washington
1
. While OpenAI and other firms have added similar "deep research" capabilities to commercial LLMs in the 14 months since OpenScholar first appeared on arXiv, the open-source AI program remains significantly more affordable1
. About 30,000 scientists have used the demonstration version since its debut, with most working outside computer science2
.Related Stories
Despite its strengths, OpenScholar faces constraints that affect its utility across disciplines. The system cannot access paywalled content, limiting its effectiveness in fields like engineering and social sciences where open-access preprints are uncommon
1
. It also cannot guarantee access only to scientifically rigorous articles, according to Mushtaq Bilal, researcher at Copenhagen-based firm Silvi1
. The tool doesn't always retrieve the most representative papers for a query, constrained by its database scope1
.Experts identify additional risks. Jevin West, data scientist at the University of Washington, notes that LLMs are designed to produce persuasive answers even when substance is lacking. "We can become a bit hypnotized by their summarization abilities," he cautions
2
. Katherine Collins, cognitive science researcher at MIT, warns about deskilling: "I do worry that scaling up these kinds of systems could encourage younger scientists to not deeply read the literature, which can help spawn new ideas and make new connections"2
.The research team plans to develop a more flexible system allowing users to tap into papers from their own subscriptions and locally downloaded files
1
. If researchers maintain free access, "it can become one of the most popular apps for scientific searches," Bilal predicts1
. As scientific publications continue growing—exceeding 4 million in 2024—tools that help researchers navigate this expanding literature while maintaining accuracy will become increasingly critical2
.Summarized by
Navi
06 Mar 2025•Science and Research

26 Jan 2026•Technology

20 Nov 2025•Science and Research

1
Business and Economy

2
Policy and Regulation

3
Policy and Regulation
