3 Sources
[1]
Teams of AI agents boost speed of research
Artificial intelligence is poised to take on a more-active role in the laboratory: two new systems, described today in Nature, use teams of AI agents to develop hypotheses, propose experiments and analyse data. Each system still relies on human input at various stages, but they boast timelines that can be remarkably shorter than when the process is left to human minds and hands alone. When the systems were asked to identify existing drugs that might be repurposed for different conditions, they arrived at plausible answers in a matter of hours. "It almost seems like an agentic, in silico implementation of the thought process in a scientist's head," says Vivek Natarajan, a researcher at Google DeepMind in Mountain View, California, who helped to develop one of the systems. "The goal is to give scientists superpowers." In one experiment, Natarajan and his colleagues used Google's Co-Scientist to look for approved drugs that could be repurposed to treat a form of blood cancer called acute myeloid leukaemia. The system identified a list of candidate drugs, from which human researchers selected five for further study. Three of these showed promise in preliminary studies on cells grown in the lab. FutureHouse, a non-profit AI research lab in San Francisco, California, developed the second system, called Robin, and instructed it to find drugs to treat an eye condition called dry age-related macular degeneration. Robin began by consulting AI agents trained to conduct literature reviews and used their reports to select lab experiments to test a variety of candidate drugs. Humans carried out those experiments and fed the data back to Robin, which then supplied them to an AI agent specialized in analysing data. Using this procedure, Robin suggested a list of molecular targets for treating dry age-related macular degeneration and identified a drug called ripasudil, which is used to treat the eye condition glaucoma, as a candidate treatment. The system suggested assays to confirm ripasudil's activity in the lab and then proposed follow-up experiments. None of the drugs identified by the AI scientists have been fully evaluated, and many drug candidates that pass initial assays in lab-grown cells go on to fail more-stringent assays. But the examples show that these AI systems can arrive at plausible hypotheses, says Karandeep Singh, who oversees AI initiatives and strategy for University of California San Diego Health. How well the AI assistants perform in day-to-day science in other contexts remains to be seen, he adds. "You don't know how it works in reality until it's been made available to a broad set of people," he says. Natarajan says that about 100 scientists outside Google DeepMind now have access to Co-Scientist and are testing its capabilities in a variety of settings. In one experiment described in the paper, Natarajan and his colleagues asked Co-Scientist to develop a hypothesis that could explain why a particular suite of genes conferring antimicrobial resistance could be found in multiple bacterial species. The system took only days to arrive at the same conclusion as that of a group of researchers who had been studying the phenomenon and had not yet published the results. Although AI agents can perform some tasks much more quickly than humans can, researchers could lose time and money if AI leads them down dead ends. Both Robin and Co-Scientist are based on large-language models, a form of AI that is prone to occasionally producing false but plausible-sounding answers. These 'hallucinations' will probably always be a concern when dealing with this form of AI, says Ola Spjuth, who studies the use of AI for drug discovery at Uppsala University in Sweden. But cutting-edge AI models hallucinate less than their predecessors do, he notes, and a researcher using an AI co-scientist system can audit its decision-making process to learn why it made the choice that it did. Furthermore, Robin and Co-Scientist include steps in which AI agents debate hypotheses or compare results between themselves, potentially weeding out some hallucinations and faulty reasoning. Even so, human oversight is key, Spjuth says. "We cannot just delegate important decisions right now to LLMs and AI agents," he says. "We need to supervise these methods." The role for humans in research could be shifting in other ways as well. Companies are progressing rapidly towards sophisticated robots that can carry out some lab work. And today, a team of Google researchers report, in Nature, an agentic AI system called Empirical Research Assistance that aims to write high-quality software, and has been deployed in fields as diverse as cosmology and neuroscience. The extent to which AI tools can take over hypothesis generation and data interpretation might depend on the nature of the research, says Samuel Rodriques, chief executive and co-founder of FutureHouse. In terms of drug discovery, "there's a huge way to go" before AI can design a new drug all the way from the initial target through to clinical testing, he says. "But I think it's possible."
[2]
Two AI-based science assistants succeed with drug-retargeting tasks
On Tuesday, Nature released two papers describing AI systems intended to help scientists develop and test hypotheses. One, Google's Co-Scientist, is designed as what they term "scientist in the loop," meaning researchers are regularly applying their judgements to direct the system. The second, from a nonprofit called FutureHouse, goes a step beyond and has trained a system that can evaluate biological data coming from some specific classes of experiments. While Google says its system will also work for physics, both groups exclusively present biological data, and largely straightforward hypotheses -- this drug will work for that. So, this is not an attempt to replace either scientists or the scientific process. Instead, it's meant to help with the things that current AIs are best at: chewing through massive amounts of information that humans would struggle to come to grips with. What's this good for? There are some distinctions between the two systems, but both of them are what is termed agentic; they operate in the background by calling out to separate tools. (Microsoft has taken a similar approach with its science assistant as well; OpenAI seems to be an exception in that it simply tuned an LLM for biology.) And, while there are differences between them that we'll highlight, they are both focused on the same general issue: the utter profusion of scientific information. With the ease of online publishing, the number of journals has exploded, and with them the number of papers. It has gotten tough for any researcher to stay on top of their field. Finding potentially relevant material in other fields is a real challenge. If you're focused on eye development, for example, one of the signaling systems used there may also be involved in the kidney, and it can be easy to miss what people are discovering about it there. As the people at FutureHouse put this issue, "By focusing on 'combinatorial synthesis' (identifying non-obvious connections between disparate fields), Robin effectively targets 'low-hanging fruit' that human experts may overlook due to the compartmentalization of scientific knowledge." This is a task that's well suited to AI, which can chew through the peer-reviewed literature in the background while researchers do other things. This isn't really a question of whether an AI could do something better or worse than a human; it's more of an issue of whether any human would end up doing these sorts of searches at all. By finding enough connections among disparate research, these tools can make suggestions -- hypotheses, really -- about the biology. This can include things like what processes underly biological behaviors, and what pathways and networks regulate those processes. And, in the cases explored here, it included suggesting known drugs that might target some of these pathways in diseased cells: acute myeloid leukemia in Google's case, and a form of macular degeneration for FutureHouse. Co-scientist As you might imagine, Google's system is based on the company's Gemini large language model. That helps the system interpret a statement of research goals provided by human scientists and starts a literature search to find relevant information and form hypotheses. Those are then evaluated relative to each other in a "tournament," the results of which are evaluated by a Reflection agent. An Evolution agent can then make improvements to any surviving ideas, which can be sent back through the process. Key criteria considered throughout this process include plausibility, novelty, testability, and safety. And the Reflection tool has access to external search tools, as access to the scientific literature "prevented the hallucination of seemingly novel but implausible hypotheses," the company wrote. As the paper puts it, scientists were kept in the loop at all times. In the search for potential drugs targeting leukemia, the suggestions made by the system were prioritized based on a review by a panel of experts, who had access to the literature Co-Scientist used to formulate its suggestions. The results are what you'd expect from cancer therapies. Some of the drugs identified were affected, but only against subsets of a panel of myeloid leukemia cells. That's not unusual, given that there are multiple routes to unchecked growth, so drugs that block the route followed by one cell type may not be effective in cells that took a different route. Google also mentioned that the system could do more general hypothesizing that doesn't involve drugs, using an example of the spread of virulence genes in bacteria. But the details of that work were fairly sparse. The system is also set up so that it's model agnostic, allowing it to be switched over to better-performing models as AI systems evolve. But they also warn that, "Co-Scientist also inherits the intrinsic limitations of its underlying models, including imperfect factuality and the potential for hallucinations." And Robin FutureHouse's system has some similarities but a couple of critical differences that go beyond naming all the agentic tools after birds. The main system, Robin, has access to specialized literature search tools. One, Crow, produces a concise summary of papers, while Falcon gives a deep overview of the information contained in the paper. The paper describing the system provides a clear sense of the advantages here: "Robin analyses 551 papers in 30 minutes compared to an estimated time of 540 hours for a human." Taking those summaries, Robin then formed a series of hypotheses about disease mechanisms for macular degeneration and used these tools to provide a detailed report on the evidence for each mechanism. An LLM judge then made pairwise comparisons among the hypotheses, which resulted in relative rankings -- a bit like Google's tournament system. In a similar manner, the system was re-deployed to suggest cell lines and culture conditions that could provide a model of macular degeneration, and it prepared reports on 30 candidate drugs. "These reports contained both justification for why each drug is suitable for mitigating the disease mechanism represented in the in vitro model and potential limitations the drug may pose," according to the FutureHouse team. Again, these reports were evaluated by human experts to determine which tests to go ahead with. Robin also suggested assays to test the drugs, which humans evaluated (in most cases, it appears they used variants of the suggested ones). The key difference with Robin is that it includes a tool, Finch, that can automate the evaluation of data from some standard biological screening assays, like flow cytometry and RNA-seq. So, as long as your tests involve one of the assays that Finch can handle, then there's an additional step that can be performed by the system. As above, Robin came up with a novel hypothesis: Increasing the ability of retinal cells to pick up debris outside the cells could provide some protection against the disease. And it identified a drug that seemed to provide just that sort of boost in the experiments it proposed. As Google found, having tools designed specifically to interface with the scientific literature mattered. Swapping out Crow for OpenAI's o4-mini took the rate of hallucinated references from zero percent all the way up to 45 percent. FutureHouse also took a look at the performance of OpenAI's research-focused tool and found that, in all cases where it suggested drugs that Robin hadn't come up with, those drugs failed to have an effect on these cells. Where does this leave us? For starters, it's important to note that these successes come in one of the easier parts of drug development (not that any part of it can really be said to be easy). The AIs weren't being asked to design entirely new molecules, and most drugs fail during the animal and clinical trials phase, rather than during testing in cell culture. That's not to say repurposing existing drugs is nothing -- we already have safety profiles and agency approvals for these molecules, and many are off-patent and therefore cheap. But we're not at the point where AIs are solving hard problems. This sort of hypothesis -- this mechanism underlies that disease, and the drug over there can target it -- is also one of the more concrete forms of hypothesis in biology. In my career as a scientist, I had to come up with hypotheses that were meant to address things like "mice with this mutation have a whole lot of defects in very different tissues; is there a single mechanism underlying them?" Or, "What's going on at the border of this gene's expression that is changing how cells respond to this signaling molecule?" It's not clear how these systems could handle these more open-ended scientific problems. That said, the problem of literature overload is a real one in many fields, and systems meant to address it can potentially help us avoid a situation where all the information we needed was sitting around for a decade, but nobody put it together. Given we're still working through AI's growing pains, however, I'm also happy that there are at least two independently developed systems tackling this problem so that we can potentially run both and compare the results. Nature, 2026. DOI: 10.1038/s41586-026-10652-y, /10.1038/s41586-026-10644-y (About DOIs).
[3]
New 'AI scientists' are improving - but reveal their fundamental limits
Many of the most exciting discoveries in science involve highly specialised knowledge and making connections between far-flung facts. Scientists must combine deep analysis with broad reasoning strategies. As in many information-rich tasks, researchers are looking to artificial intelligence (AI) systems to speed up their work. AI tools may be able to support key steps such as generating ideas, reviewing existing work and analysing data. The latest systems use large language models (LLMs) to allow scientists to interact naturally and directly with the vast body of knowledge captured in words in the scientific literature. But as two new systems described in papers just published in Nature show, when it comes to science, language alone can only go so far. What AI is doing to science A number of organisations, such as Sakana AI, are trying to automate the entire scientific process. To date, these efforts have largely focused on computer science, where "experiments" mainly involved designing and writing code. However, the Agents4Science conference organised at Stanford last October showcased a broader range of AI-generated papers. They covered topics from mechanical engineering and protein design to a system called BadScientist which deliberately produced "convincing but unsound" research. I have previously raised concerns about the impacts of AI scientists on the scientific ecosystem. Recent work validates these concerns, showing increased quantity but lower quality of both papers and peer reviews, identifying fabricated references in published works, finding fabricated and misleading images, and more. What scientists are doing with AI AI systems clearly can't be trusted to conduct the full process of science on their own. But how about using AI to help scientists get more done more quickly? This is the intent of the two new systems described in Nature: Robin, made by non-profit Future House, and Co-Scientist, from Google DeepMind. Both systems aim to accelerate scientific discovery, working in collaboration with a scientist. Both are also "multi-agent" AI systems, meaning they are built as a collection of specialised agents each targeting specific steps of the scientific discovery process, coordinated by a "supervisor" agent. The agents that comprise Co-Scientist aim to mirror abstract cognitive tasks, such as a "reflection agent" that acts as a critical scientific peer reviewer assessing the quality of a hypothesis. "Ranking agents" debate research hypotheses in "tournaments", using multiple interacting LLMs to simulate a discussion about the relative merits of two hypotheses. Robin's agents, on the other hand, are more tuned to specific tasks relevant to drug repurposing, aiming to identify new drugs for a given disease. One agent focuses on selecting experimental tests, while another analyses complex biomedical data. How do the results stack up? Co-Scientist can assess the quality of its generated proposals, using a method called the Elo rating which is best known for ranking chess players. Co-Scientist's self-ratings of the novelty and impact of its outputs align quite well with the preferences of human experts and judgements by other LLM systems. In a drug repurposing experiment, Co-Scientist selected 30 drug candidates as promising treatments for a kind of cancer called acute myeloid leukemia. Expert (human) oncologists refined the list, and five drugs were tested in the lab. Of these, three showed some positive results and one seemed to show particular promise. Other experiments showed the potential of Co-Scientist to explore combinations of multiple drugs. Notably, the predictions of Co-Scientist were not compared with the plethora of targeted computational and machine learning methods for drug repurposing that have been developed over decades of computational biology research. This means we don't know whether the new general-purpose tool outperforms more specific AI approaches. Both systems stop short of validating their hypotheses directly, which would involve real physical experiments. Both also rely heavily on human input to define the key scientific question, sense-check predictions, and prioritise predictions for further investigation. Co-Scientist focuses primarily on generating hypotheses through elaborate reasoning agents, leaving validation and interpretation to subsequent steps. Robin also uses an agent to analyse data produced from real-world experiments. Robin was used to propose 30 drug candidates for a condition called dry age-related macular degeneration. The top five were selected for testing. Robin also made proposals for the experiments, with several suggestions overridden by the human scientists. Through several rounds of brainstorming and analysis, two drugs were identified as promising. Testing of Robin's individual agents showed those that dug through earlier research were better at the task than general-purpose LLMs. The analytical agent did less well on questions about statistics and bioinformatics, and relied heavily on human-supplied prompts. The limits of language alone AI can help scientists to navigate the vast amount of documented knowledge humans have acquired over the millennia. Use of computation to find patterns in large datasets, to integrate dispersed information, and to drive new discoveries from existing literature has already contributed to scientific progress for decades. New models such as Robin and Co-Scientist represent a shift towards working directly in the realm of the language of science, rather than the realm of raw data. This allows more natural collaborations between scientist and machine, through language-based "discussions". However, more natural doesn't necessarily mean more effective. Language-based communication can be imprecise and ambiguous, where science must be specific. Models that combine the best of these worlds are on the horizon. These aim to link structured quantitative data to the concepts and relationships that describe the core facts beneath it. Such models ground scientific reasoning in the structure of knowledge. They allow scientific evidence ranging from genomic sequences and protein structures to cellular imaging to be connected. Words are how science is communicated. AI tools that facilitate making sense of the information that is hidden in all of those words are surely valuable. But the complexity of the natural world means that AI (co-) scientists will only be truly effective when they can go beyond connecting words together, to modelling the full complexity of the systems those words describe.
Share
Copy Link
Two new AI systems published in Nature demonstrate how teams of AI agents can compress months of scientific research into hours. Google's Co-Scientist and FutureHouse's Robin identified promising drug candidates for cancer and eye disease, though human supervision remains essential to prevent AI hallucinations and validate results.
Artificial intelligence is taking on a more active role in laboratories as two new systems use teams of AI agents to develop hypotheses, propose experiments, and analyse data with remarkable speed. Google's Co-Scientist and FutureHouse's Robin, both described in Nature, represent a shift in how AI-based science assistants can accelerate scientific research by compressing timelines from months to mere hours
1
.When tasked with identifying existing drugs for repurposing, both systems arrived at plausible answers in hours rather than the extended periods typically required for such research. "It almost seems like an agentic, in silico implementation of the thought process in a scientist's head," says Vivek Natarajan, a researcher at Google DeepMind who helped develop Co-Scientist. "The goal is to give scientists superpowers"
1
.
Source: Ars Technica
Google's Co-Scientist is built on the company's Gemini model and operates as what developers term "scientist in the loop," keeping researchers involved at critical decision points
2
. The system interprets research goals provided by human scientists, conducts literature searches, and forms hypotheses that are evaluated in a "tournament" format where different AI agents debate their merits.In one experiment focused on drug discovery, Natarajan and colleagues used Co-Scientist to look for approved drugs that could treat acute myeloid leukemia. The system identified a list of candidate drugs, from which human researchers selected five for further study. Three of these showed promise in preliminary studies on cells grown in the lab
1
. The results demonstrated that while some drugs affected only subsets of leukemia cells—not unusual given multiple routes to unchecked growth—the system could rapidly identify viable candidates for testing2
.About 100 scientists outside Google DeepMind now have access to Co-Scientist and are testing its capabilities across various settings. In another experiment, the system developed a hypothesis explaining why particular antimicrobial resistance genes appear across multiple bacterial species, arriving at the same conclusion in days that took a research group considerably longer to reach
1
.
Source: The Conversation
FutureHouse, a non-profit AI research lab in San Francisco, developed Robin with agents more tuned to specific tasks relevant to drug retargeting
3
. The system was instructed to find treatments for dry age-related macular degeneration, an eye condition affecting millions.Robin began by consulting AI agents trained to conduct literature reviews and used their reports to select lab experiments testing various candidate drugs. Humans carried out those experiments and fed data back to Robin, which supplied them to an AI agent specialized in analysing data
1
. Through this iterative process of hypothesis generation and experimental validation, Robin suggested molecular targets for treating the condition and identified ripasudil—a drug used to treat glaucoma—as a candidate treatment. The system then proposed assays to confirm ripasudil's activity and suggested follow-up experiments.Related Stories
Both Robin and Co-Scientist are multi-agent AI systems, meaning they comprise collections of specialized agents targeting specific steps of scientific discovery, coordinated by a supervisor agent
3
. This architecture addresses the profusion of scientific information that makes it difficult for researchers to stay current with their field, let alone discover relevant material in other disciplines.However, these systems inherit fundamental limitations. Both are based on large language models prone to producing AI hallucinations—false but plausible-sounding answers that remain a persistent concern
1
. While cutting-edge AI models hallucinate less than predecessors, and both systems include steps where AI agents debate hypotheses to weed out faulty reasoning, human supervision remains essential."We cannot just delegate important decisions right now to LLMs and AI agents," says Ola Spjuth, who studies AI use for drug discovery at Uppsala University. "We need to supervise these methods"
1
. Karandeep Singh, who oversees AI initiatives for University of California San Diego Health, notes that none of the drugs identified have been fully evaluated, and many candidates passing initial assays in lab-grown cells fail more stringent tests1
.
Source: Nature
The extent to which these AI-based science assistants perform in day-to-day scientific research across different contexts remains to be seen. "You don't know how it works in reality until it's been made available to a broad set of people," Singh observes
1
. Researchers could lose time and money if AI leads them down dead ends, making the ability to audit AI decision-making processes crucial.Samuel Rodriques, chief executive and co-founder of FutureHouse, suggests the role of AI tools in taking over hypothesis generation and data interpretation may depend on the nature of research. In drug discovery specifically, "there's a huge way to go" before AI can design new treatments independently
1
. The systems represent progress in combining deep analysis with broad reasoning strategies, but they stop short of validating hypotheses directly and rely heavily on human input to define key questions, sense-check predictions, and prioritize candidates for investigation3
.Summarized by
Navi
[3]
06 Mar 2025•Science and Research

20 Feb 2025•Science and Research

14 Jan 2026•Science and Research

1
Policy and Regulation

2
Technology

3
Health
