



5 Sources
5 Sources
[1]

AI-Powered Evo-2 Model Generates DNA, Advances Genome Research
A new artificial intelligence model has been introduced, marking a significant advancement in biological research. Developed using a dataset of 128,000 genomes covering various life forms, this AI can generate entire chromosomes and small genomes from scratch. Researchers claim it has the potential to interpret non-coding gene variants associated with diseases, making it a powerful tool in genetic research. This development is expected to enhance genome engineering by facilitating a deeper understanding of DNA sequences and their functions. According to a study published by the Arc Institute, the AI model, named Evo-2, has been developed in collaboration with Stanford University and NVIDIA. The model, which has been made available through web interfaces, provides researchers with the ability to generate and analyse DNA sequences. Patrick Hsu, bioengineer at the Arc Institute and the University of California, Berkeley, stated during a press briefing that Evo-2 is intended to serve as a platform that scientists can modify to suit their research needs. Unlike previous AI models that focused primarily on protein sequences, Evo-2 has been trained on genome data, encompassing both coding and non-coding sequences. This extensive training set includes genomes from humans, animals, plants, bacteria, and archaea, covering 9.3 trillion DNA letters. The complexity of eukaryotic genomes, which contain interspersed coding and non-coding regions, has been incorporated into Evo-2's framework to enhance its ability to predict gene activity. Anshul Kundaje, computational genomicist at Stanford University, stated to Nature that independent testing would be required to fully assess Evo-2's capabilities. Preliminary results suggest that it performs at a high level when predicting the effects of mutations in genes such as BRCA1, which is linked to breast cancer. The model was also used to analyse the genome of the woolly mammoth, further demonstrating its ability to interpret complex genetic structures. The AI has been tested in designing new DNA sequences, including CRISPR gene editors, as well as bacterial and viral genomes. Earlier versions of the model produced incomplete genomes, but Evo-2 has shown improvements by generating more biologically plausible sequences. Brian Hie, computational biologist at Stanford University and Arc Institute, mentioned that while progress has been made, further refinements are necessary before these sequences can be fully functional in living cells. Researchers anticipate that Evo-2 will aid in designing regulatory DNA sequences that control gene expression. Experiments are already underway to test its predictions on chromatin accessibility, which influences cell identity in multicellular organisms. Yunha Wang, computational biologist and CEO of Tatta Bio, suggested that Evo-2's ability to learn from bacterial and archaeal genomes could assist in designing novel human proteins. Scientists involved in the project aim to push beyond protein design towards comprehensive genome engineering. With ongoing refinements and laboratory validations, Evo-2 may contribute to advancements in synthetic biology and precision medicine. The model's role in understanding genetic regulation and designing functional DNA sequences is expected to grow as more researchers adopt and refine its capabilities.
[2]

Biggest-ever AI biology model writes DNA on demand
Scientists today released what they say is the biggest-ever artificial-intelligence model for biology. The model -- which was trained on 150,000 genomes spanning the tree of life, from humans to single-celled bacteria and archaea -- can write whole chromosomes and small genomes from scratch. It can also make sense of existing DNA, including hard-to-interpret 'non-coding' gene variants that are linked to disease. Evo-2, co-developed by researchers at the Arc Institute in Palo Alto, California, and chip maker NVIDIA, is available to scientists through web interfaces or they can download its freely available software code, data and other parameters needed to replicate the model. The developers see Evo-2 as a platform that others can adapt to their own uses. "We're really looking forward to how scientists and engineers build this 'app store' for biology," Patrick Hsu, a bioengineer at the Arc Institute and the University of California, Berkeley, said at a press briefing announcing Evo-2's launch. Other scientists are impressed with what they've read about the model -- which is described in a paper posted to the Arc Institute website and submitted to the bioRxiv preprint server. But they say they will need to kick the tyres before coming to firm conclusions. "We'll have to see how it holds up in independent benchmarks after the preprint is out," says Anshul Kundaje, a computational genomicist at Stanford University in Palo Alto. So far, he is impressed by the engineering that underpins the model. In the past few years, researchers have developed increasingly powerful protein language models such as the ESM-3 model developed by former Meta employees that, after training on millions of protein sequences, have been used to help predict protein structures and design totally new proteins including gene editors and fluorescent molecules. Unlike these models, Evo-2 was trained on genome data that contains both 'coding sequences' -- which carry instructions for making proteins -- and non-coding DNA that includes sequences that can control when, where and how genes are active. The first version of Evo released last year was trained on the genomes of 80,000 bacteria and archaea -- simple organisms called prokaryotes -- as well as their viruses and other sequences. The latest model is based on 128,000 genomes, including those of humans and other animals, plants and other eukaryotic organisms. These genomes encompass a total of 9.3 trillion DNA letters. Based on the computing power needed to devour this data and other features, the Evo-2 is the biggest biological AI model yet released, says Hsu. Compared with prokaryotes, eukaryotic genomes tend to be longer and more complex: genes are made of interspersed segments of coding and non-coding regions, and non-coding 'regulatory DNA' can be far away from the genes they control. To handle this complexity, Evo-2 was built so that it can learn patterns in sequences of DNA as far away as 1 million base pairs. To demonstrate its ability to make sense of complex genomes, Hsu and his colleagues used Evo-2 to predict the effects of previously studied mutations in a gene implicated in breast cancer called BRCA1. It did nearly as well as the best bio-AI models at determining whether changes to coding regions would cause diseases, said Hsu. "It's state of the art for non-coding mutations." In the future, the model could help to identify these hard-to-interpret changes in patient genomes. The researchers also tested the model's ability to decipher other features of complex genomes -- including that of the woolly mammoth. "Evo-2 represents a significant step in learning DNA regulatory grammar," says Christina Theodoris, a computational biologist at the Gladstone Institutes in San Francisco, California. Kundaje says Evo-2 seems good at finding coding sequences -- and nearby non-coding DNA. But it's not yet clear whether the model has learned about distant non-coding sequences that regulate gene activity. One appeal of genome models like Evo-2 is that they can generate new DNA sequences corresponding to not just proteins, but also non-coding sequences that work with them. Hsu and his colleagues used Evo-1 to create new CRISPR gene editors, which include a DNA-cutting enzyme and RNA molecules that direct that protein to a target site. These were shown to work in lab experiments. They also attempted to design bacterial and viral genomes, but these lacked many features of bona fide genomes. "We likened this to a blurry picture of the genomes," Brian Hie, a computational biologist at Arc Institute and Stanford, said at the briefing. With Evo-2, these images became less blurry. The researchers used the model to create genomes inspired by those of Mycoplasma genitalium -- a bacterium that was the first cellular organism to have its genome fully synthesized -- human mitochondria, and a 330,000 DNA letter-long yeast chromosome. These looked more realistic than the genomes Evo-1 generated - which lacked plausible proteins in some cases - but "there's still room for improvement," Hie said. Without further improvements, he doubts the genomes would work if put into cells. Because it's trained on DNA from across the tree of life, Evo-2 could be adept at applying what it's learned from bacterial and archaeal genomes to coming up with new human proteins, says Yunha Wang, a computational biologist and chief executive of Tatta Bio, a non-profit company in New York City that's developing genome models. The researchers hope to further validate Evo-2 with lab experiments. For instance, they designed sequences that alter the accessibility of folded-up DNA called chromatin -- a feature that influences the identity of cells in multicellular organisms -- and are collaborating with another lab to test these designs in mouse embryonic stem cells. Protein language models and other AI tools for protein design have ushered in a bio-design revolution. Hie and his colleagues -- who eventually want to model entire cells with AI -- hope that genome models like Evo-2 can move the needle further. "We want to push the field beyond protein design to genome design."
[3]

NVIDIA and Arc Institute Unveil an AI Model to Predict DNA, RNA & Proteins
The model has been trained on nearly 9 trillion nucleotides, the building blocks of DNA and RNA. California-based nonprofit Arc Institute and Stanford University, in collaboration with NVIDIA, unveiled Evo 2 on Wednesday as the largest publicly available AI model for genomic data. Evo 2 can predict and design the genetic code -- DNA, RNA, and proteins -- of all domains of life. The model has been trained on nearly 9 trillion nucleotides, the building blocks of DNA and RNA. "We make Evo 2 fully open, including model parameters, training code, inference code, and the OpenGenome2 dataset, to accelerate the exploration and design of biological complexity," the researchers said in the official paper. "Deploying a model like Evo 2 is like sending a powerful new telescope out to the farthest reaches of the universe," said Dave Burke, Arc's chief technology officer. "We know there's immense opportunity for exploration, but we don't yet know what we're going to discover." NVIDIA said the model can be used for biomolecular research applications, including predicting protein structures, identifying novel molecules for healthcare and industrial use, and evaluating how gene mutations affect function. "Evo 2 represents a major milestone for generative genomics," said Patrick Hsu, Arc Institute cofounder and core investigator, and an assistant professor of bioengineering at the University of California, Berkeley. "By advancing our understanding of these fundamental building blocks of life, we can pursue solutions in healthcare and environmental science that are unimaginable today." The model is available as an NVIDIA NIM microservice, allowing users to generate biological sequences with customisable settings. Researchers can also fine-tune Evo 2 on proprietary datasets through the open-source NVIDIA BioNeMo Framework. "Designing new biology has traditionally been a laborious, unpredictable and artisanal process," said Brian Hie, assistant professor of chemical engineering at Stanford University and Arc Institute innovation investigator. "With Evo 2, we make biological design of complex systems more accessible to researchers, enabling the creation of new and beneficial advances in a fraction of the time it would previously have taken." Arc Institute, founded in 2021 with $650 million in funding, supports long-term scientific research by providing multiyear funding and dedicated lab space. Scientists at the institute focus on disease areas, including cancer, immune dysfunction, and neurodegeneration. NVIDIA contributed computing resources by providing access to 2,000 NVIDIA H100 GPUs via NVIDIA DGX Cloud on AWS. The AI platform includes NVIDIA BioNeMo software, featuring optimised microservices and BioNeMo Blueprints. NVIDIA researchers also collaborated on AI scaling and optimisation. Evo 2 processes genetic sequences up to 1 million tokens in length, enabling a broader analysis of the genome. This capability allows scientists to explore relationships between genetic sequences and cell function, gene expression, and disease. "A single human gene contains thousands of nucleotides -- so for an AI model to analyse how such complex biological systems work, it needs to process the largest possible portion of a genetic sequence at once," said Hsu. In healthcare and drug discovery, Evo 2 could help researchers identify gene variants linked to specific diseases and design molecules that precisely target them. In a separate study by Stanford and Arc Institute, researchers found that Evo 2 could predict with 90% accuracy whether previously unrecognised mutations in BRCA1, a gene associated with breast cancer, would affect gene function. In agriculture, the model could support food security efforts by improving understanding of plant biology, leading to the development of climate-resilient or nutrient-dense crops. Evo 2 could also be used to engineer biofuels or proteins that break down plastic or oil.
[4]

Generative AI tool marks a milestone in biology and accelerates the future of life sciences
magineImagine being able to speed up evolution - hypothetically - to learn which genes might have a harmful or beneficial effect on human health. Imagine, further, being able to rapidly generate new genetic sequences that could help cure disease or solve environmental challenges. Now, scientists have developed a generative AI tool that can predict the form and function of proteins coded in the DNA of all domains of life, identify molecules that could be useful for bioengineering and medicine, and allow labs to run dozens of other standard experiments with a virtual query - in minutes or hours instead of years (or millennia). The open-source, all-access tool, known as Evo 2, was developed by a multi-institutional team co-led by Stanford's Brian Hie, an assistant professor of chemical engineering and a faculty fellow in Stanford Data Science. Evo 2 was trained on a dataset that includes all known living species, including humans, plants, bacteria, amoebas, and even a few extinct species. Stanford Report talked to Hie about Evo 2's advanced capabilities, why the scientific world is so eager to get its hands on this new tool, and how Evo 2 could reshape the biological sciences. Can you give us the lay version of how Evo 2 works? All life is encoded in DNA using just four chemicals, known as nucleotides. These complex molecules are abbreviated using the letters A, C, G, and T. The human genome, at 3 billion nucleotides long, is just a string of these four letters. Now, if you imagine DNA as the characters in a book that is 3 billion letters long, the individual genes are the words. They are spelled differently. Some have more letters than others. And they have different purposes and meanings - that is, they have different functions. With AI, we can search for patterns in all that code and use it to predict what the next nucleotide in the sequence is likely to be. In this way, Evo 2 is able to generate - to write - new genetic code that has never existed before. With Evo 2, you can enter a sequence of up to 1 million nucleotides. The million-nucleotide window in biology is important, as it allows us to explore long-distance interactions between two or more genes that may not be physically close to one another on the DNA molecule. The longer context window could allow us to spot connections between these long-distance collaborators that we wouldn't even know about with a shorter window. How is Evo 2 different from Evo 1 - which came out just last year - and how did you advance the technology so quickly? Honestly, Evo 1 was more effective than we thought it would be. Evo 1 was trained on only 113,000 or so genomes of simpler life forms like bacteria and amoebas, known as the prokaryotes. Evo 2, on the other hand, also includes the known genomes of 15,000 or so plants and animals - the eukaryotes - which includes humans. Our dataset has now expanded from about 300 billion nucleotides to almost 9 trillion with Evo 2. In terms of safety, we have left out the genomes of viruses to prevent Evo 2 from being used to create new or more dangerous diseases. It's like a representative snapshot of all species on Earth. Because it has the potential to improve tasks related to human disease, we felt like we needed to share Evo 2 quickly. How is Evo 2 like ChatGPT? In a natural language processor, like ChatGPT, you can prompt it with some text, and it will autocomplete the sentence based on patterns from previously written words. Evo 2 does this with DNA. If you want to design a new gene, you prompt the model with the beginning of a gene sequence of base pairs, and Evo 2 will autocomplete the gene. Sometimes that completion will look exactly like a gene found in nature, but other times the model will make some improvements or write the gene in a different way than has ever happened in evolutionary history. In the real world, these mutations happen by chance. With Evo 2, we can be more direct and steer toward mutations that have useful functions. Evo 2 also includes machine learning models that will tell you if the sequence exists in nature and predict how this new sequence will function in real life. Then we go into the lab and synthesize the DNA and insert it into a living cell to test it using a gene editing technology like CRISPR. Essentially, Evo 2 is speeding up evolution, providing promising new genetic paths for us to explore. How do you hope other scientists will use Evo 2? We hope that Evo 2 will someday have clinical significance. It is really good at discovery. Evo 2 could help predict which mutations lead to pathogenicity and disease. Everyone has random mutations in their DNA and, mostly, they're harmless. But on rare occasions, they'll cause cancer or other disease. The model is actually very good at distinguishing which mutations are just random, harmless variations and which cause disease. The last area we are hopeful about is using Evo 2 for designing new genetic sequences with specific functions of interest. Another relevant next step is integrating these models with models of systems biology that would help us learn about interactions between two or more genes to cause disease. Can you talk about the collaboration needed to make something like Evo 2 happen? Something of this scale cannot be done by a single person. The three major institutions involved are Stanford, NVIDIA - which makes the AI computer chips and software to run it - and the Arc Institute, a biomedical research nonprofit that is itself a collaboration among Stanford, the University of California, Berkeley, and the University of California, San Francisco. In terms of personnel, we had three subteams. First, the machine learning team focused on training the model and making sure that the computers ran efficiently. Then, once you train a model, you need to know it actually works as intended. So there's a team of biologists - computational, molecular, systems, prokaryotic, eukaryotic biologists - to make sure the information we are getting back is valuable and usable. And, last, we have an experimental biology team that synthesizes the new DNA, puts it into cells, and tests the cells to make sure what we've created works in real life. It's all very hard work, and I'm very grateful to everyone on the team for their help.
[5]

Massive Foundation Model for Biomolecular Sciences Now Available via NVIDIA BioNeMo
Evo 2, a powerful new AI model built using NVIDIA DGX Cloud on Amazon Web Services (AWS), provides insights into DNA, RNA and proteins across diverse species. Scientists everywhere can now access Evo 2, a powerful new foundation model that understands the genetic code for all domains of life. Unveiled today as the largest publicly available AI model for genomic data, it was built on the NVIDIA DGX Cloud platform in a collaboration led by nonprofit biomedical research organization Arc Institute and Stanford University. Evo 2 is available to global developers on the NVIDIA BioNeMo platform, including as an NVIDIA NIM microservice for easy, secure AI deployment. Trained on an enormous dataset of nearly 9 trillion nucleotides -- the building blocks of DNA and RNA -- Evo 2 can be applied to biomolecular research applications including predicting the form and function of proteins based on their genetic sequence, identifying novel molecules for healthcare and industrial applications, and evaluating how gene mutations affect their function. "Evo 2 represents a major milestone for generative genomics," said Patrick Hsu, Arc Institute cofounder and core investigator, and an assistant professor of bioengineering at the University of California, Berkeley. "By advancing our understanding of these fundamental building blocks of life, we can pursue solutions in healthcare and environmental science that are unimaginable today." The NVIDIA NIM microservice for Evo 2 enables users to generate a variety of biological sequences, with settings to adjust model parameters. Developers interested in fine-tuning Evo 2 on their proprietary datasets can download the model through the open-source NVIDIA BioNeMo Framework, a collection of accelerated computing tools for biomolecular research. "Designing new biology has traditionally been a laborious, unpredictable and artisanal process," said Brian Hie, assistant professor of chemical engineering at Stanford University, the Dieter Schwarz Foundation Stanford Data Science Faculty Fellow and an Arc Institute innovation investigator. "With Evo 2, we make biological design of complex systems more accessible to researchers, enabling the creation of new and beneficial advances in a fraction of the time it would previously have taken." Enabling Complex Scientific Research Established in 2021 with $650 million from its founding donors, Arc Institute empowers researchers to tackle long-term scientific challenges by providing scientists with multiyear funding -- letting scientists focus on innovative research instead of grant writing. Its core investigators receive state-of-the-art lab space and funding for eight-year, renewable terms that can be held concurrently with faculty appointments with one of the institute's university partners, which include Stanford University, the University of California, Berkeley, and the University of California, San Francisco. By combining this unique research environment with accelerated computing expertise and resources from NVIDIA, Arc Institute's researchers can pursue more complex projects, analyze larger datasets and more quickly achieve results. Its scientists are focused on disease areas including cancer, immune dysfunction and neurodegeneration. NVIDIA accelerated the Evo 2 project by giving scientists access to 2,000 NVIDIA H100 GPUs via NVIDIA DGX Cloud on AWS. DGX Cloud provides short-term access to large compute clusters, giving researchers the flexibility to innovate. The fully managed AI platform includes NVIDIA BioNeMo, which features optimized software in the form of NVIDIA NIM microservices and NVIDIA BioNeMo Blueprints. NVIDIA researchers and engineers also collaborated closely on AI scaling and optimization. Applications Across Biomolecular Sciences Evo 2 can provide insights into DNA, RNA and proteins. Trained on a wide array of species across domains of life -- including plants, animals and bacteria -- the model can be applied to scientific fields such as healthcare, agricultural biotechnology and materials science. Evo 2 uses a novel model architecture that can process lengthy sequences of genetic information, up to 1 million tokens. This widened view into the genome could unlock scientists' understanding of the connection between distant parts of an organism's genetic code and the mechanics of cell function, gene expression and disease. "A single human gene contains thousands of nucleotides -- so for an AI model to analyze how such complex biological systems work, it needs to process the largest possible portion of a genetic sequence at once," said Hsu. In healthcare and drug discovery, Evo 2 could help researchers understand which gene variants are tied to a specific disease -- and design novel molecules that precisely target those areas to treat the disease. For example, researchers from Stanford and the Arc Institute found that in tests with BRCA1, a gene associated with breast cancer, Evo 2 could predict with 90% accuracy whether previously unrecognized mutations would affect gene function. In agriculture, the model could help tackle global food shortages by providing insights into plant biology and helping scientists develop varieties of crops that are more climate-resilient or more nutrient-dense. And in other scientific fields, Evo 2 could be applied to design biofuels or engineer proteins that break down oil or plastic. "Deploying a model like Evo 2 is like sending a powerful new telescope out to the farthest reaches of the universe," said Dave Burke, Arc's chief technology officer. "We know there's immense opportunity for exploration, but we don't yet know what we're going to discover."
Share
Share
Copy Link
Scientists unveil Evo-2, a groundbreaking AI model trained on 128,000 genomes, capable of generating entire chromosomes and small genomes. This advancement promises to transform genetic research and genome engineering.

Scientists from the Arc Institute, Stanford University, and NVIDIA have unveiled Evo-2, a groundbreaking artificial intelligence model that marks a significant advancement in biological research. This powerful tool, trained on a dataset of 128,000 genomes spanning various life forms, can generate entire chromosomes and small genomes from scratch
1
2
.Evo-2's training set encompasses 9.3 trillion DNA letters from humans, animals, plants, bacteria, and archaea
1
. Unlike previous AI models that focused primarily on protein sequences, Evo-2 has been trained on genome data, including both coding and non-coding sequences2
. This extensive training allows the model to handle the complexity of eukaryotic genomes, which contain interspersed coding and non-coding regions2
.The model can process genetic sequences up to 1 million tokens in length, enabling a broader analysis of the genome
3
. This capability allows scientists to explore relationships between genetic sequences and cell function, gene expression, and disease3
. Evo-2 has demonstrated impressive abilities in several areas:2
2
2
2
Researchers anticipate that Evo-2 will have far-reaching implications across multiple scientific domains:
3
5
3
5
3
5
1
The Evo-2 model has been made available to scientists through web interfaces, and its software code, data, and parameters are freely accessible
2
. This open-source approach aims to accelerate the exploration and design of biological complexity3
.Related Stories
Evo-2 was built using NVIDIA DGX Cloud on Amazon Web Services (AWS), utilizing 2,000 NVIDIA H100 GPUs
5
. The project involved collaboration between multiple institutions, including Stanford University, NVIDIA, and the Arc Institute4
.While Evo-2 represents a significant milestone in generative genomics, researchers emphasize the need for further validation and refinement. Experiments are underway to test its predictions on chromatin accessibility and other complex genetic structures
2
. As more scientists adopt and build upon Evo-2's capabilities, it is expected to play an increasingly important role in advancing our understanding of genomics and accelerating discoveries in the life sciences4
5
.Summarized by

Navi
[1]
[3]
[4]
26 Jun 2025•Science and Research

09 May 2025•Science and Research

27 Sept 2025•Science and Research
