Curated by THEOUTPOST
On Mon, 31 Mar, 4:03 PM UTC
5 Sources
[1]
The AI revolution comes to protein sequencing
Artificial intelligence (AI) has already revolutionized the study of how proteins fold up into their 3D shapes, an achievement honored by last year's Nobel Prize in Chemistry. Now, AI is transforming protein sequencing -- identifying proteins from the sequence of amino acids that make them up. AI is often faster than conventional methods. It also enables researchers to sequence proteins they have never seen before, a common challenge in medical diagnostics, environmental studies, and archaeology. In the latest advance, European researchers reported this week in that an AI known as InstaNova can identify pathogenic proteins in wounds and unknown proteins produced by the brew of microbes in seawater samples. InstaNova isn't alone. Over the past 4 years, researchers have unveiled more than two dozen protein sequencing AIs. "It seems clear that this is where the field is going to go," says William Noble, a proteomics AI developer at the University of Washington. Researchers in other areas are eager to apply the tools. Evolutionary biologists, for example, are using them to identify ancient proteins that could reveal insights into the differences between modern humans and our extinct relatives. "It's already helpful," says Enrico Cappellini, a paleoproteomics expert at the University of Copenhagen. "And it's just going to get better and better." The world of proteins is far more complex than that of their genetic blueprints, DNA and RNA. The human genome, for example, contains roughly 20,000 genes, but those genes can give rise to 10 million different proteins, because of changes that can occur as DNA is copied into RNA or as RNA is translated into proteins, which themselves can be appended with myriad chemical modifications. Biologists traditionally identify proteins by breaking them up into short fragments called peptides, each made up of between five and 20 amino acids. Scientists then weigh those fragments in a mass spectrometer, match the weights to those of known peptides in one of dozens of databases to determine their identity, and then piece together the fragments into the full molecule. But there are problems with this approach. For starters, up to 70% of peptides found by mass spectroscopy aren't in any databases. "Traditional proteomics is a bit like a Google search. If it's not there, you will not find it," says Timothy Patrick Jenkins, a proteomics expert at the Technical University of Denmark. And as the databases of peptides continue to grow, it's taking ever more computer time to spot hits. The new AI sequencers don't bother searching for matches among known peptides. Instead, they calculate the weights of all the potential peptide fragments that could result from chemical modifications to a peptide of a given length. If the AI comes up with fragments that match ones from the actual sample, it tries to assemble them into full-length proteins. To increase their accuracy, the protein sequencing AIs are trained on millions of known peptides and how they assemble into known proteins. This allows the AIs to learn the most common ways amino acid chains combine. The approach, Jenkins says, is similar to the way large language models (LLMs) such as ChatGPT train on vast bodies of text to learn the rules of syntax. Just as an LLM learns that "the boy bounces a ball" is more likely to be a valid sentence than "bounces a ball the boy," the proteomics algorithms learn a kind of protein syntax, which provides the most likely sequence for a given set of peptides. In 2021, Noble and his colleagues unveiled Casanovo, the first protein sequencing AI to use a deep neural network like the one that powers ChatGPT. In a 2024 paper in , Noble's team reported that the AI proved adept at identifying novel sequences of peptides that weren't in the training data. Additional experiments showed that Casanovo excelled at identifying the cell-surface peptides that the immune system targets when it attacks cancer, as well as unknown proteins in seawater samples. Now, Jenkins and his colleagues have built on these results with InstaNova. It, too, uses a deep learning neural network. But unlike previous AI protein sequencing models, it adds a strategy called diffusion, an approach that has supercharged AI imagemaking models such as DALL-E, and protein structure models such as RoseTTAFold or AlphaFold. Diffusion models initially add random noise to the input data and then remove it to see how the procedure sharpens the output. Based on the outcome, they then apply noise removal more broadly to further sharpen the result. In their paper, Jenkins and his colleagues report that in a head-to-head test with Casanovo, InstaNova, coupled with a refinement called InstanNova+, identified 42% more peptides in a labmade brew of proteins from nine organisms. When the team applied its AI to real-world proteomics challenges, it found, among other results, that it identified 1225 peptides unique to the blood protein albumin in infected leg wounds, 10 times more than conventional database searches did. Of those 254 were new peptides not in the databases. The researchers also mapped other peptides to 52 bacterial proteins. These and other results show that InstaNova "can analyze complex samples and come up with answers," says Catrine Soiberg, who heads R&D for Atlas Antibodies, a firm that helps researchers map proteins throughout tissues. Noble, who got an early look at InstaNova and has already put it through its paces, calls it "a real advance." Others are running with it as well. Matthew Collins, a proteomics researcher at the University of Cambridge, has recently been testing several AI protein sequencing tools to analyze archaeological samples. In most cases the proteins in the samples have undergone extensive chemical changes after eons underground or came from extinct plants and animals, so they are unlikely to be represented in conventional protein and peptide databases. The models, Collins says, "are particularly good for messy environments [where] you don't know what's there." Already the AI tools have enabled his team to spot signatures of rabbit proteins in Neanderthal sites and fish muscle proteins in ancient Brazilian pots. "[The models] are so useful, we have switched all our research to work with them," Collins says. "In my mind it's a step change."
[2]
AI is helping scientists decode previously inscrutable proteins
The tools could help uncover better cancer treatments, illuminate rare diseases and more Generative artificial intelligence has entered a new frontier of fundamental biology: helping scientists to better understand proteins, the workhorses of living cells. Scientists have developed two new AI tools to decipher proteins often missed by existing detection methods, researchers report March 31 in Nature Machine Intelligence. Uncovering these unknown proteins in all types of biological samples could be key to creating better cancer treatments, improving doctors' understanding of diseases, and discovering mechanisms behind unexplained animal abilities. If DNA represents an organism's master plan, then proteins are the final build, encapsulating what cells actually make and do. Deviations from the DNA blueprint for making proteins are common: Proteins might undergo alterations or cuts post-production, and there are many instances where something goes awry in the pipeline, leading to proteins that differ from the initial genetic schematic. These unexpected, "hidden" proteins have been historically difficult for scientists to identify and analyze. That's where the machine learning tools come in. The AI models, called InstaNovo and InstaNovo+, are a step toward "the holy grail" of protein research: to unravel the genetic identity of previously unstudied proteins en masse, says Benjamin Neely, a chemist and protein scientist at the National Institute of Standards and Technology in Gaithersburg, Md. With continued advances and testing, these tools or similar ones are "going to be powerful. It's going to let me see things that I can't normally see," says Neely, who was not involved in the study. Many non-model organisms haven't been well studied, and their proteins are poorly cataloged. As a hypothetical, Neely suggests the new tools could be used to find the obscure kidney proteins that allow stingrays to move between brackish water and the ocean. AI has already transformed how researchers predict protein folding with a tool called AlphaFold. And machine learning-powered protein design earned a Nobel Prize in 2024. Filling long-standing gaps in protein sequencing is poised to be the next AI leap in the field, Neely suggests. InstaNovo (IN) is structured similarly to OpenAI's GPT-4 transformer model and trained to translate the peaks and valleys of a protein's "fingerprint," plotted through mass spectroscopy, into a string of likely amino acids. These amino acid sequences can then be used to reconstruct and identify the hidden protein. Instanovo+ (IN+) is a diffusion model that works more like an AI image generator and is primed to take the same initial information and progressively remove noise to produce a clear protein picture. IN and IN+ are not the first attempts to apply machine learning to protein sequencing. But the new study demonstrates how far the technology has come in recent years -- edging ever closer to real-world utility, largely thanks to expanding protein analysis databases like Proteome Tools, which can be used to train AI models. These were the data used to develop and train IN and IN+, but the models' analyses extend beyond the proteins in existing databases. They can suggest possible protein segments that haven't yet been cataloged. Both tools individually show promise across a spate of tests compared with results from a previously released AI transformer protein decoder called Casanovo and from the database search method most commonly used to ID unknown proteins. In straightforward protein sequencing tests, the models don't outperform database search, yet they seem to excel in more complicated trials. One especially challenging task is sequencing human immune proteins, which are uniquely tough to analyze with standard methods because of their small size and amino acid composition. The researchers report that IN finds about three times as many candidate protein segments as classic database searching, going from about 10,000 identified peptides to more than 35,000. And IN+ finds about six times more. Used together, the models' combined performance offers an even larger boost. Based on the thorough validation presented in the study, Amanda Smythers, who specializes in protein analysis, says she'd be eager to try the tools. A chemist at Dana-Farber Cancer Institute in Boston, Smythers imagines using the AI models to answer questions like why pancreatic cancer commonly triggers rapid muscle wasting and fatigue. Proteins made by cancer cells or disruption of normal protein function in noncancer cells could be at fault. "It's a really important piece of biology that we don't understand yet," Smythers says. Bringing obscure protein sequences to the surface (whether they're from cancer cells or stingray kidneys) could enable the possibility of neutralizing harmful ones or harnessing beneficial ones to treat disease. Still, the new models have limitations. The possibility of false positives, which the study authors estimate at around 5 percent, means the AI outputs require extra verification, says coauthor Konstantinos Kalogeropoulos, a computational bioengineer at the Technical University of Denmark in Lyngby. And how to best evaluate these AI tools remains an open question, notes William Noble, a developer of Casanovo and a computer scientist and proteomics researcher at the University of Washington in Seattle. Finally, AI sequencing is not a replacement for database searching, Smythers says. It's a supplement. "There's never one single tool that's good for every job," she says. "However, it's tools like this that really help us keep progressing the field further."
[3]
AI takes step towards cracking biology's toughest problem - ...
A new AI system, dubbed InstaNovo, could revolutionise protein sequencing just as AlphaFold transformed protein structure prediction, its developers claim. While DNA sequencing is routine, determining protein sequences remains one of biology's toughest challenges, says corresponding author Timothy Jenkins of the Technical University of Denmark. InstaNovo aims to change that by directly reading protein sequences from raw experimental data, unlocking vast areas of previously inaccessible biology. In proteomics - the study of proteins in biological systems - de novo peptide sequencing is used to figure out a protein's base amino acid sequence using tandem mass spectrometry. This technique fragments peptide ions and analyses their mass-to-charge ratios at multiple stages, allowing researchers to infer the original sequence. 'There are many techniques for studying proteins, but none match the throughput and comprehensiveness of mass spectrometry,' says Kostas Kalogeropoulos, also at the Technical University of Denmark. 'We analyse proteins - or their smaller fragments, peptides - by measuring their mass.' Unlike database-dependent approaches, which compare unknown peptides to known sequences, de novo sequencing reconstructs them from scratch, requiring no prior information. 'De novo sequencing has long been underappreciated, yet it holds immense potential for many biological applications,' says Jenkins. However, issues with accuracy and high computational costs have hindered its widespread adoption. 'Traditionally, de novo peptide sequencing relies on algorithms that functioned similarly to manually reconstructing a sequence,' explains Kalogeropoulos. 'These methods require all necessary information to be present, otherwise they would fail.' 'InstaNovo and similar tools circumvent this major limitation by direct "de novo" interpreting peptide sequences from peptide fragmentation spectra,' comments Francis Impens at the VIB research institute and the University of Ghent, who was not involved in the study. 'That means we can now identify proteins from genomically unsequenced species or from very complex samples, [such as] microbiome samples, for which the species composition is unknown.' InstaNovo is a transformer-based AI, a neural network originally used in language processing that learns context and meaning by tracking relationships in sequential data - like the correct sequence of words in a sentence. When applied to de novo peptide sequencing, InstaNovo analyses peaks or signals from mass spectrometry data and processes them through multiple steps using transformer decoder layers, which are like smart filters that piece together the most likely sequence of amino acids from the fragmented data. To choose the most accurate sequence, InstaNovo uses a 'knapsack beam search', a strategy that efficiently tests different possible sequences, keeps the best ones and then refines them. This works similarly to how a human would double-check and fine-tune their guesses when manually sequencing proteins. 'InstaNovo ... directly predicts the sequence from the spectrum, eliminating the need for database lookups,' says first author Kevin Eloff. 'This is possible because our models have learned the underlying patterns of the sequences we are measuring and can translate a spectrum directly into the corresponding peptide sequences.' As a proof of concept, InstaNovo was used to analyse peptides in fluid from patients' wounds, identifying at least three pathogens, which were confirmed by standard techniques. 'While the presence of pathogens in these wounds was not unexpected, we were surprised by how easily we could detect them,' says Kalogeropoulos. 'This finding could have significant implications for how we diagnose and treat chronic wound patients.' The team is now exploring InstaNovo's potential to map all the proteins in a patient's cell. It could also identify mutated cancer proteins and proteins with unknown roles. Impens finds it exciting that InstaNovo can extend database search results beyond known sequences, surpassing previous de novo sequencing tools. However, it still needs to be evaluated and trained on larger datasets. 'The model still needs to be fine-tuned for post-translational modifications [which affect protein function and are critical for biological activity] and data from different types of mass spectrometers,' he adds. As with any new technology, he expects challenges in integration and real-world application. Jenkins and Kalogeropoulos agree but believe that interdisciplinary collaborations will demonstrate the model's benefits. 'We cannot say that de novo peptide sequencing is fully solved yet,' says Eloff, 'but we hope to train on a lot more data and make improvements wherever we can ... making state-of-the-art models available to anyone'.
[4]
New AI models possible game-changers within protein science and healthcare
In the wake of broadly available AI tools, most technical and natural sciences fields are advancing rapidly. This is particularly true in biotechnology, where AI models power breakthroughs in drug discovery, precision medicine, gene editing, food security, and many other research areas. One sub-field is proteomics -- the study of proteins on a large scale -- where vast amounts of protein data are gathered in databases against which a sample can be compared. These databases enable scientists to discern which proteins -- and, thereby, microorganisms -- are present in a sample. They allow a doctor to diagnose diseases, monitor the effectiveness of a treatment, or identify pathogens present in a patient's sample. Although these tools are very useful and effective, there are limits to what they can do, says Timothy Patrick Jenkins, an Associate Professor at DTU Bioengineering and corresponding author: "First off, no database includes everything, so you need to know which databases are relevant to your particular needs. Then deep searches are very time-consuming and demand a lot of computer power. And, finally, it's nearly impossible to identify proteins that haven't been registered yet." For this reason, some groups have worked on so-called 'de novo sequencing algorithms' that improve accuracy and lower computational costs with increasing database size. Still, according to Jenkins and colleagues from DTU, Delft University in the Netherlands and the British AI company InstaDeep, their performance remained "underwhelming." Exceeding state-of-the-art In a , they propose two novel AI models to assist researchers, medical practitioners, and commercial entities in finding exactly the necessary information in the vast amounts of data. These are called InstaNovo and InstaNovo+ and are available to researchers through the InstaDeep website (see fact box). "Seen together, our models exceed state-of-the-art and are significantly more precise than currently available tools. Furthermore, as we show in the paper, our models are not specific to a particular research area. Instead, these tools could propel significant advances in all fields involving proteomics," says Kevin Michael Eloff, a research engineer at InstaDeep and co-first author of the paper. To assess the usefulness of their models, the researchers have trained and tested them on several specific tasks within major areas of interest. One investigation was performed on wound fluid from venous leg ulcer patients. Since venous leg ulcers are notoriously difficult to treat and often become chronic, knowing which microorganisms like bacteria are present is crucial to treatment. The models could map ten times as many sequences as a database search, among them E. coli and Pseudomonas aeruginosa -- the latter being a multidrug-resistant bacterium. Another use case was conducted on small pieces of protein, called peptides, displayed on the surface of cells. These help the immune system recognize infections and diseases such as cancer. The InstaNovo models identified thousands of new peptides that were not found using traditional methods. In personalised cancer treatments empowering the immune system -- immunotherapy for short -- these peptides are all potential attack points. "In combination, our tests of the model on complex cases, where, for example, unknown proteins are present, or where we have no prior knowledge of the organisms involved, show that they are suitable to improve our understanding significantly. That this bodes well for biomedicine is a given, since it can directly improve identification of our microbiome, as well as improve our efforts within personalised medicine and cancer immunology," says Konstantinos Kalogeropoulos, co-first author and Assistant Professor at DTU Bioengineering. The paper provides six additional cases that demonstrate how these models improve therapeutic sequencing, discover novel peptides, detect unreported organisms, and significantly enhance proteomics searches. The implications of their results extend far beyond the medical sciences, says Timothy Patrick Jenkins: "Looking at it from a purely technical, scientific perspective, it is also true that with these tools, we can improve our understanding of the biological world as a whole, not only in terms of healthcare but also in industry and academia. Within every field using proteomics -- be it plant science, veterinary science, industrial biotech, environmental monitoring, or archaeology -- we can gain insights into protein landscapes that have been inaccessible until now." InstaNovo is a transformer-based model designed for de novo peptide sequencing. Developed in collaboration between InstaDeep and the Department of Biotechnology and Biomedicine at the Technical University of Denmark (DTU), it translates fragment ion peaks from mass spectrometry data into peptide sequences with unprecedented precision. Unlike traditional methods that rely on pre-existing databases, InstaNovo identifies peptides that have never been documented before -- expanding the landscape of proteomic discovery. A key innovation of the InstaNovo models is InstaNovo+, a diffusion-based iterative refinement model that enhances sequence accuracy by mimicking how researchers manually refine peptide predictions. InstaNovo+ begins with an initial sequence -- either derived from InstaNovo or generated at random -- and improves it, step by step. When paired with InstaNovo, InstaNovo+ significantly reduces false discovery rates (FDR) and improves sequence accuracy, not just by refining predictions, but by exploring a broader range of potential peptide sequences. Unlike autoregressive models such as InstaNovo and others, which predict peptide sequences one amino acid at a time, InstaNovo+ processes entire sequences holistically, enabling greater accuracy and higher detection rates. Together, InstaNovo and InstaNovo+ enhance de novo peptide sequencing, striking a balance between precision and exploration to accelerate biological discovery.
[5]
New AI models enhance protein data analysis for medical research
Researchers have developed new AI models that can vastly improve accuracy and discovery within protein science. The models could assist the medical sciences in overcoming present challenges within personalized medicine, drug discovery, and diagnostics. In the wake of the widespread availability of AI tools, most fields in the technical and natural sciences are advancing rapidly. This is particularly true in biotechnology, where AI models power breakthroughs in drug discovery, precision medicine, gene editing, food security, and many other research areas. One sub-field is proteomics -- the study of proteins on a large scale -- where vast amounts of protein data are gathered in databases against which a sample can be compared. These databases enable scientists to discern which proteins -- and, thereby, microorganisms -- are present in a sample. They allow a doctor to diagnose diseases, monitor the effectiveness of a treatment, or identify pathogens present in a patient's sample. Although these tools are useful and effective, there are limits to what they can do, says Timothy Patrick Jenkins, an Associate Professor at DTU Bioengineering and corresponding author: "First off, no database includes everything, so you need to know which databases are relevant to your particular needs. Then deep searches are very time-consuming and demand a lot of computer power. And, finally, it's nearly impossible to identify proteins that haven't been registered yet." For this reason, some groups have worked on so-called "de novo sequencing algorithms" that improve accuracy and lower computational costs with increasing database size. Still, according to Jenkins and colleagues from DTU, Delft University in the Netherlands and the British AI company InstaDeep, their performance remained "underwhelming." Exceeding state-of-the-art In a new paper in Nature Machine Intelligence, they propose two novel AI models to assist researchers, medical practitioners, and commercial entities in finding exactly the necessary information in the vast amounts of data. These are called InstaNovo and InstaNovo+ and are available to researchers through the InstaDeep website. "Seen together, our models exceed state-of-the-art and are significantly more precise than currently available tools. Furthermore, as we show in the paper, our models are not specific to a particular research area. Instead, these tools could propel significant advances in all fields involving proteomics," says Kevin Michael Eloff, a research engineer at InstaDeep and co-first author of the paper. To assess the usefulness of their models, the researchers have trained and tested them on several specific tasks within major areas of interest. One investigation was conducted on wound fluid from patients with venous leg ulcers. Since venous leg ulcers are notoriously difficult to treat and often become chronic, knowing which microorganisms, such as bacteria, are present is crucial to treatment. The models could map 10 times as many sequences as a database search, including those of E. coli and Pseudomonas aeruginosa -- the latter being a multidrug-resistant bacterium. Another use case was conducted on small pieces of protein, called peptides, displayed on the surface of cells. These help the immune system recognize infections and diseases such as cancer. The InstaNovo models identified thousands of new peptides that were not found using traditional methods. In personalized cancer treatments, empowering the immune system -- also known as immunotherapy -- these peptides are all potential targets for attack. "In combination, our tests of the model on complex cases, where, for example, unknown proteins are present, or where we have no prior knowledge of the organisms involved, show that they are suitable to improve our understanding significantly. That this bodes well for biomedicine is a given, since it can directly improve identification of our microbiome, as well as improve our efforts within personalized medicine and cancer immunology," says Konstantinos Kalogeropoulos, co-first author and Assistant Professor at DTU Bioengineering. The paper provides six additional cases that demonstrate how these models improve therapeutic sequencing, discover novel peptides, detect unreported organisms, and significantly enhance proteomics searches. The implications of their results extend far beyond the medical sciences, says Timothy Patrick Jenkins: "Looking at it from a purely technical, scientific perspective, it is also true that, with these tools, we can improve our understanding of the biological world as a whole, not only in terms of health care, but also in industry and academia. "Within every field using proteomics -- be it plant science, veterinary science, industrial biotech, environmental monitoring, or archaeology -- we can gain insights into protein landscapes that have been inaccessible until now."
Share
Share
Copy Link
New AI models, InstaNovo and InstaNovo+, are transforming protein sequencing, offering improved accuracy and the ability to identify previously unknown proteins. This breakthrough has significant implications for medical research, drug discovery, and various scientific fields.
Artificial intelligence (AI) is revolutionizing the field of protein sequencing, building on its recent success in protein folding prediction. Researchers have developed new AI models, InstaNovo and InstaNovo+, that are transforming how scientists identify and analyze proteins, potentially leading to breakthroughs in medical research, drug discovery, and various scientific disciplines 12.
Protein sequencing has long been a complex and challenging task in biology. Unlike DNA sequencing, which is now routine, determining protein sequences remains one of biology's toughest problems 3. Traditional methods rely on breaking proteins into smaller fragments called peptides, weighing them in a mass spectrometer, and matching the weights to known peptides in databases. However, this approach has limitations, including the inability to identify proteins not present in existing databases 1.
InstaNovo, developed by researchers at the Technical University of Denmark and other institutions, is a transformer-based AI model similar to OpenAI's GPT-4 2. It translates mass spectrometry data directly into likely amino acid sequences without relying on database searches. InstaNovo+ is a diffusion model that further refines the results, similar to AI image generation tools 14.
These models offer several advantages over traditional methods:
The potential applications of these AI models are vast and span multiple scientific fields:
The advent of AI-powered protein sequencing tools like InstaNovo represents a significant leap forward in proteomics. These models can potentially unlock vast areas of previously inaccessible biology, similar to how AlphaFold transformed protein structure prediction 35.
While the new AI models show great promise, there are still challenges to overcome:
As research continues, it's expected that these tools will become increasingly powerful and widely adopted across various scientific disciplines 45.
Reference
[1]
[2]
[3]
Researchers develop EVOLVEpro, an AI tool that significantly enhances protein engineering capabilities, promising advancements in medicine, agriculture, and environmental solutions.
3 Sources
3 Sources
Google DeepMind has introduced AlphaProteo, an advanced AI model for protein design. This breakthrough technology promises to accelerate drug discovery and development of sustainable materials.
2 Sources
2 Sources
Researchers at Linköping University have enhanced AlphaFold, enabling it to predict very large and complex protein structures while incorporating experimental data. This advancement, called AF_unmasked, marks a significant step towards more efficient protein design for medical and scientific applications.
2 Sources
2 Sources
Researchers at Argonne National Laboratory have developed an innovative AI-driven framework called MProt-DPO that accelerates protein design by integrating multimodal data and leveraging supercomputers, potentially transforming fields from vaccine development to environmental science.
2 Sources
2 Sources
Scientists unveil Evo-2, a groundbreaking AI model trained on 128,000 genomes, capable of generating entire chromosomes and small genomes. This advancement promises to transform genetic research and genome engineering.
5 Sources
5 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved