Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. University of California – San Francisco‘s scientists have created an AI system capable of generating artificial enzymes from scratch.
Their system, dubbed ProGen, uses next-token prediction to assemble amino acid sequences into artificial proteins. When tested, some of the resultant enzymes worked as well as those found in nature, even when their artificially generated amino acid sequences diverged significantly from any known natural protein.
Scientists said the new technology could become more powerful than directed evolution, the Nobel-prize-winning protein design technology, and it will energize the 50-year-old field of protein engineering by speeding the development of new proteins that can be used for almost anything from therapeutics to degrading plastic.
James Fraser, Ph.D., professor of bioengineering and therapeutic sciences at the UCSF School of Pharmacy, said, “The artificial designs perform much better than designs that were inspired by the evolutionary process. The language model is learning aspects of evolution, but it’s different than the normal evolutionary process.”
“We now can tune the generation of these properties for specific effects. For example, an incredibly thermostable enzyme likes acidic environments or won’t interact with other proteins.”
The amino acid sequences of 280 million unique proteins of all kinds were loaded into the machine learning model to develop the model. The model was then given a few weeks to process the data. After that, they adjusted the model by feeding it 56,000 sequences from five different lysozyme families, along with some background knowledge about these particular proteins.
Based on how closely they mirrored the sequences of normal proteins and how naturalistic the underlying amino acid “grammar” and “semantics” of the AI proteins were, the study team chose 100 sequences from the model’s fast generation of a million sequences to test.
Out of this initial batch of 100 proteins, which Tierra Biosciences evaluated in vitro, the team created five artificial proteins to test in cells and compared their function to an enzyme known as hen egg white lysozyme present in the whites of chicken eggs (HEWL). Human tears, saliva, and milk all contain similar lysozymes acting as antimicrobial defenses against bacteria and fungi.
Despite only sharing around 18% of their sequences, two artificial enzymes could degrade bacterial cell walls with activity comparable to HEWL.
Just one mutation in a natural protein can make it stop working. Still, in a subsequent round of screening, the scientists discovered that the AI-generated enzymes showed activity even when as little as 31.4% of their sequence resembled any known natural protein.
The AI could even learn how the enzymes should be shaped by studying the raw sequence data. Measured with X-ray crystallography, the atomic structures of the artificial proteins looked just as they should, although the sequences were like nothing seen before.
Nikhil Naik, Ph.D., Director of AI Research at Salesforce Research and the senior author of the paper, said, “When you train sequence-based models with lots of data, they are really powerful in learning structure and rules. They learn what words can co-occur, and also compositionality.”
Ali Madani, Ph.D., founder of Profluent Bio, a former research scientist at Salesforce Research, and the paper’s first author, said, “Given the limitless possibilities, it’s remarkable that the model can so easily generate working enzymes.”
“The capability to generate functional proteins from scratch out-of-the-box demonstrates we are entering into a new era of protein design. This is a versatile new tool available to protein engineers, and we’re looking forward to seeing the therapeutic applications.”
- Ali Madani, Ben Krause, Eric R. Greene, Subu Subramanian, Benjamin P. Mohr, James M. Holton, Jose Luis Olmos, Caiming Xiong, Zachary Z. Sun, Richard Socher, James S. Fraser, Nikhil Naik. Large language models generate functional protein sequences across diverse families. Nature Biotechnology, 2023; DOI: 10.1038/s41587-022-01618-2