Predicting molecule’s properties using language learning

The AI algorithm predicts molecular features with little data, which speeds up medication discovery.

Share

The laborious, trial-and-error procedure generally used to find novel materials and medications can take decades and cost millions of dollars. Scientists frequently use machine learning to anticipate chemical characteristics and select the molecules that will be synthesized and tested in the lab. A novel, integrated framework has been created by MIT and MIT-Watson AI Lab researchers that does both molecular property prediction and molecular synthesis significantly more quickly than standard deep-learning techniques.

A machine-learning model must be trained by exposing it to millions of labeled molecular structures to learn to predict a molecule’s biological or mechanical attributes. The efficiency of machine-learning techniques is nonetheless typically constrained by the need for more readily available big training datasets. 

In contrast, the MIT researchers’ algorithm can accurately predict molecular features with just minimal data. They have a system that understands the fundamental rules of how building components interact to form legitimate compounds. By capturing the commonalities between molecular structures, these principles enable the system to synthesize new molecules and anticipate their attributes efficiently. Our strategy performed better than previous machine-learning techniques on small and large datasets and was successfully predicted.

Lead author Minghao Guo, a computer science and electrical engineering (EECS) graduate student, said, “Our goal with this project is to use some data-driven methods to speed up the discovery of new molecules, so you can train a model to predict without all of these cost-heavy experiments.”

Molecular grammar is a machine-learning system developed by the MIT team that automatically learns the “language” of molecules using only a tiny, domain-specific dataset. It creates functional molecules and forecasts their attributes using this syntax. Using reinforcement learning, a trial-and-error process where the model is rewarded for behavior that brings it closer to accomplishing a goal, the system learns the production rules for a molecular language.

Millions of molecules with attributes comparable to those they hope to uncover are needed as training datasets for machine learning models to provide the best results. These domain-specific datasets are typically tiny in actuality. To apply models to a much smaller, tailored dataset, researchers first trained them on massive datasets of broad compounds. These models have a limited amount of domain-specific knowledge, however. Thus they frequently need to improve.

In tests, the new system developed by the researchers simultaneously produced functional molecules and polymers and predicted their properties more precisely than some well-known machine-learning methods, even when the domain-specific datasets had only a small number of samples. A costly pretraining stage was also necessary for several other systems, whereas the new methodology does not.

The method proved particularly good at forecasting the physical characteristics of polymers, such as the glass transition temperature or the temperature at which a substance changes from a solid to a liquid. Because the experiments must be conducted at extremely high temperatures and pressures, manually getting this information is frequently expensive. The researchers reduced one training set by more than half to just 94 samples to advance their methodology, and their model nevertheless produced outcomes comparable to those of approaches trained using the complete dataset.

Guo said, “Once we have this grammar as a representation for all the different molecules, we can use it to boost the process of property prediction.”

He added, “This grammar-based representation is very powerful. And because the grammar itself is a very general representation, it can be deployed to different kinds of graph-form data. We are trying to identify other applications beyond chemistry or material science.” 

The researchers hope to expand their molecular language to incorporate the 3D shape of molecules and polymers, essential for comprehending polymer chain interactions. They are also working on an interface that will show a user the learned grammar production rules and request comments to rectify any incorrect rules, increasing the system’ssystem’s accuracy.

The MIT-IBM Watson AI Lab and its member business, Evonik, funded this research.

Journal Reference:

  1. GRAMMAR-INDUCED GEOMETRY FOR DATA EFFICIENT MOLECULAR PROPERTY PREDICTION.Link to Paper.

Newsletter

See stories of the future in your inbox each morning.

Trending