To advance scientific discoveries and for their expanded application areas such as in pharmaceuticals and materials science, it is important to predict molecular properties quickly and precisely. Due to time-intensive and cost constraints involved with experiments and simulations to explore potential options, scientists have investigated using Machine Learning(ML) methods to assist research in computational chemistry.
However, most Machine Learning(ML) models are suitable to use only known or labelled data. This makes it nearly impossible to anticipate with accuracy the properties of novel compounds.
In fact, in industries such as drug discovery, there are millions of molecules to be selected from for use in potential drug candidate. A prediction inaccuracy of as small as 1% can result into misidentification of more than ten thousand molecules. In order to address this, improving the accuracy of Machine Learning(ML) models with limited data could play a key role to develop new treatment for diseases.
While the availability of labelled data is limited, the volume of feasible, yet unlabelled data is growing rapidly. A team of researchers at College of Engineering, Carnegie Mellon University reviewed if large volume of unlabelled molecules to construct Machine Learning(ML) models could perform better on property projections than other models.
The work of the researchers ended with the development of a self-supervised learning framework which they called Molecular Contrastive Learning of Representations with Graph Neural Networks. The findings of the study are published in Nature Machine Intelligence.
The framework boost performance of Machine Learning(ML) significantly by leveraging nearly 10 million unlabelled molecule data.
Meanwhile, for a simple explanation between labelled and unlabelled data, one of the research scientists suggested to image two sets of images of cats and dogs. Of the two sets, each animal in one set is labelled with the name of its species.