NVIDIA BioNeMo is a framework for training and deploying large biomolecular language models at a supercomputing scale — helping scientists better understand the disease and find therapies for patients. The large language model (LLM) framework will support chemistry, protein, DNA, and RNA data formats. It’s part of the NVIDIA Clara Discovery collection of frameworks, applications, and AI models for drug discovery.
Just as AI is learning to understand human languages with LLMs, it’s also learning the languages of biology and chemistry. By making it easier to train massive neural networks on biomolecular data, NVIDIA BioNeMo helps researchers discover new patterns and insights in biological sequences — insights that researchers can connect to biological properties or functions, and even human health conditions.
NVIDIA BioNeMo provides a framework for scientists to train large-scale language models using bigger datasets, resulting in better-performing neural networks. The framework will be available in early access on NVIDIA NGC, a hub for GPU-optimized software.
In addition to the language model framework, NVIDIA BioNeMo has a cloud API service that will support a growing list of pre-trained AI models.
BioNeMo Framework Supports Bigger Models, Better Predictions
Scientists using natural language processing models for biological data today often train relatively small neural networks that require custom preprocessing. By adopting BioNeMo, they can scale up to LLMs with billions of parameters that capture information about molecular structure, protein solubility, and more.
BioNeMo is an extension of the NVIDIA NeMo Megatron framework for GPU-accelerated training of large-scale, self-supervised language models. It’s domain-specific, designed to support molecular data represented in the SMILES notation for chemical structures, and in FASTA sequence strings for amino acids and nucleic acids.
“The framework allows researchers across the healthcare and life sciences industry to take advantage of their rapidly growing biological and chemical datasets,” said Mohammed AlQuraishi, founding member of the OpenFold Consortium and assistant professor at Columbia University’s Department of Systems Biology. “This makes it easier to discover and design therapeutics that precisely target the molecular signature of a disease.”
BioNeMo Service Features LLMs for Chemistry and Biology
For developers looking to quickly get started with LLMs for digital biology and chemistry applications, the NVIDIA BioNeMo LLM service will include four pre-trained language models. These are optimized for inference and will be available under early access through a cloud API running on NVIDIA DGX Foundry.
- ESM-1: This protein LLM, originally published by Meta AI Labs, processes amino acid sequences to generate representations that can be used to predict a wide variety of protein properties and functions. It also improves scientists’ ability to understand protein structure.
- OpenFold: The public-private consortium creating state-of-the-art protein modeling tools will make its open-source AI pipeline accessible through the BioNeMo service.
- MegaMolBART: Trained on 1.4 billion molecules, this generative chemistry model can be used for reaction prediction, molecular optimization, and de novo molecular generation.
- ProtT5: The model, developed in a collaboration led by the Technical University of Munich’s RostLab and including NVIDIA, extends the capabilities of protein LLMs like ESM-1b to sequence generation.
In the future, researchers using the BioNeMo LLM service will be able to customize the LLM models for higher accuracy on their applications in a few hours — with fine-tuning and new techniques such as p-tuning, a training method that requires a dataset with just a few hundred examples instead of millions.