Expanding Natural Enzyme Diversity Using Generative AI: Cost-Effective Approaches with Progen2 on AWS
Collaborators:
- Audra Devoto, Owen Janson, Christopher Brown (Metagenomi)
- Adam Perry (Tennex)
Overview of Generative AI in Enzyme Development
Implementing Progen2 on AWS Inferentia
Scaling Inference with AWS Batch
Cost Comparisons: Progen2 vs. Traditional Approaches
Scaling Generation to Millions of Proteins
Conclusion
About the Authors
Revolutionizing Enzyme Diversity with Generative AI: Insights from Metagenomi and Tennex
This post was written in collaboration with Audra Devoto, Owen Janson, and Christopher Brown of Metagenomi, along with Adam Perry of Tennex.
Introduction
In the quest for innovative biotechnology solutions, the need for diverse and efficient enzymes has never been more acute. At Metagenomi, we believe that augmenting the extensive natural diversity of high-value enzymes through generative AI, specifically protein language models (pLMs), holds the key to unlocking new therapeutic potentials. This approach allows us to generate orders of magnitude more predicted examples within targeted enzyme classes, providing a pathway to discover variants with enhanced stability, specificity, and efficacy.
The Power of Generative AI
Generative AI empowers researchers to expand the natural enzyme diversity available for therapeutic applications. By leveraging a comprehensive database of known enzymes, we utilize pLMs to create a plethora of enzyme variants, filtering them through multi-model workflows to predict their characteristics. This not only streamlines enzyme engineering but also opens up avenues for developing potentially curative therapeutics using CRISPR gene editing enzymes from our proprietary database, MGXdb.
However, the generation of these enzymes at scale can be financially burdensome, particularly as model complexity and variant numbers increase. To address these challenges, we’ve embarked on methods to significantly reduce costs while enhancing throughput in enzyme generation.
Cost-Effective High-Throughput Workflows with Progen2 on AWS Inferentia
Introducing Progen2
Our journey towards cost-effective high-throughput protein design involves the implementation of the Progen2 autoregressive transformer model on AWS Inferentia. This EC2 Inf2 instance type not only is more cost-efficient but also provides higher availability as Spot Instances. Through a trial and error approach, we initially ran Progen2 on NVIDIA L40S GPUs, which served as a foundation for this larger-scale initiative.
The migration to AWS Inferentia necessitated a tracing and bucketing technique that optimizes the performance of Progen2. Although this approach introduces some changes that could impact model accuracy, it has allowed us to significantly minimize inference times and costs associated with enzyme generation workflows.
Testing Model Accuracy
To ensure accuracy while implementing Progen2 on EC2 Inf2 instances, we compared generated output using this new model against the native implementation on NVIDIA GPUs. Our tests focused on generating 1,000 protein sequences for each of 10 prompts sourced from UniprotKB, allowing us to assess the perplexity and sequence integrity of the results.
The outcomes revealed that the tracing and bucketing implementation maintained similar sequence characteristics compared to the native approach, thereby assuring us of its reliability for further applications.
Scaling Inference with AWS Batch
To expand our protein generation capabilities, we turned to AWS Batch, which facilitates the efficient scaling of computational tasks. By running batch jobs on EC2 Inf2 Spot Instances, we achieved remarkable cost savings—up to 56%—compared to our previous implementations.
The architecture supports the orchestration of numerous batch jobs that simultaneously handle diverse computational tasks, such as downloading models and processing generated sequences. This robust environment allows us to efficiently conduct protein generation, track outcomes using well-structured pipelines, and easily manage the infrastructure.
Cost Comparisons and Savings
Our primary goal is to make protein sequence generation economical while maximizing the diversity of enzyme classes. Through our recent projects, we found that generating 10,000 sequences with Progen2 on EC2 Inf2 Spot Instances drastically reduced costs. The economic model aims to minimize expenses while maintaining high throughput, crucial for biotechnology start-ups striving for scalability.
Moreover, additional savings can be achieved by executing jobs at half precision, which has shown surprisingly equivalent results in sequence generation.
Pushing the Boundaries: Generating Millions of Proteins
To test our optimized workflows, we conducted extensive trials fine-tuning models on enzymes sourced from Metagenomi’s extensive database. Utilizing our AWS AI pipeline, we generated over 1 million enzyme sequences, experimenting with different parameters, such as sampling methods and generation temperatures.
The latter phases involved validating generated sequences with hybrid techniques incorporating both AI and traditional approaches, ensuring that our outputs were both innovative and valid.
Conclusion
In summary, we have outlined practical methods to significantly reduce costs associated with large-scale protein design projects by up to 56% using AWS EC2 Inf instances. This major step has allowed Metagenomi to explore the frontier of enzyme diversity and discover millions of novel enzymes across high-value protein classes.
With AWS Inferentia at our disposal, we aspire to foster innovation in protein generation, making advanced biotechnology applications more accessible and economically viable. To learn more about EC2 Inf instances and implement your own workflows, check out the AWS Neuron documentation.
About the Authors
Audra Devoto: Data Scientist with expertise in metagenomics and large genomics datasets on AWS.
Owen Janson: Bioinformatics Engineer focused on cloud infrastructure for genomic analysis.
Adam Perry: Co-Founder of Tennex, specializing in AWS cloud architecture for biotech startups.
Christopher Brown, PhD: Head of Discovery at Metagenomi, an expert in enzyme systems for gene editing.
Jamal Arif: Senior Solutions Architect at AWS, focusing on AI and cloud-native architectures.
Pavel Novichkov, PhD: Senior Solutions Architect at AWS, specializing in genomics and life sciences.
Explore the future of biotechnology with us as we leverage generative AI to pioneer the next generation of enzyme diversity!