Leveraging MLOps for Precision Medicine: Addressing Data Challenges in Biomarker Discovery
Navigating the Curse of Dimensionality in Early Disease Detection
Optimizing Multi-Omic Data Management with AWS
How Sonrai and AWS Built a Robust MLOps Framework
Enhancing Experiment Tracking for Regulatory Compliance
The Impact of Real-Time Collaboration on Model Performance
Accelerating Precision Medicine Trials: Results and Future Directions
Contributors: Expertise Behind Sonrai’s Success
Tackling the Challenges of Precision Medicine: How Sonrai and AWS Leverage MLOps for Early Disease Detection
In the fast-evolving landscape of precision medicine, the quest to develop effective diagnostic tests faces a critical challenge: the notorious curse of dimensionality. Researchers often find themselves with datasets brimming with thousands of potential biomarkers but limited to just a few hundred patient samples. This imbalance can dictate the success or failure of groundbreaking discoveries, particularly in the realm of early disease detection.
The Complexity of Handling Omic Data
Modern bioinformatics has embraced a multi-omic approach, utilizing genomics, lipidomics, proteomics, and metabolomics to create early detection tests. However, researchers frequently confront datasets where the number of features vastly exceeds the number of samples. As additional modalities are introduced, the number of permutations increases dramatically, complicating experiment tracking. Moreover, maintaining source control and high code quality is vital, especially as machine learning (ML) operations (MLOps) processes must be established to ensure the integrity of the workflow—which can often be neglected in the initial stages of discovery.
In this post, we delve into how Sonrai, a pioneering life sciences AI company, joined forces with AWS to establish a robust MLOps framework using Amazon SageMaker AI, effectively addressing the challenges of precision medicine while ensuring the traceability and reproducibility needed in regulated environments.
Overview of MLOps
MLOps unites machine learning, DevOps, and data engineering practices to deploy and maintain ML systems in production reliably and efficiently. Implementing MLOps from the beginning accelerates experimentation cycles and guarantees confident and traceable model deployment—an essential requirement for healthcare technology firms bound by rigorous governance and validation standards.
Sonrai’s Data Challenge
Sonrai partnered with a biotechnology company focused on developing biomarker tests for an underserved cancer type. The project utilized a rich dataset across multiple omic modalities—proteomics, metabolomics, and lipidomics—to identify optimal combinations of features for an early detection biomarker with high sensitivity and specificity.
The customer faced several critical challenges:
- An overwhelming 8,000 potential biomarkers were available from only a few hundred patient samples, creating an extreme feature-to-sample ratio that demanded sophisticated feature selection techniques to avoid overfitting.
- The need to explore hundreds of combinations of modalities and modeling approaches rendered manual experiment tracking impractical.
- Complete traceability from raw data through every modeling decision to final deployment was crucial for regulatory submissions.
Solution Overview
To navigate these MLOps challenges, Sonrai designed a comprehensive solution using Amazon SageMaker AI, a fully managed service that streamlines the building, training, and deployment of ML models at scale. This solution enhances data management security, provides flexible development environments, supports robust experiment tracking, and facilitates streamlined model deployment with complete traceability.
MLOps Workflow
The end-to-end MLOps workflow follows these steps:
- Customers upload sample data to a secure data repository in Amazon S3.
- ML engineers leverage Amazon SageMaker Studio Lab and Code Editor, directly linked to a secure source control system.
- Data processing is handled through pipelines that read from the data repository and write results back to Amazon S3.
- Experimentation results are logged in MLflow within Amazon SageMaker Studio.
- Generated reports are stored in Amazon S3 for stakeholder access.
- Validated models are promoted to the SageMaker Model Registry.
- Final models are deployed for inference or further validation.
This architecture ensures complete traceability; each registered model can be traced back through every modeling decision to the original data and code versions.
Secure Data Management with Amazon S3
The backbone of Sonrai’s solution is its secure data management system using Amazon S3. Sonrai configured S3 buckets with tiered access controls to safeguard sensitive patient data, separating sample clinical data from processed data and model outputs. This security measure allows for flexible analysis sharing while ensuring raw data remains protected—a necessity in the regulated life sciences industry.
Utilizing SageMaker AI
From the project’s onset, Sonrai employed both JupyterLab and Code Editor within the SageMaker AI environment. This setup, integrated with the client’s Git repository, established solid version control and review workflows, enhancing collaborative efforts. The numerous ML-optimized compute instances available through SageMaker AI simplified resource provisioning for extensive modeling runs tailored to handle large omic datasets efficiently.
Third-party tools, such as Quarto, were employed within the SageMaker compute environments for generating stakeholder-ready reports, encapsulating results in an interactive format that facilitated timely discussions.
Comprehensive Experiment Tracking with MLflow
The managed MLflow capability within SageMaker AI provides seamless experiment tracking. Every experiment is documented automatically, creating a single source of truth for the modeling process. This visibility into performance metrics, hyperparameters, and custom artifacts, such as ROC curves and confusion matrices, equips Sonrai’s team with insights to refine their modeling strategies continually.
Robust MLOps Pipelines
Sonrai’s modeling pipelines are built as reproducible, version-controlled workflows, processing raw data through several stages to derive final models. Each execution is meticulously logged in MLflow, documenting code commits, input data versions, hyperparameters, and performance metrics, creating an auditable trail essential for regulatory reviews.
A crucial aspect of the pipeline was the Recursive Feature Elimination (RFE) stage, which iteratively identified the most significant features while tracking model performance, validating feature selection decisions, and preparing for regulatory scrutiny.
Effective Model Deployment
Sonrai employs a dual strategy using MLflow and the SageMaker Model Registry for managing model artifacts throughout their lifecycle. This meticulous evaluation process ensures that only models meeting stringent clinical criteria are approved for deployment.
Results and Model Performance
Implementing ML-optimized compute instances on SageMaker AI facilitated rapid model iteration—executing the entire pipeline from raw data to final models in under 10 minutes. This efficiency allowed for daily updates and immediate validation of hypotheses during customer discussions.
The modeling efforts produced 15 individual models, with the top-performing model combining proteomic and metabolomic features, achieving 94% sensitivity, 89% specificity, and an AUC-ROC of 0.93. The winning model, now registered in the SageMaker Model Registry, underwent further validation from the client’s clinical team.
Conclusion
Sonrai’s collaboration with AWS has led to the development of an MLOps solution that accelerates precision medicine trials using SageMaker AI. This framework addresses the complexities of biomarker discovery, effectively managing vast feature sets while ensuring regulatory compliance through stringent traceability and reproducibility.
Key results include:
- Effectively modeling and tracking 8,916 biomarkers.
- Conducting hundreds of experiments with complete lineage.
- Achieving a 50% reduction in time spent curating data for biomarker reports.
Building upon this foundation, Sonrai aims to enhance its MLOps capabilities, automating retraining pipelines and extending architecture to support federated learning across multiple clinical sites.
With these advancements, organizations can harness the potential of Amazon SageMaker to develop their own ML Ops pipelines, accelerating the journey toward impactful healthcare solutions.
About the Authors
Matthew Lee is the Director of AI & Medical Imaging at Sonrai, with a wealth of experience in developing impactful AI solutions across various stages from proof of concept to deployment.
Jonah Craig is a Startup Solutions Architect at AWS, focusing on AI/ML solutions and actively engaging with startups to help them realize their technological visions.
Siamak Nariman is a Senior Product Manager at AWS, dedicated to enhancing AI/ML technology and governance to boost organizational efficiency and productivity.
If you’re interested in embarking on your own journey into MLOps, we invite you to explore our introductory Amazon SageMaker ML Ops workshop to get started.