Accelerating Drug Discovery: Streamlining the Market to Molecule Process with Generative AI
Transforming Genomic Research through Enhanced Natural Language Processing
Optimizing Text-to-SQL for Genomics Data Access
Generative AI Approaches to Text-to-SQL: Advancements in Query Generation Strategies
Conclusion: Democratizing Access to Omics Data through AI Solutions
Transforming Biopharma with the Market to Molecule (M2M) Value Stream
The journey from lab to patient can take over 10 years and cost upwards of $2 billion for biopharma companies. With an alarming failure rate exceeding 90%, the process to deliver effective new drugs is resource-intensive and filled with complexities. Here, we explore how a streamlined approach to the Market to Molecule (M2M) value stream can expedite drug delivery and enhance patient outcomes.
Understanding the M2M Value Stream
The M2M value stream process encompasses everything from pre-clinical research to clinical trials and ultimately to market launch. Nine out of ten biopharma companies leverage AWS, recognizing the need for streamlined processes to mitigate risks and enhance efficiency.
Pharmaceutical companies are pivoting towards genetic validation, scrutinizing the human genome to correlate gene variants with specific diseases. Such breakthroughs can potentially enhance the success rate of drug development by targeting the root causes of illnesses.
The Role of Research in Drug Development
Research, particularly the Basic Research sub-process, is vital in linking gene variants to diseases and defining target molecules. This phase determines the trajectory of development and is crucial in reducing the time and cost associated with getting new drugs to patients.
Innovating with AI and Big Data
Our customer has embarked on a transformative journey to correlate genes with diseases using a dataset of over 2 million sequenced exomes. However, navigating this massive treasure trove of data using traditional online genome browsers can be tedious and inefficient.
The conventional approach involves a cumbersome search process with multiple layers, filters, and repeated queries. To modernize this workflow, we propose a transition from a standard UI experience to a conversational AI assistant that enhances user interactions in the clinical research environment.
The Promise of Generative AI
Generative AI represents a pivotal innovation in this transformation. Partnering with our customer, we’ve developed a custom AI solution that allows scientists to interact with genomic data through natural language queries. By enabling researchers to ask questions freely, we aim to streamline their exploration of gene variants and their potential correlations with diseases.
This approach serves to not only save time for researchers but also improves the likelihood of breakthrough discoveries in drug development.
A New Paradigm with Text-to-SQL
In this blog, we’ll delve into the text-to-SQL pipeline that employs generative AI models and Amazon Bedrock to convert natural language into SQL queries for a genomics database. We will walk through creating an AI assistant web interface using AWS Amplify, discuss prompt engineering strategies, and provide step-by-step instructions to deploy your service.
What is Text-to-SQL?
Text-to-SQL is an NLP task wherein natural language text is automatically converted into SQL queries. This process contrasts significantly with the flexibility and ambiguity of human language, requiring an understanding of structured database formats.
Before Large Language Models (LLMs) revolutionized this field, queries needed extensive preprocessing. Now, LLMs have shown remarkable improvements, generating valid SQL queries from natural language input, though they are not without their limitations.
Strategies for Enhanced Accuracy in Text-to-SQL
Achieving high accuracy in text-to-SQL involves two primary strategies:
- 
Prompt Engineering: This technique involves structuring prompts with annotations to guide the model, providing more control over SQL output. 
- Fine-Tuning: Pre-trained models can be adjusted with specific examples tailored to target tasks, improving performance but requiring extensive labeled data.
Our focus here is on prompt engineering due to its efficiency and simplicity of implementation, making it a favored choice for AWS customers.
Experimenting with Prompt Techniques
We’ve explored various approaches, including chain-of-thought and tree-of-thought techniques, to optimize the reasoning and SQL generation process. Here’s a snapshot of how these prompt strategies work:
- 
Chain-of-Thought: This method breaks complex questions into smaller reasoning steps, guiding the LLM to articulate its thought process leading to clear outputs. 
- Tree-of-Thought: Building on chain-of-thought, this technique generates a structured problem-solving approach, enhancing reasoning through branching paths for different sub-questions.
Solution Architecture Overview
The architecture for our solution has several components, outlined below:
- Natural language submission via a web application connected through AWS Amplify and AppSync.
- Amazon API Gateway relays the request to AWS Lambda, where text-to-SQL is implemented.
- The Lambda function processes the question and interacts with Amazon Bedrock to generate SQL.
- Once the SQL is formulated, queries run against Amazon Athena, fetching genomic data.
- The data retrieval updates the user session, with automatic error handling enabling a seamless experience.
Generative AI Techniques for Text-to-SQL
We analyzed various prompt-engineering strategies, including:
- LLM SQL agents
- RAG (Retrieval-Augmented Generation)
- Detailed prompt descriptions of relevant tables
- Chain-of-Thought and Tree-of-Thought techniques
- Dynamic Few-Shot prompting
While some approaches yielded subpar results, techniques utilizing structured prompts and iterative reasoning demonstrated high accuracy, significantly improving the generation of syntactically correct SQL.
Conclusion
This exploration demonstrated an innovative approach to democratizing access to omics data through a text-to-SQL solution. By leveraging HealthOmics and Amazon Bedrock, we can unlock genomic insights that were previously confined to data experts, thereby accelerating the drug discovery process.
The provided code and deployment instructions are accessible via our GitHub repository, guiding you through setting up your own text-to-SQL project.
We extend our gratitude to team members Thomaz Silva and Saeed Elnaj for their invaluable contributions to this initiative.
About the Authors
Ganesh Raam Ramadurai is a Senior Technical Program Manager at AWS, focusing on innovative AI solutions.
Jeff Harman is a Senior Prototyping Architect at AWS, known for his deep expertise in cloud technologies.
Kosal Sen is a Design Technologist at AWS, bridging technology with user-centric design.
Through our combined expertise, we aim to reshape biopharma’s future—ensuring that transformative drugs reach patients more efficiently than ever before.