Evaluation of LLM Performance in Cancer Diagnosis and Prognosis
Overview of Dataset and Methodology
To evaluate the performance of LLMs in cancer type identification, AJCC stage determination, and prognosis assessment, we begin by curating a custom dataset, the specifics of which are outlined in Section 4. The dataset is structured into three question-and-answer tasks, each addressing a specific aspect: cancer type identification, stage classification, and prognosis assessment. In each task, the question combines a query related to the respective topic cancer type, stage, or prognosis with the corresponding pathology report, and the answer represents the desired outcome. Given the verbosity of chat-based LLMs, we constrain their outputs to strictly adhere to a JSON object format, structured as key-value pairs. This approach facilitates answer extraction using regular expressions for evaluation purposes while ensuring the outputs are well-suited for subsequent data mining and information extraction tasks.
Model Performance and Evaluation Strategy
For each task, we evaluate the performance of six models: two OpenAI models (GPT-4o and GPT-4o-mini), two Mistral models (Mistral-Medium and Mistral-Large), and two Llama 3 models (the 8B and 70B variants). Each experiment is conducted five times per model to ensure robust interpretations and to compute reliable confidence intervals. We perform instruction tuning by generating synthetic question-answer pairs, which, when combined with pathology reports, form a diverse and comprehensive training dataset. The Path-GPT-4o-mini-FT model is fine-tuned using OpenAI’s proprietary platform, while Path-llama3.1-8B undergoes fine-tuning via Low-Rank Adaptation (LoRA), specifically optimizing key attention components. More details on data curation and model training are provided in Section 4.
Evaluating LLMs in Cancer Diagnosis: Performance Insights
In recent years, the integration of Large Language Models (LLMs) in medical contexts has garnered significant attention, particularly for tasks such as cancer type identification, AJCC stage determination, and prognosis assessment. This article dives into a detailed evaluation of various LLMs using a custom dataset designed to analyze their performance in these critical areas.
Dataset Overview
Our study begins with the curation of a specialized dataset comprising pathology reports structured into three distinct tasks. Each task combines a query related to cancer type, stage, or prognosis, with the respective pathology report serving as the context for the model’s response. To ensure accuracy and reliability, we constrain the model outputs to a precise JSON format, which allows for seamless information extraction and evaluation.
Evaluated Models
The performance of six models was assessed:
- OpenAI Models: GPT-4o and GPT-4o-mini
- Mistral Models: Mistral-Medium and Mistral-Large
- Llama 3 Models: Llama3-8B and Llama3-70B
Each model underwent five experimental runs, fostering robust interpretations and facilitating the calculation of confidence intervals. We applied instruction tuning through synthetic question-answer pairs, enhancing the dataset’s diversity for training purposes.
Task 1: Cancer Type Identification
Methodology & Results
Using a test set of 952 pathology reports across 32 cancer types, all models were evaluated under standardized conditions. Most models displayed impressive performance, with mean accuracies above 96%, except for two outliers: Llama3-8B (64% accuracy) and Mistral-Medium (90%).
The top performer, the instruction-tuned Path-GPT-4o-mini-FT, achieved an outstanding accuracy of 99%, showcasing the benefits of targeted instruction tuning.
Insights & Challenges
Cancer type identification proved to be the most straightforward of the tasks. The main challenge arose from the Llama3-8B’s smaller parameter size, which hindered its instruction-following capabilities. In contrast, the instruction-tuned version of Llama3.1 showed significant improvement, emphasizing the value of fine-tuning.
Analysis of errors revealed common pitfalls across models, such as misclassifying anatomically similar cancers, which highlights the need for better differentiation strategies in future datasets.
Task 2: AJCC Stage Identification
Methodology & Findings
The same set of models was tested on 594 pathology reports to determine AJCC stages (I–IV). Encouragingly, the task incorporated a self-generated chain-of-thought query for improved reasoning. Performance, however, was comparatively lower than in cancer type identification.
The standard models achieved a mean accuracy around 76%, while the instruction-tuned models fared better: Path-GPT-4o-mini-FT scored 87%, demonstrating the value of specialized training even without an explicit reasoning framework.
Comparative Analysis
During AJCC stage determination, models tended to default to either stage I or IV, likely due to the predominant data characteristics they were trained on. The discrepancy between cancer type identification and AJCC stage accuracy reflects the added complexity of deductive reasoning required for staging tasks.
Task 3: Prognosis Assessment
Overview & Performance Metrics
Prognosis assessment represented a binary classification challenge to predict patient survival beyond average survival times based on pathology reports. Spanning 593 cases across various cancer types, this task proved the most difficult, as it required not only deductive reasoning but future outcome predictions.
The instruction-tuned Path-GPT-4o-mini-FT excelled here, attaining the highest accuracy and F1 score, illustrating the effectiveness of instruction tuning for this complex task.
Insights
Despite progress, prognosis remains a challenging area for LLMs, underscoring their limitations in statistical reasoning and nuanced interpretation required for accurate outcome predictions. Incorporating disease-specific survival times into the models enhanced their understanding significantly.
External Validation
To further evaluate the clinical applicability of our instruction-tuned models, we tested them against a set of 60 pathology reports from Weill Cornell Medicine, achieving 89% accuracy in cancer type identification and 70% in AJCC stage classification. This validation indicates strong generalization to real-world data and sets the groundwork for future improvements tailored to specific clinical environments.
Conclusion
The journey of evaluating LLMs for cancer diagnosis unveils promising advancements yet highlights inherent challenges. Instruction tuning emerges as a vital strategy for optimizing model performance, particularly in nuanced medical applications. As research progresses, addressing the complexities of prognosis assessment and AJCC staging will pave the way for deploying LLMs in clinical settings, ultimately enhancing patient outcomes through more accurate and insightful analyses.
Integrating LLMs into cancer diagnostics holds transformative potential, provided we continue refining methodologies and models for precision and reliability in clinical contexts. The collaboration of AI with medical expertise stands to redefine cancer care, making it not only possible but crucial for the future of healthcare.