Comparative Analysis of Chatbot Performance in Medical Queries Related to Tuberculosis
Table of Contents
- Table 1: Mean Scores of Responses to Medical Questions based on NLAT-AI Criteria
- Figure 1: Heatmap of Mean Scores across Chatbots
- Table 2: Chatbot Scores on Diagnostic Domain Indices
- Table 3: Chatbot Scores on Treatment Domain Indices
- Table 4: Chatbot Scores in Prevention and Control
- Table 5: Chatbot Scores in Disease Management
- Table 6: Evaluation of Chatbots Based on DISCERN-AI Criteria
Summary of Findings
This section summarizes the performance of ChatGPT, Copilot, and Gemini across various medical inquiry categories, offering insights into their strengths and areas for improvement.
Evaluating Chatbot Performance in Medical Inquiries: A Deep Dive
In the realm of digital health, artificial intelligence (AI) chatbots are increasingly playing a pivotal role in answering medical queries. Recent evaluations have focused on three prominent models—ChatGPT, Copilot, and Gemini—specifically in the context of tuberculosis management. The findings, relay valuable insights into their performance across various medical categories based on the NLAT-AI criteria.
Overview of the Findings
Performance Summary
Table 1 presents the mean scores for each chatbot across four main categories: Diagnostic, Treatment, Prevention & Control, and Disease Management. Fig. 1 displays a heatmap that visually represents each model’s average performance.
-
Diagnostics: All three chatbots achieved an impressive score of 4.0, indicating a strong consensus in diagnostic competency.
-
Treatment: ChatGPT and Copilot scored 4.0, while Gemini lagged slightly behind at 3.8. This discrepancy points to potential limitations in Gemini’s treatment indicators.
-
Prevention & Control: Here, Gemini outperformed the others, earning a score of 4.4, while Copilot scored the lowest at 3.6.
- Disease Management: Both ChatGPT and Gemini earned a solid score of 4.0, contrasting with Copilot’s lower score of 3.6, suggesting it may benefit from enhancements in this area.
In-Depth Analysis of Chatbot Categories
Diagnostics Performance (Table 2)
Table 2 highlights diagnostic capabilities concerning brucellosis. The chatbots generally scored well, with most indices reflecting a score of 4. However, Gemini’s scores of 3 in Appropriateness and Effectiveness indicate room for improvement, especially when addressing specific medical queries.
Treatment Evaluation (Table 3)
When assessing treatment-related inquiries, ChatGPT achieved a stellar score of 5 in Accuracy, outperforming both Copilot and Gemini. Copilot, however, surpassed Gemini in the Appropriateness metric, suggesting that while ChatGPT excels in precision, the others may offer more clinically relevant guidance.
Prevention & Control Metrics (Table 4)
Gemini’s superior performance in the Prevention & Control domain is underscored by perfect scores (5 out of 5) in Safety and Actionability metrics. These findings suggest Gemini’s robustness in preventative health measures and actionable advice—a vital aspect in controlling outbreaks.
Disease Management Insights (Table 5)
Table 5 illustrates a comparable performance among all three chatbots in Disease Management. While scores were generally high, Copilot’s lower scores in Accuracy and Effectiveness indicate it might need further fine-tuning in providing comprehensive patient care strategies.
Evaluating Brucellosis Responses (Table 6)
Table 6 provides a broader look at how well each chatbot handles inquiries regarding brucellosis:
-
Information Relevance: ChatGPT took the lead, providing more relevant responses than its counterparts.
-
Citing Sources: While Copilot and Gemini included partial citations, they still outperformed ChatGPT, which did not reference any sources. This aspect highlights transparency discrepancies among models.
-
Date of Information Production: None of the chatbots provided production dates, a critical oversight that can hinder the assessment of information reliability and timeliness.
-
Balance and Impartiality: All three chatbots performed similarly, maintaining neutrality in their responses.
- Additional Sources and Uncertainty Indication: All models struggled to provide ample additional resources or acknowledge uncertainty in their responses, indicating widespread opportunities for enhancement.
Conclusion
The comparative evaluation of ChatGPT, Copilot, and Gemini underscores the potential and limitations of AI chatbots in the medical field. While all three demonstrated strong diagnostic capabilities, substantial variations across treatment, prevention, and disease management highlight the need for ongoing improvements.
Future developments should focus on:
- Enhancing Domain-Specific Knowledge: Particularly for Gemini in treatment and outcomes.
- Improving Source Transparency: Including comprehensive citations and references.
- Incorporating Timeliness: Assigning production dates to enhance reliability.
As AI chatbots continue to evolve in the medical landscape, these findings not only guide improvements in chatbot performance but also ensure that they achieve their ultimate goal: providing accurate, relevant, and timely medical information to users.