Dataset and Preprocessing
This section outlines the dataset utilized in our research, detailing the ethical considerations in data collection, the specific content of the dataset, and the preprocessing steps undertaken to ensure data quality and relevance for further analysis.
1. Research Context and Ethical Approval
Our study was conducted at the Database Laboratory of Universidade Federal de Minas Gerais. The research project received approval from the Ethics Committee of the Federal University of Minas Gerais (Registration CAAE 80632524.4.0000.5149) in accordance with Brazilian Resolution CNS 446/12.
2. Dataset Composition
We collected WhatsApp messages from dialogues between patients and the healthcare team from October 27, 2021, to January 8, 2024, provided by Ana Health’s cloud-based contact center. The dataset comprises 207,040 messages written in Portuguese and is available upon request from the authors.
2.1 Company Overview
Ana Health is a digital primary care service facilitating access to a multidisciplinary healthcare team, offering comprehensive support to patients through various communication channels.
3. Data Anonymization
All identifiable information such as names, emails, and phone numbers was removed prior to analysis, and messages were anonymized to maintain data privacy.
4. Preprocessing Steps
Preprocessing involved the removal of special characters, filtering out template messages, and cleaning the dataset, resulting in 202,326 messages and 1,863 dialogues.
5. Message Quality Analysis
To ensure the quality of the messages for summarization, we analyzed three dimensions:
- Size: Assessing the character, word, and sentence counts to classify messages based on length.
- Readability: Employing the Flesch-Kincaid grade level to evaluate the complexity of texts.
- Correctness: Calculating the proportion of words in the messages matched to a predefined Portuguese dictionary.
6. Summarization Process
We utilized LLMs (Qwen 2 and LLaMA 3) for summarizing dialogues, ensuring that summaries captured pivotal dialogue aspects while minimizing reliance on external data.
7. Summary Evaluation
Human evaluators from Ana Health’s healthcare team assessed the generated summaries, focusing on metrics such as coverage, relevance, redundancy, and veracity to ensure practical utility for healthcare professionals.
8. Data Privacy Compliance
Our study adhered strictly to ethical standards concerning data privacy, with all processing carried out locally and in compliance with legal norms and institutional guidelines. Measures for data anonymization and confidentiality were rigorously implemented throughout the research process.
Understanding Dataset and Preprocessing in Healthcare Dialogue Research
At the Database Laboratory of Universidade Federal de Minas Gerais, our research delves into the innovative use of language models to summarize patient-healthcare team dialogues—specifically those occurring over WhatsApp at Ana Health. This project, stretching from October 27, 2021, to January 8, 2024, is vital for advancing digital primary care services and was ethically authorized by the Federal University of Minas Gerais.
The Dataset
Our dataset comprises 207,040 messages collected through patient-healthcare team conversations. This trove of information, meticulously gathered through Ana Health’s cloud-based contact center, provides a diverse representation of interactions, primarily in Portuguese. The messages are available upon request, reinforcing our commitment to transparency and collaboration in research.
As part of our ethical guidelines, all identifiable patient information was anonymized by Ana Health, converting names, emails, phone numbers, and URLs into unique codes to protect privacy. We obtained consent for the use of this dataset from the legal guardians involved.
Dataset Composition
The dataset includes dialogues encompassing a range of interactions and emotional tones, highlighting the dynamics between patients and healthcare professionals. After preprocessing, which involved removing special characters, filtering out template messages, and cleansing the text, we distilled it to 202,326 messages across 1,863 dialogues.
Preprocessing Steps
Preprocessing is critical in refining raw data to enhance the quality and efficacy of analyses. We undertook several key steps:
- Text Clean-Up: Special characters such as tabs and newlines were removed, and duplicate white spaces eliminated.
- Template Filtering: We eliminated standardized responses from Ana Health to sharpen focus on genuine interactions.
This careful curation ensures that our dataset is primed for quality analysis, setting the stage for effective summarization.
Assessing Message Quality
To maximize the effectiveness of summarization, we employed a robust framework to assess textual quality based on three dimensions:
1. Size
We analyzed the total number of characters, words, and sentences in each message. Quality messages aren’t excessively short or long. Messages over the average length by ten words were classified as long, while those with fewer than five words below the average count were deemed short.
2. Readability
Utilizing the Flesch-Kincaid Grade Level, a measure of text complexity, we compared the readability of patient-authored messages against those created by healthcare professionals. Despite its original design for English texts, adaptations for Portuguese validate its use for our analysis.
3. Correctness
To gauge message correctness, we calculated the proportion of words from a predefined Portuguese dictionary included in the messages, employing the br.ispell package.
Summarization Strategy
The primary goal of our study is to evaluate how well large language models (LLMs)—specifically LLaMA 3 and Qwen 2—can generate coherent, informative summaries of patient-healthcare team dialogues. We leverage the last 5,000 tokens from each dialogue, theorizing that recent messages will carry greater significance in the context of ongoing patient care.
Selection of LLMs
Our choice of LLMs hinges on several considerations:
- Open-Source Availability: Both models may be deployed locally, ensuring data privacy and compliance with ethical standards for healthcare analytics.
- Empirical Performance: Both LLaMA 3 and Qwen 2 deliver state-of-the-art results across various natural language tasks.
- Model Size and Efficiency: We utilized smaller 8-billion and 7-billion parameter versions of each model to balance performance with computational efficiency, making them amenable to standard hardware.
Evaluation of Summaries
The effectiveness of LLM-generated summaries was rigorously tested through an A/B evaluation involving 24 healthcare professionals—physicians and psychologists—who rated the summaries on four dimensions: coverage, relevance, redundancy, and veracity.
Evaluation Methodology
Participants evaluated summaries based on their clarity and relevance, considering potential biases by ensuring they had no prior interactions with the associated dialogues. Each summary was dissected, with ratings collected using a 5-point Likert scale, allowing us to quantify the perceived quality of the summaries.
Data Privacy Considerations
Given the sensitive nature of healthcare communications, our study adhered to stringent ethical and legal standards. All data processing occurred locally without transmission to external servers. Moreover, anonymity protocols were rigorously enforced, with all research team members signing Non-Disclosure Agreements. The goal was to ensure data integrity and maintain patient confidentiality throughout our research processes.
Conclusion
Our work at the interface of healthcare and AI emphasizes the significance of robust datasets and effective preprocessing techniques in advancing the field of digital health. The careful selection and evaluation of technology—especially in contexts as sensitive as healthcare—reflect a commitment not only to technological progress but also to ethical research practices. By harnessing sophisticated LLMs, we aim to enhance the way healthcare providers engage with patients, ultimately improving care delivery and patient outcomes.