News and Publications
Comparing representations of long clinical texts for the task of patient note-identification
Alsaidi et al, CL4Health@NAACL 2025, May 2025. https://hal.science/hal-05000565v2
This research focuses on a technical but important problem in healthcare: how to make sure that anonymized medical notes (notes that don’t include names or personal details) are correctly matched to the right patient. This is especially important in large hospitals and research databases, where patients may have many notes written by different doctors over time.
Matching notes accurately helps with:
- Finding duplicate records (when the same patient is accidentally entered more than once)
- Comparing patients with similar conditions or treatments
- Building a complete picture of a patient’s medical history
To solve this problem, the researchers tested different advanced computer models that can read and understand long medical texts. These models include newer versions of artificial intelligence tools like BERT and Transformer networks, which are good at understanding language.
They also tested different ways of combining information from words and sentences to create a full “profile” of a patient. One method, called mean_max pooling, worked especially well at capturing important details from the notes.
The study showed that these modern AI tools can do a better job than older methods, especially when dealing with long and complex notes. The researchers tested their approach on two large sets of real medical data — one from a public research database and one from a major hospital in France — and got strong results in both cases.
In short, this work helps improve how computers understand and organize patient information, which could lead to better care, more accurate research, and safer medical records.
In this paper, we address the challenge of patient-note identification, which involves accurately matching an anonymized clinical note to its corresponding patient, represented by a set of related notes. This task has broad applications, including duplicate records detection and patient similarity analysis, which require robust patient-level representations. We explore various embedding methods, including Hierarchical Attention Networks (HAN), three-level Hierarchical Transformer Networks (HTN), Long-Former, and advanced BERT-based models, focusing on their ability to process medium-to-long clinical texts effectively. Additionally, we evaluate different pooling strategies (mean, max, and mean_max) for aggregating word level embeddings into patient-level representations and we examine the impact of sliding windows on model performance. Our results indicate that BERT-based embeddings outperform traditional and hierarchical models, particularly in processing lengthy clinical notes and capturing nuanced patient representations. Among the pooling strategies, mean_max pooling consistently yields the best results, highlighting its ability to capture critical features from clinical notes. Furthermore, the reproduction of our results on both MIMIC dataset and Necker hospital data warehouse illustrates the generalizability of these approaches to real-world applications, emphasizing the importance of both embedding methods and aggregation strategies in optimizing patient-note identification and enhancing patient-level modeling.