A benchmark of text embedding models for semantic harmonization of Alzheimer's disease cohorts - 01/12/25

Abstract |
Background |
Harmonizing diverse healthcare datasets is a challenging task due to inconsistent naming conventions. Manual harmonization is time- and resource-intensive, limiting scalability for multi-cohort Alzheimer's Disease research. Large Language Models, or specifically text-embedding models, offer a promising solution, but their rapid development necessitates continuous, domain-specific benchmarking, especially since general established benchmarks lack clinical data harmonization use cases.
Objectives |
To evaluate how different text-embedding models perform for the harmonization of clinical variables.
Design and setting |
We created a novel benchmark to assess how well different Language Model embeddings can be used to harmonize cohort study metadata with an in-house Common Data Model that includes cohort-to-cohort mappings for a wide range of Alzheimer’s Disease cohorts. We evaluated five different state-of-the-art text embedding models for seven different data sets in the context of Alzheimer’s disease.
Participants |
No patient data were utilized for any of the analyses, as the evaluation was based on semantic harmonization of cohort metadata only.
Measurements |
Text descriptions of variables from different modalities were included for the analyses, namely clinical, lifestyle, demographics, and imaging.
Results |
Our benchmark results favored different models compared to general-purpose benchmarks. This suggests that models fine-tuned for generic tasks may not translate well to real-world data harmonization, particularly in Alzheimer’s disease. We propose guidelines to format metadata to facilitate manual or model-assisted data harmonization. We introduce an open-source library ( ADHTEB ) and an interactive leaderboard ( adhteb.scai.fraunhofer.de ) to aid future model benchmarking.
Conclusions |
Our findings highlight the importance of domain-specific benchmarks for clinical data harmonization in the field of Alzheimer’s disease and motivate standards for naming conventions that may support semi-automated mapping applications in the future.
Le texte complet de cet article est disponible en PDF.Keywords : Harmonization, Alzheimer’s disease, Text-embeddings, Large language models
Plan
Vol 13 - N° 1
Article 100420- janvier 2026 Retour au numéroBienvenue sur EM-consulte, la référence des professionnels de santé.
