DNA methylation-based machine learning models for classification of oral cancer and potentially malignant lesions: A proof-of-concept study - 12/10/25
, Kannan Sridharan b
, Mohammed Abdulla AlMuharraqi c 
Abstract |
Background |
Accurate classification of oral squamous cell carcinoma (OSCC) and oral potentially malignant lesions (OPLs) is challenging due to histopathological variability and limited predictive biomarkers. DNA methylation offers a promising molecular signature, but its utility for tissue classification remains underexplored.
Methods |
We harmonized publicly available DNA methylation datasets (GSE97784 and GSE204943; n = 142) and selected the top 100 most variable CpG sites (variance 0.074–0.117) for analysis. Eight supervised machine learning (ML) models—logistic regression, random forest (RF), support vector machine (SVM), extreme gradient boosting (XGBoost), k-nearest neighbors (kNN), Naive Bayes, gradient boosting machine (GBM), and neural network (NN)—were trained using 10-fold cross-validation. Principal component analysis was performed to assess data dimensionality.
Results |
High-variance CpG sites were predominantly located within gene bodies and clustered on chromosomes 1, 2, and 6. PCA revealed complex, high-dimensional methylation patterns requiring 55 components to capture 90 % of variance. Overall, RF achieved the highest accuracy (78 %) and AUC-ROC (0.84), followed by GBM (76 %) and XGBoost. Tumor and normal tissues were classified with relatively high sensitivity and specificity, while OPLs were difficult to detect, showing low sensitivity (<50 %) across all models. GBM performed best for normal tissue detection, and Naive Bayes slightly outperformed for tumor F1-score, but RF offered the most balanced performance across classes.
Conclusions |
Ensemble ML models, particularly RF and GBM, demonstrate proof-of-concept potential for DNA methylation-based classification of oral tissues. While tumor and normal classification is robust, OPL detection remains challenging, highlighting the need for larger, balanced datasets and complementary biomarkers to improve early detection and clinical utility.
Le texte complet de cet article est disponible en PDF.Keywords : Carcinoma, Squamous Cell, Mouth Neoplasms, Precancerous Conditions, DNA Methylation, Machine Learning, Random Forest Algorithm, Epigenomics .
Plan
Vol 127 - N° 2
Article 102594- mars 2026 Retour au numéroBienvenue sur EM-consulte, la référence des professionnels de santé.
L’accès au texte intégral de cet article nécessite un abonnement.
Déjà abonné à cette revue ?
