Evaluation of A Hybrid Approach Combining Random Forest and XGBoost for Type 2 Diabetes Prediction Using the PIMA Dataset

Mishel Bravo Almendras; Maria Laura Nuñez Jaillita; Juan Abiel Iriarte Colque; Eynar Calle Viles; Oscar Contreras Carrasco; Germán Rico Ramallo

doi:10.1109/la-cci66231.2025.11270183

Evaluation of A Hybrid Approach Combining Random Forest and XGBoost for Type 2 Diabetes Prediction Using the PIMA Dataset

dc.contributor.author	Mishel Bravo Almendras
dc.contributor.author	Maria Laura Nuñez Jaillita
dc.contributor.author	Juan Abiel Iriarte Colque
dc.contributor.author	Eynar Calle Viles
dc.contributor.author	Oscar Contreras Carrasco
dc.contributor.author	Germán Rico Ramallo
dc.coverage.spatial	Bolivia
dc.date.accessioned	2026-03-22T19:52:23Z
dc.date.available	2026-03-22T19:52:23Z
dc.date.issued	2025
dc.description.abstract	This study evaluates the performance of three machine learning models in predicting type 2 diabetes, focusing on their accuracy, sensitivity, and generalization capacity. The methodological process was structured into three phases: data preprocessing, model development, and performance evaluation. The widely validated PIMA dataset was used, which includes relevant clinical variables such as body mass index, blood glucose levels, and blood pressure. During preprocessing, the SMOTE algorithm was applied to address class imbalance, significantly improving the detection of positive cases. Three predictive approaches were compared: Random Forest, XGBoost, and a hybrid model based on soft voting between both. The results showed that Random Forest achieved the best performance with an accuracy of 97.0%, followed by XGBoost with 96.4% accuracy. The hybrid model, while matching XGBoost in nearly all metrics, demonstrated greater stability and robustness in its results. These metrics included an accuracy of 96.4%, a recall of 98.0%, F1-score of 96.4%, and AUC-ROC of 99.3% for XGBoost and 99.7% for the hybrid model, while Random Forest reached an F1-score of 97.1% and AUC-ROC of 99.86%. These results surpass those reported in previous research, reinforcing the clinical utility of these algorithms for early diagnostic tasks. As part of the project, an interactive interface was developed for healthcare professionals, allowing the input of medical data and immediate predictions on diabetes risk. This tool, delivered as functional software to the Tuscapujio Hospital Center in Sacaba, Cochabamba, Bolivia, represents a practical and accessible solution to support clinical decision-making in real-world contexts. The study highlights the importance of rigorous preprocessing and proper handling of class imbalance to improve the performance of predictive models in medical applications.
dc.identifier.doi	10.1109/la-cci66231.2025.11270183
dc.identifier.uri	https://doi.org/10.1109/la-cci66231.2025.11270183
dc.identifier.uri	https://andeanlibrary.org/handle/123456789/78627
dc.source	Universidad Privada del Valle
dc.subject	Random forest
dc.subject	Computer science
dc.subject	Preprocessor
dc.subject	Artificial intelligence
dc.subject	Machine learning
dc.subject	Data mining
dc.subject	Robustness (evolution)
dc.subject	Stability (learning theory)
dc.subject	Matching (statistics)
dc.subject	Generalization
dc.title	Evaluation of A Hybrid Approach Combining Random Forest and XGBoost for Type 2 Diabetes Prediction Using the PIMA Dataset
dc.type	article

Collections

Artículo Científico Publicado

Evaluation of A Hybrid Approach Combining Random Forest and XGBoost for Type 2 Diabetes Prediction Using the PIMA Dataset

Files

Collections