Evaluation of A Hybrid Approach Combining Random Forest and XGBoost for Type 2 Diabetes Prediction Using the PIMA Dataset

dc.contributor.authorMishel Bravo Almendras
dc.contributor.authorMaria Laura Nuñez Jaillita
dc.contributor.authorJuan Abiel Iriarte Colque
dc.contributor.authorEynar Calle Viles
dc.contributor.authorOscar Contreras Carrasco
dc.contributor.authorGermán Rico Ramallo
dc.coverage.spatialBolivia
dc.date.accessioned2026-03-22T19:52:23Z
dc.date.available2026-03-22T19:52:23Z
dc.date.issued2025
dc.description.abstractThis study evaluates the performance of three machine learning models in predicting type 2 diabetes, focusing on their accuracy, sensitivity, and generalization capacity. The methodological process was structured into three phases: data preprocessing, model development, and performance evaluation. The widely validated PIMA dataset was used, which includes relevant clinical variables such as body mass index, blood glucose levels, and blood pressure. During preprocessing, the SMOTE algorithm was applied to address class imbalance, significantly improving the detection of positive cases. Three predictive approaches were compared: Random Forest, XGBoost, and a hybrid model based on soft voting between both. The results showed that Random Forest achieved the best performance with an accuracy of 97.0%, followed by XGBoost with 96.4% accuracy. The hybrid model, while matching XGBoost in nearly all metrics, demonstrated greater stability and robustness in its results. These metrics included an accuracy of 96.4%, a recall of 98.0%, F1-score of 96.4%, and AUC-ROC of 99.3% for XGBoost and 99.7% for the hybrid model, while Random Forest reached an F1-score of 97.1% and AUC-ROC of 99.86%. These results surpass those reported in previous research, reinforcing the clinical utility of these algorithms for early diagnostic tasks. As part of the project, an interactive interface was developed for healthcare professionals, allowing the input of medical data and immediate predictions on diabetes risk. This tool, delivered as functional software to the Tuscapujio Hospital Center in Sacaba, Cochabamba, Bolivia, represents a practical and accessible solution to support clinical decision-making in real-world contexts. The study highlights the importance of rigorous preprocessing and proper handling of class imbalance to improve the performance of predictive models in medical applications.
dc.identifier.doi10.1109/la-cci66231.2025.11270183
dc.identifier.urihttps://doi.org/10.1109/la-cci66231.2025.11270183
dc.identifier.urihttps://andeanlibrary.org/handle/123456789/78627
dc.sourceUniversidad Privada del Valle
dc.subjectRandom forest
dc.subjectComputer science
dc.subjectPreprocessor
dc.subjectArtificial intelligence
dc.subjectMachine learning
dc.subjectData mining
dc.subjectRobustness (evolution)
dc.subjectStability (learning theory)
dc.subjectMatching (statistics)
dc.subjectGeneralization
dc.titleEvaluation of A Hybrid Approach Combining Random Forest and XGBoost for Type 2 Diabetes Prediction Using the PIMA Dataset
dc.typearticle

Files