Predicting Type 2 Diabetes Risk and Identifying its Risk Factors through a Machine Learning Model

No Thumbnail Available
Ghaith Maqboul
Journal Title
Journal ISSN
Volume Title
Abstract As one of the most prevalent chronic conditions worldwide, diabetes, particularly type 2 diabetes, poses a significant health challenge affecting millions of individuals and placing a considerable global economic burden. Our goal was to develop predictive models aimed at identifying universal risk factors for type 2 diabetes. The intention is to advance early diagnosis and intervention strategies and also reduce medical costs. The dataset started with 441,456 participants and 330 features. After preprocessing and cleaning, it was narrowed down to 42,340 participants and 18 features, incorporating 10,348 type 2 diabetes cases from the 2015 Behavioral Risk Factor Surveillance System (BRFSS), a survey conducted by the U.S. Centers for Disease Control and Prevention (CDC). This binary classification project strategically selected features to inform a comprehensive analysis for public health strategies. Employing multiple machine learning models, such as AdaBoost, Neural Network, Logistic Regression, Decision Tree, K Nearest Neighbors, Naive Bayes, and Random Forest, we delved into feature importance, with the Random Forest classifier scrutinizing risk factors associated with type 2 diabetes. Our study evaluates various predictive models for type 2 diabetes, all demonstrating notable performance with an AUC range of (74.7%-79.2%). AdaBoost excels with the highest test accuracy (78.2%), with sensitivity (33.5%), and specificity (92.7%). Neural Network and Logistic Regression also perform well. K Nearest Neighbors prioritizes specificity (92.8%), while Naive Bayes showcases notable sensitivity (57.8%), Random Forest had the highest sensitivity (72.9%), this classifier has been used to evaluate the importance of features associated with type 2 diabetes, identifying the top five significant contributors: Age (14.4%), Income (12.1%), MentalHealth (8.3%), Education (8.2%), and PhysicalHealth (7.7%). Among 7 models, including Neural Network, AdaBoost, and Logistic Regression, a convergence is seen with (77.4%-78%) accuracy, sensitivity (32%-34.6%), and (91.2%-92.7) specificity, yielding a closely aligned AUC of (78.7%-79.2%). Notably, Random Forest excels in sensitivity at 72.9%, despite a 71.7% accuracy, it is crucial for feature importance, and it is preferred for type 2 diabetes initial screening due to its balanced overall results.