Subscribe to RSS

DOI: 10.1055/a-2797-4380
Early Risk Factor Prediction in Chronic Kidney Disease Diagnosis Using Feature Selection and Machine Learning Algorithms
Authors
Abstract
Background
Chronic kidney disease, CKD in short, is a kind of long-term kidney illness in which rapid deterioration of kidney function is observed over a period of time. Unlike other organs, this damage in kidney function cannot be recovered and reversed as well. Moreover, in its early stages, asymptomatic renal disease is highly prevalent, making early identification with conventional clinical approaches difficult. Thus, early and accurate detection of risk factors is a very challenging step in CKD diagnosis.
Objectives
This research work showed earlier and effective identification of risk factors using notable feature selection techniques for the enhancement of patient care. It also aimed at the improvement of predictive diagnosis of CKD employing different supervised and ensemble machine learning classifiers.
Methods
A CKD-focused dataset consisting of 1,032 patient records and 14 features was used for this research purpose. This research emphasized on identifying the risk factors of CKD using feature importance (for tree-based model) with sequential feature selector and ReliefF algorithm as feature selection process. Based on the ranking for both feature selection techniques, the top 10 features were identified. Then utilizing those features, the classifiers such as random forest, support vector machine, Naïve Bayes, decision tree, logistic regression, gradient boosting, K-nearest neighbors, and ensemble classifier voting technique were trained using stratified 5-fold and grid-based search cross-validation techniques. After that, their performances were assessed using evaluation measures, i.e., accuracy, F1 score, precision, recall, training loss, test loss, bias, and AUC, to classify the individual having presence or absence of CKD.
Results
The feature selection algorithms selected the significant data-driven top 10 features. Based on the ranking for both feature selection procedures, hemoglobin is determined to be the significant risk factor among these features. For both feature selection techniques, all the classifiers showed their best performance, having 86 to 98% of accuracy, AUC value of over 0.96 to 1.00, and bias value of 0.003 to 0.103. All the classifiers showed a very good trade-off between false positives and false negatives, with precision, recall, and F1 score ranging from 92 to 98%, 90 to 99%, and 93 to 98%, respectively, using feature importance with SFS. In both cases of the feature selection techniques, gradient boosting outperformed all other algorithms in terms of accuracy, precision, AUC, recall, F1 score, specificity, and bias.
Conclusion
To conclude, in the suggested methodology the feature selection algorithms effectively identified the prominent features based on their importance, and the pipeline demonstrated a good performance in diagnosing individuals at risk of CKD development. Some of the classifiers showed their effectiveness in CKD prediction using the selected features by achieving higher accuracy, F1 score, precision, recall, AUC, specificity, and lower bias to ensure the diagnostic performance. Therefore, it can be inferred that this proposed methodology, combining the power of these eight machine learning models with two efficient feature selection approaches, demonstrated that people at risk of this nephrological condition can be detected earlier, more accurately identifying increased risk factors than with conventional methods. This holds a great promise toward enhancing healthcare judgment and eventually ensuring treatment for patients.
Keywords
chronic kidney disease - early risk prediction - feature selection - machine learning algorithmsPublication History
Received: 30 June 2025
Accepted: 23 January 2026
Article published online:
13 February 2026
© 2026. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution License, permitting unrestricted use, distribution, and reproduction so long as the original work is properly cited. (https://creativecommons.org/licenses/by/4.0/)
Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany
-
References
- 1 Foreman KJ, Marquez N, Dolgert A. et al. Forecasting life expectancy, years of life lost, and all-cause and cause-specific mortality for 250 causes of death: reference and alternative scenarios for 2016-40 for 195 countries and territories. Lancet 2018; 392 (10159): 2052-2090
- 2 Francis A, Harhay MN, Ong ACM. et al; American Society of Nephrology, European Renal Association, International Society of Nephrology. Chronic kidney disease and the global public health agenda: an international consensus. Nat Rev Nephrol 2024; 20 (07) 473-485
- 3 Chronic Kidney Disease in the United States. 2023. Centers for Disease Control and Prevention; . Accessed March 29, 2025 at: https://www.cdc.gov/kidneydisease/publications-resources/CKD-national-facts.html
- 4 Gounden V, Bhatt H, Jialal I. Renal function tests. In: StatPearls [Internet]. Treasure Island, FL: StatPearls Publishing; 2024
- 5 Jager KJ, Kovesdy C, Langham R, Rosenberg M, Jha V, Zoccali C. A single number for advocacy and communication-worldwide more than 850 million individuals have kidney diseases. Nephrol Dial Transplant 2019; 34 (11) 1803-1805
- 6 Arora P. Chronic Kidney Disease (CKD) Clinical Presentation. Medscape. Accessed March 31, 2025 at: https://emedicine.medscape.com/article/238798-clinical
- 7 Arif MS, Mukheimer A, Asif D. Enhancing the early detection of chronic kidney disease: a robust machine learning model. Big Data Cogn Comput. 2023; 7 (03) 144
- 8 Qezelbash-Chamak J, Badamchizadeh S, Eshghi K, Asadi Y. A survey of machine learning in kidney disease diagnosis. Mach Learn Appl 2022; 10: 100418
- 9 Wang W, Chakraborty G, Chakraborty B. Predicting the risk of chronic kidney disease using machine learning algorithm. Appl Sci (Basel) 2020; 11 (01) 202
- 10 Majid M, Gulzar Y, Ayoub S. et al. Using ensemble learning and advanced data mining techniques to improve the diagnosis of chronic kidney disease. Int J Adv Comput Sci Appl 2023; 14 (10) 470-480
- 11 Islam MA, Akter S, Hossen MS, Keya SA, Tisha SA, Hossain S. Risk factor prediction of chronic kidney disease based on machine learning algorithms. In: Proceedings of the 2020 3rd International Conference on Intelligent Sustainable Systems (ICISS); Dec 3–5 2020; Thoothukudi, India. IEEE; 2020: 952-957
- 12 Bravin PS. A review on data preprocessing techniques in data mining. Int J Adv Eng Manag 2022; 4 (05) 1206-1209
- 13 Adil IH, Zaman A. Outliers detection in skewed distributions: Split sample skewness based boxplot. Econ Comput Econ Cybern Stud Res 2020; (03) 279-296
- 14 Sequential Feature Selection. GeeksforGeeks. Accessed March 31, 2025 at: https://www.geeksforgeeks.org/sequential-feature-selection/
- 15 Robnik-Šikonja M, Kononenko I. Theoretical and empirical analysis of ReliefF and RReliefF. Mach Learn 2003; 53: 23-69
- 16 Suthaharan S. Support vector machine. In: Suthaharan S. ed. Machine Learning Models and Algorithms for Big Data Classification: Thinking with Examples for Effective Learning. Cham: Springer; 2016: 207-235
- 17 Sumathi S, Rajappa S, Kumar LA, Paneerselvam S. Machine Learning for Decision Sciences with Case Studies in Python. Boca Raton, FL: CRC Press; 2022
- 18 Damanik IS, Windarto AP, Wanto A, Poningsih, Andani SR, Saputra W. Decision tree optimization in C4.5 algorithm using genetic algorithm. J Phys Conf Ser 2019; 1255 (01) 012012
- 19 Pandey R, Maurya P, Chiong R. Data Modelling and Analytics for the Internet of Medical Things. Boca Raton, FL: CRC Press; 2023
- 20 Reddy EMK, Gurrala A, Hasitha VB, Kumar KVR. Introduction to Naive Bayes and a review on its subtypes with applications. In: Bayesian Reasoning and Gaussian Processes for Machine Learning Applications. Singapore: Springer; 2022: 1-14
- 21 Nusinovici S, Tham YC, Chak Yan MY. et al. Logistic regression was as good as machine learning for predicting major chronic diseases. J Clin Epidemiol 2020; 122: 56-69
- 22 Belyadi H, Haghighat A. Machine Learning Guide for Oil and Gas Using Python: A Step-by-Step Breakdown with Data, Algorithms, Codes, and Applications. Houston, TX: Gulf Professional Publishing; 2021
- 23 Kabari LG, Onwuka UC. Comparison of bagging and voting ensemble machine learning algorithm as a classifier. Int J Adv Res Comput Sci Softw Eng 2019; 9 (03) 19-23
- 24 Ranjan GSK, Verma AK, Radhika S. K-nearest neighbors and grid search CV based real-time fault monitoring system for industries. In: Proceedings of the 2019 IEEE 5th International Conference for Convergence in Technology (I2CT); 2019 ; Bombay, India. pp 1–5
- 25 Hema K, Meena K, Pandian R. Analyze the impact of feature selection techniques in the early prediction of CKD. Int J Cogn Comput Eng 2024; 5: 66-77