Open Access
CC BY 4.0 · Methods Inf Med
DOI: 10.1055/a-2797-4380
Original Article

Early Risk Factor Prediction in Chronic Kidney Disease Diagnosis Using Feature Selection and Machine Learning Algorithms

Authors

  • Chowdhury Nazia Enam Prima

    1   Data Science Research Center, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland
  • Martti Juhola

    1   Data Science Research Center, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland

Abstract

Background

Chronic kidney disease, CKD in short, is a kind of long-term kidney illness in which rapid deterioration of kidney function is observed over a period of time. Unlike other organs, this damage in kidney function cannot be recovered and reversed as well. Moreover, in its early stages, asymptomatic renal disease is highly prevalent, making early identification with conventional clinical approaches difficult. Thus, early and accurate detection of risk factors is a very challenging step in CKD diagnosis.

Objectives

This research work showed earlier and effective identification of risk factors using notable feature selection techniques for the enhancement of patient care. It also aimed at the improvement of predictive diagnosis of CKD employing different supervised and ensemble machine learning classifiers.

Methods

A CKD-focused dataset consisting of 1,032 patient records and 14 features was used for this research purpose. This research emphasized on identifying the risk factors of CKD using feature importance (for tree-based model) with sequential feature selector and ReliefF algorithm as feature selection process. Based on the ranking for both feature selection techniques, the top 10 features were identified. Then utilizing those features, the classifiers such as random forest, support vector machine, Naïve Bayes, decision tree, logistic regression, gradient boosting, K-nearest neighbors, and ensemble classifier voting technique were trained using stratified 5-fold and grid-based search cross-validation techniques. After that, their performances were assessed using evaluation measures, i.e., accuracy, F1 score, precision, recall, training loss, test loss, bias, and AUC, to classify the individual having presence or absence of CKD.

Results

The feature selection algorithms selected the significant data-driven top 10 features. Based on the ranking for both feature selection procedures, hemoglobin is determined to be the significant risk factor among these features. For both feature selection techniques, all the classifiers showed their best performance, having 86 to 98% of accuracy, AUC value of over 0.96 to 1.00, and bias value of 0.003 to 0.103. All the classifiers showed a very good trade-off between false positives and false negatives, with precision, recall, and F1 score ranging from 92 to 98%, 90 to 99%, and 93 to 98%, respectively, using feature importance with SFS. In both cases of the feature selection techniques, gradient boosting outperformed all other algorithms in terms of accuracy, precision, AUC, recall, F1 score, specificity, and bias.

Conclusion

To conclude, in the suggested methodology the feature selection algorithms effectively identified the prominent features based on their importance, and the pipeline demonstrated a good performance in diagnosing individuals at risk of CKD development. Some of the classifiers showed their effectiveness in CKD prediction using the selected features by achieving higher accuracy, F1 score, precision, recall, AUC, specificity, and lower bias to ensure the diagnostic performance. Therefore, it can be inferred that this proposed methodology, combining the power of these eight machine learning models with two efficient feature selection approaches, demonstrated that people at risk of this nephrological condition can be detected earlier, more accurately identifying increased risk factors than with conventional methods. This holds a great promise toward enhancing healthcare judgment and eventually ensuring treatment for patients.



Publication History

Received: 30 June 2025

Accepted: 23 January 2026

Article published online:
13 February 2026

© 2026. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution License, permitting unrestricted use, distribution, and reproduction so long as the original work is properly cited. (https://creativecommons.org/licenses/by/4.0/)

Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany