RSS-Feed abonnieren
DOI: 10.1160/ME0317
Using T3, an Improved Decision Tree Classifier, for Mining Stroke-related Medical Data
Publikationsverlauf
Publikationsdatum:
22. Januar 2018 (online)
Summary
Objectives: Medical data are a valuable resource from which novel and potentially useful knowledge can be discovered by using data mining. Data mining can assist and support medical decision making and enhance clinical managementand investigative research. The objective of this work is to propose a method for building accurate descriptive and predictive models based on classification of past medical data. We also aim to compare this method with other well established data mining methods and identify strengths and weaknesses.
Method: We propose T3, a decision tree classifier which builds predictive models based on known classes, by allowing for a certain amount of misclassification error in training in order to achieve better descriptive and predictive accuracy. We then experiment with a real medical data set on stroke, and various subsets, in order to identify strengths and weaknesses. We also compare performance with a very successful and well established decision tree classifier.
Results: T3 demonstrated impressive performance when predicting unseen cases of stroke resulting in as little as 0.4% classification error while the state of the art decision tree classifier resulted in 33.6% classification error respectively.
Conclusions: This paper presents and evaluates T3, a classification algorithm that builds decision trees of depth at most three, and results in high accuracy whilst keeping the tree size reasonably small. T3 demonstrates strong descriptive and predictive power without compromising simplicity and clarity. We evaluate T3 based on real stroke register data and compare it with C4.5, a well-known classification algorithm, showing that T3 produces significantly more accurate and readable classifiers.
-
References
- 1 Pfaff M. et al. Prediction of Cardiovascular Risk in Hemodialysis Patients by Data Mining. Methods Inf Med 2004; 43: 106-13.
- 2 Richards G. et al. Data mining for indicators of early mortality in a database of clinical records. Artif Intell Med 2001; 22: 215-31.
- 3 Kamber M. et al. Generalisation and Decision Tree Induction: Efficient Classification in Data Mining. In Proc. Int’l Workshop Research Issues Data Engineering (RIDE’97). 1997: 111-20.
- 4 Ganti V, Gehrke J, Ramakrishnan R. Mining Very Large Databases. IEEE Computer, Special issue on Data Mining 1999; 32 (08) 38-45.
- 5 Kohavi R, Sommerfield D, Dougherty J. Data Mining using MLC++: A Machine Learning Library in C++. Tools with AI. 1996: 234-45.
- 6 Tjortjis C, Keane JA. T3: an Improved Classification Algorithm for Data Mining. Lecture Notes Computer Science Series. Springer-Verlag; Vol. 2412 2002: 50-55.
- 7 Auer P, Holte RC, Maass W. Theory and applications of agnostic PAC-learning with small decision trees. In Proc. 12th Int’l Machine Learning Conf. 1995: 21-9.
- 8 Du X. et al. Case-control study of stroke and the quality of hypertension control in North West England. BMJ 1997; 314: 272.
- 9 Tunstall-Pedoe H. Monitoring trends in cardiovascular disease and risk factors: the WHO MONICA Project. WHO. Chron 1985; 39: 3-5.
- 10 Bonita R, Beaglehole R. The enigma of the decline in stroke deaths in the United States: The search for an explanation. Stroke 1996; 27: 367-70.
- 11 Taylor TN. et al. Lifetime cost of stroke in the United States. Stroke 1996; 27: 1459-66.
- 12 Thorvaldsen P. et al. Stroke incidence, case fatality, and mortality in the WHO MONICA Project. Stroke 1995; 26: 361-7.
- 13 The Stroke Association. Preventing a stroke. Scriptographic Publ.; www.stroke.org.uk 2006
- 14 Saraee M, Theodoulidis B. Knowledge discovery intemporal databases: the initial step. Proc. 4th Int’l Conf.. Deductive Object-Oriented Databases; 1995: 17-22.
- 15 Quinlan JR. C4.5: Programs for ML. Morgan Kaufmann, 1993.
- 16 Mehta M. et al. SLIQ: A Fast Scalable Classifier for Data Mining. Proc. 5th Int’l Conf.. Extending Database Technology; 1996: 18-32.
- 17 UCI Machine Learning Repository. http://www.ics.uci.edu/~mlearn/MLRepositoryhtml data sets converted to MLC++ format, http://www.sgi.com/tech/mlc/db/ (last accessed 12/05).
- 18 Quinlan JR. Improved Use of Continuous Attributes in C4.5. Journal AI Research 1996; 4: 77-90.
- 19 SGI MLC++ sources. http://www.sgi.com/tech/mlc/alpha/MLC1.3.1-src.tar.Z (last accessed 12/05).
- 20 Tjortjis C. T3: a classification algorithm for data mining. MPhil thesis. UMIST; UK: 1999
- 21 Quinlan J R. Unknown attribute values in induction. In: Proc. 6th Int’l Machine Learning Workshop. 1989: 164-8.
- 22 Murthy S, Saltzberg S. Decision Tree Induction: How effective is the Greedy Heuristic? In: Proc 1st Int’l Conf. KDD and DM. 1995: 15-61.
- 23 T3, available at http://www.co.umist.ac.uk/~christos/TMiner.exe (last accessed 12/05).
- 24 Delen D. et al. Predicting breast cancer survivability: a comparison of three data mining methods. . Artif Intell Med 2005; 34 (02) 113-27.
- 25 Kurgan L, Cios KJ, Sontag M, Accurso FJ. Mining the Cystic Fibrosis Data. Zurada J, Kantardzic M. Next Generation of Data-Mining Applications. IEEE Press; 2005: 415-44.
- 26 Hand DJ, Mannila H, Smyth P. Principles of Data Mining.. The MIT Press; 2001: 21.
- 27 Peduzzi P, Henderson W, Hartigan P, Lavori Ε. Analysis of Randomized Controlled Trials. Epidemiologic Reviews 2002; 24 (01) 26-38.
- 28 Schwarzer G. et al. Comparison of Fuzzy Inference, Logistic Regression, and Classification Trees (CART). Prediction of Cervical Lymph Node Metastasis in Carcinoma of the Tongue. Methods Inf Med 2003; 42: 572-7.
- 29 Abu-Hanna A, de Keizer N. Integrating classification trees with local logistic regression in intensive care prognosis. Artif Intell Med 2003; 29: 5-23.