Methods Inf Med 2001; 40(01): 32-38
DOI: 10.1055/s-0038-1634461
Original Article
Schattauer GmbH

Effects of Case Removal in Prognostic Models

L. Ohno-Machado
1   Brigham and Women’s Hospital and Health Sciences and Technology Division, Harvard Medical School and Massachusetts Institute of Technology, Boston, USA
,
S. Vinterbo
2   Knowledge Systems Group, Dept. Computer and Information Sciences, Norwegian University of Science and Technology, Trondheim, Norway
› Author Affiliations
Further Information

Publication History

Publication Date:
08 February 2018 (online)

Abstract

Constructing and updating prognostic models that learn from training cases is a time-consuming task. The more compact, and yet informative, the training sets are, the faster one can build and properly evaluate such models. We have compared different regression diagnostic methods for selection and removal of training cases in prognostic models. Univariate determinations were performed using classical regression diagnostic statistics. Multivariate determinations were performed using (1) a sequential “backward” selection of cases, and (2) a non-sequential genetic algorithm. The genetic algorithm produced final models that kept few cases and retained predictive capability. A genetic algorithm approach to case selection may be better suited for guiding removal of cases in training sets than a univariate or a sequential multivariate approach, possibly because of its ability to detect sets of cases that are influential en bloc but may not be sufficiently influential when considered in isolation.

 
  • References

  • 1 Belsley DA, Kuh E, Weilsch RE. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: John Wiley and Sons; 1980: 292-3.
  • 2 Cook RD, Weisberg S. Residuals and Influence in Regression. New York: Chapman and Hall; 1982: 230-1.
  • 3 Brodley CE, Friedl MA. Identifying and eliminating mislabeled training instances. In: Proceedings of the Thirteenth National Conference on Artificial Intelligence. AAAI Press; 1996: 799-805.
  • 4 Gamberger D, Lavrac N, Dzeroski S. Noise elimination in inductive concept learning: A case study in medical diagnosis. In: Proc. of the 7th International Workshop on Algorithmic Learning Theory. Berlin: Springer; 1996: 199-212.
  • 5 Atkinson AC. Plots, Transformations, and Regression. Oxford: Clarendon; 1985
  • 6 Braithwaite IJ, Boot DA, Patterson M, Robinson A. Disability after severe injury: five year follow-up of a large cohort. Injury 1998; 29 (Suppl. 01) 55-9.
  • 7 Clark DE, Ryan LM. Modeling injury outcomes using time-to-event methods. J Trauma 1997; 42 (Suppl. 06) 1129-34.
  • 8 Wyatt JP, Beard D, Busuttil A. Quantifying injury and predicting outcome after trauma. Forensic Sci Int 1998; 95 (Suppl. 01) 57-66.
  • 9 Osler T, Baker SP, Long W. A modification of the injury severity score that both improves accuracy and simplifies scoring. J Trauma 1997; 43 (Suppl. 06) 922-5.
  • 10 Christensen R. Log-Linear Models and Logistic Regression. New York: Springer; 1997
  • 11 Bedrick EJ, Christensen R, Johnson W. Bayesian binomial regression: Predicting survival at a trauma center. Ame Stat 1997; 51: 211-8.
  • 12 Kennedy RL, Burton AM. et al. Early diagnosis of acute myocardial infarction using clinical and electrocardiographic data at presentation: derivation and evaluation of logistic regression models. Eur Heart J 1996; 17 (Suppl. 08) 1181-91.
  • 13 Dreiseitl S, Ohno-Machado L, Vinterbo S. Evaluating Variable Selection Methods for Diagnosis of Myocardial Infarction. Proceedings of the 1999 American Medical Informatics Association Fall Meeting. (in press).
  • 14 Ohno-Machado L, Fraser HS, Øhrn A. Improving Machine Learning Performance by Removing Redundant Cases in Medical Data Sets. J Am Med Inform Assoc 1998; Suppl 5: 523-7.
  • 15 The LOGISTIC Procedure.. In: SAS/STAT User’s Guide. Cary: SAS Institute; 1990
  • 16 Mitchell M. An Introduction to Genetic Algorithms. Cambridge: MIT Press; 1996
  • 17 Vinterbo S, Ohno-Machado L. A Genetic Algorithm to Select Variables in Logistic Regression: Example in Myocardial Infarction. Proceedings of the 1999 American Medical Informatics Association Fall Meeting. (in press).
  • 18 Ohno-Machado L, Fraser HS, Øhrn A. Improving Machine Learning Performance by Removing Redundant Cases in Medical Data Sets. J Am Med Inform Assoc 1998; Suppl 5: 523-7.