Methods Inf Med 2005; 44(03): 438-443
DOI: 10.1055/s-0038-1633990
Original Article
Schattauer GmbH

Molecular Diagnosis

Classification, Model Selection and Performance Evaluation
F. Markowetz
1   Computational Diagnostics Group, MPI for Molecular Genetics, Berlin, Germany
,
R. Spang
1   Computational Diagnostics Group, MPI for Molecular Genetics, Berlin, Germany
› Author Affiliations
Further Information

Publication History

Publication Date:
06 February 2018 (online)

Summary

Objectives: We discuss supervised classification techniques applied to medical diagnosis based on gene expression profiles. Our focus lies on strategies of adaptive model selection to avoid overfitting in high-dimensional spaces.

Methods: We introduce likelihood-based methods, classification trees, support vector machines and regularized binary regression. For regularization by dimension reduction, we describe feature selection methods: feature filtering, feature shrinkage and wrapper approaches. In small sample-size situations efficient methods of data re-use are needed to assess the predictive power of a model. We discuss two issues in using cross-validation: the difference between in-loop and out-of-loop feature selection, and estimating model parameters in nested-loop cross-validation.

Results: Gene selection does not reduce the dimensionality of the model. Tuning parameters enable adaptive model selection. The feature selection bias is a common pitfall in performance evaluation. Model selection and performance evaluation can be combined by nested-loop cross-validation.

Conclusions: Classification of microarrays is prone to overfitting. A rigorous and unbiased assessment of the predictive power of the model is a must.

 
  • References

  • 1 Yeoh EJ, Ross ME, Shurtleff SA, Williams WK, Patel D, Mahfouz R, Behm FG, Raimondi SC, Relling MV, Patel A, Cheng C, Campana D, Wilkins D, Zhou X, Li J, Liu H, Pui CH, Evans WE, Naeve C, Wong L, Downing JR. Classification, subtype discovery, and prediction of outcome in prediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell 2002; 1 (02) 133-43.
  • 2 Huang E, West M, Nevins J. Gene expression profiling for prediction of clinical characteristics of breast cancer. Recent Prog Horm Res 2003; 58: 55-73.
  • 3 Cheok MH, Yang W, Pui CH, Downing JR, Cheng C, Naeve CW, Relling MV, Evans WE. Treatmentspecific changes in gene expression discriminate in vivo drug response in human leukemia cells. Nat Genet 2003; 34 (01) 85-90.
  • 4 Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. New York: Springer; 2001
  • 5 Duda RO, Hart PE, Stork DG. Pattern classification. New York: Wiley; second edition; 2001
  • 6 Ripley BD. Pattern recognition and neural networks. Cambridge: Cambridge University Press; 1996
  • 7 Bishop CM. Neural networks for pattern classification. Oxford: Clarendon Press; 1995
  • 8 Schölkopf B, Smola AJ. Learning with Kernels. Cambridge (MA): MIT Press; 2001
  • 9 Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 2002; 97: 77-87.
  • 10 Breiman L, Friedman J, Olshen R, Stone C. Classification and Regression Trees. Wadsworth; 1984
  • 11 Breiman L. Bagging predictors. Machine Learning 1996; 24 (02) 123-40.
  • 12 Breiman L. Arcing classifiers. The Annals of Statistics 1998; 26 (03) 801-49.
  • 13 Breiman L. Random Forests. Machine Learning 2001; 45 (01) 5-32.
  • 14 Vapnik V. The Nature of Statistical Learning Theory. New York: Springer; 1995
  • 15 Vapnik V. Statistical Learning Theory. New York: Wiley; 1998
  • 16 Furey TS, Christianini N, Duffy N, Bednarski DW, Schummer M, Haussler D. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 2000; 16 (10) 906-14.
  • 17 Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov JP, Poggio T, Gerald W, Loda M, Lander ES, Golub TR. Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci USA 2001; 98 (26) 15149-54.
  • 18 Eilers PH, Boer JM, Van Ommen GJ, Van Houwelingen HC. Classification of microarray data with penalized logistic regression. Proceedings of SPIE volume4266: progress in biomedical optics and imaging 2001; 2: 187-98.
  • 19 Tibshirani R. Regression shrinkage and selection via the lasso. J Royal Statist Soc B 1996; 58 (01) 267-88.
  • 20 Johnson VE, Albert JH. Ordinal data modeling. New York: Springer; 1999
  • 21 Roth V. The Generalized LASSO. IEEE Transactions on Neural Networks 2004; 15: 01
  • 22 Krishnapuram B, Carin L, Hartemink A. Gene Expression Analysis: Joint Feature Selection and Classifier Design. In. Schölkopf B, Tsuda K, Vert JP. (eds) Kernel methods in Computational Biology. Cambridge MA: MIT Press; 2004
  • 23 West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JA, Marks Jr. JR, Nevins JR. Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci USA 2001; 98 (20) 11462-7.
  • 24 Spang R, Blanchette C, Zuzan H, Marks JR, Nevins J, West M. Prediction and uncertainty in the analysis of gene expression profiles. In Silico Biol 2002; 2 (03) 369-81.
  • 25 Altman DG, Royston R. What do we mean by validating a prognostic model?. Stat Med 2000; 19 (04) 453-73.
  • 26 Ransohoff DF. Rules of evidence for cancer molecular- marker discovery and validation. Nat Rev Cancer 2004; 4 (04) 309-14.
  • 27 Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. Chapman & Hall/CRC, second edition. 2003
  • 28 Blum A, Langley P. Selection of relevant features and examples in machine learning. Artificial Intelligence 1997; 97: 245-71.
  • 29 Kohavi R, John GH. Wrappers for feature subset selection. Artificial Intelligence 1997; 97 1-2 273-324.
  • 30 Guyon I, Elisseeff A. An Introduction to Variable and Feature Selection. Journal of Machine Learning Research 2003; 3 (Mar) 1157-82.
  • 31 Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci USA 2002; 99 (10) 6567-72.
  • 32 Tibshirani R, Hastie T, Narasimhan B, Chu G. Class Prediction by Nearest Shrunken Centroids, with Applications to DNA Microarrays. Statistical Science 2003; 18 (01) 104-17.
  • 33 Jäger J, Sengupta R, Ruzzo WL. Improved Gene Selection for Classification of Microarrays, In Proc Pacific Symposium on Biocomputing. 2003: 53-64.
  • 34 Guyon I, Weston J, Barnhill S, Vapnik V. Gene Selection for Cancer Classification using Support Vector Machines. Mach Learn 2002; 46: 389-422.
  • 35 Hochreiter S, Obermayer K. Gene Selection for Microarray Data. In. Schölkopf B, Tsuda K, Vert JP. (eds) Kernel methods in Computational Biology. Cambridge MA: MIT Press; 2004
  • 36 Hastie T, Rosset S, Tibshirani R, Zhu J. The entire regularization path for the support vector machine. Journal of Machine Learning Research 2004; 5: 1391-1415.
  • 37 Ambroise C, McLachlan GJ. Selection bias in gene extraction on the basis of microarray geneexpression data. Proc Natl Acad Sci USA 2002; 99 (10) 6562-6.
  • 38 Simon R, Radmacher MD, Dobbin K, McShane LM. Pitfalls in the Use of DNA Microarray Data for Diagnostic and Prognostic Classification. Journal of the National Cancer Institute 2003; 95: 01
  • 39 R Development Core Team (2004). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL. http://www.R-project.org