Penalized Binary Regression for Gene Expression Profiling

Michae G. Schimek

doi:10.1055/s-0038-1633894

Methods of Information in Medicine, Inhaltsverzeichnis

Methods Inf Med 2004; 43(05): 439-444
DOI: 10.1055/s-0038-1633894

Original Article

Schattauer GmbH

Penalized Binary Regression for Gene Expression Profiling

Michae G. Schimek

¹Medical University of Graz, Institute for Medical Informatics, Statistics and Documentation, Graz, Austria

› Institutsangaben

Abstract

Summary

Objectives: A typical bioinformatics task in microarray analysis is the classification of biological samples into two alternative categories. A procedure is needed which, based on the expression levels measured, allows us to compute the probability that a new sample belongs to a certain class.

Methods: For the purpose of classification the statistical approach of binary regression is considered. High-dimensionality and at the same time small sample sizes make it a challenging task. Standard logit or probit regression fails because of condition problems and poor predictive performance. The concepts of frequentist and of Bayesian penalization for binary regression are introduced. A Bayesian interpretation of the penalized log-likelihood is given. Finally the role of cross-validation for regularization and feature selection is discussed.

Results: Penalization makes classical binary regression a suitable tool for microarray analysis. We illustrate penalized logit and Bayesian probit regression on a well-known data set and compare the obtained results, also with respect to published results from decision trees.

Conclusions: The frequentist and the Bayesian penalization concept work equally well on the example data, however some method-specific differences can be made out. Moreover the Bayesian approach yields a quantification (posterior probabilities) of the bias due to the constraining assumptions.

Keywords

Bayes - bioinformatics - classification - cross-validation - logit regression - penalization - prediction - probit regression

Volltext

Referenzen

References
1 Breiman L. Heuristics of instability and stabilization in model selection. Annal Statist 1996; 24: 2350-83.
2 le Cessie S, van Houwelingen JC. Ridge estimators in logistic regression. Appl Statist 1992; 41: 191-201.
3 Dettling M, Bühlmann P. Boosting for tumor classification with gene expression data. Bioinformatics 2003; 19: 1061-9.
4 Donoho D, Johnstone I. Ideal spatial adaptation by wavelet shrinkage. Biometrika 1994; 81: 425-55.
5 Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. J Amer Statist Assoc 2002; 97: 77-87.
6 Eilers PHC. et al. Classification of microarray data with penalized logistic regression. Proceedings of SPIE 2001; 4266: 187-98.
7 Finney D. Statistical Method in Biological Assay. New York: Hafner; 1973. 2nd edition.
8 Friedman JH. Regularized discriminant analysis. J Amer Statist Assoc 1989; 84: 165-75.
9 Gentleman R. et al. The Bioconductor FAQ. 2004. http://www.bioconductor.org/
10 Girosi F, Jones M, Poggio T. Regularization theory and neural networks architecture. Neural Computation 1995; 7: 219-69.
11 Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. Data Mining, Inference, and Prediction. New York: Springer-Verlag; 2001
12 Hoerl AE, Kennard RW. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 1970; 12: 55-67.
13 Hornik K. et al. The R FAQ. 2004. http://www.r-project.org/
14 van Houwelingen JC, le Cessie S. Predictive value of statistical models. Statistics in Medicine 1990; 9: 1303-25.
15 Jonathan P, Krzanowski WJ, McCarthy WV. On the use of cross-validation to assess performance in multivariate prediction. Statistics and Computing 2000; 10: 209-29.
16 van der Linde A. A note on smoothing splines as Bayesian estimates. Statistics and Decisions 1993; 11: 61-7.
17 McCullagh P, Nelder JA. Generalized Linear Models. London: Chapman & Hall; 1989. 2nd edition.
18 Park P, Pagano M, Bonetti M. A nonparametric scoring algorithm for identifying informative genes from microarray data. Pac Symp Biocomput 2001; 6: 52-63.
19 Schimek MG. A roughness penalty approach for statistical graphics. In Edwards D, Raun NE. (eds) Proceedings in Computational Statistics 1988. Heidelberg: Physica-Verlag; 1988: 37-43.
20 Spang R. et al. Prediction and uncertainty in the analysis of gene expression profiles. Silico Biol. 2002: 2 http://www.bioinfo.de/isb/gcb01/talks/spang/index.html
21 Tibshirani R. Regression shrinkage and selection via the lasso. J Royal Statist Soc 1995; B 57: 267-88.
22 Tibshirani R, Efron B. Pre-validation and inference in microarrays. Statistical Applications in Genetics and Molecular Biology 2002; 1 article 1 http://www.bepress.com/sagmb/vol1/iss1/art1
23 West M. et al. Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Nat Academy Scien 2001; 98: 11462-7.