Methods Inf Med 2006; 45(02): 139-145
DOI: 10.1055/s-0038-1634057
Original Article
Schattauer GmbH

Reproducible Statistical Analysis in Microarray Profiling Studies

U. Mansmann
1   IBE, University of Munich, Muenchen, Germany
,
M. Ruschhaupt
1   IBE, University of Munich, Muenchen, Germany
2   Division of Molecular Genome Analysis, German Cancer Research Center, INF 580, Heidelberg, Germany
,
W. Huber
3   EBI, EMBL, Cambridge, UK
› Author Affiliations
Further Information

Publication History

Publication Date:
06 February 2018 (online)

Summary

Objectives: Microarrays are a recent biotechnology that offers the hope of improved cancer classification. A number of publications presented clinically promising results by combining this new kind of biological data with specifically designed algorithmic approaches. But, reproducing published results in this domain is harder than it may seem.

Methods: This paper presents examples, discusses the problems hidden in the published analyses and demonstrates a strategy to improve the situation which is based on the vignette technology available from the R and Bioconductor projects.

Results: The tool of a compendium is discussed to achieve reproducible calculations and to offer an extensible computational framework. A compendium is a document that bundles primary data, processing methods (computational code), derived data, and statistical output with textual documentation and conclusions. It is interactive in the sense that it allows for the modification of the processing options, plugging in new data, or inserting further algorithms and visualizations.

Conclusions: Due to the complexity of the algorithms, the size of the data sets, and the limitations of the medium printed paper it is usually not possible to report all the minutiae of the data processing and statistical computations. The technique of a compendium allows a complete critical assessment of a complex analysis.

 
  • References

  • 1 Microarray special. Statistical Science 18 2003; 1-117.
  • 2 Simon R, Rademacher MD, Dobbin K, McShane LM. Pitfalls in the use of DNA microarray classification. J Nat Cancer Inst 2003; 95: 14-8.
  • 3 van ’t Veer L, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Petersen HL, van de Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH. Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002; 415: 530-6.
  • 4 Huang E, Cheng SH, Dressman H, Pittman J, Tsou MH, Horng CF, Bild A, Iversen ES, Liao M, Chen CM, West M, Nevins JR, Huang AT. Gene expression predictors of breast cancer outcomes. The Lancet 2003; 361: 1590-6.
  • 5 Chang J, Wooten EC, Tsimelzon A, Hilsenbeck SG, Gutierrez MC, Elledge R, Moshin S, Osborne CK, Allred DC, O’Connell P. Gene expression profiling for the prediction of therapeutic response to docetaxel in patients with breast cancer. The Lancet 2003; 362: 362-9.
  • 6 Bullinger L, Döhner K, Bair E, Fröhling S, Schlenk RF, Tibshirani R, Döhner H, Pollack JR. Use of Gene-Expression Profiling to Identify Prognostic Subclasses in Adult Acute Myeloid Leukemia. NEJM 2004; 350: 1605-16.
  • 7 Tibshirani RJ, Efron B. Pre-validation and inference in microarrays. Statistical Applications in Genetics and Molecular Biology 2002; 1 (1): Article 1
  • 8 Breiman L.. Statistical Modelling: The Two Cultures. Statistical Science 2001; 16: 199-231.
  • 9 Ambroise C, McLachlan GJ. Selection bias in gene extraction on the basis of microarray geneexpression data. Proc Natl Acad Sci 2002; 99: 6562-6.
  • 10 Brenton JD, Caldas C. Predictive cancer genomics – what do we need?. The Lancet 2003; 362: 340-1.
  • 11 Leisch F, Rossini AJ. Reproducible statistical research. Chance 2003; 16: 41-6.
  • 13 Dudoit S, Fridlyand J, Speed T. Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. JASA 2002; 97: 77-87.
  • 13 Lee JW, Lee JB, Park M, Song SH. An extensive comparison of recent classification tools applied to microarray data. Computational Statistics and Data Analysis 2004. (in press).
  • 14 Ihaka R, Gentleman R. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics 1996; 5: 299-314.
  • 15 Gentleman R, Carey V. Bioconductor. R News 1996; 2: 11-6.
  • 16 Leisch F. Sweave: Dynamic generation of statistical reports using literate data analysis. In: Härdle W, Rönz B. Compstat 2002 – Proceedings in Computational Statistics. Heidelberg:: Physika Verlag; 2002: 575-80.
  • 17 van de Vijver MJ, He YD, van ’t Veer LJ. et al. A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med 2002; 347: 1999-2009.
  • 18 Tibshirani R, Hastie T, Narasimhan B, Chu G. Class prediction by nearest shrunken centroids, with application to DNA microarrays. Statistical Science 2003; 18: 104-17.
  • 19 Vapnik V. The Nature of Statistical Learning Theory. New York:: Springer; 1999
  • 20 Breiman L. Random Forests. Machine Learning Journal 2001; 45: 5-32.
  • 21 Eilers PH, Boer JM, Van Ommen GJ, Van Houwelingen HC. Classification of Microarray Data with Penalized Logistic Regression. Proceedings of SPIE volume 4266: progress in biomedical optics and imaging 2001; 2: 187-98.
  • 22 Carey VJ. Literate Statistical Programming: Concepts and Tools. Chance 2001; 14: 46-50.
  • 23 Sawitzki G. Keeping Statistics Alive in Documents. Computational Statistics 2002; 17: 65-88.