Methods Inf Med 2001; 40(05): 403-409
DOI: 10.1055/s-0038-1634200
Original Article
Schattauer GmbH

Prediction in Medicine by Integrating Regression Trees into Regression Analysis with Optimal Scaling

E. Dusseldorp
1   Data Theory Group, Department of Education, Leiden University, The Netherlands
,
J. J. Meulman
1   Data Theory Group, Department of Education, Leiden University, The Netherlands
› Author Affiliations
Further Information

Publication History

Publication Date:
08 February 2018 (online)

Summary

Objectives: A new data-analysis strategy is proposed to solve the problems of selecting interaction terms in linear regression on the one hand, and of statistically testing the significance of regression trees on the other hand.

Methods: The proposed strategy combines two data mining techniques: regression trees and regression analysis with optimal scaling (CATREG). The method traces small regression trees using the bootstrap and integrates the results as interaction variables (called “trunk variables”) into CATREG.

Results: An application to data from cardiac patients shows a relative increase of 19% variance accounted for (16% cross-validated variance), by the CATREG model including the trunk variables compared to the model excluding these variables.

Conclusions: This study indicates that trunk variables can be useful to model interaction effects in prediction problems.

 
  • References

  • 1 Hastie TJ, Tibshirani RJ. Generalized additive models. London: Chapman & Hall; 1990
  • 2 Judd SM, McClelland GH, Culhane SE. Data analysis: Continuing issues in the everyday analysis of psychological data. Annu Rev Psychol 1995; 46: 433-65.
  • 3 Sonquist JA. Multivariate model building: The validation of a search strategy. Ann Arbor: Institute for Social Research, The University of Michigan; 1970
  • 4 Hand DJ. Construction and assessment of classification rules. Chichester: Wiley; 1997
  • 5 Hand DJ, Henley WE. Statistical classification methods in consumer credit scoring: A review. J Roy Stat Soc A 1997; 160: 523-41.
  • 6 Hand DJ. personal communication. January 1999
  • 7 Doering S, Müller E, Köpcke W. et al. Predictors of relapse and rehospitalization in schizophrenia and schizoaffective disorder. Schizophrenia Bull 1998; 24: 87-98.
  • 8 Sauerbrei W, Madjar H, Prömpeler HJ. Differentiation of benign and malignant breast tumors by logistic regression and a classification tree using Doppler flow signals. Methods Inf Med 1998; 37: 226-34.
  • 9 Friedman JH. Data mining and statistics: What’s the connection? Online publication. 1997 Available at http://www-stat.stanford.edu/~jhf.
  • 10 Gaul W, Säuberlich F. Classification and Positioning of Data-mining Tools. In: Classification in the Information Age. Studies in Classification, Data Analysis, and Knowledge Organization. Gaul W, Locarek-Junge H. eds. Heidelberg: Springer; 1999: 145-54.
  • 11 Goebel M, Gruenwald L. A survey of data mining and knowledge discovery software tools. SIGKDD Explorations 1999; 1: 20-33.
  • 12 Piatetsky-Shapiro G, Frawley WJ. Knowledge Discovery in Databases. Menlo Park, CA: AAAI Press; 1991
  • 13 Clark LA, Pregibon D. Tree-based models. In: Statistical Models in S. Chambers JM, Hastie TJ. eds. London: Chapman & Hall; 1993: 377-419.
  • 14 Altman DG, Lausen B, Sauerbrei W, Schumacher M. Dangers of using “optimal” cutpoints in the evaluation of prognostic factors. J Nat Cancer Inst 1994; 86: 829-35.
  • 15 Kruskal JB. Analysis of factorial experiments by estimating monotone transformations of the data. J Roy Stat Soc B 1965; 27: 251-63.
  • 16 Breiman L, Friedman JH. Estimating optimal transformations for multiple regression analysis and correlation. JASA 1985; 80: 580-98.
  • 17 Buja A. Remarks on functional canonical variates, alternating least squares methods and ACE. Ann Statist 1990; 18: 1032-69.
  • 18 Gifi A. Nonlinear Multivariate Analysis. Chi-chester: Wiley; 1990
  • 19 Hastie T, Tibshirani R, Buja A. Flexible discriminant analysis by optimal scoring. JASA 1994; 89: 1255-70.
  • 20 Meulman JJ, Heiser WJ. SPSS Inc.. SPSS Categories 10.0. Chicago, IL: SPSS Inc; 1999
  • 21 Ramsay JO. Monotone regression splines in action. Stat Sci 1989; 4: 425-41.
  • 22 Van der Kooij AJ, Meulman JJ. MURALS: Multiple regression analysis and optimal scaling using alternating least squares. Proceedings of Softstat ’97. In: Faulbaum E, Bandilla W. eds. Stuttgart: Lucius & Lucius; 1997: 99-106.
  • 23 Young FW, De Leeuw J, Takane Y. Regression with qualitative and quantitative variables: An alternating least squares method with optimal scaling features. Psychometrika 1976; 41: 505-28.
  • 24 Fayyad UM. Editorial. Data Mining and Knowledge Discovery 1997; 1: 5-10.
  • 25 Max J. Quantizing for minimum distortion. IRE Transactions on Information Theory 1960; 6: 7-12.
  • 26 S-PLUS 2000.. Online user’s guide. Seattle, WA: Mathsoft Inc; 1999
  • 27 Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification and Regression Trees. Belmont, CA: Wadsworth; 1984
  • 28 Efron B, Tibshirani RJ. An Introduction to the Bootstrap. New York: Chapman & Hall; 1993
  • 29 Van Elderen T, Kersbergen B, Dusseldorp E. et al. Screening voor psychosociale modules van zorgverlening aan coronaire hartpatiënten [Screening for psychosocial health care programs in cardiac rehabilitation]. Leiden: Health Psychology Section, Leiden University; 1999
  • 30 Van Elderen T, Chatrou M, Weeda H, Maes S. Leidse Screening Vragenlijst voor Hartpatiënten [Leiden screening questionnaire for cardiac patients]. Leiden: Health Psychology Section, Leiden University; 1997
  • 31 Cronbach LJ. Coefficient alpha and the internal structure of tests. Psychometrika 1951; 16: 297-334.
  • 32 Tibshirani RJ. personal communication. January 1999
  • 33 Tatsuoka MM. Multivariate Analysis: Techniques for Educational and Psychological Research (2nd ed.). New York: Macmillan Publishing Company; 1988