Key words breast cancer - SNPs - triple-negative - subtype prediction - prediction model - variable selection
Schlüsselwörter Brustkrebs - SNPs - triple-negativ - Subtypvorhersage - Prädiktionsmodell - Variablenselektion
Introduction
Knowledge about targeted therapies for breast cancer has improved immensely over the last two decades. These therapies have mainly been developed for – although they are not restricted to – intrinsic molecular subtypes: triple-negative breast cancer (TNBC), hormone receptor-positive breast cancer, and HER2-positive breast cancer.
As TNBC lacks hormone receptors and HER2 receptors, treatment for triple-negative breast cancer is primarily restricted to conventional chemotherapy. At the molecular level, however, TNBC is a heterogeneous disease that has different histological and molecular features. Recently, studies of TNBC have been extending the inclusion criteria and now include additional molecular markers in the selection criteria, opening up scope for targeted therapies in this subtype of breast cancer.
One example is a study of the targeted antibody–drug conjugate glembatumumab vedotin [1 ]. The study includes not only a requirement for the tumor to be triple-negative, but also for it to show overexpression of glycoprotein nonmetastatic melanoma protein B (gpNMB). Other examples are poly-(ADP-ribose) polymerase (PARP) inhibitor studies in patients with BRCA1/2 mutations, based on a requirement for the tumor to be triple-negative for testing of a BRCA1/2 mutation [2 ], [3 ]. The requirement for triple negativity was later changed to HER2 negativity.
The screening phases for studies of this type are often extended, since the process of determining the molecular subtype and carrying out additional biomarker assessment is time-consuming. This can often be a challenge to the patience of both patients and physicians. Parameters capable of predicting the molecular subtype before it becomes available via pathology might be helpful for treatment planning and for optimizing the timing and cost of screening phases for clinical trials. Biomarker assessment could be carried out at an early stage in the work-up for patients with an increased likelihood of the specific molecular subtype.
Clinical and epidemiological risk factors, such as reproductive factors and body mass index (BMI), are associated with the molecular subtype of the tumor. They appear to have an effect on the risk of developing hormone receptor-positive tumors [4 ], [5 ], [6 ]. In a case–case analysis, our group previously reported that age and BMI are the most important parameters associated with molecular subtypes [7 ]. High mammographic density was also associated with hormone receptor-negative tumors [8 ], [9 ].
Since rapid and low-cost genotyping is becoming increasingly widely available [10 ], single nucleotide polymorphisms (SNPs) might be useful as predictors for molecular subtypes. Genetic factors have been shown to increase the risk for specific breast cancer subtypes. For example, it is known that patients with BRCA1 mutations are mainly diagnosed with triple-negative breast cancer, and mutation rates in this population are over 10% [11 ]. In addition, approximately 100 validated SNPs for breast cancer risk are known [12 ], [13 ], [14 ]. Some of these SNPs have been specifically linked to a risk for hormone receptor-positive, hormone receptor-negative, or triple-negative breast cancer [15 ], [16 ], [17 ], [18 ], [19 ], [20 ], [21 ].
It was hypothesized that a combination of multiple breast cancer risk SNPs in addition to clinical predictors of molecular subtypes may improve the prediction of molecular subtypes. Specifically, predicting TNBC – a breast cancer subtype in which the patients affected have many unmet medical needs – would be helpful. The aim of this study was therefore to identify breast cancer risk SNPs capable of predicting TNBC in addition to clinical predictors in women with invasive breast cancer. The prediction performance of various methods of selecting SNPs was compared.
Methods
Patients
The patients selected for this retrospectively designed cross-sectional observational study are included in the Bavarian Breast Cancer Cases and Cohorts (BBCC) study. The BBCC has been ongoing since 2002 and includes consecutively recruited patients with invasive breast cancer at the University Breast Center for Franconia. The study was designed to identify and validate genetic and nongenetic risk factors, and it has been involved in several validation studies for SNPs [13 ], [14 ], [19 ], [20 ], [21 ], [22 ], [23 ], [24 ], [25 ], [26 ], [27 ], [28 ], [29 ], [30 ], [31 ], [32 ], [33 ]. For the present study, all women who were recruited into the BBCC from 2002 to 2010 were selected. Among them, patients were excluded for the following reasons: no participation in any genetic BBCC research projects; insufficient remaining DNA available due to participation in previous research projects; and no data on hormone receptor status or HER2 status available from the central pathology department at the breast cancer center. After SNPs had been selected for analysis (see below), patients with incomplete genetic information were also excluded. All of the patients provided written informed consent, and the Ethics Committee of the Medical Faculty at Friedrich Alexander University of Erlangen–Nuremberg approved the study.
Data collection
All treatment-related patient data and tumor characteristics were documented as part of the certification processes required by the German Cancer Society (Deutsche Krebsgesellschaft) and by the German Society for Breast Diseases (Deutsche Gesellschaft für Senologie)
[34 ]. The data are recorded prospectively in a database and audited annually as part of the breast cancer center certification process. Epidemiological data and risk factors for breast cancer were obtained using a structured questionnaire, which was completed by the patients and reviewed together with trained study personnel and supplemented if necessary.
SNP selection
A total of 102 SNPs were selected for genotyping. Of these, 98 are validated breast cancer risk SNPs. Most of these breast cancer risk SNPs have been confirmed in large international validation studies, mainly by the Breast Cancer Association Consortium (BCAC). The BCAC initially published a validation of a few SNPs [35 ] and then, after increasing the sample size and analyzing more SNPs, published a series of papers as a result of the Collaborative Oncological Gene–environment Study (COGS; www.cogseu.org and www.nature.com/icogs/ ) [13 ], [14 ]. Four SNPs that were shown to have an influence on the prognosis in breast cancer patients were also selected [36 ], [37 ], [38 ], [39 ]. A complete list of the SNPs, including references, is provided in Supplementary Table S1
[13 ], [14 ], [18 ], [19 ], [20 ], [25 ], [29 ], [30 ], [32 ], [33 ], [35 ], [37 ], [38 ], [39 ], [67 ], [79 ], [80 ], [81 ], [82 ], [83 ], [84 ], [85 ], [86 ], [87 ], [88 ], [89 ].
DNA extraction, genotyping, and quality control
Whole-blood samples were collected in citrate-phosphatedextrose-adenine (CPDA) tubes (Sarstedt AG, Nümbrecht, Germany) from patients who had consented to participate in the biomarker substudy. Germline DNA was extracted using the automated magnetic bead-based chemagic MSM I technique (PerkinElmer chemagen, Baesweiler, Germany) in accordance with the manufacturerʼs instructions. Genotyping was done at the Dr. Margarete Fischer-Bosch Institute of Clinical Pharmacology, using MassARRAY iPLEX Gold (Sequenom, San Diego, California, USA). SNPs were excluded if MALDI spectra were unreliable, based on raw genotype data. Exact tests for Hardy–Weinberg equilibrium (HWE) were performed and SNPs with an unexpectedly small p value, assessed using a quantile–quantile plot, were also excluded.
Pathology assessment
All of the histopathological information used in the analysis was directly documented from the original pathology reports, which were reviewed by two investigators. Estrogen receptor status, progesterone receptor status, and HER2 status were assessed as follows. Monoclonal mouse antibodies against estrogen receptor-alpha (clone 1D5; 1 : 200 dilution, DAKO, Denmark) and monoclonal mouse antibody against the progesterone receptor (clone pgR636, 1 : 200 dilution, DAKO, Denmark) were used to stain the pretreatment core biopsies. The percentage of positively stained cells was included in the pathology reports. The tumors were considered to be positive for the estrogen and progesterone receptors if 10% or more of the cells showed positive staining, in accordance with recommendations applying at the time when the study was conducted [40 ], [41 ], [42 ], [43 ]. A polyclonal antibody against HER2 (1 : 200 dilution, DAKO, Denmark) was used, and HER2 status was stated in the pathology reports as negative, 0, 1+, 2+, or 3+ in accordance with the guidelines published by Sauter et al. [44 ]. Tumors with a score of 0 or 1+ were regarded as HER2-negative, and those with a score of 3+ were regarded as HER2-positive. Tumors with 2+ staining were tested for gene copy numbers of HER2 by chromogene in-situ hybridization. Using a kit with two probes of different colors (ZytoDot, 2C SPEC HER2/CEN17, ZytoVision Ltd., Bremerhaven, Germany), the gene copy numbers of HER2 and centromeres of the corresponding chromosome 17 were retrieved. A HER2/CEN17 ratio of ≥ 2.2 was considered as amplification of HER2. Scoring was carried out in a standardized way by a group of dedicated pathologists in routine surgical pathology. A tumor was regarded as being triple-negative if the estrogen receptor (ER) status was negative, progesterone receptor (PR) status was negative, and HER2 status was negative. In the present study, “triple-negative” refers to one subgroup of molecular subtypes of breast cancer, although comprehensive gene expression profiling was not performed.
Statistical methods
To investigate the predictive value of each single SNP relative to the occurrence of a TNBC in addition to clinical predictors, a multiple logistic regression model was fitted for each SNP with TNBC status (yes versus no) as the outcome, and with the specific SNP (ordinal; 0, 1, or 2 minor alleles) and the clinical predictors age at diagnosis (continuous) and BMI (continuous) as predictors [7 ]. Patients with missing genetic data or missing outcome data were excluded. Missing clinical predictors were imputed, as done in [45 ]. Continuous predictors were used as natural cubic spline functions to describe nonlinear effects [46 ].The degrees of freedom (between 1 and 3) of each predictor were calculated as done in [45 ]. The odds ratio (OR) per minor allele with confidence interval was calculated using the logistic regression model. For each SNP, a likelihood ratio test comparing the clinical-genetic logistic regression model with a clinical logistic regression model containing only the clinical predictors was performed. The p values (one per SNP) were corrected for multiple testing using the Bonferroni–Holm method.
The primary study aim was to identify a set of SNPs that together would improve the prediction of TNBCs in addition to clinical predictors (age, BMI). Identifying relevant SNPs among the relatively large number of candidate SNPs was a challenging process, which can be summed up as follows. The complete dataset was randomly divided into two parts: one training set with about two-thirds of the patients, and one validation set with about one-third of the patients. Various SNP selection methods and regression techniques, respectively, were applied to the training data to obtain regression models with selected SNPs and clinical predictors. The models were compared among themselves with regard to their prediction errors on validation data.
All but one of the regression techniques considered comprise a bundle of candidate models characterized by a tuning parameter λ. The optimal λ has to be determined before a specific prediction model representing the regression technique can be fitted to predict TNBC. After the degrees of freedom of the continuous clinical predictors had been determined again by using training data, the following regression techniques were applied to the training data.
Univariate selection. For each SNP, a logistic regression model with the clinical predictors and the specific SNP was compared with a logistic regression model with clinical predictors alone, using a likelihood ratio test. The SNPs were ordered according to increasing p values for these likelihood ratio tests. The λ top-ranked SNPs were selected and included in a logistic regression model that also contained the clinical predictors. Here λ, ranging from 0 to 30, is a tuning parameter representing the number of selected SNPs [47 ]. When a specific model was applied to the validation data afterwards, generalized shrinkage after coefficient estimation toward the clinical regression model was used to improve predictions [48 ].The shrinkage factor was obtained from the maximal genetic model with 30 SNPs.
Stepwise selection as described in [49 ]. All of the SNPs were ordered as above. The top 30 ranked SNPs were preselected, in order to keep the number of SNPs to be analyzed easy to handle. One hundred bootstrap samples of the same size as the original dataset were drawn with replacement. On each bootstrap sample, a logistic regression model with the clinical predictors and the preselected SNPs was set up. A backward stepwise variable selection procedure that kept all the clinical predictors was carried out to obtain the best model in accordance with the Akaike information criterion. The retained variables from each bootstrap sample were recorded, and a final variable selection was made. The most frequently selected SNPs (> 70%) and – to address correlation among SNPs – representatives of highly frequent SNP pairs (> 90%) were chosen. Again, generalized shrinkage was incorporated when the final model was applied.
The least absolute shrinkage operator (lasso) is a regression technique in which the regression coefficients are shrunk towards zero during estimation [50 ]. The amount of shrinkage is controlled by a tuning parameter λ. Depending on the value of λ, a number of coefficients reach exactly zero, which means that lasso also leads to variable selection. In the present study, a regression model was set up with the clinical predictors and all SNPs. The coefficients of the SNPs, but not the coefficients of the clinical predictors, were shrunk by variation of λ. A regression model with maximal shrinkage that has all coefficients of the SNPs equal to 0 corresponds to the clinical logistic regression model. In contrast to the usual regression models, lasso can deal with large numbers of predictors.
Component-wise gradient boosting fits a regression model iteratively [51 ], [52 ]. It starts with an empty model without any predictors. In each iteration, the best-performing predictor is added to the model with a small step size, or its coefficient is updated if it was included before. More relevant predictors are included earlier than less relevant ones. The number of boosting iterations, λ, is a tuning parameter that controls both the variable selection properties of the algorithm and the implied shrinkage of the coefficients. The incorporation of clinical predictors is less straightforward than for lasso. A logistic regression model with clinical predictors is fitted. This fit is taken as the offset for the boosting procedure described above with SNPs as predictors [53 ].
The optimal λ for each method was found by 10-fold cross-validation on the training dataset. For a given value of λ, the prediction model was estimated on nine folds and then applied on the tenth fold. The mean squared error (MSE) was taken as the evaluation measure. The MSE is a summary measure of the differences between the observed TNBC status (either 0 for “no” or 1 for “yes”) of patients in the tenth fold, which was not used for model building, and the expected probability obtained from the model (between 0 and 1) for these patients having a TNBC. This procedure was done 10 times, leaving one fold out at a time, and the average MSE was calculated. The λ value with the smallest average MSE was regarded as the optimal λ. The whole training set was finally used to fit a regression model with the optimal λ.
The procedures described above resulted in four clinical-genetic regression models for predicting TNBC. In addition, two benchmark models – a logistic null model without any predictors and a clinical logistic regression model with clinical predictors but without any SNPs – were fitted on the training data. A useful clinical model should perform better than the null model, whereas a useful prediction model with clinical and genetic predictors should perform better than the clinical model without further predictors. These six models were evaluated on the validation dataset to measure their performance in new patients. Again, the MSE was taken as a performance criterion.
To obtain further insight into the accuracy of the prediction, the performance improvement of the four clinical-genetic models in comparison with the clinical model was assessed on validation data using the continuous net reclassification improvement (NRI). Roughly speaking, the continuous NRI is the proportion of patients with TNBC or without TNBC, respectively, who are correctly given a higher or lower predicted probability of TNBC by the clinical-genetic regression model than by the clinical model, corrected by wrongly assigned lower or higher probabilities [54 ].
In clinical practice, a prediction model for TNBC might support treatment decision-making based on a threshold for the predicted probability of TNBC that classifies a patient as a “high-risk” patient or “low-risk” patient. The ability to distinguish between patients with and without TNBC was measured on validation data using the receiver operating characteristic (ROC) curve and the area under the ROC curve (AUC), an estimation of the probability that given two patients, one with TNBC and the other without TNBC, the prediction model will assign TNBC status to both patients correctly.
To overcome the drawbacks of only splitting the data into training and validation sets once, the dataset was divided several times into training and validation sets and the procedure was repeated as described above each time [47 ], [55 ]. More precisely, 3-fold cross-validation with 100 repetitions was done. For each regression technique for predicting TNBC, the average value of the 300 MSEs of the corresponding regression models was taken as a final evaluation criterion, and the average AUC and average NRI were used as further criteria. The regression technique with the smallest average MSE is regarded as the best method (the “winner” method) for predicting TNBC.
The best prediction method was applied to the whole dataset to obtain the final prediction model for TNBC. This was done by repeating all of the model-building steps, this time not on the training data, but on the complete dataset. That is, cubic spline functions and the tuning parameter λ were determined as described above and a corresponding regression model was fitted on the complete dataset. A TNBC prediction score on a scale from 0 to 100, representing the probability of a TNBC, was derived from the final prediction model by taking the inverse logit of the linear combination of predictor values and regression coefficients. The performance of the final model on the complete dataset in terms of discrimination and calibration was measured using the AUC and the Hosmer–Lemeshow statistic (scatterplot and χ2 test) comparing predicted and observed TNBC events, as done recently in [56 ]. A large p value indicates satisfactory calibration.
All of the tests were two-sided, and a p value < 0.05 was considered statistically significant. Calculations were carried out using the R system for statistical computing (version 3.0.1; R Core Team, Vienna, Austria, 2013).
Results
Patients and SNPs
A total of 2234 patients were recruited into the BBCC during the specified period. A subset of 1868 patients took part in genetic BBCC research projects. Of these, sufficient DNA was available from 1743 patients. A further 472 patients with incomplete hormone receptor and HER2 status information were excluded, resulting in 1271 remaining patients. Twenty-seven out of 102 SNPs were excluded after genotype quality control: 24 SNPs because of unreliable MALDI spectra and three SNPs because of departure from HWE (Supplementary Table S1 ). Due to missing values, the following SNPs were excluded: rs1550623 (17.0% missing values out of 1271), rs3903072 (9.9%), rs2380205 (9.6%), rs17817449 (7.4%), rs2236007 (7.2%), rs3803662 (5.0%), rs9790517 (5.0%), and rs2046210 (5.0%). All patients had age information, and missing BMI values (4.2%) were imputed. The final sample size was 1027 patients, after 244 patients with incomplete genetic information had been excluded. Patient characteristics are shown in [Table 1 ].
Table 1 Patient characteristics.
Characteristic
Mean or count
SD or %
BMI, body mass index; ER, estrogen receptor; PR, progesterone receptor
Age at diagnosis
Years
57.2
11.7
BMI
kg/m2
26.2
4.8
ER
231
22.5
796
77.5
PR
288
28.0
739
72.0
HER2
877
85.4
150
14.6
Triple-negative
893
87.0
134
13.0
Univariate SNP and TNBC association
The clinical predictors age at diagnosis and BMI, used as adjustment variables, fitted best as cubic spline functions with 2 and 1 degrees of freedom, respectively – i.e., age was used nonlinearly and BMI was used linearly. Twenty SNPs with the smallest p values in the univariate analyses are shown in [Table 2 ]. rs10069690 (TERT, CLPTM1L) was the only significant SNP (corrected p = 0.02) after correction of p values for multiple testing. The corrected p values for rs2981579 (FGFR2) , rs7726159 (TERT) , rs2588809 (RAD51B) , and rs78540526 (CCND1) were 0.18, 0.36, 0.81 and 0.93, respectively; the other corrected p values were 1.00.
Table 2 Univariate associations with triple-negative breast cancer (TNBC) for the 20 SNPs with the lowest p values.
SNP
Chromosome
Nearest genes
MAF
OR (95% CI)1
p value2
SNP, single nucleotide polymorphism; MAF, minor allele frequency
1 Odds ratio (OR) per minor allele, adjusted for age and body mass index, with 95% confidence interval (CI) and corresponding p value, obtained from the multiple logistic regression model.
2 Uncorrected p values. The corrected p values for the top five SNPs were 0.02, 0.18, 0.36, 0.81, and 0.93. All other corrected p values were 1.00.
rs10069690
5
TERT, CLPTM1L
0.249
1.66 (1.27, 2.18)
< 0.001
rs2981579
10
FGFR2
0.484
0.66 (0.51, 0.87)
< 0.01
rs7726159
5
TERT
0.358
1.46 (1.12, 1.91)
< 0.01
rs2588809
14
RAD51B
0.174
0.62 (0.42, 0.92)
0.02
rs78540526
11
CCND1
0.104
0.55 (0.32, 0.92)
0.02
rs11820646
11
–
0.389
1.38 (1.06, 1.80)
0.02
rs2981582
10
FGFR2
0.451
0.73 (0.55, 0.95)
0.02
rs3760982
19
KCNN4, ZNF283
0.485
0.77 (0.59, 1.01)
0.06
rs2363956
19
MERIT40
0.488
0.78 (0.60, 1.01)
0.06
rs1436904
18
CHST9
0.383
1.27 (0.98, 1.65)
0.07
rs6001930
22
MKL1
0.127
0.68 (0.43, 1.06)
0.09
rs12422552
12
ATF7IP, GRIN2B
0.295
0.77 (0.57, 1.04)
0.09
rs8170
19
MERIT40
0.191
1.31 (0.97, 1.78)
0.08
rs941764
14
CCDC88C
0.354
0.79 (0.59, 1.04)
0.09
rs11075995
16
FTO
0.264
1.29 (0.96, 1.72)
0.09
rs12710696
2
–
0.357
1.26 (0.96, 1.66)
0.09
rs11365234
7
AKAP9
0.392
1.24 (0.96, 1.61)
0.10
rs2823093
21
NRIP1
0.275
1.26 (0.96, 1.67)
0.10
rs4666275
2
ALK
0.060
1.50 (0.92, 2.46)
0.11
rs75915166
11
CCND1
0.082
0.67 (0.40, 1.14)
0.14
Clinical-genetic TNBC prediction
Boosting turned out to be the most accurate prediction method, and had a slightly smaller cross-validated prediction error MSE than the lasso ([Table 3 ]). Lasso and boosting performed better than the clinical prediction model without genetic predictors, whereas univariate selection performed similarly and stepwise selection performed less well. These results were confirmed by AUC statistics: Boosting was also superior with regard to distinguishing between TNBC patients and non-TNBC patients. Lasso and univariate selection performed better than the clinical model, and stepwise selection less well.
Table 3 Prediction of triple-negative tumor1 .
Model
MSE
Reclassification (%)
AUC
Selected SNPs
NRI
Correctly up
Correctly down
AUC, area under the curve; MSE, mean squared error; NRI, net reclassification improvement; SNP, single nucleotide polymorphism
1 Summary statistics (mean and standard deviation) for MSE, NRI, and AUC, obtained from (logistic) regression models as well as the number of selected SNPs are shown. All measures were obtained by 3-fold cross-validation with 100 repetitions.
2 Logistic regression model without any predictors.
3 Logistic regression model with clinical predictors (age and body mass index), but without any genetic predictors.
4 Regression model with clinical predictors and selected SNPs.
Null2
0.1137 (0.0109)
–
–
–
0.500 (0.000)
–
Clinical3
0.1098 (0.0104)
–
–
–
0.618 (0.036)
–
Univariate selection4
0.1098 (0.0107)
9.0 (12.2)
29.9 (25.2)
35.3 (29.1)
0.620 (0.038)
2.2 (2.9)
Stepwise selection4
0.1108 (0.0108)
13.8 (13.9)
46.0 (7.5)
60.6 (5.4)
0.614 (0.037)
8.1 (2.5)
Lasso4
0.1096 (0.0103)
12.5 (16.7)
49.1 (19.9)
57.1 (16.9)
0.622 (0.039)
9.1 (7.5)
Boosting4
0.1095 (0.0103)
17.3 (13.8)
55.4 (9.4)
53.3 (8.5)
0.625 (0.037)
8.2 (7.2)
Boosting correctly increased the predicted probabilities of TNBC for the majority of patients with a TNBC (“correct reclassification upwards” in [Table 3 ]) and correctly decreased the predicted probabilities of TNBC for the majority of patients without TNBC (“correct reclassification downwards” in [Table 3 ]). Lasso did these correct increases and decreases for about half of the TNBC patients and the majority of the non-TNBC patients. Univariate selection correctly increased and decreased prediction probabilities only for a minority of patients. With regard to correct reclassifications, stepwise selection performed much better than univariate selection. In total, the reclassification improvement of the boosting model was superior to all other methods (“NRI” in [Table 3 ]).
The average number of selected SNPs on the 300 training samples was similar at boosting, lasso, and stepwise selection and smaller at univariate selection. The number of SNPs varied relatively strongly at lasso and boosting and weakly at stepwise selection and univariate selection ([Table 3 ]).
During cross-validation, univariate tests were performed on each training set and SNPs were ordered according to their p values. The most frequent SNP on top was rs10069690, ranking first 158 times (52.7%). The next most frequent SNPs on top were rs2981579 (17.7%), rs78540526 (5.7%), rs2588809 (4.3%), and rs7726159 (4.3%). In total, 24 SNPs were ranked first at least once.
A boosting prediction model, the “winner” in the method comparison, was fitted on the complete dataset. Four SNPs were selected: rs10069690 (TERT, CLPTM1L) , rs2981579 (FGFR2) , rs2588809 (RAD51B) , and rs78540526 (CCND1) . All of these belonged to the top five SNPs at the univariate analysis. Age was the strongest predictor, stronger than any genetic predictors. The predicted probability for TNBC as a continuous function of age is shown in [Fig. 1 ]. The likelihood of TNBC decreases with increasing age up to about 60 years and remains constantly low thereafter. All regression coefficients are shown in [Table 4 ]. The coefficients of the predictor age were approximated using a cubic polynomial, as cubic spline functions are difficult to use. Apart from age, positive coefficients are associated with an increased likelihood of TNBC. An ideal “genetically high-risk patient” can thus be defined as a patient with two minor rs10069690 alleles and always two common alleles at the other SNPs, while an ideal “genetically low-risk patient” is a patient with two common rs10069690 alleles and minor alleles at the other SNPs. The footnote in [Table 4 ] states how the predicted probability of TNBC can be calculated using the predictor values given.
Table 4 The final clinical-genetic prediction model for triple-negative breast cancer1 .
Predictor
Unit
Coefficient
1 For example, the predicted probability for a 50-year-old patient with a body mass index of 26 and 1, 2, 1, 0 minor alleles of rs10069690, rs2981579, rs2588809, and rs78540526, respectively, is exp (z)/ (1+exp[z]) at z = 3.5589 + 50 × (− 0.1624) + 502 × 0.0009372 + 503 × 0.000001951 + 26 × 0.005691 + 1 × 0.19926 + 2 × (− 0.09108) + 1 × (− 0.02625) + 0 × (− 0.03166) = − 1.8354. That is, 13.8%.
Intercept
3.5589
Age at diagnosis
Year
− 0.1624
Year2
0.0009372
Year3
0.000001951
BMI
Per kg/m2
0.005691
rs10069690
Minor allele
0.19926
rs2981579
Minor allele
− 0.09108
rs2588809
Minor allele
− 0.02625
rs78540526
Minor allele
− 0.03166
Fig. 1 The predicted probability for triple-negative breast cancer (TNBC) as a continuous function of age at diagnosis. The curves were generated using the boosting model fitted on the complete dataset. The black curve predicts the TNBC risk of a genetically “average” woman with a median body mass index. The blue and the orange curves show the predicted risk for patients with genetically maximally increased and maximally decreased risks.
The boosting model was well calibrated. The difference between actual and predicted events was quite low ([Fig. 2 ]; p = 0.73, Hosmer–Lemeshow test). The apparent AUC – i.e., the AUC on the complete dataset – was 0.668, which is 0.043 units larger than the cross-validated AUC value. This indicates that the prediction model was slightly overfitted. For comparison, the apparent AUC of the clinical model was 0.632 – i.e., 0.014 units larger than its cross-validated value.
Fig. 2 The observed and predicted frequencies of triple-negative breast cancer (TNBC). The patients were sorted according to the predicted probability for TNBC using the boosting prediction model and grouped into ten categories based on percentiles. The number of actually observed TNBCs (“observed events”) in each category and the sum of predicted probabilities for TNBC (“predicted events”) in each category are shown. Points below the gray line indicate when the model is overestimating the likelihood of TNBC; points above the gray line indicate when the model is underestimating the likelihood. A perfect prediction model would show all of the points on the gray line.
To demonstrate a possible future application of the final prediction model, various cut-off points for the TNBC risk between 0 and 100% were defined – e.g., 12%. Patients were classified as “low-risk” if the prediction model assigned a TNBC risk below 12%. Otherwise they were classified as “high-risk.” The sensitivity (i.e., the proportion of patients classified as “high-risk” among true TNBC patients) and the specificity (i.e., the proportion of patients classified as “low-risk” among true non-TNBC patients) are presented in [Table 5 ], and compared with the clinical model. The sensitivities were almost equal for cut-off points up to 12%. Thereafter, the sensitivities of the boosting model were larger. For instance, if a physician decides to screen patients with a risk of TNBC of more than 15% for biomarkers that are important for TNBC patients, without yet knowing their receptor status, then 43% of all TNBCs will be detected with the boosting model, in comparison with 38% with the clinical model. The rate of false-positive classifications would be 24%, two percentage points more than when using the clinical prediction model. The ROC curves for all possible cut-off points are shown in [Fig. 3 ].
Table 5 Sensitivity and specificity for the clinical prediction model and clinical-genetic boosting prediction model1 .
Cut-off point for predicting triple-negative tumor (%)2
Frequency above cut-off point (%)3
Sensitivity
Specificity
Clinical model
Clinical-genetic model
Clinical model
Clinical-genetic model
1 All measurements were obtained by 3-fold cross-validation with 100 repetitions.
2 Patients were classified into a „high-risk“ group if the prediction model assigned a triple-negative tumor probability above the cut-off point. Sensitivity (between 0 and 1) is defined as the proportion of „high-risk“ patients among TNBC patients. Specificity (between 0 and 1) is defined as the proportion of „low-risk“ patients among non-TNBC patients.
3 The proportion of patients classified as „high-risk“ in the total study population, using the clinical-genetic prediction model.
10
56.3
0.68
0.69
0.47
0.44
12
39.0
0.53
0.57
0.65
0.61
15
23.8
0.38
0.43
0.78
0.76
20
13.0
0.26
0.29
0.89
0.87
25
7.3
0.17
0.20
0.95
0.93
Fig. 3 Cross-validated receiver operating characteristic (ROC) curve for the clinical and clinical-genetic boosting prediction models.
Discussion
The study shows that prediction of TNBC can be improved if breast cancer risk SNPs are added to a prediction rule based on age at diagnosis and BMI. Age at diagnosis turned out to be the strongest predictor, stronger than any genetic influencing factors.
The final prediction model included four SNPs from the genes RAD51B, TERT, CCND1, and FGFR2 . Only one of these was statistically significant in the univariate SNP and TNBC association tests, but all of them belong to the top five SNPs with the lowest p values. Although the selection procedure did not consider any external biological information, there might be biological reasons why these SNPs taken together improve prediction.
rs10069690 (TERT) has been described as being associated with estrogen receptor-negative and triple-negative breast cancer, serous ovarian cancer, breast and ovarian cancer risk in BRCA1 mutation carriers, as well as prostate cancer – implying that there are similar pathways of pathogenesis in these different types of cancer [13 ], [15 ], [30 ], [33 ], [57 ]. Fine mapping analyses of this region revealed a function for telomere stability [30 ], [57 ]. rs2981579 (FGFR2) has been clearly described as an SNP that specifically increases the risk for hormone receptor-positive breast cancer [21 ], [58 ], [59 ]. Its role in hormone receptor signaling has been linked to FOXA1 .
rs2588809 (RAD51B) is associated with triple-negative breast cancer [13 ], [15 ]. RAD51B, RAD51C, and RAD51D are RAD51 paralogues that build complexes among one other [60 ], [61 ] and have a function in homologous recombination. Breast cancer in men [62 ], prostate cancer risk [63 ], and an increased risk of breast and ovarian cancer in BRCA1 mutation carriers [64 ] are associated with SNPs in RAD51B . In vitro experiments have shown that a reduction in RAD51B by silencing RNA increases the chemosensitivity and reduces the efficacy of homologous recombination in breast cancer cells, with differences depending on subtype [65 ].
rs78540526 (CCND1) is located in a gene region that maps to a putative enhancer of CCDN1 . It is clearly associated only with hormone receptor-positive breast cancer risk [25 ], [66 ], [67 ] and is therefore a reasonable marker for predicting hormone-receptor negativity and triple negativity. Functionally different CCDN1 expression levels have been shown to be different with regard to haplotypes in this enhancer region [25 ]. This is of special interest, as CCND1 expression and/or amplification have been under discussion as a biomarker for the efficacy of CDK4/6 inhibitors [68 ].
In genetic prediction studies, it can be expected that the ranking of the SNPs will differ, and the set of SNPs selected for prediction will also differ, if the experiment is repeated on a different group of patients with the same clinical characteristics. This also holds if analyses are performed on subsets of patients within one study [69 ]. In the present study, for instance, the top-ranked SNP in the complete dataset was not ranked top in about 50% of all subsamples, and the sets of selected SNPs varied strongly. Correlations among SNPs, and SNPs with weak individual associations with the outcome but stronger power as a group, may encourage fluctuation in SNP selection. To obtain stable, reliable results that are independent of a randomly chosen patient subset, all decisions (e.g., the choice of tuning parameter for model specification and comparison of model performances) were based on repeated sampling.
Double cross-validation was carried out, with an inner loop to specify the prediction model and an outer loop to compute model performance measures, in order to ensure that all model-building steps were performed completely independently of the validation step [55 ], [70 ]. That is, all reported measures were based on data that were not used for model building. Otherwise, the measures would have been overoptimistic. Schild et al. [71 ] provide an example of double (cross-)validation being applied in a gynecological study.
The SNP selection process was carried out following a prespecified plan. Univariate selection is a simple method that does not take correlations among SNPs into account. It is known to perform less well in general than more sophisticated methods such as lasso and boosting [47 ], a result that was confirmed in this study. Lasso and boosting performed similarly, although the model fitting was rather different. However, the two methods share the common feature that variable selection is a continuous process that leads to “weakly” selected SNPs in addition to strong predictors. The result in the present study showing that boosting had a slightly better prediction accuracy is consistent with a recently published methodological study comparing boosting and lasso on simulated datasets [72 ]. Bootstrap-based stepwise selection, a method that our group has previously applied successfully to nongenetic data (e.g., [45 ], [73 ], [74 ]), performed less well than lasso and boosting. This might be because the parameters for variable selection were kept firm, in contrast to the varying tuning parameters of the other methods. Since repetitive stepwise selection is itself relatively elaborate, it would have been computationally demanding if the number of selection processes had been further increased.
The added value provided by breast cancer SNPs to a clinical prediction model was assessed using the overall performance measures MSE and AUC. The advantage of such overall measures is that prediction models can easily be compared. The disadvantage is that they may be insensitive to detecting improvements in the model performance when new predictors are added to a model that has already included important predictors [75 ], [76 ]. For example, in [77 ], the addition of a significant biomarker score to a set of standard risk factors increased the AUC only from 0.76 to 0.77, an increase that is similar to that in the present study. Because of this, different methods of quantifying the improvement such as the NRI have been developed [78 ].
In the future, germline genetic testing of SNPs from blood could be carried out in clinical routine work on the same day and at reasonable cost [10 ], particularly if only a few SNPs are involved that can therefore be genotyped using polymerase chain reaction. This would mean that the data would be available long before the processing of tissue, which has to be embedded, cut, and examined by a pathologist along with the relevant molecular tests. Using this genetic method of information screening for specific TNBC studies with elaborate biomarker assessment could be initiated at an early time point for patients with an increased likelihood of TNBC, particularly when biomarker assessment for all patients would be too expensive and waiting for results to come from pathology would delay biomarker assessment and the patientʼs entry into a study.
The present study also aimed to demonstrate ways of managing the abundance of data available in the era of “big data” and easy access to a variety of data, in order to make it feasible to use large data volumes for clinical purposes. It can be anticipated that it will also become possible to add the analysis of other markers, such as circulating tumor DNA, in order to increase the accuracy of molecular subgroup prediction. However, that will be a task for future research.
This study has some limitations. First of all, it needs to be borne in mind that the study was conducted in a population consisting only of breast cancer patients. It did not serve to identify SNPs capable of predicting the risk for triple-negative breast cancer in healthy women – e.g., using a case–control study design. As the study was intended to differentiate between triple-negative patients and non–triple-negative ones, it might have been more useful to examine SNPs differentiating between molecular subtypes rather than SNPs for breast cancer risk. Another limitation is the small sample size. With just over 1000 patients, the sample size was rather low and the findings will require validation in other independent populations.
In conclusion, the ability to predict triple-negative tumors can be improved for breast cancer patients if breast cancer risk SNPs are added to a prediction rule based on age at diagnosis and BMI. This finding could be used for prescreening purposes in complicated molecular therapy studies for triple-negative breast cancer. The advanced statistical procedures used in this study follow a prespecified, systematic plan and are described with sufficient generality to be easily adaptable for other research purposes.
Acknowledgement
The authors are grateful to Michael Robertson for professional medical editing services.