Confounding Effects in ‘‘A Six-Gene Signature Predicting Breast Cancer Lung Metastasis’’

The majority of breast cancer deaths result from metastases rather than from direct effects of the primary tumor itself. Recently, Landemaine and colleagues described a six-gene signature purported to predict lung metastasis risk. They analyzed gene expression in 23 metastases from breast cancer patients (5 lung, 18 non-lung) identifying a 21-gene signature. Expression of 16 of these was analyzed in primary breast tumors from 72 patients with known outcome, and six were selected that were predictive of lung metastases: DSC2, TFCP2L1, UGT8, ITGB8, ANP32E , and FERMT1 . Despite the value of such a signature, our analysis indicates that this analysis ignored potentially important confounding factors and that their signature is instead a surrogate for molecular subtype. [Cancer Res 2009;69(18):7480–5]


Introduction
Breast tumors are heterogeneous, and different subtypes have greater or lesser propensity to metastasize to particular organ sites. Basal-like and luminal B subtypes have the greatest likelihood to metastasize to lung (40% and 36.7%, respectively; ref. 1). In contrast, ERBB2 + , luminal A, and normal-like tumors preferentially metastasize to other sites, with rare recurrence in lung (1). Given this difference in metastatic profile, we examined Landemaine's six-gene signature using their training, validation, and other data sets and found it to be more predictive of molecular subtype than of metastasis site independent of subtype.

Materials and Methods
Gene expression data sets. Published data sets were downloaded from the Gene Expression Omnibus 4 or ArrayExpress 5 database; accession numbers are given in Supplementary Table S1. Data were normalized using robust multiarray averaging (2) using the Bioconductor package affy (R, version 2.1, Bioconductor version 1.8). The EMC-344 data set, available only as MAS5 processed data, was quantile normalized and converted to log 2 expression values. Expression profiles were visualized using average linkage hierarchical clustering using the Bioconductor package made4 (3) using Euclidean distance or 1 À Pearson correlation coefficient distance, as appropriate.
Assigning molecular subtype. Breast cancers can be subdivided into clinically relevant, prognostic molecular subtypes based on expression of characteristic genes. Basal-like breast cancer is typically negative for expression of ESR1/PGR/ERBB2; ERBB2 + tumors have amplification of the ERBB2 gene; and the ESR1 + luminal subtype can be subdivived by grade: luminal A (low grade) and luminal B (high grade). To assign subtypes in the various data sets analyzed, we examined expression of 12 Affymetrix probe sets (Supplementary Table S2) representing estrogen receptor (ER) genes (ESR1, PGR , and GATA3), ERBB2 genes (ERBB2 and GRB7), and the grade signature described by Ma and colleagues (4). Using data from these probe sets, we performed hierarchical clustering analysis and assigned subtype based on overall expression patterns; our assignments were consistent with reported clinical assignments where available. As validation, assignments were compared with those based on the intrinsic breast cancer quantitative reverse transcription-PCR (RT-PCR) signature (5).
Mapping of the six-gene signature to other microarray platforms. The Landemaine signature consists of six probe sets (204751_x_at, 227642_at, 228956_at, 211488_s_at, 208103_s_at, and 60474_at) that were reported to correspond to DSC2, TFCP2L1, UGT8, ITGB8, ANP32E, and FERMT1, respectively (6). The data sets used in our analysis were obtained on two different GeneChip platforms: U133Plus2 (6, 7) and U133A (refs. 8-11; see Supplementary Table S1). Only four of the six probe sets were present on U133A (the ''intersection set''); the missing probe sets are 227642_at (TFCPL1) and 228956_at (UGTB8). It should be noted that neither of these two probe sets maps to the coding region of their respective gene sequences in EnsEMBL (release 50). Consequently, we expanded our analysis to include all probe sets mapping to Landemaine's six genes. Fourteen and 10 probe sets map to the six genes on U133Plus2 and U133A arrays, respectively (Supplementary Table S3; the ''all-mapped set''). In our analysis, we used Landemaine's six probe sets, the intersection set, and the all-mapped set. Affymetrix arrays often contain multiple probe sets for individual genes, including probe sets with partial matches to other genes and probe sets for alternate splice forms for individual genes; these can sometimes give conflicting results although not always. The decision to use multiple array-based probe sets was motivated by our desire to both replicate Landemaine's analysis and take a more inclusive approach to Landemaine's signature, particularly when comparing across array designs.
Statistical analyses. All analyses were done using the R statistical language (release 2.7.1) and Bioconductor (release 2.2). Associations between clinical and biological covariates and expression of the six-gene signature were analyzed using two methods. First, we applied Goeman's globaltest (Bioconductor package globaltest version 4.10.0), which is based on an empirical Bayesian generalized linear model where the regression coefficients between expression data and clinical outcome are the random variables estimated using a goodness-of-fit test (12). Second, we used an ANCOVA approach (Bioconductor package GlobalAncova version 3.6.0) to test for the association between expression values and clinical covariates (13,14). GlobalAncova compares linear models via the extra sum of squares principle and thus tests whether the expectation of expression levels differs between clinical covariates for a given group of genes. There was a high level of consensus between these two approaches. R scripts for both analyses are included in Supplementary Materials.
Comparison to published gene expression signatures. Gene symbols for each of the lung metastasis six-gene signature were searched against GenSigDB, a collection of more than 250 breast cancer gene signatures that we manually curated from published studies. 6

Results
High-grade basal-like and luminal B tumors have a propensity to metastasize to lung (1). However, studies identifying gene expression signatures predictive of lung metastases have focused on heterogeneous patient groups without considering subtypespecific effects (6,8). Landemaine's six-gene signature, the focus of our analysis, is claimed to predict lung metastasis risk independent of other factors. Our analysis, using their original training data set (6), their validation data (8,9), and three additional breast cancer data sets with defined subtypes (7,10,11), finds Landemaine's six genes to be more discriminative in identifying triple-negative basal-like tumors rather than in identifying the site of distant metastasis independent of molecular subtype.
Hierarchical clustering analysis of probe sets that categorize breast cancer molecular subtypes (Supplementary Table S2) was used to assign subtypes in Landemaine's 23 metastasis samples ( Fig. 1; Table 1A). Eight were negative for ESR1, PGR , and ERBB2 gene expression and expressed genes associated with high grade, typical of basal-like breast tumors. Five metastasis samples expressed high levels of ERBB2 and GRB7 and were classified as ERBB2 + . Of the 10 positive for ESR1 expression, 5 were identified as high-grade (luminal B) and 5 as low-grade (luminal A) tumors (Fig. 1A). All of the lung metastasis (n = 5) samples were classified as basal-like molecular subtype (n = 8) and this relationship was significant (Pearson m 2 test, P < 0.01), indicating a potentially confounding covariate within this data set. This is supported by further hierarchical clustering analysis using Landemaine's six-gene signature, which shows a clear separation between basal-like and non-basal-like tumors ( Fig. 1B; globaltest, P < 0.0001; GlobalAncova, P < 0.001).
Unfortunately, Landemaine's sample annotation lacks information necessary to confirm our subtype assignment: No information is available on the total number of patients profiled (it is possible that there were fewer than 23 patients, some with metastases to multiple sites), relevant clinical and histopathologic data, or gene expression profiles from the primary tumors. Consequently, we validated our observation of confounding effects by analyzing additional published data sets, including those used by Landemaine for confirmation.
We first examined expression profiles of primary breast cancers from patients at Memorial Sloan-Kettering Cancer Center (''MSK'' data set) for which metastasis status was known (8); this data set was used by Landemaine for validation. MSK contained profiles from 98 tumors, 82 of which had 3-year follow-up annotation including information on metastasis. These 82 samples were assigned to molecular subtypes: basal-like (n = 25), ERBB2 + (n = 18), luminal A (n = 10), and luminal B (n = 29; Supplementary Figs. S1 and S2). Six of the nine tumors with lung metastases had a basal-like profile (Table 1B), and the association between molecular subtype and metastasis site was significant (m 2 test, P < 0.05).
We then applied global gene set analysis to test whether the six-gene signature is predictive of metastasis site or subtype (Supplementary Table S4). Globaltest (12) and GlobalAncova (13,14) analyses reported a significant association between expression of the six-gene signature and molecular subtype (intersection or all-mapped probe sets, P < 0.0001; n = 82). Whereas there was a marginally significant association with metastasis site (intersection probe set: globaltest, P = 0.05; GlobalAncova, P < 0.05; all-mapped probe set: globaltest, P = 0.1; GlobalAncova, P < 0.05; n = 82), this was no longer significant when adjusted for molecular subtype (globaltest or GlobalAncova, P > 0.05). By contrast, the association between expression of the six genes and subtype remained highly significant even when corrected for metastasis status (globaltest or GlobalAncova, P < 0.0001; n = 82). When only the basal-like breast samples were considered, Landemaine's six genes were not able to predict propensity to metastasize to lung or non-lung sites using either the intersection (globaltest or GlobalAncova, P > 0.05; n = 26) or all-mapped probe sets (globaltest or GlobalAncova, P > 0.05; n = 26).
Returning to the full set of 82 MSK samples, we examined the contribution of each of Landemaine's six genes to the association with subtype. The model with four intersection probe sets was influenced most strongly by expression of DSC2 (probe set 204751_x_at) and ANP32E (probe sets 208103_s_at), which were expressed in the basal-like samples. When the 10 all-mapped probe sets are considered, the same genes, DSC2 (probe set 204751_x_at) and ANP32E (probe sets 208103_s_at and 221505_at), most strongly influenced the association ( Fig. 2; Supplementary  Fig. S3). The association between expression of Landemaine's six genes and metastasis site was not evaluated in luminal B tumors in MSK due to insufficient sample size (Table 1B; n = 1). These analyses indicate that Landemaine's six-gene signature is predictive of subtype, but not metastasis site, in the MSK data set.
Landemaine also validated their signature on a larger cohort (n = 344) of early-stage patients (EMC-344, 9), some of whom had lung metastases as the first site of relapse (n = 31) or among cumulative sites of distant relapse (n = 42; see Supplementary  Tables S5 and S6 for summaries). We found that although the intersection probe sets (but not the all-mapped probe sets) were significantly associated with first (globaltest, P < 0.001; GlobalAncova, P < 0.05) or all lung metastasis events (globaltest or GlobalAncova, P < 0.01), this effect remains only marginally significant when corrected for subtype (globaltest or GlobalAncova; intersection probe sets, P < 0.05; Supplementary Table S7; Supplementary Figs. S4 and S5). By contrast, subtype is highly significant (globaltest or GlobalAncova; intersection or all-mapped probe sets, P < 0.0001; Supplementary Fig. S6 and Table S7) and remains so even when corrected for first or all lung metastases (globaltest or GlobalAncova; intersection or all-mapped probe sets, P < 0.0001; Supplementary Fig. S5). When we examined specific breast cancer subtypes in the EMC-344 data set, there was no association between first or all lung metastasis events and expression of the Landemaine signature (intersection or allmapped probe sets) in basal-like tumors (n = 84; P > 0.05). There is a weak association between expression of the intersection probe sets, but not of the all-mapped probe sets, and first recurrence to lung in luminal B breast cancer (globaltest or GlobalAncova, P < 0.05; n = 95). However, there is no association between expression of these probe sets (intersection or all-mapped) and lung metastases in luminal B breast cancer (n = 95). Therefore, assessment of the Landemaine signature indicates that although strongly associated with subtype, there is little evidence to support it as a predictor of lung metastases in two of their validation data sets.
We then tested the association between expression of the six genes and subtype in three additional publicly available breast cancer data sets (7,10,11). In each case, molecular subtypes were provided with the sample annotation. In a study of primary breast tumors, Farmer and colleagues (10) defined three tumor subtypes: an ER-positive luminal group (n = 27) and two ER-negative subtypes, basal-like (n = 16) and an androgen receptor-positive group, which they called molecular apocrine (n = 6). The molecular apocrine tumors expressed ERBB2 and shared features with the ERBB2 + subtype. Among Landemaine's six genes (using either 4  intersection probe sets or 10 all-mapping probe sets), we observed a significant association between expression of these and subtype using either globaltest or GlobalAncova analysis (P < 0.0001; Supplementary Fig. S7). In both, ANP32E and DSC2 most influenced the model and were both significantly up-regulated in basal-like tumors relative to the other subtypes.
Basal-like tumors cluster with and are phenotypically similar to BRCA1-deficient breast tumors. Both are ER negative, display high levels of chromosome abnormalities, and have poorer prognosis compared with other subtypes. In a study of BRCA1 breast cancer and sporadic basal-like breast cancer gene expression using U133Plus2 arrays, Richardson and colleagues (7) provided both BRCA1 status and subtype information for their samples: sporadic basal-like (n = 18), BRCA1-deficient (n = 2), non-basal-like (n = 20), and normal breast (n = 7). Both GlobalAncova and globaltest identified a significant association between expression of Landemaine's six genes and the basal-like phenotype (P < 0.0001; Supplementary Fig. S8). DSC2, ANP32E, and ITGB8 had greatest influence on the test statistic in both analyses.
In an analysis of 51 breast cancer cell lines, Neve and colleagues (11) divided basal-like cell lines into two groups, basal B and basal A, based on morphology and patterns of gene expression. The  basal B were less differentiated, displayed a greater mesenchymallike appearance, and were also more invasive in Boyden chamber assays than were basal A or luminal cells (11). We examined the expression of Landemane's six predictive genes in these cell lines to determine their expression profiles in highly invasive basal B and other cell line subtypes. Global gene set analysis found these genes to be associated with expression in both basal A and basal B, but not the luminal, breast cancer cell lines (P < 0.0001; Supplementary  Fig. S9). In both globaltest and GlobalAncova analyses, expression of DSC2 and FERMT1 was associated with basal A and basal B cell lines, respectively. However, although ANP32E influenced both models, it was associated with basal A by GlobalAncova and with basal B by globaltest. Therefore, although the six-gene signature is associated with expression in basal-like cell lines, it is not specific for basal A or the more aggressive basal B subtype. These additional analyses further support the association between expression of Landemaine's six genes and basal-like breast cancer. It is interesting to note that expression of DSC2 and ANP32E strongly influences the association between the six-gene signature and subtype, specifically basal-like, in each analysis. Because of this strong association, we investigated whether Landemaine's genes had been previously reported to be important in basal-like breast cancer. We checked the six genes against GenSigDB, a manually curated database of more than 250 published microarray gene expression signatures in breast cancer. 6 We found that five of the six genes had been previously identified in breast cancer gene signatures ( Whereas roles for DSC2 (a desmocollin) or ANP32E (a member of the leucine-rich acidic nuclear phosphoprotein 32 family) are not well established in breast cancer, these two genes significantly associated with subtype in our analyses and have been previously associated with ER-negative, basal-like breast cancer in multiple studies ( Table 2). DSC2 is listed in five microarray breast cancer gene expression signatures, including those that define ''intrinsic'' molecular subtypes (5,15). It is a member of the 53-gene set optimized for real-time quantitative RT-PCR subtyping of breast cancer (5). ANP32E was identified in six predictive gene expression signatures. ANP32E was found to be differentially expressed in triple-negative medullary basal-like breast cancer (16) and is a member of the wound response signature (17), which has been found to be predictive of poor prognosis. Whereas it has been reported to regulate the activity of the tumor suppressor protein phosphatase PP2A in cerebellar synatogenesis (18), we know of no    study that has reported such a role for it in breast cancer. Expression of both DSC2 and ANP32E is also associated with p53 mutation (19), which is predictive of poor patient outcome and is most frequently observed in patients with basal-like or ERBB2 + breast cancer. The role of ANP32E and DSC2 in basal-like breast cancer warrants further investigation. In analyzing the NKI and EMC data sets, Landemaine used survival to validate their lung metastasis signature. Basal-like breast cancer, the wound healing signature, p53 mutation, and ERnegative breast cancer are all associated with poor survival. Given these strong associations, it is not unexpected that Landemaine found these genes to be predictive of patient survival.

Discussion
Our analysis of Landemaine's signature indicates that it is confounded by subtype and that it instead predicts basal-like subtype rather than lung metastasis. The true test of the power of Landemaine's six-gene signature to predict lung metastasis would be to show its ability to differentiate between basal-like breast cancer with and without lung metastases, or to selectively identify lung metastases arising in patients with luminal B (highgrade, ER + ) breast cancer, but this is not something that can be investigated with the available data. More generally, our work suggests that more comprehensive sample annotation of gene expression data is necessary and that greater care must be taken in analyzing gene expression signatures to account for confounding effects.

Disclosure of Potential Conflicts of Interest
No potential conflicts of interest were disclosed.