Oncogene (2008) 27, 6607–6622 & 2008 Macmillan Publishers Limited All rights reserved 0950-9232/08 $32.00 www.nature.com/onc ONCOGENOMICS Prediction of future metastasis and molecular characterization of head and neck squamous-cell carcinoma based on transcriptome and genome analysis bymicroarrays

DS Rickman1,4, R Millon2, A De Reynies1, E Thomas1, C Wasylyk3, D Muller2, J Abecassis2 and BWasylyk 3

1Ligue Nationale Contre le Cancer, Paris, France; 2Centre Re´gional de Lutte Contre le Cancer Paul Strauss, 3 rue Porte de l’Ho¨pital, Strasbourg, France and 3IGBMC CNRS/INSERM/ULP, Illkirch, France

Propensityfor subsequent distant metastasis in head and Introduction neck squamous-cell carcinoma (HNSCC) was analysed using 186 primarytumours from patients initiallytreated There are various aetiological factors for head and neck bysurgerythat developed (M) or did not develop (NM) squamous-cell carcinoma (HNSCC), including alcohol metastases as the first recurrent event. Transcriptome consumption with smoking and papilloma- (Affymetrix HGU133_Plus2, QRT–PCR) and array- virus (HPV) infection, for our patient population comparative genomic hybridization data were collected. (Applebaum et al., 2007; Hashibe et al., 2007). Survival Non-supervised hierarchical clustering based on Affyme- is still poor (5-year survival 30–50%), mainly due to trix data distinguished tumours differing in pathological relapse, metastasis or second cancer (Forastiere et al., differentiation, and identified associated functional 2001; Le Tourneau et al., 2005). The anatomic location changes. Propensityfor metastasis was not associated and TNM staging guide treatment selection, but patients with these subgroups. Using QRT–PCR data we identified with similar tumour characteristics differ in their clinical a four- model (PSMD10, HSD17B12, FLOT2 and outcome. Our aim is to identify, in primary tumours, KRT17) that predicts M/NM status with 77% success in molecular signatures that predict the subsequent devel- a separate 79-sample validation group of HNSCC opment of distant metastases in patients treated with samples. This prediction is independent of clinical criteria complete surgical resection and adjuvant therapy. (age, lymph node status, stage, differentiation and Furthermore, we want to study the still poorly under- localization). The most significantlyaltered transcripts stood biological processes that predispose human in M versus NM were significantlyassociated to tumours to the development of metastases, and to metastasis-related functions, including adhesion, mobility define targets for therapy. and cell survival. Several genomic modifications were Our previous study, using large-scale validated significantlyassociated with M/NM status (most notably differential display, identified 820 transcripts that were gains at 4q11–22 and Xq12–28; losses at 11q14–24 and differentially expressed between tumours and normal 17q11 losses) and partlylinked to transcription modifica- tissue, about 10% of which differed between tumours tions. This work yields a basis for the development of that did (M) or did not (NM) develop metastasis as the prognostic molecular signatures, markers and therapeutic first recurrence event (Carles et al., 2006). A subsequent targets for HNSCC metastasis. study, with 28 tumours and Affymetrix U95A arrays Oncogene (2008) 27, 6607–6622; doi:10.1038/onc.2008.251; (Affymetrix, Santa Clara, CA, USA) detected 164 published online 4 August 2008 transcripts whose levels differed significantly between N and NM tumours, but we did not find changes that Keywords: HNSCC; distant metastasis; prognosis; could be used to predict N/NM status in independent intrinsic groups; differentiation samples (Cromer et al., 2004). Signatures of poor ONCOGENOMICS prognosis and metastasis have been identified in other larger scale studies (review: Nguyen and Massague, 2007). We now report a larger study, using more samples (186) and RNA variables (Affymetrix HG- U133 plus 2.0 GeneChips), and we have included an Correspondence: Dr BWasylyk, IGBMC,CNRS/INSERM/ULP, analysis of genomic changes (IntegraChip 4.4K bacterial 1 rue Laurent Fries, BP 10142, Illkirch 67404, France. artificial (BAC) comparative genomic E-mail: [email protected] hybridisation (CGH) arrays, array-CGH (aCGH)). 4Current address: Department of Pathology and Laboratory Medicine, Using unsupervised analysis, we defined intrinsic groups Weill Cornell Medical Center, 1300 York Avenue, Room C-458D, New York, NY 10021, USA. that correspond to pathological differentiation. Using Received 6 February 2008; revised 21 May 2008; accepted 27 June 2008; QRT–PCR-validated RNA levels and a training group, published online 4 August 2008 we have defined a four-gene model that predicts future HNSCC, distant metastasis signatures and classifiers DS Rickman et al 6608 distant metastasis with 77% success in an independent associated GO terms and pathways. For example, validation group. We have identified genomic and poorly differentiated tumours (C3) are characterized transcriptomic changes that are significantly different by upregulation in gene cluster a, which is significantly in primary tumours with dissimilar M/NM status. These associated with cell motility. Sample group C4 is findings are useful for the development of prospective associated with upregulation of encoding for signatures of metastasis, and for the understanding of related to muscle structure and function biological processes that predispose to the development (approximately half of the 59 significant GO terms that of metastases in patients with HNSCC. are enriched in the gene cluster f, Figure 1; see the GO analysis Supplementary Table 2 sheet 2, and the genes in cluster f in Supplementary Table 2 sheet 1). This is a clue to the origin of cluster C4. Results To further analyse the biological functions associated with pathologically defined differentiation (as opposed Classification of tumours to the intrinsic groups defined by unsupervised classifi- The global differences between the transcriptomes of the cation) we performed a supervised analysis between the HNSCC tumours may result from metastatic propen- tumours classified according to pathological differentia- sity, or other clinical or biological features. To detect tion (Table 1; Supplementary Table 3). Using 1-way these ‘intrinsic’ features, we performed unsupervised ANOVA (Po0.01) and pairwise Wilcoxon tests hierarchical classification as already described (Boyault (Po0.001) we established three lists of genes (a total et al., 2007). We studied extensively the robustness of of 835 genes) that were significantly differentially the topologies (series of dendrograms) obtained under expressed between at least two of the three clinically different conditions: three distinct agglomerative clus- defined groups (Supplementary Table 4). As expected, a tering methods, various thresholds for variance based number of the genes (50) involved in the unsupervised unsupervised gene selection, resampling and addition of clustering (Figure 1) were among the 835 significant Gaussian noise to the data. We found that the sample genes obtained from the supervised analysis. QRT–PCR partitions that yielded between two to four sample validation of seven genes confirmed that there is a groups were very reproducible (Supplementary Table 1) gradient of expression between poorly to well-differ- and identified a consensus partition of four groups entiated tumours (Figure 2; in general there was a very (Supplementary Figure 1). For illustrative purposes, high correlation between the Affymetrix and QRT– we selected a representative sample dendrogram that PCR, data not shown). These genes were chosen as was most related to the consensus partition (Figure 1). representative genes that are differentially expressed in Using Fisher’s exact tests we found that these four tumours according to their pathologically defined major clusters (C1–C4) were not significantly associated differentiation status, and due to the potential impor- with stage, localization, metastasis-free survival tance of the encoded proteins for cell differentiation (see (Figure 1), or any of the other characteristics of the Supplementary Table 5, column H, for a brief descrip- tumours analysed (data not shown). However, C1–C3 tion). Further analysis of the genes that are differentially were significantly associated with degree of differentia- expressed between the tumours could provide insights tion, corresponding to well (C1), moderately (C2) and into the molecular events that are related to pathological poorly (C3) differentiated tumours. C4 was not associa- differentiation of HNSCC. tion with any of the characteristics analysed. These results suggest that pathological differentiation has the biggest effect on the overall differences in transcription Four-gene model predicting future metastasis in HNSCC of the tumours. that is independent of clinical variables To get a better understanding of the molecular Given that metastasis was not associated with any of the determinants of the intrinsic groups, we analysed the intrinsic HNSCC subgroups, we set out to identify a 449-gene list (548 probe sets) that was used to generate molecular signature that is predictive of developing the four clusters in the representative dendrogram future metastasis irrespective of differentiation status shown in Figure 1. Unsupervised cluster analysis was and/or other clinical variables, and that could be used to segregate the genes into six gene groups (a–f; employed in a clinical environment using quantitative Figure 1). Using hypergeometric tests we determined if PCR. Of the initial 186 patients, 142 (50M and 92 NM) there was an enrichment of particular biological path- were used for transcriptome, genome and bioinformatics ways and functions in the six gene groups (see analysis as they had a minimal follow-up period of 36 Supplementary Material and methods). This allowed months (Table 1; Supplementary Table 3; Supplemen- us to characterize the gene clusters a–f by the most tary Material and methods). Based on Cox univariate significantly associated (GO; Ashburner and multivariate analyses of RNA (Affymetrix) and et al., 2000) terms and pathways (Supplementary Table DNA (aCGH) variables from 81 patients, we selected 31 2), which included cell differentiation, extracellular RNA variables for further analysis (Supplementary matrix organization, tissue development, adhesion and Table 5; Supplementary Material and methods). The immune response (Figure 1). We then characterized each DNA variables did not add predictive power. Subse- sample group (C1–C4), according to the genes specifi- quently, RNAs from 134 tumours were quantified by cally up/downregulated within this group and their QRT–PCR (73 ‘Affymetrix’ þ 61 additional samples;

Oncogene HNSCC, distant metastasis signatures and classifiers DS Rickman et al 6609

Figure 1 Identification of intrinsic groups in head and neck squamous-cell carcinoma (HNSCC) (a) and the molecular features of the most varying probe sets (b). (a) A representative dendrogram of the samples is shown. The samples can be divided into four distinct groups, C1–C4. The little squares below the dendrogram indicate the status of the individual patients in terms of the clinical parameters: tumour stage (1–4, UICC TNM system), localization (1: lip, tongue and base of tongue; 2: oral cavity; 3: oropharynx; 4: hypopharynx), differentiation (1: well, red; 2: moderately, orange, 3: poor, red), event free survival (EFSevent) refers to M/NM status (black: M, white: NM). Fisher’s exact test p values shown are for the indicated clinical parameters relative to all 4 groups (in red on the left of the small boxes) and the individual groups (Table). The clinical parameters with the lowest P-values are shown in parenthesis in the Table. (b) Gene dendrogram (left) using the 2.5% (548) most varying probe-sets and heat map of in the samples. The gene dendrogram was partitioned (dark black line) arbitrarily to define six gene-cluster groups (a–f, dark blue to red rectangles). Enriched gene ontology (GO) categories (black, Supplementary Table 2 sheet 2) and corresponding genes (blue) of clusters a–f are shown on the right. The genes in the clusters a–f are listed in Supplementary Table 2 sheet 1, and those shown on the right are selected examples.

Table 2). From Cox analysis using a training group (55 Table 6). Pathological stage and localization also retain samples, S10), we identified a four-gene model their predictive values when adjusted for all the other (PSMD10, HSD17B12, FLOT2 and KRT17; Table 3). variables, whether the four-gene model was included or We analysed its performance in an independent valida- not (Table 4). Kaplan–Meier curves (Figure 3; Supple- tion group that was not used for selection (79 samples, mentary Figure 2) were used to visualize event-free S20). It predicted M/NM class membership with an survival (metastasis) with time of the groups selected by overall success rate of 77% (74% sensitivity (M different criteria. The four-gene model successfully predicted as M) and 78% specificity (NM predicted as selected groups with differences in event-free survival NM)). Using Cox proportional hazard univariate with: the validation group (A), the whole population analysis, we showed that the four-gene model was (B), patients that were NO/N1 or N2 þ (C), and highly associated with the development of metastasis patients with pathological stage oIV or ¼ IV (D). (hazard ratio (HR) of 6.5; (95% confidence interval (CI) The four-gene model performed favourably in compar- 2.4–18.1; P ¼ 0.0003; Table 4)). Moreover, the four-gene ison with the other clinical criteria considered individu- model performed better than other prognostic factors, ally (Supplementary Figure 2A), including node including lymph node status, pathological grade and presence (B), stage (C), localization (D), differentiation localization. Using multivariate analysis, we showed (E) or age (F). These results show that the four-gene that the four-gene model retained significant predictive model outperforms and is independent from other value when adjusted for all the other clinicopathological clinical variables in predicting the development of variables together (Table 4) or individually (Supplementary metastasis.

Oncogene HNSCC, distant metastasis signatures and classifiers DS Rickman et al 6610 Table 1 Overview of clinical and histopathologic characteristics, and sample groups Parameters Groups

Entered into study Selected for statistical analysis

M+NM (n ¼ 186) M+NM (n ¼ 142) NM (n ¼ 92) M (n ¼ 50)

Patient characteristics Sex (numbers) Male 165 129 82 47 Female 21 13 10 3 Age (years), median (range) 58.1 (35–82) 58.7 (35–79) 56.6 (38–73) Adjuvant therapy Radiotherapy 124 106 66 40 Chemoradiotherapy 15 11 1 10 None 27 25 25 0 Clinical follow-up (months), median (range) Clinical follow-up 43.0 (5–159) 81.2 (36–159) 27.2 (5–121) Disease-free survival 40.5 (1–159) 80.5 (36–159) 17 (1–52) Global survival 50.5 (7–159) 86.6 (37–159) 29.4 (7–121) Patient status (numbers) Alive 83 72 69 3 Dead 103 70 23 47

Tumour characteristics Localization Tongue and base of tongue 43 31 21 10 Oral cavity 35 28 25 3 Oropharynx 32 23 14 9 Hypopharynx 74 58 30 28 Others 2 2 2 0 Histological size T1 21 15 11 4 T2 85 68 44 24 T3 61 43 25 18 T4 18 16 12 4 na 1 Pathological lymph node status N0 50 41 34 7 N+ 132 97 54 43 na 4 4 4 0 Pathological stage I: T1N0M0 4 3 3 0 II: T2N0M0 26 21 16 5 III: T1-2N1M0 or T3N0M0 45 35 29 6 IV: T4N0-1 or T1-4N2-3 107 79 40 39 na 4 4 4 0 Differentiation Well 46 38 29 9 Moderately 88 72 47 25 Poorly 48 31 15 16 Undifferentiated 4 1 1 0 HPV status HPVÀ 141 HPV+ 19 na 26

Measures Affymetrix 98 81 40 41 aCGH 94 74 36 38 QRT–PCR 182 134 87 47 Affy+aCGH 92 74 36 38 Affy+aCGH+QRT–PCR 91 73 36 37

Analysis S1 training (array data) 40 20 20 S2 training (array data) 20 10 10 S3 validation (array data) 21 10 11 S1’ training (QTR–PCR data) 55 28 27 S2’ validation (QTR–PCR data) 79 59 20

Abbreviations: aCGH, array-comparative genomic hybridization; HPV, human papillomavirus; na, not annotated. Overviews are given for the patients selected for the CIT study from the CPS tumour collection and the subsequent subgroups used for data generation (measures) with transcriptome (Affymetrix) and genome (CGH) arrays, and bioinformatics analysis (analysis). The term ‘M+NM’ refers to the total number of patients (metastatic (M) plus the non-metastatic (NM)) for the indicated characteristic. The groups (S) used for the analysis of the array (S1–S3) and QRT–PCR (S10,S20) data are described in Material and methods. The values are the number of patients or time (age (years) and clinical follow up (months)).

Oncogene HNSCC, distant metastasis signatures and classifiers DS Rickman et al 6611

Figure 2 RNA expression by seven genes relative to pathological differentiation. The box plots indicate the distribution of RNA levels measured by QRT–PCR in 134 tumour samples grouped by level of differentiation. The seven genes yielded significant P-values (Po0.01) based on Affymetrix data testing the different tumour differentiation classes. Colours correspond to the sample groups (green: well differentiated (n ¼ 40); orange: moderately differentiated (n ¼ 68); red: poorly differentiated (n ¼ 29)). The y axis gives the DCt values relative to the control genes (R18S and RPLP0). The bottom table shows the fold difference (FC) between the geometric mean values between the different classes and associated Wilcoxon P-value.

Table 2 Distribution of samples in training and validation groups employed to define top gene combinations based on Affymetrix probe set, aCGH and RT–PCR data Affymetrix+aCGH (n ¼ 81) QRT–PCR (n ¼ 134)

S1+S2 S3 S10 S20

Affymetrix+aCGH 5 3 Affymetrix+aCGH+QRT–PCR 55 18 55 18 QRT–PCR 061

Abbreviation: aCGH, array-comparative genomic hybridisation. Variables were initially measured with transcriptome (Affymetrix) and genome (CGH) arrays. RNA was subsequently quantitated on some of the original and new samples by QTR–PCR. The analysis of the array (S1–S3) and QRT–PCR (S10,S20) data is described in Material and methods.

Transcriptomic description of metastatic propensity regulation, cell signalling, adhesion/motility, prolifera- In order to explore the molecular mechanisms behind tion and metabolism; Table 5; Supplementary Table 7). the development of distant metastasis in HNSCC, we These factors might confer propensity to metastasize to identified 614 genes with significantly altered expression primary tumours. levels in M relative to NM using Cox univariate tests on the Affymetrix data from 81 samples (41M versus 40 NM, Po0.01;. Supplementary Table 7). We selected alterations associated with metastatic and validated the differential expression of 22 genes propensity and correlation with the transcriptome using QRT–PCR analysis of the 134 samples (Figure 4). HNSCC is known to be associated with extensive The 22 genes were selected as representative genes that chromosomal alterations. We searched for chromo- would validate the microarray analysis, and also by somal alterations associated with the propensity to subjective criteria such as encoding proteins that could metastasize using aCGH with 74 HNSCCs. As expected, be important for metastasis (apoptosis, cell-cycle reg- we found numerous chromosome alterations in the ulation, cell interactions, oncogenesis, poor prognosis overall population (Supplementary Figure 3) that have and signalling pathways involved in metastasis; see been observed previously (Baudis and Cleary, 2001; Discussion). We analysed the 614 genes for association Gollin, 2001; Jarvinen et al., 2006; gains: 3q, 5p, 7q, 8q, with particular biological pathways and functions 11q13; losses: 3p, 8p, 9p, 11q14–24, 13q, 18q, 21q). We (Supplementary Material and Methods). The 614 M/ also detected a number of additional alterations (gains NM genes were found to , in particular, proteins at 1q, 12p and Xq and losses at 4p, 7q, 11q, 17p, 19p associated with major functional categories and path- and Xp; Supplementary Table 8). Importantly, we ways that could be important for metastasis (RNA identified by Fisher’s exact tests and Cox univariate processing, development/differentiation, transcription analysis, genomic alterations related to M/NM status, in

Oncogene HNSCC, distant metastasis signatures and classifiers DS Rickman et al 6612 Table 3 List of the 19 genes that were used to construct the four-gene model (shown in grey) Gene Log-rank P-value FC gene Refseq ID Chromosome location

S10 S10+S20

ZNF6 0.00496 0.00019 2.00 7552 NM_021998 chrXq21.1 GPRASP2 0.00020 0.00023 1.64 114928 NM_001004051 /// NM_138437 chrXq22.1 BHLHB9 0.00526 0.03000 1.32 80823 NM_030639 chrXq22.1 PSMD10 0.00203 0.00205 1.23 5716 NM_002814 /// NM_170750 chrXq22.3 KIAA1729 0.00038 0.10000 1.23 85460 NM_053042 chr4p16.1 DHX35 0.03000 0.01000 1.23 60625 NM_021931 chr20q11.23 HSD17B12 0.01000 0.01000 1.17 51144 NM_016142 chr11p11.2 ARMCX5 0.00005 0.11000 1.16 64860 NM_022838 chrXq22.1 YIPF6 0.00310 0.25000 1.08 286451 NM_173834 chrXq13.1 FLJ31795 0.03000 0.39000 1.05 124808 NM_144609 chr17q21.31 ZNF77 0.00257 0.62000 1.04 58492 NM_021217 chr19p13.3 C6orf107 0.01000 0.64000 1.04 54887 NM_017754 chr6p21.31 SYBL1 0.02000 0.59000 1.02 6845 NM_005638 chrXq28 GLT28D1 0.01000 0.45000 0.94 55849 NM_018466 chrXq23 FLOT2 0.00805 0.18000 0.90 2319 NM_004475 chr17q11.2 GUK1 0.01000 0.00277 0.86 2987 NM_000858 chr1q42.13 ATP2B4 0.00500 0.00161 0.81 493 NM_001001396 /// NM_001684 chr1q32.1 LAMB3 0.01000 0.03000 0.74 3914 NM_000228 /// NM_001017402 chr1q32.2 KRT17 0.03000 4.61E-06 0.47 3872 NM_000422 chr17q21.2

Abbreviation: FC, fold change. The table gives the gene names, the Cox log-rank P-values for the S10 (training) and S10+S20 (total) populations, the fold change (FC) based on M/ NM for the total population, Entrez gene and RefSeq identifiers and chromosomal locations.

Table 4 Cox proportional hazard univariate and multivariate analysis of the validation group (S20) for future metastasis (M) Variables Univariate Cox multivariate *without Cox multivariate *with ‘four-gene model’ ‘four-gene model’

HR (95%CI) P-value$ HR (95%CI) P-value HR (95%CI) P-value

Age at time of surgery 2.39 (0.95–5.9) 0.062 2.711 (1.06–6.90) 0.037 >0.1 o58 vs >58 years Lymph node status N+ vs N0 12.55 (1.69–92.9) 0.013 >0.1 >0.1 Pathological Stage 4 vs 1–3 8.03 (1.85–34.50) 0.0054 6.91 (1.59–29.91) 0.010 6.3 (1.45–27.33) 0.014 Differentiation Poor (3–4) vs 2.22 (0.80–6.12) 0.1262 >0.1 >0.1 well/moderate (1–2) Tumour localization 5.52 (1.99–15.2) 0.0011 4.9 (1.77–13.81) 0.0024 3.8 (1.36–10.62) 0.0112 hypopharynx vs others localisations Four-genes predictor high risk 6.53 (2.36–18.10) 0.0003 Not included 5.29 (1.85–14.79) 0.0016 vs low risk

Multivariate analysis of the four-gene model and other clinical variables together and in variable pairs. Shown are the hazard ratios (HR) for future metastasis (M) with the associated confidence interval (HR (lower 95th) and HR (upper 95th)), and the P-value derived from a Wald test for each variable in the model. *All variables associated with univariate P-valueo0.05 were included in the multivariate model using the stepwise method (variables entered sequentially, checked and removed if the P-value is >0.1, i.e., non significant). $P-value derived from a Wald test for each variable in the model.

particular gains at 4q11–22 and Xq21–28, and losses at mapped to chromosome zones of deregulated genes 11q14–24 and 17p11–12, q11 in M tumours (Figures 5a (Supplementary Table 7). These genes, that are both and b, see arrows; Supplementary Table 8). dysregulated and located in altered genomic regions, As genomic alterations could be responsible for the classify M and NM with a 76.8% success rate (data not altered expression of genes, we analysed the level of gene shown), which is not better that the four-gene model expression in M versus NM along the chromosomes (77% success). A number of these regions also have using windows of 5 Â 106 base pairs and steps of 2 Â 106 DNA copy number alterations associated with meta- base pairs. Similar to initial results obtained from gene static propensity (for example, 4q21;q22, 11q23.1 and set enrichment analysis (Subramanian et al., 2005; Xq22.1; Figure 5; Supplementary Tables 7 and 8). Of Supplementary Figure 4), we found a number of interest, one of the genes in the four-gene model, chromosome-positional zones of deregulated genes PSMD10, is located in a zone of genes overexpressed associated with metastatic propensity (Figure 5c). About in M at Xq22 (Figure 6a, black line; individual genes are half of the M/NM gene transcripts (425 out of 729 probe indicated with red triangles). This zone corresponds to a sets) identified using Cox univariate analysis were chromosomal region that also has a gain in DNA copy

Oncogene HNSCC, distant metastasis signatures and classifiers DS Rickman et al 6613

Figure 3 Kaplan–Meier curves of event free survival (future distant metastasis) of the groups selected with the four-gene model, starting with (a) the validation group (n ¼ 79, S20, group 1 ¼ at risk), (b) the whole population (n ¼ 134, S10 þ S20), (c) subgroups of the whole population formed with N0/N1 and N2 and (d) subgroups of the whole population formed with pathological stages less than and equal to IV. Censored events are marked under the curves on the graphs. The w2 (Chisq) and log-rank P-values are shown. The tables below the figures give, at 10-month intervals, the numbers of patients remaining at risk in the groups (a,0¼ predicted as NM, 1 ¼ predicted as M; b,0¼ predicted as NM, 1 ¼ predicted as M; c,0¼ N0/N1 predicted as NM, 1 ¼ N2 þ predicted as NM, 2 ¼ N0/N1 predicted as M, 3 ¼ N2 þ predicted as M; d,0¼ pSTAGEoIV predicted as NM, 1 ¼ pSTAGE ¼ IV predicted as NM, 3 ¼ pSTAGEoIV predicted as M, 4 ¼ pSTAGE ¼ IV predicted as M). number associated with M (Figure 6a, solid blue line; 6b develop metastases as the first recurrence in a 36-month shows in detail the aCGH data, with follow-up. HPV positive samples were eliminated from statistically significant regions in yellow). Overexpres- this study because we found that they had distinct sion is not observed in M samples without genomic gain changes at the RNA and DNA levels, and thus deserved (dotted black line), indicating that increased expression to be treated separately (recently confirmed by Ragin is linked to gene copy number. We confirmed by et al., 2006; Slebos et al., 2006; Schlecht et al., 2007). QRT–PCR that genes in this region are overexpressed Unsupervised analysis of transcripts that varied most in M samples (Figure 6a, see gene names). We also between the samples led to the identification of intrinsic found zones of deregulated genes that are not associated groups related to histological differentiation. There was with changes in DNA copy number associated with no correlation between the intrinsic groups and M/NM metastasis (examples: 1q, 12p, 18p,q and 19p; Figure 5; status. The genes that define the different intrinsic Supplementary Tables 7 and 8) suggesting other groups, as well as the genes that are differentially mechanisms of metastasis-related gene regulation. expressed between different pathological-differentiation groups, are predominantly involved in potentially relevant functions. The poorly differentiated tumours Discussion (C1 and C2) overexpress genes involved in embryonic development, cell adhesion, differentiation, motility One aim of this study was to develop a predictive and extracellular matrix, whereas well-differentiated signature for patients who, after complete surgical tumours (C3) overexpress genes involved in metabolism, resection as a first treatment and adjuvant therapy, will epithelial cell differentiation and anti-apoptosis. The

Oncogene HNSCC, distant metastasis signatures and classifiers DS Rickman et al 6614

Figure 4 QRT–PCR validation of genes associated with metastasis. Transcripts (22) selected from the transcriptome array analysis (Affymetrix; Cox P-value o0.05) were quantitated by QRT–PCR analysis with 134 samples of M (n ¼ 46) and NM (n ¼ 88). Top: box plots (NM blue, M red) representing the distributions of the log2 value of the DCt values after power transformation. Bottom: the fold difference (FC) between the geometric mean values from M divided by NM samples and the p value calculated from the Cox univariate tests for each of the genes, using the entire population of 134 samples analysed by QRT–PCR.

fourth intrinsic group (C4) was not significantly population is now ongoing, which consists of studying associated with any clinical characteristic, but was about 200 new patients and following their clinical weakly association with localization to the oral cavity. outcome over the next three years. Further validation C4 expresses genes involved in muscle differentiation. It using patients recruited in other hospitals, through remains to be seen whether C4 will be identified in collaborations at the national and international levels, is analyses of additional samples, and to determine its planned. Globally, a number of molecular profiles are origin. being standardized, particularly for breast cancer and We report a four-gene predictor based on QRT–PCR, for some haematological malignancies. Many others are which is, to our knowledge, the first predictor of distant in the exploratory phase (Ioannidis, 2007). metastasis in HNSCC. It is a significant predictor of This study extends and confirms our previous work; metastasis, even when combined with the other criteria 17 of the original 133 ‘descriptors’ of M/MN status were used for prognosis (age, stage, differentiation and also identified in this study, 16/17 with concordant localization) and is independent of HNSCC intrinsic relative expression (Cromer et al., 2004; Supplementary subgroups suggesting that it could contribute to clinical Table 9)). There was little overlap with genes identified decision making. However, it would not replace them, in other reports on HNSCC and some key studies on because other current clinical criteria (age, stage, lymph other tumours (Supplementary Table 9), as might be node involvement and localization) remain significant expected from the distinct nature and aims of this study. when the four-gene model is used for prediction. Further In our previous work, we were unable to identify genes studies will be required before the four-gene model that could be used to predict whether the patients would could be ready for clinical decision making (Ludwig and develop distant metastasis following treatment. A Weinstein, 2005; Glas et al., 2006; Kaklamani and related study reached the same conclusion (Braakhuis Gradishar, 2006; Ioannidis, 2007). Translating mole- et al., 2006). The previous unsuccessful studies were of cular profiles to the clinic generally requires assay smaller scale, which could account for the inability to development, demonstration and validation of predic- find predictor gene models. We increased both the tive performance, provision of independent information number of transcripts and patients analysed by approxi- beyond classical predictors, accumulation of evidence, mately fourfold, and added an analysis of 4.4K BAC demonstration of clinical efficacy, and then integration probes on CGH arrays. Furthermore, we used RNA into clinical practise (see Ioannidis, 2007). We have levels quantitated by QRT–PCR in the selection developed the assay, demonstrated its predictive perfor- procedure. The transcriptome variables alone contrib- mance and validated its performance on an independent uted to the four-gene model. There are genome variables group of patients. A new prospective study on a larger that are significantly different between M and NM

Oncogene HNSCC, distant metastasis signatures and classifiers DS Rickman et al 6615 Table 5 Functional categories and pathways implicated by the genes in relation to metastasis Go-id Genes Genes P-value Term Gene symbol (total) (in list)

(a) 349 Genes upregulated RNA/DNA processing GO:0006306 22 4 0.00149 DNA methylation CTCF, DNMT3A, DNAJC18, ATRX GO:0006259 560 26 0.00066 DNA metabolic CTCF, BRD8, DNMT3A, DTYMK, DNAJC18, EPC2, process NAP1L5, MSH5, MUTYH, NONO, PHF21A, PHB, ATRX, PPARBP, AHI1, RAD17, RFC1, SMARCA1, TFAM, TP53BP1, SUMO1, WHSC1, HIST1H2BG, SETDB2, MYST1, HIRIP3 GO:0051276 275 14 0.00505 Chromosome BRD8, DKC1, EPC2, NAP1L5, PHF21A, PHB, ATRX, organization and RFC1, SMARCA1, WHSC1, HIST1H2BG, SETDB2, biogenesis MYST1, HIRIP3 GO:0016071 253 20 1.98E-06 mRNA metabolic DHX8, DCP1B, ERN1, PPWD1, HNRPH1, NONO, process DNAJB11, PNN, PAPD1, THOC2, RBM25, DHX35, SFRS1, SFRS3, SFRS6, UPF3B, SNRPN, PABPN1, SFRS9, EF- TUD2 GO:0008380 197 17 3.59E-06 RNA splicing MPHOSPH10, DHX8, PPWD1, HNRPH1, NONO, CROP, PNN, THOC2, RBM25, DHX35, SFRS1, SFRS3, SFRS6, SNRPN, SFRS9, EFTUD2, PPIG GO:0006396 384 24 1.16E-05 RNA processing MPHOSPH10, DDX17, DHX8, DKC1, ERN1, PPWD1, HNRPH1, NONO, NOP5/NOP58, CROP, PNN, PAPD1, THOC2, RBM25, DHX35, SFRS1, SFRS3, SFRS6, MRPL44, SNRPN, PABPN1, SFRS9, EFTUD2, PPIG GO:0006397 217 16 4.86E-05 mRNA processing DHX8, ERN1, PPWD1, HNRPH1, NONO, PNN, PAPD1, THOC2, RBM25, DHX35, SFRS1, SFRS3, SFRS6, PABPN1, SFRS9, EFTUD2 GO:0050658 49 7 0.00013 RNA transport NUDT4, HRB, NUP107, THOC2, UPF3B, RAE1, CKAP5 GO:0006406 18 3 0.00775 mRNA export from HRB, NUP107, RAE1 nucleus

Development/differentiation GO:0045597 28 5 0.00041 Positive regulation FOXO3A, VHL, ACVR1B, BOC, ACVR2A of cell differentiation GO:0030154 628 27 0.00161 Cell differentiation ABI2, EDAR, CREB1, CSF3, ETV4, FOXO3A, FRZB, HRB, MGP, ATBF1, NEO1, NFATC1, PPARBP, BEX1, RPS21, SMARCA1, VHL, CDK5RAP3, FXR1, APOLD1, MYST1, EIF2B5, ACVR1B, BOC, ACVR2A, FARP2, FGF19 GO:0048468 239 13 0.00392 Cell development ABI2, CREB1, ETV4, FOXO3A, HRB, ATBF1, NEO1, NFATC1, SMARCA1, CDK5RAP3, EIF2B5, FARP2, FGF19

Transcriptional regulation GO:0045449 1600 79 9.74E-11 Regulation of ZBTB33, ZNF263, CTCF, ZMYND11, ZNF271, BRD8, transcription SNF8, CREB1, ZFP90, DDIT3, ZNF92, ZNF384, E2F6, TIGD1, ZNF449, ERN1, ETV4, ETV5, CAND2, FOXO3A, PASK, MGA, EPC2, BRPF3, ZNF621, ZNF789, MECP2, MYCN, ATBF1, NEO1, NFATC1, NONO, PHF20L1, RNF12, PHF20, PHF21A, ESF1, TAF9B, NLK, PHB, PNN, ATRX, PPARBP, PNRC2, FOXJ2, PSMD10, SALL4, JAR- ID1A, RFC1, SALL2, SFRS6, RBM15, SMARCA1, TFAM, TP53BP1, SUMO1, VHL, WHSC1, ZFP161, ZNF711, ZNF24, ZNF26, ZNF133, ZNF193, MTERF, TCEAL4, ZNF606, MED28, ZNF435, MYST1, ZNF397, CGGBP1, TCEAL8, ACVR1B, TIGD7, CBFA2T2, ZNF764, MGC16385, SUPT7 L GO:0006390 2 2 0.00053 Transcription from TFAM, MTERF mitochondrial promoter GO:0030111 18 4 0.00067 Regulation of Wnt FRZB, LRP6, NLK, SENP2 receptor signalling pathway

(b) 265 Genes downregulated Development/differentiation GO:0009888 194 22 2.80E-10 Tissue development COL17A1, FLOT2, GJB5, SFN, IVL, KRT5, KRT14, KRT16, KRT17, LAMA3, LAMC2, ZBTB7A, PTGS2, GRHL3, SECTM1, BMP1, SPRR1A, SPRR1B, TGFB1, TUFT1, HES7, KEAP1

Oncogene HNSCC, distant metastasis signatures and classifiers DS Rickman et al 6616 Table 5 Continued Go-id Genes Genes P-value Term Gene symbol (total) (in list)

GO:0030216 29 6 3.17E-05 Keratinocyte SFN, ANXA1, IVL, PTGS2, SPRR1A, SPRR1B differentiation GO:0030154 628 24 5.69E-03 Cell differentiation NDRG1, BAIAP2, SEMA4B, VAMP5, TIRAP, POLM, SFN, ANXA1, IFI16, IVL, PPP2R1B, PTGS2, PTPRZ1, S100A6, MAPK12, BMP1, SPI1, SPRR1A, SPRR1B, UBC, PAX8, CCM2, AP3D1, KEAP1 GO:0000902 314 17 0.0006 Cell morphogenesis VAV3, BAIAP2, CAP1, KCTD11, KRT5, KRT14, PML, PALMD, PPP2R1B, PTPRZ1, RDX, S100A6, SHC1, TGFB1, UBC, CAMK2D, CAPG

Cell signaling GO:0007243 307 21 4.13E-06 kinase CAMKK2, TIRAP, ADORA2B, DAPK3, MKNK2, RHOA, cascade RHOC, MYD88, PPP2R1B, MAPK11, BCL3, SECTM1, SHC1, THOP1, CASP1, CCM2, MKNK1, RPS6KA4, DOK2, MAPKAPK2, STK17A GO:0007264 330 21 1.25E-05 Small GTPase VAV3, BAIAP2, RAB32, CENTG3, RHOV, PSD4, HRAS, mediated signal RAB7B, ARF1, RHOA, RHOC, MYO9B, PLD2, RHOF, transduction RAB25, RALGDS, BCR, RRAD, RAB11A, IQGAP1, DOK2 GO:0007166 764 35 2.99E-05 Cell-surface recep- VAV3, BAIAP2, CAP1, RGS14, GIPC1, ADCY7, TIRAP, tor-linked signal ADORA2B, PTK2B, FCER1G, GNA15, MKNK2, ANXA1, transduction HRAS, ITGB4, GPR153, MYD88, P2RY2, PLD2, PLP2, EDG8, CSNK1G1, PPP2R1B, PTGIR, PXN, SHC1, DST, BSG, TGFB1, RASSF5, CCM2, ADAM15, DOK2, OSMR, CELSR1 GO:0019932 126 8 0.0063 Second-messenger- CAP1, CAMKK2, ADCY7, ADORA2B, GNA15, P2RY2, mediated signalling PPP2R1B, PTGIR GO:0007229 50 5 0.0045 Integrin-mediated VAV3, ITGB4, DST, CCM2, ADAM15 signalling pathway GO:0006955 424 19 0.0025 Immune response IL18BP, CTSC, TIRAP, F3, FCER1G, GBP1, GBP2, POLM, IFI16, IL18, MYD88, MAPK11, PVRL1, SECTM1, TAP2, TGFB1, C1QBP, TNFSF9, CD58 Adhesion/motility GO:0007155 449 24 5.34E-05 Cell adhesion EVA1, VAV3, PKP3, COL17A1, DSC3, DSG3, PTK2B, FLOT2, IL18, ITGB4, JUP, LAMA3, LAMC2, PKP1, PPP2R1B, PVRL1, PXN, PERP, DST, LY6D, ADAM15, CD44, CELSR1, CD58 GO:0006928 258 16 1.80E-04 Cell motility ARPC5, ARPC3, ACTR3, VAV3, CAP1, ANXA1, LAMA3, PTGS2, PXN, S100A2, TGFB1, TPM4, TXN, UBC, VASP, CAPZA1 GO:0030054 126 12 2.00E-05 Cell junction PKP3, COL17A1, DSC3, DSG3, GJB2, GJB3, GJB5, JUP, PKP1, PVRL1, DST, TJP2 GO:0008064 34 7 0.0000 Regulation of ARPC5, ARPC3, EPB49, PFN1, RDX, CAPG, CAPZA1 polymerization and/or depolymerization Proliferation GO:0008219 590 25 1.24E-03 Cell death CDKN1A, DAPK3, GADD45A, DNM2, PTK2B, NALP1, FOSL2, TNFAIP8, ANXA1, IFI16, IL18, PML, ADAMTSL4, PPP2R1B, PRF1, BCL2L1, PERP, TGFB1, YWHAZ, CASP1, RASSF5, TNFSF9, SGPL1, STK17A, EI24 GO:0043065 180 12 5.97E-04 Positive regulation CDKN1A, DAPK3, DNM2, NALP1, IFI16, IL18, PML, of apoptosis ADAMTSL4, PPP2R1B, PERP, STK17A, EI24 GO:0045786 166 11 1.05E-03 Negative regulation CDKN1A, GAS2L1, SESN3, KCTD11, GADD45A, PML, of progression MAPK12, DST, STK11, TGFB1, RASSF5 through cell cycle GO:0051726 447 19 0.0044 Regulation of cell CDK9, CDKN1A, GAS2L1, SESN3, KCTD11, GADD45A, cycle SFN, HRAS, FZR1, PML, PPP2R1B, BCL3, S100A6, MAPK12, SHC1, DST, STK11, TGFB1, RASSF5 GO:0007049 697 25 0.0104 Cell cycle CDK9, CDKN1A, GAS2L1, RGS14, SESN3, KCTD11, GADD45A, DNM2, SFN, ANXA1, HRAS, FZR1, PML, PPP2R1B, BCL3, S100A6, MAPK12, SHC1, DST, STK11, TGFB1, UBC, CAMK2B, CAMK2D, RASSF5 Metabolism GO:0006096 37 5 1.17E-03 Glycolysis HK1, RHOC, OGDH, PGAM1, PKM2

The table shows the gene ontology (GO) biological pathways with the encompassed GO term identifiers (GO-id), the total number of genes englobed by the GO terms, the numbers in the lists of upregulated (a) and downregulated (b) genes in relation to metastasis (M vs NM), the significance levels of the hypergeometric tests (P-value), the GO terms and the symbols of the deregulated genes in the lists.

Oncogene HNSCC, distant metastasis signatures and classifiers DS Rickman et al 6617

Figure 5 Pan-genomic views of genomic gains and losses and the localizations of genes altered in expression in M and NM samples. (a and b) Frequency of gains (y>0) and losses (yo0) in M (a) and NM (b). n ¼ number of samples. Colour codes (Cox univariate): yellow, 0.05oPo0.01; orange, 0.005oPo0.01; red, Po0.005. (c) Frequency of 10 Mb genomic regions enriched in ‘M genes’ (Po0.001, Fisher’s exact test; red: over-; green: underexpressed in M). patients, but none were selected during model building. et al., 2006). In fact, there is also a lack of similarity The genomic analysis was less exhaustive than the between the other studies (Supplementary Table 9). This transcriptome analysis, because the resolution of the difference in the genes used for prediction does not BAC arrays is smaller (55K transcripts versus 4.4K detract from the validity of the models, whose purpose is genomic regions). In addition, the RNA and DNA to predict events in independent samples (Dupuy and variables may be partially redundant, because we Simon, 2007). identified zones of altered transcription in genomic The predictor genes in the four-gene model regions that are also altered. In fact, about 5% of the (PSMD10, HSD17B12, FLOT2 and KRT17) code for top M-associated genes are clustered in amplified proteins with highly relevant biological functions. genomic regions also associated with M, suggesting that PSMD10 is an oncoprotein, regulates pRb and , these events are linked and redundant. Transcripts may and is part of the regulatory 19S proteasome particle be more informative than genomic alterations for other (Dawson et al., 2006). HSD17B12 may play a role in reasons. In particular, many transcripts that are breast cancer progression (Song et al., 2006), and could clustered in genomic zones and are associated with M be a target for endocrine-disrupting cancer treatment (about 50% of the most significant genes) map to from its role in the conversion of E1 to E2 (Sanderson, genomic regions where the alterations are not associated 2006). FLOT2 is a cell-surface protein that regulates key with M, or are unaltered. There are many mechanisms functions for metastasis, including GPCR signalling, that could account for the zonal effects on transcription actin cytoskeleton structure, invasion, cell–matrix adhe- of neighbouring genes, including coregulation of related sion and spreading (Babuke and Tikkanen, 2007). clusters of genes, and epigenetic modifications that KRT17 is an unusual cytokeratin. Mutations in affect large regions of the genome (Esteller, 2007). KRT17 lead to Jackson–Lawler type pachyonychia The four-gene model efficiently selected M tumours congenita and steatocystoma multiplex, and KRT17- on an independent validation group of 79 samples. null mice develop alopecia. It is rapidly induced by Previous studies of HNSCC have identified genes wounding of stratified epithelia, and regulates cell associated with lymph node metastasis, which is also growth and size (van de Rijn et al., 2002; Kaklamani an independent predictor of distant metastasis (Chung and Gradishar, 2006). KRT17 may to be a marker of et al., 2004; Roepman et al., 2005). Our predictor has a certain epithelial stem cells (Gu and Coulombe, 2007). significant HR in N þ patients (5.2 (1.9–17.8), The genes comprising the four-gene model were P ¼ 0.0015), showing that it is an independent marker. selected by model building with bioinformatics, which There are few genes in common between this report and can generate many signatures of similar performance, other studies with potentially related aims (lymph node and small variations in the model-building group can metastasis (Roepman et al., 2005; Zhou et al., 2006) or produce a number of models with similar performances with paradigm work (Ramaswamy et al., 2003; Glas (Roepman et al., 2006). Apparently, the genes that we

Oncogene HNSCC, distant metastasis signatures and classifiers DS Rickman et al 6618

Figure 6 Transcriptome and genome alterations in M/NM samples on the X chromosome. (a) Transcriptionally deregulated zones (black lines) are aligned with genomic gains and losses (blue lines). The y axis is the log2-ratio of prometastasis versus metastasis-free patients. The solid black line represents the ratio M/NM of the median expression values obtained from all probe sets in the 21K list that are mapped within a 8 megabase genomic region from M samples having the gain shown grey (n ¼ 9) and NM samples without the gain (n ¼ 34). The dotted black line represents the median of probe sets for pro-metastasis samples with no gain (n ¼ 29) versus the median of metastasis-free samples having no gain (n ¼ 34). The blue solid line is the smoothed log2-ratio of the CGH data from pro- metastasis samples having the gain (n ¼ 9) versus that of the metastasis-free samples having no gain (n ¼ 34). Cox univariate significant genes are labelled with pink and green triangles, which correspond to over- and underexpressed in prometastasis versus metastasis-free samples, respectively. Genes used for M/NM prediction (Table 3) and validated using QRT–PCR (Supplementary Table 2) are labelled with gene symbols. (b) The two panels correspond to the aligned regions of chromosome X shown in Figures 3a and b, respectively. Frequency of gains (y>0) and losses (yo0) in M and NM. n ¼ number of samples. Colour codes (Cox univariate): yellow, 0.05oPo0.01; orange, 0.005oPo0.01; red, Po0.005.

have selected for the four-gene model could have key downregulation in M tumours of genes that encode for functions related to metastatic propensity, suggesting proteins involved in apoptosis (CASP1, DAPK3, IL18, they are related to the underlying biological mechanisms PPP2R1B), negative regulation of the cell cycle (DST) and could be developed as targets for therapy. Genes and cell interactions (COL17A1), and upregulation of that have been identified in related studies could be genes encoding proteins involved in oncogenesis and components of the same pathways, or could belong to poor prognosis (MYCN) and Wnt signalling (LRP6). other pathways. Further analysis will be required to Interestingly, the Wnt pathway has great potential for investigate these possibilities. cancer therapeutic design (Barker and Clevers, 2006). Additional biologically important functions could be Recent studies have shown that metastasis is a distinct represented in the more extensive list of ‘descriptor’ function that is not necessarily linked to the classical genes, whose expression differs between the M and NM properties of oncogenes and tumour suppressor genes tumours. This list was found to be enriched in genes that (cell division, apoptosis, and so on). This is an emerging encode for proteins with expected relevant functions, field (Nguyen and Massague, 2007) and the HNSCC M/ that in M tumours could lead to increased RNA/DNA NM ‘descriptor’ is presumable enriched in metastasis processing, transcription and Wnt signalling, and genes. The roles of these genes need to be investigated in decreased differentiation, adhesion and proliferation model systems, including the use of HNSCC cell lines in (see Supplementary Table 7). Amongst the 22 QRT– cell culture and mice, assaying for cell movement, PCR-validated descriptor genes (Figure 4), we found a invasion, growth in suspension and metastasis.

Oncogene HNSCC, distant metastasis signatures and classifiers DS Rickman et al 6619 An intriguing finding is that several genes Patients and samples associated with M (BEX1, BEX2, ZNF6, NGFRAP1L1, Tumour samples came from the Biological Collection of the GPRASP2) are clustered on chromosome Xq21–22, Centre Paul Strauss. Patients were operated for primary whose chromosome gain is also associated with M HNSCC between 1988 and 2003. Tumour samples were (Figure 6; Supplementary Table 7), indicating that collected at the time of surgery, with the patient’s informed consent. A fragment was taken near the advancing edge of the they are important for metastasis. BEX 1 regulates primary tumour (avoiding its necrotic centre), immediately nerve growth-factor signalling, neuronal differentiation, frozen in liquid nitrogen and stored at À80 1C. The rest of the and cell-cycle progression (Vilar et al., 2006). The BEX1 tumour was fixed in 6% buffered formaldehyde and embedded gene is epigenetically silenced by promoter hyper- in paraffin for histopathological analysis. The UICC TNM methylation in malignant gliomas. It is involved in system (Sobin and Fleming, 1997) was used for tumour-node- sensitivity to chemotherapy-induced apoptosis and in metastasis staging. Histological examination of sections tumorigenesis (Foltz et al., 2006). BEX2 interacts with adjacent to each tumour fragment showed that 60–80% were LMO2 and regulates transcription (Han et al., 2005) tumour cells. ZNF6 could be a Zn finger . A total of 186 samples were included in transcriptome and NGFRAP1L1 has no known function, but is possible genome array analysis. The criteria for inclusion were: tumour localization (oral cavity, tongue, oropharynx and hypophar- involved in apoptosis. GPRASP2 interacts with G- ynx), no clinically evident distant metastases by conventional protein coupled receptors and may have a role in clinical and diagnostic radiological examinations (computed signalling (Simonin et al., 2004). This is the first tomography), surgical resection was the first treatment, and at description of this alteration, and merits further least 3 mm of the surgical margins were histologically tumour investigation, for example in relationship to X-chromo- free. The patients were treated post surgery with adjuvant some inactivation in females. radiotherapy (RX, the majority); several also had combined In addition to the genomic alterations associated with chemotherapy (the minority). HPV was detected in 19 patients, M/NM status, we observed many others that were not and will be reported elsewhere. 142 patients were grouped significantly associated with future metastasis. Some according to whether, during the clinical follow-up of at least have been described before (Baudis and Cleary, 2001; 36 months, they developed distant metastases as the first recurrence (M) or did not (NM). Patients without RX therapy Gollin, 2001; Jarvinen et al., 2006), others have not been were included only if they did not develop any recurrence (that reported. The genomic events not associated with is, NM patients), and the few cases with combined radio- metastasis might be driven by additional biological therapy and chemotherapy only if they developed metastases processes involved in the development of tumours, or (that is, M). For tumour characteristics, treatment, clinical may be indirect consequences of genomic instability. follow up and sample group distribution, see Table 1 and Functional studies in model systems will be required to Supplementary Table 3. investigate these possibilities. ‘Pathological differentiation’ is the status of each tumour Our results constitute a rich foundation for future established by microscopic assessment by a pathologist. work. The intrinsic groups were apparently selected on HNSCC is classically categorized into three degrees of the basis of a molecular description of differentiation. differentiation, according to the amount of ‘keratinization’: well differentiated >75%, moderately differentiated 25–75%, This molecular signature may be useful in refining poorly differentiated o25%. classification of pathological differentiation compared to clinical–histological criteria. Molecular classification Data could increase the diagnostic value of differentiation in See EBI ArrayExpress (E-TABM-302; www.ebi.ac.uk/arrayex the clinic. Another group of tumours, identifiable by press). HPV infection and other molecular criteria (manuscript in preparation), may have different prognosis and may Gene expression and array-comparative genome hybridization benefit from different treatment modalities, such as Except when indicated, transcriptome and aCGH analyses HPV therapeutic vaccines currently under development. used either an assortment of R system software (v1.9. 0) The genome and transcriptome changes may indicate packages and Bioconductor (V1.1.1), or original R code. potentially interesting metastatic functions, and the Xq21–22 region may be of particular interest for further Nucleic acid preparation Total RNA was extracted using the investigation. Most interestingly, we have found a RNAeasy kit (Qiagen, Courtaboeuf, France) with DNase I ‘predictor’ (the four-gene model) that can be used to treatment. DNA was extracted with phenol-chloroform using identify individuals whose primary HNSCC exhibit high standard procedures. Their integrity was verified on an Agilent metastatic potential. This molecular signature may 2100 Bioanalyser (Agilent Technologies, Palo Alto, CA, USA). accurately predict clinical outcome and may help to improve therapeutic management of patients with Microarray analyses Three micrograms of total RNA (10 mg HNSCC in the future. cRNA per hybridization) from 98 tumour samples was amplified, labelled following the manufacturer’s one-cycle target labelling protocol (http://www.affymetrix. com), and hybridized to HG-U133 plus 2.0 Affymetrix GeneChip arrays Materials and methods (Affymetrix; GeneChip Fluidics Station 400). The chips were scanned with the Affymetrix GeneChip Scanner 3000 and Unsupervised classification, gene set enrichment analysis, images analysed using GCOS 1.4. Raw feature data were statistics normalized and log2 intensity expression summary values for See Supplementary Materials and methods. each probe set were calculated using robust multi-array

Oncogene HNSCC, distant metastasis signatures and classifiers DS Rickman et al 6620 average (RMA, package affy V1.4.32; Irizarry et al., 2003). success rate from the two sets was used to choose the top Probe sets corresponding to control genes or having a ‘_x_’ variable. All combinations of the other variables in the annotation were masked, yielding a total of 50 406 probe sets combined data set and the chosen variable (that is, two- available for further analyses. We then calculated the 80th variable combination) were then used to construct a two-gene percentile intensity value for each of the 50 406 probe sets and predictor. The two-variable predictor yielding the highest eliminated those with a value less than or equal to 15 intensity average success rate (for S1 and S2) was chosen and so on until units (non-log and after RMA normalization). This yielded the best four-variable predictor was identified. One- to four- 21 906 probe sets for further analysis. gene models with a success rate lower than 80% were eliminated. The six top models (top model for 2–4 genes QRT–PCR A total of 182 HNSCC samples were analysed (n ¼ 3 models) and for each of the two-step methods) were then for 59 genes, of which 31 were for metastasis prediction, 22 for applied to the S3 group. This approach yielded two top metastasis characterization and 7 for differentiation (Supple- predictors (total of five genes), one for each step method. To mentary Table 5). Three micrograms of total RNA was reverse determine the top predictor using the QRT–PCR data of the transcribed using the High Capacity Archive kit and random entire sample population, we included 26 additional genes with hexamers (Applied Biosystems, Courtaboeuf, France). cDNA the 5 genes mentioned above (a total of 31 genes). The 26 genes quality was assessed using R18S quantification by real-time were chosen because they were amongst the top 30 genes PCR (coefficient of variation 7% for the entire series). One (ranked using the univariate log-rank P-value) that were microliter of cDNA (2 ng of reverse transcribed RNA) was characterized (had HUGO gene symbols) and could be analysed in duplicate using TaqMan Low Density Arrays and analysed with the transcript-matched ABI assay-on-demand. the ABI PRISM 7900HT System (Applied Biosystems). The Of the 81 samples studied using Affymetrix arrays, 8 yielded poor quality QRT–PCR 18S control data. Of the remaining 73 variability of duplicates was less than 5% in each measure; 0 therefore we used the mean of duplicate measures to estimate samples, the S1 and S2 groups were combined to form the S1 the level of gene expression. Gene expression was normalized training group (n ¼ 55 samples) and were used for gene to the average Ct values of two internal controls, R18S and selection (Cox univariate test with log-rank Po0.05) and RPLP0 (Vandesompele et al., 2002; GeNorm software). model construction using the QRT–PCR data. Eighteen S3 samples (that had not been used for selection) were combined with sixty-one additional samples not studied using Affymetrix aCGH A total of 94 DNA samples, from amongst the 98 arrays to form the S20 validation group (n ¼ 79 samples; samples studied with Affymetrix arrays, were analysed with Table 2). IntegraChip (IntegraGen, Evry, France) aCGH microarrays containing 4434 BACs. The control was mixed blood DNA from 20 healthy females used at the same concentration. Analysis of expression profiles along the chromosome Chro- Smoothed, normalized log2-ratio values were partitioned into mosome positions of the Affymetrix U133 plus 2.0 probe sets three groups: gain, no change or loss (GNL). Recurrent or were obtaining from Affymetrix (NetAffx analysis centre, consensus variables were defined as variables that occurred in http://www.affymetrix.com/analysis/index.affx) and ENSEMBL (http://www.ensembl.org/index.html). Using a window of more than 2 of the 94 samples (see Supplementary Materials 6 6 and Methods for more details). 5 Â 10 bases and 2 Â 10 -base steps along the chromosome we calculated a local s.d. based on the distribution of the expression values obtained for the non-metastasis samples for Prediction analysis Initial selection was based on univariate each window along the chromosome. We then calculated the and multivariate Cox analyses (survival R package v2.26) of P-value of local enrichment (Fisher’s exact test) of probe sets Affymetrix gene chip variables using 81 samples, divided into that, in the M samples, were two s.d. above or below the three groups: training group S1 (20M and 20 NM samples), median expression value for that window of the non-metastasis training group S2 (10M and 10 NM samples) and a validation samples. Windows containing less than five probe sets were not group S3 (11M and 10 NM samples). As an initial approach we considered. A total of 1523 genomic regions were assessed. We combined the expression data obtained using Affymetrix gene also employed GSEA ((Subramanian et al., 2005)) using chips with the array CGH datasets (normalized log-ratios and cytoband localization to define gene sets to assess enrichment GNL status, see below). We performed Cox univariate tests associated with metastasis. using the S1 group of samples and the three different data sets (21K Affymetrix data set, the 4K aCGH log -ratio values and 2 Supervised analyses Cox univariate and Wilcoxon non- the 2.2K GNL aCGH values). Using a P-value cut-off of 0.025 parametric tests (R survival v2.26. package, and GeneSpring we obtained 566, 107 and 92 variables, respectively. A binning GX 7.3 (Agilent Technologies), respectively) were used to approach was applied by clustering (average linkage, 1- define the differentially expressed gene lists with a significance Pearson correlation coefficient (r)) independently, each result- level of each univariate test of P 0.01, unless otherwise ing S1 data set, cutting the variable dendrogram (r ¼ 0.65) and o indicated. We have calculated a local FDR estimate for our selecting the variable from each cluster that yielded the lowest test results using R package q-value from BEfron and R Cox univariate log-rank P-value. A combined data set was Tibshirani. For gene ontology and biological pathways constructed of the selected variables (81 total variables: 61 analyses we used a hypergeometric test to measure the Affymetrix probe sets, 14 aCGH log -ratios and 6 aCGH 2 association between a gene (feature) list and a biological GNL variables). To build the multi-gene (1–4 genes) pre- pathway or a gene ontology term (see Supplementary dictors, Cox multivariate models were constructed using two Materials and Methods for more details). bottom-up methods: (1) with or (2) without possibility to go backward (that is, drop a gene selected at a previous step). Each of the variables was used to build a Cox model; this Acknowledgements model was used to predict class membership for S1 based on the resulting score from each sample (using a zero as a We are extremely thankful for the important contributions threshold, a threshold that was used in all groups). The model made by Jaqueline Godet (CIT project scientific manager), was applied to S2 group of samples. The highest average the CIT platforms (Affymetrix IGBMC: Philippe Kastner,

Oncogene HNSCC, distant metastasis signatures and classifiers DS Rickman et al 6621 Christelle Thibault, Doulaye Dembe´ le´ ; BAC arrays IGBMC: IGBMC core facilities and Guy Bronner (providing tumour Stan du Manoir, Christelle Arnold; QRT–PCR RNA analysis samples). St Louis: Jessica Zucmann-Rossi), Christine Macabre Financial support: Ligues Nationale et Re´ gionales (Bas et (technical help), Fabien Petel (data submission to EBI), the Haut Rhin) contre le Cancer, CNRS, INSERM.

References

Applebaum KM, Furniss CS, Zeka A, Posner MR, Smith JF, Bryan J Hashibe M, Brennan P, Benhamou S, Castellsague X, Chen C, et al. (2007). Lack of association of alcohol and tobacco with Curado MP et al. (2007). Alcohol drinking in never users of HPV16-associated head and neck cancer. J Natl Cancer Inst 99: tobacco, cigarette smoking in never drinkers, and the risk of head 1801–1810. and neck cancer: pooled analysis in the International Head Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM and Neck Cancer Epidemiology Consortium. J Natl Cancer Inst et al. (2000). Gene ontology: tool for the unification of biology. The 99: 777–789. Gene Ontology Consortium. Nat Genet 25: 25–29. Ioannidis JP. (2007). Is molecular profiling ready for use in clinical Babuke T, Tikkanen R. (2007). Dissecting the molecular function of decision making? Oncologist 12: 301–311. reggie/flotillin proteins. Eur J Cell Biol 86: 525–532. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Barker N, Clevers H. (2006). Mining the Wnt pathway for cancer Scherf U et al. (2003). Exploration, normalization, and summaries therapeutics. Nat Rev Drug Discovery 5: 997–1014. of high density oligonucleotide array probe level data. Biostatistics Baudis M, Cleary ML. (2001). Progenetix.net: an online repository 4: 249–264. for molecular cytogenetic aberration data. Bioinformatics 17: Jarvinen AK, Autio R, Haapa-Paananen S, Wolf M, 1228–1229. Saarela M, Grenman R et al. (2006). Identification of target Boyault S, Rickman DS, de Reynies A, Balabaud C, Rebouissou S, genes in laryngeal squamous cell carcinoma by high-resolution Jeannot E et al. (2007). Transcriptome classification of HCC is copy number and gene expression microarray analyses. Oncogene related to gene alterations and to new therapeutic targets. 25: 6997–7008. Hepatology 45: 42–52. Kaklamani VG, Gradishar WJ. (2006). Gene expression in breast Braakhuis BJ, Senft A, de Bree R, de Vries J, Ylstra B, Cloos J et al. cancer. Curr Treat Options Oncol 7: 123–128. (2006). Expression profiling and prediction of distant metastases Le Tourneau C, Velten M, Jung GM, Bronner G, Flesch H, Borel C. in head and neck squamous cell carcinoma. J Clin Pathol 59: (2005). Prognostic indicators for survival in head and neck 1254–1260. squamous cell carcinomas: analysis of a series of 621 cases. Head Carles A, Millon R, Cromer A, Ganguli G, Lemaire F, Young J et al. Neck 27: 801–808. (2006). Head and neck squamous cell carcinoma transcriptome Ludwig JA, Weinstein JN. (2005). Biomarkers in cancer staging, analysis by comprehensive validated differential display. Oncogene prognosis and treatment selection. Nat Rev Cancer 5: 845–856. 25: 1821–1831. Nguyen DX, Massague J. (2007). Genetic determinants of cancer Chung CH, Parker JS, Karaca G, Wu J, Funkhouser WK, Moore D metastasis. Nat Rev Genet 8: 341–352. et al. (2004). Molecular classification of head and neck squamous Ragin CC, Taioli E, Weissfeld JL, White JS, Rossie KM, Modugno F cell carcinomas using patterns of gene expression. Cancer Cell 5: et al. (2006). 11q13 amplification status and human papillomavirus 489–500. in relation to p16 expression defines two distinct etiologies of head Cromer A, Carles A, Millon R, Ganguli G, Chalmel F, Lemaire F and neck tumours. Br J Cancer 95: 1432–1438. et al. (2004). Identification of genes associated with tumorigenesis Ramaswamy S, Ross KN, Lander ES, Golub TR. (2003). A molecular and metastatic potential of hypopharyngeal cancer by microarray signature of metastasis in primary solid tumors. Nat Genet 33: analysis. Oncogene 23: 2484–2498. 49–54. Dawson S, Higashitsuji H, Wilkinson AJ, Fujita J, Mayer RJ. (2006). Roepman P, Kemmeren P, Wessels LF, Slootweg PJ, Holstege FC. Gankyrin: a new oncoprotein and regulator of pRb and p53. Trends (2006). Multiple robust signatures for detecting lymph node Cell Biol 16: 229–233. metastasis in head and neck cancer. Cancer Res 66: 2361–2366. Dupuy A, Simon RM. (2007). Critical review of published microarray Roepman P, Wessels LF, Kettelarij N, Kemmeren P, Miles AJ, studies for cancer outcome and guidelines on statistical analysis and Lijnzaad P et al. (2005). An expression profile for diagnosis of lymph reporting. J Natl Cancer Inst 99: 147–157. node metastases from primary head and neck squamous cell Esteller M. (2007). Cancer epigenomics: DNA methylomes and carcinomas. Nat Genet 37: 182–186. -modification maps. Nat Rev Genet 8: 286–298. Sanderson JT. (2006). The steroid hormone biosynthesis pathway as a Foltz G, Ryu GY, Yoon JG, Nelson T, Fahey J, Frakes A et al. (2006). target for endocrine-disrupting chemicals. Toxicol Sci 94: 3–21. Genome-wide analysis of epigenetic silencing identifies BEX1 and Schlecht NF, Burk RD, Adrien L, Dunne A, Kawachi N, Sarta C et al. BEX2 as candidate tumor suppressor genes in malignant glioma. (2007). Gene expression profiles in HPV-infected head and neck Cancer Res 66: 6665–6674. cancer. J Pathol 213: 283–293. Forastiere A, Koch W, Trotti A, Sidransky D. (2001). Head and neck Simonin F, Karcher P, Boeuf JJ, Matifas A, Kieffer BL. (2004). cancer. N Engl J Med 345: 1890–1900. Identification of a novel family of G protein-coupled receptor Glas AM, Floore A, Delahaye LJ, Witteveen AT, Pover RC, Bakx N associated sorting proteins. J Neurochem 89: 766–775. et al. (2006). Converting a breast cancer microarray signature into a Slebos RJ, Yi Y, Ely K, Carter J, Evjen A, Zhang X et al. (2006). Gene high-throughput diagnostic test. BMC Genomics 7: 278. expression differences associated with human papillomavirus status Gollin SM. (2001). Chromosomal alterations in squamous cell in head and neck squamous cell carcinoma. Clin Cancer Res 12: carcinomas of the head and neck: window to the biology of disease. 701–709. Head Neck 23: 238–253. Sobin LH, Fleming ID. (1997). TNM Classification of Malignant Gu LH, Coulombe PA. (2007). Keratin function in skin epithelia: a Tumors, fifth edition (1997). Union Internationale Contre le broadening palette with surprising shades. Curr Opin Cell Biol 19: Cancer and the American Joint Committee on Cancer. Cancer 80: 13–23. 1803–1804. Han C, Liu H, Liu J, Yin K, Xie Y, Shen X et al. (2005). Song D, Liu G, Luu-The V, Zhao D, Wang L, Zhang H et al. (2006). Human Bex2 interacts with LMO2 and regulates the transcriptional Expression of aromatase and 17beta-hydroxysteroid dehydrogenase activity of a novel DNA-binding complex. Nucleic Acids Res 33: types 1, 7 and 12 in breast cancer. An immunocytochemical study. 6555–6565. J Steroid Biochem Mol Biol 101: 136–144.

Oncogene HNSCC, distant metastasis signatures and classifiers DS Rickman et al 6622 Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, quantitative RT-PCR data by geometric averaging of multiple Gillette MA et al. (2005). Gene set enrichment analysis: a knowl- internal control genes. Genome Biol 3: RESEARCH0034. edge-based approach for interpreting genome-wide expression Vilar M, Murillo-Carretero M, Mira H, Magnusson K, Besset V, profiles. Proc Natl Acad Sci USA 102: 15545–15550. Ibanez CF. (2006). Bex1, a novel interactor of the p75 neurotrophin van de Rijn M, Perou CM, Tibshirani R, Haas P, Kallioniemi O, receptor, links neurotrophin signaling to the cell cycle. EMBO J 25: Kononen J et al. (2002). Expression of cytokeratins 17 and 5 1219–1230. identifies a group of breast carcinomas with poor clinical outcome. Zhou X, Temam S, Oh M, Pungpravat N, Huang BL, Mao L et al. Am J Pathol 161: 1991–1996. (2006). Global expression-based classification of lymph node Vandesompele J, De Preter K, Pattyn F, Poppe B, Van Roy N, metastasis and extracapsular spread of oral tongue squamous cell De Paepe A et al. (2002). Accurate normalization of real-time carcinoma. Neoplasia 8: 925–932.

Supplementary information accompanies the paper on the Oncogene website (http://www.nature.com/onc)

Oncogene