Oncogene (2005) 24, 7105–7113 & 2005 Nature Publishing Group All rights reserved 0950-9232/05 $30.00 www.nature.com/onc

Two subclasses of lung squamous cell carcinoma with different expression profiles and prognosis identified by hierarchical clustering and non-negative matrix factorization

Kentaro Inamura1,6, Takeshi Fujiwara2,6, Yujin Hoshida2, Takayuki Isagawa2, Michael H Jones3, Carl Virtanen3, Miyuki Shimane2, Yukitoshi Satoh4, Sakae Okumura4, Ken Nakagawa4, Eiju Tsuchiya5, Shumpei Ishikawa2, Hiroyuki Aburatani2, Hitoshi Nomura2 and Yuichi Ishikawa*,1

1Department of Pathology, The Cancer Institute, Japanese Foundation for Cancer Research, 3-10-6 Ariake, Koto-ku, Tokyo 135- 8550, Japan; 2Laboratory for Systems Biology and Medicine, Research Center for Advanced Science and Technology, The University of Tokyo, Tokyo, Japan; 3Chugai Pharmaceutical Company, Tokyo, Japan; 4Department of Chest Surgery, Cancer Institute Hospital, Japanese Foundation for Cancer Research, Tokyo, Japan; 5Kanagawa Cancer Center, Yokohama, Kanagawa, Japan

Current clinical and histopathological criteria used to Introduction define lung squamous cell carcinomas (SCCs) are insufficient to predict clinical outcome.To make a Lung cancer is the leading cause of cancer death in men clinically useful classification by profiling, and women worldwide and continues to rise in we used a 40 386 element cDNA microarray to analyse 48 frequency. Lung carcinomas are generally classified as SCC, nine adenocarcinoma, and 30 normal lung samples. either small-cell lung carcinoma (SCLC) or non-SCLC Initial analysis by hierarchical clustering (HC) allowed (NSCLC) type. Within these groups further distinctions division of SCCs into two distinct subclasses.An are made, with NSCLC classified as adenocarcinoma additional independent round of HC induced a similar (AC), squamous cell carcinoma (SCC), and large-cell partition and consensus clustering with the non-negative carcinoma. After AC, SCC is the most frequent cancer matrix factorization approach indicated the robustness of histology, accounting for approximately 30% of all lung this classification.Kaplan–Meier analysis with the log- cancers. Its development is the most strongly related to rank test pointed to a nonsignificant difference in survival inhaled carcinogens, as in cigarette smoke. For ACs, (P ¼ 0.071), but the likelihood of survival to 6 years was subclassification by differentiation grade (Garber et al., significantly different between the two groups (40.5 vs 2001) or histological pattern (Miyoshi et al., 2003) is 81.8%, P ¼ 0.014, Z-test).Biological process categories useful to predict clinical outcome. For SCCs, differ- characteristic for each subclass were identified statisti- entiation grade is used for pathological subclassification, cally and upregulation of cell-proliferation-related but it correlates poorly with prognosis. Although SCCs was evident in the subclass with poor prognosis.In the demonstrate some histological variation, such as with subclass with better survival, genes involved in differ- the basaloid variant, this does not allow good predic- entiated intracellular functions, such as the MAPKKK tion of clinical outcome. The present system used to cascade, ceramide , or regulation of transcrip- subclassify SCC is thus insufficient. In this study, we tion, were upregulated.This work represents an important attempted to make a clinically useful classification based step toward the identification of clinically useful classifi- on gene expression profiling. cation for lung SCC. This latter provides a direct method of surveying gene Oncogene (2005) 24, 7105–7113. doi:10.1038/sj.onc.1208858; transcription and has increased our understanding of published online 27 June 2005 several cancers including lung carcinomas (Bhattachar- jee et al., 2001; Garber et al., 2001; Nacht et al., 2001; Keywords: lung; squamous cell carcinoma; cDNA Miura et al., 2002; Virtanen et al., 2002; Borczuk et al., microarray; hierarchical clustering; non-negative matrix 2003; Amatschek et al., 2004; Jones et al., 2004; factorization Kettunen et al., 2004; Tomida et al., 2004). Here, we used microarray-based expression profiling techniques to subclassify a series of lung SCCs. Our initial analysis by using hierarchical clustering (HC) allowed division into two distinct groups (SCC-A and SCC-B). However, HC has the disadvantage that it depends largely on genes selected in the filtering process, and it is highly *Correspondence: Y Ishikawa; E-mail: [email protected] 6These authors contributed equally to this work sensitive to the metric used to assess similarity, and Received 9 February 2005; revised 13May 2005; accepted 19 May 2005; typically requires subjective evaluation to define clus- published online 27 June 2005 ters. To ensure that the initial classification of SCC Two subclasses of lung squamous cell carcinoma K Inamura et al 7106 samples was accurate and robust, we therefore per- formed an additional round of HC with quite different gene filtering and consensus clustering with the non- negative matrix factorization (NMF) approach (Lee and Seung, 1999; Brunet et al., 2004). NMF, a novel dimensionality reduction technique, has been recently proposed to be useful for clustering expression data (Brunet et al., 2004). NMF appears to be more accurate and robust to the choice of input genes than HC. Furthermore, NMF can be combined with a quantita- tive evaluation of the robustness of the number of clusters. To our knowledge, this is the first report in which the NMF approach was adopted to confirm the robustness of a classification indicated by HC in gene expression profiling.

Results

Firstly, we wished to identify genes whose expression patterns are characteristic of SCC. A statistical compar- ison (Welch’s ANOVA, Bonferoni correction, P ¼ 0.05) was therefore used to identify genes differentially expressed between SCC, AC, and normal lung. A set of 4689 genes was selected (see Materials and methods). Two-way HC of data for 48 SCCs, nine ACs, and 30 normal lung samples against the gene set was performed using a Pearson correlation around 0 (Figure 1a). In the dendrogram, two major clusters were defined. The first contained all tumor samples and was further divided into three groups containing 25 SCCs, 23SCCs, and all nine ACs. The second major cluster contained all normal lung samples. Groups of genes whose expression patterns were characteristic of SCC were identified. A group of 1331 genes, annotated as ‘Up in SCC’ in Figure 1a, was commonly upregulated across all SCCs, while another group of 1458, annotated as ‘Down in SCC’, was Figure 1 (a) Hierarchical clustering for 4689 genes against 48 commonly downregulated in all SCCs. The top 15 genes SCCs (red tree branches), nine ACs (light blue tree branches), and 30 normal lung samples (yellow tree branches). Columns represent are listed in Table 1. The top upregulated genes in SCCs sample and rows represent gene. Red, green, black indicate high, included several genes encoding epithelial cell-specific low, or intermediate relative expression, respectively. Samples were markers. Keratin 5 (KRT5) and KRT14 are specifically essentially divided into a tumor cluster and a normal one. The expressed in the basal layer of the epidermis and have tumor cluster was further divided into 25 SCCs, 23SCCs, and all been reported to be overexpressed in lung SCCs nine ACs. Characteristic groups of genes, annotated as ‘Up in SCC’, ‘Down in SCC’, and ‘Different in SCC’, were identified. (b) (Borczuk et al., 2003; Amatschek et al., 2004; Kettunen Hierarchical clustering of all SCC samples against 432 genes et al., 2004). Mutations in these genes are associated (‘Different in SCC’) identified in Figure 1a. Segregation of SCCs with the skin-blistering disorder, epidermolysis bullosa into two clear groups, annotated as SCC-A (red tree branches) and simplex. Besides KRT5 and KRT14, upregulated 1331 SCC-B (yellow tree branches), is indicated genes included many keratin gene family members (KRT6B, 12, 13, 16, 25D, KRTHA1). 3 was constantly downregulated in lung cancer samples by (DSC3) and 3 (DSG3) are desmosomal cDNA array, but CAV1 expression was detected family members engaged in . These two immunohistochemically in half of the samples (Wikman genes are expressed in the basal and suprabasal layers of et al., 2004). TTF-1 and SP-A, which are useful markers epidermis and have been reported to be overexpressed in for type 2 pneumocyte differentiation, were included in lung SCCs (Amatschek et al., 2004; Kettunen et al., the downregulated 1458 genes. 2004). The top downregulated gene was caveolin-1 Using the EASE analysis (see Materials and meth- (CAV1), whose expression has been reported to ods), we identified biological process categories that correlate with poor prognosis of lung SCCs (Yoo showed a disproportionately high number of coregu- et al., 2003). The constant and marked downregulation lated genes (significant over-representation in those of CAV1 might result from massive expression of CAV1 categories). The Biological Process in the normal lung samples. In a recent report, CAV1 categories over-represented by up- and downregulated

Oncogene Two subclasses of lung squamous cell carcinoma K Inamura et al 7107 Table 1 The top 15 of up- and downregulated genes in SCCs as calculated using the ANOVA statistic Accession Symbol Description P-value

Top 15 upregulated genes in SCCs (‘Up in SCC’) AA160507 KRT5 Keratin 5 (epidermolysis bullosa simplex, Dowling–Meara/Kobner/Weber–Cockayne types) 1.35EÀ15 AI356363 SFN Stratifin 5.00EÀ13 R17667 SLC2A1 Solute carrier family 2 (facilitated glucose transporter), member 1 7.27EÀ11 R28481 DSC3 Desmocollin 3 4.26EÀ10 AA936183 TPX2 TPX2, microtubule-associated protein homolog (Xenopus laevis) 8.50EÀ10 AI017224 SELI Selenoprotein I 1.19EÀ09 AA702666 TCBA1 T-cell lymphoma breakpoint-associated target 1 1.38EÀ09 W80635 ALS2CR19 Amyotrophic lateral sclerosis 2 (juvenile) region, candidate 19 1.52EÀ09 N70010 CDCA5 Cell division cycle associated 5 1.77EÀ09 H44051 KRT14 Keratin 14 (epidermolysis bullosa simplex, Dowling–Meara, Koebner) 1.93EÀ09 W68630 DSG3 Desmoglein 3( vulgaris antigen) 2.06E À09 AA504348 TOP2A Topoisomerase (DNA) II alpha 170 kDa 2.26EÀ09 AI208433 PSMA8 Proteasome (prosome, macropain) subunit, alpha type, 8 2.82EÀ09 AA058663 PVRL1 Poliovirus receptor-related 1 (herpesvirus entry mediator C; ) 3.15EÀ09 AA421311 OSGEP O-sialoglycoprotein endopeptidase 3.18EÀ09

Top 15 downregulated genes in SCCs (‘Down in SCC’) AA487560 CAV1 Caveolin 1, caveolae protein, 22 kDa 3.11EÀ19 AA878093 SRCRB4D Scavenger receptor cysteine rich domain containing, group B (4 domains) 6.06EÀ16 AA706788 PGM5 Phosphoglucomutase 5 8.69EÀ15 AA489331 ADARB1 Adenosine deaminase, RNA-specific, B1 (RED1 homolog rat) 4.06EÀ14 AI000450 RASL12 RAS-like, family 12 5.68EÀ14 AA700164 PHACTR2 Phosphatase and actin regulator 2 2.45EÀ13 W74536 AGER Advanced glycosylation end product-specific receptor 3.97EÀ13 AI362029 NR2F1 Nuclear receptor subfamily 2, group F, member 1 5.45EÀ13 AA485883 VWF Von Willebrand factor 5.67EÀ13 AA488418 PALM2-AKAP2 PALM2-AKAP2 protein 5.94EÀ13 R32440 EPAS1 Endothelial PAS domain protein 1 6.54EÀ13 N70193 ADAMTSL3 ADAMTS-like 3 1.43EÀ12 AI401684 SIAT7E Sialyltransferase 7 ((alpha-N-acetylneuraminyl-2,3-beta-galactosyl-1,3)-N-acetyl 1.49EÀ12 galactosaminide alpha-2,6-sialyltransferase) E AA634308 ABCA8 ATP-binding cassette, subfamily A (ABC1), member 8 2.99EÀ12 AA917316 ADAMTS8 A disintegrin-like and metalloprotease (reprolysin type) with thrombospondin type 1 6.81EÀ12 motif, 8

P-values were calculated using Welch’s ANOVA (Bonferoni correction, P ¼ 0.05) between SCC, AD, and normal lung samples. ESTs are not included genes in SCCs are shown in Table 2. Cell-proliferation- of DAPK is early and frequent in lung tumors induced by related categories were mainly over-represented by smoking (Esteller et al., 2001; Pulling et al., 2004). upregulated genes in SCCs (e.g. transcription factor Besides the genes that were regulated uniformly across DP1). In the over-represented metabolism category, all SCCs, a group of 432 whose expression differed there were some genes probably involved in metabolism between the two SCC clusters was identified and of tobacco carcinogens by their oxidoreductase activity annotated as ‘Different in SCC’ in Figure 1a. Two- (e.g. AKR1B10, C1, C3, ALDH3A1, GPX2, peroxired- way HC of all SCCs samples against these 432 genes was oxin). Aldoketo reductase family 1, member C (AKR1C) performed using Euclidean distances for the sample tree is implicated in the conversion of polycyclic aromatic and Pearson correlations for the gene tree (Figure 1b). hydrocarbons into active carcinogens in lung (Palackal This resulted in the segregation of SCCs into two clear et al., 2002). Overexpression of AKR1B10 appears groups of 34 (SCC-A) and 14 (SCC-B). To verify the characteristic of smokers’ non-small-cell lung carcino- accuracy and robustness of this classification, we mas (Fukumoto et al., 2005). The over-represented ecto- performed additionally an independent round of HC derm development category consisted mainly of genes and adopted the NMF approach. involved in epidermal development (e.g. KRT5, 6B, 13, For the additional independent round of HC, a group SPRR1B, KRTHA1, ). Tumor protein p63 of 3891 genes that were up- or downregulated in SCC (TP73L), an oncogene amplified in SCC that is involved samples were selected (see Materials and methods). in epidermal morphogenesis (Mills et al., 1999; Yang Two-way HC of all SCC samples was performed using et al., 1999; Hibi et al., 2000), was also included. For the a Pearson correlation around 0 (Figure 2). In the categories over-represented by downregulated genes in dendrogram, SCC-B samples formed a distinct indepen- SCCs, many of them were associated with communica- dent cluster, except for one sample. Considering the tion between cells and their surroundings. These inclu- susceptibility of HC to prior gene selection, a mostly ded death-associated protein kinase 1 (DAPK1), an independent cluster of SCC-B samples might indicate a inducer of apoptosis through extracellular signaling certain consistency for the initial classification of SCCs (Deiss et al., 1995). Aberrant promoter hypermethylation (SCC-A and SCC-B).

Oncogene Two subclasses of lung squamous cell carcinoma K Inamura et al 7108 Table 2 Biological process categories over-represented by up- and downregulated genes in SCCs Category Total (539/7174; EASE 7.5%) score

Top 10 over-represented categories for ‘Up in SCC’ 1331 genes Cell cycle (87/503; 17.3%) 6.68EÀ14 Mitotic cell cycle (50/240; 20.8%) 4.93EÀ11 Cell proliferation (102/728; 14.0%) 2.05EÀ10 DNA metabolism (58/343; 16.9%) 5.26EÀ09 Nuclear division (26/115; 22.6%) 9.05EÀ07 Metabolism (360/4114; 8.8%) 2.32EÀ06 Chromatin assembly/ (16/60; 26.7%) 2.46EÀ05 disassembly DNA replication (20/96; 20.8%) 7.35EÀ05 Protein metabolism (154/1551; 9.9%) 6.25EÀ05 Ectoderm development (11/38; 28.9%) 0.000358

Top 10 over-represented categories for ‘Down in SCC’ 1458 genes Cell communication (208/1943; 10.7%) 3.95EÀ08 Signal transduction (162/1515; 10.7%) 3.57EÀ06 Antigen processing, exogen- (7/12; 58.3%) 0.000131 ous antigen via MHC class II Cell surface receptor-linked (67/584; 11.5%) 0.000959 signal transduction Cell adhesion (48/408; 11.8%) 0.00351 Protein amino-acid phos- (42/372; 11.3%) 0.0131 Figure 2 Additional independent round of HC of all SCC samples phorylation against 3891 genes up- or downregulated in SCCs. Tree branches Actin filament-based process (11/61; 18.0%) 0.018 are colored for SCC-A (red) and SCC-B (yellow) Morphogenesis (64/629; 10.2%) 0.0187 Intracellular signaling (61/599; 10.2%) 0.0217 cascade indicated and the classification was the same as that Regulation of biological (40/366; 10.9%) 0.0246 initially defined (SCC-A and SCC-B). process By the additional independent round of HC and the The top 10 over-represented categories are shown. Numerous other NMF approach, we were able to confirm the accuracy similar categories are not included to reduce redundancy. Significant and robustness of our SCC classification (SCC-A and functional categories are those with a higher ratio of identified genes to SCC-B). Since clinical samples were grossly dissected, all genes on the array for associations with that category, relative to the expression profiling might have been influenced by the ratio of total identified genes to all genes for associations with all categories. Association numbers approximate but are not exactly equal the varying amounts of normal tissue included. There- to gene numbers in a category. The ratio of associations for that fore, we reviewed the hematoxylin–eosin histology of category and the percentage by that ratio was shown in the second SCC samples. In normal tissue adjacent to neoplastic column. The analogous ratios for total identified genes are shown in cells, there were no distinct characteristics, such as the heading (Total). EASE Score, modified Fisher’s exact test P-value prominent fibrosis, necrosis, or inflammation. It is thus unlikely that our SCC classification depends on non- neoplastic components. For the NMF approach (see Materials and methods), To identify genes differentially expressed between 3344 genes were selected as a set of input genes for all 48 SCC-A and SCC-B, we performed a statistical compar- SCC samples. Reordered consensus matrices at k ¼ 2–7 ison (Welch’s t-test, Bonferoni correction, P ¼ 0.05). and cophenetic correlation coefficients corresponding to This yielded a set of 1590 genes, comprising 91 with the HC of consensus matrices are shown in Figure 3. As expression significantly higher in SCC-A and 1499 genes clear block diagonal patterns attest to the robust class with expression significantly higher in SCC-B. In the for k ¼ 2, cophenetic correlations quantitatively indi- latter, 316 of the ‘Different in SCC’ 432 genes were cated the two-centroid clustering to be the most robust included. The top 15 genes are listed in Table 3. V-akt with the highest value. In the k ¼ 2 model, all SCC murine thymoma viral oncogene homolog 2 (AKT2) was samples were distinctly divided into two classes corre- selectively upregulated in SCC-B (P ¼ 4.83EÀ09, t-test). sponding to SCC-A and SCC-B samples. AKT2 can contribute to tumor cell progression by HC typically requires subjective evaluation to define mediating phosphoinositide 3-OH kinase-dependent clusters. Depending on the viewpoint, Figure 1b appears effects on adhesion, motility, invasion, and metastasis to indicate a three-class partition of the SCC samples (Arboleda et al., 2003). A recent study showed AKT to with a subpartitioning of SCC-A. In the k ¼ 3NMF promote epithelial mesenchymal transitions in squa- partition, however, one of the three classes corre- mous carcinoma cell lines (Grille et al., 2003). Using sponded roughly to SCC-B, but the remaining two EASE analysis, we identified biological process catego- classes did not agree with the subpartitioning of SCC-A ries that showed significant over-representation. The top from HC, indicating the vulnerability of the three-class 10 over-represented categories for the 91 and the 1499 partition (Figure 3c). In brief, by the NMF approach, genes are shown in Table 4. Cell-proliferation-related validity of two-class partition for SCC samples was categories were mainly over-represented by the 91 genes

Oncogene Two subclasses of lung squamous cell carcinoma K Inamura et al 7109

Figure 3 (a) Reordered consensus matrices averaging 50 connectivity matrices computed at k ¼ 2–7 for all SCC samples with 3344 genes. Samples were hierarchically clustered, colored from 0 (blue, samples are never in the same cluster) to 1 (red, samples are always in the same cluster). (b) Cophenetic correlation coefficients for the hierarchically clustered matrices. (c) Illustration of a hierarchy in the NMF classes. The NMF class assignments for k ¼ 2–7 are shown color-coded. The dendrogram is identical to that of Figure 1b

Oncogene Two subclasses of lung squamous cell carcinoma K Inamura et al 7110 Table 3 The top genes associated with SCC-A and SCC-B as calculated using the t-test statistic Accession Symbol Description P-value

Top 15 of genes selectively upregulated in SCC-A R43328 KIAA0974 KIAA0974 protein 5.15EÀ06 N67639 CS Citrate synthase 1.77EÀ05 H05112 SMNDC1 Survival motor domain containing 1 9.25EÀ05 H10788 CIT Citron (rho-interacting, serine/threonine kinase 21) 9.32EÀ05 AA884837 C10orf42 Chromosome 10 open reading frame 42 1.24EÀ04 AA457671 P4HA1 Procollagen-proline, 2-oxoglutarate 4-dioxygenase (proline 4-hydroxylase), alpha polypeptide I 2.91EÀ04 AA682438 DHFR 4.66EÀ04 AA427401 FLJ11730 Sarcoma antigen NY-SAR-91 6.23EÀ04 T83665 FLJ36874 Hypothetical protein FLJ36874 9.56EÀ04 R07115 PH-4 Hypoxia-inducible factor prolyl 4-hydroxylase 0.001102 AA488332 PA2G4 Proliferation-associated 2G4, 38 kDa 0.001268 AA426039 KIAA0391 KIAA0391 protein 0.001347 AA455448 CD47 CD47 antigen (Rh-related antigen, -associated signal transducer) 0.001358 N69466 CPSF2 Cleavage and polyadenylation specific factor 2, 100 kDa 0.001384 AI005540 KIAA0261 KIAA0261 protein 0.002167

Top 15 genes selectively upregulated in SCC-B H63241 USP24 Ubiquitin-specific protease 24 2.25EÀ10 R91573 SOS1 Son of sevenless homolog 1 (Drosophila) 3.91EÀ10 AI049504 BAZ2B Bromodomain adjacent to zinc-finger domain, 2B 4.69EÀ10 H04202 KIAA0635 Centrosomal protein 4 5.01EÀ10 AI050031 PUM2 Vacuolar protein sorting 35 (yeast) 6.18EÀ10 AI061229 CENTB2 Centaurin, beta 2 7.37EÀ10 AI127075 ARF4 ADP-ribosylation factor 4 7.52EÀ10 H90688 MLL3 B melanoma antigen family, member 4 7.61EÀ10 AI299294 PPP1R8 Protein phosphatase 1, regulatory (inhibitor) subunit 8 9.89EÀ10 AA464550 SPCS1 Signal peptidase complex subunit 1 homolog (S. cerevisiae) 1.13EÀ09 H91476 DECR1 2,4-dienoyl CoA reductase 1, mitochondrial 1.84EÀ09 AI189495 MGC4093 Hypothetical protein MGC40931.90E À09 N69574 GPHN Gephyrin 3.25EÀ09 H04826 ASH1L Ash1 (absent, small, or homeotic)-like (Drosophila) 3.66EÀ09 AA680099 AKT2 V-akt murine thymoma viral oncogene homolog 2 4.83EÀ09

P-values were calculated using Welch’s t-test (Bonferoni correction, P ¼ 0.05) between SCC-A and SCC-B samples. ESTs are not included

highly expressed in SCC-A. These included BUB3,a the NMF partition classes for k ¼ 3–7, and between mitotic checkpoint regulator (Taylor et al., 1998), and the two partition classes of SCC-A from HC, there were ZW10 interactor. For the 1499 genes highly expressed no significant differences. Between SCC-A and SCC-B, in SCC-B, the over-represented categories were mainly there were no significant correlations for any other related to differentiated intracellular functions such clinical and pathological parameters including tumor as the MAPKKK cascade, ceramide metabolism, or stage and histological differentiation grade (Table 5). regulation of transcription. For example, RASGRP3, in- volved in the MAPKKK cascade (Rebhun et al., 2000), or UGCG, a ceramide glucosyltransferase (Ichikawa Discussion et al., 1996), were included. Positive regulation of apop- tosis was also an over-represented category and included In this study, we performed an analysis of gene the deleted in colorectal carcinoma gene (DCC) (Mehlen expression profiles for 48 lung SCCs. Firstly, we sought et al., 1998) and death-associated protein kinase 3 to identify genes whose expression patterns are char- (DAPK3) (Kawai et al., 2003). acteristic of SCC. Groups of genes up- or down- We generated Kaplan–Meier survival curves for regulated in SCC were identified and statistically patients with tumors in the SCC-A and SCC-B groups assessed at the gene function level by using EASE (Figure 4) and analysis with the log-rank test indicated a analysis. In upregulated genes, cell-proliferation-related tendency for different survival (P ¼ 0.071). The like- functions were marked and many genes associated with lihood of survival for 6 years was significantly higher oncogenesis, for example, transcription factor DP-1,an in SCC-B (81.8%) than in SCC-A (40.5%) (P ¼ 0.014, important regulator of the cell cycle (Helin et al., 1993), Z-test). For patients with stage I tumors, a similar were included. Upregulation of many genes encoding prognostic difference was recognized. However, statis- with oxidoreductase activity reflects the strong tically there was no significant difference (P ¼ 0.38, log- causal relationship between tobacco carcinogens and rank test), because of too small sample size (SCC-A, SCC. Not surprisingly, many of the genes associated n ¼ 19; SCC-B, n ¼ 8). The likelihood of survival for 6 with epidermal development were also found upregu- years was 41.7% in SCC-A and 77.4% in SCC-B lated in SCC. The top upregulated genes included (P ¼ 0.17, Z-test). Aside from comparison between several genes encoding epithelial cell-specific markers SCC-A and SCC-B, we compared prognosis between including cytokeratins. Regarding downregulated genes,

Oncogene Two subclasses of lung squamous cell carcinoma K Inamura et al 7111 Table 4 Biological process categories over-represented by genes Table 5 Clinical and pathological parameters of patients and their selectively upregulated in SCC-A and SCC-B tumors Category Total EASE Characteristics SCC-A SCC-B P-value (539/7174; 7.5%) score No. of samples % No. of samples % Top 10 over-represented categories for 91 genes selectively upregulated in SCC-A Age (years) 34 14 0.7391 Cell proliferation (10/728; 1.4%) 0.00996 Mean 69.35 68.5 Mitotic cell cycle (5/240; 2.1%) 0.034 Cell cycle (7/503; 1.4%) 0.0418 Sex 34 14 0.8103 Cellular physiological process (20/2609; 0.77%) 0.0439 Male 30 88% 12 86% Cell growth and/or maintenance (18/2261; 0.80%) 0.046 Female 4 12% 2 14% Cytokinesis (3/79; 3.8%) 0.0622 Cellular process (26/ 3877; 0.67%) 0.0669 Smoking index 34 14 0.2644 mRNA processing (3/111; 2.7%) 0.111 SIo400 2 6% 1 7% Nuclear division (3/115; 2.6%) 0.118 400pSIo1000 11 32% 3 21% Macromolecule biosynthesis (6/525; 1.1%) 0.13 1000pSI 21 62% 10 71%

Top 10 over-represented categories for 1499 genes selectively upregu- Tumor depth 34 14 0.4463 lated in SCC-B Peripheral 21 62% 6 43% MAPKKK cascade (5/39; 12.8%) 0.00958 Mid-zonal 7 21% 5 36% Ceramide metabolism (3/14; 21.4%) 0.0356 Central 6 18% 321% Protein folding (6/92; 6.5%) 0.0485 Positive regulation of apoptosis (6/92; 6.5%) 0.0485 Tumor size 34 14 0.8194 Regulation of transcription (33/1120; 2.9%) 0.0549 o30 mm 8 24% 2 14% Intracellular protein transport (11/282; 3.9%) 0.084 30mmp 26 76% 12 86% Metabolism (98/4114; 2.4%) 0.105 Protein kinase cascade (6/119; 5.0%) 0.115 p-Stage 34 14 0.8103 Sphingolipid metabolism (3/39; 10.3%) 0.129 I 19 56% 8 57% Biosynthesis (19/632; 3.0%) 0.136 IIp 15 44% 6 43%

The top 10 over-represented categories are shown. Numerous other Differentiation 34 14 0.1068 similar categories are not included to reduce redundancy. Significant Well 1 3% 3 21% functional categories are those with a higher ratio of identified genes to Moderate 25 74% 8 57% all genes on the array for associations with that category, relative to por 8 24% 321% the ratio of total identified genes to all genes for associations with all categories. Association numbers approximate but are not exactly equal Anthracosis 31 14 0.3853 to gene numbers in a category. The ratio of associations for that À 1 3% 2 14% category and the percentage by that ratio was shown in the second + 8 26% 321% column. The analogous ratios for total identified genes are shown in ++ 22 71% 9 64% the heading (Total). EASE Score, modified Fisher’s exact test P-value Emphysema 31 14 0.2862 À 1342% 9 64% + 18 58% 5 36%

Smoking index is defined as a product of the numbers of cigarette per day and the duration (years); p-Stage, pathological stage according to the classification of the Union Internationale Contre le Cancer (UICC); differentiation, tumor histological differentiation grade; anthracosis, anthracosis in the surrounding lung tissue; emphysema, emphysema in the surrounding lung tissue. P-values for age and tumor size were obtained by Student’s t-test; a P-value for Smoking index was obtained by Welch’s t-test; P-values for tumor depth, differentiation, and anthracosis were obtained by w2-test; P-values for sex, p-stage, and emphysema were obtained by Yates w2-test

and its robustness was confirmed by additional analyses, including the NMF approach. Survival analysis indi- cated SCC-B to be an SCC subclass with a good prog- Figure 4 Kaplan–Meier survival curves for the 48 SCC patients nosis relative to SCC-A. Many genes differentially up- (SCC-A vs SCC-B) regulated in SCC-A have functions associated with cell proliferation or the cell cycle, including BUB3 and ZW10 interactor, which are involved in maintaining the examples involved in communication between cells and mitotic spindle checkpoint. This contrasts with the their surroundings were marked. These genes contribute report that increased expression of BUB3 and ZW10 is to a characteristic profile of SCC and should serve as associated with a better prognosis in lung AC cases useful markers and potential therapeutic targets. (Miura et al., 2002), and suggests pathogenetic differ- Segregation of SCC samples into two groups (SCC-A ences between SCC and AC. Regarding highly expressed and SCC-B) was here indicated by the initial analysis genes in SCC-B, differentiated intracellular functions

Oncogene Two subclasses of lung squamous cell carcinoma K Inamura et al 7112 were marked in contrast to the ‘proliferation signature’ unique UNIGENE clusters accounting for 10 799 named of SCC-A. However, upregulation of AKT2 was identi- genes, 2894 other unnamed genes (for example, clones fied uniquely in SCC-B. This may indicate a SCC-B- described as hypothetical proteins), and 6184 ESTs. Clones specific mechanism of oncogenesis and suggest that were purchased as sequence verified, but we estimate an error therapeutic inhibition of AKT2 might be a possible rate of at least 10%. Genes mentioned by name in this paper were resequenced. strategy for SCC-B control. It is established that the SCC differentiation grade does not correlate well with prognosis, possibly due to Microarray experiments its definition, which depends largely upon the amount of cRNA was synthesized from total RNA using ampliscribe T7 keratinization rather than parameters related to an (Epicenter, Madison, WI, USA) and cDNAs generated with aggressive nature. Therefore, it is not surprising that our 2 mg aliquots were aminoallyl labeled with Cy5 (sample) or Cy3 classification of SCC-A and SCC-B demonstrated no (reference), hybridized overnight at 421C in a buffer containing significant correlation with the differentiation grade. 50% formamide, 5 Â SSC, 0.1% SDS, 0.25 mg/ml human cot1 Furthermore, except for prognosis, our classification DNA and 0.125 mg/ml poly-dA and washed to a final stringency 1 had no other significant correlation with clinicopatho- of 1 Â SSC at 42 C. After washing, slides were immediately scanned on an Axon GenePix 4000B and quantified using logical parameters. It follows that by expression profi- Axon GenePix Pro (Axon, Union city, CA, USA). ling, we may be able to predict differential prognosis with tumors that are pathologically and clinically indistinguishable. Microarray data analysis The need for a new system of classification in SCC is Data were analysed using GeneSpring (Silicon Genetics, obvious. Whereas prognosis in AC can be quite well Redwood City, CA, USA) and Matlab (Mathworks, Natick, predicted on the basis of classical factors such as MA, USA). All data were subjected to intensity-dependent differentiation grade or the presence of particular (LOWESS) normalization. Log-transformed data were used histopathological features, notably a micropapillary for HC and the choice of input genes before NMF approach. pattern, similar robust correlations do not exist for For NMF analysis, unlogged data were used because of the SCC. Our present analysis of gene expression patterns, non-negativity requirement. For the initial statistical analysis, we selected genes however, allowed division of SCC into two distinct differentially expressed between the SCC, AC, and normal groups with a prognostic difference. This study repre- tissue groups. After each gene was normalized to its median sents an important step toward the identification of expression value across all 87 samples, Welch’s ANOVA with clinically useful classification for lung SCC. Further Bonferoni correction (P ¼ 0.05) was performed. This yielded a research will be required to confirm the classification set of 4689 genes. identified here and dissect out its clinical relevance. For the additional independent round of HC, we selected genes up- or downregulated in SCC samples. For this filtering, normal and SCC samples were separated into two groups. For each group, the expression of each gene was normalized to the Materials and methods median value across all samples. To remove genes that normally vary between individuals and between different Clinical samples sections of lung, we first selected genes that are expressed All clinical samples were collected with ethical committee stably in normal lung. For the 30 normal samples, we selected approval and informed consent from patients undergoing genes for which we had data in at least 25 of the series and for surgery at the Cancer Institute Hospital, Tokyo, Japan, which the coefficient of variance of log expression ratios was between May 1996 and December 1999. All samples were less than 0.6. This yielded a set of 38 924 genes. For each gene, grossly dissected and snap-frozen in liquid nitrogen within the average log expression ratio of all normal samples (Nave.) 20 min of removal. Initial diagnosis of each sample from frozen was calculated. These genes were then filtered across the SCC sections was later confirmed by detailed analysis of paraffin- samples passing any gene for which the log expression ratio embedded sections. We included 30 samples of normal lung was more than twice the Nave for any six of the 48 SCC from unrelated individuals. Since a simple comparison of gene samples or less than half for any six. This process resulted in a expression between SCC and normal lung would identify many set of 3891 genes. genes that are commonly regulated across all lung cancers For the NMF approach, we firstly made a choice of input (Virtanen et al., 2002), we also generated expression profiles genes. For the 48 SCC samples, we selected genes in which for a set of nine well- or moderately differentiated ACs and more than 38 of the 48 had an absorption measurement more incorporated these into our analysis. Differentiated ACs were than 100 in the tumor signal and all 48 samples had more than selected because some poorly differentiated ACs can develop 100 in the reference channel. This yielded a set of 19 315 genes. transcription profiles that resemble SCCs (Virtanen et al., From these, we selected examples for which the standard 2002). Reference RNA was a mixture of normal lung and lung deviation of log expression ratios was more than 0.4. This cancer cell lines as described previously (Virtanen et al., 2002). filtering process resulted in a set of 3344 genes. NMF analysis was performed in Matlab using codes for NMF divergence reducing equations, as well as for model Array design selection and reordering of the consensus matrices, provided Each microarray contained 40 386 elements, 39 936 of these on a website (Brunet et al., 2004; http://www.broad.mit.edu/ being derived from IMAGE cDNA clones purchased from cancer/). For rank k ¼ 2–7, consensus matrices were obtained Research Genetics (Huntsville, AL, USA). This was supple- by taking the average of over 50 connectivity matrices. Each mented with 384 proprietary clones and 48 control elements. consensus matrix was reordered by HC using distances derived According to our latest estimate, the array represents 19 877 from consensus clustering matrix entries.

Oncogene Two subclasses of lung squamous cell carcinoma K Inamura et al 7113 Biological process categorization by gene ontology Z-test. We analysed statistical correlations for other clinical and pathological parameters using the Student’s t-test, Welch’s We used a new software tool, the EXPRESSION ANALYSIS t-test, w2-test, or Yates w2-test, as appropriate. SYSTEMATIC EXPLORER (Hosack et al., 2003; http:// david.niaid.nih.gov/david/ease.htm), to assign identified genes to ‘GO: Biological Process’ categories of the Gene Ontology Acknowledgements Consortium (Ashburner et al., 2000; http://www.geneontolo- We thank Mr Shogo Yamamoto for technical advice and gy.org) and for statistical testing (EASE Score, modified suggestions; Mr Atsushi Kobayashi, Ms Mio Kato, Ms Fisher’s exact test) for significant coregulation (over-represen- Kazuko Yokokawa, Mr Motoyoshi Iwakoshi, Ms Miyuki tation) of identified genes within each biological process Kogure, and Ms Tomoyo Kakita for their technical assistance; category. and Ms Chisato Kakuta for secretarial work. Parts of this study were supported financially by Grants-in-Aid from the Ministry of Education, Culture, Sports, Science; and Technol- Analysis of clinical and pathological parameters ogy, and by grants from the Ministry of Health, Labour and Cumulative survival rates were calculated by means of the Welfare, the Smoking Research Foundation, and the Vehicle Kaplan–Meier method and compared by the log-rank test and Racing Commemorative Foundation.

References

Amatschek S, Koenig U, Auer H, Steinlein P, Pacher M, Jones MH, Virtanen C, Honjoh D, Miyoshi T, Satoh Y, Gruenfelder A, Dekan G, Vogl S, Kubista E, Heider KH, Okumura S, Nakagawa K, Nomura H and Ishikawa Y. Stratowa C, Schreiber M and Sommergruber W. (2004). (2004). Lancet, 363, 775–781. Cancer Res., 64, 844–856. Kawai T, Akira S and Reed JC. (2003). Mol. Cell. Biol., 23, Arboleda MJ, Lyons JF, Kabbinavar FF, Bray MR, Snow BE, 6174–6186. Ayala R, Danino M, Karlan BY and Slamon DJ. (2003). Kettunen E, Anttila S, Seppanen JK, Karjalainen A, Edgren Cancer Res., 63, 196–206. H, Lindstrom I, Salovaara R, Nissen AM, Salo J, Mattson Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, K, Hollmen J, Knuutila S and Wikman H. (2004). Cancer Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Genet. Cytogenet., 149, 98–106. Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Lee DD and Seung HS. (1999). Nature, 401, 788–791. Matese JC, Richardson JE, Ringwald M, Rubin GM and Mehlen P, Rabizadeh S, Snipas SJ, Assa-Munt N, Salvesen GS Sherlock G. (2000). Nat. Genet., 25, 25–29. and Bredesen DE. (1998). Nature, 395, 801–804. Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Mills AA, Zheng B, Wang XJ, Vogel H, Roop DR and Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, Loda M, Bradley A. (1999). Nature, 398, 708–713. Weber G, Mark EJ, Lander ES, Wong W, Johnson BE, Miura K, Bowman ED, Simon R, Peng AC, Robles AI, Jones Golub TR, Sugarbaker DJ and Meyerson M. (2001). Proc. RT, Katagiri T, He P, Mizukami H, Charboneau L, Kikuchi Natl. Acad. Sci. USA, 98, 13790–13795. T, Liotta LA, Nakamura Y and Harris CC. (2002). Cancer Borczuk AC, Gorenstein L, Walter KL, Assaad AA, Wang L Res., 62, 3244–3250. and Powell CA. (2003). Am. J. Pathol., 163, 1949–1960. Miyoshi T, Satoh Y, Okumura S, Nakagawa K, Shirakusa T, Brunet JP, Tamayo P, Golub TR and Mesirov JP. (2004). Tsuchiya E and Ishikawa Y. (2003). Am. J. Surg. Pathol., 27, Proc. Natl. Acad. Sci. USA, 101, 4164–4169. 101–109. Deiss LP, Feinstein E, Berissi H, Cohen O and Kimchi A. Nacht M, Dracheva T, Gao Y, Fujii T, Chen Y, Player A, (1995). Genes Dev., 9, 15–30. Akmaev V, Cook B, Dufault M, Zhang M, Zhang W, Guo M, Esteller M, Corn PG, Baylin SB and Herman JG. (2001). Curran J, Han S, Sidransky D, Buetow K, Madden SL and Cancer Res., 61, 3225–3229. Jen J. (2001). Proc. Natl. Acad. Sci. USA, 98, 15203–15208. Fukumoto S, Yamauchi N, Moriguchi H, Hippo Y, Watanabe Palackal NT, Lee SH, Harvey RG, Blair IA and Penning TM. A, Shibahara J, Taniguchi H, Ishikawa S, Ito H, Yamamoto (2002). J. Biol. Chem., 277, 24799–24808. S, Iwanari H, Hironaka M, Ishikawa Y, Niki T, Sohara Pulling LC, Vuillemenot BR, Hutt JA, Devereux TR and Y, Kodama T, Nishimura M, Fukayama M, Dosaka- Belinsky SA. (2004). Cancer Res., 64, 3844–3848. Akita H and Aburatani H. (2005). Clin. Cancer Res., 11, Rebhun JF, Castro AF and Quilliam LA. (2000). J. Biol. 1776–1785. Chem., 275, 34901–34908. Garber ME, Troyanskaya OG, Schluens K, Petersen S, Taylor SS, Ha E and McKeon F. (1998). J. Cell Biol., 142, 1–11. Thaesler Z, Pacyna-Gengelbach M, van de Rijn M, Rosen Tomida S, Koshikawa K, Yatabe Y, Harano T, Ogura N, GD, Perou CM, Whyte RI, Altman RB, Brown PO, Botstein Mitsudomi T, Some M, Yanagisawa K, Takahashi T, Osada D and Petersen I. (2001). Proc. Natl. Acad. Sci. USA, 98, H and Takahashi T. (2004). Oncogene, 23, 5360–5370. 13784–13789. Virtanen C, Ishikawa Y, Honjoh D, Kimura M, Shimane M, Grille SJ, Bellacosa A, Upson J, Klein-Szanto AJ, van Roy F, Miyoshi T, Nomura H and Jones MH. (2002). Proc. Natl. Lee-Kwon W, Donowitz M, Tsichlis PN and Larue L. Acad. Sci. USA, 99, 12357–12362. (2003). Cancer Res., 63, 2172–2178. Wikman H, Seppanen JK, Sarhadi VK, Kettunen E, Helin K, Wu CL, Fattaey AR, Lees JA, Dynlacht BD, Ngwu Salmenkivi K, Kuosma E, Vainio-Siukola K, Nagy B, C and Harlow E. (1993). Genes Dev., 7, 1850–1861. Karjalainen A, Sioris T, Salo J, Hollmen J, Knuutila S and Hibi K, Trink B, Patturajan M, Westra WH, Caballero OL, Anttila S. (2004). J. Pathol., 203, 584–593. Hill DE, Ratovitski EA, Jen J and Sidransky D. (2000). Yang A, Schweitzer R, Sun D, Kaghad M, Walker N, Bronson Proc. Natl. Acad. Sci. USA, 97, 5462–5467. RT, Tabin C, Sharpe A, Caput D, Crum C and McKeon F. Hosack DA, Dennis G, Sherman BT, Lane HC and Lempicki (1999). Nature, 398, 714–718. RA. (2003). Genome Biol., 4, R70. Yoo SH, Park YS, Kim HR, Sung SW, Kim JH, Shim YS, Lee Ichikawa S, Sakiyama H, Suzuki G, Hidari KI and Hirabaya- SD, Choi YL, Kim MK and Chung DH. (2003). Lung shi Y. (1996). Proc. Natl. Acad. Sci. USA, 93, 4638–4643. Cancer, 42, 195–202.

Oncogene