Oncogene (2006) 25, 7324–7332 & 2006 Nature Publishing Group All rights reserved 0950-9232/06 $30.00 www.nature.com/onc ONCOGENOMICS DNA copy number amplification profiling of human neoplasms

S Myllykangas1, J Himberg2,TBo¨ hling1, B Nagy1,3, J Hollme´ n2 and S Knuutila1

1Department of Pathology, Haartman Institute and HUSLAB, University of Helsinki and Helsinki University Central Hospital, Helsinki, Finland; 2Laboratory of Computer and Information Science, Helsinki University of Technology, Espoo, Finland and 31st Department of Obstetrics and Gynecology, Scmmclwcis University, Budapest, Hungary

DNA copy number amplifications activate oncogenes and been shown to increase significantly in breast and are hallmarks of nearly all advanced tumors. Amplified prostate cell lines and in primary tumors, but the represent attractive targets for therapy, diagnostics number of overexpressed genes in the amplified chro- and prognostics. To investigate DNA amplifications in mosomal areas varies upon cancer tissue type (Pollack different neoplasms, we performed a bibliomics survey et al., 1999, 2002; Hyman et al., 2002). Amplification- using 838 published chromosomal comparative genomic activated human oncogenes include AKT2 in ovarian hybridization studies and collected amplification data at cancer, ERBB2 in breast and ovarian tumors, MYCL1 band resolution from more than 4500 cases. in small-cell lung cancer, MYCN in neuroblastoma, Amplification profiles were determined for 73 distinct REL in Hodgkin’s lymphoma, EGFR in glioma and neoplasms. Neoplasms were clustered according to the non-small-cell lung cancer and in various cancers amplification profiles, and frequently amplified chromo- (Futreal et al., 2004). Amplifications are frequently somal loci (amplification hot spots) were identified using observed in advanced cancers, which have lost the - computational modeling. To investigate the site specificity mediated maintenance of genomic integrity (Livingstone and mechanisms of amplifications, colocalization of et al., 1992; Yin et al., 1992). Amplifications of various amplification hot spots, cancer genes, fragile sites, virus drug target genes have been observed to confer drug integration sites and gene size cohorts were tested in a resistance in several in vitro experiments as well as in statistical framework. Amplification-based clustering clinical studies (Wahl et al., 1979; Johnston et al., 1983; demonstrated that cancers with similar etiology, cell-of- Goker et al., 1995; Shannon, 2002). origin or topographical location have a tendency to obtain DNA copy number amplifications arise as multi- convergent amplification profiles. The identified amplifi- plication of intra-chromosomal regions of 0.5–10 Mb in cation hot spots were colocalized with the known fragile length. In contrast to the definition of amplification, sites, cancer genes and virus integration sites, but DNA copy number increase in larger chromosomal global statistical significance could not be ascertained. areas or intact , owing to translocations or Large genes were significantly overrepresented on the aneuploidy, is defined as a gain (Lengauer et al., 1998). fragile sites and the reported amplification hot spots. Amplifications produce excess gene copies in homo- These findings indicate that amplifications are selected in geneously staining regions (HSRs) and extrachromo- the cancer tissue environment according to the qualitative somal acentric DNA fragments (double minutes and traits and localization of cancer genes. episomes) (Albertson et al., 2003). HSRs manifest as a Oncogene(2006) 25, 7324–7332. doi:10.1038/sj.onc.1209717; ladder-like structure of inverted repeats within chromo- published online 5 June 2006 somes (Schwab, 1998). Extrachromosomal amplicons appear as double minutes, which can be seen using Keywords: cancer; gene amplification; fragile site; bio- conventional cytogenetics (Hahn, 1993), and episomes informatics; data mining; molecular pathology (B250 bp in length), which are only detectable using molecular biology methods (Maurer et al., 1987; Graux ONCOGENOMICS et al., 2004). HSRs, double minutes and episomes may contain fused genetic material from different chromo- Introduction somal loci (Guan et al., 1994; Graux et al., 2004). Models for pathways that result in gene amplification Gene amplifications cause an increase in the gene copy have been reviewed in detail by Myllykangas and number and subsequently elevate the expression of the Knuutila (2006). Amplification pathway models predict amplified genes. The expression of amplified genes has that two independent DNA double-stranded breaks at the margins of an amplified region are required to occur Correspondence: S Myllykangas, Department of Pathology, Haartman to enable the amplification of the intra-chromosomal Institute, University of Helsinki, PO Box 21, FI-00014 Helsinki, DNA segment. The is perturbed by Finland. specific labile domains, such as fragile sites, virus E-mail: samuel.myllykangas@helsinki.fi Received 1 February 2006; revised 13 April 2006; accepted 20 April 2006; integration sites and large genes, which are more published online 5 June 2006 sensitive to DNA damage than other loci. Fragile sites Amplification profiling of human neoplasms S Myllykangas et al 7325 are chromosomal regions, in which chromosomal breaks Therefore, amplification hot spots, the genomic loci can be induced using replication-stalling chemicals. that are prone to amplify in human neoplasms, were Based on their prevalence in the population, the fragile identified using computational modeling, specifically the sites (119) are classified as common or rare. Eighty-eight independent component analysis (ICA). Chromosome common fragile sites are found in all individuals and 31 bands with values >0.5 were included in the genome- rare fragile sites in less than 5% of the population. wide ICA factors representing the amplification hot Susceptibility to damage varies between the fragile sites, spots. Values between 0.5 and 1 within the amplification as 75% of the DNA damage lesions have been located to hot spots are shown using gray to black scaling. 22 common fragile sites (out of the 139 chromosomal Figure 1b presents the top 30 identified amplification loci with identified lesions) (Glover et al., 1984). hot spots (10% of all identified independent factors) in Especially, common fragile sites are colocalized with the order of decreasing stability index, which was amplifications (Hellman et al., 2002). WWOX (1.1 Mb) evaluated by bootstrapping the ICA factors. Besides and FHIT (0.8 Mb) are particularly large genes and factor 19, which embodies the whole chromosome 21, located in two of the most damage-prone fragile sites the longest coherent amplification hot spots encom- (16q23.1 and 3p14.2, respectively). WWOX and FHIT passed single chromosome arms. Note that three factors are frequently involved in DNA breaks. Besides fragile are composed of two disjoint regions (factors 9, 12 and site expression, retroviral DNA insertion at specific 27). The disjoint hot spots spanned across the centro- genomic sites is believed to induce DNA breaks. meres of chromosomes 8, 17 and 5, respectively. Amplification hot spots did not cross chromosome boundaries, although this would have been conceivable in the continuous vector presentation of amplifications. Results The amplification hot spots are also presented in Table 1, which lists the fragile sites and cancer genes that DNA copy number amplification profiles of human colocalize with the amplification hot spots. The ampli- neoplasms fication hot spots covered 178 chromosome bands DNA copy number amplifications in human neoplasms (45% of the genome). were extracted from a publicly available data collection. A data matrix containing binary chromosome sub- band-specific information of DNA amplifications was Amplification profile-based clustering reveals relations generated (downloadable at http://www.helsinki.fi/cmg/ between human neoplasms cgh_data.html). The number of examined loci was 393 DNA amplification profiling (Figure 1a) showed that (list of all analysed chromosomal loci in Appendix 1 in amplifications were not randomly distributed in the the Supplementary information). The assembled data human genome, but specific amplifications were ob- matrix contained 4590 cases with 5740 coherent, served in distinct neoplasms. Hierarchical clustering was amplified chromosomal regions. In all, 1 803 870 chro- performed to investigate whether there were amplifica- mosome bands were analysed and 26 527 of them were tion-based similarities between neoplasms. Figure 2 labeled amplified. Appendix 2 in the Supplementary shows the correlation matrix and the dendrogram of information presents an overview of the studied the neoplasms. Four main neoplasm clusters are marked neoplasms (n ¼ 73) and DNA amplifications. The in correlation matrix. majority of the neoplasms were malignant tumors or hematologic malignancies, but non-malignant or pre- Associations between amplification hot spots, cancer malignant lesions were also included in this study. All 73 genes, fragile sites, virus integration sites and gene size neoplasm categories included cases with amplifications. cohorts Amplification profiling was carried out by calculating The identification of amplification hot spots (Figure 1b) chromosome sub-band-specific amplification frequen- suggested that some structural genomic properties or cies. Amplification profiles of all studied neoplasms are labile genomic features might explain the site specificity presented in Figure 1a, which shows amplification of the amplifications. Localization of cancer genes, virus frequencies at chromosome band resolution with fre- integration sites and fragile sites relative to amplification quency values of individual neoplasms zoomed to the hot spots was evaluated to explain the mechanisms maximum. Neoplasms in Figure 1a are sorted according behind the amplification hot spot findings. Only one of to the hierarchical clustering. Four main neoplasm the hot spot areas showed significant overrepresentation clusters (Figure 1a) were identified according to the of cancer genes (8p23-q12; 8q23; difference of means clustering analysis. Detailed amplification profiles 0.0227, P ¼ 0.0070). The same analysis performed with were generated for 73 neoplasms (see Appendix 3 in regard to the virus integration sites yielded four the Supplementary information). significant findings of hot spot areas with overrepresen- tation, namely 2q13–q36 (difference 0.0404, Po0.0001), Amplification hot spots represent frequently amplified 8q24.1–q24.3 (difference 0.0407, P ¼ 0.0020), 9q11–q34 genomic domains (difference 0.1264, Po0.0001) and 18q11.2–q23 (differ- DNA copy number amplifications were found to be ence 0.1975, Po0.0001). National Center for Biotech- preferentially localized in the genome (Figure 1a nology Information (NCBI)-derived fragile sites were and Appendix 3 in the Supplementary information). not significantly overrepresented in any of the reported

Oncogene Amplification profiling of human neoplasms S Myllykangas et al 7326 a

b

Figure 1 DNA amplification profiling of human neoplasms and amplification hot spots in the human genome. (a) Overview of amplification profiles of all analysed neoplasms. Amplification incidences were calculated for each chromosome band and zoomed to the maximum value in each neoplasm. Hierarchical clustering was applied to sort the neoplasms. Four main clusters are marked (clusters 1–4). (b) Amplification hot spots were identified using ICA. The cutoff value for chromosome bands, which were included in the amplification hot spots, was set at 0.5. Hot spot temperature (values between 0.5 and 1) is presented using gray to black scaling. Stability indexes of the ICA factors (amplification hot spots) were assessed using bootstrapping analysis. Amplification hot spots were sorted by the stability indexes.

amplification hot spots. In a global statistical setting, the Differences in proportions of genes located on amplifi- fragile sites, which had been extracted from the NCBI cation hot spots and non-amplified chromosomal database or from a publication (Glover et al., 1984), regions as well as in fragile and non-fragile sites were were not significantly over- or underrepresented in the measured for each gene size cohort, and statistical reported amplification hot spot areas when compared testing was applied to assess the significance of the with non-amplified chromosomal regions. Similarly, findings. Gene size cohorts 0–100 kb (difference of when analysed in global scale, neither cancer genes nor means 0.234 and P-value o0.0000001), 50–150 kb virus integration sites were significantly represented on (difference of means 0.206 and P-value ¼ 0.002), 250– the reported amplification hot spots. 350 kb (difference of means 0.204 and P-value Associations between amplification hot spots, pre- o0.0000001) and 300–400 kb (difference of means valence weighted fragile sites and gene size cohorts were 0.146 and P-value ¼ 0.001) were found to be associated tested to analyse whether the gene size affects chromo- with amplification hot spots. When the gene size cohorts somal fragility or has influence on gene amplifications. were measured against the prevalence weighted fragile

Oncogene Table 1 Amplification hot spots of the human genome with colocalized fragile sites and cancer genes Genomic location Amplifica- Fragile sites Cancer genes tion hot spots

1q11–q44 17 FRA1J, FRA1F, FRA1G, FRA1K, FRA1H, FRA1I PAX7, PDE4, DIP, BCL9, IRTA1, MUC1, AF1Q, ARNT, SDHC, PRCC, HSPCA, NTRK1, HRPT2, TPM3, FCGR2B, PBX1, PMX1, ABL2, TPR, FH 1q21–q24 24 FRA1F BCL9, IRTA1, MUC1, AF1Q, ARNT, SDHC, PRCC, HSPCA, NTRK1, HRPT2, TPM3, FCGR2B, PBX1, PMX1 2p25–p23 4 FRA2C MSH2, ALK, MYCN 2q13–q36 25 FRA2B, FRA2F, FRA2K, FRA2G, FRA2H, FRA2I TTL, ERCC3, HOXD11, HOXD13, CHN1, PMS1, ATIC, PAX3, FEV 3q26.1–q26.3 10 EVI1, RPL22, PIK3CA 3q26.3–q29 7 FRA3C EVI1, RPL22, PIK3CA, BCL6, EIF4A2, LPP, TFRC 5p15.3–p15.1 26 5p15.3–p15.1; 5p12–q35 27 FRA5B, FRA5D, FRA5F, FRA5C, FRA5G LIFR, APC, AF5q31, FACL6, GRAF, PDGFRB, RANBP17, NPM1, NSD1, TLX3, FLT4 6p25–p11.1 15 FRA6C, FRA6A, FRA6B HSPCB, CCND3, TFEB, HMGA1, SFRS3, PIM1, HIST1H4I, FANCE, DEK, IRF4 7p13–p11.1 20 FRA7A, FRA7D ZNFN1A1, EGFR 7p22–p13 29 FRA7D, FRA7C, FRA7B ETV1, ZNFN1A1, EGFR, JAZF1, HOXA11, HOXA13, HOXA9, PMS2 8p23–q12; 8q23 9 RUNXBP2, FGFR1, WHSC1L1, WRN, PCM1, TCEA1, PLAG1, COX6C 8q11.1–q24.3 16 FRA8F, FRA8B, FRA8A, FRA8C, FRA8E, FRA8D TCEA1, PLAG1, NCOA2, NBS1, CBFA2T1, COX6C, EXT1, MYC, RECQL4 8q24.1–q24.3 14 FRA8C, FRA8E, FRA8D EXT1, MYC, RECQL4 Myllykangas neoplasms S human of profiling Amplification 9q11–q34 8 FRA9A, FRA9C, FRA9F, FRA9D, FRA9B, FRA9E SYK, NR4A3, FANCC, PTCH, XPA, FNBP1, TAL2, CEP1, SET, TSC1, NUP214, ABL1 11q12–q25 23 FRA11A, FRA11H, FRA11F, FRA11B, FRA11G HEAB, NUMA1, CCND1, MEN1, PICALM, ATM, BIRC3, MAML2, DDX10, PAFAH1B2, MLL, SDHD, POU2AF1, ZNF145, CBL, DDX6, PCSK7, ARHGEF12, FLI1 11q13 11 FRA11A, FRA11H NUMA1, CCND1, MEN1

12p13–p11.1 2 KRAS2, CCND2, ETV6, ZNF384, ELKS al et 12q13–q21 21 FRA12A, FRA12B ATF1, DDIT3, HOXC13, HOXC11, CDK4, C12orf9, HMGA2 12q21–q24.3 30 FRA12B, FRA12D, FRA12C, FRA12E BTG1, NACA, BCL7, APTPN11, TCF1 15q21–q26 3 FRA15A TCF12, PML, NTRK3, BLM 17p13–p11.1 6 FRA17A BHD, HCMOGT-1, MAP2K4, RAB5EP, TP53, PER1 17q11.1–q21; 17q25 12 TAF15, SUZ12, LASP1, CLTC, NF1, BRCA1, ETV4, MLLT6, ERBB2, ASPSCR1, ALO17 17q22–q25 5 FRA17B HLF, BCL5, MSI2, PRKAR1A, ASPSCR1, ALO17 18q11.2–q23 18 FRA18A, FRA18B, FRA18C SS18, MALT1, MADH4, FVT1, BCL2 20p13–p11.1 28 FRA20A, FRA20B 21p13–q22 19 OLIG2, ERG, RUNX1 22q11.1–q13 22 FRA22B, FRA22A PNUTL1, CLTCL1, BCR, EWSR1, NF2, PDGFB, ZNF278, MN1, MKL1, EP300, MYH9 Xp22.3–p11.1 1 FRAXB TFE3, SSX4, GATA1, SSX1, SSX2, WAS Xq12–q28 13 FRAXC, FRAXD, FRAXA, FRAXF, FRAXE NONO, MLLT7, SEPT6, GPC3, MTCP1

Amplification hot spots are sorted by chromosomal location. The census of human cancer genes was adopted from a publication by Futreal et al. (2004) and fragile sites were extracted from the NCBI database. Oncogene 7327 Amplification profiling of human neoplasms S Myllykangas et al 7328

Figure 2 Hierarchical clustering of human neoplasms. Neoplasms were sorted using pairwise hierarchical clustering. In the correlation matrix, neoplasms (marked 1–73) appear in the same order on the x and y axis, and black lines separate the clusters. Four main clusters are marked (clusters 1–4). Linkage distances of paired neoplasms were used to join the best matching neoplasm pairs to form the dendrogram tree.

sites, the 200–300, 250–350, and 900 kb–1 Mb gene size common in late-stage cancers of various organs. In this samplings were found to be associated with the fragile context, amplification profile-based molecular classifica- sites (differences of means 0.000021, 0.000022 and tion of neoplasms is well grounded. Naturally, amplifi- 0.000103 and P-values 0.006, o0.0000001 and 0.002, cation profiling is only one means to classify neoplasms respectively). The low number of genes in the test sets and similar classification can be based on any other prevented both gene size assessments in cohorts measurable molecular variable. >1.5 Mb. DNA copy number amplification data was collected in a bibliomics survey. Data mining by means of statistical modeling was applied to elicit knowledge from DNA copy number amplifications in human Discussion cancer. Amplification frequency profiles were identified for neoplasm-type-specific sample groups (Figure 1a and Cytogenetic and molecular genetic markers have long Appendix 3 in the Supplementary information). The been applied in cancer diagnostics. In recent years, it has method applied in the amplification profiling analysis become evident that molecular classification of tumors assumes that a single measurement of significance only provides essential knowledge about the mechanisms of applies to that specific chromosome band ignoring the carcinogenesis and guides the development of targeted uncertainty resulting from multiple testing. Conserva- therapies and clinical practices. Although some leuke- tive multiple testing correction was not applied as the mias have already been classified using molecular method was not used in hypothesis testing but in the biology parameters, such classifications are generally context of data mining and knowledge discovery. In lacking. Amplification profiles can be regarded as the addition, the DNA amplification profiling tests were not molecular signature of the history of the neoplasm, independent (required by the correction algorithms) as because they are distributed across the entire genome in the genomic map positions are interconnected. Thus, a fairly resolute manner. Moreover, amplifications are uncertainty of the amplification frequency profiling mechanistically associated with cancer progression and analyses is mainly reflected in the sample size of the

Oncogene Amplification profiling of human neoplasms S Myllykangas et al 7329 respective classes (see Appendix 2 in the Supplementary crypts (Sell, 2004). All gastrointestinal adenocarcinomas information). Hierarchical clustering was used to clustered together, with the exception of pancreatic identify neoplasm clusters based on amplification adenocarcinoma. Such gender-specific adenocarcinomas profiles (Figure 2) and ICA was applied to identify as breast cancer and prostate cancer were clearly outside amplification hot spots (Figure 1b) that were subse- this main adenocarcinoma cluster. In general, adeno- quently probed with parameters related to carcino- carcinomas were scattered, with the exception of genesis using hypothesis testing. gastrointestinal adenocarcinomas, which suggests that Neoplasm cluster 1 consists of miscellaneous cancers adenocarcinomas are a highly heterogeneous group with (Figure 2). Sarcomas are the main group of malignan- concordant subclustering based on gender, topography, cies in cluster 1, but adenocarcinomas (breast and progenitor cell or surrounding tissue function. The prostate) and also other types of neoplasms are present. gastrointestinal tract cluster was more related to clusters No underlying unambiguously common denominators 1 and 2 than to cluster 3, which overrepresented the could be designated. It is worth mentioning that both childhood solid tumors. breast and prostate adenocarcinomas have the same cell- Instead of clear clusters, the rest of the neoplasms of-origin (undifferentiated suprabasal stem cells) (Sell, were observed as highly related cancer pairs and triplets 2004). Cluster 2 comprises two tight clusters of or single outliers. Small-cell lung cancer and neuroendo- squamous cell carcinomas (head and neck, cervical, crine lung cancers are regarded to be related and they anal and oral carcinoma) and carcinomas of columnar clustered together also in this setting. Pituitary adenoma epithelium (fallopian tube carcinoma, non-small-cell (not neoplastic), adenocortical carcinoma and thyroid lung cancer and endometrial carcinoma). Noteworthy carcinoma clustered together, but neuroendocrine lung is that non-small-cell lung cancer contains also pulmon- carcinoma was clearly separated from the rest of the ary squamous carcinoma (almost half of the cases), endocrine tumors. Germline tumors were found to be which strengthens the association of cluster 2 with connected. Retinoblastoma and Merkel cell carcinoma squamous carcinomas. According to the dendrogram were clearly related and differed considerably from the and correlation matrix analyses, clusters 1 and 2 are also other studied neoplasms. These two tumors have not related with each other. been reported to be related previously. The third neoplasm cluster consists of assorted ICA was used to identify amplification hot spots, the sarcomas and carcinomas with preference to childhood frequently amplified chromosome bands. Similar factor- solid tumors. Besides Ewing’s sarcoma and Wilms’ based methods are applicable as well as other methods, tumor, hepatoblastoma and renal cell carcinoma are for example, the Hidden Markov Model. Amplification tumors that usually affect young patients. A somewhat hot spots were congruent with the amplification profiles diverging clustering (from other neoplasms) of these (Figure 1). For example, 11q13 amplification was adolescent cancers suggests that they also differ from observed in the majority of the cancers. 8q24.1–q24.3 adult cancers on the basis of the amplifications. Wilms’ peaked in the amplification profiles in addition to the tumor, Ewing’s sarcoma and hepatocellular carcinoma other chromosome 8 amplifications, and also 1q, 3q, were tightly related within this cluster. Although Wilms’ 5p, 12p, 12q and 17q amplifications stood out in tumor and Ewing’s sarcoma were the second most the amplification profile presentation (Figure 1). The related pair of cancers in this research setting, they have amplification hot spots were related to known amplified not been considered to be closely related. Wilms’ tumor genes. The reported amplifications hot spots harbor usually arises in children and originates from residual such amplification-activated oncogenes as EGFR, embryonic stem cells (Sell, 2004). The cell-of-origin of MYCN, MYC and ERBB2 (Futreal et al., 2004) Ewing’s sarcoma is unknown but it has been postulated (Table 1). The RUNX1 (alias AML1) gene amplification to originate from bone marrow stem cells (Torchia et al., (gene 21q22.3 maps to amplification hot spot 27) 2003). It has been debated whether hepatocellular is typical in a variety of hematologic malignancies carcinoma is of bone marrow stem cell origin (Sell, (Roumier et al., 2003). 11q23-qter is typically amplified 2004). In this context, Wilms’ tumor, Ewing’s sarcoma in AML but rarely in other tumors (Zatkova et al., and hepatocellular carcinoma share similar pluripotent 2004). The BCL2 (18q21.3) amplification is typical in stem cell origin, which might explain the close amplifi- B-cell non-Hodgkin’s lymphoma (Monni et al., 1999). cation profile-based kinship. Histologically, Wilms’ 9q11–q34 amplification was a somewhat unsuspected tumor and Ewing’s sarcoma are small round cell tumors. finding among the top 10% hot spots, because the ABL1 In cluster 3, malignant fibrous histiocytoma (MFH), oncogene (located in 9q34) is usually activated by liposarcoma and malignant mesenchymoma of bone translocations in hematologic malignancies (CML and grouped together. This was expected, because they AML) and not by amplification. Recently, ABL1 was resemble each other histologically. Interestingly, mela- shown to be amplified in T-cell acute lymphoblastic nomas of the skin and eye were totally separated in this leukemia (Bernasconi et al., 2005). 20p13–p11.1 ampli- evaluation, even though they are thought to be related. fication was peculiar among the hot spots as no known The fourth main neoplasm cluster contains adeno- cancer genes are located in this region (Table 1). carcinomas of the gastrointestinal tract. Gastric and Likewise, 5p15.3–p15.1 amplification hot spot was colorectal adenocarcinomas were closely related, which divergent to the other hot spots, because the region is might be explained by the fact that both of these cancers not known to harbor any fragile sites or cancer genes originate from functionally anchored stem cells of the (Table 1). Amplifications at 13q, 19 and 20q would be

Oncogene Amplification profiling of human neoplasms S Myllykangas et al 7330 expected to appear among the amplification hot spots. findings suggest that large genes might be involved in the These amplifications did not break into the top 10%, amplification process and genomic fragility. but they were seen in the individual cancer profiles. For Associations between labile genomic features and example, 20q amplifications were frequent in gastric specific amplification hot spots were statistically ana- cancer and colorectal cancer, and 19q (the AKT2 gene is lysed. Only one amplification hot spot factor (8p23-q12; located at 19q13.1–q13.2) contained amplifications in 8q23) (amplification hot spot no. 9, Figure 1b and the majority of neoplasms (Figure 1). The 2p13 locus of Table 1) was enriched with cancer genes when compared the REL gene is also missing from the amplification hot to a random reference. This finding may be explained by spot list. REL is frequently amplified in some lympho- overrepresentation of cancer genes in this chromosomal mas (Futreal et al., 2004). Amplifications in 3q, 5, 6p, location. Virus integration sites were overrepresented in 8q, 11q13, 15q, 17q, 21, 22 and X are typical of cancers four amplification hot spots, at 2q13–q36, 8q24.1–q24.3, in general. In addition to these general amplifications, 9q11–q34 and 18q11.2–q23. None of the amplification cancer-specific amplifications emerged in the global hot spots were enriched with fragile sites and most of the analysis of amplification hot spots. 2p25–p23 amplifica- amplification hot spot findings cannot be explained by tion (containing the MYCN gene) is specific to neuro- cancer gene accumulation or virus integration site blastoma (Schwab, 2004), 12p amplifications frequently colocalization. Evaluation of general associations re- appear in embryonal tumors (teratomas) and liposarco- vealed no statistical overrepresentation of fragile sites, mas (Looijenga and Oosterhuis, 1999; Rieker et al., virus integration sites or cancer genes on the reported 2002), 17p amplifications are typical of some sarcomas amplification hot spots when compared with non- (osteosarcoma, MFH and leiomyosarcoma) (El-Rifai amplified chromosomal regions. Even when the expres- et al., 1998; Weng et al., 2003; Atiye et al., 2005) and sion intensity of the common fragile site (measured as a 1q21–q24 amplifications are frequent in osteosarcomas lesion count) (Glover et al., 1984) was taken into (Ozaki et al., 2002). Our findings suggest that some account, no global association was observed. amplification hot spots can be used as clinical markers Although no statistical associations were observed in for cancer diagnostics and confirm the general concep- the global setting, most of the amplification hot spots tion that DNA amplifications are characteristic of contain many cancer genes and fragile sites (Table 1), malignant tumors, whereas benign tumors show no suggesting that these sites might be selected to amplify amplifications. Our findings clearly indicate that DNA because of the qualitative properties of the cancer genes amplifications may be applied in the differential and facilitation conducted by the fragile site expression. diagnosis of small round cell tumors: 2p amplification Furthermore, the resolution of the microscopic data (hot spot 4) in neuroblastoma, 17p (hot spot 6) in used in this study is inadequate to enable thorough osteosarcoma, 18q (hot spot 18) in lymphoma, and 1q investigation of the relationship between genomic (hot spots 17 and 24), 12 (hot spots 2, 21 and 30) and 8 fragility and amplifications, because amplification hot (hot spots 9, 14 and 16) in Ewing’s sarcoma. There spots span 45% of the human genome and fragile sites are more than 10 malignant small round cell tumors are located in 30% of the chromosome bands. Thus, affecting mainly children and adolescents who have further studies using molecular biology resolution are different prognosis and treatment and are difficult to needed to address the colocalization of amplification differentiate. breakpoint regions and fragile DNA sequence features. Colocalization of gene size cohorts and amplification Amplification breakpoint regions need to be mapped hot spots or fragile sites was studied using statistical using genome-wide array comparative genomic hybridi- modeling to assess the associations between gene size zation (aCGH) and cloning is essential in determining and DNA amplifications and chromosomal fragility. fragile sequences (Buttel et al., 2004). A cross-cancer Gene sizes between 0 and 150 kb (92.32% of all genes) database of aCGH results linked with labile genomic and 250 and 400 kb (2.5% of all genes) were found to be features is instrumental for systems biology oriented overrepresented in the amplification hot spot loci when data mining and research of amplification mechanisms. compared with a random reference. Genes between 200 This study was restricted to amplifications because and 350 kb (2% of all genes) and 900 kb and 1 Mb identification of DNA copy number gains and losses by (0.09% of all genes) were overrepresented in the conventional CGH is relatively unreliable. Nonetheless, prevalence weighted fragile sites. These gene size cohorts it is well accepted that DNA copy number aberrations represent a minority in the human gene population, owing to deletions and low-level gains are also of great suggesting that genes in the fragile chromosomal regions importance in cancer pathology. The profiling analysis are generally quite large. Interestingly, the genes that was restricted to group averages, even though quanti- were between 900 kb and 1 Mb in length were associated tative data for a single case might also be important in with fragile sites, supporting the hypothesis that extra deciphering tumor-maintaining cancer gene networks. large genes are breakage prone, even though neither of It is also recognized that cancer cell genome is variable the known fragile genes (WWOX or FHIT) was included and tumor tissue is not homogeneous with regard to in this gene set. Noteworthy is that the genes, which tissue architecture or function. DNA copy number were 250–350 kb in length, showed the best association losses as well as minor gains, single case experiments scores in both fragile site and amplification hot spot and cancer genome plasticity are elements that can assessments. These genes represent 1.5% of the human be studied more explicitly using the proposed aCGH genome and are lengthwise among the top 3.8%. These approach.

Oncogene Amplification profiling of human neoplasms S Myllykangas et al 7331 In conclusion, the amplification profiling analyses Clustering of human neoplasms based on amplification profiles showed that cancers with similar cell-of-origin, histology Hierarchical clustering was applied to evaluate the amplifica- or topographical location obtained convergent amplifi- tion-based similarity between neoplasms. The average values cation profiles, suggesting that amplifications are selected of neoplasm-specific groups were used in the clustering. A pair- in the cancer tissue environment according to the wise correlation matrix was calculated based on the neoplasm group-specific amplification profiles, and a dendrogram was qualitative traits and localization of cancer genes. The built using average linkage distance between the neoplasms. amplification patterns seem to be too complex for very Four coherent clusters were identified based on the dendro- simple mechanistic explanations, although some associa- gram and the correlation matrix. tions of colocalization between labile genomic features and DNA amplifications were observed. The amplifica- Identification of amplification hot spots tion profiling results demonstrate the independent value Amplification hot spots were identified computationally using of DNA copy number amplifications in cancer patho- chromosome band resolution and case-specific data without logy. The identified amplification hot spots may have neoplasm information. The ICA method was applied to clinical value in differential diagnostics of some tumors. identify independent factors in the data set. The ICA factor These findings encourage the development of molecular model for identifying candidate amplification hot spots assumes classification based on amplification profiling to supple- that an observed pattern of amplifications, vector x,isformed ment the conventional, histology-based classification of as a sum of basis vectors (in matrix A) triggered by independent tumors. Subtyping of tumors based on amplification latent variables in vectors s,thatis,x ¼ As. One basis vector signatures would be advantageous, because character- represents a hot spot. In this model, a hot spot need not be ization based on molecular properties is more relevant contiguous with respect to the spatial location on bands. As the amplification data is binary, we used the threshold in the clinical setting and enables development of method described previously (Himberg and Hyva¨ rinen, 2001). approaches directed at biologically active targets in Himberg and Hyva¨ rinen (2001) show that the linear ICA can cancer. Knowledge and comparison of molecular patho- be successfully used for ‘sparse’ data even if A, x and s are logy of tumors are also essential in designing innovative binary, and the mixing model contains a post-mixture thresh- clinical practices, such as multi-cancer therapies, and old function U, that is, x ¼ U(As). The amplification data is flexible diagnostic and prognostic applications. sparse (less than 5% ones among zeros). To reduce noise, the dimensionality was lowered using the principal component analysis (Hyva¨ rinen et al., 2001). We selected 60 components with the largest eigenvalues. Then, Materials and methods FastICA was run 30 times with different initial conditions and the data were bootstrapped (Himberg et al, 2004). The contrast DNA copy number amplification data function was based on skewness. The mixing matrix A was The CGH data was obtained from a publicly available data then scaled and thresholds were set (Himberg and Hyva¨ rinen, collection (http://www.helsinki.fi/cmg/cgh_data.html) contain- 2001). The components were ordered according to the stability ing 23 284 cases, which have been collected by inspecting 838 index defined by Himberg et al. (2004), and 30 basis vectors original research articles of chromosomal CGH studies corresponding to the most stable components were selected. published in peer-reviewed journals between 1992 and 2002 (Knuutila et al., 2000). Amplifications were determined from the CGH data collection. A cutoff value of 1.5 in the CGH Statistical evaluation of colocalization of amplification hot spots, measurement ratios or definitions given by the authors were fragile sites, virus integration sites, cancer genes and gene size applied when selecting the amplified chromosome bands. cohorts Amplification data for each case in the data collection was pre- Elemental factors contributing to the induction of amplifica- sented as a vector of zeros and ones (not amplified or amplified tions were tested in a global statistical setting to investigate chromosome bands). Chromosome sub-band resolution was the underlying mechanisms. Amplification hot spots were used to map the DNA copy number amplifications to the extracted as explained above. Fragile sites and virus integra- genome (ISCN, 1995). Vectors of amplification observations in tion sites were adopted from the NCBI Locus Link (National chromosome sub-bands were assembled into a data matrix. Center for Biotechnology Information, Bethesda, CA, USA) and lesion prevalence in the common fragile sites was derived from a previous publication (Glover et al., 1984). A census of Amplification profiling of human neoplasms human cancer genes was published by Futreal et al. (2004) and Each case in the amplification data matrix was classified an updated list of cancer genes was retrieved from the Cancer according to the World Health Organization (WHO) guide- Genome Project web site (http://www.sanger.ac.uk/genetics/ lines. The cases were then grouped according to the neoplasm CGP/). When fragile sites, virus integration sites and cancer classifications. An amplification profile for each neoplasm genes were matched with amplifications, their genomic (n ¼ 73) was defined by calculating chromosome band-specific locations were reported using coarser chromosome sub-band and chromosome arm-specific frequencies of amplification resolution. observations. For reference, similar amplification profiles were Enrichment of cancer genes, fragile sites (NCBI derived) and calculated for all neoplasms excluding the neoplasm that was virus integration sites on specific amplification hot spots was studied. To define characteristic amplifications of each measured by counting the factor-specific observations and neoplasm, differences between the amplification profiles of comparing the results against a random reference. In addition, studied neoplasms and the reference were analysed using a the distribution of fragile sites (NCBI annotated as well as hypothesis test. Significance of the difference between the prevalence weighted), cancer genes and virus integration sites studied neoplasm and the reference profiles was analysed using between amplification hot spots and non-amplified chromo- a permutation test (1000 permutations) (Good, 2000). The some regions was measured. The significance of the differences significance threshold was set at P ¼ 0.01. in the observations was defined using a hypothesis test in

Oncogene Amplification profiling of human neoplasms S Myllykangas et al 7332 a non-parametric setting. P-values for the differences in in means was evaluated using a non-parametric permutation observations were assigned according to the empirical test similar to the method described above. distribution generated using a permutation test. In order to test whether gene size contributes to site specificity of chromosomal fragility or amplifications, the Acknowledgements distribution of gene size cohorts between fragile sites (NCBI and prevalence weighted) and non-fragile regions as well as This research was supported by grants from the Finnish amplification hot spots and non-amplified chromosomal Academy (SYSBIO Research Program), the Helsinki regions was measured. The human genes were collected using University Central Hospital Research Funds and the the Ensembl services (Birney et al., 2004) and sorted by length. Sigrid Juse´ lius Foundation. The University of Helsinki, Genomic size-based gene cohorts were extracted by selecting the Helsinki University of Technology and the HUSLAB windows of 100 kb in length with an overlap of 50 kb, that is, provided the research facilities and the equipment used in 0–100, 50–150, 100–200 kb, etc. Significance of the differences this study.

References

Albertson DG, Collins C, McCormick F, Gray JW. (2003). ISCN (1995). An International System for Human Cytogenetic Nat Genet 34: 369–376. Nomenclature. S Karger: Basel. Atiye J, Wolf M, Kaur S, Monni O, Bohling T, Kivioja A et al. Johnston RN, Beverley SM, Schimke RT. (1983). Proc Natl (2005). Genes Chromosomes Cancer 42: 158–163. Acad Sci USA 80: 3711–3715. Bernasconi P, Calatroni S, Giardini I, Inzoli A, Castagnola C, Knuutila S, Autio K, Aalto Y. (2000). Am J Pathol 157: 689. Cavigliano PM et al. (2005). Cancer Genet Cytogenet 162: Lengauer C, Kinzler KW, Vogelstein B. (1998). Nature 396: 146–150. 643–649. Birney E, Andrews TD, Bevan P, Caccamo M, Chen Y, Clarke Livingstone LR, White A, Sprouse J, Livanos E, Jacks T, L et al. (2004). Genome Res 14: 925–928. Tlsty TD. (1992). Cell 70: 923–935. Buttel I, Fechter A, Schwab M. (2004). Ann NY Acad Sci 1028: Looijenga LH, Oosterhuis JW. (1999). Rev Reprod 4: 90–100. 14–27. Maurer BJ, Lai E, Hamkalo BA, Hood L, Attardi G. (1987). El-Rifai W, Sarlomo-Rikala M, Knuutila S, Miettinen M. Nature 327: 434–437. (1998). Am J Pathol 153: 985–990. Monni O, Franssila K, Joensuu H, Knuutila S. (1999). Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Leukamia Lymphoma 34: 45–52. Wooster R et al. (2004). Nat Rev Cancer 4: 177–183. Myllykangas S, Knuutila S. (2006). Cancer Lett 232: 79–89. Glover TW, Berger C, Coyle J, Echo B. (1984). Hum Genet 67: Ozaki T, Schaefer KL, Wai D, Buerger H, Flege S, Lindner N 136–142. et al. (2002). Int J Cancer 102: 355–365. Goker E, Waltham M, Kheradpour A, Trippett T, Mazumdar Pollack JR, Perou CM, Alizadeh AA, Eisen MB, Perga- M, Elisseyeff Y et al. (1995). Blood 86: 677–684. menschikov A, Williams CF et al. (1999). Nat Genet 23: Good P. (2000). Permutation Tests: A Practical Guide to 41–46. Resampling Methods for Testing Hypotheses. Springer- Pollack JR, Sorlie T, Perou CM, Rees CA, Jeffrey SS, Lonning Verlag: Berlin. PE et al. (2002). Proc Natl Acad Sci USA 99: 12963–12968. Graux C, Cools J, Melotte C, Quentmeier H, Ferrando A, Rieker RJ, Joos S, Bartsch C, Willeke F, Schwarzbach M, Levine R et al. (2004). Nat Genet 36: 1084–1089. Otano-Joos M et al. (2002). Int J Cancer 99: 68–73. Guan XY, Meltzer PS, Dalton WS, Trent JM. (1994). Nat Roumier C, Fenaux P, Lafage M, Imbert M, Eclache V, Genet 8: 155–161. Preudhomme C. (2003). Leukemia 17: 9–16. Hahn PJ. (1993). Bioessays 15: 477–484. Schwab M. (1998). Bioessays 20: 473–479. Hellman A, Zlotorynski E, Scherer SW, Cheung J, Vincent JB, Schwab M. (2004). Cancer Lett 204: 179–187. Smith DI et al. (2002). Cancer Cell 1: 89–97. Sell S. (2004). Crit Rev Oncol Hematol 51: 1–28. Himberg J, Hyva¨ rinen A. (2001). In: Lee T-W, Jung T-P, Shannon KM. (2002). Cancer Cell 2: 99–102. Makeig S, Sejnowski TJ (eds). Third International Workshop Torchia EC, Jaishankar S, Baker SJ. (2003). Cancer Res 63: on Independent Component Analysis and Blind Signal 3464–3468. Separation (ICA2001). Institute for Neural Computation, Wahl GM, Padgett RA, Stark GR. (1979). J Biol Chem 254: University of California San Diego: San Diego, CA, USA, 8679–8689. pp 552–556. Weng WH, Ahlen J, Lui WO, Brosjo O, Pang ST, Von Rosen Himberg J, Hyva¨ rinen A, Esposito F. (2004). Neuroimage 22: A et al. (2003). Br J Cancer 89: 720–726. 1214–1222. Yin Y, Tainsky MA, Bischoff FZ, Strong LC, Wahl GM. Hyman E, Kauraniemi P, Hautaniemi S, Wolf M, Mousses S, (1992). Cell 70: 937–948. Rozenblum E et al. (2002). Cancer Res 62: 6240–6245. Zatkova A, Ullmann R, Rouillard JM, Lamb BJ, Kuick R, Hyva¨ rinen A, Oja E, Karhunen J. (2001). Independent Hanash SM et al. (2004). Genes Chromosomes Cancer 39: Component Analysis. John Wiley & Sons: New York. 263–276.

Supplementary Information accompanies the paper on Oncogene website (http://www.nature.com/onc).

Oncogene