DEVELOPMENT AND APPLICATION OF AN INTEGRATIVE GENOMICS APPROACH TO LUNG CANCER

by

RAJAGOPAL CHARI

B.Sc., University of British Columbia, 2001 B.Sc., University of British Columbia, 2004

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

in

THE FACULTY OF GRADUATE STUDIES (Pathology and Laboratory Medicine)

THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver)

June 2010

© Rajagopal Chari, 2010

Abstract

Lung cancer has the highest mortality rate amongst all diagnosed malignancies with adenocarcinoma (AC) being the most commonly diagnosed subtype of this disease in North America. The dismal survival statistics of lung cancer patients are largely due to the detection of the disease at an advanced stage and to a lesser extent, the limited efficacy of current front line treatments.

Genomic approaches, namely expression analysis, have provided tremendous insight into lung cancer. While many gene expression changes have been identified, most changes are likely reactive to changes which have a primary role in cancer development. Moreover, one feature which can discern primary from reactive changes is the presence of concordant DNA level alteration.

Many well known involved in cancer such as TP53 and CDKN2A have been shown to be affected by multiple mechanisms of alteration such as somatic mutation in or loss of DNA sequence. For a given gene, one tumor may be affected by one mechanism while another tumor may be affected by a different mechanism. Although this level of multi-dimensional analysis has been performed for specific genes, such analysis has not been done at the genome-wide level.

This thesis highlights the development and application of a multi-dimensional genetic and epigenetic approach to identify frequently aberrant genes and pathways in lung AC. I present, first, the design and implementation of the system for integrative genomic multi-dimensional analysis of cancer genomes, epigenomes and transcriptomes (SIGMA2). Next, analyzing a multi-dimensional dataset generated from ten lung AC specimens with non-malignant controls, I identified novel genes and pathways that would have been missed if a non-integrative approach were used. Finally, examining genes involved with EGFR signaling, I identified a gene, signal receptor alpha (SIRPA), which had not been previously shown to be associated with lung cancer.

Taken together, these findings demonstrate the power of a multi-dimensional approach to identify important genes and pathways in lung cancer. Moreover, identifying key genes using a multi-dimensional approach on a small sample set suggests the need of large datasets may be circumvented by using a more comprehensive approach on a smaller set of samples.

ii

Table of Contents

Abstract ...... ii Table of Contents ...... iii List of Tables ...... vii List of Figures ...... viii List of Abbreviations ...... x Acknowledgements ...... xii Dedication ...... xiii Co-Authorship Statement ...... xiv Chapter 1: Introduction ...... 1 1.1 Lung cancer ...... 2 1.2 Genomic profiling of lung cancer ...... 3 1.2.1 Gene expression analysis ...... 3 1.2.2 DNA copy number analysis ...... 4 1.2.3 Loss of heterozygosity (LOH) and allelic imbalance ...... 5 1.3 Somatic mutations in lung cancer ...... 5 1.4 Epigenetic alterations in lung cancer ...... 6 1.4.1 DNA methylation ...... 6 1.5 Current level of integrative analysis ...... 7 1.6 Need for an integrative approach to study lung cancer ...... 7 1.7 Bioinformatic tools for genomic analysis ...... 8 1.8 Thesis theme ...... 9 1.9 Objectives and hypothesis ...... 9 1.10 Specific aims and outline of thesis ...... 10 1.11 Description of high throughput data in this thesis ...... 13 1.12 Other relevant contributions not included as chapters in this thesis ...... 13 1.12.1 Development of tools for genomic analysis ...... 14 1.12.2 Baseline gene expression in non-malignant lung tissue ...... 14 1.12.3 Differential gene expression analysis in lung cancer ...... 15 1.12.4 Integration of gene dosage and gene expression in lung cancer ...... 16 1.13 References ...... 18

iii

Chapter 2: SIGMA2: A system for the integrative genomic multi-dimensional analysis of cancer genomes, epigenomes, and transcriptomes1 ...... 24 2.1 Introduction ...... 25 2.2 Implementation ...... 26 2.3 Results and discussion ...... 26 2.3.1 Look and feel of SIGMA2 ...... 26 2.3.2 Description of application scope and functionality ...... 27 2.3.3 Approach to integration between array platforms and assays ...... 27 2.3.4 Format requirements of input data ...... 27 2.3.5 Description of user interface ...... 28 2.3.6 Analysis of data from a single assay type ...... 29 2.3.7 Analysis of data from multiple assays in a given 'omics dimension ...... 30 2.3.8 Combinatorial analysis of multiple 'omics dimensions - gene dosage and gene expression ...... 30 2.3.9 Group comparison analysis - single ‘omics dimension ...... 31 2.3.10 Group comparison analysis - integrating multiple 'omics dimensions ...... 31 2.3.11 Multi-dimensional analysis of a breast cancer genome ...... 31 2.3.12 Exporting data and results ...... 32 2.4 Conclusions ...... 32 2.5 Availability and requirements ...... 33 2.6 References ...... 46 Chapter 3: An integrative multi-dimensional genetic and epigenetic strategy to identify aberrant genes and pathways in cancer2 ...... 48 3.1 Background ...... 49 3.2 Methods ...... 50 3.2.1 Data generation and acquisition ...... 50 3.2.2 Data processing and normalization ...... 51 3.2.3 Strategy for integrative analysis ...... 52 3.2.4 Multiple concerted disruption (MCD) analysis ...... 53 3.2.5 Simulated data analysis ...... 54 3.2.6 Pathway enrichment analysis ...... 54 3.2.6 Survival and differential gene expression analysis in publicly available datasets...... 55 3.3 Results and discussion ...... 55 3.3.1 Analysis of individual genomic dimensions ...... 55

iv

3.3.2 Multi-dimensional analysis (MDA) reveals a higher proportion of intra-sample deregulated gene expression can be explained when more dimensions are analyzed ...... 56 3.3.3 MDA reveals genes are disrupted at higher frequencies when examining multiple dimensions as compared to any single dimension alone ...... 56 3.3.4 MDA identifies significantly enriched cancer related pathways ...... 58 3.3.5 MDA of the Neuregulin signaling pathway reveals a complex pattern of deregulation ...... 59 3.3.6 Genes exhibiting multiple concerted disruption (MCD) - biological and clinical significance ...... 60 3.3.7 Association of genes exhibiting MCD and triple negative breast cancers (TNBC) ..... 62 3.4 Conclusions ...... 63 3.5 References ...... 75 Chapter 4: Uniparental disomy is a prevalent genetic mechanism of oncogene disruption in lung adenocarcinoma3 ...... 79 4.1 Introduction ...... 80 4.2 Methods ...... 81 4.2.1 Genome wide profiling of clinical lung adenocarcinoma specimens ...... 81 4.2.2 Determination of regions of uniparental disomy (UPD) in clinical lung tumors ...... 81 4.2.3 Determining frequent regions of UPD, gain and loss ...... 82 4.2.4 Determination of UPD in cancer cell lines ...... 82 4.2.5 Expression analysis of genes in focal regions of UPD ...... 82 4.3 Results ...... 83 4.3.1 Detection of UPD using allele specific copy number analysis ...... 83 4.3.2 UPD is prevalent and non-random in the lung cancer genome with comparable frequencies to gain and loss ...... 83 4.3.3 Overlap of major oncogenes and tumor suppressor genes in regions of gain, loss, and UPD ...... 84 4.3.4 UPD is prevalent at oncogenes across multiple cancer types ...... 84 4.3.5 Identification of novel candidate oncogenes using focal regions of UPD ...... 85 4.4 Discussion ...... 85 4.5 Conclusion ...... 87 4.6 References ...... 108 Chapter 5: Integrating the multiple dimensions of genomic and epigenomic landscapes of cancer4 ...... 111 5.1 Introduction ...... 112

v

5.2 Genomic alterations ...... 113 5.2.1 Chromosomal aberrations ...... 113 5.2.2 Gene dosage, allelic imbalance, mutational status ...... 113 5.2.3 Genomic landscape: Gains, losses and uniparental disomy ...... 116 5.3 Epigenomic alterations ...... 117 5.3.1 The cancer methylome ...... 117 5.3.2 Integration of cancer genomic and epigenomic events ...... 119 5.4 Relating genetic and epigenetic events to changes in the transcriptome through integrative analysis ...... 120 5.4.1 Multiple mechanisms of gene disruption ...... 121 5.4.2 Multiple mechanisms of disrupting non-coding RNA levels ...... 121 5.4.3 Multi-dimensional integration of genome, epigenome, and transcriptome ...... 122 5.4.4 Disruption of multiple components in biological pathways ...... 124 5.4.5 Identification of a novel gene involved with EGFR signaling deregulated in adenocarcinoma ...... 125 5.4.6 Prevalence of SIRPA deregulation and association with clinical characteristics ...... 126 5.5 Tracking clonal expansion in spatial dimensions ...... 127 5.6 Evaluating the biological significance of integrative genomics findings ...... 127 5.5 References ...... 144 Chapter 6: Conclusions ...... 162 6.1 Summary ...... 163 6.1.1 Development of the integrative genetic and epigenetic approach ...... 163 6.1.2 Identification of a prevalent genetic alteration in lung adenocarcinoma ...... 164 6.1.3 Application of the integrative approach to lung adenocarcinoma specimens ...... 165 6.2 Conclusions ...... 166 6.3 Future directions ...... 168 6.4 References ...... 171 APPENDIX I: List of publications ...... 174 APPENDIX II: Description of cell lines ...... 183 APPENDIX III: Sources of data ...... 184 APPENDIX IV: MCD strategy and Kaplan-Meier analysis of TUSC3 ...... 185 APPENDIX V: Kaplan-Meier and Oncomine expression analysis of frequent MCD genes ...... 186 APPENDIX VI: Summary of Kaplan-Meier survival analysis ...... 188 APPENDIX VII: Copy of UBC Research Ethics Board certificate of approval...... 189

vi

List of Tables

Table 2.1. Features required for integrative analysis ...... 44 Table 2.2. Summary of Input, analysis, output for each dimension ...... 45 Table 4.1. Regions of the genome exhibiting frequent UPD ...... 99 Table 4.2. List of major oncogenes and tumor suppressor genes assessed ...... 101 Table 4.3. Overlap of oncogenes in frequent regions of genomic alteration ...... 102 Table 4.4. Overlap of tumor suppressor genes in frequent regions of genomic alteration ...... 103 Table 4.5. Cell lines and oncogene loci with homozygous mutation ...... 104 Table 4.6. Summary of homozygous mutation analysis in cancer cell lines ...... 105 Table 4.7. RefSeq genes in focal regions of UPD ...... 106 Table 4.8. Genes overexpressed in focal regions of UPD ...... 107 Table 5.1. List of software for integrative analysis ...... 141 Table 5.2. List of genomic resources and databases ...... 142 Table 5.3. Genes interacting with SIRPA as identified by network analysis ...... 143

vii

List of Figures

Figure 1.1. Multiple mechanisms of alteration leading to same downstream consequences ..... 17 Figure 2.1. Main structural components of SIGMA2...... 34 Figure 2.2. Data structure hierarchy...... 35 Figure 2.3. Algorithm for integrating between different array platforms ...... 36 Figure 2.4. SIGMA2 interface...... 37 Figure 2.5. Consensus calling and heterogeneous array analysis...... 38 Figure 2.6. Integrative genetic analysis of HCC2218 ...... 40 Figure 2.7. Two-group two dimensional comparison of 37 NSCLC and 16 SCLC cancer cell lines...... 41 Figure 2.8. Multi-dimensional perspective of 17 of the HCC2218 breast cancer cell line...... 42 Figure 3.1. Genomic profiles of breast cancer cell lines...... 65 Figure 3.2. Quantitative and qualitative benefits of integrative analyses...... 66 Figure 3.3. Determination and application of a disruption frequency threshold...... 68 Figure 3.4. Impact of multi-dimensional analysis on low frequency events...... 69 Figure 3.5. Pathway analysis of the 1162 genes identified by multi-dimensional analysis...... 70 Figure 3.6. Complex deregulation of the Neuregulin/ERBB2 signaling pathway...... 71 Figure 3.7. Deregulation of PTEN occurs differently between samples...... 72 Figure 3.8. Multiple concerted disruption (MCD) analysis and its application to triple negative breast cancer...... 73 Figure 4.1. Detection of UPD using allele specific copy number...... 88 Figure 4.2. Comparison of frequent regions of gain, loss and UPD in the lung adenocarcinoma genome ...... 90 Figure 4.3. Venn diagram illustrating the amount of the genome covered by gain, loss, and UPD ...... 92 Figure 4.4. Genomic profile of an individual lung adenocarcinoma sample ...... 93 Figure 4.5. Examination of UPD events at the KRAS and RB1 loci ...... 95 Figure 4.6. Relationship of homozygous mutation at oncogenes and genomic alteration ...... 96 Figure 4.7. Identification of E2F3 in a focal region of UPD ...... 97 Figure 5.1. Advances in cancer genomic landscape post Y2K...... 129 Figure 5.2. SNP array analysis to identify areas of altered copy number and allelic composition in a clinical lung cancer specimen...... 130

viii

Figure 5.3. Overlay of chromosomal regions of gain, loss and UPD (copy number neutral LOH) inherent to the T47D breast cancer cell line...... 131 Figure 5.4. Integration of copy number, allelic status, DNA methylation, and gene expression for a single lung adenocarcinoma sample...... 132 Figure 5.5. Integration of copy number, allelic status, DNA methylation, and gene expression for a single lung adenocarcinoma sample...... 134 Figure 5.6. Identification of multiple disrupted components in a biological pathway...... 136 Figure 5.7. Multi-dimensional analysis of the epidermal growth factor receptor signaling pathway...... 137 Figure 5.8. Prevalence of SIRPA underexpression and its relationship with PTPN6 and smoking status...... 138 Figure 5.9. Kaplan-Meier analysis of SIRPA in four independent microarray datasets...... 139 Figure 5.10. Automated detection of selected clonal populations of cells within a cancer biopsy tissue section...... 140

ix

List of Abbreviations

Abbreviation Definition

AC Adenocarcinoma

ASCN Allele specific copy number

BRAF v-raf murine sarcoma viral oncogene homolog B1

CDKN2A Cyclin-dependent kinase inhibitor 2A

CGH Comparative Genomic Hybridization

CNV Copy number variation

DNA Deoxyribonucleic Acid

EGFR Epidermal Growth Factor Receptor

FISH Fluorescence in-situ hybridization

GWAS Genome wide association studies

KRAS v-Ki-ras2 Kirsten rat sarcoma viral oncogene

LOH Loss of Heterozygosity

MASI Mutant allele specific imbalance

MCD Multiple Concerted Disruption

MDA Multi-Dimensional Analysis

MUC1 Mucin 1

NSCLC Non-small cell lung cancer

PCR Polymerase Chain Reaction

qPCR Quantitative PCR

x

RB1 Retinoblastoma 1

RNA Ribonucleic Acid

RRM2 Ribonucleotide Reductase Subunit M2

SIGMA System for integrative genomic microarray analysis

SIGMA2 System for integrative genomic multi-dimensional analysis

SIRPA Signal Regulatory Protein Alpha

SKY Spectral karyotyping

SNP Single nucleotide polymorphism

TUSC3 Tumor suppressor candidate 3

UPD Uniparental Disomy

xi

Acknowledgements

I would like to acknowledge the contributions of many of my colleagues in the Wan Lam Lab who contributed to this work, especially the co-authors of each of the manuscript chapters presented herein. Detailed acknowledgements from the published version of Chapter 2 is listed below:

Chapter 2: We thank William W. Lockwood and Timon P.H. Buys for useful discussion and critical reading of manuscript, Ashleen Shadeo for providing data for breast cancer samples, and Anna Chu, Byron Cline, Devon Macey, Andrew Thomson, Lan Wei, Reginald Sacdalan,

Tiffany Chao, and Laura Aslan for help with software development.

I would also like to acknowledge generous scholarship support from the Canadian Institutes of

Health Research and Michael Smith Foundation for Health Research.

The research presented in this thesis was funded by the following granting agencies: Genome

Canada/ Genome British Columbia, Canadian Cancer Society Research Institute (CCS20485),

Canadian Institute of Health Research (MOP 86731, MOP 77903), National Institutes of Health

(R01 DE15965-01), National Cancer Institute Early Detection Research Network (5U01

CA84971-10), Canary Foundation, and Canadian Breast Cancer Research Alliance.

xii

Dedication

To my family.

xiii

Co-Authorship Statement

Chapters 2 to 5 were co-authored as manuscripts for publication. The following author lists apply for each chapter:

Chapter 2: Chari R, Coe BP, Wedseltoft C, Benetti M, Wilson IM, Vucic EA, MacAulay C, Ng RT, Lam WL. (2008) SIGMA2: a system for the integrative genomic multi-dimensional analysis of cancer genomes, epigenomes, and transcriptomes. BMC Bioinformatics, 9(1):422, 1-12.

Contribution: I am the first author of this manuscript. I designed and developed the software and wrote the manuscript. The co-authors of this manuscript were either undergraduate students who I mentored on this project or were fellow graduate students who tested the software and provided important user feedback.

Chapter 3: Chari R, Coe BP, Vucic EA, Lockwood WW, Lam WL. (2010) An integrative multi- dimensional genetic and epigenetic strategy to identify aberrant genes and pathways in cancer. BMC Systems Biology, 4(1):67, 1-14.

Contribution: I am the first author of this manuscript. I acquired most of the data through generating genomic profiles and downloaded the rest of the data from public resources. I conceived the analysis for the manuscript and wrote the manuscript.

Chapter 4: Chari R, Lockwood WW, Soh J, Coe BP, Tam K, MacAulay C, Minna JD, Lam S, Gazdar AF, Lam WL. (2010) Uniparental disomy is a prevalent mechanism of genetic alteration in lung adenocarcinoma.

Contribution: I am the first author of this manuscript. I generated all of the data and performed all of the analyses for this manuscript and my co-authors provided useful information through comments and other supporting data.

xiv

Chapter 5: Chari R, Thu KL, Wilson IM, Lockwood WW, Lonergan KM, Coe BP, Malloff CA, Gazdar AF, Lam S, Garnis C, MacAulay CE, Alvarez CE, Lam WL. (2010) Integrating the multiple dimensions of genomic and epigenomic landscapes of cancer. Cancer and Metastasis Reviews, 29(1):73-93.

Contribution: I am the first author of this manuscript. I orchestrated the study, performed all analyses and wrote the manuscript with the help of my supervisor. Other co-authors provided useful information, data, or comments.

xv

Chapter 1: Introduction

1 1.1 Lung cancer

Lung cancer has the highest mortality rate amongst all diagnosed malignancies [1]. In 2009, it is estimated that 24,000 individuals will be diagnosed with lung cancer with approximately

21,000 individuals succumbing to this disease (Canadian Cancer Statistics 2009, www.cancer.ca). Lung cancer is classified into two main types: non-small cell lung cancer

(NSCLC) and small cell lung cancer (SCLC) and within NSCLC, the two major histological subtypes are adenocarcinoma (AC) and squamous cell carcinoma (SqCC) with large cell carcinoma (LCC) being the third most common histological subtype . AC accounts for the highest percentage of all lung cancer cases, representing almost half of all NSCLCs diagnosed.

The primary etiological factor associated with lung cancer is tobacco smoke exposure. While the majority of lung cancer patients have a heavy smoke exposure history, there is an increasing percentage of lung cancer patients (25%) where primary smoke exposure is not the associated cause of the disease [2]. Moreover, when examining the association of smoke history and histological subtypes diagnosed, while all subtypes have an association with smoke exposure, SCLC and SqCC show the most strongest associations [3]. In addition, amongst never smokers, the majority of cases are of the adenocarcinoma subtype [2].

Examining across the spectrum of all NSCLC patients, independent of stage, only 15% of all lung cancer patients will achieve five-year survival with the median survival time of lung cancer patients less than one year. Stratification by stage reveals that those individuals diagnosed early (stage IA) have a superior rate of five year survival as compared to those diagnosed late

(stage IV) (50% vs. 2%) [4]. Given the overall survival independent of stage is closer to stage

IV than stage IA, it is clear that the paltry survival statistics are largely due to the late diagnosis of this disease and to a lesser extent, the nominal response rate observed by conventional chemotherapies [5].

2 While overall therapeutic strategies have provided limited benefit to prolonging patient survival, there has been moderate success in the application of targeted therapeutics. Specifically, pharmacological agents against the epidermal growth factor receptor (EGFR) tyrosine kinase have shown selective efficacy in a subset of lung AC patients [6-12]. Hence, in addition to improving early detection strategies, another main focus of lung cancer research is the identification of novel therapeutic targets. One such approach that can be used to identify targets is through the application of genomic tools to clinical lung cancer specimens.

1.2 Genomic profiling of lung cancer

1.2.1 Gene expression analysis

One of the first applications of high throughput genome technologies was to the assessment of messenger RNA (mRNA) levels [13, 14]. While the first, landmark cancer-related studies were done in breast and hematological malignancies [15-17], substantial findings were made in the analysis of lung cancer. Specifically, lung cancer gene expression studies have identified genes differentially expressed in tumors, genes associated with angiogenic potential, genes associated with chemoresistance, expression signatures defining subclasses of lung cancer, expression signatures associated with patient prognosis, and expression signatures from normal bronchial epithelium samples to detect lung cancer [18-34]. In addition, much work has also been done to understand baseline gene expression in non-malignant lung tissue as well its changes with respect to heavy smoke exposure [35-38]. These studies are as important as studies involving lung cancer samples as they provide an important reference level of gene expression to decipher the dysregulated gene expression in tumors.

However, from a given analysis of differential expression in tumors, there are typically hundreds, if not thousands, of genes which may show aberrant gene expression in tumors when compared to non-malignant tissue. Moreover, it is likely that a proportion of the genes which are aberrantly expressed are not integral or causal to tumor development as many gene

3 expression changes are reactive to changes in expression of other genes. In addition, using gene expression alone, one cannot discern which changes are causal and which changes are reactive. One approach to assign causality with gene expression changes is to identify alterations at the DNA level such as somatic mutation, changes in gene dosage (DNA copy number), or epigenetic changes such aberrant DNA methylation or histone modification which can explain the observed differential expression.

1.2.2 DNA copy number analysis

Alterations in gene dosage, whereby segments of DNA in the genome are either replicated or lost, have shown to be important in lung cancer [39-41]. Typically, these gains and losses of

DNA are detected through the comparison of a genome from a tumor sample with a genome that is normal or non-malignant. It is thought that these increases or decreases in amounts of specific gene sequence could allow for increased or decreased expression of that gene.

Technological advances have allowed for the high throughput assessment of DNA copy number changes in the cancer genome namely through microarray comparative genomic hybridization

(CGH) [42, 43]. Briefly, this technology capitalizes on differential fluorescence labelling where

DNA from the tumor sample and DNA from the normal sample, each labelled with different fluorescent dyes, are hybridized together on the same chip and differences in fluorescent intensities are measured. Moreover, array CGH profiling of both lung cancer cell lines and tumors have identified areas of the genome which are frequently gained or lost [44-52].

Specifically, these areas of copy number alteration have targeted known oncogenes such as

MYC, EGFR, MDM2, TERT, and tumor suppressor genes such as CDKN2A, TP53 and RB1.

However, these alterations typically do not occur in 100% of lung tumors (e.g. the EGFR locus is gained/amplified in 10-20% of cases). In addition, the amplification and deletion events typically encompass multiple genes and as such, more often than not, only a subset of those genes will have a downstream consequence at the gene expression level. Hence, integration of

4 gene dosage with gene expression analysis would be useful to discern the target gene(s) of a given copy number alteration.

1.2.3 Loss of heterozygosity (LOH) and allelic imbalance

Loss of heterozygosity (LOH) is a common genetic event in cancer [53]. In the normal cell, each somatic chromosome has two copies, with one copy (or allele) originiating from each parent. Subsequently, in the tumor, a specific segment from one of the copies of the chromosome is lost, resulting in loss of heterozygosity.

Frequent regions of LOH have also been identified in the lung cancer genome [54-58]. While initial studies involved the use of microsatellite markers placed throughout the genome and thus, the resolution of these changes were limited, the application of SNP arrays were able to refine these areas into specific chromosome arms [45, 46]. In addition to advances in SNP array technology, analysis approaches were also developed that increased the detection sensitivity of regions of LOH / allelic imbalance [59-63]. Although most areas with altered gene dosage will also be detected as LOH (in case of copy number loss) and allelic imbalance (in case of copy number gain), there are also areas in the genome which exhibit LOH but no change in copy number, termed copy neutral LOH or uniparental disomy. However, the role of

UPD in lung cancer is not well understood.

1.3 Somatic mutations in lung cancer

Somatic mutations have also shown to be important in cancer development. In addition, mutational analysis is also used for screening purposes in high risk populations (e.g. BRCA1/2 and hereditary breast cancer) as well as criteria for receiving targeted chemotherapy (e.g.

EGFR mutation and EGFR inhibitors). Many studies have been undertaken to identify mutations in genes involved with important cellular processes pertinent to the cancer phenotype such as DNA repair and cellular proliferation and have successfully identified key genes to a

5 number of different cancer types. Moreover, it can be classified that while oncogenes typically harbour activating mutations, tumor suppressor genes often harbour inactivating mutations.

In lung adenocarcinoma, the most well known genes shown to be mutated are EGFR, KRAS,

LKB1 (or STK11), TP53 and CDKN2A [30, 54, 64, 65], with some mutations such as EGFR and

KRAS showing preferential mutation patterns based on smoking history. A recent study assessing other well known oncogenes and tumor suppressor genes showed there were a number of other genes also observed to be mutated in lung adenocarcinoma [64]. However, due to technological and material limitations at the time, many of these studies only assess small numbers of genes in a given study and thus, genome wide screening for somatic mutations is unfeasible. While high throughput sequencing technologies to assess sequence mutation on a genome scale have become available, challenges associated with cost and data analysis preclude the use in a routine manner.

1.4 Epigenetic alterations in lung cancer

1.4.1 DNA methylation

Another DNA level mechanism which can affect gene expression is through the methylation of

DNA at gene promoters. DNA methylation is a reversible chemical modification which has shown to have a prominent role in the silencing of tumor suppressor genes. Specifically, this modification targets cytosines whereby a methyl (CH3) is added to the carbon 5 moiety of cytosine.

It is thought that in cancer, the majority of the genome loses its methylation but small areas in the gene promoters, known as CpG islands, gain methylation [66-69]. Generally, it is thought that the acquired methylation targets tumor suppressor genes while the areas of lost methylation facilitate the activation of repetitive areas of the genome which can lead to increased genomic instability. In addition, aberrant DNA methylation of critical genes have been

6 utilized for early detection purposes as well as a target for therapeutic intervention [70-72], emphasizing its key role in cancer.

In lung cancer, a number of specific genes such as CDKN2A (or p16), RASSF1A, and MGMT have shown to harbour increased promoter methylation [73]. While many of these methylation events were discovered using single locus assays, recent advances have allowed for the high throughput analysis of 1000s of genes in a single experiment [74-79]. As such, applications of these high throughput approaches in lung cancer are likely to identify novel methylated genes.

Similar to array CGH analysis, though many methylated genes are likely to be identified, it will be important to validate if these alterations affect downstream gene expression.

1.5 Current level of integrative analysis

At the time this thesis started, there were a small number of whole genome integrative studies which primarily focused on the integration of gene dosage and gene expression. In fact, the majority of the integrative analysis would be done at single locus level such as the examination of gene dosage and expression of HER2 (ERBB2) oncogene in breast cancer [80]. Moreover, there were a limited number of gene dosage or gene expression studies in lung cancer.

However, from recent studies involving multiple cancer types, including lung cancer, it has been shown that anywhere between 20% and 60% of genes in regions of copy number change also exhibit a concerted change in gene expression [52, 81-84]. Conversely, when the proportion of differential expression associated with gene dosage alteration was examined, it was found that only 11% of the observed differential expression could be attributed to high level DNA copy number change [83]. Thus, it is clear that gene dosage alterations are responsible for only a part of the overall dysregulated gene expression and that other mechanisms are likely involved.

1.6 Need for an integrative approach to study lung cancer

As discussed earlier, a gene such as CDKN2A has been shown to be inactivated by both gene dosage loss and increased promoter methylation. Thus, it is very likely that when examining a

7 large number of tumors, that a given gene may be affected by one mechanism in tumor (e.g. gene dosage increase) and another mechanism in a different tumor (e.g. DNA hypomethylation), but both leading to the same net effect (Figure 1.1). In addition, if the specific event (e.g. gene dosage increase) occurs at a low frequency, but cumulatively, the deregulation occurs at a high frequency, then examining only gene dosage or DNA methylation would preclude the identification of such potentially important genes. Hence, it should be apparent that an integrative, multi-dimensional genetic and epigenetic approach is needed to identify novel genes which would have escaped previous, single dimensional analyses.

1.7 Bioinformatic tools for genomic analysis

While many software packages exist for the analysis of high throughput gene expression data

[85-88], at the start of my thesis project, software packages for the visualization and analysis of

DNA copy number data were very limited [89-95]. A summary of array CGH analysis methodologies and software packages is provided in this review [96]. Moreover, three of the key challenges at the time were (i) the increase in data generated from a single experiment, (ii) the effective visualization of this data for easy interpretation, and (iii) the microarray platform dependence of the majority of software packages.

With respect to the increase in data generation, the first generation of microarrays used for array CGH typically comprised of two to three thousand data points. As such, software for both visualization and analysis were developed to effectively handle this level of data complexity.

For example, since array CGH data in fact represents discrete levels of copy number throughout the genome, one of the data analysis steps required is segmentation which effectively smoothes data based on genomic position. The first version of DNACopy [97], one of the first algorithms to segment array CGH data, would need a significant amount of time to execute when applied to arrays with 100,000 data points or greater and eventually, a new version of the program was developed a few years later [98]. Similarly, in terms of visualization, most programs displayed array CGH data in an ordinal manner whereby the relative genomic 8 position was on the x-axis and the log ratio of the data point was drawn on the y-axis. While this type of visualization can provide a quick genome summary of a single sample, it is difficult to readily link to information such as protein coding genes from this type of visualization.

Finally, software developed by microarray manufacturers such as Affymetrix, Agilent or

Nimblegen were specifically tailored to handle data from their respective microarray platforms.

Thus, aggregate analysis of data emanating from different microarray platforms, but analyzing samples with common characteristics, could not be analyzed in a concerted manner resulting in under-utilization of the increasingly available array CGH data in the public domain. Most importantly, no tools existed to integrate multiple dimensions of data such as global gene dosage and gene expression, let alone integration with DNA methylation. Hence, it is clear that with these apparent challenges, the development of such bioinformatic tools was needed.

1.8 Thesis theme

The theme of this thesis is the development and utilization of an integrative genetic and epigenetic approach to identify novel aberrant genes and pathways that may be involved in the tumorigenesis of lung adenocarcinoma. This will be achieved by employing genome wide genetic and epigenetic profiling experiments of lung adenocarcinoma samples and the subsequent integration of this data using novel bioinformatics tools and approaches.

1.9 Objectives and hypothesis

The objective of this work is to demonstrate the importance of employing an integrative approach to understand genetic and epigenetic alterations and their consequence on gene expression. The hypothesis can be broken down to three parts:

(A) Genes/pathways which are important to tumorigenesis are disrupted by multiple mechanisms in lung cancer.

9 (B) By using an integrative approach, looking at the global genetic and epigenetic regulation of gene expression, changes at the DNA level which have downstream effects at the gene expression level will be identified.

(C) This approach will lead to the identification of more genes that are disrupted than previously anticipated and these genes will be enriched in key pathways and functions important to lung tumorigenesis.

1.10 Specific aims and outline of thesis

This thesis consists of four manuscripts assembled in a non-chronological order to best address the objectives and hypothesis of this thesis.

Aim 1: Development of a platform for multi 'omics data integration and analysis

Chapter 2 discusses the development of an integrative analysis software package called a system for the integrative genomic multi-dimensional analysis of cancer genomes, epigenomes, and transcriptomes (SIGMA2). The development of this application was necessary prior to the undertaking of the analysis of the vast amount of data generated from the utilized high throughput, genome-wide technologies.

As discussed earlier in this chapter, there were very few bioionformatic tools available for the analysis of array CGH data, let alone for integrative analysis of gene dosage and gene expression. Prior to the development of SIGMA2, I developed the pre-cursor version of this software SIGMA [95]. SIGMA provided the basic framework in terms of the user interfaces, database communication, data structures and "look and feel" that would be utilized in SIGMA2.

Moreover, one of the key challenges when SIGMA was developed was the effective visualization and analysis of large datasets generated by newer, high density array CGH platforms. At the time, the majority of data that were generated were on platforms comprised of

3000 measurements per sample but, newer technologies were being developed which

10 generated over 500,000 data points per sample, representing a 100-fold increase in information obtained from each experiment [99]. Hence, the base software architecture used in SIGMA2 was already capable of handling large amounts of data.

Aim 2: Demonstration of an integrative approach using model systems

Chapter 3 discusses the demonstration of an integrative, multi-dimensional approach on tumor cell line model systems. Using a set of breast cancer cell lines, I examine the gene dosage, allelic composition, DNA methylation, and gene expression profiles in an integrative manner to delineate which genes and pathways would be missed or less significant if such an approach was not used. This demonstrative study was needed to show the key advantages and benefits of an integrative approach. While cell lines are artificial systems and may have acquired alterations that are beneficial to grow in vitro, it is important that a sample source was used where material limitations did not exist. For each of the genetic or epigenetic profiling studies, sufficient amounts of DNA and RNA are needed and when more assays are done in a given sample, more material is required. Moreover, when whole tumor samples are microdissected to ensure high tumor cell purity, this inherently will reduce the amount of usable sample material.

As such, it is important that the quantitative and qualitative benefits of utilizing an integrative approach are sufficient to warrant using clinical samples. At the time this study was initiated,

SNP array and array CGH profiles were available for breast cancer cell lines and thus, only generation of DNA methylation and gene expression profiles were needed to complete this set.

Given the purpose of this study was to demonstrate the effectiveness of the integrative approach, while data from lung cancer cell lines would have been most optimal, the source of data has limited relevance to the purpose of this aim.

Aim 3: Characterization of DNA level alterations in lung adenocarcinoma

A number of studies have been done to identify gene dosage alterations in lung cancer and in lung adenocarcinoma specifically. These studies were done on a number of different array

11 platforms, with one of the latest studies done using Affymetrix SNP arrays. One of the benefits of Affymetrix SNP arrays is the ability to simultaneously detect changes in gene dosage as well as allelic imbalance. Allelic imbalance, though should be determined using a patient matched non-malignant sample as a control, has also been determined using a pool of unmatched non- malignant samples. While the ability to detect imbalance using unmatched control samples is important when matched control samples are not available, this may falsely score regions as imbalanced but in fact are not, and vice versa. In addition, samples in these different studies were typically not microdissected and thus, tumor cell purity in the samples would be variable.

Thus, those samples with low tumor cell content would make it difficult to detect genetic alterations. Moreover, there has been a recent drastic increase in resolution with the newest

SNP arrays, with the ability to measure over 4X as many SNPs and over 8X as many spots for gene dosage. Chapter 4 discusses the application of a new SNP array technology to microdissected lung adenocarcinoma specimens with the goal of identifying genetic alterations at the highest resolution currently available.

Aim 4: Application of an integrative approach to clinical lung adenocarcinoma specimens

With the approach and necessary tools developed and now demonstrated to be beneficial using a model dataset, chapter 5 discusses the application of the integrative approach to lung adenocarcinoma specimens. While the published chapter provides an overview of cancer genome and epigenome landscapes, sections 5.4.3 to 5.4.4 present some of the quantitative and qualitative benefits of integrative analysis specific to the analysis of a lung adenocarcinoma dataset. In addition, sections 5.4.5 to 5.4.6 discusses key findings in terms of genes and pathways that were identified from this integrative analysis.

12

1.11 Description of high throughput data in this thesis

Throughout this thesis, a number of platform technologies were utilized to generate high throughput, genome wide data. Below is a summary of all platforms used in each chapter.

In Chapter 3, for nine breast cancer cell lines and one control cell line (MCF10A), the following profiles were generated: Affymetrix SNP 500K for the analysis of allelic status; whole genome tiling path array CGH for the analysis of gene dosage; Illumina Infinium HumanMethylation27 for

DNA methylation analysis; and Affymetrix U133 Plus 2.0 for the analysis of gene expression.

In Chapter 4, for the 46 tumors and matched non-malignant tissue as well as the cancer cell lines, the Affymetrix SNP 6.0 platform was utilized to measure total copy number and allelic imbalance. For a subset of tumors, gene expression profiles were generated using a custom

Affymetrix platform designed by Rosetta Inpharmatics.

In Chapter 5, for the ten tumors and matched non-malignant tissue samples, the following profiles were generated: Affymetrix SNP 6.0 for the analysis of allelic status and gene dosage;

Illumina Infinium HumanMethylation27 for DNA methylation analysis; and Affymetrix HuEx 1.0

ST array for the analysis of gene expression. Quantitative RT-PCR was performed using the

Applied Biosystems TaqMan gene expression assay.

1.12 Other relevant contributions not included as chapters in this thesis

In this thesis, I have chosen to include a small portion of my overall work in order to achieve a coherent theme. However, in this section, I have outlined specific contributions, which I’ve either led or participated as 2nd author, that I’ve deemed are relevant to the theme of lung cancer and genomics.

13 1.12.1 Development of tools for genomic analysis

As mentioned earlier, the precursor version of SIGMA2 was SIGMA [95]. This tool was built as an interactive database of cancer cell line array CGH profiles and provided a means for effective visualization of high density array CGH data as well as sharing of data. One of the other problems that arose for high density array CGH data is the availability of efficient analysis algorithms to delineate gains and losses. Most algorithms that were developed for array CGH analysis were developed for arrays with 2000 to 3000 data points and their execution times did not scale up efficiently when the arrays were generating 100,000 to 1,000,000 data points. To address this problem, I contributed to the development of a segmentation and calling algorithm named FACADE [100].

1.12.2 Baseline gene expression in non-malignant lung tissue

Though gene expression studies studying malignant samples are important, it is also critical to define what genes are expressed in non-malignant samples as these are used in reference to determine aberrant gene expression. There were two studies I was involved with which addressed this question. First, we examined gene expression of non-malignant, smoke damaged bronchial epithelium using serial analysis of gene expression (SAGE) [37]. We found that there were specific genes that showed high tissue specificity to the bronchial epithelium with limited representation in other tissues and that there were differences between bronchial epithelial samples and lung parenchyma, which are samples adjacent to tumors typically used as non-malignant controls comprised of a mixture of different cells.

In the study described above, bronchial epithelium samples from current and former smokers were grouped together. Hence, the next logical question was to assess the effect of active smoking on the bronchial epithelium. In the second study, a group of never smoker samples were added to the groups of current and former smokers and the gene expression profiles of the three groups were compared. We first identified a set of genes which were differentially

14 expressed in response to smoke exposure and found a subset of genes that were reversible upon smoking cessation and another subset of genes irreversible upon smoking cessation [35].

Those genes which were irreversibly altered after heavy smoke exposure may have implications in affecting future risk of developing lung cancer. Moreover, these findings also suggest that when trying to identify cancer-specific changes when unmatched control samples are not available, clinical characteristics such as smoking status should be taken into consideration when comparisons are made.

1.12.3 Differential gene expression analysis in lung cancer

With the non-malignant, baseline gene expression defined, differential gene expression in early stages of lung cancer and locally invasive squamous cell carcinoma were then assessed [101].

It was found that genes associated with epidermal development were increased in expression and mucociliary function were decreased in expression in carcinoma-in-situ as well as in precancerous stages. Finally, genes associated with tissue re-modelling were also altered in expression in local invasive cancer and also showed altered expression in carcinoma-in-situ, suggesting this function is affected early in cancer development.

The Wnt pathway has been shown to be aberrant in many cancer types. At the time the study began, there were two branches of the pathway that were known to exist, canonical and non- canonical, whose activation resulted in different downstream consequences. While the canonical branch was the primary focus of most researchers studying this pathway, we sought to assess the role of the non-canonical branch in lung squamous cell carcinoma using semi- quantitative and quantitative PCR of genes which were a part of the non-canonical branch [102].

From this study, it was found that (i) these non-canonical genes were expressed in the normal lung and (ii) some of these non-canonical genes were differentially expressed in tumors.

An important consideration in the analysis of differential gene expression in cancer is the use of suitable reference genes for data normalization. This consideration is critical to both

15 quantitative PCR experiments as well as microarray experiments where relative quantifications are typically used. To address this, using SAGE, where quantification of expression is absolute, genes whose expression was constant across both malignant and non-malignant samples were identified [103]. Those genes demonstrated better constancy than some genes which are typically used as controls for gene expression analysis.

1.12.4 Integration of gene dosage and gene expression in lung cancer

The first level of genomic integration that needed to be accomplished was the integration of gene dosage and gene expression. In one study using cancer cell lines, hot spots of DNA amplification were identified throughout the genome. When specifically examining lung cancer cell lines and subsequently coupling this with gene expression data, it was found that 50% of genes in these frequently amplified regions show correlation between gene dosage and gene expression [52]. Moreover, it was also observed that different components of the EGFR signaling pathway were amplified in different cell lines illustrating that for a given pathway, one can underestimate the frequency of pathway disruption when only well known genes in the pathway are assessed.

In a second study involving clinical lung tumors, a genomic region which was preferentially amplified in squamous cell carcinomas as compared to adenocarcinomas was identified.

Further integration with gene expression data allowed for the identification of the target gene,

BRF2, in this amplified region [104]. Moreover, gene dosage and protein expression level assessment of CIS samples for BRF2 showed that amplification and overexpression were present, suggesting that this event is occurring at an early stage of tumorigenesis.

16 Figure 1.1

a Normal Tumor b Normal Tumor c Normal Tumor

Copy Number Loss / Allelic Imbalance Uniparental disomy (UPD) Loss of heterozygosity (LOH)

d AATACGCGCGCGTCGCATCCAGCATGAACAGA Normal TTATGCGCGCGCAGCGTAGGTCGTACTTGTCT

Tumor AATACGCGCGCGTCGCATCCAGCATGAACAGA TTATGCGCGCGCAGCGTAGGTCGTACTTGTCT DNA Hypermethylation e

Normal AATACGCGCGCGTCGCATCCAGCATGAACAGA TTATGCGCGCGCAGCGTAGGTCGTACTTGTCT

Premature stop, truncated transcript Tumor AATACGCGCGCGTCGCATCCAGCATGAACTGA TTATGCGCGCGCAGCGTAGGTCGTACTTGACT Somatic mutation

Figure 1.1. Multiple mechanisms of alteration leading to the same downstream con- sequences. (a) Illustration of copy number loss. Loss of a particular chromosomal region in tumors. (b) Illustration of allelic imbalance. While both alleles are present, there is a preferential increase of one of the alleles. (c) Ilustration of uniparental disomy. While overall the total number of DNA copies is normal, one part of an allele is lost and replaced by a part from the other allele. (d) Promoter hypermethylation in tumors which results in suppression of gene transcription. (e) Somatic mutation in the tumor which can lead to the transcription of a truncated (possibly non-functional) transcript. Mechanisms shown in (a), (c), (d), and (e) can lead to the same net downstream effect in loss of gene and protein expression. For (a), (b), and (c), though whole chormosomes are shown, these events can vary in scale from a focal region of change to a whole chormosome arm. The green arrow represents the transcription start site.

17 1.13 References

1. Jemal A, Siegel R, Ward E, Hao Y, Xu J, Thun MJ: Cancer statistics, 2009. CA Cancer J Clin 2009, 59(4):225-249. 2. Sun S, Schiller JH, Gazdar AF: Lung cancer in never smokers--a different disease. Nat Rev Cancer 2007, 7(10):778-790. 3. Khuder SA: Effect of cigarette smoking on major histological types of lung cancer: a meta-analysis. Lung Cancer 2001, 31(2-3):139-148. 4. Detterbeck FC, Boffa DJ, Tanoue LT: The new lung cancer staging system. Chest 2009, 136(1):260-271. 5. Herbst RS, Lynch TJ, Sandler AB: Beyond doublet chemotherapy for advanced non- small-cell lung cancer: combination of targeted agents with first-line chemotherapy. Clin Lung Cancer 2009, 10(1):20-27. 6. Kim KS, Jeong JY, Kim YC, Na KJ, Kim YH, Ahn SJ, Baek SM, Park CS, Park CM, Kim YI et al: Predictors of the response to gefitinib in refractory non-small cell lung cancer. Clin Cancer Res 2005, 11(6):2244-2251. 7. Kim TE, Murren JR: Erlotinib OSI/Roche/Genentech. Curr Opin Investig Drugs 2002, 3(9):1385-1395. 8. Miller VA, Kris MG, Shah N, Patel J, Azzoli C, Gomez J, Krug LM, Pao W, Rizvi N, Pizzo B et al: Bronchioloalveolar pathologic subtype and smoking history predict sensitivity to gefitinib in advanced non-small-cell lung cancer. J Clin Oncol 2004, 22(6):1103-1109. 9. Mitsudomi T, Kosaka T, Endoh H, Horio Y, Hida T, Mori S, Hatooka S, Shinoda M, Takahashi T, Yatabe Y: Mutations of the epidermal growth factor receptor gene predict prolonged survival after gefitinib treatment in patients with non-small-cell lung cancer with postoperative recurrence. J Clin Oncol 2005, 23(11):2513-2520. 10. Pao W, Miller V, Zakowski M, Doherty J, Politi K, Sarkaria I, Singh B, Heelan R, Rusch V, Fulton L et al: EGF receptor gene mutations are common in lung cancers from "never smokers" and are associated with sensitivity of tumors to gefitinib and erlotinib. Proc Natl Acad Sci U S A 2004, 101(36):13306-13311. 11. Sirotnak FM, Zakowski MF, Miller VA, Scher HI, Kris MG: Efficacy of cytotoxic agents against human tumor xenografts is markedly enhanced by coadministration of ZD1839 (Iressa), an inhibitor of EGFR tyrosine kinase. Clin Cancer Res 2000, 6(12):4885-4892. 12. Paez JG, Janne PA, Lee JC, Tracy S, Greulich H, Gabriel S, Herman P, Kaye FJ, Lindeman N, Boggon TJ et al: EGFR mutations in lung cancer: correlation with clinical response to gefitinib therapy. Science 2004, 304(5676):1497-1500. 13. Schena M, Shalon D, Davis RW, Brown PO: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 1995, 270(5235):467-470. 14. Schena M, Shalon D, Heller R, Chai A, Brown PO, Davis RW: Parallel analysis: microarray-based expression monitoring of 1000 genes. Proc Natl Acad Sci U S A 1996, 93(20):10614-10619. 15. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA et al: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 286(5439):531-537. 16. Perou CM, Sorlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA et al: Molecular portraits of human breast tumours. Nature 2000, 406(6797):747-752. 17. Sorlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, van de Rijn M, Jeffrey SS et al: Gene expression patterns of breast carcinomas

18 distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci U S A 2001, 98(19):10869-10874. 18. Fukumoto S, Yamauchi N, Moriguchi H, Hippo Y, Watanabe A, Shibahara J, Taniguchi H, Ishikawa S, Ito H, Yamamoto S et al: Overexpression of the aldo-keto reductase family protein AKR1B10 is highly correlated with smokers' non-small cell lung carcinomas. Clin Cancer Res 2005, 11(5):1776-1785. 19. Heighway J, Knapp T, Boyce L, Brennand S, Field JK, Betticher DC, Ratschiller D, Gugger M, Donovan M, Lasek A et al: Expression profiling of primary non-small cell lung cancer for target identification. Oncogene 2002, 21(50):7749-7763. 20. Hu J, Bianchi F, Ferguson M, Cesario A, Margaritora S, Granone P, Goldstraw P, Tetlow M, Ratcliffe C, Nicholson AG et al: Gene expression signature for angiogenic and nonangiogenic non-small-cell lung cancer. Oncogene 2005, 24(7):1212-1219. 21. Larsen JE, Pavey SJ, Passmore LH, Bowman R, Clarke BE, Hayward NK, Fong KM: Expression profiling defines a recurrence signature in lung squamous cell carcinoma. Carcinogenesis 2007, 28(3):760-766. 22. Larsen JE, Pavey SJ, Passmore LH, Bowman RV, Hayward NK, Fong KM: Gene expression signature predicts recurrence in lung adenocarcinoma. Clin Cancer Res 2007, 13(10):2946-2954. 23. Lau SK, Boutros PC, Pintilie M, Blackhall FH, Zhu CQ, Strumpf D, Johnston MR, Darling G, Keshavjee S, Waddell TK et al: Three-gene prognostic classifier for early-stage non small-cell lung cancer. J Clin Oncol 2007, 25(35):5562-5569. 24. Oshita F, Ikehara M, Sekiyama A, Hamanaka N, Saito H, Yamada K, Noda K, Kameda Y, Miyagi Y: Genomic-wide cDNA microarray screening to correlate gene expression profile with chemoresistance in patients with advanced lung cancer. J Exp Ther Oncol 2004, 4(2):155-160. 25. Potti A, Mukherjee S, Petersen R, Dressman HK, Bild A, Koontz J, Kratzke R, Watson MA, Kelley M, Ginsburg GS et al: A genomic strategy to refine prognosis in early- stage non-small-cell lung cancer. N Engl J Med 2006, 355(6):570-580. 26. Raponi M, Zhang Y, Yu J, Chen G, Lee G, Taylor JM, Macdonald J, Thomas D, Moskaluk C, Wang Y et al: Gene expression signatures for predicting prognosis of squamous cell and adenocarcinomas of the lung. Cancer Res 2006, 66(15):7466- 7472. 27. Remmelink M, Mijatovic T, Gustin A, Mathieu A, Rombaut K, Kiss R, Salmon I, Decaestecker C: Identification by means of cDNA microarray analyses of gene expression modifications in squamous non-small cell lung cancers as compared to normal bronchial epithelial tissue. Int J Oncol 2005, 26(1):247-258. 28. Singhal S, Amin KM, Kruklitis R, DeLong P, Friscia ME, Litzky LA, Putt ME, Kaiser LR, Albelda SM: Alterations in cell cycle genes in early stage lung adenocarcinoma identified by expression profiling. Cancer Biol Ther 2003, 2(3):291-298. 29. Spira A, Beane JE, Shah V, Steiling K, Liu G, Schembri F, Gilman S, Dumas YM, Calner P, Sebastiani P et al: Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer. Nat Med 2007, 13(3):361-366. 30. Sun Z, Wigle DA, Yang P: Non-overlapping and non-cell-type-specific gene expression signatures predict lung cancer survival. J Clin Oncol 2008, 26(6):877- 883. 31. Wang T, Hopkins D, Schmidt C, Silva S, Houghton R, Takita H, Repasky E, Reed SG: Identification of genes differentially over-expressed in lung squamous cell carcinoma using combination of cDNA subtraction and microarray analysis. Oncogene 2000, 19(12):1519-1528. 32. Wikman H, Seppanen JK, Sarhadi VK, Kettunen E, Salmenkivi K, Kuosma E, Vainio- Siukola K, Nagy B, Karjalainen A, Sioris T et al: Caveolins as tumour markers in lung cancer detected by combined use of cDNA and tissue microarrays. J Pathol 2004, 203(1):584-593.

19 33. Garber ME, Troyanskaya OG, Schluens K, Petersen S, Thaesler Z, Pacyna-Gengelbach M, van de Rijn M, Rosen GD, Perou CM, Whyte RI et al: Diversity of gene expression in adenocarcinoma of the lung. Proc Natl Acad Sci U S A 2001, 98(24):13784-13789. 34. Shedden K, Taylor JM, Enkemann SA, Tsao MS, Yeatman TJ, Gerald WL, Eschrich S, Jurisica I, Giordano TJ, Misek DE et al: Gene expression-based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study. Nat Med 2008, 14(8):822-827. 35. Chari R, Lonergan KM, Ng RT, MacAulay C, Lam WL, Lam S: Effect of active smoking on the human bronchial epithelium transcriptome. BMC Genomics 2007, 8:297. 36. Spira A, Beane J, Shah V, Liu G, Schembri F, Yang X, Palma J, Brody JS: Effects of cigarette smoke on the human airway epithelial cell transcriptome. Proc Natl Acad Sci U S A 2004, 101(27):10143-10148. 37. Lonergan KM, Chari R, Deleeuw RJ, Shadeo A, Chi B, Tsao MS, Jones S, Marra M, Ling V, Ng R et al: Identification of novel lung genes in bronchial epithelium by serial analysis of gene expression. Am J Respir Cell Mol Biol 2006, 35(6):651-661. 38. Beane J, Sebastiani P, Liu G, Brody JS, Lenburg ME, Spira A: Reversible and permanent effects of tobacco smoke exposure on airway epithelial gene expression. Genome Biol 2007, 8(9):R201. 39. Balsara BR, Testa JR: Chromosomal imbalances in human lung cancer. Oncogene 2002, 21(45):6877-6883. 40. Sato M, Shames DS, Gazdar AF, Minna JD: A translational view of the molecular pathogenesis of lung cancer. J Thorac Oncol 2007, 2(4):327-343. 41. Thomas RK, Weir B, Meyerson M: Genomic approaches to lung cancer. Clin Cancer Res 2006, 12(14 Pt 2):4384s-4391s. 42. Albertson DG, Collins C, McCormick F, Gray JW: Chromosome aberrations in solid tumors. Nat Genet 2003, 34(4):369-376. 43. Lockwood WW, Chari R, Chi B, Lam WL: Recent advances in array comparative genomic hybridization technologies and their applications in human genetics. Eur J Hum Genet 2006, 14(2):139-148. 44. Garnis C, Lockwood WW, Vucic E, Ge Y, Girard L, Minna JD, Gazdar AF, Lam S, MacAulay C, Lam WL: High resolution analysis of non-small cell lung cancer cell lines by whole genome tiling path array CGH. Int J Cancer 2006, 118(6):1556-1564. 45. Janne PA, Li C, Zhao X, Girard L, Chen TH, Minna J, Christiani DC, Johnson BE, Meyerson M: High-resolution single-nucleotide polymorphism array and clustering analysis of loss of heterozygosity in human lung cancer cell lines. Oncogene 2004, 23(15):2716-2726. 46. Zhao X, Li C, Paez JG, Chin K, Janne PA, Chen TH, Girard L, Minna J, Christiani D, Leo C et al: An integrated view of copy number and allelic alterations in the cancer genome using single nucleotide polymorphism arrays. Cancer Res 2004, 64(9):3060-3071. 47. Zhao X, Weir BA, LaFramboise T, Lin M, Beroukhim R, Garraway L, Beheshti J, Lee JC, Naoki K, Richards WG et al: Homozygous deletions and chromosome amplifications in human lung carcinomas revealed by single nucleotide polymorphism array analysis. Cancer Res 2005, 65(13):5561-5570. 48. Weir BA, Woo MS, Getz G, Perner S, Ding L, Beroukhim R, Lin WM, Province MA, Kraja A, Johnson LA et al: Characterizing the cancer genome in lung adenocarcinoma. Nature 2007, 450(7171):893-898. 49. Chitale D, Gong Y, Taylor BS, Broderick S, Brennan C, Somwar R, Golas B, Wang L, Motoi N, Szoke J et al: An integrated genomic analysis of lung cancer reveals loss of DUSP4 in EGFR-mutant tumors. Oncogene 2009, 28(31):2773-2783. 50. Kendall J, Liu Q, Bakleh A, Krasnitz A, Nguyen KC, Lakshmi B, Gerald WL, Powers S, Mu D: Oncogenic cooperation and coamplification of developmental transcription factor genes in lung cancer. Proc Natl Acad Sci U S A 2007, 104(42):16663-16668.

20 51. Tonon G, Wong KK, Maulik G, Brennan C, Feng B, Zhang Y, Khatry DB, Protopopov A, You MJ, Aguirre AJ et al: High-resolution genomic profiles of human lung cancer. Proc Natl Acad Sci U S A 2005, 102(27):9625-9630. 52. Lockwood WW, Chari R, Coe BP, Girard L, Macaulay C, Lam S, Gazdar AF, Minna JD, Lam WL: DNA amplification is a ubiquitous mechanism of oncogene activation in lung and other cancers. Oncogene 2008, 27(33):4615-4624. 53. Cavenee WK: Loss of heterozygosity in stages of malignancy. Clin Chem 1989, 35(7 Suppl):B48-52. 54. Bepler G, Garcia-Blanco MA: Three tumor-suppressor regions on chromosome 11p identified by high-resolution deletion mapping in human non-small-cell lung cancer. Proc Natl Acad Sci U S A 1994, 91(12):5513-5517. 55. Fong KM, Zimmerman PV, Smith PJ: Microsatellite instability and other molecular abnormalities in non-small cell lung cancer. Cancer Res 1995, 55(1):28-30. 56. Merlo A, Gabrielson E, Askin F, Sidransky D: Frequent loss of chromosome 9 in human primary non-small cell lung cancer. Cancer Res 1994, 54(3):640-642. 57. Merlo A, Gabrielson E, Mabry M, Vollmer R, Baylin SB, Sidransky D: Homozygous deletion on chromosome 9p and loss of heterozygosity on 9q, 6p, and 6q in primary human small cell lung cancer. Cancer Res 1994, 54(9):2322-2326. 58. Sundaresan V, Heppell-Parton A, Coleman N, Miozzo M, Sozzi G, Ball R, Cary N, Hasleton P, Fowler W, Rabbitts P: Somatic genetic changes in lung cancer and precancerous lesions. Ann Oncol 1995, 6 Suppl 1:27-31; discussion 31-22. 59. Huang J, Wei W, Chen J, Zhang J, Liu G, Di X, Mei R, Ishikawa S, Aburatani H, Jones KW et al: CARAT: a novel method for allelic detection of DNA copy number changes using high density oligonucleotide arrays. BMC Bioinformatics 2006, 7:83. 60. Yamamoto G, Nannya Y, Kato M, Sanada M, Levine RL, Kawamata N, Hangaishi A, Kurokawa M, Chiba S, Gilliland DG et al: Highly sensitive method for genomewide detection of allelic composition in nonpaired, primary tumor specimens by use of affymetrix single-nucleotide-polymorphism genotyping microarrays. Am J Hum Genet 2007, 81(1):114-126. 61. LaFramboise T, Weir BA, Zhao X, Beroukhim R, Li C, Harrington D, Sellers WR, Meyerson M: Allele-specific amplification in cancer revealed by SNP array analysis. PLoS Comput Biol 2005, 1(6):e65. 62. Li C, Beroukhim R, Weir BA, Winckler W, Garraway LA, Sellers WR, Meyerson M: Major copy proportion analysis of tumor samples using SNP arrays. BMC Bioinformatics 2008, 9:204. 63. Lin M, Wei LJ, Sellers WR, Lieberfarb M, Wong WH, Li C: dChipSNP: significance curve and clustering of SNP-array-based loss-of-heterozygosity data. Bioinformatics 2004, 20(8):1233-1240. 64. Ding L, Getz G, Wheeler DA, Mardis ER, McLellan MD, Cibulskis K, Sougnez C, Greulich H, Muzny DM, Morgan MB et al: Somatic mutations affect key pathways in lung adenocarcinoma. Nature 2008, 455(7216):1069-1075. 65. Suda K, Tomizawa K, Mitsudomi T: Biological and clinical significance of KRAS mutations in lung cancer: an oncogenic driver that contrasts with EGFR mutation. Cancer Metastasis Rev 2010. 66. Feinberg AP: Phenotypic plasticity and the epigenetics of human disease. Nature 2007, 447(7143):433-440. 67. Feinberg AP, Gehrke CW, Kuo KC, Ehrlich M: Reduced genomic 5-methylcytosine content in human colonic neoplasia. Cancer Res 1988, 48(5):1159-1161. 68. Feinberg AP, Tycko B: The history of cancer epigenetics. Nat Rev Cancer 2004, 4(2):143-153. 69. Lo PK, Sukumar S: Epigenomics and breast cancer. Pharmacogenomics 2008, 9(12):1879-1902. 70. Decitabine: 2'-deoxy-5-azacytidine, Aza dC, DAC, dezocitidine, NSC 127716. Drugs R D 2003, 4(3):179-184. 21 71. Shivapurkar N, Gazdar AF: DNA Methylation Based Biomarkers in Non-Invasive Cancer Screening. Curr Mol Med. 72. Anglim PP, Alonzo TA, Laird-Offringa IA: DNA methylation-based biomarkers for early detection of non-small cell lung cancer: an update. Mol Cancer 2008, 7:81. 73. Heller G, Zielinski CC, Zochbauer-Muller S: Lung cancer: From single-gene methylation to methylome profiling. Cancer Metastasis Rev 2010. 74. Bibikova M, Lin Z, Zhou L, Chudin E, Garcia EW, Wu B, Doucet D, Thomas NJ, Wang Y, Vollmer E et al: High-throughput DNA methylation profiling using universal bead arrays. Genome Res 2006, 16(3):383-393. 75. Weber M, Davies JJ, Wittig D, Oakeley EJ, Haase M, Lam WL, Schubeler D: Chromosome-wide and promoter-specific analyses identify sites of differential DNA methylation in normal and transformed human cells. Nat Genet 2005, 37(8):853-862. 76. Shames DS, Girard L, Gao B, Sato M, Lewis CM, Shivapurkar N, Jiang A, Perou CM, Kim YH, Pollack JR et al: A genome-wide screen for promoter methylation in lung cancer identifies novel methylation markers for multiple malignancies. PLoS Med 2006, 3(12):e486. 77. Thu KL, Pikor LA, Kennett JY, Alvarez CE, Lam WL: Methylation analysis by DNA immunoprecipitation. J Cell Physiol 2009, 222(3):522-531. 78. Khulan B, Thompson RF, Ye K, Fazzari MJ, Suzuki M, Stasiek E, Figueroa ME, Glass JL, Chen Q, Montagna C et al: Comparative isoschizomer profiling of cytosine methylation: the HELP assay. Genome Res 2006, 16(8):1046-1055. 79. Omura N, Li CP, Li A, Hong SM, Walter K, Jimeno A, Hidalgo M, Goggins M: Genome- wide profiling of methylated promoters in pancreatic adenocarcinoma. Cancer Biol Ther 2008, 7(7):1146-1156. 80. Slamon DJ, Godolphin W, Jones LA, Holt JA, Wong SG, Keith DE, Levin WJ, Stuart SG, Udove J, Ullrich A et al: Studies of the HER-2/neu proto-oncogene in human breast and ovarian cancer. Science 1989, 244(4905):707-712. 81. Coe BP, Chari R, Lockwood WW, Lam WL: Evolving strategies for global gene expression analysis of cancer. J Cell Physiol 2008, 217(3):590-597. 82. Heidenblad M, Lindgren D, Veltman JA, Jonson T, Mahlamaki EH, Gorunova L, van Kessel AG, Schoenmakers EF, Hoglund M: Microarray analyses reveal strong influence of DNA copy number alterations on the transcriptional patterns in pancreatic cancer: implications for the interpretation of genomic amplifications. Oncogene 2005, 24(10):1794-1801. 83. Hyman E, Kauraniemi P, Hautaniemi S, Wolf M, Mousses S, Rozenblum E, Ringner M, Sauter G, Monni O, Elkahloun A et al: Impact of DNA amplification on gene expression patterns in breast cancer. Cancer Res 2002, 62(21):6240-6245. 84. Pollack JR, Sorlie T, Perou CM, Rees CA, Jeffrey SS, Lonning PE, Tibshirani R, Botstein D, Borresen-Dale AL, Brown PO: Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proc Natl Acad Sci U S A 2002, 99(20):12963-12968. 85. Brazma A, Robinson A, Cameron G, Ashburner M: One-stop shop for microarray data. Nature 2000, 403(6771):699-700. 86. Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A 2001, 98(9):5116-5121. 87. Rajagopalan D: A comparison of statistical methods for analysis of high density oligonucleotide array data. Bioinformatics 2003, 19(12):1469-1476. 88. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2003, 4(2):249-264. 89. Myers CL, Chen X, Troyanskaya OG: Visualization-based discovery and analysis of genomic aberrations in microarray data. BMC Bioinformatics 2005, 6:146.

22 90. Chen W, Erdogan F, Ropers HH, Lenzner S, Ullmann R: CGHPRO -- a comprehensive data analysis tool for array CGH. BMC Bioinformatics 2005, 6:85. 91. Kim SY, Nam SW, Lee SH, Park WS, Yoo NJ, Lee JY, Chung YJ: ArrayCyGHt: a web application for analysis and visualization of array-CGH data. Bioinformatics 2005, 21(10):2554-2555. 92. Autio R, Hautaniemi S, Kauraniemi P, Yli-Harja O, Astola J, Wolf M, Kallioniemi A: CGH- Plotter: MATLAB toolbox for CGH-data analysis. Bioinformatics 2003, 19(13):1714- 1715. 93. Lingjaerde OC, Baumbusch LO, Liestol K, Glad IK, Borresen-Dale AL: CGH-Explorer: a program for analysis of array-CGH data. Bioinformatics 2005, 21(6):821-822. 94. Chi B, DeLeeuw RJ, Coe BP, MacAulay C, Lam WL: SeeGH--a software tool for visualization of whole genome array comparative genomic hybridization data. BMC Bioinformatics 2004, 5:13. 95. Chari R, Lockwood WW, Coe BP, Chu A, Macey D, Thomson A, Davies JJ, MacAulay C, Lam WL: SIGMA: a system for integrative genomic microarray analysis of cancer genomes. BMC Genomics 2006, 7:324. 96. Chari R, Lockwood WW, Lam WL: Computational methods for the analysis of array comparative genomic hybridization. Cancer Inform 2007, 2:48-58. 97. Olshen AB, Venkatraman ES, Lucito R, Wigler M: Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 2004, 5(4):557-572. 98. Venkatraman ES, Olshen AB: A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics 2007, 23(6):657-663. 99. Coe BP, Ylstra B, Carvalho B, Meijer GA, Macaulay C, Lam WL: Resolving the resolution of array CGH. Genomics 2007, 89(5):647-653. 100. Coe BP, Chari R, MacAulay C, Lam WL: FACADE: A fast and sensitive algorithm for the segmentation and calling of high resolution array CGH data. Nucleic Acids Res 2010, Revision. 101. Lonergan KM, Chari R, Coe BP, Wilson IM, Tsao MS, Ng RT, MacAulay C, Lam S, Lam WL: Transcriptome profiles of carcinoma-in-situ and invasive non-small cell lung cancer as revealed by SAGE. PLoS One 2010, Accepted. 102. Lee EHL, Chari R, Lam A, Ng RT, Yee J, English J, Evans KG, MacAulay C, Lam S, Lam WL: Disruption of the non-canonical WNT pathway in lung squamous cell carcinoma. Clinical Medicine: Oncology 2008, 2:169-179. 103. Chari R, Lonergan KM, Pikor LA, Coe BP, Zhu CQ, Chan THW, MacAulay C, Tsao MS, Lam S, Ng RT et al: A sequence-based approach to identify reference genes for gene expression analysis. BMC Medical Genomics 2010, Submitted. 104. Lockwood WW, Chari R, Coe BP, Thu KL, Garnis C, Malloff CA, Campbell J, Williams AC, Hwang D, Zhu CQ et al: BRF2 – A Novel Lineage Specific Oncogene in Lung Squamous Cell Carcinoma. PLoS Med 2010, Revisions.

23 Chapter 2: SIGMA2: A system for the integrative genomic multi-dimensional analysis of cancer genomes, epigenomes, and transcriptomes1

1 A version of this chapter has been published. Chari R, Coe BP, Wedseltoft C, Benetti M, Wilson IM, Vucic EA, MacAulay C, Ng RT, Lam WL. (2008) SIGMA2: A system for the integrative genomic multi-dimensional analysis of cancer genomes, epigenomes, and transcriptomes. BMC Bioinformatics 9:422. doi:10.1186/1471-2105-9-422. Please see the published version of this chapter for all supplementary materials.

24 2.1 Introduction

Multiple mechanisms of gene disruption have been shown to be important in the development of cancer. Genetic alterations (mutations, changes in gene dosage, allele imbalance) and epigenetic alterations (changes in DNA methylation and histone modification states) are responsible for changing the expression of genes. High throughput approaches have afforded the ability to interrogate the genomic, epigenomic and gene expression (transcriptomic) profiles at unprecedented resolution [1-6]. However, a gene can be disrupted by one or by a combination of mechanisms, therefore, investigation in a single 'omics dimension (genomics, epigenomics, or transcriptomics) alone cannot detect all disrupted genes in a given tumor.

Moreover, individual tumors may have different patterns of gene disruption, by different mechanisms for a given gene while achieving the same net effect on phenotype. Hence, a multi-dimensional approach is required to identify the causal events at the DNA level and understand their downstream consequences.

The current state of software for global profile comparison typically focuses on analyzing and displaying data from a single dimension, for example CGH Fusion (infoQuant Ltd, London, UK) for DNA copy number profile analysis and GeneSpring (Agilent Technologies, Santa Clara, CA,

USA) for gene expression profile analysis. Software for integrative analysis have been restricted to working with datasets derived from limited combination of technology platforms

(Table 1) [7-10]. Though different software can analyze data generated from different platforms, the ability to perform meta-analysis using data from multiple microarray platforms is limited to a small number of software packages. Consequently, integrative analysis of cancer genomes typically involves no more than two types of data, most commonly the integration of gene dosage and gene expression data [11-16] and recently expanded to integrating allelic information [17]. Software to perform multi-dimensional analysis are therefore greatly in demand.

25 Here, we present SIGMA2, a novel software package which allows users to integrate data from the various 'omics disciplines such as genomics, epigenomics and transcriptomics. Multi- dimensional datasets can be simultaneously compared, analyzed and visualized with respect to individual dimensions, allowing combinatorial integration of the different assays belonging to the different 'omics. The identification of genes altered at multiple levels such as copy number,

LOH, DNA methylation and the detection of consequential changes in gene expression can be concertedly performed, establishing SIGMA2 as a tool to facilitate the high throughput systems biology analysis of cancer. SIGMA2 is freely available for academic and research use from our website, http://www.flintbox.com/technology.asp?Page=3716.

2.2 Implementation

SIGMA2 is implemented in Java, and requires version 1.6+ of the runtime compiler. In addition, the statistical package R and database application MySQL are also required. The java interface communicates with MySQL using a JDBC connector and with R using the JRI package by JGR

(Figure 2.1). MySQL is used for data storage and querying while R is used for the segmentation and statistical analysis. All genomic coordinate information was obtained from

University of California Santa Cruz (UCSC) genome databases [18].

2.3 Results and discussion

2.3.1 Look and feel of SIGMA2

The novel multi-dimensional ‘omics data analysis software SIGMA2 is built on the framework of a facile visualization tool called SIGMA, which can display alignment of genomic data from a built-in static database [7]. The arsenal of functionalities introduced in SIGMA2 is shown in

Table 2.1.

26 2.3.2 Description of application scope and functionality

SIGMA2 is built to handle a variety of analysis techniques typically used in the high-throughput study of cancer, allowing the combinatorial integration of multiple 'omics disciplines. The hierarchy, which underlies the program, groups data into genome, epigenome, and transcriptome is shown in Figure 2.2a and the overall functionality map is given in Figure 2.2b and listed in Table 2.2. With each 'omics dimension, data sets may be imported representing any of the major types of biological measurements being assayed, for example, (i) examining both DNA copy number and LOH assays within the genomic bundle, (ii) examining both DNA methylation and histone modification status within the epigenomics bundle, and (iii) examining both gene expression profiles and microRNAs expression assays within the transcriptomic bundle. Each assay may branch into data sources from a multitude of technology platforms.

2.3.3 Approach to integration between array platforms and assays

SIGMA2 treats all data in the context of genome position based on the relevant human genome build using the UCSC genome assemblies. An interval-based approach is used to sample across different array platforms and assays and data from each interval are merged together.

Briefly, this is done by querying data at fixed genomic intervals for each platform and subsequently taking an average of the measurements within each interval. The algorithm is listed in Figure 2.3.

2.3.4 Format requirements of input data

Standard tab-delimited text files are used for the input of data for all of the assay types. For genomic data, specifically array CGH, normalization is recommended using external algorithms such as CGH-Norm and MANOR [19, 20]. Segmentation analysis can be performed within

SIGMA2, but results from external analysis can be imported and used in the consensus calling feature. The algorithms which can be called within SIGMA2currently include DNACopy and

GLAD [21, 22]. Multiple sample batch importing is available to facilitate efficient loading of

27 datasets. To utilize this, the user must create an information file which describes each sample in the dataset. Formatting requirements of the information file are specified in the manual.

Alternatively, for Affymetrix SNP array analysis, data should also be pre-processed and normalized using the appropriate software, such as CNAG before importing into SIGMA2 [23].

Genotyping calls should be made prior to importing, using the "AA", "AB" and "BB" convention.

If the genotype call does not exist, "NC" must be specified. For epigenomic data, data from affinity based-approaches (MeDIP [6] and ChIP [24]) should contain a value representing the level of enrichment and the genomic coordinates for each spot. Similarly, for bisulphite-based approaches [25], a percent of converted CpGs should be provided along with the genomic coordinates for each spot. Finally, for transcriptome data, gene expression data from Affymetrix experiments can be directly imported and processed as CEL files and are normalized using the

MAS 5.0 algorithm implemented in the "affy" package of R. For any assay type, custom data can be imported whereby the user provides a map of the platform based on the given genome build, and the unique identifier for the map must be used for the data generated from those experiments.

2.3.5 Description of user interface

The main user interface in SIGMA2 utilizes a tabbed window-pane which allows the user to open multiple visualizations simultaneously (Figure 2.4). The left part of the window manages the analyses and projects which belong to the current user and button shortcuts for the main functionality are spread along the top of the window. Using an example of an array CGH profile from the Agilent 244K platform, we demonstrate the step-wise interrogation of a region of interest [26]. Briefly, using the highlighting toolbar button, the user can select a region of interest and subsequently, by clicking the right mouse button, the user can search for annotated genes within the specified genomic coordinates.

28 2.3.6 Analysis of data from a single assay type

The first, and most basic, level of analysis is from a single assay type. For array CGH, multiple options for segmentation algorithms are available within the program and results from externally run segmentation can be imported as well. However, each segmentation algorithm has its advantages and disadvantages depending on the type of data used and the quality of data at hand. A unique feature of SIGMA2 is the ability to take a consensus of multiple algorithms using

"And" or "Or" logic between algorithms. Moreover, a level of consensus can be specified

(Figure 2.5a). For example, if an experiment is analyzed using five approaches, the user can select areas of gain and loss which were detected by one algorithm, at least three algorithms, all five algorithms, etc. For LOH, basic analysis using the number of consecutive markers that exhibit LOH is used to determine its status. Affinity-based approaches for DNA methylation and histone modification states or bead-based percentage of CpG island methylation is analyzed by either direct thresholding or z-transform thresholding. For any of the different assay types, when examining across a number of samples, a frequency of alteration can be calculated and plotted.

For data from different array platforms, but assaying the same biological measurement, the algorithm for integration is used to derive common data. This feature is most applicable to DNA copy number data due to the number of array CGH platforms. This allows for better utilization of publicly available data and thus, increasing sample size for statistical analysis. Similar to the multiple sample analysis of data on the sample platform, a frequency of altered states can be generated and plotted. Figure 5A shows the concerted analysis of a sample profiled on the

Affymetrix 500K SNP array, Agilent 244K CGH array and the whole genome tiling path BAC array (Figure 2.5b).

29 2.3.7 Analysis of data from multiple assays in a given 'omics dimension

Within a given 'omics dimension, multiple assay types can be analyzed in combination. For example, it is useful to investigate copy number and LOH and the interplay between DNA methylation and different states of histone modification. Typically, in regions of copy number loss, LOH is also observed. However, LOH can also occur in regions which are copy number neutral, indicating a change in allelic status which is not interpretable by one dimension alone.

Here, we show a sample for which copy number and LOH information exists, a region of copy number loss associated with LOH (Figure 2.6). In terms of epigenetics, DNA methylation and states of histone methylation and acetylation have been known to be biologically relevant. With high throughput technologies available to assay these dimensions, this type of analysis will become more prevalent.

2.3.8 Combinatorial analysis of multiple 'omics dimensions - gene dosage and gene expression

The most common analysis of multiple 'omics dimensions is the influence of the genome on the transcriptome. A number of software packages have started to incorporate approaches to examining gene dosage and gene expression [8, 9, 27]. In SIGMA2, there are multiple functionalities which allow the user to link DNA copy number to gene expression. For a single group of samples, with matching DNA copy number and gene expression profiles, the user can determine associations through two main options: a) using a correlation-based approach, correlating the log ratios with the normalized gene expression intensities and b) using a statistical-based approach comparing the expression in samples with copy number changes against those without copy number change utilizing the Mann Whitney U test, analogous to approaches taken in previous studies [27]. Spearman, Kendall or Pearson correlation coefficients can be calculated for option a). Similarly, this functionality is also available for correlating epigenetic profiles and gene expression.

30 In addition to single group analysis, two-dimensional genome/transcriptome analysis can be applied to two-group comparison analysis. For example, if patterns of copy number alterations are compared between two groups and a particular region is more frequently gained in one group than another, the expression data can subsequently compared between the groups of sample to determine if there is an association between gene dosage and gene expression.

That is, we would expect the group with more frequent copy number gain to have higher expression than the other group. Notably, this functionality does not require both copy number and expression data to exist for the same sample, but allows the user to select an independent dataset for expression data comparisons (Figure 2.7).

2.3.9 Group comparison analysis - single ‘omics dimension

Finally, for two groups of samples, the user can compare the distribution of changes between two groups to determine if the patterns are statistically different using a Fisher's Exact test. For

DNA copy number, it is the distribution of gain and losses; for DNA methylation or histone modification states, the proportion of samples that meet the threshold of enrichment for each group (low or high); and for LOH, proportion of samples with LOH for a region for each group.

2.3.10 Group comparison analysis - integrating multiple 'omics dimensions

This type of analysis can be performed with a single sample or multiple samples, thus allowing combinatorial (“and”) analysis for large datasets. In addition, the user can also identify "or" events, where a change in any of the dimensions can be flagged. This is more important in multi-sample datasets as one dimension may not capture complex alterations of a particular region.

2.3.11 Multi-dimensional analysis of a breast cancer genome

Using the breast cancer cell line HCC2218, we show the integration of genomic, epigenomic, and transcriptomic data. Interestingly, when we examine the ERBB2 gene on chromosome 17, we show concurrent amplification, LOH, loss of methylation and drastic increase in gene 31 expression (Figure 2.8). ERBB2 has shown to be an important gene in breast cancer development and therapeutic intervention. This demonstrates the value in integrating multiple dimensions to understand complex alteration patterns in disease samples where multiple causes can lead to a single effect.

2.3.12 Exporting data and results

High resolution images can be exported for all types of visualizations in SIGMA2. Histogram plots of gene expression, heatmaps with clustering of gene expression, karyogram plots and frequency histogram plots are the main types of visualization available. Frequency histogram data which is used to generate the plots can also be exported. Integrated plots with data plotted serially or overlaid are also available for analysis involving multiple genomic and epigenomic dimensions. Genes which are obtained from the conjunctive (And) and disjunctive (Or) multi- dimensional analysis can be exported with their status. Results of statistical analysis such as

Fisher's exact comparisons and U-test comparisons of gene expression can be exported against annotate gene lists based on user-specified human genome builds. Currently, April

2003 (hg15), May 2004 (hg17) and March 2006 (hg18) are the available genome builds [18].

As new builds are released, support for those builds will be available. Finally, data from multi- platform integration can be exported based on based pair position for additional external analysis if necessary.

2.4 Conclusions

With the increase in high-throughput data covering multiple dimensions of the genome, epigenome and transcriptome, the approaches and tools to analyze this data must advance accordingly to handle, analyze and interpret this data in an integrated manner. SIGMA2 meets these requirements and provides the framework for the incorporation of data from future approaches and technologies. Specifically, with the movement from array to sequence-based

32 technologies, the ability to assimilate sequence data with the various 'omics data sets will become a future requirement of software packages.

2.5 Availability and requirements

Project name: SIGMA2

Operating system(s): Java SE V.1.6+, R Project V.2.5+, Windows XP or Vista

License: Free for academic and research use; commercial users please contact

33 Figure 2.1

SIGMA2 Biological Databases R JGR / JRI -Segmentation • PubMed -Statistical analysis Link to • OMIM Java external • NCBI Gene RMySQL -User interface resources • UCSC Genome Browser -Visualization • GEO Profiles MySQL • Database of Genomic Variants -Data storage JDBC -Querying

Figure 2.1. Main structural components of SIGMA2. Data and genome mapping infor- mation is stored in the MySQL database. Segmentation analysis using DNACopy and GLAD and statistical analysis is performed using R, with results stored in database. Java was used to program the application, specifically for the user interface and the different types of visualization. Base-pair positions and gene annotations are linked to other biologi- cal databases to facilitate further interrogation by the user.

34 Figure 2.2

a Combinatorial Integration

Omics Genome Epigenome Transcriptome Gene & MicroRNA Assay DNA Copy Number Allelic Imbalance (LOH) DNA Methylation Histone Expression modification

BAC Oligo SNP Microsatellite MeDIP - Bisulphite- Platform ChIP-on- SAGE Microarrays array CGH array CGH Arrays markers array CGH based chip methods

b Single Platform / Single ‘omics Combinatorial Integration Single Assay (Multiple assays) (Multiple ‘omics) Single sample A,B,C,Q,R,S D,E,H F,G,O,P

Multiple samples (one group) A,B,C,L,Q,R,S D,E,H F,G,I,J,K,O,P

Multiple samples (two groups) A,B,C,L,M,Q,R,S D,E,H F,G,I,J,K,N,O,P

Segmentation analysis for array CGH to identify regions of gain Correlation of DNA methylation and gene expression (dataset A K and loss with matched DNA methylation and expression profiles) Identify recurrent changes (copy number alterations, common Moving average thresholding for affinity based approaches L B (MeDIP for DNA methylation, ChIP-on-chip for histone enrichment patterns [MeDIP, ChIP], regions of LOH) modification states) Statistical comparison of patters of recurrent changes between M C Regions of loss of heterozygosity (LOH) two groups using Fisher's exact test D Regions of copy number change and LOH Two-dimensional two-group comparisons (statistical comparison E Regions of copy number neutrality and LOH (e.g. UPD) N of expression profiles of genes in regions of difference identified F Regions of copy number AND methylation alteration ("two" hit) by Fisher's exact comparison) Regions of copy number OR methylation alteration Identify "And" events between three or more DNA-based G (compensatory change with same net effect) O dimensions (copy number, LOH, DNA methylation, histone Epigenetic interplay between DNA methylation and various modification states) H modification states of histones Identify "Or" events between three or more DNA-based Correlation of copy number and gene expression (dataset with P dimensions (copy number, LOH, DNA methylation, histone I matched copy number and expression profiles) modification states) Statistical comparison of samples with copy number change Q Cancer gene discovery J versus without copy number change (dataset with matched copy R Lists of genes for systems/function/pathway analysis number and expression profiles) using Mann Whitney U-test Linking to public biological databases (PubMed, NCBI Gene, S OMIM, NCBI GEO Profiles, UCSC Genome Browser, Database of Genomic Variants)

Figure 2.2. Data structure hierarchy. (a) Data hierarchy describing the relationship between platforms, assays and 'omics disciplines. (b) Functionality map of SIGMA2. List of the various functions and the output from that function that can be performed given the number of samples or sample groups and dimensions. Multiple sample analysis (single group and two group) are microarray platform independent. Functions listed in boxes are in addition to those listed in the box preceding the arrows.

35 Figure 2.3

numSamples <- number of samples for chr <- 1: 24 k <- 10000 chrEnd <- length of chromosome intervals <- chrEnd % k data <- array[intervals, numSamples] currentInterval <- 0 for pos<-0, pos < chrEnd, pos+=k for sampleNum <- 1:numSamples data[currentInterval,sampleNum] <- data from sample for interval pos and pos+k* end currentInterval <- currentInterval + 1 end end

*if multiple data points exist in the interval, an average is used. If no data exists, blank is returned. If it is array CGH data that is segmented, data is assumed to exist for any genomic position.

Figure 2.3. Algorithm for integrating between different array platforms. Data for every platform is matched to genomic position. Subsequently, an interval-based approach is used to systematically query data for each interval. In this figure, the interval, k, is 10 kb in size. By converting everything to genomic position, samples sets of the same disease type but on different array platforms can be aggregated affording the user with additional statistical power.

36 Figure 2.4 a

b

c

Search for genes, link to databases 37

e

d

Figure 2.4. SIGMA2 interface. Description of the SIGMA2 user interface using a single sample visualization as an example. (a) Customizable toolbar with shortcut buttons, (b) Project/Analysis tree to track work within and between sessions, (c) Main display area using tab-based navigation, (d) Information console and (e) Genome features tracks. Here, a copy number change is displayed in the context of CpG islands (red), microRNAs (orange) and regions annotated in the database of genomic variants (blue). Figure 2.5. Consensus calling and heterogeneous array analysis. (a) Consensus calling using multiple algorithms. Multiple algorithms (and different parameters) can be selected to analyze a given array CGH sample and this can be defined for each array platform independently as each platform may have exhibit different noise and ratio response characteristics. (b) Heterogeneous array analysis using data from multiple array CGH platforms. Sample from the Agilent 244K, Affymetrix SNP 500K and whole genome BAC array were segmented to define areas of gain and loss. Subsequently, the results were aggregated into a frequency histogram plot showing the common areas of gain and loss across the three samples.

38 Figure 2.5 a

A ymetrix Agilent BCCA SNP 500K 244K CGH WGTP 32K

b

39 Figure 2.6

HCC2218 HCC2218 HCC2218BL

Copy Number LOH

Figure 2.6. Integrative genetic analysis of HCC2218. Parallel visualization and analysis of the copy number and genotype profiles of the breast cancer cell line HCC2218. Geno- type profile of the matching normal blood lymphoblast line (HCC2218BL) is also provided to define regions of LOH. DNA copy number profile was generated on the BCCA whole genome tiling path BAC array and genotype profiles are from the Affymetrix SNP 10K array {Zhao, 2004 #38}. This region of chromosome arm 3q has a defined segmental copy number loss and the boundary of the change is evident from the LOH profile. In the geno- type profile, the horizontal blue lines indicate a SNP transition from heterozygous in normal to homozygous in the tumor, indicating LOH.

40 Figure 2.7 a b NSCLC SCLC

c

Figure 2.7. Two-group two dimensional comparison of 37 NSCLC and 16 SCLC cancer cell lines. First, segmentation analysis is performed to delineate gains and losses in each sample. Next, a statistical comparison of the distribution of gains and losses between the two groups is done using the Fisher’s exact test. (a) Using the interactive search, one of the regions of difference identified is on , with a NSCLC and SCLC sample aligned next to each other. The NSCLC has a clear segmental gain of that region, with the SCLC not having the gain. The right-most graph is a frequency plot sum- mary of two sample sets (NSCLC and SCLC). NSCLC is color-coded in red while SCLC in green, and the overlap appears in yellow. The frequency of chromosome arm 7p gain is higher in the red group. (b) A heatmap is shown representing 15 NSCLC and 15 SCLC gene expression profiles, of the specific genes in the region highlighted in yellow. (c) When examining gene expression data of EGFR specifically, a gene in this region, we can see that the expression is drastically higher in NSCLC vs. SCLC, as predicted by the higher frequency of gain in NSCLC vs. SCLC of that region. Gene expression data are repre- sented as log2 of the normalized intensities.

41 Figure 2.8. Multi-dimensional perspective of chromosome 17 of the HCC2218 breast cancer cell line. Copy number, LOH, and DNA methylation, and profiling identifies an amplification of ERBB2 coinciding with allelic imbalance and loss of methylation. When examining the gene expression, the expression of HCC2218 is significantly higher than a panel of normal luminal and myoepithelial cell lines [28].

42 Figure 2.8 43

DNA Copy Number Allelic imbalance (LOH) DNA Methylation HCC2218 Luminal Myoepithelial Table 2.1. Features required for integrative analysis

2 Features required for integrative analysis *CGH VAMP SIGMA SIGMA Analytics ISA-CGH ISA-CGH MD-SeeGH Nexus CGH CGH Fusion Built-in segmentation for array CGH 9 9 9 9 9 9 9 Consensus calling using multiple 9 segmentation algorithms Array platform-independent 9 9 9 combined CGH analysis Custom microarray data handling 9 9 9 9 9 9 9 Basic copy number and expression 9 9 9 9 integration Alignment and analysis of genetic 9 9 9 and epigenetic data Multi-dimensional visualization of genetic, epigenetic and gene 9 expression data Two group statistical comparison 9 9 9 9 Two group combinatorial gene dosage and gene expression 9 comparison Linking to external biological 9 9 9 9 9 9 9 9 databases Linking to external gene expression 9 (GEOProfiles) Context-based visualization of 9 9 9 9 9 genome features Conversion of data between 9 9 9 9 different genome assemblies Free for academic/research use 9 9 9 9 9 9

44 Table 2.2. Summary of Input, analysis, output for each dimension

'Omics Assay(s) Input Functionality*** Output classification measured Regions of gain and loss Segmentation Gene lists for further Direct thresholding analysis Genomics Copy number Array CGH Moving average-based thresholding High-resolution karyogram Z-transformation of moving average images Whole genome visualization Frequency histograms Genomics LOH SNPs* LOH based on consecutive altered markers Regions of LOH Microsatellite Genomics LOH Same as above Same as above markers Copy number, Identify regions of uniparental disomy Genomics LOH (UPD): LOH with no copy number change Regions of enrichment and Direct thresholding DNA MeDIP + lack of methylation Epigenomics Moving average-based thresholding methylation array CGH Gene lists for further Z-transformation of moving average analysis Visualization against genome position DNA Bilsulphite- Epigenomics Thresholding of proportion of methylated methylation based CpG’s Regions of enrichment and Histone Direct thresholding lack of enrichment Epigenomics modification ChIP-on-chip Moving average-based thresholding Gene lists for further states Z-transformation of moving average analysis DNA Regions of mutually methylation, exclusive change between Epigenomics Histone Epigenetic interplay chromatin state and DNA modification methylation states Heatmap visualization, clustering Expression of genes of Gene Transcriptomics Microarrays Histograms interested based on DNA expression** Statistical comparisons analysis Heatmap visualization, clustering Expression of genes of Gene Transcriptomics SAGE Histograms interested based on DNA expression** Statistical comparisons analysis Genes whose expression Correlation analysis of copy number and is strongly regulatd by copy Copy number, expression Genomics, number Gene Statistical comparison of expression in Transcriptomics p-values for associations expression regions of copy number difference (two p-values for group group analysis) comparison Identify regions of concerted change in Copy number, BOTH copy number and methylation ("two- Genomics, DNA hit") Epigenomics methylation Identify regions with change in copy number OR DNA methylation Genomics, LOH, DNA Regions of allele specific Identify allele-specific methylation events Epigenomics methylation aberrant methylation Copy number, LOH, DNA Genomics, methylation, Identify co-ordinate genetic, epigenetic and Genes altered at multiple Epigenomics, Histone gene expression changes levels Transcriptomics modification Gene Expression

45 2.6 References

1. Garnis C, Buys TP, Lam WL: Genetic alteration and gene expression modulation during cancer progression. Mol Cancer 2004, 3:9. 2. Ishkanian AS, Malloff CA, Watson SK, DeLeeuw RJ, Chi B, Coe BP, Snijders A, Albertson DG, Pinkel D, Marra MA et al: A tiling resolution DNA microarray with complete coverage of the human genome. Nat Genet 2004, 36(3):299-303. 3. Khulan B, Thompson RF, Ye K, Fazzari MJ, Suzuki M, Stasiek E, Figueroa ME, Glass JL, Chen Q, Montagna C et al: Comparative isoschizomer profiling of cytosine methylation: the HELP assay. Genome Res 2006, 16(8):1046-1055. 4. Lockwood WW, Chari R, Chi B, Lam WL: Recent advances in array comparative genomic hybridization technologies and their applications in human genetics. Eur J Hum Genet 2006, 14(2):139-148. 5. Rauch T, Li H, Wu X, Pfeifer GP: MIRA-assisted microarray analysis, a new technology for the determination of DNA methylation patterns, identifies frequent methylation of homeodomain-containing genes in lung cancer cells. Cancer Res 2006, 66(16):7939-7947. 6. Weber M, Davies JJ, Wittig D, Oakeley EJ, Haase M, Lam WL, Schubeler D: Chromosome-wide and promoter-specific analyses identify sites of differential DNA methylation in normal and transformed human cells. Nat Genet 2005, 37(8):853-862. 7. Chari R, Lockwood WW, Coe BP, Chu A, Macey D, Thomson A, Davies JJ, MacAulay C, Lam WL: SIGMA: a system for integrative genomic microarray analysis of cancer genomes. BMC Genomics 2006, 7:324. 8. Conde L, Montaner D, Burguet-Castell J, Tarraga J, Medina I, Al-Shahrour F, Dopazo J: ISACGH: a web-based environment for the analysis of Array CGH and gene expression which includes functional profiling. Nucleic Acids Res 2007, 35(Web Server issue):W81-85. 9. La Rosa P, Viara E, Hupe P, Pierron G, Liva S, Neuvial P, Brito I, Lair S, Servant N, Robine N et al: VAMP: visualization and analysis of array-CGH, transcriptome and other molecular profiles. Bioinformatics 2006, 22(17):2066-2073. 10. Chi B, deLeeuw RJ, Coe BP, Ng RT, MacAulay C, Lam WL: MD-SeeGH: a platform for integrative analysis of multi-dimensional genomic data. BMC Bioinformatics 2008, 9:243. 11. Carrasco DR, Tonon G, Huang Y, Zhang Y, Sinha R, Feng B, Stewart JP, Zhan F, Khatry D, Protopopova M et al: High-resolution genomic profiles define distinct clinico-pathogenetic subgroups of multiple myeloma patients. Cancer Cell 2006, 9(4):313-325. 12. Chin K, DeVries S, Fridlyand J, Spellman PT, Roydasgupta R, Kuo WL, Lapuk A, Neve RM, Qian Z, Ryder T et al: Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. Cancer Cell 2006, 10(6):529-541. 13. Coe BP, Lockwood WW, Girard L, Chari R, Macaulay C, Lam S, Gazdar AF, Minna JD, Lam WL: Differential disruption of cell cycle pathways in small cell and non-small cell lung cancer. Br J Cancer 2006, 94(12):1927-1935. 14. Lockwood WW, Chari R, Coe BP, Girard L, Macaulay C, Lam S, Gazdar AF, Minna JD, Lam WL: DNA amplification is a ubiquitous mechanism of oncogene activation in lung and other cancers. Oncogene 2008. 15. Neve RM, Chin K, Fridlyand J, Yeh J, Baehner FL, Fevr T, Clark L, Bayani N, Coppe JP, Tong F et al: A collection of breast cancer cell lines for the study of functionally distinct cancer subtypes. Cancer Cell 2006, 10(6):515-527.

46 16. Stransky N, Vallot C, Reyal F, Bernard-Pierrot I, de Medina SG, Segraves R, de Rycke Y, Elvin P, Cassidy A, Spraggon C et al: Regional copy number-independent deregulation of transcription in cancer. Nat Genet 2006, 38(12):1386-1396. 17. Sanders MA, Verhaak RG, Geertsma-Kleinekoort WM, Abbas S, Horsman S, van der Spek PJ, Lowenberg B, Valk PJ: SNPExpress: integrated visualization of genome- wide genotypes, copy numbers and gene expression levels. BMC Genomics 2008, 9:41. 18. Karolchik D, Kuhn RM, Baertsch R, Barber GP, Clawson H, Diekhans M, Giardine B, Harte RA, Hinrichs AS, Hsu F et al: The UCSC Genome Browser Database: 2008 update. Nucleic Acids Res 2008, 36(Database issue):D773-779. 19. Khojasteh M, Lam WL, Ward RK, MacAulay C: A stepwise framework for the normalization of array CGH data. BMC Bioinformatics 2005, 6:274. 20. Neuvial P, Hupe P, Brito I, Liva S, Manie E, Brennetot C, Radvanyi F, Aurias A, Barillot E: Spatial normalization of array-CGH data. BMC Bioinformatics 2006, 7:264. 21. Hupe P, Stransky N, Thiery JP, Radvanyi F, Barillot E: Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics 2004, 20(18):3413- 3422. 22. Venkatraman ES, Olshen AB: A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics 2007, 23(6):657-663. 23. Nannya Y, Sanada M, Nakazaki K, Hosoya N, Wang L, Hangaishi A, Kurokawa M, Chiba S, Bailey DK, Kennedy GC et al: A robust algorithm for copy number detection using high-density oligonucleotide single nucleotide polymorphism genotyping arrays. Cancer Res 2005, 65(14):6071-6079. 24. Ballestar E, Paz MF, Valle L, Wei S, Fraga MF, Espada J, Cigudosa JC, Huang TH, Esteller M: Methyl-CpG binding identify novel sites of epigenetic inactivation in human cancer. EMBO J 2003, 22(23):6335-6345. 25. Bibikova M, Lin Z, Zhou L, Chudin E, Garcia EW, Wu B, Doucet D, Thomas NJ, Wang Y, Vollmer E et al: High-throughput DNA methylation profiling using universal bead arrays. Genome Res 2006, 16(3):383-393. 26. Coe BP, Ylstra B, Carvalho B, Meijer GA, Macaulay C, Lam WL: Resolving the resolution of array CGH. Genomics 2007, 89(5):647-653. 27. van Wieringen WN, Belien JA, Vosse SJ, Achame EM, Ylstra B: ACE-it: a tool for genome-wide integration of gene dosage and RNA expression data. Bioinformatics 2006, 22(15):1919-1920. 28. Grigoriadis A, Mackay A, Reis-Filho JS, Steele D, Iseli C, Stevenson BJ, Jongeneel CV, Valgeirsson H, Fenwick K, Iravani M et al: Establishment of the epithelial-specific transcriptome of normal and malignant human breast cells based on MPSS and array expression data. Breast Cancer Res 2006, 8(5):R56.

47 Chapter 3: An integrative multi-dimensional genetic and epigenetic strategy to identify aberrant genes and pathways in cancer2

2 A version of this chapter has been published. Chari R, Coe BP, Vucic EA, Lockwood WW, Lam WL. (2010) An integrative multi-dimensional genetic and epigenetic strategy to identify aberrant genes and pathways in cancer. BMC Systems Biology, 4(1):67, 1-14. Please see the published version of this chapter for all supplementary materials.

48 3.1 Background

Genomic analyses have substantially improved our knowledge of cancer. Gene expression profiling, for example, is utilized to delineate subtypes of breast cancer, and has facilitated the derivation of predictive and prognostic signatures [1-5]. However, not all of the gene expression changes observed are causal to cancer development, and global gene expression analysis alone cannot distinguish between causal and reactive changes. Corresponding alteration at the

DNA level is regarded as evidence of causality; for example, gene deletion or gene silencing by methylation. Hence, examining genetic and epigenetic events in conjunction with the changes in gene expression pattern should improve the identification of causal changes that lead to disease phenotype.

Analysis of gene copy number alone has correlated breast cancer genome features with poor prognosis based on the degree of genomic instability observed [6]. In terms of gene discovery, specific genomic regions containing important loci have been shown to be frequently gained or lost [7-11]. Integrative analyses of gene dosage and gene expression in breast cancer have revealed specific genes which are deregulated at the gene expression level as a result of changes in DNA copy number. From a global perspective, studies have shown a broad range in concordance between DNA amplification and overexpression of genes. This variability is attributable to the sensitivity of the methods used in detecting gene copy number and gene expression changes as well as the number of genes examined [12-15]. Conversely, when examining gene overexpression, it was found that only 10.5% of the overexpression could be attributable to gene amplification [14]. It is certain that altered gene expression can not only be attributed to disruption of regulatory/signaling cascades and downstream effects, but also to a multitude of causal genetic and epigenetic aberrations.

We reason that by examining multiple genomic dimensions simultaneously, with a dimension representing a genome wide assay measuring DNA level alterations such as gene copy number or DNA methylation, we are likely to achieve the following: (i) explain a greater fraction of the

49 observed gene expression deregulation as compared with explaining expression deregulation using only a single dimension, (ii) improve the discovery of critical oncogenes and tumor suppressor genes (TSGs) by focusing on those genes altered simultaneously at multiple genomic dimensions, and (iii) begin to understand the complex mechanisms of dysregulation of oncogenic pathways. In this study, we demonstrate the power of an integrative genomics approach by performing multi-dimensional analyses (MDA) of the genome, epigenome, and transcriptome of breast cancer cell lines. We illustrate and demonstrate the need for integrative analysis of multiple genomic dimensions by showing the co-operative contribution of DNA mechanisms to explaining differential gene expression. Using a strategy to identify genes exhibiting congruent alteration in copy number, DNA methylation, and allelic (or loss of heterozygosity, LOH) status, which we term multiple concerted disruption (MCD) analysis, we find genes representing key nodes in pathways as well as genes which exhibit prognostic significance. In examining the neuregulin pathway, we observe the variability among samples in the mechanism of dysregulation of this commonly altered breast cancer pathway, highlighting the importance of multi-dimensional correlative analysis of a given pathway in individual tumor samples -- in addition to the conventional approach of identifying loci simply based on frequency of disruption in a cohort. Finally, examining the subset of triple negative breast cancer cell

(TNBC) lines, we show that a downstream target of FGFR2, a recently implicated oncogene in

TNBC, COL1A1 is frequently affected by MCD even though in FGFR2 itself is rarely affected.

Notably, this is the first such in-depth genomic, epigenomic, and transcriptomic analyses of breast cancer.

3.2 Methods

3.2.1 Data generation and acquisition

Commonly used breast cancer (HCC38, HCC1008, HCC1143, HCC1395, HCC1599, HCC1937,

HCC2218, BT474, MCF-7) and non-cancer (MCF10A) cell lines were selected for analyses

(Additional File 1 or Appendix II). Copy number profiles were obtained from the SIGMA

50 database [11, 16]. These profiles were generated using a whole genome tiling path microarray

CGH platform [17, 18]. Expression profiles for BT474 and MCF-7 were obtained from the NCI

Cancer Biomedical Informatics Grid (caBIG, https://cabig.nci.nih.gov), MCF10A profile from

GEO (GSM254525), and the rest were generated using Affymetrix U133 Plus 2.0 platform at the

McGill University and Genome Quebec Innovation Centre. Affymetrix 500K SNP array data were obtained from caBIG. DNA methylation profiles were generated using the Illumina

Infinium methylation platform at the Genomics Lab, Wellcome Trust Centre for Human

Genetics. A summary of the sources of all the data used is provided in Additional File 2 or

Appendix III. Gene expression and methylation data generated were deposited in NCBI GEO

(GSE17768 and GSE17769).

3.2.2 Data processing and normalization

Array CGH data were normalized using a stepwise normalization framework [19]. In addition, data were filtered based on a stringent standard deviation cut-off of 0.075 between replicate spots, with those exceeding this cut-off excluded from further analysis. To identify regions of gain and loss, smoothing and segmentation analysis was performed using aCGH-Smooth [20] as previously described [21]. Copy number status for clones which were filtered from above were inferred using neighboring clones within a 1 Mb window.

Affymetrix SNP array data were normalized and genotyped using the "oligo" package in R, specifically using the crlmm algorithm for genotyping [22]. Genotype calls whose confidences were less than 0.95 were termed "No Call" (NC). Subsequently, genotype profiles were analyzed using dChip [23] and LOH was determined using a panel of 60 normal genotypes from the HapMap dataset [24] as provided by dChip, as matching blood lymphoblast profiles were not available. LOH ("L"), Retention ("R"), and No Call ("N") status was determined for every marker in each sample. Analysis parameters used were as specified in the dChip manual.

51 Raw gene expression profiles from all ten cell lines were normalized using the "rma" package in

R (Additional File 3). Gene expression data were further filtered using the Affymetrix MAS 5.0

Call values ("P","M", and "A"). Since the comparison of differential expression was one cancer line to one normal, both call values could not be "Absent" in order to be retained for analysis.

Methylation data were normalized and processed using Illumina BeadStudio software

(http://www.illumina.com/software/genomestudio_software.ilmn, Illumina, Inc., San Diego, CA,

USA). Beta-values and confidence p-values were retained for further analysis. Beta-values with associated confidence p-values > 0.05 were excluded. Data from all genomic dimensions were mapped to the hg18 (March 2006) genome assembly.

3.2.3 Strategy for integrative analysis

Copy number and LOH profiles were mapped to genes using the mapping of the Affymetrix

U133 Plus 2.0 platform as well as the UCSC Genome Browser [25]. Methylation data were linked to the other three types of data using either the RefSeq gene symbol as specified by the

Illumina mapping file (Illumina), or the RefSeq accession number. Differential expression was determined by subtracting the expression value in the non-malignant line MCF10A from the value in each cancer line. Since the obtained gene expression values after RMA normalization were represented in log2 space, a gene was considered differentially expressed if the difference between the cancer line and MCF10A was greater than 1, which corresponded to a two-fold expression difference. DNA methylation status was determined by subtracting beta-values, with hypermethylation defined as a positive difference between tumor and normal (≥ 0.25) and hypomethylation defined as a negative difference between tumor and normal (≤ -0.25). Briefly, a beta value for a given CpG site ranges from 0 to 1 and represents the ratio of the methylated signal over the total signal (methylated plus unmethylated signal). These thresholds are comparable to those used in previous studies using an earlier Illumina methylation platform [26].

Using this mapping strategy, 12,910 unique genes were mapped across platforms corresponding to 24,708 of the ~27,000 Illumina Infinium probes and to 27,053 probes of the

52 Affymetrix U133 Plus 2.0 platform. Visualization of multi-dimensional data was performed using the SIGMA2 software [27].

To determine the genetic events that caused (or could explain) gene expression status, we first identified a set of overexpressed and underexpressed genes for each cell line sample relative to

MCF10A based on differential expression criteria mentioned above. Each cancer sample may have a different number of differentially expressed genes. Second, for each differentially expressed gene in each sample, we examined the copy number status, methylation status, and allelic status. A differential expression was considered "explained" when the observed expression change matched the expected change at the DNA level. If a gene was overexpressed, the causal copy number status would be a gain, DNA methylation status would be hypomethylation, or allelic status would be allelic imbalance. Conversely, if a gene was underexpressed, the causal copy number status would be a loss, DNA methylation status would be hypermethylation, or allelic status would be LOH. From this point forward, when a change in allele status with overexpression is discussed, it will be denoted as allelic imbalance (AI).

Conversely, for underexpression, a change in allele status will be denoted as loss of heterozygosity (LOH). While changes in methylation or changes in gene dosage leading to differential expression are more commonly discussed, previous studies have shown that changes in allele status without change in copy number (copy neutral AI or LOH) can also lead to differential gene expression due to preferential allelic expression [28-30].

3.2.4 Multiple concerted disruption (MCD) analysis

To determine what are likely key nodes in pathways and functions, we hypothesize that, in addition to being altered frequently (by one mechanism or multiple mechanisms), these genes also exhibit multiple concerted disruption (MCD) in a given sample. That is, a congruent change in gene copy number (gain or loss) accompanied by allelic imbalance and change in

DNA methylation (hypomethylation or hypermethylation) resulting in a change in gene expression (over or underexpression). Moreover, the MCD events would be used as a similar

53 screening approach to gene amplifications (multi-copy increases) or homozygous deletions whereby the expectation is that these events would occur at a lower frequency than disruptions through one mechanism alone and observation of these events would signify importance to the genes in question.

In this study, the MCD strategy can be broken down into four sequential steps. First, using a pre-defined frequency threshold, we identify a set of the most frequently differentially expressed genes. Second, we identify the most frequently differentially expressed genes from step 1 whose expression change is frequently associated with concerted change in at least one DNA dimension (either DNA copy number, DNA methylation or allelic status) within the same sample.

Next, we further refine this subset of genes from step 2 by selecting those having concerted change in all dimensions in the same sample which we term as MCD. Finally, we introduce an additional level of stringency by requiring a minimum frequency of MCD in the given cohort. At the end of the process, we identify a small subset of genes which exhibit disruption through multiple mechanisms and show consequential change in gene expression.

3.2.5 Simulated data analysis

Using the status of DNA alteration and expression for every gene in every sample, data within each sample were shuffled and randomized ten times to create ten simulated datasets. Each dataset was analyzed for overall disruption frequency and MCD and all results were then aggregated to determine the frequency distribution of different thresholds observed in the randomized data analysis.

3.2.6 Pathway enrichment analysis

For pathway analysis, Ingenuity Pathway Analysis software was used (Ingenuity Systems, CA,

USA). Specifically, the core and comparison analyses were used, with focus on canonical signaling pathways. Briefly, for a given function or pathway, statistical significance of pathway enrichment is calculated using a right-tailed Fisher's exact test based on the number of genes

54 annotated, number of genes represented in the input dataset, and the total number of genes being assessed in the experiment. A pathway was deemed significant if the p-value of enrichment was ≤ 0.05 (adjusted for multiple comparisons using a Benjamini-Hochberg correction).

3.2.6 Survival and differential gene expression analysis in publicly available datasets

For survival analysis, Kaplan-Meier analysis was performed using the statistical toolbox in

Matlab (Mathworks). For each gene, the expression data were sorted from lowest to highest expression across the sample set and survival times were compared between the top 1/3 and bottom 1/3 of the samples. Two publicly available gene expression microarray datasets with survival data were utilized for this analysis [4, 31]. For the Sorlie et al dataset, individuals whose cause of death was not breast cancer were excluded from the analysis and missing data due to quality control issues were filled using the knn method in the “impute” package in Bioconductor

[32]. Of the 23 genes selected by our MCD analysis (see Results), 17 were represented in either dataset. Survival distributions were compared using a log rank test and two-tailed p- values unadjusted for multiple comparisons were reported.

Subsequently, these 17 genes were further evaluated for differential expression in publicly available expression datasets of clinical breast cancer samples using the Oncomine database

[33].

3.3 Results and discussion

3.3.1 Analysis of individual genomic dimensions

When examining each genomic dimension alone, we see that many of the common features identified are consistent with the current knowledge of breast cancer genomes, for example, previously reported chromosomal regions of frequent copy number gain, segmental loss and loss of heterozygosity (LOH) / allelic imbalance (AI) (Figure 3.1a) [6, 8, 11, 12, 34]. While

55 many regions of frequent LOH/AI do overlap with regions of copy number change, others are in regions of neutral copy number. Key genes implicated in breast cancer reside in these specific regions and are altered expectedly (Figure 3.1b).

3.3.2 Multi-dimensional analysis (MDA) reveals a higher proportion of intra-sample deregulated gene expression can be explained when more dimensions are analyzed

The impact of integrative, multi-dimensional analysis on gene discovery is observed at two levels: (i) within an individual sample as well as (ii) across a set of samples. Within a given sample, we see that by sequentially examining more genomic dimensions at the DNA level, i.e. gene dosage, allelic status, and DNA methylation, we can explain a higher proportion of the differential gene expression changes observed. Interestingly, although this proportion may vary between samples, it always increases with every additional dimension examined (Figure 3.2a).

For example, in HCC1395, a single genomic dimension alone can explain as much as 64.4% of overexpression but when using all three DNA based dimensions, whereby gene overexpression can be explained by disruption at the DNA level in at least one dimension, as much as 75.7% of aberrant overexpression can be explained. Similarly, in HCC1937, an increase from 56.9% to

74.7% explainable underexpression is observed when moving from one to three genomic dimensions respectively. Conversely, in HCC2218, we observe 44% and 36% of overexpression and underexpression respectively when using all three DNA dimensions. This suggests that the majority of differential expression in sample HCC2218 is most likely a result of complex gene-gene trans-regulation and consequently, highlights the individual differences between samples.

3.3.3 MDA reveals genes are disrupted at higher frequencies when examining multiple dimensions as compared to any single dimension alone

When considering across a sample set, we see that analysis of multiple genomic dimensions leads to the discovery of more disrupted genes than what would be detected using a single dimension of analysis alone. For each identified gene, we gain insight in how multiple 56 mechanisms are complementary in gene disruption (Figure 3.2b). For example, the tumor suppressor gene caspase 1 (CASP1) has been thought to be deactivated through DNA hypermethylation in multiple cancer types [35, 36]. The gene is underexpressed in all nine cases examined in this study. In a subset of these cases, the observed underexpression can be attributed to copy number loss. Interestingly, in the remaining cases, DNA hypermethylation and copy neutral LOH are observed. Similarly, in another example, GNAS is differentially expressed in all nine cases, with a subset of cases showing concerted copy number change while the remaining cases reveal concerted change in DNA methylation. Notably, our conclusion is supported by recent studies of glioblastoma, that also showed higher than expected disruption frequencies of specific genes when multiple genomic dimensions were analyzed [37, 38]. These examples illustrate how deregulated genes can be detected in more cases when multiple, but complementary, approaches are used.

Until very recently, multi-dimensional genomic analysis typically represented the parallel examination of gene dosage and gene expression. To demonstrate the power of examining multiple dimensions, we examine the frequency of gene expression deregulation explained by congruent alteration at the DNA level. Briefly, for each gene, a sample is determined to have a

DNA explained gene expression change if any of the following criteria are met; gene overexpression should be accompanied with either (i) copy number gain, (ii) copy neutral allelic imbalance, or (iii) hypomethylation and gene underexpression should be accompanied with either (i) copy number loss, (ii) copy neutral LOH, or (iii) hypermethylation.

To determine an appropriate frequency of disruption threshold, ten random, simulated datasets were generated and a distribution plot was generated for all of the observed frequencies from

0/9 to 9/9 across all simulations (Figure 3.3a). The proportion of observed frequencies ≥ 5/9 was 0.086 but for ≥ 6/9, the proportion was 0.020. Thus, since the 6/9 threshold was the first threshold ≤ 0.05, 6/9 was used for further analysis. Using this threshold, we found that 437 differentially expressed genes have a corresponding change in gene dosage. Scaling this

57 approach to examining the whole genome at multiple dimensions, we anticipate identifying more disrupted genes. When we added the remaining dimensions to account for differential expression, at the same frequency cut-off, we identified the mechanism of disruption for 1162 deregulated genes (Figure 3.3b, Additional File 4).

The impact of multi-dimensional integrative analysis on cancer gene discovery is the enhanced detection of genes which are disrupted by multiple mechanisms but at lower frequencies for individual mechanisms. Collectively, the detection of gene dosage, allelic conversion and change in methylation status enable the identification of such genes as frequently disrupted.

Using the list of 1162 genes, the distributions of alteration frequencies for each genomic dimension or combination of dimensions were assessed (Figure 3.4a). Examining the median frequencies in each box plot, there is a sequential increase in the median as more dimensions are examined. This point can be further validated using specific genes. For example, the CD70 and ENG genes are underexpressed in the majority of samples. Using copy number analysis alone, the observed frequency of disruption (loss and underexpression) is 44% and 22% respectively. If we then examine the methylation status, in the remaining cases not explained by DNA copy number, we observe an additional 33% of cases exhibiting hypermethylation and underexpression for ENG (red) and 22% for CD70 (blue). Finally, when we also examine allelic status, we observe an additional 22% of cases with copy neutral LOH and gene underexpression for CD70 and 11% for ENG. In total, by using all three dimensions, the cumulative frequency of disruption is 88% for CD70 and 77% for ENG (Figure 3.4b). This example demonstrates the utility of a multi-dimensional approach to elucidate events which would escape conventional single dimensional analysis.

3.3.4 MDA identifies significantly enriched cancer related pathways

Using the set of 1162 genes identified by MDA (Additional File 4) and the similar lists of genes identified from each of the simulated datasets, pathway analyses were performed with Ingenuity

Pathway Analysis. From the pathway analysis of MDA genes and focusing only on canonical

58 signaling pathways, 53 pathways were significantly enriched for at a Benjamini-Hochberg corrected p-value of 0.05 (Additional File 5). In contrast, using the gene lists from the 10 simulated datasets, nine of the 10 pathway analyses yielded no significant pathways enriched for at the same p-value with one of the pathway analyses yielding one significant pathway.

Similar results from analysis were obtained using the publicly available

GATHER database [39] (Additional File 6). Specific pathways involved in breast cancer, ovarian cancer, and prostate cancer were amongst the ones identified as most significant

(Figure 3.5). Consequently, these results suggest that the genes identified using MDA have a high degree of biological relevance.

3.3.5 MDA of the Neuregulin signaling pathway reveals a complex pattern of deregulation

Among the 53 pathways which were statistically over-represented from our list of 1162 genes, one of the pathways identified is the neuregulin pathway. This pathway contains the well known breast cancer oncogene ERBB2 as well as other genes known to be affected in breast and other cancers [40-43]. Examining the components of this pathway, we observe that some are genes commonly altered while others are infrequently altered across our sample set by multiple patterns of genomic alteration, and some genes which behave oppositely in different samples

(Figure 3.6).

While genes such as HRAS (down), BAD (down), HSP90AB1 (up), SOS2 (up) and RPS6KB1

(up) generally exhibit consistent differential expression with concerted change at the DNA level across our sample set, genes such as GRB7, PTEN, and MAP2K1 exhibit both overexpression and underexpression, with concerted DNA change, in different samples. For example, if we examine PTEN, we observe copy number loss, LOH, DNA hypermethylation and consequent underexpression in HCC1395 while HCC1008 contains copy number gain, with DNA hypomethylation and consequent overexpression (Figure 3.7). The impact of such a difference on a downstream targets was recently shown in a breast cancer study where AKT and mTOR phosphorylation were higher in cases with low PTEN expression compared to those with high

59 PTEN expression [44]. Using this pathway as an example, though average features across a sample set are important, those differences between samples in the same pathway may also play an important role and thus, may have a consequence on the biology of the tumor.

3.3.6 Genes exhibiting multiple concerted disruption (MCD) - biological and clinical significance

We have demonstrated that we can identify more disrupted genes in a given sample when considering any mechanism of disruption. On the other hand, those genes which exhibit multiple concerted disruptions (MCD) across all DNA dimensions -- i.e. overexpression of a gene due to increased gene dosage, which led to allelic imbalance, and DNA hypomethylation at the same locus relieving regulation -- may likely have strong biological significance.

Likewise, underexpression due to reduced gene copy number, resulting in LOH, and complementary DNA hypermethylation, leading to gene silencing may also be significant. By employing multiple dimensions of interrogation, genes exhibiting MCD are captured.

To determine what frequency of MCD was deemed significant, we performed a similar analysis of the 10 simulated datasets from before and assessed the proportion of events at each frequency of MCD from 0/9 to 1/9 (Figure 3.8a). It was found that by random chance, a gene exhibiting MCD in 1/9 would occur 0.3% of the time. Thus, using this threshold of at least one

MCD event, 974 genes were identified (Additional File 7). Interestingly, the overlap of the

MDA list (1162 genes) with the MCD list (974 genes) yielded 375 genes.

The MCD strategy sequentially refines the roster of target genes with the intent of identifying critical genes for tumorigenesis (Additional File 8 or Appendix IV). Such genes which exhibit multiple mechanisms of deregulation, for example, may represent important nodes in pathways such as hub proteins [45], whereby disruption of the gene has an effect on multiple downstream targets or genes with biological and/or clinical relevance. Thus, although these genes may not be affected at a high frequency across the sample set, their disruption at multiple levels in

60 individual samples would signify importance in tumorigenesis. As shown earlier, 375 genes identified by both MDA and MCD. If we further employed a criterion of frequent MCD, whereby this event occurs in 4/9 of cases (signifying high recurrence), we detect 23 genes (Additional

File 8 or Appendix IV). Among the 23 genes identified are TUSC3 (8p22), ELK3 (12q23), and

CCNA1 (13q12.3-q13).

TUSC3 resides at 8p22, a locus frequently deleted across multiple epithelial cancers [46-49].

ELK3 is an ETS domain transcription factor which, in mice, acts as a transcriptional inhibitor in the absence of RAS, but is a transcriptional activator in the presence of RAS [50]. Recently,

ELK3 was shown to be underexpressed in a panel of breast cancer lines as well clinical breast tumor specimens [51]. CCNA1 was shown to be hypermethylated in multiple cancer types, including breast cancer [52].

To validate the relevance of the 23 MCD genes in clinical breast cancer samples, we evaluated gene expression levels associated with survival and examined multiple publicly available microarray datasets using the Oncomine database [33]. Of these 23 genes, 17 were represented in either the van de Vijver et al or Sorlie et al datasets. Interestingly, eight of these genes, demonstrated a statistically significant association with patient survival in at least one of the two independent datasets (Additional File 9 or Appendix V, Additional File 10 or

Appendix VI) [4, 31]. Moreover, when comparing the percentage of survival-associated genes

(8/17, 47.1%) in the MCD gene list with what was expected without pre-selection (27.1%), the increased percentage was statistically significant based on the binomial test (p = 0.04131806).

To further evaluate the clinical significance of these genes, we utilized the Oncomine database

(Additional File 9 or Appendix V). It should be noted the caveat of the Oncomine analysis is that it may not detect all low levels of differential expression. TUSC3 is shown as an example of one of the genes whose expression correlates with survival (Additional File 8 or Appendix IV, see Methods). Notably, in ovarian cancer, TUSC3, in conjunction with EFA6R, also correlated with poor survival [53]. The observations that TUSC3 is altered frequently by multiple

61 mechanisms at the DNA and RNA level and shows a strong association with patient survival, highlight the use of MCD in systematically identifying biologically, and potentially clinically, relevant genes.

3.3.7 Association of genes exhibiting MCD and triple negative breast cancers (TNBC)

In this study, the majority of samples used (5/9) were of the triple negative subtype of breast cancer; a subtype which is estrogen receptor (ER) negative, progesterone receptor (PR) negative, and HER2 negative and represents between 10% and 20% of all diagnosed breast malignancies [54-57] . Genomic analyses of triple negative breast cancers (TNBCs) have been previously performed [58-61] and they revealed a heterogeneous and complex view of this breast cancer subtype. A recent study, however, had implicated fibroblast growth factor receptor 2 (FGFR2) as novel therapeutic target amplified in TNBCs [57]. Interestingly, from a meta-analysis of array CGH data, this gene was found to be amplified in 4% of TNBC cases

[57]. Thus, we assessed the status of FGFR2 and its downstream targets in our multi- dimensional dataset.

While FGFR2 is not amplified in any of the five TNBC cell lines, all of the five cell lines showed overexpression of FGFR2 with one of the cell lines exhibiting a low level gain of a region encompassing FGFR2 (HCC1937). From this analysis, within the sample set of TNBC cell lines, though FGFR2 is overexpressed, it was not frequently associated with DNA level alterations.

However, examining downstream targets of FGFR2 revealed a striking finding. Using the knowledge database of Ingenuity Pathway Analysis, one of the downstream components affected at the expression level, which was also on both the MDA (Additional File 4) and MCD

(Additional File 7) lists, was COL1A1. Remarkably, of the five TNBC cell lines, four exhibited

DNA alteration associated overexpression of COL1A1 with two lines exhibiting MCD at COL1A1 and two other lines having DNA copy number associated overexpression. The remaining line

62 exhibited DNA copy number associated overexpression of FGFR2 (Figure 3.8b). Hence, every

TNBC line was affected at either FGFR2 or COL1A1 at both the DNA and RNA levels.

Interestingly, COL1A1 has been shown to be both prognostic and predictive in multiple cancer types, including breast cancer [3, 5, 62, 63].

3.4 Conclusions

In conclusion, we have demonstrated that a multi-dimensional genomic approach is superior to analysis of one or two genomic dimensions alone. Each additional genomic dimension surveyed increases the amount of aberrant gene expression that can be explained within individual samples. As a by-product, when examining across a sample set, multi-dimensional genomic analysis can identify relevant genes that may be overlooked due to low frequencies of disruption by the individual mechanisms. The increased frequency of gene disruption detected, due to the consideration of multiple mechanisms of disruption, could potentially reduce the sample size of study cohort needed for gene discovery.

Secondly, while the increased detection of genes disrupted using multi-dimensional analysis is useful for achieving a more comprehensive identification of deregulated pathways and gene networks, it also presents a challenge in prioritizing which genes are likely key nodes or hubs in the affected pathways and networks. Hence, one way to prioritize is to identify genes with evidence of multiple concerted disruption. The Knudson two-hit hypothesis suggests that tumor suppressor genes require two allelic hits to disrupt gene function. Bi-allelic alteration, such as homozygous deletion, or concerted genetic and epigenetic changes, are well documented causal mechanisms of gene disruption. Likewise, hypomethylation and increased gene dosage are known mechanisms for gene overexpression. The bi-allelic disruption phenomenon

(leading to loss or gain of function) provides a means to identify causative genes; hence, parallel analysis of the genome and epigenome in the same tumor is of great benefit. In this study, we have developed a stepwise gene selection strategy to identify multiple concerted disruptions using an integrative genomics approach.

63 In this study, three DNA dimensions, which have current affordable high throughput assays, were examined. However, we envision that new techniques for analysis of additional aspects such as histone modification states and gene mutation status will reveal mechanisms that would explain even more gene expression changes within individual samples. The identification of a number of key cancer-related genes and pathways using a relatively small sample size suggests that limitations in requiring large sample sizes for studies to identify relevant genes and pathways may be circumvented by our comprehensive approach. Consequently, this concept can be projected to current technologies such as high throughput sequencing where it may prove more prudent to perform this analysis in multiple dimensions in a smaller number of samples rather than in one dimension in many more samples at a comparable cost. Finally, observing the same gene in a given pathway being deregulated in a completely different manner between samples highlights one of the shortcomings of group-based analysis and highlights the eventual need to move to systems analysis of tumors as individual entities.

64 Figure 3.1

a ESR1 BRCA2 TP53 ERBB2BRCA1 1

0.5 CN Gain Frequency 0 0 0.5 1 1.5 2 2.5 3 1

0.5 CN Loss Frequency 0 0 0.5 1 1.5 2 2.5 3 1

0.5 LOH

Frequency 0 0 0.5 1 1.5 2 2.5 3 1

0.5

Frequency 0 CN Neutral LOH CN Neutral 0 0.5 1 1.5 2 2.5 3 Genomic Position (Gbp) b

HCC38HCC1008HCC1143HCC1395HCC1599HCC1937HCC2218BT474MCF7 HCC38HCC1008HCC1143HCC1395HCC1599HCC1937HCC2218BT474MCF7 BRCA2 TP53

ESR1 Copy Number Gain Copy Number Loss ERBB2 LOH Retention (no LOH) BRCA1

Figure 3.1. Genomic profiles of breast cancer cell lines. (a) Whole genome frequency analysis copy number gain (red), copy number loss (green), loss of heterozygosity/allelic imbalance (AI) (top blue) and copy number neutral LOH/AI (bottom blue). Vertical lines through all four graphs represent the genomic location of key breast cancer genes, using the hg18 build of the human genome map. (b) Illustration of copy number and LOH/AI status for ESR1, BRCA1, BRCA2, ERBB2 and TP53 in each of the samples. Each of these DNA events is evident in all of these genes. 65 Figure 3.2. Quantitative and qualitative benefits of integrative analyses. (a) Heatmap and bar plot illustration of the additive benefit of multi-dimensional DNA analysis for the explanation of consequential differential gene expression. Within a sample, when sequentially adding a DNA dimension of analysis, an increasing percentage of observed differential gene expression can be explained. For each dimension or combination of dimensions, in the bar plot, the median value is used (grey bars). Heatmaps display the percentage of differential expression explained by DNA mechanisms, with values near to 100 either dark red (overexpression) or green

(underexpression) and values closer to 0 in white. (b) Two specific genes GNAS and CASP1 are given as examples to show multiple and complementary mechanisms of gene disruption, illustrating the importance of multi-dimensional analysis (MDA).

66 Figure 3.2

a

HCC38 HCC1008 HCC1143 HCC1395 HCC1599 HCC1937 HCC2218 BT474 MCF7 Hypo 0.197 0.266 0.176 0.134 0.203 0.171 0.144 0.194 0.180 AI 0.319 0.325 0.325 0.337 0.215 0.421 0.132 0.271 0.122 CNG 0.708 0.401 0.372 0.644 0.440 0.464 0.321 0.458 0.500 CNG Or Hypo Or AI 0.821 0.686 0.655 0.757 0.612 0.743 0.435 0.679 0.629 0 0.2 0.4 0.6 0.8 Proportion of Overexpression Explained

Hyper 0.103 0.062 0.126 0.236 0.145 0.161 0.183 0.172 0.166 LOH 0.425 0.512 0.516 0.523 0.316 0.569 0.197 0.348 0.226

67 CNL 0.367 0.584 0.573 0.499 0.408 0.473 0.203 0.549 0.419 CNL Or Hyper Or LOH 0.522 0.705 0.790 0.702 0.562 0.747 0.363 0.721 0.558 0 0.2 0.4 0.6 0.8 Proportion of Underexpression Explained b GNAS HCC38 HCC1008 HCC1143 HCC1395 HCC1599 HCC1937 HCC2218 BT474 MCF7 Gene Expression DNA Copy Number DNA Methylation Legend: Allelic Status GE: Gene Expression: Over Under CASP1 CN: DNA Copy Number: Gain Loss Gene Expression L: Allelic Status: LOH DNA Copy Number M: DNA Methylation: Hypo Hyper DNA Methylation Allelic Status Figure 3.3 a 0.30 0.25

0.20

0.15

0.10

Proportion of genes in Proportion simulations random 0.05 Simulated Data

0 0 1 2 3 4 5 6 7 8 9 Disruption frequency b CN Or AI/LOH Or Meth

CN

Meth Experimental Data AI/LOH

0 200 400 600 800 1000 1200 1400 # of genes at 6/9 cut-o Figure 3.3. Determination and application of a disruption frequency threshold. (a) Results of the analyses of ten simulated datasets. Aggregating the results of the simulated analyses, the proportion of random simulations at the observed frequency thresholds are shown. From these analysis, approximately 2% of the simulations were ≥ 6/9. (b) Using a frequency cut-off of 6/9, the number of genes disrupted at that frequency using a single or combination of DNA dimensions. With a single dimension alone, we can maximally identify 437 genes which are differentially expressed and exhibit a concerted change at the DNA level in a minimum of 6/9 samples. However, using all three dimensions, we find that 1162 genes are in fact differentially expressed and contain at least one concerted change in one of the DNA dimensions. This represents over a two-fold increase in the number of genes identified. 68 Figure 3.4 a 9 8 7 6 5 4 3 2 Disruption Frequency 1 0

LOH DNA

AI/LOH Or AI/LOH Methylation Copy Number Copy OrNumber Copy Number OrCopy NumberAI/LOH Or DNA Methylation

9 DNA Methylation DNA Methylation b 8 7 6 5 Frequency 4 threshold 3

Cumulative Frequency Cumulative 2 1 0

LOH

DNA

Copy Number DNA Methylation Copy Number Or Copy NumberLOH Or Or DNA Methylation Methylation Or DNA Methylation Copy Number Or LOH LOH Figure 3.4. Impact of multi-dimensional analysis on low frequency events. (a) Box plot analysis of the frequency distribution of single and multi-dimensional analyses (MDA) of the 1162 genes differentially expressed with a concerted change in one of the DNA dimen- sions. The area in red represents the number of genes (of the 1162) that would be missed if only a single DNA dimension was examined, while the area in blue represents the genes that would be detected. Examining the median values for the three right-most boxes, we see that by even using the box with the highest median (copy number), we would not be able to detect about 50% of the 1162 genes. (b) Two specific examples highlighting the importance of multi-dimensional genomic analysis. Using single dimensional analyses (green shade) alone, CD70 (blue line graph) and ENG (red line graph) disruption occur at very low frequencies (44% and 33% respectively). However, when examining two (red shade) or three genomic dimensions (blue shade), the disruption of these genes occurs at very high frequencies, 88% and 77% respectively. Frequency threshold of 6/9 is denoted with a black dotted line. 69 Figure 3.5

5.0

4.0 3.0 2.0

-log(pvalue) Threshold 1.0 0.0

p53 Signaling

Aryl Hydrocarbon PI3K/AKT Signaling Neuregulin Signaling Molecular Mechanisms Receptor Signaling of Cancer Breast Cancer RegulationOvarian Cancer SignalingProstate Cancer Signaling Cell Cycle: G1/S Checkpoint by Stathmin1 Cell Cycle: G2/M DNA Damage Regulation Checkpoint Regulation

70 Legend: Multi-Dimensional Analyis Simulated Data Sets

Figure 3.5. Pathway analysis of the 1162 genes identified by multi-dimensional analysis. Ingenuity Pathway Analysis of the 1162 genes identified by MDA as well as genes meeting the same frequency criteria (6/9) from the analysis of the ten simulated datasets. In total, using the list of 1162 MDA genes, 53 canonical signaling pathways were identified as significant after multiple testing correction using a Benjamini-Hochberg correction (Additional File 5). In contrast, using the same statistical criteria, nine of the 10 simulated datasets yielded no significant pathways with one of the datasets yielding one pathway. In this figure, ten of the most well known, cancer-related pathways are shown. The yellow threshold line represents a Benjamini-Hochberg corrected p-value of 0.05 with bars above that line deemed significant. Figure 3.6 S1 S2 S3 S4 S5 S6 S7 S8 S9 GE CN EREG * EREG L M

S1 S2 S3 S4 S5 S6 S7 S8 S9 S1 S2 S3 S4 S5 S6 S7 S8 S9 ERBB2 ERBB2 ERBB2 ERBB4 GE ERBB4 * ERBB4 GE CN CN L L M M S1 S2 S3 S4 S5 S6 S7 S8 S9 S1 S2 S3 S4 S5 S6 S7 S8 S9 * GE GE CN PIK3R1 CN S1 S2 S3 S4 S5 S6 S7 S8 S9 GRB7 L * L GE M M HSP90AB1 CN S1 S2 S3 S4 S5 S6 S7 S8 S9 * L GE PI3K-AKT M CN ERBB2IP L Signalling M S1 S2 S3 S4 S5 S6 S7 S8 S9 GE CN Survival & STAT5 L M PIP2 PIP3 Proliferation S1 S2 S3 S4 S5 S6 S7 S8 S9 S1 S2 S3 S4 S5 S6 S7 S8 S9 GE GE CN ERRFI1 CN GRB2 L L M M S1 S2 S3 S4 S5 S6 S7 S8 S9 S1 S2 S3 S4 S5 S6 S7 S8 S9 Mitogenic GE GE PTEN CN CN * L SOS2 L Signalling M M

S1 S2 S3 S4 S5 S6 S7 S8 S9 S1 S2 S3 S4 S5 S6 S7 S8 S9 S1 S2 S3 S4 S5 S6 S7 S8 S9 GE GE GE CN CN CN PRKCI L PDK1 L * HRAS L M M M

S1 S2 S3 S4 S5 S6 S7 S8 S9 S1 S2 S3 S4 S5 S6 S7 S8 S9 GE S1 S2 S3 S4 S5 S6 S7 S8 S9 GE GE CN CN RAF1 L CN BAD L M AKT2 L M M

S1 S2 S3 S4 S5 S6 S7 S8 S9 GE S1 S2 S3 S4 S5 S6 S7 S8 S9 CN GE MAP2K1 L CDKN1B CN M mTOR L M

S1 S2 S3 S4 S5 S6 S7 S8 S9 S1 S2 S3 S4 S5 S6 S7 S8 S9 S1 S2 S3 S4 S5 S6 S7 S8 S9 GE GE GE CN CN CN ERK1/2 L RPS6KB1 L RPS6 L M M M

S1 S2 S3 S4 S5 S6 S7 S8 S9 GE CN ERK1/2 MYC L RPS6KB1 M Legend: Proliferation & GE: Gene Expression: Over Under ELK1 Cell Cycle Differentiation CN: DNA Copy Number: Gain Loss S1 S2 S3 S4 S5 S6 S7 S8 S9 GE L: Allelic Status: LOH CN L M: DNA Methylation: Hypo Hyper M

Figure 3.6. Complex deregulation of the Neuregulin/ERBB2 signaling pathway. Each gene is color-coded red and green to represent over and underexpression respectively. Genes colored both represent genes which are over and underexpressed in different samples. Beside each gene is the status for gene expression, copy number, LOH/AI and DNA methylation, with the alterations in each dimension colored as per the legend. DNA alterations are only shown when a change in gene expression is observed. It should be noted that LOH can be derived from multiple mechanisms. In this study, we do not distin- guish between the which mechanisms. Likewise, methylation changes may affect one or both alleles. In this study, we do not distinguish the status of the alleles individually. Genes denoted with * have one sample exhibiting multiple concerted disruption (MCD). Samples are coded as follows: S1 = HCC38, S2 = HCC1008, S3 = HCC1143, S4 = HCC1395, S5 = HCC1599, S6 = HCC1937, S7 = HCC2218, S8 = BT474, and S9 = MCF7. 71 Figure 3.7 Sample: HCC1008 Sample: HCC1395 DNA Methylation DNA Methylation 1.2

0.4 0.8

0.2 Beta value 0.4 Beta value

0 0 MCF10A HCC1008 MCF10A HCC1395

Gene Expression Gene Expression 10 12 PTEN 8 6

72 8 4 4 Log2 Intensity Log2 Log2 Intensity Log2 2 0 0 MCF10A HCC1008 MCF10A HCC1395

Copy Number Gain Retention Copy Number Loss LOH

Figure 3.7. Deregulation of PTEN occurs differently between samples. In HCC1008 (left), PTEN is overexpressed with an associated gain in copy number and hypomethylation. Conversely, in HCC1395 (right), PTEN is underexpressed, with an associated loss in copy number, LOH, and DNA hypermethylation. This illustrates how each tumor may behave differently from another. Figure 3.8. Multiple concerted disruption (MCD) analysis and its application to triple negative breast cancer. (a) Analysis of ten simulated datasets to determine the proportion of random simulations at each observed frequency of MCD. Notably, 99.7% of random simulations had a MCD frequency of 0/9, with the remaining 0.3% at 1/9. Moreover, no simulations showed a frequency ≥ 2/9. Thus, the observation of an MCD event suggests the event is likely non-random. (b) Using the knowledge database of Ingenuity Pathway Analysis, upstream and downstream components of FGFR2 were selected to assess their role in the subset of triple negative breast cancer (TNBC) cell lines. Only components which were shown to have a direct or indirect expression level relationship were selected. Of the seven components identified (four upstream and three downstream of FGFR2), one upstream component (FGF2) and one downstream component (COL1A1) were present in both the MDA list (Additional File 4) and MCD list (Additional File 7). FGF2, colored in green, is shown to be frequently underexpressed while COL1A2, colored in red, is frequently overexpressed.

Interestingly, examining FGFR2 and COL1A1, while FGFR2 overexpression is not frequently associated with DNA level alteration, COL1A1 is frequently affected at DNA level. Moreover, in the five TNBC cell lines examined, four have DNA level alteration of COL1A1 and the remaining line has DNA level alteration of FGFR2.

73 Figure 3.8

a 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2

Proportion of random simulations of random Proportion 0.1 0 0 1 2 3 4 5 6 7 8 9 MCD Frequency b

TP63

TGFB1

FGF2

FGFR2

RUNX2 IGF2 COL1A1

FGFR2 HCC38 HCC1008 HCC1143 HCC1599 HCC1937 Gene Expression DNA Copy Number Legend: DNA Methylation Allelic Status GE: Gene Expression: Over Under CN: DNA Copy Number: Gain Loss COL1A1 L: Allelic Status: LOH Gene Expression M: DNA Methylation: Hypo Hyper DNA Copy Number DNA Methylation *Sample has MCD Allelic Status * *

74 3.5 References

1. Chang JC, Wooten EC, Tsimelzon A, Hilsenbeck SG, Gutierrez MC, Elledge R, Mohsin S, Osborne CK, Chamness GC, Allred DC et al: Gene expression profiling for the prediction of therapeutic response to docetaxel in patients with breast cancer. Lancet 2003, 362(9381):362-369. 2. Coe BP, Chari R, Lockwood WW, Lam WL: Evolving strategies for global gene expression analysis of cancer. J Cell Physiol 2008. 3. Perou CM, Sorlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA et al: Molecular portraits of human breast tumours. Nature 2000, 406(6797):747-752. 4. Sorlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, van de Rijn M, Jeffrey SS et al: Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci U S A 2001, 98(19):10869-10874. 5. van 't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT et al: Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 415(6871):530-536. 6. Fridlyand J, Snijders AM, Ylstra B, Li H, Olshen A, Segraves R, Dairkee S, Tokuyasu T, Ljung BM, Jain AN et al: Breast tumor copy number aberration phenotypes and genomic instability. BMC Cancer 2006, 6:96. 7. Albertson DG, Ylstra B, Segraves R, Collins C, Dairkee SH, Kowbel D, Kuo WL, Gray JW, Pinkel D: Quantitative mapping of amplicon structure by array CGH identifies CYP24 as a candidate oncogene. Nat Genet 2000, 25(2):144-146. 8. Chin SF, Wang Y, Thorne NP, Teschendorff AE, Pinder SE, Vias M, Naderi A, Roberts I, Barbosa-Morais NL, Garcia MJ et al: Using array-comparative genomic hybridization to define molecular portraits of primary breast cancers. Oncogene 2007, 26(13):1959-1970. 9. Jain AN, Chin K, Borresen-Dale AL, Erikstein BK, Eynstein Lonning P, Kaaresen R, Gray JW: Quantitative analysis of chromosomal CGH in human breast tumors associates copy number abnormalities with p53 status and patient survival. Proc Natl Acad Sci U S A 2001, 98(14):7952-7957. 10. Naylor TL, Greshock J, Wang Y, Colligon T, Yu QC, Clemmer V, Zaks TZ, Weber BL: High resolution genomic analysis of sporadic breast cancer using array-based comparative genomic hybridization. Breast Cancer Res 2005, 7(6):R1186-1198. 11. Shadeo A, Lam WL: Comprehensive copy number profiles of breast cancer cell model genomes. Breast Cancer Res 2006, 8(1):R9. 12. Chin K, DeVries S, Fridlyand J, Spellman PT, Roydasgupta R, Kuo WL, Lapuk A, Neve RM, Qian Z, Ryder T et al: Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. Cancer Cell 2006, 10(6):529-541. 13. Chin SF, Teschendorff AE, Marioni JC, Wang Y, Barbosa-Morais NL, Thorne NP, Costa JL, Pinder SE, van de Wiel MA, Green AR et al: High-resolution aCGH and expression profiling identifies a novel genomic subtype of ER negative breast cancer. Genome Biol 2007, 8(10):R215. 14. Hyman E, Kauraniemi P, Hautaniemi S, Wolf M, Mousses S, Rozenblum E, Ringner M, Sauter G, Monni O, Elkahloun A et al: Impact of DNA amplification on gene expression patterns in breast cancer. Cancer Res 2002, 62(21):6240-6245. 15. Pollack JR, Sorlie T, Perou CM, Rees CA, Jeffrey SS, Lonning PE, Tibshirani R, Botstein D, Borresen-Dale AL, Brown PO: Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proc Natl Acad Sci U S A 2002, 99(20):12963-12968.

75 16. Chari R, Lockwood WW, Coe BP, Chu A, Macey D, Thomson A, Davies JJ, MacAulay C, Lam WL: SIGMA: a system for integrative genomic microarray analysis of cancer genomes. BMC Genomics 2006, 7:324. 17. Ishkanian AS, Malloff CA, Watson SK, DeLeeuw RJ, Chi B, Coe BP, Snijders A, Albertson DG, Pinkel D, Marra MA et al: A tiling resolution DNA microarray with complete coverage of the human genome. Nat Genet 2004, 36(3):299-303. 18. Lockwood WW, Coe BP, Williams AC, MacAulay C, Lam WL: Whole genome tiling path array CGH analysis of segmental copy number alterations in cervical cancer cell lines. Int J Cancer 2007, 120(2):436-443. 19. Khojasteh M, Lam WL, Ward RK, MacAulay C: A stepwise framework for the normalization of array CGH data. BMC Bioinformatics 2005, 6:274. 20. Jong K, Marchiori E, Meijer G, Vaart AV, Ylstra B: Breakpoint identification and smoothing of array comparative genomic hybridization data. Bioinformatics 2004, 20(18):3636-3637. 21. Coe BP, Lockwood WW, Girard L, Chari R, Macaulay C, Lam S, Gazdar AF, Minna JD, Lam WL: Differential disruption of cell cycle pathways in small cell and non-small cell lung cancer. Br J Cancer 2006, 94(12):1927-1935. 22. Carvalho B, Bengtsson H, Speed TP, Irizarry RA: Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data. Biostatistics 2007, 8(2):485-499. 23. Lin M, Wei LJ, Sellers WR, Lieberfarb M, Wong WH, Li C: dChipSNP: significance curve and clustering of SNP-array-based loss-of-heterozygosity data. Bioinformatics 2004, 20(8):1233-1240. 24. Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W et al: Global variation in copy number in the human genome. Nature 2006, 444(7118):444-454. 25. Karolchik D, Kuhn RM, Baertsch R, Barber GP, Clawson H, Diekhans M, Giardine B, Harte RA, Hinrichs AS, Hsu F et al: The UCSC Genome Browser Database: 2008 update. Nucleic Acids Res 2008, 36(Database issue):D773-779. 26. Bibikova M, Lin Z, Zhou L, Chudin E, Garcia EW, Wu B, Doucet D, Thomas NJ, Wang Y, Vollmer E et al: High-throughput DNA methylation profiling using universal bead arrays. Genome Res 2006, 16(3):383-393. 27. Chari R, Coe BP, Wedseltoft C, Benetti M, Wilson IM, Vucic EA, MacAulay C, Ng RT, Lam WL: SIGMA2: a system for the integrative genomic multi-dimensional analysis of cancer genomes, epigenomes, and transcriptomes. BMC Bioinformatics 2008, 9:422. 28. Soh J, Okumura N, Lockwood WW, Yamamoto H, Shigematsu H, Zhang W, Chari R, Shames DS, Tang X, MacAulay C et al: Oncogene mutations, copy number gains and mutant allele specific imbalance (MASI) frequently occur together in tumor cells. PLoS One 2009, 4(10):e7464. 29. Tuna M, Knuutila S, Mills GB: Uniparental disomy in cancer. Trends Mol Med 2009, 15(3):120-128. 30. Yan H, Yuan W, Velculescu VE, Vogelstein B, Kinzler KW: Allelic variation in human gene expression. Science 2002, 297(5584):1143. 31. van de Vijver MJ, He YD, van't Veer LJ, Dai H, Hart AA, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ et al: A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med 2002, 347(25):1999-2009. 32. Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB: Missing value estimation methods for DNA microarrays. Bioinformatics 2001, 17(6):520-525. 33. Rhodes DR, Kalyana-Sundaram S, Mahavisno V, Varambally R, Yu J, Briggs BB, Barrette TR, Anstet MJ, Kincead-Beal C, Kulkarni P et al: Oncomine 3.0: genes, pathways, and networks in a collection of 18,000 cancer gene expression profiles. Neoplasia 2007, 9(2):166-180. 76 34. Johnson N, Speirs V, Curtin NJ, Hall AG: A comparative study of genome-wide SNP, CGH microarray and protein expression analysis to explore genotypic and phenotypic mechanisms of acquired antiestrogen resistance in breast cancer. Breast Cancer Res Treat 2008, 111(1):55-63. 35. Jee CD, Lee HS, Bae SI, Yang HK, Lee YM, Rho MS, Kim WH: Loss of caspase-1 gene expression in human gastric carcinomas and cell lines. Int J Oncol 2005, 26(5):1265-1271. 36. Ueki T, Takeuchi T, Nishimatsu H, Kajiwara T, Moriyama N, Narita Y, Kawabe K, Ueki K, Kitamura T: Silencing of the caspase-1 gene occurs in murine and human renal cancer cells and causes solid tumor growth in vivo. Int J Cancer 2001, 91(5):673- 679. 37. McLendon R, Friedman A, Bigner D, Van Meir EG, Brat DJ, Mastrogianakis M, Olson JJ, Mikkelsen T, Lehman N, Aldape K et al: Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 2008. 38. Parsons DW, Jones S, Zhang X, Lin JC, Leary RJ, Angenendt P, Mankoo P, Carter H, Siu IM, Gallia GL et al: An integrated genomic analysis of human glioblastoma multiforme. Science 2008, 321(5897):1807-1812. 39. Chang JT, Nevins JR: GATHER: a systems approach to interpreting genomic signatures. Bioinformatics 2006, 22(23):2926-2933. 40. Bachman KE, Argani P, Samuels Y, Silliman N, Ptak J, Szabo S, Konishi H, Karakas B, Blair BG, Lin C et al: The PIK3CA gene is mutated with high frequency in human breast cancers. Cancer Biol Ther 2004, 3(8):772-775. 41. Slamon DJ, Godolphin W, Jones LA, Holt JA, Wong SG, Keith DE, Levin WJ, Stuart SG, Udove J, Ullrich A et al: Studies of the HER-2/neu proto-oncogene in human breast and ovarian cancer. Science 1989, 244(4905):707-712. 42. Stein D, Wu J, Fuqua SA, Roonprapunt C, Yajnik V, D'Eustachio P, Moskow JJ, Buchberg AM, Osborne CK, Margolis B: The SH2 domain protein GRB-7 is co- amplified, overexpressed and in a tight complex with HER2 in breast cancer. Embo J 1994, 13(6):1331-1340. 43. Lockwood WW, Chari R, Coe BP, Girard L, Macaulay C, Lam S, Gazdar AF, Minna JD, Lam WL: DNA amplification is a ubiquitous mechanism of oncogene activation in lung and other cancers. Oncogene 2008, 27(33):4615-4624. 44. Stemke-Hale K, Gonzalez-Angulo AM, Lluch A, Neve RM, Kuo WL, Davies M, Carey M, Hu Z, Guan Y, Sahin A et al: An integrative genomic and proteomic analysis of PIK3CA, PTEN, and AKT mutations in breast cancer. Cancer Res 2008, 68(15):6084- 6091. 45. Wang E, Lenferink A, O'Connor-McCourt M: Cancer systems biology: exploring cancer-associated genes on cellular networks. Cell Mol Life Sci 2007, 64(14):1752- 1762. 46. Bova GS, Carter BS, Bussemakers MJ, Emi M, Fujiwara Y, Kyprianou N, Jacobs SC, Robinson JC, Epstein JI, Walsh PC et al: Homozygous deletion and frequent allelic loss of chromosome 8p22 loci in human prostate cancer. Cancer Res 1993, 53(17):3869-3873. 47. Chinen K, Isomura M, Izawa K, Fujiwara Y, Ohata H, Iwamasa T, Nakamura Y: Isolation of 45 exon-like fragments from 8p22-->p21.3, a region that is commonly deleted in hepatocellular, colorectal, and non-small cell lung carcinomas. Cytogenet Cell Genet 1996, 75(2-3):190-196. 48. Cooke SL, Pole JC, Chin SF, Ellis IO, Caldas C, Edwards PA: High-resolution array CGH clarifies events occurring on 8p in carcinogenesis. BMC Cancer 2008, 8(1):288. 49. Yaremko ML, Recant WM, Westbrook CA: Loss of heterozygosity from the short arm of chromosome 8 is an early event in breast cancers. Genes Cancer 1995, 13(3):186-191.

77 50. Giovane A, Pintzas A, Maira SM, Sobieszczuk P, Wasylyk B: Net, a new ets transcription factor that is activated by Ras. Genes Dev 1994, 8(13):1502-1513. 51. He J, Pan Y, Hu J, Albarracin C, Wu Y, Dai JL: Profile of Ets gene expression in human breast carcinoma. Cancer Biol Ther 2007, 6(1):76-82. 52. Shames DS, Girard L, Gao B, Sato M, Lewis CM, Shivapurkar N, Jiang A, Perou CM, Kim YH, Pollack JR et al: A genome-wide screen for promoter methylation in lung cancer identifies novel methylation markers for multiple malignancies. PLoS Med 2006, 3(12):e486. 53. Pils D, Horak P, Gleiss A, Sax C, Fabjani G, Moebus VJ, Zielinski C, Reinthaller A, Zeillinger R, Krainer M: Five genes from chromosomal band 8p22 are significantly down-regulated in ovarian carcinoma: N33 and EFA6R have a potential impact on overall survival. Cancer 2005, 104(11):2417-2429. 54. Cheang MC, Voduc D, Bajdik C, Leung S, McKinney S, Chia SK, Perou CM, Nielsen TO: Basal-like breast cancer defined by five biomarkers has superior prognostic value than triple-negative phenotype. Clin Cancer Res 2008, 14(5):1368-1376. 55. Gluz O, Liedtke C, Gottschalk N, Pusztai L, Nitz U, Harbeck N: Triple-negative breast cancer--current status and future directions. Ann Oncol 2009, 20(12):1913-1927. 56. Rakha EA, El-Sayed ME, Green AR, Lee AH, Robertson JF, Ellis IO: Prognostic markers in triple-negative breast cancer. Cancer 2007, 109(1):25-32. 57. Turner N, Lambros MB, Horlings HM, Pearson A, Sharpe R, Natrajan R, Geyer FC, van Kouwenhove M, Kreike B, Mackay A et al: Integrative molecular profiling of triple negative breast cancers identifies amplicon drivers and potential therapeutic targets. Oncogene 2010. 58. Andre F, Job B, Dessen P, Tordai A, Michiels S, Liedtke C, Richon C, Yan K, Wang B, Vassal G et al: Molecular characterization of breast cancer with high-resolution oligonucleotide comparative genomic hybridization array. Clin Cancer Res 2009, 15(2):441-451. 59. Bertucci F, Finetti P, Cervera N, Esterni B, Hermitte F, Viens P, Birnbaum D: How basal are triple-negative breast cancers? Int J Cancer 2008, 123(1):236-240. 60. Han W, Jung EM, Cho J, Lee JW, Hwang KT, Yang SJ, Kang JJ, Bae JY, Jeon YK, Park IA et al: DNA copy number alterations and expression of relevant genes in triple- negative breast cancer. Genes Chromosomes Cancer 2008, 47(6):490-499. 61. Kreike B, van Kouwenhove M, Horlings H, Weigelt B, Peterse H, Bartelink H, van de Vijver MJ: Gene expression profiling and histopathological characterization of triple-negative/basal-like breast carcinomas. Breast Cancer Res 2007, 9(5):R65. 62. Ramaswamy S, Ross KN, Lander ES, Golub TR: A molecular signature of metastasis in primary solid tumors. Nat Genet 2003, 33(1):49-54. 63. Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijer-van Gelder ME, Yu J et al: Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet 2005, 365(9460):671-679. 64. Richardson AL, Wang ZC, De Nicolo A, Lu X, Brown M, Miron A, Liao X, Iglehart JD, Livingston DM, Ganesan S: X chromosomal abnormalities in basal-like human breast cancer. Cancer Cell 2006, 9(2):121-132. 65. Radvanyi L, Singh-Sandhu D, Gallichan S, Lovitt C, Pedyczak A, Mallo G, Gish K, Kwok K, Hanna W, Zubovits J et al: The gene associated with trichorhinophalangeal syndrome in humans is overexpressed in breast cancer. Proc Natl Acad Sci U S A 2005, 102(31):11005-11010. 66. Finak G, Bertos N, Pepin F, Sadekova S, Souleimanova M, Zhao H, Chen H, Omeroglu G, Meterissian S, Omeroglu A et al: Stromal gene expression predicts clinical outcome in breast cancer. Nat Med 2008, 14(5):518-527. 67. Karnoub AE, Dash AB, Vo AP, Sullivan A, Brooks MW, Bell GW, Richardson AL, Polyak K, Tubo R, Weinberg RA: Mesenchymal stem cells within tumour stroma promote breast cancer metastasis. Nature 2007, 449(7162):557-563. 78 Chapter 4: Uniparental disomy is a prevalent genetic mechanism of oncogene disruption in lung adenocarcinoma3

3 A version of this chapter will be submitted for publication with the following author list: Chari R, Lockwood WW, Soh J, Coe BP, Tam K, MacAulay CE, Minna JD, Lam S, Gazdar AF, Lam WL. (2010) Uniparental disomy is a prevalent genetic mechanism of oncogene disruption in lung adenocarcinoma. 79 4.1 Introduction

Genetic alterations play a significant role in a variety of malignancies [1, 2]. Typically, these alterations have been represented by either changes in gene dosage (DNA copy number) or somatic mutations such as total copy number gain or activating mutations of oncogenes and total copy number loss or inactivating mutations of tumor suppressor genes. Loss of heterozygosity is also a common alteration whereby one allele is lost and often, results in a loss of total copy number. However, there are instances in which where one allele is lost but the remaining allele is duplicated resulting in no net change in copy number, termed copy neutral loss of heterozygosity or somatic uniparental disomy (UPD).

Although somatic UPD had been shown previously in malignancies such as retinoblastoma [3], recent studies have shown an increased prominence of this alteration [4]. This largely been a result of advances in technology to detect somatic UPD and advances in the methodologies to define UPD [5, 6]. Moreover, frequent regions of somatic UPD have been identified in many different cancer types such as colorectal cancer [7, 8], lymphoma [9, 10], myelodysplastic syndrome (MDS) [11-13], basal cell carcinoma [14], hepatoblastoma [15], and ovarian cancer

[16]. In addition, while the target gene of some of these regions have been associated to tumor suppressors such as RB1 and TP53, where the gene is likely mutated, the targets have also been associated with oncogenes. For example, mutation with somatic UPD has been observed at loci such as JAK2 [6, 17], CBL [12, 18], FLT3 [19] in hematological malignancies. However, such associations have been limited in epithelial malignancies.

Recently, we have illustrated the concept of mutant allele specific imbalance (MASI) in lung cancer [20]. It was found that a highly activated state for EGFR and KRAS is achieved through either copy number amplification of the mutated allele for EGFR and UPD of the mutated allele for KRAS. With the observed frequency of UPD at KRAS as such, we sought to assess the impact and prevalence of UPD in the lung adenocarcinoma genome. Strikingly, we found that the amount of the genome affected frequently by UPD was comparable to that of copy number 80 gain and loss. When examining major oncogenes and tumor suppressor genes, while most oncogenes were associated with frequent areas of gain, we found a subset of both known and novel oncogenes that were frequently affected by UPD. Finally, examining oncogenes with homozygous mutation in multiple cancer types, we observe frequent UPD at these genes suggesting this mechanism of oncogene activation is prevalent across multiple cancer types.

4.2 Methods

4.2.1 Genome wide profiling of clinical lung adenocarcinoma specimens

Forty-six lung adenocarcinoma cases were obtained from Vancouver General Hospital under approved ethics. Cases were reviewed by a pathologist and tumors were microdissected to ensure maximal tumor cell content (≥ 70%). Five hundred nanograms of genomic DNA were extracted from each tumor and adjacent non-malignant tissue were prepared and hybridized to the Affymetrix Genome-Wide Human SNP 6.0 array platform as per manufacturer's instructions.

CEL files, the raw data files generated, were then processed using the Affymetrix Genotyping

Console version 3.0.2 to generate .chp files using the birdseed v2 genotyping algorithm.

4.2.2 Determination of regions of uniparental disomy (UPD) in clinical lung tumors

CEL files and .chp files were imported into Partek Genomics Suite (PGS) using the software's recommended default settings. First, to determine total copy number, paired copy number intensities were calculated for each sample using the intensity in the tumor vs. it's matched non- malignant sample. Paired copy number intensities were then analyzed using the Genomic

Segmentation method in PGS with all parameters run at default except for the number of markers which was set to 50. Subsequently, allele specific copy number (ASCN) analysis was used to determine regions of allelic imbalance. A region was deemed imbalanced if the imbalance proportion was ≥ 0.15 (as recommended by PGS). Finally, a region was called UPD if the region was imbalance and no change in total copy number was present.

81 4.2.3 Determining frequent regions of UPD, gain and loss

To determine frequent regions of gain, loss, and UPD, the frequency of each alteration was determined for each SNP probe on the somatic chromosomes. A frequency threshold of 40% was used. To smooth out regions of UPD (and for gain and loss as well), a three step process was performed. First, adjacent probes with frequencies greater than the threshold were merged together. Second, to account for dips in frequency where one region is split into two, if the dip is less than 1 Mb in size, the regions were merged. Finally, smoothed regions of UPD, gain, or loss which were less than 100 probes in size were removed.

4.2.4 Determination of UPD in cancer cell lines

Raw SNP 6.0 data (.CEL files) from cancer cell lines were obtained from the Wellcome Trust

Sanger CGP database. CEL files were then genotyped similarly as above to generate CHP files. To define a total copy number and allele specific copy number reference, SNP 6.0 data, generated from 72 CEPH HapMap samples, were obtained from Affymetrix and were also genotyped. Unpaired copy number and allele specific copy number analyses were performed, as described above, to determine regions of allelic imbalance without a change in total copy number using Partek Genomics Suite.

4.2.5 Expression analysis of genes in focal regions of UPD

For 16 of the 46 tumor/non-malignant tissue pairs, gene expression profiles on a custom

Affymetrix chip were generated. The 32 samples were normalized using the RMA algorithm

[21] in the Bioconductor software suite in R [22]. To determine overexpression in a given sample pair for a given gene, since expression values are in log2 space, expression values in non-malignant samples were subtracted from expression values in the tumor. A two-fold expression change was deemed significant for this analysis.

82 4.3 Results

4.3.1 Detection of UPD using allele specific copy number analysis

To determine regions of UPD, we first determined regions of allelic imbalance using an allele specific copy number based approach. This approach has been shown to identify more regions of imbalance than previous call-based approaches [6, 12]. In the first example, where no UPD is present, we observe a chromosome exhibiting no change in total copy number as compared to its matched control and also no imbalance between the alleles, represented by shift between the blue and red data points (Figure 4.1a). However, in the next two samples, we do observe large shifts between the blue and red data points. Specifically, one example illustrates a region of UPD with a region of gain on chromosome arm 12q (Figure 4.1b) and another example illustrates a whole chromosome UPD event on chromosome 14 (Figure 4.1c). The blue data points in the UPD regions are not completely at zero but slightly above due to cells that do not carry the UPD alteration.

4.3.2 UPD is prevalent and non-random in the lung cancer genome with comparable frequencies to gain and loss

With the ability to detect UPD as shown above as well as identifying UPD at the KRAS oncogene from a previous study, we then assessed the prevalence of UPD in the genome.

Using a 40% frequency threshold, we determine the regions of the genome affected by UPD at this frequency. In total, 153 regions were identified (Table 4.1). Moreover, when examining areas of frequent gain and loss (at similar frequency thresholds), we observe that the amount of the genome affected by frequent UPD is comparable to that of frequent gain and loss (Figure

4.2). While there was some overlap with the regions of loss and UPD, there was very little overlap between gain and UPD, even though we would expect some level of overlap by random chance. Using megabases of the genome as a metric, we observe 650 Mb affected by gain,

500 Mb by loss and 400 Mb by UPD, with 7 Mb overlap in gain and UPD and 58 Mb overlap

83 between loss and UPD (Figure 4.3). Strikingly, all three alterations cover over 49% of the genome. It should also be noted that the observation of comparable levels of gain, loss and

UPD at the frequency level is also seen when examining samples on an individual basis (Figure

4.4).

4.3.3 Overlap of major oncogenes and tumor suppressor genes in regions of gain, loss, and UPD

We then assessed how major oncogenes and tumor suppressor genes associated with the three levels of genetic alteration. Using a list of 112 genes derived from a number of sources

[23, 24] (Table 4.2), we found 52 of these genes to overlap with at least one of frequent gain, loss, or UPD. Major oncogenes such as EGFR, MYC, AKT1, MDM2, and ERBB2 are affected frequently by copy number gain, which has been shown previously [25-29] (Table 4.3).

Similarly, major tumor suppressor genes such as FHIT, RARB and CDKN2A are affected by frequent copy number loss (Table 4.4). Interestingly, while expected tumor suppressor genes such as BRCA2 and RB1 are affected by UPD, a subset of seven oncogenes were affected by

UPD. Specifically, UPD was observed at KRAS, as shown previously, PIK3CA, BCL6 and

FLT3. Moreover, examining KRAS (Figure 4.5a) and RB1 (Figure 4.5b) specifically, we see that the UPD events are of different sizes between different samples.

4.3.4 UPD is prevalent at oncogenes across multiple cancer types

We observed frequent UPD at oncogenes in lung cancer. We sought to assess the prevalence of UPD at oncogenes across multiple cancer types. For this analysis, we utilized SNP 6.0 array data for over 700 cancer cell lines from the Wellcome Trust Sanger database where somatic mutation data were also available. In total, 67 instances of homozygous mutation at 13 oncogene loci were assessed (Table 4.5). It was found that while copy number gain was the most prevalent genetic alteration, a significant proportion of samples exhibited UPD (Figure

4.6a, Table 4.6). Examining the genes with the most samples harbouring homozygous

84 mutation, KRAS and BRAF, the overall trend is consistent with what is observed at these two loci (Figure 4.6a). An example of UPD at KRAS in NCI-H2030 and BRAF in A427 are illustrated in Figure 4.6b.

For this analysis, cancer cell lines were utilized as the samples represent a more homogeneous population of cells. In contrast, clinical tumors, even after microdissection, still may contain small amounts of contaminating normal cells. As such, determining if a mutation is homozygous in clinical lung tumors is challenging. With available KRAS mutation data, we assessed the frequency of gain, loss and UPD in KRAS mutant tumors and observe a similar distribution pattern observed in the cell lines (Figure 4.6c)

4.3.5 Identification of novel candidate oncogenes using focal regions of UPD

Selecting the more focal regions of UPD within the set of 153 regions, we identified 35 of the regions which contained three or less RefSeq annotated genes. In total, 64 RefSeq genes were identified across all 35 regions (Table 4.7) and amongst these genes was E2F3 (Figure 4.7a).

Examining paired gene expression for a subset of the 46 tumor/normal pairs, it was found that

10/16 pairs showed overexpression of E2F3 (Figure 4.7b). E2F3 has previously shown to be overexpressed in lung cancer and also shown to have a role in other cancer types [30, 31].

4.4 Discussion

We have shown the unexpected and wide prevalence of UPD in the lung adenocarcinoma genome and have also observed a large number of both known and novel oncogenes harbored in these regions of frequent UPD. While there have been previous studies utilizing SNP arrays on lung adenocarcinoma tumors [26, 30], there are likely a number of reasons why these frequent regions were likely missed. First, the tumors utilized in this study were microdissected to ensure a high proportion of tumor cells (≥ 70% was required) were analyzed. This is important as previous studies have shown the impact of tissue heterogeneity and the ability to detect alterations [32, 33]. Secondly, for every tumor used, matched non-malignant tissue was

85 obtained, profiled and used as the control. While it has been shown that unmatched references can be used to detect UPD, the resultant UPD may not be called correctly all the time. Finally, the progression from call-based approaches to allele specific copy number-based approaches can also increase the detection of UPD [6, 12]. Taken together, these improvements could explain the observed results.

While it is interesting to observe these frequent regions of UPD in the lung adenocarcinoma genome, the larger implications of these findings may not be readily apparent. In the cases of somatically mutated oncogenes or tumor suppressor genes, the existence of UPD in these cases is clear as UPD is used to select the mutated allele to result in a homozygous mutation state. We have previously shown that mutant allelic specific imbalance (MASI), either through allele specific amplification or UPD, is associated with a poorer prognosis [20]. To assess the prevalence of UPD at homozygously mutated oncogene sites, we analyzed cancer cell lines encompassing multiple cancer types for UPD at mutated oncogenes. While the most frequent genomic alteration observed is copy number gain, frequent UPD also occurs. The distribution of alterations observed across all genes is consistent with the most frequently mutated oncogenes, KRAS and BRAF. The result of these UPD events is preferential expression of the mutated allele.

It should also be noted that with the amount of frequent UPD detected, there are regions likely selected for reasons other than somatic mutation. For example, like in the cases of imprinted regions, there could be preferential selection of an unmethylated or methylated allele which in turn, could regulate downstream gene expression. Previous studies have assessed the relationship between regions of UPD and DNA methylation patterns in cancer [8, 34, 35].

Alternatively, in order to achieve downstream differential expression, in addition to preferential selection based on methylation, it has also been shown that for a given gene, transcription may involve only one of the alleles [36-39] and thus, selection may be based on transcriptional efficiency. Hence, it is important that the genetic data on UPD be integrated with methylation

86 and gene expression data to refine these regions of UPD with many genes to a small number of candidate oncogenes and tumor suppressor genes.

Though many of the regions of UPD identified were large and encompassed a number of genes, approximately 1/5 of the regions identified contained three or less genes. As such, this is one approach for narrowing down candidate gene targets. Using gene expression data on a subset of the profiled cases used for UPD, we assessed the gene expression profiles of the 64 genes encompassed in the 35 focal regions. Of the 64 genes, 57 were represented on the gene expression microarray platform used. Fifteen of the 57 genes were overexpressed in at least 25% of the samples (4/16) (Table 4.8). In addition to E2F3, other genes within the set of

15 genes have shown interesting biological function. For example, GPR39 has been shown to activate EGFR signaling as well as protect cells form apoptosis [40, 41]; SLC7A11 has been shown to have a role in drug resistance [42] and was assessed as a therapeutic target for small cell lung cancer [43]; PDGFD has been implicated in many different cancer types [44]; and

PRDM8, a histone methyltransferase, is a member of the PRDM transcription factor family and these factors have been implicated as proto-oncogenes [45].

4.5 Conclusion

In summary, we have shown an unexpectedly high prevalence of UPD in the lung adenocarcinoma genome, with comparable amounts of the genome affected being comparable to copy number gain and loss. While a number of known oncogenes were shown to be in regions of frequent UPD, potentially novel lung oncogenes have also been shown to be affected by UPD with downstream consequential change in gene expression. Further studies are needed to elicit their roles in lung adenocarcinoma.

87 Figure 4.1. Detection of UPD using allele specific copy number. Total copy number (top) and allelic specific copy number (bottom) plots. In the allele specific copy number plot, the red data points represent the level of the major allele and the blue data points represent the level of the minor allele. The total copy number plot represents a the sum of the allele specific copy number. (a) Sample with neutral copy number and no imbalance of . While the total copy number is neutral, when examining the allele specific copy number, imbalance between the alleles is evident. (b) Sample with regions of copy number gains and UPD (in orange) on chromosome 12q. (c) Sample with whole chromosome UPD of chromosome 14.

88 Figure 4.1 a 4 Total copy number

2 # of copies

0 3 Allele speci c copy number

2

1 # of copies 0

Chromosome 12

b 4 Total copy number

2 # of copies

0 3 Allele speci c copy number

2

1 # of copies 0

Chromosome 12q

c 4 Total copy number

2 # of copies

0 3 Allele speci c copy number 2

1 # of copies 0

Chromosome 14

89 Figure 4.2. Comparison of frequent regions of gain, loss and UPD in the lung adenocarcinoma genome. Frequent regions of gain (red), loss (green) and UPD (blue) in the lung adenocarcinoma genome. Only regions which were altered in at least 40% of the samples, by either gain, loss, or UPD, are shown. Frequent regions of gain (such as 5p, 7p, 8q, 17q and

20q) and loss (such as 3p, 8p, 9p, 13q), which have previously been shown, are detected. The fourth column, composite ("C"), represents areas of overlap between gain and UPD (red) and loss and UPD (green).

90 Figure 4.2

1 G L U C 2 G L U C 3 G L U C 4 G L U C

5 G L U C 6 G L U C 7 G L U C 8 G L U C

9 G L U C 10 G L U C 11 G L U C 12 G L U C

13 G L U C 14 G L U C 15 G L U C 16 G L U C

17 G L U C 18 G L U C 19 G L U C 20 G L U C

21 G L U C 22 G L U C G - Gain U - UPD L - Loss C - Composite

91 Figure 4.3

Gain Loss

642 441

7 58

335

UPD

Figure 4.3. Venn diagram illustrating the amount of the genome covered by frequent gain, loss, and UPD. Numbers provided are in megabases (Mb) of genome sequence.

92 Figure 4.4. Genomic profile of an individual lung adenocarcinoma sample. Regions of gain (red), loss (green), and UPD (blue) are shown in this single lung adenocarcinoma profile.

Comparable amounts of the genome are affected by all three of these alterations.

93 Figure 4.4

1 2 3 4 5 6 7 8

94 9 10 11 12 13 14 15 16

17 18 19 20 21 22 Gain Loss UPD Figure 4.5 a chr12 (p12.1-p11.21) 12q12 15 22

KRAS UPD Regions 85060201 85070205 85070159 85060358 85050147 85060186 85050235 85070021 85070093 85060276 85050241 85040031 85060354 85060256 85050172 85040001 85050140 85060342 85060206 85060311 85060251 85050207 85060098 KRAS b chr13 (q13.3-q21.1) 13 p12 11.2 21.1 q31.1 q34 95

RB1 UPD Regions 85070205 85060186 85060098 85060342 85060251 85050177 85070081 85050011 85060358 85060256 85050147 85040001 85060221 85070085 85060216 85060068 85050172 85070061 85060311 85070093 85040031

RB1 Figure 4.5. Examination of UPD events at the KRAS and RB1 loci. KRAS shown in (a) and RB1 shown in (b). The region of UPD encompassing these loci varies in size between samples, with some samples illustrating larger sizes of UPD than others. The existence of these different size events are likely a result of a different underlying mechanism of UPD. Figure 4.6

a b A427 3 Copy Number Allele Speci c

2

1 All Genes (n=67) # of alleles 0 4 Total Copy Number

2 KRAS (n=33) # of copies

0

Chromosome 7 BRAF BRAF (n=11) NCI-H2030 3 Gain Copy Number Allele Speci c Loss 2 UPD Neutral 1 # of alleles

0

c 50 KRAS (n=21) 4 Total Copy Number

40 2 30 # of copies 20 0 Percent of cases Percent 10

0 Gain UPD Loss Neutral KRAS Chromosome 12

Figure 4.6. Relationship of homozygous mutation at oncogenes and genomic altera- tion. Using the Wellcome Trust Sanger COSMIC database for somatic mutation data and SNP 6.0 data available for over 700 cancer cell lines from their database, prevalence of UPD was assessed in this dataset. Specifically, only those cell lines with oncogenes and homozygous mutation were analyzed. (a) In total, 67 instances of homozygous mutation at an oncogene loci were identified. While a large fraction of cases exhibited copy number increase (51%), the second most prominent alteration is UPD (34%). Of the 12 different genes assessed, KRAS and BRAF are the most frequently homozygously mutated onco- genes and those two genes show similar frequency distribution patterns of genomic altera- tion to the whole set. (b) An example of UPD at BRAF in A427 and KRAS in NCI-H2030 where both BRAF and KRAS are homozygously mutated. (c) With available mutation data on KRAS from the 46 lung tumor/matched non-malignant tissue pairs, similar analysis was performed and it was found that the patterns of genomic alteration were similar to what was observed in cancer cell lines.

96 Figure 4.7. Identification of E2F3 in a focal region of UPD. (a) One of the focal regions identified was located on chromosomal region 6p22.3. There were only three RefSeq annotated genes that were completely encompassed within this region: E2F3, ID4, and

MBOAT1. The UCSC Genome Browser (genome build hg18) was used to identify genes and visualize region [46]. (b) Analyzing gene expression amongst a subset of the tumors profiled on the SNP array, it was found that E2F3 was the most frequently overexpressed amongst the three genes assessed, with a frequency of overexpression of 62.5%.

97 Figure 4.7 a 6p23 6p25.3 6p25.1 6p24.3 6p24.1 6p22.3 6p22.2 6p22.1 6p21.2 6p21.1 6p12.3 6p12.2 6p12.1 6p11.2 6p11.1 6p21.33 6p21.31 b 2.5

2.0

1.5

1.0

0.5 Log2 fold change fold Log2

0

-0.5 Samples 1 to 16

98 Table 4.1. Regions of the genome exhibiting frequent UPD

Chr BPStart BPEnd # of Chr BPStart BPEnd # of markers markers 1 57240523 57708781 261 6 8651004 9629747 293 1 88627690 88937185 107 6 18605671 20757531 903 1 213006484 213325225 153 6 71651506 72037387 152 2 33915294 34575063 336 6 73271789 76714973 996 2 102028604 102246459 130 6 82824129 84992961 566 2 103342214 104703078 369 6 87778426 91268388 1147 2 107288361 107909534 151 6 97800818 101044085 940 2 113475801 113597836 110 6 105272060 114467061 2939 2 123104889 123771933 262 6 116466834 119757979 966 2 129285355 130086326 353 6 121042746 123214382 558 2 132852444 133093232 131 6 125234328 126352419 406 2 134976541 137102451 519 6 130012107 130399246 197 2 138356995 142103903 1239 6 131442876 133188489 644 2 148597921 149606831 175 6 134265257 144908460 3331 2 150732270 153458836 918 6 147524934 170759956 9496 2 154569594 157096948 576 7 109870908 110924988 272 2 158283600 163854187 1560 7 119981096 121751585 455 2 165114877 166441984 353 7 122940488 123963613 280 2 167549158 172376141 1670 7 125810873 126573092 258 2 173385819 175183422 649 8 82600012 84910522 562 2 178506144 178971533 144 8 87475660 91661039 1094 2 182593441 183843529 379 8 109473113 113912000 1026 2 185976166 192057947 1482 9 32370194 32717136 127 2 195998062 198251401 604 10 86487543 87543389 442 2 214938398 217672315 1092 10 97401670 97992222 155 2 222098215 223638933 514 10 99819620 100845692 389 2 224915228 225683208 295 11 7088070 8626013 675 2 234086493 235425331 673 11 9745400 16571856 2963 3 38947562 40467146 489 11 17785506 20708448 1441 3 75597086 77391013 520 11 22527476 27641093 1997 3 120734475 121565472 248 11 31290899 36427391 2111 3 126442524 128090913 466 11 37624038 46083889 3066 3 131310768 131908449 194 11 78488179 78811363 210 3 133156739 140816203 2231 11 81391754 83571743 737 3 141841646 145572883 1095 11 85133890 86695809 613 3 148635117 153815363 1592 11 92858003 94403193 618 3 154850963 162426559 2109 11 99344667 102033351 1020 3 163593089 164340331 178 11 103294671 104342367 382 3 171033453 175715353 1551 11 106728968 107654049 275 3 179650656 194058084 4585 11 111090113 111676300 118

99 Chr BPStart BPEnd # of Chr BPStart BPEnd # of markers markers 4 56867092 57264514 111 11 114516077 115632014 460 4 59586524 62756073 878 11 121222174 122326244 431 4 68014200 73981908 1418 11 127365170 128934004 622 4 75048874 79252113 1473 11 131240209 132411274 555 4 81069345 81632642 125 12 4901875 5859088 616 4 83519602 86772381 1036 12 7251496 15972576 2966 4 95266916 96342489 299 12 19068978 20077939 414 4 99713479 100262528 179 12 21583225 28079788 2489 4 102215835 109269168 1774 12 29252288 31395946 1005 4 110431176 111696150 384 12 36144018 38218718 356 4 113392931 114074009 165 12 43846459 44709526 206 4 119368629 120609480 324 12 46714417 47193357 116 4 122057241 123147706 383 12 49541041 50471244 235 4 128685983 130719464 455 12 53439918 53858695 150 4 138792901 139993006 454 12 74639231 75745944 244 5 52835375 56296363 1274 12 92376456 96256397 1605 5 59963217 62083765 600 12 97428291 100976394 1075 5 64017546 65834248 562 12 102585961 120121516 6328 5 70702961 72963048 707 12 124477764 127353782 1525 5 74029627 81781885 2453 12 128730700 129400678 383 5 86329552 88138619 328 13 17943628 32731810 6266 5 90373713 90752434 102 13 35198318 36512766 543 5 95109222 96467147 461 13 39604817 41933751 808 5 98108008 98876920 190 13 43687180 52149041 2627 5 106982444 108758672 584 13 80666670 81857944 333 5 113412673 114060668 291 13 97080915 99917507 1076 5 115177955 116152174 521 14 26463057 27074470 206 5 118498618 118830120 101 14 39744666 40655458 316 5 123995175 124348209 112 15 23337531 24061503 396 5 130208827 131907815 390 17 1646832 4192054 695 5 139359170 140952015 262 17 5443385 6771106 606 5 145140525 146950721 652 17 8395366 9333459 369 5 148164516 149400891 515 17 10552154 14949757 1968 5 153600774 154479059 292 20 8028477 9014932 524 5 156203870 159492184 1245 20 53205763 53952115 387 5 162483304 163524481 416 5 165711472 166223866 189 5 180068185 180629495 176

100 Table 4.2. List of major oncogenes and tumor suppressor genes assessed

Gene Chr Gene Chr Gene Chr ABL1 9 EVI1 3 NF2 22 ABL2 1 FBXW7 4 NKX2-1 14 AKT1 14 FEV 2 NOTCH1 9 AKT2 19 FGFR1 8 NRAS 1 ALK 2 FGFR2 10 NTRK1 1 APC 5 FGFR3 4 NTRK3 15 ATM 11 FH 1 PDGFB 22 BCL2 18 FHIT 3 PDGFRA 4 BCL3 19 FLT3 13 PDGFRB 5 BCL6 3 FOXO1A 13 PHOX2B 4 BMPR1A 10 FOXO3A 6 PIK3CA 3 BRAF 7 FOXP1 3 PIK3R1 5 BRCA1 17 GNAS 20 PIM1 6 BRCA2 13 GSTP1 11 PRKAR1A 17 BUB1B 15 HRAS 11 PTCH 9 CAV1 7 HRPT2 1 PTEN 10 CBL 11 ITK 5 PTPN11 12 CCND1 11 JAK2 9 RARB 3 CCND2 12 JAK3 19 RASSF1A 3 CCND3 6 KIT 4 RB1 13 CD44 11 KRAS 12 REL 2 CDH1 16 LCK 1 RET 10 CDH11 16 MAF 16 RUNX1 21 CDH13 16 MAFB 20 SEMA3B 3 CDK4 12 MAML2 11 SMO 7 CDK6 7 MAP2K4 17 STK11 19 CDKN2A 9 MDM2 12 SUFU 10 CEBPA 11 MEN1 11 SYK 9 CHEK2 22 MET 7 TCF1 12 CRK 17 MLH1 3 TIMP3 22 CTNNB1 3 MLL 11 TP53 17 CYLD 16 MPL 1 TSC1 9 DAPK1 9 MSH2 2 TSC2 16 EGFR 7 MSH6 2 TSHR 14 ERBB2 17 MYC 8 VHL 3 ERCC2 19 MYCL1 1 WT1 11 ERG 21 MYCN 2 ETV6 12 NF1 17

101 Table 4.3. Overlap of oncogenes in frequent regions of genomic alteration

Gene Location Gain Loss UPD Symbol ABL1 9q34.1 X ABL2 1q24-q25 X AKT1 14q32.32 X AKT2 19q13.1- X q13.2 BCL6 3q27 X CCND1 11q13 X CCND3 6p21 X CD44 11p13 X CDK4 12q14 X CEBPA 19q13.11 X CRK 17p13.3 X EGFR 7p12.3-p12.1 X ERBB2 17q21.1 X ETV6 12p13 X FEV 2q36 X FGFR3 4p16.3 X FLT3 13q12 X GNAS 20q13.2 X HRAS 11p15.5 X ITK 5q31-q32 X KRAS 12p12.1 X LCK 1p35-p34.3 X MAFB 20q11.2- X q13.1 MDM2 12q15 X MEN1 11q13 X MPL 1p34 X MYC 8q24.12- X q24.13 MYCL1 1p34.3 X NOTCH1 9q34.3 X NTRK1 1q21-q22 X PDGFB 22q12.3- X q13.1 PDGFRB 5q31-q32 X PIK3CA 3q26.3 X PIM1 6p21.2 X PRKAR1A 17q23-q24 X SMO 7q31-q32 X

102 Table 4.4. Overlap of tumor suppressor genes in frequent regions of genomic alteration

Gene Location Gain Loss UPD Symbol BRCA1 17q21 X BRCA2 13q12 X X CDH1 16q22.1 X CDKN2A 9p21 X CYLD 16q12-q13 X FH 1q42.1 X FHIT 3p14.2 X GSTP1 X MAP2K4 17p11.2 X X NF1 17q12 X PTPN11 12q24.1 X RARB 3p24.2 X RB1 13q14 X X TSC1 9q34 X TSC2 16p13.3 X WT1 11p13 X

103 Table 4.5. Cell lines and oncogene loci with homozygous mutation

Sample Primary Tissue Gene Sample Primary Tissue Gene EFM-19 breast PIK3CA NCI-H460 lung KRAS NCI-ADR-RES breast ERBB2 NCI-H727 lung KRAS OCUB-M breast PIK3CA PC-14 lung EGFR AM-38 central nervous system BRAF SHP-77 lung KRAS OMC-1 cervix PIK3CA SW1573 lung KRAS HEC-1 endometrium KRAS KYSE-450 oesophagus NOTCH1 ECC4 gastrointestinal tract KRAS OVCAR-5 ovary KRAS BE-13 haematopoietic and NOTCH1 AsPC-1 pancreas KRAS lymphoid tissue HEL haematopoietic and JAK2 CAPAN-1 pancreas KRAS lymphoid tissue OPM-2 haematopoietic and FGFR3 HuP-T4 pancreas KRAS lymphoid tissue LS-174T large intestine CTNNB1 MIA-PaCa-2 pancreas KRAS LS-411N large intestine BRAF PANC-08-13 pancreas KRAS RCM-1 large intestine KRAS SW1990 pancreas KRAS SK-CO-1 large intestine KRAS YAPC pancreas KRAS SNU-C2B large intestine KRAS A375 skin BRAF SW1463 large intestine KRAS COLO-679 skin BRAF SW403 large intestine KRAS CP66-MEL skin NRAS SW620 large intestine KRAS GAK skin NRAS A427 lung CTNNB2 HT-144 skin BRAF A549 lung KRAS MEL-HO skin BRAF COLO-668 lung KRAS MEL-JUSO skin HRAS COR-L23 lung KRAS SH-4 skin BRAF COR-L23 lung RUNX1 SK-MEL-2 skin NRAS IA-LM lung KRAS SK-MEL-28 skin BRAF LCLC-97TM1 lung KRAS SK-MEL-28 skin EGFR LU-65 lung KRAS UACC-62 skin BRAF NCI-H1092 lung CTNNB3 RD soft tissue NRAS NCI-H1155 lung KRAS BCPAP thyroid BRAF NCI-H1395 lung BRAF CAL-62 thyroid KRAS NCI-H1793 lung KRAS BB49-HNC upper HRAS aerodigestive tract NCI-H2030 lung KRAS 639-V urinary tract PIK3CA NCI-H2122 lung KRAS T-24 urinary tract HRAS NCI-H2291 lung KRAS UM-UC-3 urinary tract KRAS NCI-H2347 lung NRAS

104 Table 4.6. Summary of homozygous mutation analysis in cancer cell lines

Gene # of Hz mutations # UPD # Gain # Loss # Neutral KRAS 33 10 18 4 1 BRAF 11 4 6 1 0 NRAS 5 1 3 1 0 PIK3CA 4 1 2 1 0 CTNNB1 3 2 0 1 0 HRAS 3 2 1 0 0 EGFR 2 1 1 0 0 NOTCH1 2 0 1 1 0 FGFR3 1 0 1 0 0 JAK2 1 0 1 0 0 ERBB2 1 1 0 0 0 RUNX1 1 1 0 0 0 Total 67 23 34 9 1

105 Table 4.7. RefSeq genes in focal regions of UPD

Gene Symbol Chr Gene Symbol Chr DAB1 1 TNFAIP8 5 IL1R1 2 ZNF608 5 IL1RL2 2 E2F3 6 GPR39 2 ID4 6 EPC2 2 MBOAT1 6 KIF5C 2 B3GAT2 6 MBD5 2 C6orf191 6 OSBPL6 2 IMMP2L 7 RBM45 2 LRRN3 7 CUL3 2 GRM8 7 DOCK10 2 ODZ4 11 FAM124B 2 CASP4 11 FRG2C 3 DDI1 11 ZNF717 3 PDGFD 11 COL29A1 3 CADM1 11 COL6A6 3 HNT 11 LPHN3 4 OPCML 11 ANTXR2 4 ANO2 12 FGF5 4 KCNA5 12 PRDM8 4 NTF3 12 ADH5 4 AEBP2 12 EIF4E 4 PLEKHA5 12 METAP1 4 ALG10B 12 SLC7A11 4 CPNE8 12 ARRDC3 5 KIF21A 12 CHD1 5 TMEM132B 12 RGMB 5 FZD10 12 FBXL17 5 ZNF10 12 FER 5 ZNF140 12 PJA2 5 ZNF268 12 KCNN2 5 ATP10A 15 DMXL1 5 PLCB1 20

106 Table 4.8. Genes overexpressed in focal regions of UPD

Frequency of Probe ID Gene Symbol Overexpression merck-NM_001508_a_at GPR39 13 merck-AJ270693_at PLEKHA5 12 merck-NM_014331_at SLC7A11 10 merck-NM_001949_at E2F3 10 merck-C17174_at CUL3 9 merck-AA651853_at ARRDC3 7 merck-AF336376_a_at PDGFD 6 merck-CR624190_a_at KIF21A 6 merck-NM_016522_at HNT 5 merck-NM_003854_at IL1RL2 5 merck-NM_000845_at GRM8 5 merck-AY358331_s_at HNT 5 merck-CR625009_at ZNF140 5 merck-X52332_a_at ZNF10 5 merck-AK127693_s_at PLCB1 4 merck-NM_020226_at PRDM8 4

107 4.6 References

1. Bell DW: Our changing view of the genomic landscape of cancer. J Pathol 2010, 220(2):231-243. 2. Chari R, Thu KL, Wilson IM, Lockwood WW, Lonergan KM, Coe BP, Malloff CA, Gazdar AF, Lam S, Garnis C et al: Integrating the multiple dimensions of genomic and epigenomic landscapes of cancer. Cancer Metastasis Rev 2010. 3. Zhu X, Dunn JM, Goddard AD, Squire JA, Becker A, Phillips RA, Gallie BL: Mechanisms of loss of heterozygosity in retinoblastoma. Cytogenet Cell Genet 1992, 59(4):248-252. 4. Tuna M, Knuutila S, Mills GB: Uniparental disomy in cancer. Trends Mol Med 2009, 15(3):120-128. 5. Li C, Beroukhim R, Weir BA, Winckler W, Garraway LA, Sellers WR, Meyerson M: Major copy proportion analysis of tumor samples using SNP arrays. BMC Bioinformatics 2008, 9:204. 6. Yamamoto G, Nannya Y, Kato M, Sanada M, Levine RL, Kawamata N, Hangaishi A, Kurokawa M, Chiba S, Gilliland DG et al: Highly sensitive method for genomewide detection of allelic composition in nonpaired, primary tumor specimens by use of affymetrix single-nucleotide-polymorphism genotyping microarrays. Am J Hum Genet 2007, 81(1):114-126. 7. Andersen CL, Wiuf C, Kruhoffer M, Korsgaard M, Laurberg S, Orntoft TF: Frequent occurrence of uniparental disomy in colorectal cancer. Carcinogenesis 2007, 28(1):38-48. 8. Darbary HK, Dutt SS, Sait SJ, Nowak NJ, Heinaman RE, Stoler DL, Anderson GR: Uniparentalism in sporadic colorectal cancer is independent of imprint status, and coordinate for chromosomes 14 and 18. Cancer Genet Cytogenet 2009, 189(2):77- 86. 9. Fitzgibbon J, Iqbal S, Davies A, O'Shea D, Carlotti E, Chaplin T, Matthews J, Raghavan M, Norton A, Lister TA et al: Genome-wide detection of recurring sites of uniparental disomy in follicular and transformed follicular lymphoma. Leukemia 2007, 21(7):1514-1520. 10. Kawamata N, Ogawa S, Seeger K, Kirschner-Schwabe R, Huynh T, Chen J, Megrabian N, Harbott J, Zimmermann M, Henze G et al: Molecular allelokaryotyping of relapsed pediatric acute lymphoblastic leukemia. Int J Oncol 2009, 34(6):1603-1612. 11. Gondek LP, Tiu R, O'Keefe CL, Sekeres MA, Theil KS, Maciejewski JP: Chromosomal lesions and uniparental disomy detected by SNP arrays in MDS, MDS/MPD, and MDS-derived AML. Blood 2008, 111(3):1534-1542. 12. Sanada M, Suzuki T, Shih LY, Otsu M, Kato M, Yamazaki S, Tamura A, Honda H, Sakata-Yanagimoto M, Kumano K et al: Gain-of-function of mutated C-CBL tumour suppressor in myeloid neoplasms. Nature 2009, 460(7257):904-908. 13. Tiu RV, Gondek LP, O'Keefe CL, Huh J, Sekeres MA, Elson P, McDevitt MA, Wang XF, Levis MJ, Karp JE et al: New lesions detected by single nucleotide polymorphism array-based chromosomal analysis have important clinical impact in acute myeloid leukemia. J Clin Oncol 2009, 27(31):5219-5226. 14. Teh MT, Blaydon D, Chaplin T, Foot NJ, Skoulakis S, Raghavan M, Harwood CA, Proby CM, Philpott MP, Young BD et al: Genomewide single nucleotide polymorphism microarray mapping in basal cell carcinomas unveils uniparental disomy as a key somatic event. Cancer Res 2005, 65(19):8597-8603. 15. Suzuki M, Kato M, Yuyan C, Takita J, Sanada M, Nannya Y, Yamamoto G, Takahashi A, Ikeda H, Kuwano H et al: Whole-genome profiling of chromosomal aberrations in hepatoblastoma using high-density single-nucleotide polymorphism genotyping microarrays. Cancer Sci 2008, 99(3):564-570.

108 16. Walsh CS, Ogawa S, Scoles DR, Miller CW, Kawamata N, Narod SA, Koeffler HP, Karlan BY: Genome-wide loss of heterozygosity and uniparental disomy in BRCA1/2-associated ovarian carcinomas. Clin Cancer Res 2008, 14(23):7645-7651. 17. Kralovics R, Guan Y, Prchal JT: Acquired uniparental disomy of chromosome 9p is a frequent stem cell defect in polycythemia vera. Exp Hematol 2002, 30(3):229-236. 18. Grand FH, Hidalgo-Curtis CE, Ernst T, Zoi K, Zoi C, McGuire C, Kreil S, Jones A, Score J, Metzgeroth G et al: Frequent CBL mutations associated with 11q acquired uniparental disomy in myeloproliferative neoplasms. Blood 2009, 113(24):6182- 6192. 19. Fitzgibbon J, Smith LL, Raghavan M, Smith ML, Debernardi S, Skoulakis S, Lillington D, Lister TA, Young BD: Association between acquired uniparental disomy and homozygous gene mutation in acute myeloid leukemias. Cancer Res 2005, 65(20):9152-9154. 20. Soh J, Okumura N, Lockwood WW, Yamamoto H, Shigematsu H, Zhang W, Chari R, Shames DS, Tang X, MacAulay C et al: Oncogene mutations, copy number gains and mutant allele specific imbalance (MASI) frequently occur together in tumor cells. PLoS One 2009, 4(10):e7464. 21. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2003, 4(2):249-264. 22. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J et al: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004, 5(10):R80. 23. Ding L, Getz G, Wheeler DA, Mardis ER, McLellan MD, Cibulskis K, Sougnez C, Greulich H, Muzny DM, Morgan MB et al: Somatic mutations affect key pathways in lung adenocarcinoma. Nature 2008, 455(7216):1069-1075. 24. Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman N, Stratton MR: A census of human cancer genes. Nat Rev Cancer 2004, 4(3):177-183. 25. Lockwood WW, Chari R, Coe BP, Girard L, Macaulay C, Lam S, Gazdar AF, Minna JD, Lam WL: DNA amplification is a ubiquitous mechanism of oncogene activation in lung and other cancers. Oncogene 2008, 27(33):4615-4624. 26. Weir BA, Woo MS, Getz G, Perner S, Ding L, Beroukhim R, Lin WM, Province MA, Kraja A, Johnson LA et al: Characterizing the cancer genome in lung adenocarcinoma. Nature 2007, 450(7171):893-898. 27. Chitale D, Gong Y, Taylor BS, Broderick S, Brennan C, Somwar R, Golas B, Wang L, Motoi N, Szoke J et al: An integrated genomic analysis of lung cancer reveals loss of DUSP4 in EGFR-mutant tumors. Oncogene 2009, 28(31):2773-2783. 28. Kendall J, Liu Q, Bakleh A, Krasnitz A, Nguyen KC, Lakshmi B, Gerald WL, Powers S, Mu D: Oncogenic cooperation and coamplification of developmental transcription factor genes in lung cancer. Proc Natl Acad Sci U S A 2007, 104(42):16663-16668. 29. Garnis C, Lockwood WW, Vucic E, Ge Y, Girard L, Minna JD, Gazdar AF, Lam S, MacAulay C, Lam WL: High resolution analysis of non-small cell lung cancer cell lines by whole genome tiling path array CGH. Int J Cancer 2006, 118(6):1556-1564. 30. Borczuk AC, Gorenstein L, Walter KL, Assaad AA, Wang L, Powell CA: Non-small-cell lung cancer molecular signatures recapitulate lung developmental pathways. Am J Pathol 2003, 163(5):1949-1960. 31. Cooper CS, Nicholson AG, Foster C, Dodson A, Edwards S, Fletcher A, Roe T, Clark J, Joshi A, Norman A et al: Nuclear overexpression of the E2F3 transcription factor in human lung cancer. Lung Cancer 2006, 54(2):155-162. 32. Goransson H, Edlund K, Rydaker M, Rasmussen M, Winquist J, Ekman S, Bergqvist M, Thomas A, Lambe M, Rosenquist R et al: Quantification of normal cell fraction and copy number neutral LOH in clinical lung cancer samples using SNP array data. PLoS One 2009, 4(6):e6057.

109 33. Garnis C, Coe BP, Lam SL, MacAulay C, Lam WL: High-resolution array CGH increases heterogeneity tolerance in the analysis of clinical samples. Genomics 2005, 85(6):790-793. 34. Raghavan M, Lillington DM, Skoulakis S, Debernardi S, Chaplin T, Foot NJ, Lister TA, Young BD: Genome-wide single nucleotide polymorphism analysis reveals frequent partial uniparental disomy due to somatic recombination in acute myeloid leukemias. Cancer Res 2005, 65(2):375-378. 35. Haruta M, Arai Y, Sugawara W, Watanabe N, Honda S, Ohshima J, Soejima H, Nakadate H, Okita H, Hata J et al: Duplication of paternal IGF2 or loss of maternal IGF2 imprinting occurs in half of Wilms tumors with various structural WT1 abnormalities. Genes Chromosomes Cancer 2008, 47(8):712-727. 36. Bjornsson HT, Albert TJ, Ladd-Acosta CM, Green RD, Rongione MA, Middle CM, Irizarry RA, Broman KW, Feinberg AP: SNP-specific array-based allele-specific expression analysis. Genome Res 2008, 18(5):771-779. 37. Gimelbrant A, Hutchinson JN, Thompson BR, Chess A: Widespread monoallelic expression on human autosomes. Science 2007, 318(5853):1136-1140. 38. Palacios R, Gazave E, Goni J, Piedrafita G, Fernando O, Navarro A, Villoslada P: Allele-specific gene expression is widespread across the genome and biological processes. PLoS One 2009, 4(1):e4150. 39. Zhang K, Li JB, Gao Y, Egli D, Xie B, Deng J, Li Z, Lee JH, Aach J, Leproust EM et al: Digital RNA allelotyping reveals tissue-specific and allele-specific gene expression in human. Nat Methods 2009, 6(8):613-618. 40. Alvarez CJ, Lodeiro M, Theodoropoulou M, Camina JP, Casanueva FF, Pazos Y: Obestatin stimulates Akt signalling in gastric cancer cells through beta-arrestin- mediated epidermal growth factor receptor transactivation. Endocr Relat Cancer 2009, 16(2):599-611. 41. Dittmer S, Sahin M, Pantlen A, Saxena A, Toutzaris D, Pina AL, Geerts A, Golz S, Methner A: The constitutively active orphan G-protein-coupled receptor GPR39 protects from cell death by increasing secretion of pigment epithelium-derived growth factor. J Biol Chem 2008, 283(11):7074-7081. 42. Lo M, Ling V, Wang YZ, Gout PW: The xc- cystine/glutamate antiporter: a mediator of pancreatic cancer growth with a role in drug resistance. Br J Cancer 2008, 99(3):464-472. 43. Guan J, Lo M, Dockery P, Mahon S, Karp CM, Buckley AR, Lam S, Gout PW, Wang YZ: The xc- cystine/glutamate antiporter as a potential therapeutic target for small-cell lung cancer: use of sulfasalazine. Cancer Chemother Pharmacol 2009, 64(3):463- 472. 44. Wang Z, Kong D, Li Y, Sarkar FH: PDGF-D signaling: a novel target in cancer therapy. Curr Drug Targets 2009, 10(1):38-41. 45. Kinameri E, Inoue T, Aruga J, Imayoshi I, Kageyama R, Shimogori T, Moore AW: Prdm proto-oncogene transcription factor family expression and interaction with the Notch-Hes pathway in mouse neurogenesis. PLoS One 2008, 3(12):e3859. 46. Rhead B, Karolchik D, Kuhn RM, Hinrichs AS, Zweig AS, Fujita PA, Diekhans M, Smith KE, Rosenbloom KR, Raney BJ et al: The UCSC Genome Browser database: update 2010. Nucleic Acids Res 2010, 38(Database issue):D613-619.

110 Chapter 5: Integrating the multiple dimensions of genomic and epigenomic landscapes of cancer4

4 Sections 5.1 to 5.3, 5.4.1 to 5.4.4, 5.5 and 5.6 of this chapter has been published. Chari R, Thu KL, Wilson IM, Lockwood WW, Lonergan KM, Coe BP, Malloff CA, Gazdar AF, Lam S, Garnis C, MacAulay CE, Alvarez CE, Lam WL. (2010) Integrating the multiple dimensions of genomic and epigenomic landscapes of cancer. Cancer and Metastasis Reviews, 29(1):73-93. doi: 10.1007/s10555-010-9199-2. Sections 5.4.5 to 5.4.6 were not published previously. 111 5.1 Introduction

In the past decade, advancements in genome profiling technologies have greatly improved our ability to understand the landscape of cancer genomes. From the emergence of array based comparative genomic hybridization (CGH) and spectral karyotyping (SKY) to the current state of next generation sequencing (NGS), the improvement in resolution at which the genome can be described has been over a million fold [1-6]. Likewise, the recent development of integrative platforms to relate multiple dimensions of DNA features (such as copy number, allelic status, sequence mutations, and DNA methylation) to gene expression pattern, has dramatically improved our ability to identify causal genetic events and decipher their downstream consequences in the context of gene networks and biological functions [7, 8] (Table 5.1).

Landmark events in cancer genomics, from the launch of Cancer Genome Anatomy Project at the beginning of the decade to the recent publications of complete cancer genome sequences, are highlighted in Figure 5.1 [3-6, 8-43].

Multiple levels of genetic and epigenetic disruption are instrumental to cancer development, whereby specific genes may be altered by a variety of mechanisms. For example, the tumor suppressor CDKN2A can be inactivated through copy number loss, DNA hypermethylation, or sequence mutation. These mechanisms of disruption can occur in a tumor-specific manner or, may occur concurrently in the same tumor, i.e. a two hit scenario. Moreover, in the former situation, if a given gene or pathway's frequency of alteration is low when examined by one mechanism or dimension, it is likely the gene/pathway would be overlooked by the analysis.

However, when multiple dimensions of disruption are considered in the analyses, alteration of the gene in question may be detected at a high frequency, albeit at low frequencies by any one mechanism. This illustrates the need for and the benefit of integrative analytical approaches. In this article, we discuss the impact of multi-dimensional genomic analyses on our view of the cancer genome landscape, and the contribution of such new knowledge to our understanding of cancer progression and metastasis.

112 5.2 Genomic alterations

5.2.1 Chromosomal aberrations

Chromosomal aberrations and rearrangements, such as translocations and gains/losses of whole or portions of chromosome arms are detected through direct examination using molecular cytogenetic techniques such as G-banding, SKY, fluorescence in situ hybridization (FISH) and

CGH [2, 44-48]. The manifestation of such alterations are generally attributed to mitotic errors, where centrosomal aberrations and telomere dysfunction play key causative roles [49-53].

Aberrations such as gains and losses have been further refined using technologies such as microarray CGH (see below). While primarily associated with different types of leukemia and lymphomas, recent genomic studies have identified translocations in epithelial tumors such as prostate and lung cancer [54-61]. A compilation of cumulative cytogenetic data from three main sources - NCI/NCBI SKY/M-FISH & CGH Database, NCI Mitelman Database of Chromosome

Aberrations in Cancer, and NCI Recurrent Aberrations in Cancer – are now integrated into

NCBI's system as Cancer Chromosomes [62] (Table 5.2).

5.2.2 Gene dosage, allelic imbalance, mutational status

Gene dosage. Genomic DNA copy number alterations are a prominent mechanism of gene disruption that contributes to tumor development [63]. Segmental amplification may lead to an increase in gene and protein expression of oncogenes, while deletions may lead to haploinsufficiency or the loss of expression of tumor suppressor genes. Since its development in the mid 1990s, advances in microarray-based CGH technology have dramatically increased genome coverage and target density, improving both the resolution and sensitivity of detection of copy number alterations [64, 65]. The first genome-wide array CGH analysis utilized cDNA microarrays originally designed for gene expression profiling [66]. Since these first experiments, whole genome tiling path arrays with tens of thousands of bacterial artificial chromosome (BAC) clones, oligonucleotide (25-80 bp nucleotide probes) and single nucleotide

113 polymorphism (SNP) arrays with over one million DNA elements and the essential bioinformatics tools for visualization and analysis of high density array CGH data have been developed (Figure 5.1) [7, 33, 67-71]. These innovations have enabled increasingly precise mapping of the boundaries and magnitude of genetic alterations throughout the genome in a single experiment, greatly increasing our understanding of the cancer genome landscape in the context of DNA copy number [33, 72-76]. While early attempts have been made utilizing sequence-based approaches [77-80], recent studies have begun to illustrate the improvement in detection resolution through the advances in high throughput sequencing technologies [6, 11,

13, 14]. The popularity of genome sequencing will depend on further cost reduction in data generation and major advancements in analysis [81].

Copy number variation. The discovery of a vast abundance of germ line segmental DNA copy number variation (CNV) in the normal human population has not only provided a baseline for interpretation of cancer genome data, but also highlighted the need for comparison against paired normal tissue [18, 19, 31, 32, 82-89]. Moreover, it has been shown that many of the reported CNVs overlap with loci involved with sensory perception and more importantly, disease susceptibility. While the role of CNV in cancer is not well understood, a recent study showed that these regions are more susceptible to genomic rearrangement and may initiate subsequent alterations during tumorigenesis [90]. Moreover, CNV at 1q21.1 was recently shown to be associated with neuroblastoma and implicated NBPF23, a new member of the Neuroblastoma

Breakpoint Family, in tumorigenesis [91]. A database of all known CNVs is available at http://projects.tcag.ca/variation [31]. In addition, as copy number profiles of cancer genomes accumulate, hotspots for amplification and deletion are becoming evident, and signature alterations associated with specific diseases and cancer histologic subtypes are emerging [92-

96]. The manifestation of “oncogene addiction” through lineage specific DNA amplification is a case in point [38, 39, 97-100].

114 Allelic status. Single nucleotide polymorphism (SNP) arrays are best known for their application in genome wide association studies (GWAS), where the correlation of haplotype with phenotype implicates disease susceptibility [101, 102]. SNP array platforms have shown tremendous advances in resolution, with the number of SNPs that can be simultaneously measured increased by 1000-fold since initial development. Currently, for example, the Affymetrix SNP

6.0 array platform measures 1.8 million elements representing 906,600 SNP elements and >

946,000 CNV elements. Likewise, on the Illumina HumanOmni1 platform, over 1,000,000 sites

(representing a mixture of SNP and CNV elements) can be simultaneously assessed. In addition to their application in GWAS, SNP arrays can also be used to detect somatic alterations and when applied in this context, can allow for the simultaneous detection of copy number alteration and allele imbalance in tumor genomes. In the example in Figure 5.2, when the SNP array profile of a lung cancer genome is compared against that of its paired non- cancerous lung tissue, it is not only possible to distinguish regions of allelic balanced copy neutrality (Figure 5.2a) from allelic imbalance (Figure 5.2b, 5.2c), but also regions of allelic imbalance due to segmental DNA copy number alteration (Figure 5.2b) from those without change in total copy number (Figure 5.2c).

Mutational profiling and whole genome sequencing. In cancer, oncogenes are thought to harbor mutations which lead to increased protein expression or constitutive protein activation while tumor suppressor genes are thought to harbor mutations which are inactivating, either through total loss of protein expression or expression of mutant, non-functional protein. In addition, activating and inactivating mutations can also be accompanied by changes in gene dosage or allele status (see below). Traditionally, mutation screening has been focused on specific oncogene and tumor suppressor loci. With the availability of newer and cheaper sequencing technologies [103], recent studies have expanded from single gene analyses to genome-wide screens [6, 11, 13, 14, 104]. For example, in studies using small cell lung cancer and melanoma cell lines, tens of thousands of somatic mutations were identified in each cell line, with a proportion of these mutations being attributed to cigarette smoke (G to T

115 substitutions) and UV exposure (C to T), respectively [4, 5]. It will be interesting to see if other cancers have such mutation signatures. Another observation made in both studies was that the uneven distribution of mutations suggests that DNA sequence integrity is largely maintained by transcription-associated DNA repair. While these and future studies will uncover a vast number of mutations, the contribution of those mutations to tumorigenesis will need to be determined

[105, 106].

5.2.3 Genomic landscape: Gains, losses and uniparental disomy

Individually, the study of genomic dimensions has yielded a global description of cancer genomes in terms of gene dosage, allelic status and somatic mutation. Collectively, however, the integration of these three dimensions has brought two concepts to the forefront: allele specific copy number alterations and uniparental disomy (UPD) (Figure 5.2). Typically, the relationship between somatic mutation and allele specific copy number alterations have been associated with tumor suppressor genes (e.g. RB1 and TP53) whereby mutation is combined with loss to achieve bi-allelic inactivation [107, 108]. However, recent studies have shown preferential amplification of alleles encoding mutated oncogenes as well [109-114]. In non- small cell lung cancer, mutant allele specific imbalance (MASI) is frequently present in mutant

EGFR and KRAS tumor cells, and is associated with increased mutant allele transcription and gene activity [114].

UPD is the presence of two copies of a chromosome segment from one parent, and the absence of that DNA from the other parent. Somatic UPD, also known as copy neutral LOH

(CNLOH), results in loss of heterozygosity (tumor versus normal), without a change in total DNA copy number [115-117]. UPD is observed at tumor suppressor gene loci whereby upon loss of the wild type allele, the mutated allele is duplicated resulting in a diploid state with homozygous mutation of the target gene [118]. Interestingly, UPD events are also detected at mutated oncogenes [114, 119-121]. Until recently, due to limitations in the resolution of genomic array platforms, the prevalence of this event has been widely underestimated and underappreciated.

116 Recent studies have shown that UPD events are frequently observed in tumor genomes, with most of the findings reported from hematological malignancies [122-131]. Our genome wide analysis of segmental gain, loss and UPD in the T47D breast cancer cell line genome identified that a significant portion of the genome exhibits UPD, rivaling the proportion of the genome affected by segmental gain and loss, and highlighting the potential of UPD as a prominent mechanism of gene disruption in epithelial cancer (Figure 3). Interestingly, PIK3CA and TP53 mutations in T47D are noted in the Catalogue of Somatic Mutations in Cancer [132]. Integrative analysis at these loci detected copy number increase at PIK3CA and copy number loss at TP53 illustrating the MASI concept described above (Figure 3).

Somatic UPD also exists at genes without mutation. The potential significance of this somatic event is not readily apparent, but it raises the intriguing possibility of allelic conversion of epigenetic status [117, 122, 133].

5.3 Epigenomic alterations

5.3.1 The cancer methylome

Abnormal DNA methylation patterns occur in cancer, whereby focal hypermethylation at many

CpG islands is evident in a background of global DNA hypomethylation [134-137]. Broad hypomethylation may lead to genomic instability, while hypermethylation of CpG islands silences transcription of specific genes [136, 138-140]. Non-random methylation of multiple

CpG islands observed in colon cancer led to the discovery of CpG island methylator phenotype

(CIMP), which is causally linked to microsatellite instability via silencing of the mismatch repair gene, MLH1 [141-143].

The determination of DNA methylation status relies on the ability to discriminate between methylated and unmethylated cytosines. This is achieved by exploiting methylation- sensitive/insensitive isoschizomer restriction-enzyme pairs [144-150], chemical conversion of unmethylated cytosine to uracil [151-156], and the affinity for methylated DNA of specially

117 developed antibodies and methylated-DNA binding proteins [24, 157-163]. Several computational methods have been developed for deriving approximations of actual methylation levels from the relative levels generated by most microarray and locus specific sequencing assays [147, 162, 164, 165]. However, it is important to note that CpG targets represented on microarrays may or may not be the only elements controlling gene expression. Recently, it was shown that in the human colon cancer methylome sequences up to 2 kb away from CpG islands, termed CpG shores, exhibited more methylation than CpG islands and had greater influence on gene expression than CpG islands [166]. Furthermore, while excess promoter methylation is typically associated with transcriptional repression, the loss of required methylation within gene bodies, proximal to promoters, can have the same effect [167]. DNA methylation of epigenetic neighborhoods in the megabase size range has also been reported

[168]. Validation of methylation-mediated control of gene-specific expression, and evaluation of biological significance, can be achieved via pharmacologic manipulation of DNA methylation, for example by 5-azacytidine treatment, to relieve methylation silencing and invoke re-expression

[20, 169].

The first single-base-resolution maps of the human methylome have recently been generated by sequencing of bisulfite converted DNA from human embryonic stem cells and fetal fibroblasts

[12, 170]. This landmark study will greatly advance the analysis of DNA methylation by providing whole genome reference maps of methylation in these specific cells. However, it is well known that DNA methylation is tissue specific and that it changes throughout development thus, methylome maps for all tissues at various stages of development may be necessary to provide adequate maps of 'normal' methylation patterns for use in deciphering aberrant methylation patterns characteristic of tumors [171-176]. In recognition of this, the Human

Epigenome Project was launched in 2004 to map the methylomes of all major human tissues

[177].

118 5.3.2 Integration of cancer genomic and epigenomic events

DNA methylation and genomic instability. Cancer-specific aberrant DNA methylation is associated with reduced genomic stability and subsequent copy number alterations, including preferential loss of certain imprinted alleles (LOI) [178-184]. Mechanistically, this instability may be related to the susceptibility of hypomethylated DNA to undergo inappropriate recombination events [185]. Another mechanism known to negatively impact genomic integrity in lung cancer is the relaxation of transposable element control that is mediated by DNA methylation [186-190].

DNA hypomethylation and DNA amplification. Preliminary evidence of specific demethylation of somatic segmental amplifications (or amplicons) has been put forth in lung cancer, perhaps representing a novel mechanism of aberrant oncogene activation [189, 191].

Further studies using large-scale sequencing of bisulfite treated DNA will help to clarify this phenomenon [12]. Hypomethylation has also been implicated in the formation of specific copy number alterations in glioblastoma multiforme [192]. One potentially interesting application for

DNA methylation profiling of cancer amplicons such as these, is in the discrimination between

"driver" and "passenger" genes within the amplified sequence. It may be that DNA methylation within the promoters or gene bodies of these genes is responsible for the lack of uniform overexpression of genes residing within amplicons.

DNA hypermethylation and copy number loss. The relationship between DNA hypermethylation and allelic loss is well documented. Tumor suppressor genes are frequently found in regions of common LOH, and these same TSGs are frequently found to be hypermethylated, perhaps best exemplified by the FHIT gene on chromosome 3p [193].

Although it is unclear whether loss or hypermethylation occurs first, both are known to be very early events in tumorigenesis preceding any histologic alterations [194-196]. With the advent of high resolution genome-wide technologies it has become possible to comprehensively search for genes that are inactivated by both mechanisms simultaneously [197].

119 Histone modification states. While DNA methylation and gene dosage profiling technologies have become accessible, technologies for global assays of other key epigenetic marks including histone modifications are not widely available. One of the main challenges to conducting the highest quality studies of genome wide chromatin-immunoprecipitation on microarray (ChIP- chip) or on sequencing platform (ChIP-seq) experiments is the requirement of high quality DNA from pure cells – which essentially means growing cells in culture. It is thus difficult to analyze these dimensions from clinical specimens. However, much has been learned from studies of the relationship between different histone modification states and transcriptional activation or repression in model systems. Such examples utilizing ChIP-chip include: cell or context specific histone modification patterns related to cell or context specific gene expression; histone

3 lysine 27 (H3K27) trimethylation patterns associated with prostate, lung and breast cancers; and H3K9 and H3K79 modification patterns in leukemia [198-204]. Examples utilizing ChIP-seq include: the analysis of the growth inhibition program of the androgen receptor, and the chromatic interaction network of the estrogen receptor [205, 206].

5.4 Relating genetic and epigenetic events to changes in the transcriptome through integrative analysis

Aberrations in individual genetic or epigenetic dimensions are prominent across various cancer types, culminating in changes to the transcriptome. However, for a given gene, most of the events documented previously, such as copy number amplification, homozygous deletion, somatic mutation, or DNA hypermethylation, do not occur in 100% of tumors for a given cancer type. Moreover, it has been observed that the same gene may be activated or inactivated by different mechanisms. Since most of the studies described above analyzed single DNA dimensions, it is likely many genes would be overlooked due to a low frequency of alteration in a single dimension; the same gene may be detected at a high frequency when multiple dimensions are considered. Thus, analysis of more dimensions may reveal higher frequency

120 gene-specific disruption with corresponding transcriptome aberrations for particular cancer types, as would be expected for genes causative to cancer development.

5.4.1 Multiple mechanisms of gene disruption

Expression profiling studies have been instrumental in detecting genes dysregulated in cancer

[207-209]. However, aberrant expression of some genes may simply reflect incidental genome instability or secondary dysregulation. Global gene expression profiling alone may not distinguish causal events and bystander changes. One of the first studies to relate gene expression changes with gene dosage status on a global scale was a parallel analysis of DNA and mRNA [66, 210]. The same cDNA microarray platform was used to investigate impact of

DNA copy number alterations on the expression of over 6,500 genes. This study determined that 62% of genes located within regions of DNA amplification showed elevated expression in breast cancer. Subsequent studies in other cancer types revealed a broad range in the correlation between increased gene dosage and expression levels for protein coding genes

(19% to 62%) [92, 207, 210-213]. Studies integrating gene dosage and gene expression have identified cancer subtype-specific pathway activation and signatures associated with clinical outcome [96, 214-217]. In addition, when examining known disease-relevant pathways, it has been shown that even though individual components of a pathway are disrupted at a low frequency, collectively, these alterations can result in frequent disruption of a given pathway [16,

92]. Similarly, alterations in DNA methylation or histone modification status can also affect gene expression and have subsequent pathway level consequences (see above).

5.4.2 Multiple mechanisms of disrupting non-coding RNA levels

Segmental DNA copy number alterations also affect the expression of non-coding RNAs

(ncRNA) [218-222]. MicroRNAs (miRNA) have been shown to have a significant role in cancer development with specific miRNAs implicated in a number of different cancer types [26, 223-

225]. Specific miRNA expression signatures are associated with critical steps in tumor initiation and development including cell hyperproliferation, angiogenesis, tumor formation and

121 metastasis [226]. High throughput analysis of microRNAs has been of interest and microarrays have been developed to assess essentially all annotated microRNAs. To date, >700 miRNAs have been annotated in the genome (http://mirdb.org/miRDB/statistics.html, [227]), with more likely to be discovered. For example, we recently demonstrated that a deletion on chromosome

5q leads to the reduced expression of two miRNAs that are abundant in hematopoietic stem/progenitor cells. This study revealed haploinsufficiency and reduced expression of miR-

145 and miR-146a as mediators of a subtype of myelodysplastic syndrome [221]. Although the genomic loss and underexpression implicates a tumor-suppressive role for these specific miRNAs, others undergo activating genomic alterations and elevated expression and hence are thought to be oncogenic [228, 229].

Just as copy number alterations can alter miRNA activity, epigenetic alterations have also been shown to affect miRNA expression [230-232]. Aberrant methylation of miRNAs has been reported in a variety of cancer types, and the disruption of epigenetically-mediated miRNA control has been shown to have oncogenic effects due to downstream gene deregulation [233].

For example, abnormal DNA methylation of miRNAs has been associated with tumor metastasis, leading to the appreciation of a group of metastasis-related miRNAs [229].

5.4.3 Multi-dimensional integration of genome, epigenome, and transcriptome

Large scale initiatives. Since multiple genomic/epigenomic mechanisms can influence gene expression and lead to disruption of a given function, an integrative multi-dimensional analysis is necessary for a more comprehensive understanding of the cancer phenotype (Figure 4).

Specific programs and initiatives such as those by The Cancer Genome Atlas (TCGA) project and the cancer Biomedical Informatics Grid (caBIG) enable parallel and multi-dimensional analysis of cancer genomes [8, 16] (Table 5.2). Recently, studies in glioblastoma and osteosarcoma have shown that integrative genomic and epigenomic approaches can indeed reveal the specific genetic pathways involved in different cancers [16, 234].

122 Gene disruption by multiple mechanisms. One of the two key reasons for using an integrative approach is the ability to detect critical genes that are disrupted by multiple mechanisms across a sample set, but are disrupted at a low frequency by any one mechanism.

These genes would have been overlooked in previous, single dimensional studies. The second key advantage of integrative approaches is the ability to identify genes that are simultaneously disrupted by multiple mechanisms -- two hits -- in a single sample. Using a dataset comprised of DNA copy number, allelic status, DNA methylation, and gene expression profiles from ten lung adenocarcinomas and matched non-malignant tissue controls, we illustrate these benefits below.

If gene expression changes are a consequence of alterations at the DNA level, then a higher proportion of the observed expression changes can be directly attributed to a defined causal event when multiple types of DNA alterations are examined (Figure 5.5a). While some samples have over 70% of the expression associated with DNA level changes (Sample 7,

Sample 8), other samples have only 30% (Sample 5, Sample 9). Additionally, consequential to associating more gene expression changes with DNA level changes within a sample, more disrupted genes are detected, and in turn, more disrupted pathways are identified across a sample set (Figure 5.5b, 5.5c). In fact, in our example, nearly five times as many genes

(~1100 compared to ~200) are detected as disrupted in at least 50% of the samples when we account for multiple mechanisms of disruption (vs. one mechanism alone) (Figure 5.5c). This result illustrates that without using an integrative approach, many potentially important genes would be dismissed as they are disrupted by low frequency events when a single DNA dimension is analyzed. This also holds true at the pathway level when the identified genes are grouped based on their biological function (Figure 5.5d). For example, the Hepatic

Fibrosis/Hepatic Stellate Cell Activation pathway and the RAR Activation pathway, which are identified when all DNA dimensions are considered, would not be detected as significantly altered when using individual DNA dimensions alone.

123 Implications on sample size requirements. In the example above, we illustrate that a significant number of genes and pathways exhibit a low frequency of disruption when examining single dimensions (and thus would be overlooked) but, indeed exhibit a high frequency of disruption when multiple dimensions are considered (Figure 5.5). Notably, these findings imply that integrative multi-dimensional analysis of individual samples may directly impact the cohort sample size required for gene discovery on the basis of frequency of disruption (Figure 5.5e).

Reduction in sample size requirements means that one can extend this approach to situations involving rare specimens where accrual of hundreds of samples in a reasonable timeframe is not possible. Moreover, reduced sample sizes are particularly applicable to familial cancers or to isolated populations at increased risk for specific cancers.

Bi-allelic gene disruption. Two-hit bi-allelic inactivation of genes and high level gene amplifications are typically considered to be causal mechanisms that inflict gene expression changes. When examining multiple DNA dimensions, concerted bi-allelic disruption of a gene in the same sample can be readily identified; copy number loss with hypermethylation resulting in underexpression, or copy number gain with hypomethylation and overexpression are examples.

Indeed, we do identify genes harboring concerted disruptions using the same lung adenocarcinoma dataset mentioned above. The MUC1 locus exhibits concurrent copy number increase with hypomethylation and overexpression (Figure 5.4). MUC1 has previously been shown to be important in lung and breast cancers and is currently a target for therapeutic intervention [235-237]. Collectively, we have demonstrated how an integrative, multi- dimensional approach can be utilized for cancer gene and pathway discovery.

5.4.4 Disruption of multiple components in biological pathways

We described above how an integrative, multi-dimensional approach improves the detection of disrupted genes, especially those affected by multiple low frequency mechanisms. This concept can be extended to identify biological pathways, where multiple pathway components are disrupted at low frequencies (see above; Figure 5.5d). The EGFR signaling pathway is a

124 well documented dysregulated component of lung cancer. Using the same multi-dimensional profiling dataset from Figure 5 above, seven genes were detected with gene dosage alteration at a frequency ≥30%. However, when we considered alterations in gene dosage, allelic status,

DNA methylation and somatic mutation collectively (for KRAS and EGFR only), 18 genes in the pathway were identified to be altered at ≥30% frequency (Figure 5.6). The detection of the additional 11 genes illustrates the benefit of employing an integrative approach and extends the sample size reduction argument to the pathway level.

5.4.5 Identification of a novel gene involved with EGFR signaling deregulated in adenocarcinoma

In the section above, I have shown that more of the well known components are frequently altered when we examine multiple DNA dimensions as opposed to a single DNA dimension, such as DNA copy number, alone. When this analysis is expanded to include more genes based on literature evidence, I found that the most frequently disrupted gene is signal-regulatory protein alpha (SIRPA) (Figure 5.7).

SIRPA has been shown to be down-regulated when EGFR is activated in glioblastoma and up- regulated when EGFR is suppressed [238, 239]. SIRPA has also shown to be a tumor suppressor gene in multiple cancer types including liver and breast cancer [240, 241].

Moreover, in the resting lung, SIRPA has been thought to modulate the inflammatory response through SHP-1 (also known as PTPN6) and eventually, NFKB [242]. While most studies have documented the association of SIRPA with SHP-2 (also known as PTPN11) [243-245], few studies have shown the association of SIRPA with SHP-1.

To discern the association of SIRPA with SHP-1 and SHP-2, mutual information network analysis was utilized [246, 247]. Briefly, using our gene expression dataset and a publicly available dataset [248], Affymetrix exon array datasets were normalized separately using the aroma.affymetrix package (Bengtsson et al 2008 Berkeley). Subsequently, each dataset was

125 analyzed using the "minet" package in the Bioconductor suite in R [247, 249]. From these analysis, for each gene, a score between each gene and every other gene is calculated. The top 5% of gene-gene interactions (based on the score) from each dataset were retained and those interactions which were in the top 5% of both analyses were retained. Finally, gene-gene interactions involving SIRPA were extracted, resulting in a total of 310 genes found to highly correlate with SIRPA expression (Table 5.3). Within this list of genes, PTPN6 was present and

PTPN11 was not, suggesting that SIRPA is likely involving PTPN6 rather than PTPN11 in lung adenocarcinoma.

5.4.6 Prevalence of SIRPA deregulation and association with clinical characteristics

Given that the sample set we examined comprised of only 10 samples, we then wanted to assess the expression in a larger panel of samples to validate the frequency of underexpression observed in the initial set. Using 59 lung adenocarcinoma and matched non-malignant sample pairs, the prevalence of SIRPA underexpression was assessed and the correlation of SIRPA and PTPN6 was re-evaluated. It was found that 47/59 pairs exhibited at least a 1.5-fold reduction of SIRPA in tumors as compared to matched non-malignant tissue, representing

~80% of tumors assessed (Figure 5.8a). In addition, correlating SIRPA and PTPN6 expression using a Pearson correlation, a correlation coefficient of 0.907 was found (Figure 5.8b).

It should also be noted that there was a small number of samples which exhibited overexpression of SIRPA. This finding was somewhat unexpected given the high prevalence of underexpression observed in the initial dataset. However, the initial ten tumors were from individuals who were former smokers and the set of 59 tumors was comprised of 23 current smokers, 21 former smokers and 15 never smokers. When stratifying the differential gene expression based on smoking status, overexpression was not observed in any of the current or former smokers, while overexpression was only observed in a subset of never smokers (Figure

5.8c).

126 Finally, using publicly microarray datasets with patient survival information [250-252], Kaplan-

Meier analysis was performed on each of these datasets based on SIRPA expression levels.

The association was deemed significant if the gene had a p-value ≤ 0.05 based on a Mantel-

Cox (or log ranks) test. Two of the five datasets showed a statistically significant association between SIRPA expression levels and overall patient survival with an additional two datasets close to significance with p values ≤ 0.18 (Figure 5.9).

5.5 Tracking clonal expansion in spatial dimensions

Delineating the clonal relationship between multiple tumors in the same patient is relevant not only to clinical management of disease but also to the understanding of metastasis. Multiple tumors in the same patient may not necessarily share an identical genomic profile. The similarities and differences in genomic landscape between tumors are quantifiable and therefore can be used for delineating relatedness. Whole genome comparison based on array CGH profiles is a new tool for distinguishing metastatic from primary synchronous carcinomas. A multitude of genomic features, for example the boundaries of segmental deletions, are used to delineate the presence and the sequence of events in clonal evolution [253-261].

Furthermore, signature genetic alterations can be used to track clonality in a cell population, putting genetic events in the context of tumor tissue architecture. By assessing the appearance of pre-selected markers in individual nuclei on a tissue section by FISH, the clustering and the expansion of clonally related cells can be delineated by analyzing the marker patterns of neighboring cells (Figure 5.10).

5.6 Evaluating the biological significance of integrative genomics findings

The utilization of an integrative genomic, epigenomic and transcriptomic approach will undoubtedly improve our ability to identify gene disruptions and their effects on gene

127 expression. The next challenge is to develop approaches for the determination of functional and phenotypic evidence of the biological relevance of such gene disruptions in a high throughput manner -- for example, functional genomic screens by RNAi, proteomic profiling and metabolite profiling. Forced expression of genes and RNAi knockdown of gene expression are commonly used methods for assessing growth and invasion phenotypes in cell models.

Genome wide RNAi screens, comprised of large libraries of short hairpin RNA sequences redundantly targeting thousands of genes, have been used to identify genes essential to tumorigenesis, including tumor suppressor genes as well as cooperative genes with oncogenic mutation in several malignancies [22, 28, 29, 262-270]. Animal models are also instrumental to functional validation of genes singly or in combination, but this topic is beyond the scope of this article. Cross referencing genomic findings with proteomic profiles will determine the functional consequences yielding information on expression levels, post-translational modification, and protein-protein interactions [271-275]. As recent studies have highlighted the importance of the metabolome in cancer, the genomic landscape can also be integrated with metabolome profiles to determine the role of genetic and epigenetic alterations in cellular physiology relevant to cancer development [276-278].

The progress made in the development of technologies and approaches to analyze the genome, epigenome, and transcriptome have allowed for much improved understanding of cancer landscapes. With the increased application of sequence based approaches to analyze genetic and epigenetic dimensions and the additional complexity with the proteome and metabolome to follow, an unprecedented definition of the cancer cell can be achieved. The next key challenge will be the synthesis of this information to better understand fundamental cancer processes such as progression, metastasis and drug resistance.

128 Figure 5.1. Advances in cancer genomic landscape post Y2K.

DNA nanoballs sequencing technology [3] Breast, lung & skin cancer genomes sequen ced [4-6,11] Human methylomes sequen ced [12] 2009 Acute myeloid leukemia genome sequen ced [13,14] International Cancer Genome Consortium initiated 1000 genomes p roject launched [15] Integrative study of glioblas toma [16] Genome RNAi database established [9,10] 2007 2nd generation human haplo type map with >3M SN Ps [17] Next generation, massively parallel sequencing technologies Copy number variation maps [18,19] 5-Azacytidine re-expression of met hylated cancer genes [20] Exome sequencing mut ation detection [21] The RNAi Consortium (TRC) [22] 2006 Bead Arrays for bisulfite DNA methylation [23] NIH Cancer Genome Atlas (CGA) initiated Methylome map by MeDNA immunop recipitation [24] 2005 First human genome haplo type map [25] MicroRNA expression profiles classify can cers [26] Catalogue of som atic mutations in can cer (COSMIC) [27] Large scale RN Ai-based screens [28,29] Cancer Gene Census published [30] Large scale copy number variation in humans [31,32] 2004 Whole genome tiling p ath CGH microarrays [33] Tiling path analysis of human t ranscribed sequen ces [34] Cancer Biomedical Informatics Grid (caBIG) launched [8] The Ensembl genome d atabase project [35] The human genome b rowser at UCSC [36] BeadArray genotyping platforms [37] 2002 Concept of oncogene addiction [38,39] First human genome sequen ces [40,41] CGAP launched [42,43]

Figure 5.1. Advances in cancer genomic landscape post Y2K. The timeframe of events are estimated based on time of publication.

129 Figure 5.2

Total copy number Total copy number 4 4

2 2 # of copies # of copies

0 0

Allele speci c copy number Allele speci c copy number 3 3

2 2 # of copies # of copies 1 1

0 0 (a) Neutral (b) Gain (c) UPD

Figure 5.2. SNP array analysis to identify areas of altered copy number and allelic composition in a clinical lung cancer specimen. Shown here are (a) a region that is copy neutral with no observed allelic imbalance and regions containing a (b) segmental gain and (c) UPD. Examining the allele specific copy number plot, the gain (in b) is likely a single copy change and the UPD event (in c) is signified by the shift in allele levels while maintain- ing total copy number neutral status.

130 Figure 5.3

1 2 3 4 5 6 7 8

PIK3CA

9 10 11 12 13 14 15 16

17 18 19 20 21 22 Gain TP53 Loss UPD

Figure 5.3. Overlay of chromosomal regions of gain, loss and UPD (copy number neutral LOH) inherent to the T47D breast cancer cell line. The chromosomal loci for PIK3CA and TP53 (modified by activating and inactivating mutations, respectively, in this cell line), are indicated. The majority of the genome is affected by any one of the three genomic alterations. Raw SNP 6.0 array data was obtained from the Sanger database with mutation status obtained from the COSMIC database [132]. Copy number and allelic status changes were determined using Partek Genomics Suite and reference genomes used were 72 individuals from the HapMap collection. Data was visualized using the SIGMA2 software [7].

131 Figure 5.4. Integration of copy number, allelic status, DNA methylation, and gene expression for a single lung adenocarcinoma sample. (a) Copy number and (b) allele status analyses revealed a high level allele-specific DNA amplification (highlighted in yellow, image generated with Partek Genomics Suite); (c) individual CpG loci within this region were assessed for differential methylation between tumor and non-malignant tissue.

Hypomethylation at the indicated CpG locus, which corresponds to the MUC1 gene, is observed

(visualized with Genesis). (d) Expression analysis revealed four-fold overexpression of the

MUC1 transcript when a tumor sample was compared to matched, adjacent non-malignant tissue. Copy number and allele status profiling was performed using the Affymetrix SNP 6.0 array; DNA methylation profiling using the Illumina Infinium HM27 platform; and gene expression using the Affymetrix Human Exon 1.0 ST array.

132 Figure 5.4

a Total copy number: Ampli cation 4

2 # of copies

0 Allele speci c copy number b 3

2

1 # of copies

0

c DNA hypomethylation Normal Tumor

MUC1

d Overexpression 7000 6000 5000 4000 3000 2000 Relative expression Relative 1000 0 Normal Tumor

133 Figure 5.5. Integration of copy number, allelic status, DNA methylation, and gene expression for a single lung adenocarcinoma sample. Enhanced analysis of the cancer phenotype using an integrative and multi-dimensional approach. (a) On average, a higher proportion of differential gene expression can be associated with genomic alterations when examining multiple DNA dimensions relative to single dimensions. (b) Using a fixed frequency threshold of 50 %, more genes are revealed to be frequently disrupted when multiple mechanisms of genomic alteration (e.g. altered copy number, DNA methylation, or copy number neutral LOH) are considered, (~200 genes versus more than 1000 genes). (c) Pathway analyses performed using gene lists derived from a multi-dimensional approach, identifies an enhanced number of aberrant pathways relative to those identified from a uni-dimensional approach. (d) Functional pathways identified using the integrated gene list are of relatively high significance; the top 10 such pathways are shown. This suggests that the additional identified genes associate with specific pathways rather than with random functions. The four bars represent, from left to right: all dimensions, copy number, DNA methylation, and UPD.

Ingenuity Pathway Analysis was used for analyses in (c) and (d). (e) Example of two genes that are missed when a single DNA dimension is studied, but captured when multiple DNA dimensions are examined. Both ribonucleotide reductase M2 (RRM2) [279, 280] and retinoic acid receptor responder (tazarotene induced) 2 (RARRES2) [281, 282] are known to be deregulated in multiple cancer types.

134 Figure 5.5

a b 1400 1200 0.9 1000 800 0.8 600 0.7 Sample 1 400 Sample 2 # of genes identied 200 0.6 Sample 3 0 Sample 4 Sample 5 CNNLOH 0.5 Sample 6 Sample 7 Copy Number All Dimensions 0.4 Sample 8 DNA Methylation Sample 9 0.3 Sample 10 Average 60 0.2 c 50 0.1 40 Proportion of di erentially expressed genes expressed of di erentially Proportion 0 30 20 10 # of Signicant Pathways # of Signicant Copy Number Leveraging All DNA Methylation Dimensions 0 Copy Number Neutral LOH (CNNLOH) CNNLOH

Copy Number All Dimensions d DNA Methylation 6 4 2 -log(pvalue) 0

Signaling Signaling Acute Phase Oncostatin M IL-8 Signaling RAR Activation in Neurons Reelin Signaling CXCR4 Signaling HepaticHepatic Fibrosis Stellate / Macropinocytosis Cell Activation Complement System Response Signaling Leukocyte Extravasation e RARRES2 RRM2

Copy Number Copy Number

DNA Methylation DNA Methylation

CNNLOH CNNLOH

All Dimensions All Dimensions

0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 Frequency of Disruption (%) Frequency of Disruption (%)

135 Figure 5.6

EGF TGFA EGFR ERBB2

SHC1 MUC1

GRB2 PLCG1 PIK3R1 KRAS

SOS2

RRAS IP3 PIP2 PIP3 RASSF5 RASSF1

X X MST1 ER Apoptosis ITPR1 DAG PDK1 CCND1 Proliferation AKT2AKT1 Ca2+ RAF1 PRKCA

MAP2K1 CASP9 FOXO3 BAD DUSP4 Apoptosis MAPK1

MAPK1 CCND1 MYC

Proliferation & Cell Cycle Differentiation

Figure 5.6. Identification of multiple disrupted components in a biological pathway. Integrative analysis identifies more genes affected in the EGFR signaling pathway than a single dimensional analysis alone. In this example, multi-dimensional profiling data were generated from ten lung adenocarcinomas and their paired non-cancerous lung tissue. Analysis of DNA copy number (gene dosage) alterations that affected expression, identified 7 genes (in green) that are disrupted at ≥ 30% frequency. However, when alterations in copy number, DNA methylation, sequence mutation and/or copy neutral LOH were consid- ered, 17 genes disrupted at ≥ 30% frequency were identified to be associated with a change in expression, with an additional gene, KRAS, harboring frequent mutation. The 11 addi- tional genes are indicated in red. Genes in gray are not significant in this dataset as they did not meet the frequency criteria.

136 Figure 5.7 S1 S2S4 S6S7 S8 S10 S1 S3S4 S5S9 GE * GE CN CN M EGF TGFA M L L

S1 S2S4 S7S8 S9 S10 S2 S3S4 S5S8 S9 S10 GE GE CN CN M M L L µ µ EGFR ERBB2

S7 S9S10 S1 S2S3 S4S5 S6 S7S8S9S10 S3 S5S6 S7S10 S1 S5S6 S8S9 GE * GE * GE GE CN CN CN CN SHC1 M SIRPA M MUC1 M KRAS M L L L L µ µ µ µ µ S5 S6S7 GE CN S2 S1 S2S3 S4S5 S6 S7S8S9S10 GRB2 M GE GE L PI3K-AKT CN CN PLCG1 M PIK3R1 M S1 S8 L GE Signalling L CN SOS2 M L S1 S2S3 S4S6 S9 S10 GE CN RRAS M IP3 PIP2 PIP3 RASSF5 RASSF1 L X

ER S1 S2S3 S4 S6 S7S8S9 S2 S3S4 S5S8 S9 S2 S5S8S9S10 GE GE GE CN CN CN ITPR1 M DAG PDK1 M CCND1 M L L X L Proliferation S8 S10 S1 S2 S4 S6 S8 S1 S1 S3S4 S6S7 S10 GE GE GE GE CN Ca2+ CN CN CN RAF1 M PRKCA M AKT2AKT1 M MST1 M L L L L Apoptosis S2S6 S7S8 GE CN MAP2K1 M S5 S6S7 S8S10 S7 S10 S1 S3S4 S6S8 S9 S10 S2 L GE GE GE GE CN CN CN CN S6 S7S8 DUSP4 M CASP9 M FOXO3 M BAD M GE L L L L CN MAPK1 M Apoptosis L

S3 S4S5 S6 S7S8 GE Legend: CN MAPK1 CCND1 MYC M L GE: Gene Expression: Over Under CN: DNA Copy Number: Gain Loss L: Allelic Status: LOH Proliferation & Cell Cycle Differentiation M: DNA Methylation: Hypo Hyper Figure 5.7. Multi-dimensional analysis of the epidermal growth factor receptor signal- ing pathway. Integrative analysis identifies more genes affected in the EGFR signaling pathway than a single dimensional analysis alone. In this example, multi-dimensional profiling data were generated from ten lung adenocarcinomas and their paired non- cancerous lung tissue. Analysis of DNA copy number (gene dosage) alterations that affected expression, identified 7 genes (in green) that are disrupted at ≥ 30% frequency. However, when alterations in copy number, DNA methylation, sequence mutation and/or copy neutral LOH were considered, 17 genes disrupted at ≥ 30% frequency were identified to be associated with a change in expression, with an additional gene, KRAS, harboring frequent mutation. The 11 additional genes are indicated in red. Genes in gray are not significant in this dataset as they did not meet the frequency criteria. Genome profiles were generated using the Affymetrix SNP 6.0 platform, DNA methylation data were genrated using the Illumina Infinium HM27 platform and gene expression profiles were generated using the Affymetrix Exon Array. 137 Figure 5.8 a 4

3 SIRPA 2 PTPN6 1 Threshold 0

-1

Log2 Fold Change (T Fold N) Log2 vs. -2

-3

-4 b r = 0.907426 PTPN6

SIRPA c 100 40 80 30 60 40 20 20 10 0 0 CS FS NS

CS FS NS cases % of overexpressing % of underexpressing cases % of underexpressing Figure 5.8. Prevalence of SIRPA underexpression and its relationship with PTPN6 and smoking status. (a) Analysis of SIRPA and PTPN6 expression in 59 lung adenocarci- noma tumor/non-malignant pairs using quantitative PCR. Plotted are the log2 fold changes of each tumors versus its matched non-malignant sample. PCR data were normalized with Beta-Actin. All samples were done in triplicate. Threshold lines denote a 1.5-fold change. (b) Pairwise comparison of SIRPA and PTPN6 fold changes in the 59 sample pairs. Spear- man correlation coefficient was calculated. (c) Stratification of qPCR results based on smoking status. While the majority of current smokers (CS, n=22) and former smokers (n=22) show underexpression, a subset of never smokers (NS, n=15) exhibit overexpres- sion. 138 Figure 5.9

Duke H. Lee Mo tt 1 1 LowExpression LowExpression HighExpression 0.9 HighExpression 0.9 0.8 0.8 0.7 0.7 0.6

0.6 0.5 Survival Ratio Survival Ratio 0.4 0.5 0.3 0.4 0.2 0.3 p = 0.009 0.1 p = 0.009 0.2 0 0 5 10 15 20 25 30 35 40 45 0 20 40 60 80 100 120 Time (months) Time

MSKCC Michigan 1 1 LowExpression LowExpression 0.9 HighExpression HighExpression 0.9 0.8 0.8 0.7 0.7 0.6

0.5 0.6

Survival Ratio 0.4 Survival Ratio 0.5 0.3 0.4 0.2 0.3 0.1 p = 0.150 p = 0.180 0 0.2 0 50 100 150 200 250 0 20 40 60 80 100 120 140 160 Time Time

Figure 5.9. Kaplan-Meier analysis of SIRPA in four independent microarray datasets. Using publicly available gene expression microarray data, Kaplan-Meier analysis was performed to assess the association of SIRPA expression levels and overall patient survival. Briefly, for each dataset, samples were sorted based on ascending SIRPA expression and survival distributions of the top 1/3 of samples expressing SIRPA and bottom 1/3 of samples expression SIRPA were compared. In total, five datasets were tested with two of the datsets (Duke, H. Lee Moffitt) showing a stastistically significant association. In an additional two datasets (MSKCC, Michigan), the p-values were close to statistical significance. All expres- sion data were normalized using RMA. P-values were calculated using a Mantel-Cox log rank test.

139 Figure 5.10

(a) (b)

Figure 5.10. Automated detection of selected clonal populations of cells within a cancer biopsy tissue section. All nuclei (~150,000 in this example) are detected and FISH probe signal counts are enumerated for each nucleus. FISH signal pattern for each cell is compared against its neighbor in order to define spatial association (or neighbor- hood). A mathematical model is then applied to determine clonal cell relationships. (a) Mapping cancer cells on a tissue section. A gain or loss of any one of three FISH markers indicates a cancer cell. This image shows the density of cancer cells (so defined) in neigh- borhoods as a color overlay. Red indicates high fraction of cancer cells, yellow indicates medium fraction of cancer cells and blue indicates low to none (see scale bar). Most of the section is highlighted except for the surrounding normal stromal infiltrates. (b) Mapping clonal cells. The same image data were analyzed for concurrent gains of each of the three of the markers. The two clusters of cells, magnified within the white boxes, are cells harbor- ing gain of all three markers.

140 Table 5.1. List of software for integrative analysis

Source: Commercia l (C) Citatio Software Website (http://www.) or n tome

Academic Genome Transcrip- Integrative (A) Epigenome chem.agilent.com/en- Agilent us/products/instruments/dnamicroar Genomic C X X X X N/A rays/ Workbench dnaanalyticssoftware/pages/default. 5.0 aspx flintbox.com/technology.asp?page= SIGMA2 A X X X X [7] 3716 Integrative Genomics A X X X N/A broadinstitute.org/igv/ Viewer Nexus Copy C X X X N/A biodiscovery.com/index/nexus Number CGH Fusion C X X N/A infoquant.com/index/cghfusion ISA-CGH A X X X [283] isacgh.bioinfo.cipf.es VAMP A X X X X [284] bioinfo-out.curie.fr/projects/vamp/ Partek Genomics C X X X X N/A partek.com/partekgs Suite

141 Table 5.2. List of genomic resources and databases

Name Description Citation Website (http://www.) ArrayExpress Gene Gene expression analysis of public [285] ebi.ac.uk/gxa Expression Atlas datasets Protein/Small molecule interaction BioDrugScreen [286] biodrugscreen.org database Catalogue of Somatic Mutations in Cancer Listing of somatic mutations in cancer [132] sanger.ac.uk/cosmic (COSMIC) Cancer Gene Expression Gene expression analysis of cancer [287] cged.hgc.jp Database (CGED) Database of Differentially Differentially expressed proteins in Expressed Proteins in [288] dbdepc.biosino.org/index cancer human Cancers (dbDEPC) Database of Genomics Reported normal copy number [31] projects.tcag.ca/variation Variants variations European Bioinformatics Integrated database of multiple [289] ebi.ac.uk Institute (EBI) biological resources Integrated database of multiple GeneCards [290] .org biological resources GenomeRNAi RNAi experiment results [10] rnai2.dkfz.de/GenomeRNAi Whole genome methylation sequences neomorph.salk.edu/human Human DNA Methylome [12] of multiple individuals _methylome Human Histone Modification Histone modification database [291] bioinfo.hrbmu.edu.cn/hhmd Database (HHMD) microRNA.org Annotated microRNAs and their targets [292] microRNA.org miR2Disease Deregulated microRNAs in cancer [293] miR2Disease.org miRDB Annotated microRNAs and their targets [227] mirdb.org diana.cslab.ece.ntua.gr/mir miRGen Annotated microRNAs and their targets [294] gen National Center for Integrated database of multiple Biotechnology Information [295] ncbi.nlm.nih.gov biological resources (NCBI) Curated cytogenetic alterations in ncbi.nlm.nih.gov/sites/entre NCBI Cancer Chromosomes [295] cancer z?db=cancerchromosomes Gene expression analysis of public ncbi.nlm.nih.gov/sites/entre NCBI GEO Profiles [296] datasets z?db=geo Gene expression analysis of public Oncomine [297] oncomine.org datasets Copy number aberrations in cancer by PROGENETIX [298] progenetix.net CGH PRoteomics IDentifications Mass spectrometry results [299] ebi.ac.uk/pride Database (PRIDE) sanger.ac.uk/cgi- Sanger CGP LOH And Copy Copy number and LOH profiles of - bin/genetics/CGP/cghviewe Number Analysis cancer cell lines r/CghHome.cgi siRecords.umn.edu/siRecor siRecords RNAi experiment results [300] ds System for Integrative Genomic Microarray Array CGH profiles of cancer cell lines [301] sigma.bccrc.ca Analysis (SIGMA) The Cancer Genome Gene expression analysis of cancer [43] cgap.nci.nih.gov/ Anatomy Project (CGAP) The Cancer Genome Atlas Multi-dimensional description of cancer cancergenome.nih.gov/dat [16] (TCGA) genomes aportal/data/about/ Integrated database of multiple genome.ucsc.edu/cgi- UCSC Genome Browser [302] biological resources bin/hgNear

142 Table 5.3. Genes interacting with SIRPA as identified by network analysis

Gene Gene Gene Gene Gene Gene Gene Gene ABCG1 C5AR1 DOK2 GPR65 LILRA1 NTNG1 RECK STK10 ABI3BP C7orf44 DPEP2 GPR85 LILRA5 NTRK3 RHOG STK33 ACOT1 CANT1 DSE GPX3 LILRB1 NUP62CL RHOJ STX11 ACP5 CCND2 EMP3 GSPT2 LILRB2 OGN RNASEK SULF1 ACSL4 CD14 EMR1 GTDC1 LILRB5 OLFML1 RTN1 TACSTD1 LOC440 ACVRL1 CD163 ETS1 GYPC 295 OR1J1 RUNX1T1 TARP ACY1 CD300C EVI2B HCK LOXL2 PARVB SAMHD1 TARS2 ADAMTSL4 CD300LF FAAH2 HERPUD1 LPAR1 PCDH15 SELPLG TCEAL2 PDCD1LG ADARB1 CD33 FAM107A HIST2H4A LPIN1 2 SH2B3 TCF21 ADC CD34 FAM65A HSD11B1 LPXN PDE3B SIGLEC7 TDRKH ADCY4 CD4 FBLN5 HSPB7 LRCH2 PHEX SIP1 TFE3 ADCY7 CD53 FBXL17 HVCN1 LRRC25 PHKA1 SIRPB1 TGFBR3 ADPRH CD86 FBXL2 IFI30 LRRC33 PIK3AP1 SLA TLN1 LRRC37 ADRA1A CD93 FCER1G IGSF10 A PIK3R5 SLC15A3 TLR4 LRRC8 ADRBK1 CD97 FCGR1A IGSF2 C PILRA SLC16A2 TLR8 AGER CDH1 FCGR3A IL17RA LST1 PLEK SLC22A25 TM6SF1 AKR7A3 CDKL3 FERMT3 IL8RA LTBP2 PLEKHA8 SLC25A10 TMED3 TMEM183 AKT1 CFD FGD2 IRF6 MAF PLEKHO2 SLC25A29 B MAN1C TMEM184 ALS2CR12 CFP FGF2 IRF8 1 PMP22 SLC2A9 A MAP3K ANGPTL1 CLEC4E FGFR4 ITGA5 3 PNPLA6 SLC31A2 TMEM47 MARCH ANKRD36 CLN3 FGL2 ITGAL 1 PPAP2B SLC7A11 TMTC1 TNFRSF1 AOC3 CMKLR1 FGR ITPR3 MARCO PPM1F SLC7A7 B ARHGAP3 MCOLN PPP1R14 0 COASY FHL1 ITPRIP 1 B SLC8A1 TPK1 ARRB2 COG1 FIBIN JAM2 MFNG PRCP SLCO2B1 TPRG1 ATP6V1B2 COMMD3 FIGF JUNB MMP19 PREX1 SLFN13 TRPV2 BAALC CPA3 FLI1 KCNJ5 MORC2 PROM1 SMARCA2 TSPAN18 BHLHB3 CPVL FLJ22662 KCNK1 MRAS PRUNE2 SMYD3 TTC13 BRCC3 CSF1R FPR1 KCTD12 MRC2 PTGDS SNN TTC30B BTK CSF2RB FRMD4A KIF26B MS4A15 PTGER4 SORBS1 TYROBP CSGALNA BVES CT1 FXYD6 KLF13 MSRB3 PTGIS SORD TYW1 MTMR1 BZW2 CYP4Z1 GFRA2 KLF4 0 PTPN6 SPARCL1 USP48 C10orf54 CYTH4 GIMAP1 KMO MYCT1 PTPRG SPI1 VAMP7 C10orf72 CYYR1 GIMAP4 LAIR1 MYD88 PVRL4 SPN VASH1 C14orf49 DAB2 GIMAP5 LAMB3 NCF1 QKI SPOCK2 VAT1 NCKAP C1orf38 DNAH7 GIMAP6 LAMC2 1L QSER1 SPON1 VSIG4 C1QA DNAJB4 GIMAP8 LAPTM5 NEK4 RAD54B SRGN VWF C1QB DNASE1L3 GLIPR2 LAT2 NEXN RASSF2 SRPX WDR60 C1QC DOCK11 GMFG LCP1 NLRC4 RBM17 STAP2 C1RL DOCK8 GNAI2 LDB2 NTAN1 RBM35A STAT5A

143 5.5 References

1. Pinkel D, Segraves R, Sudar D, Clark S, Poole I, Kowbel D, Collins C, Kuo WL, Chen C, Zhai Y et al: High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat Genet 1998, 20(2):207-211. 2. Schrock E, du Manoir S, Veldman T, Schoell B, Wienberg J, Ferguson-Smith MA, Ning Y, Ledbetter DH, Bar-Am I, Soenksen D et al: Multicolor spectral karyotyping of human chromosomes. Science 1996, 273(5274):494-497. 3. Drmanac R, Sparks AB, Callow MJ, Halpern AL, Burns NL, Kermani BG, Carnevali P, Nazarenko I, Nilsen GB, Yeung G et al: Human Genome Sequencing Using Unchained Base Reads on Self-Assembling DNA Nanoarrays. Science 2009. 4. Pleasance ED, Cheetham RK, Stephens PJ, McBride DJ, Humphray SJ, Greenman CD, Varela I, Lin ML, Ordonez GR, Bignell GR et al: A comprehensive catalogue of somatic mutations from a human cancer genome. Nature 2009. 5. Pleasance ED, Stephens PJ, O'Meara S, McBride DJ, Meynert A, Jones D, Lin ML, Beare D, Lau KW, Greenman C et al: A small-cell lung cancer genome with complex signatures of tobacco exposure. Nature 2009. 6. Stephens PJ, McBride DJ, Lin ML, Varela I, Pleasance ED, Simpson JT, Stebbings LA, Leroy C, Edkins S, Mudie LJ et al: Complex landscapes of somatic rearrangement in human breast cancer genomes. Nature 2009, 462(7276):1005-1010. 7. Chari R, Coe BP, Wedseltoft C, Benetti M, Wilson IM, Vucic EA, MacAulay C, Ng RT, Lam WL: SIGMA2: a system for the integrative genomic multi-dimensional analysis of cancer genomes, epigenomes, and transcriptomes. BMC Bioinformatics 2008, 9:422. 8. von Eschenbach AC, Buetow K: Cancer Informatics Vision: caBIG. Cancer Inform 2007, 2:22-24. 9. Horn T, Arziman Z, Berger J, Boutros M: GenomeRNAi: a database for cell-based RNAi phenotypes. Nucleic Acids Res 2007, 35(Database issue):D492-497. 10. Gilsdorf M, Horn T, Arziman Z, Pelz O, Kiner E, Boutros M: GenomeRNAi: a database for cell-based RNAi phenotypes. 2009 update. Nucleic Acids Res 2010, 38(Database issue):D448-452. 11. Shah SP, Morin RD, Khattra J, Prentice L, Pugh T, Burleigh A, Delaney A, Gelmon K, Guliany R, Senz J et al: Mutational evolution in a lobular breast tumour profiled at single nucleotide resolution. Nature 2009, 461(7265):809-813. 12. Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, Tonti-Filippini J, Nery JR, Lee L, Ye Z, Ngo QM et al: Human DNA methylomes at base resolution show widespread epigenomic differences. Nature 2009. 13. Ley TJ, Mardis ER, Ding L, Fulton B, McLellan MD, Chen K, Dooling D, Dunford-Shore BH, McGrath S, Hickenbotham M et al: DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature 2008, 456(7218):66-72. 14. Mardis ER, Ding L, Dooling DJ, Larson DE, McLellan MD, Chen K, Koboldt DC, Fulton RS, Delehaunty KD, McGrath SD et al: Recurring mutations found by sequencing an acute myeloid leukemia genome. N Engl J Med 2009, 361(11):1058-1066. 15. Wise J: Consortium hopes to sequence genome of 1000 volunteers. BMJ 2008, 336(7638):237. 16. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 2008, 455(7216):1061-1068. 17. Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P, Leal SM et al: A second generation human haplotype map of over 3.1 million SNPs. Nature 2007, 449(7164):851-861.

144 18. Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W et al: Global variation in copy number in the human genome. Nature 2006, 444(7118):444-454. 19. Wong KK, deLeeuw RJ, Dosanjh NS, Kimm LR, Cheng Z, Horsman DE, MacAulay C, Ng RT, Brown CJ, Eichler EE et al: A comprehensive analysis of common copy- number variations in the human genome. Am J Hum Genet 2007, 80(1):91-104. 20. Shames DS, Girard L, Gao B, Sato M, Lewis CM, Shivapurkar N, Jiang A, Perou CM, Kim YH, Pollack JR et al: A genome-wide screen for promoter methylation in lung cancer identifies novel methylation markers for multiple malignancies. PLoS Med 2006, 3(12):e486. 21. Sjoblom T, Jones S, Wood LD, Parsons DW, Lin J, Barber TD, Mandelker D, Leary RJ, Ptak J, Silliman N et al: The consensus coding sequences of human breast and colorectal cancers. Science 2006, 314(5797):268-274. 22. Root DE, Hacohen N, Hahn WC, Lander ES, Sabatini DM: Genome-scale loss-of- function screening with a lentiviral RNAi library. Nat Methods 2006, 3(9):715-719. 23. Bibikova M, Lin Z, Zhou L, Chudin E, Garcia EW, Wu B, Doucet D, Thomas NJ, Wang Y, Vollmer E et al: High-throughput DNA methylation profiling using universal bead arrays. Genome Res 2006, 16(3):383-393. 24. Weber M, Davies JJ, Wittig D, Oakeley EJ, Haase M, Lam WL, Schubeler D: Chromosome-wide and promoter-specific analyses identify sites of differential DNA methylation in normal and transformed human cells. Nat Genet 2005, 37(8):853-862. 25. A haplotype map of the human genome. Nature 2005, 437(7063):1299-1320. 26. Lu J, Getz G, Miska EA, Alvarez-Saavedra E, Lamb J, Peck D, Sweet-Cordero A, Ebert BL, Mak RH, Ferrando AA et al: MicroRNA expression profiles classify human cancers. Nature 2005, 435(7043):834-838. 27. Bamford S, Dawson E, Forbes S, Clements J, Pettett R, Dogan A, Flanagan A, Teague J, Futreal PA, Stratton MR et al: The COSMIC (Catalogue of Somatic Mutations in Cancer) database and website. Br J Cancer 2004, 91(2):355-358. 28. Paddison PJ, Silva JM, Conklin DS, Schlabach M, Li M, Aruleba S, Balija V, O'Shaughnessy A, Gnoj L, Scobie K et al: A resource for large-scale RNA- interference-based screens in mammals. Nature 2004, 428(6981):427-431. 29. Schlabach MR, Luo J, Solimini NL, Hu G, Xu Q, Li MZ, Zhao Z, Smogorzewska A, Sowa ME, Ang XL et al: Cancer proliferation gene discovery through functional genomics. Science 2008, 319(5863):620-624. 30. Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman N, Stratton MR: A census of human cancer genes. Nat Rev Cancer 2004, 4(3):177-183. 31. Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C: Detection of large-scale variation in the human genome. Nat Genet 2004, 36(9):949- 951. 32. Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Maner S, Massa H, Walker M, Chi M et al: Large-scale copy number polymorphism in the human genome. Science 2004, 305(5683):525-528. 33. Ishkanian AS, Malloff CA, Watson SK, DeLeeuw RJ, Chi B, Coe BP, Snijders A, Albertson DG, Pinkel D, Marra MA et al: A tiling resolution DNA microarray with complete coverage of the human genome. Nat Genet 2004, 36(3):299-303. 34. Bertone P, Stolc V, Royce TE, Rozowsky JS, Urban AE, Zhu X, Rinn JL, Tongprasit W, Samanta M, Weissman S et al: Global identification of human transcribed sequences with genome tiling arrays. Science 2004, 306(5705):2242-2246. 35. Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T et al: The Ensembl genome database project. Nucleic Acids Res 2002, 30(1):38-41. 36. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The human genome browser at UCSC. Genome Res 2002, 12(6):996-1006. 145 37. Oliphant A, Barker DL, Stuelpnagel JR, Chee MS: BeadArray technology: enabling an accurate, cost-effective approach to high-throughput genotyping. Biotechniques 2002, Suppl:56-58, 60-51. 38. Weinstein IB: Cancer. Addiction to oncogenes--the Achilles heal of cancer. Science 2002, 297(5578):63-64. 39. Weinstein IB, Joe A: Oncogene addiction. Cancer Res 2008, 68(9):3077-3080; discussion 3080. 40. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W et al: Initial sequencing and analysis of the human genome. Nature 2001, 409(6822):860-921. 41. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA et al: The sequence of the human genome. Science 2001, 291(5507):1304-1351. 42. Riggins GJ, Strausberg RL: Genome and genetic resources from the Cancer Genome Anatomy Project. Hum Mol Genet 2001, 10(7):663-667. 43. Strausberg RL, Buetow KH, Emmert-Buck MR, Klausner RD: The cancer genome anatomy project: building an annotated gene index. Trends Genet 2000, 16(3):103- 106. 44. Bayani JM, Squire JA: Applications of SKY in cancer cytogenetics. Cancer Invest 2002, 20(3):373-386. 45. Kallioniemi A, Kallioniemi OP, Sudar D, Rutovitz D, Gray JW, Waldman F, Pinkel D: Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science 1992, 258(5083):818-821. 46. Garnis C, Buys TP, Lam WL: Genetic alteration and gene expression modulation during cancer progression. Mol Cancer 2004, 3:9. 47. Gebhart E: Genomic imbalances in human leukemia and lymphoma detected by comparative genomic hybridization (Review). Int J Oncol 2005, 27(3):593-606. 48. Gebhart E, Liehr T: Patterns of genomic imbalances in human solid tumors (Review). Int J Oncol 2000, 16(2):383-399. 49. Cahill DP, Lengauer C, Yu J, Riggins GJ, Willson JK, Markowitz SD, Kinzler KW, Vogelstein B: Mutations of mitotic checkpoint genes in human cancers. Nature 1998, 392(6673):300-303. 50. Fukasawa K: Centrosome amplification, chromosome instability and cancer development. Cancer Lett 2005, 230(1):6-19. 51. Lingle WL, Lukasiewicz K, Salisbury JL: Deregulation of the centrosome cycle and the origin of chromosomal instability in cancer. Adv Exp Med Biol 2005, 570:393- 421. 52. Chin K, de Solorzano CO, Knowles D, Jones A, Chou W, Rodriguez EG, Kuo WL, Ljung BM, Chew K, Myambo K et al: In situ analyses of genome instability in breast cancer. Nat Genet 2004, 36(9):984-988. 53. O'Hagan RC, Chang S, Maser RS, Mohan R, Artandi SE, Chin L, DePinho RA: Telomere dysfunction provokes regional amplification and deletion in cancer genomes. Cancer Cell 2002, 2(2):149-155. 54. Green AR: Transcription factors, translocations and haematological malignancies. Blood Rev 1992, 6(2):118-124. 55. Rowley JD: Chromosomal translocations: revisited yet again. Blood 2008, 112(6):2183-2189. 56. Watson SK, deLeeuw RJ, Horsman DE, Squire JA, Lam WL: Cytogenetically balanced translocations are associated with focal copy number alterations. Hum Genet 2007, 120(6):795-805. 57. Brenner JC, Chinnaiyan AM: Translocations in epithelial cancers. Biochim Biophys Acta 2009, 1796(2):201-215.

146 58. Mani RS, Tomlins SA, Callahan K, Ghosh A, Nyati MK, Varambally S, Palanisamy N, Chinnaiyan AM: Induced chromosomal proximity and gene fusions in prostate cancer. Science 2009, 326(5957):1230. 59. Tomlins SA, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R, Sun XW, Varambally S, Cao X, Tchinda J, Kuefer R et al: Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science 2005, 310(5748):644-648. 60. Dang TP, Gazdar AF, Virmani AK, Sepetavec T, Hande KR, Minna JD, Roberts JR, Carbone DP: Chromosome 19 translocation, overexpression of Notch3, and human lung cancer. J Natl Cancer Inst 2000, 92(16):1355-1357. 61. Soda M, Choi YL, Enomoto M, Takada S, Yamashita Y, Ishikawa S, Fujiwara S, Watanabe H, Kurashina K, Hatanaka H et al: Identification of the transforming EML4- ALK fusion gene in non-small-cell lung cancer. Nature 2007, 448(7153):561-566. 62. Knutsen T, Gobu V, Knaus R, Padilla-Nash H, Augustus M, Strausberg RL, Kirsch IR, Sirotkin K, Ried T: The interactive online SKY/M-FISH & CGH database and the Entrez cancer chromosomes search database: linkage of chromosomal aberrations with the genome sequence. Genes Chromosomes Cancer 2005, 44(1):52-64. 63. Albertson DG, Collins C, McCormick F, Gray JW: Chromosome aberrations in solid tumors. Nat Genet 2003, 34(4):369-376. 64. Coe BP, Ylstra B, Carvalho B, Meijer GA, Macaulay C, Lam WL: Resolving the resolution of array CGH. Genomics 2007, 89(5):647-653. 65. Lockwood WW, Chari R, Chi B, Lam WL: Recent advances in array comparative genomic hybridization technologies and their applications in human genetics. Eur J Hum Genet 2006, 14(2):139-148. 66. Pollack JR, Perou CM, Alizadeh AA, Eisen MB, Pergamenschikov A, Williams CF, Jeffrey SS, Botstein D, Brown PO: Genome-wide analysis of DNA copy-number changes using cDNA microarrays. Nat Genet 1999, 23(1):41-46. 67. Almagro-Garcia J, Manske M, Carret C, Campino S, Auburn S, Macinnis BL, Maslen G, Pain A, Newbold CI, Kwiatkowski DP et al: SnoopCGH: software for visualizing comparative genomic hybridization data. Bioinformatics 2009, 25(20):2732-2733. 68. Chari R, Lockwood WW, Lam WL: Computational methods for the analysis of array comparative genomic hybridization. Cancer Inform 2007, 2:48-58. 69. Chi B, DeLeeuw RJ, Coe BP, MacAulay C, Lam WL: SeeGH--a software tool for visualization of whole genome array comparative genomic hybridization data. BMC Bioinformatics 2004, 5:13. 70. Chi B, deLeeuw RJ, Coe BP, Ng RT, MacAulay C, Lam WL: MD-SeeGH: a platform for integrative analysis of multi-dimensional genomic data. BMC Bioinformatics 2008, 9:243. 71. Venkatraman ES, Olshen AB: A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics 2007, 23(6):657-663. 72. Bignell GR, Huang J, Greshock J, Watt S, Butler A, West S, Grigorova M, Jones KW, Wei W, Stratton MR et al: High-resolution analysis of DNA copy number using oligonucleotide microarrays. Genome Res 2004, 14(2):287-295. 73. Iacobucci I, Storlazzi CT, Cilloni D, Lonetti A, Ottaviani E, Soverini S, Astolfi A, Chiaretti S, Vitale A, Messa F et al: Identification and molecular characterization of recurrent genomic deletions on 7p12 in the IKZF1 gene in a large cohort of BCR-ABL1- positive acute lymphoblastic leukemia patients: on behalf of Gruppo Italiano Malattie Ematologiche dell'Adulto Acute Leukemia Working Party (GIMEMA AL WP). Blood 2009, 114(10):2159-2167. 74. Niini T, Lopez-Guerrero JA, Ninomiya S, Guled M, Hattinger CM, Michelacci F, Bohling T, Llombart-Bosch A, Picci P, Serra M et al: Frequent deletion of CDKN2A and recurrent coamplification of KIT, PDGFRA, and KDR in fibrosarcoma of bone-An array comparative genomic hybridization study. Genes Chromosomes Cancer 2010, 49(2):132-143. 147 75. Selzer RR, Richmond TA, Pofahl NJ, Green RD, Eis PS, Nair P, Brothman AR, Stallings RL: Analysis of chromosome breakpoints in neuroblastoma at sub-kilobase resolution using fine-tiling oligonucleotide array CGH. Genes Chromosomes Cancer 2005, 44(3):305-319. 76. Zhao X, Li C, Paez JG, Chin K, Janne PA, Chen TH, Girard L, Minna J, Christiani D, Leo C et al: An integrated view of copy number and allelic alterations in the cancer genome using single nucleotide polymorphism arrays. Cancer Res 2004, 64(9):3060-3071. 77. Wang TL, Maierhofer C, Speicher MR, Lengauer C, Vogelstein B, Kinzler KW, Velculescu VE: Digital karyotyping. Proc Natl Acad Sci U S A 2002, 99(25):16156- 16161. 78. Tuzun E, Sharp AJ, Bailey JA, Kaul R, Morrison VA, Pertz LM, Haugen E, Hayden H, Albertson D, Pinkel D et al: Fine-scale structural variation of the human genome. Nat Genet 2005, 37(7):727-732. 79. Volik S, Raphael BJ, Huang G, Stratton MR, Bignel G, Murnane J, Brebner JH, Bajsarowicz K, Paris PL, Tao Q et al: Decoding the fine-scale structure of a breast cancer genome and transcriptome. Genome Res 2006, 16(3):394-404. 80. Volik S, Zhao S, Chin K, Brebner JH, Herndon DR, Tao Q, Kowbel D, Huang G, Lapuk A, Kuo WL et al: End-sequence profiling: sequence-based analysis of aberrant genomes. Proc Natl Acad Sci U S A 2003, 100(13):7696-7701. 81. McPherson JD: Next-generation gap. Nat Methods 2009, 6(11 Suppl):S2-5. 82. Alkan C, Kidd JM, Marques-Bonet T, Aksay G, Antonacci F, Hormozdiari F, Kitzman JO, Baker C, Malig M, Mutlu O et al: Personalized copy number and segmental duplication maps using next-generation sequencing. Nat Genet 2009, 41(10):1061- 1067. 83. Conrad DF, Andrews TD, Carter NP, Hurles ME, Pritchard JK: A high-resolution survey of deletion polymorphism in the human genome. Nat Genet 2006, 38(1):75- 81. 84. Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, Zhang Y, Aerts J, Andrews TD, Barnes C, Campbell P et al: Origins and functional impact of copy number variation in the human genome. Nature 2009. 85. Fiegler H, Redon R, Andrews D, Scott C, Andrews R, Carder C, Clark R, Dovey O, Ellis P, Feuk L et al: Accurate and reliable high-throughput detection of copy number variation in the human genome. Genome Res 2006, 16(12):1566-1574. 86. Jakobsson M, Scholz SW, Scheet P, Gibbs JR, VanLiere JM, Fung HC, Szpiech ZA, Degnan JH, Wang K, Guerreiro R et al: Genotype, haplotype and copy-number variation in worldwide human populations. Nature 2008, 451(7181):998-1003. 87. Kidd JM, Cooper GM, Donahue WF, Hayden HS, Sampas N, Graves T, Hansen N, Teague B, Alkan C, Antonacci F et al: Mapping and sequencing of structural variation from eight human genomes. Nature 2008, 453(7191):56-64. 88. McCarroll SA, Kuruvilla FG, Korn JM, Cawley S, Nemesh J, Wysoker A, Shapero MH, de Bakker PI, Maller JB, Kirby A et al: Integrated detection and population-genetic analysis of SNPs and copy number variation. Nat Genet 2008, 40(10):1166-1174. 89. Shaikh TH, Gai X, Perin JC, Glessner JT, Xie H, Murphy K, O'Hara R, Casalunovo T, Conlin LK, D'Arcy M et al: High-resolution mapping and analysis of copy number variations in the human genome: a data resource for clinical and research applications. Genome Res 2009, 19(9):1682-1690. 90. Hastings PJ, Ira G, Lupski JR: A microhomology-mediated break-induced replication model for the origin of human copy number variation. PLoS Genet 2009, 5(1):e1000327. 91. Diskin SJ, Hou C, Glessner JT, Attiyeh EF, Laudenslager M, Bosse K, Cole K, Mosse YP, Wood A, Lynch JE et al: Copy number variation at 1q21.1 associated with neuroblastoma. Nature 2009, 459(7249):987-991.

148 92. Lockwood WW, Chari R, Coe BP, Girard L, Macaulay C, Lam S, Gazdar AF, Minna JD, Lam WL: DNA amplification is a ubiquitous mechanism of oncogene activation in lung and other cancers. Oncogene 2008, 27(33):4615-4624. 93. Myllykangas S, Himberg J, Bohling T, Nagy B, Hollmen J, Knuutila S: DNA copy number amplification profiling of human neoplasms. Oncogene 2006, 25(55):7324- 7332. 94. Teschendorff AE, Caldas C: The breast cancer somatic 'muta-ome': tackling the complexity. Breast Cancer Res 2009, 11(2):301. 95. Chin SF, Teschendorff AE, Marioni JC, Wang Y, Barbosa-Morais NL, Thorne NP, Costa JL, Pinder SE, van de Wiel MA, Green AR et al: High-resolution aCGH and expression profiling identifies a novel genomic subtype of ER negative breast cancer. Genome Biol 2007, 8(10):R215. 96. Coe BP, Lockwood WW, Girard L, Chari R, Macaulay C, Lam S, Gazdar AF, Minna JD, Lam WL: Differential disruption of cell cycle pathways in small cell and non-small cell lung cancer. Br J Cancer 2006, 94(12):1927-1935. 97. Bass AJ, Watanabe H, Mermel CH, Yu S, Perner S, Verhaak RG, Kim SY, Wardwell L, Tamayo P, Gat-Viks I et al: SOX2 is an amplified lineage-survival oncogene in lung and esophageal squamous cell carcinomas. Nat Genet 2009, 41(11):1238-1242. 98. Garraway LA, Widlund HR, Rubin MA, Getz G, Berger AJ, Ramaswamy S, Beroukhim R, Milner DA, Granter SR, Du J et al: Integrative genomic analyses identify MITF as a lineage survival oncogene amplified in malignant melanoma. Nature 2005, 436(7047):117-122. 99. Weir BA, Woo MS, Getz G, Perner S, Ding L, Beroukhim R, Lin WM, Province MA, Kraja A, Johnson LA et al: Characterizing the cancer genome in lung adenocarcinoma. Nature 2007, 450(7171):893-898. 100. Kwei KA, Kim YH, Girard L, Kao J, Pacyna-Gengelbach M, Salari K, Lee J, Choi YL, Sato M, Wang P et al: Genomic profiling identifies TITF1 as a lineage-specific oncogene amplified in lung cancer. Oncogene 2008, 27(25):3635-3640. 101. Plomin R, Haworth CM, Davis OS: Common disorders are quantitative traits. Nat Rev Genet 2009, 10(12):872-878. 102. Savas S, Liu G: Genetic variations as cancer prognostic markers: review and update. Hum Mutat 2009, 30(10):1369-1377. 103. Ansorge WJ: Next-generation DNA sequencing techniques. N Biotechnol 2009, 25(4):195-203. 104. Shah SP, Kobel M, Senz J, Morin RD, Clarke BA, Wiegand KC, Leung G, Zayed A, Mehl E, Kalloger SE et al: Mutation of FOXL2 in granulosa-cell tumors of the ovary. N Engl J Med 2009, 360(26):2719-2729. 105. Greenman C, Stephens P, Smith R, Dalgliesh GL, Hunter C, Bignell G, Davies H, Teague J, Butler A, Stevens C et al: Patterns of somatic mutation in human cancer genomes. Nature 2007, 446(7132):153-158. 106. Stratton MR, Campbell PJ, Futreal PA: The cancer genome. Nature 2009, 458(7239):719-724. 107. Cavenee WK, Hansen MF, Nordenskjold M, Kock E, Maumenee I, Squire JA, Phillips RA, Gallie BL: Genetic origin of mutations predisposing to retinoblastoma. Science 1985, 228(4698):501-503. 108. Knudson AG, Jr.: Mutation and cancer: statistical study of retinoblastoma. Proc Natl Acad Sci U S A 1971, 68(4):820-823. 109. Benz CC, Fedele V, Xu F, Ylstra B, Ginzinger D, Yu M, Moore D, Hall RK, Wolff DJ, Disis ML et al: Altered promoter usage characterizes monoallelic transcription arising with ERBB2 amplification in human breast cancers. Genes Chromosomes Cancer 2006, 45(11):983-994. 110. LaFramboise T, Weir BA, Zhao X, Beroukhim R, Li C, Harrington D, Sellers WR, Meyerson M: Allele-specific amplification in cancer revealed by SNP array analysis. PLoS Comput Biol 2005, 1(6):e65. 149 111. Melcher R, Al-Taie O, Kudlich T, Hartmann E, Maisch S, Steinlein C, Schmid M, Rosenwald A, Menzel T, Scheppach W et al: SNP-Array genotyping and spectral karyotyping reveal uniparental disomy as early mutational event in MSS- and MSI- colorectal cancer cell lines. Cytogenet Genome Res 2007, 118(2-4):214-221. 112. Nomura M, Shigematsu H, Li L, Suzuki M, Takahashi T, Estess P, Siegelman M, Feng Z, Kato H, Marchetti A et al: Polymorphisms, mutations, and amplification of the EGFR gene in non-small cell lung cancers. PLoS Med 2007, 4(4):e125. 113. Sholl LM, Yeap BY, Iafrate AJ, Holmes-Tisch AJ, Chou YP, Wu MT, Goan YG, Su L, Benedettini E, Yu J et al: Lung adenocarcinoma with EGFR amplification has distinct clinicopathologic and molecular features in never-smokers. Cancer Res 2009, 69(21):8341-8348. 114. Soh J, Okumura N, Lockwood WW, Yamamoto H, Shigematsu H, Zhang W, Chari R, Shames DS, Tang X, MacAulay C et al: Oncogene mutations, copy number gains and mutant allele specific imbalance (MASI) frequently occur together in tumor cells. PLoS One 2009, 4(10):e7464. 115. Bacolod MD, Schemmann GS, Giardina SF, Paty P, Notterman DA, Barany F: Emerging paradigms in cancer genetics: some important findings from high- density single nucleotide polymorphism array studies. Cancer Res 2009, 69(3):723- 727. 116. Robinson WP: Mechanisms leading to uniparental disomy and their clinical consequences. Bioessays 2000, 22(5):452-459. 117. Tuna M, Knuutila S, Mills GB: Uniparental disomy in cancer. Trends Mol Med 2009, 15(3):120-128. 118. Zhu X, Dunn JM, Goddard AD, Squire JA, Becker A, Phillips RA, Gallie BL: Mechanisms of loss of heterozygosity in retinoblastoma. Cytogenet Cell Genet 1992, 59(4):248-252. 119. Gondek LP, Dunbar AJ, Szpurka H, McDevitt MA, Maciejewski JP: SNP array karyotyping allows for the detection of uniparental disomy and cryptic chromosomal abnormalities in MDS/MPD-U and MPD. PLoS One 2007, 2(11):e1225. 120. Tiu RV, Gondek LP, O'Keefe CL, Huh J, Sekeres MA, Elson P, McDevitt MA, Wang XF, Levis MJ, Karp JE et al: New lesions detected by single nucleotide polymorphism array-based chromosomal analysis have important clinical impact in acute myeloid leukemia. J Clin Oncol 2009, 27(31):5219-5226. 121. Yamamoto G, Nannya Y, Kato M, Sanada M, Levine RL, Kawamata N, Hangaishi A, Kurokawa M, Chiba S, Gilliland DG et al: Highly sensitive method for genomewide detection of allelic composition in nonpaired, primary tumor specimens by use of affymetrix single-nucleotide-polymorphism genotyping microarrays. Am J Hum Genet 2007, 81(1):114-126. 122. Darbary HK, Dutt SS, Sait SJ, Nowak NJ, Heinaman RE, Stoler DL, Anderson GR: Uniparentalism in sporadic colorectal cancer is independent of imprint status, and coordinate for chromosomes 14 and 18. Cancer Genet Cytogenet 2009, 189(2):77- 86. 123. Grand FH, Hidalgo-Curtis CE, Ernst T, Zoi K, Zoi C, McGuire C, Kreil S, Jones A, Score J, Metzgeroth G et al: Frequent CBL mutations associated with 11q acquired uniparental disomy in myeloproliferative neoplasms. Blood 2009, 113(24):6182- 6192. 124. Gupta M, Raghavan M, Gale RE, Chelala C, Allen C, Molloy G, Chaplin T, Linch DC, Cazier JB, Young BD: Novel regions of acquired uniparental disomy discovered in acute myeloid leukemia. Genes Chromosomes Cancer 2008, 47(9):729-739. 125. Kawamata N, Ogawa S, Gueller S, Ross SH, Huynh T, Chen J, Chang A, Nabavi-Nouis S, Megrabian N, Siebert R et al: Identified hidden genomic changes in mantle cell lymphoma using high-resolution single nucleotide polymorphism genomic array. Exp Hematol 2009, 37(8):937-946.

150 126. Makishima H, Cazzolli H, Szpurka H, Dunbar A, Tiu R, Huh J, Muramatsu H, O'Keefe C, Hsi E, Paquette RL et al: Mutations of e3 ubiquitin ligase cbl family members constitute a novel common pathogenic lesion in myeloid malignancies. J Clin Oncol 2009, 27(36):6109-6116. 127. Walter MJ, Payton JE, Ries RE, Shannon WD, Deshmukh H, Zhao Y, Baty J, Heath S, Westervelt P, Watson MA et al: Acquired copy number alterations in adult acute myeloid leukemia genomes. Proc Natl Acad Sci U S A 2009, 106(31):12950-12955. 128. Yin D, Ogawa S, Kawamata N, Tunici P, Finocchiaro G, Eoli M, Ruckert C, Huynh T, Liu G, Kato M et al: High-resolution genomic copy number profiling of glioblastoma multiforme by single nucleotide polymorphism DNA microarray. Mol Cancer Res 2009, 7(5):665-677. 129. Purdie KJ, Lambert SR, Teh MT, Chaplin T, Molloy G, Raghavan M, Kelsell DP, Leigh IM, Harwood CA, Proby CM et al: Allelic imbalances and microdeletions affecting the PTPRD gene in cutaneous squamous cell carcinomas detected using single nucleotide polymorphism microarray analysis. Genes Chromosomes Cancer 2007, 46(7):661-669. 130. Akagi T, Ito T, Kato M, Jin Z, Cheng Y, Kan T, Yamamoto G, Olaru A, Kawamata N, Boult J et al: Chromosomal abnormalities and novel disease-related regions in progression from Barrett's esophagus to esophageal adenocarcinoma. Int J Cancer 2009, 125(10):2349-2359. 131. Andersen CL, Wiuf C, Kruhoffer M, Korsgaard M, Laurberg S, Orntoft TF: Frequent occurrence of uniparental disomy in colorectal cancer. Carcinogenesis 2007, 28(1):38-48. 132. Forbes SA, Tang G, Bindal N, Bamford S, Dawson E, Cole C, Kok CY, Jia M, Ewing R, Menzies A et al: COSMIC (the Catalogue of Somatic Mutations in Cancer): a resource to investigate acquired mutations in human cancer. Nucleic Acids Res 2010, 38(Database issue):D652-657. 133. Kerkel K, Spadola A, Yuan E, Kosek J, Jiang L, Hod E, Li K, Murty VV, Schupf N, Vilain E et al: Genomic surveys by methylation-sensitive SNP analysis identify sequence-dependent allele-specific DNA methylation. Nat Genet 2008, 40(7):904- 908. 134. Jones PA, Baylin SB: The epigenomics of cancer. Cell 2007, 128(4):683-692. 135. Esteller M: Epigenetics in cancer. N Engl J Med 2008, 358(11):1148-1159. 136. Feinberg AP: Phenotypic plasticity and the epigenetics of human disease. Nature 2007, 447(7143):433-440. 137. Vucic EA, Brown CJ, Lam WL: Epigenetics of cancer progression. Pharmacogenomics 2008, 9(2):215-234. 138. Feinberg AP, Gehrke CW, Kuo KC, Ehrlich M: Reduced genomic 5-methylcytosine content in human colonic neoplasia. Cancer Res 1988, 48(5):1159-1161. 139. Feinberg AP, Tycko B: The history of cancer epigenetics. Nat Rev Cancer 2004, 4(2):143-153. 140. Lo PK, Sukumar S: Epigenomics and breast cancer. Pharmacogenomics 2008, 9(12):1879-1902. 141. Toyota M, Ahuja N, Ohe-Toyota M, Herman JG, Baylin SB, Issa JP: CpG island methylator phenotype in colorectal cancer. Proc Natl Acad Sci U S A 1999, 96(15):8681-8686. 142. Issa JP: CpG island methylator phenotype in cancer. Nat Rev Cancer 2004, 4(12):988-993. 143. Tanemura A, Terando AM, Sim MS, van Hoesel AQ, de Maat MF, Morton DL, Hoon DS: CpG island methylator phenotype predicts progression of malignant melanoma. Clin Cancer Res 2009, 15(5):1801-1807. 144. Dai Z, Lakshmanan RR, Zhu WG, Smiraglia DJ, Rush LJ, Fruhwald MC, Brena RM, Li B, Wright FA, Ross P et al: Global methylation profiling of lung cancer identifies novel methylated genes. Neoplasia 2001, 3(4):314-323. 151 145. Takai D, Yagi Y, Wakazono K, Ohishi N, Morita Y, Sugimura T, Ushijima T: Silencing of HTR1B and reduced expression of EDN1 in human lung cancers, revealed by methylation-sensitive representational difference analysis. Oncogene 2001, 20(51):7505-7513. 146. Hu M, Yao J, Cai L, Bachman KE, van den Brule F, Velculescu V, Polyak K: Distinct epigenetic changes in the stromal cells of breast cancers. Nat Genet 2005, 37(8):899-905. 147. Irizarry RA, Ladd-Acosta C, Carvalho B, Wu H, Brandenburg SA, Jeddeloh JA, Wen B, Feinberg AP: Comprehensive high-throughput arrays for relative methylation (CHARM). Genome Res 2008, 18(5):780-790. 148. Yan PS, Chen CM, Shi H, Rahmatpanah F, Wei SH, Caldwell CW, Huang TH: Dissecting complex epigenetic alterations in breast cancer using CpG island microarrays. Cancer Res 2001, 61(23):8375-8380. 149. Yamamoto F, Yamamoto M: A DNA microarray-based methylation-sensitive (MS)- AFLP hybridization method for genetic and epigenetic analyses. Mol Genet Genomics 2004, 271(6):678-686. 150. Omura N, Li CP, Li A, Hong SM, Walter K, Jimeno A, Hidalgo M, Goggins M: Genome- wide profiling of methylated promoters in pancreatic adenocarcinoma. Cancer Biol Ther 2008, 7(7):1146-1156. 151. Trinh BN, Long TI, Laird PW: DNA methylation analysis by MethyLight technology. Methods 2001, 25(4):456-462. 152. Fan JB, Gunderson KL, Bibikova M, Yeakley JM, Chen J, Wickham Garcia E, Lebruska LL, Laurent M, Shen R, Barker D: Illumina universal bead arrays. Methods Enzymol 2006, 410:57-73. 153. Houshdaran S, Cortessis VK, Siegmund K, Yang A, Laird PW, Sokol RZ: Widespread epigenetic abnormalities suggest a broad DNA methylation erasure defect in abnormal human sperm. PLoS One 2007, 2(12):e1289. 154. Houseman EA, Christensen BC, Karagas MR, Wrensch MR, Nelson HH, Wiemels JL, Zheng S, Wiencke JK, Kelsey KT, Marsit CJ: Copy number variation has little impact on bead-array-based measures of DNA methylation. Bioinformatics 2009, 25(16):1999-2005. 155. Breton CV, Byun HM, Wenten M, Pan F, Yang A, Gilliland FD: Prenatal tobacco smoke exposure affects global and gene-specific DNA methylation. Am J Respir Crit Care Med 2009, 180(5):462-467. 156. Taylor KH, Pena-Hernandez KE, Davis JW, Arthur GL, Duff DJ, Shi H, Rahmatpanah FB, Sjahputera O, Caldwell CW: Large-scale CpG methylation analysis identifies novel candidate genes and reveals methylation hotspots in acute lymphoblastic leukemia. Cancer Res 2007, 67(6):2617-2625. 157. Weber M, Hellmann I, Stadler MB, Ramos L, Paabo S, Rebhan M, Schubeler D: Distribution, silencing potential and evolutionary impact of promoter DNA methylation in the human genome. Nat Genet 2007, 39(4):457-466. 158. Rauch T, Pfeifer GP: Methylated-CpG island recovery assay: a new technique for the rapid detection of methylated-CpG islands in cancer. Lab Invest 2005, 85(9):1172-1180. 159. Jacinto FV, Ballestar E, Ropero S, Esteller M: Discovery of epigenetically silenced genes by methylated DNA immunoprecipitation in colon cancer cells. Cancer Res 2007, 67(24):11481-11486. 160. Ballestar E, Paz MF, Valle L, Wei S, Fraga MF, Espada J, Cigudosa JC, Huang TH, Esteller M: Methyl-CpG binding proteins identify novel sites of epigenetic inactivation in human cancer. EMBO J 2003, 22(23):6335-6345. 161. Serre D, Lee BH, Ting AH: MBD-isolated Genome Sequencing provides a high- throughput and comprehensive survey of DNA methylation in the human genome. Nucleic Acids Res 2009.

152 162. Down TA, Rakyan VK, Turner DJ, Flicek P, Li H, Kulesha E, Graf S, Johnson N, Herrero J, Tomazou EM et al: A Bayesian deconvolution strategy for immunoprecipitation- based DNA methylome analysis. Nat Biotechnol 2008, 26(7):779-785. 163. Thu KL, Pikor LA, Kennett JY, Alvarez CE, Lam WL: Methylation analysis by DNA immunoprecipitation. J Cell Physiol 2009, 222(3):522-531. 164. Pelizzola M, Koga Y, Urban AE, Krauthammer M, Weissman S, Halaban R, Molinaro AM: MEDME: an experimental and analytical methodology for the estimation of DNA methylation levels based on microarray derived MeDIP-enrichment. Genome Res 2008, 18(10):1652-1659. 165. Yamashita S, Hosoya K, Gyobu K, Takeshima H, Ushijima T: Development of a Novel Output Value for Quantitative Assessment in Methylated DNA Immunoprecipitation-CpG Island Microarray Analysis. DNA Res 2009. 166. Irizarry RA, Ladd-Acosta C, Wen B, Wu Z, Montano C, Onyango P, Cui H, Gabo K, Rongione M, Webster M et al: The human colon cancer methylome shows similar hypo- and hypermethylation at conserved tissue-specific CpG island shores. Nat Genet 2009, 41(2):178-186. 167. Lorincz MC, Dickerson DR, Schmitt M, Groudine M: Intragenic DNA methylation alters chromatin structure and elongation efficiency in mammalian cells. Nat Struct Mol Biol 2004, 11(11):1068-1075. 168. Frigola J, Song J, Stirzaker C, Hinshelwood RA, Peinado MA, Clark SJ: Epigenetic remodeling in colorectal cancer results in coordinate gene suppression across an entire chromosome band. Nat Genet 2006, 38(5):540-549. 169. Zhong S, Fields CR, Su N, Pan YX, Robertson KD: Pharmacologic inhibition of epigenetic modifications, coupled with gene expression profiling, reveals novel targets of aberrant DNA methylation and histone deacetylation in lung cancer. Oncogene 2007, 26(18):2621-2634. 170. Lister R, Ecker JR: Finding the fifth base: genome-wide sequencing of cytosine methylation. Genome Res 2009, 19(6):959-966. 171. Byun HM, Siegmund KD, Pan F, Weisenberger DJ, Kanel G, Laird PW, Yang AS: Epigenetic profiling of somatic tissues from human autopsy specimens identifies tissue- and individual-specific DNA methylation patterns. Hum Mol Genet 2009, 18(24):4808-4817. 172. Fraga MF, Ballestar E, Paz MF, Ropero S, Setien F, Ballestar ML, Heine-Suner D, Cigudosa JC, Urioste M, Benitez J et al: Epigenetic differences arise during the lifetime of monozygotic twins. Proc Natl Acad Sci U S A 2005, 102(30):10604-10609. 173. Deng J, Shoemaker R, Xie B, Gore A, LeProust EM, Antosiewicz-Bourget J, Egli D, Maherali N, Park IH, Yu J et al: Targeted bisulfite sequencing reveals changes in DNA methylation associated with nuclear reprogramming. Nat Biotechnol 2009, 27(4):353-360. 174. Costello JF, Fruhwald MC, Smiraglia DJ, Rush LJ, Robertson GP, Gao X, Wright FA, Feramisco JD, Peltomaki P, Lang JC et al: Aberrant CpG-island methylation has non- random and tumour-type-specific patterns. Nat Genet 2000, 24(2):132-138. 175. Gama-Sosa MA, Midgett RM, Slagel VA, Githens S, Kuo KC, Gehrke CW, Ehrlich M: Tissue-specific differences in DNA methylation in various mammals. Biochim Biophys Acta 1983, 740(2):212-219. 176. Richardson B: Impact of aging on DNA methylation. Ageing Res Rev 2003, 2(3):245- 261. 177. Eckhardt F, Beck S, Gut IG, Berlin K: Future potential of the Human Epigenome Project. Expert Rev Mol Diagn 2004, 4(5):609-618. 178. Kohda M, Hoshiya H, Katoh M, Tanaka I, Masuda R, Takemura T, Fujiwara M, Oshimura M: Frequent loss of imprinting of IGF2 and MEST in lung adenocarcinoma. Mol Carcinog 2001, 31(4):184-191.

153 179. Kondo M, Suzuki H, Ueda R, Osada H, Takagi K, Takahashi T: Frequent loss of imprinting of the H19 gene is often associated with its overexpression in human lung cancers. Oncogene 1995, 10(6):1193-1198. 180. Rainier S, Johnson LA, Dobry CJ, Ping AJ, Grundy PE, Feinberg AP: Relaxation of imprinted genes in human cancer. Nature 1993, 362(6422):747-749. 181. Pal N, Wadey RB, Buckle B, Yeomans E, Pritchard J, Cowell JK: Preferential loss of maternal alleles in sporadic Wilms' tumour. Oncogene 1990, 5(11):1665-1668. 182. Schroeder WT, Chao LY, Dao DD, Strong LC, Pathak S, Riccardi V, Lewis WH, Saunders GF: Nonrandom loss of maternal alleles in Wilms tumors. Am J Hum Genet 1987, 40(5):413-420. 183. Scrable H, Cavenee W, Ghavimi F, Lovell M, Morgan K, Sapienza C: A model for embryonal rhabdomyosarcoma tumorigenesis that involves genome imprinting. Proc Natl Acad Sci U S A 1989, 86(19):7480-7484. 184. Gaudet F, Hodgson JG, Eden A, Jackson-Grusby L, Dausman J, Gray JW, Leonhardt H, Jaenisch R: Induction of tumors in mice by genomic hypomethylation. Science 2003, 300(5618):489-492. 185. Rizwana R, Hahn PJ: CpG methylation reduces genomic instability. J Cell Sci 1999, 112 ( Pt 24):4513-4519. 186. Daskalos A, Nikolaidis G, Xinarianos G, Savvari P, Cassidy A, Zakopoulou R, Kotsinas A, Gorgoulis V, Field JK, Liloglou T: Hypomethylation of retrotransposable elements correlates with genomic instability in non-small cell lung cancer. Int J Cancer 2009, 124(1):81-87. 187. Walsh CP, Chaillet JR, Bestor TH: Transcription of IAP endogenous retroviruses is constrained by cytosine methylation. Nat Genet 1998, 20(2):116-117. 188. Chalitchagorn K, Shuangshoti S, Hourpai N, Kongruttanachok N, Tangkijvanich P, Thong-ngam D, Voravud N, Sriuranpong V, Mutirangura A: Distinctive pattern of LINE- 1 methylation level in normal tissues and the association with carcinogenesis. Oncogene 2004, 23(54):8841-8846. 189. Rauch TA, Zhong X, Wu X, Wang M, Kernstine KH, Wang Z, Riggs AD, Pfeifer GP: High-resolution mapping of DNA hypermethylation and hypomethylation in lung cancer. Proc Natl Acad Sci U S A 2008, 105(1):252-257. 190. Groudine M, Eisenman R, Weintraub H: Chromatin structure of endogenous retroviral genes and activation by an inhibitor of DNA methylation. Nature 1981, 292(5821):311-317. 191. Wilson IM, Davies JJ, Weber M, Brown CJ, Alvarez CE, MacAulay C, Schubeler D, Lam WL: Epigenomics: mapping the methylome. Cell Cycle 2006, 5(2):155-158. 192. Cadieux B, Ching TT, VandenBerg SR, Costello JF: Genome-wide hypomethylation in human glioblastomas associated with specific copy number alteration, methylenetetrahydrofolate reductase allele status, and increased proliferation. Cancer Res 2006, 66(17):8469-8476. 193. Zabarovsky ER, Lerman MI, Minna JD: Tumor suppressor genes on chromosome 3p involved in the pathogenesis of lung and other cancers. Oncogene 2002, 21(45):6915-6935. 194. Belinsky SA, Palmisano WA, Gilliland FD, Crooks LA, Divine KK, Winters SA, Grimes MJ, Harms HJ, Tellez CS, Smith TM et al: Aberrant promoter methylation in bronchial epithelium and sputum from current and former smokers. Cancer Res 2002, 62(8):2370-2377. 195. Palmisano WA, Divine KK, Saccomanno G, Gilliland FD, Baylin SB, Herman JG, Belinsky SA: Predicting lung cancer by detecting aberrant promoter methylation in sputum. Cancer Res 2000, 60(21):5954-5958. 196. Belinsky SA: Gene-promoter hypermethylation as a biomarker in lung cancer. Nat Rev Cancer 2004, 4(9):707-717.

154 197. Tessema M, Willink R, Do K, Yu YY, Yu W, Machida EO, Brock M, Van Neste L, Stidley CA, Baylin SB et al: Promoter methylation of genes in and around the candidate lung cancer susceptibility locus 6q23-25. Cancer Res 2008, 68(6):1707-1714. 198. Heintzman ND, Hon GC, Hawkins RD, Kheradpour P, Stark A, Harp LF, Ye Z, Lee LK, Stuart RK, Ching CW et al: Histone modifications at human enhancers reflect global cell-type-specific gene expression. Nature 2009, 459(7243):108-112. 199. Komashko VM, Acevedo LG, Squazzo SL, Iyengar SS, Rabinovich A, O'Geen H, Green R, Farnham PJ: Using ChIP-chip technology to reveal common principles of transcriptional repression in normal and cancer cells. Genome Res 2008, 18(4):521-532. 200. Ke XS, Qu Y, Rostad K, Li WC, Lin B, Halvorsen OJ, Haukaas SA, Jonassen I, Petersen K, Goldfinger N et al: Genome-wide profiling of histone h3 lysine 4 and lysine 27 trimethylation reveals an epigenetic signature in prostate carcinogenesis. PLoS One 2009, 4(3):e4687. 201. Kondo Y, Shen L, Cheng AS, Ahmed S, Boumber Y, Charo C, Yamochi T, Urano T, Furukawa K, Kwabi-Addo B et al: Gene silencing in cancer by histone H3 lysine 27 trimethylation independent of promoter DNA methylation. Nat Genet 2008, 40(6):741-750. 202. Yu J, Rhodes DR, Tomlins SA, Cao X, Chen G, Mehra R, Wang X, Ghosh D, Shah RB, Varambally S et al: A polycomb repression signature in metastatic prostate cancer predicts cancer outcome. Cancer Res 2007, 67(22):10657-10663. 203. Wu J, Wang SH, Potter D, Liu JC, Smith LT, Wu YZ, Huang TH, Plass C: Diverse histone modifications on histone 3 lysine 9 and their relation to DNA methylation in specifying gene silencing. BMC Genomics 2007, 8:131. 204. Krivtsov AV, Feng Z, Lemieux ME, Faber J, Vempati S, Sinha AU, Xia X, Jesneck J, Bracken AP, Silverman LB et al: H3K79 methylation profiles define murine and human MLL-AF4 leukemias. Cancer Cell 2008, 14(5):355-368. 205. Lin B, Wang J, Hong X, Yan X, Hwang D, Cho JH, Yi D, Utleg AG, Fang X, Schones DE et al: Integrated expression profiling and ChIP-seq analyses of the growth inhibition response program of the androgen receptor. PLoS One 2009, 4(8):e6589. 206. Fullwood MJ, Liu MH, Pan YF, Liu J, Xu H, Mohamed YB, Orlov YL, Velkov S, Ho A, Mei PH et al: An oestrogen-receptor-alpha-bound human chromatin interactome. Nature 2009, 462(7269):58-64. 207. Coe BP, Chari R, Lockwood WW, Lam WL: Evolving strategies for global gene expression analysis of cancer. J Cell Physiol 2008, 217(3):590-597. 208. Liang P, Pardee AB: Analysing differential gene expression in cancer. Nat Rev Cancer 2003, 3(11):869-876. 209. Nevins JR, Potti A: Mining gene expression profiles: expression signatures as cancer phenotypes. Nat Rev Genet 2007, 8(8):601-609. 210. Pollack JR, Sorlie T, Perou CM, Rees CA, Jeffrey SS, Lonning PE, Tibshirani R, Botstein D, Borresen-Dale AL, Brown PO: Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proc Natl Acad Sci U S A 2002, 99(20):12963-12968. 211. Heidenblad M, Lindgren D, Veltman JA, Jonson T, Mahlamaki EH, Gorunova L, van Kessel AG, Schoenmakers EF, Hoglund M: Microarray analyses reveal strong influence of DNA copy number alterations on the transcriptional patterns in pancreatic cancer: implications for the interpretation of genomic amplifications. Oncogene 2005, 24(10):1794-1801. 212. Hyman E, Kauraniemi P, Hautaniemi S, Wolf M, Mousses S, Rozenblum E, Ringner M, Sauter G, Monni O, Elkahloun A et al: Impact of DNA amplification on gene expression patterns in breast cancer. Cancer Res 2002, 62(21):6240-6245. 213. Wolf M, Mousses S, Hautaniemi S, Karhu R, Huusko P, Allinen M, Elkahloun A, Monni O, Chen Y, Kallioniemi A et al: High-resolution analysis of gene copy number

155 alterations in human prostate cancer using CGH on cDNA microarrays: impact of copy number on gene expression. Neoplasia 2004, 6(3):240-247. 214. Adelaide J, Finetti P, Bekhouche I, Repellini L, Geneix J, Sircoulomb F, Charafe-Jauffret E, Cervera N, Desplans J, Parzy D et al: Integrated profiling of basal and luminal breast cancers. Cancer Res 2007, 67(24):11565-11575. 215. Broet P, Camilleri-Broet S, Zhang S, Alifano M, Bangarusamy D, Battistella M, Wu Y, Tuefferd M, Regnard JF, Lim E et al: Prediction of clinical outcome in multiple lung cancer cohorts by integrative genomics: implications for chemotherapy selection. Cancer Res 2009, 69(3):1055-1062. 216. Chin K, DeVries S, Fridlyand J, Spellman PT, Roydasgupta R, Kuo WL, Lapuk A, Neve RM, Qian Z, Ryder T et al: Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. Cancer Cell 2006, 10(6):529-541. 217. Natrajan R, Weigelt B, Mackay A, Geyer FC, Grigoriadis A, Tan DS, Jones C, Lord CJ, Vatcheva R, Rodriguez-Pinilla SM et al: An integrative genomic and transcriptomic analysis reveals molecular pathways and networks regulated by copy number aberrations in basal-like, HER2 and luminal cancers. Breast Cancer Res Treat 2009. 218. Deng S, Calin GA, Croce CM, Coukos G, Zhang L: Mechanisms of microRNA deregulation in human cancer. Cell Cycle 2008, 7(17):2643-2646. 219. Kuo KT, Guan B, Feng Y, Mao TL, Chen X, Jinawath N, Wang Y, Kurman RJ, Shih Ie M, Wang TL: Analysis of DNA copy number alterations in ovarian serous tumors identifies new molecular genetic changes in low-grade and high-grade carcinomas. Cancer Res 2009, 69(9):4036-4042. 220. Lionetti M, Agnelli L, Mosca L, Fabris S, Andronache A, Todoerti K, Ronchetti D, Deliliers GL, Neri A: Integrative high-resolution microarray analysis of human myeloma cell lines reveals deregulated miRNA expression associated with allelic imbalances and gene expression profiles. Genes Chromosomes Cancer 2009, 48(6):521-531. 221. Starczynowski DT, Kuchenbauer F, Argiropoulos B, Sung S, Morin R, Muranyi A, Hirst M, Hogge D, Marra M, Wells RA et al: Identification of miR-145 and miR-146a as mediators of the 5q- syndrome phenotype. Nat Med 2009. 222. Zhang L, Volinia S, Bonome T, Calin GA, Greshock J, Yang N, Liu CG, Giannakakis A, Alexiou P, Hasegawa K et al: Genomic and epigenetic alterations deregulate microRNA expression in human epithelial ovarian cancer. Proc Natl Acad Sci U S A 2008, 105(19):7004-7009. 223. Calin GA, Croce CM: MicroRNA signatures in human cancers. Nat Rev Cancer 2006, 6(11):857-866. 224. Nicoloso MS, Spizzo R, Shimizu M, Rossi S, Calin GA: MicroRNAs--the micro steering wheel of tumour metastases. Nat Rev Cancer 2009, 9(4):293-302. 225. Wolf NG, Farver C, Abdul-Karim FW, Schwartz S: Analysis of microsatellite instability and X-inactivation in ovarian borderline tumors lacking numerical abnormalities by comparative genomic hybridization. Cancer Genet Cytogenet 2003, 145(2):133-138. 226. Olson P, Lu J, Zhang H, Shai A, Chun MG, Wang Y, Libutti SK, Nakakura EK, Golub TR, Hanahan D: MicroRNA dynamics in the stages of tumorigenesis correlate with hallmark capabilities of cancer. Genes Dev 2009, 23(18):2152-2165. 227. Wang X: miRDB: a microRNA target prediction and functional annotation database with a wiki interface. RNA 2008, 14(6):1012-1017. 228. Garzon R, Calin GA, Croce CM: MicroRNAs in Cancer. Annu Rev Med 2009, 60:167- 179. 229. Lujambio A, Esteller M: How epigenetics can explain human metastasis: a new role for microRNAs. Cell Cycle 2009, 8(3):377-382. 230. Iorio MV, Visone R, Di Leva G, Donati V, Petrocca F, Casalini P, Taccioli C, Volinia S, Liu CG, Alder H et al: MicroRNA signatures in human ovarian cancer. Cancer Res 2007, 67(18):8699-8707. 156 231. Lujambio A, Esteller M: CpG island hypermethylation of tumor suppressor microRNAs in human cancer. Cell Cycle 2007, 6(12):1455-1459. 232. Lujambio A, Ropero S, Ballestar E, Fraga MF, Cerrato C, Setien F, Casado S, Suarez- Gauthier A, Sanchez-Cespedes M, Git A et al: Genetic unmasking of an epigenetically silenced microRNA in human cancer cells. Cancer Res 2007, 67(4):1424-1429. 233. Guil S, Esteller M: DNA methylomes, histone codes and miRNAs: tying it all together. Int J Biochem Cell Biol 2009, 41(1):87-95. 234. Sadikovic B, Yoshimoto M, Chilton-MacNeill S, Thorner P, Squire JA, Zielenska M: Identification of interactive networks of gene expression associated with osteosarcoma oncogenesis by integrated molecular profiling. Hum Mol Genet 2009, 18(11):1962-1975. 235. Joshi MD, Ahmad R, Yin L, Raina D, Rajabi H, Bubley G, Kharbanda S, Kufe D: MUC1 oncoprotein is a druggable target in human prostate cancer cells. Mol Cancer Ther 2009, 8(11):3056-3065. 236. Khodarev NN, Pitroda SP, Beckett MA, MacDermed DM, Huang L, Kufe DW, Weichselbaum RR: MUC1-induced transcriptional programs associated with tumorigenesis predict outcome in breast and lung cancer. Cancer Res 2009, 69(7):2833-2837. 237. Senapati S, Das S, Batra SK: Mucin-interacting proteins: from function to therapeutics. Trends Biochem Sci 2009. 238. Wu CJ, Chen Z, Ullrich A, Greene MI, O'Rourke DM: Inhibition of EGFR-mediated phosphoinositide-3-OH kinase (PI3-K) signaling and glioblastoma phenotype by signal-regulatory proteins (SIRPs). Oncogene 2000, 19(35):3999-4010. 239. Kapoor GS, Kapitonov D, O'Rourke DM: Transcriptional regulation of signal regulatory protein alpha1 inhibitory receptors by epidermal growth factor receptor signaling. Cancer Res 2004, 64(18):6444-6452. 240. Yamasaki Y, Ito S, Tsunoda N, Kokuryo T, Hara K, Senga T, Kannagi R, Yamamoto T, Oda K, Nagino M et al: SIRPalpha1 and SIRPalpha2: their role as tumor suppressors in breast carcinoma cells. Biochem Biophys Res Commun 2007, 361(1):7-13. 241. Qin JM, Wan XW, Zeng JZ, Wu MC: Effect of Sirpalpha1 on the expression of nuclear factor-kappa B in hepatocellular carcinoma. Hepatobiliary Pancreat Dis Int 2007, 6(3):276-283. 242. Gardai SJ, Xiao YQ, Dickinson M, Nick JA, Voelker DR, Greene KE, Henson PM: By binding SIRPalpha or calreticulin/CD91, lung collectins act as dual function surveillance molecules to suppress or enhance inflammation. Cell 2003, 115(1):13- 23. 243. Takada T, Matozaki T, Takeda H, Fukunaga K, Noguchi T, Fujioka Y, Okazaki I, Tsuda M, Yamao T, Ochi F et al: Roles of the complex formation of SHPS-1 with SHP-2 in insulin-stimulated mitogen-activated protein kinase activation. J Biol Chem 1998, 273(15):9234-9242. 244. Motegi S, Okazawa H, Ohnishi H, Sato R, Kaneko Y, Kobayashi H, Tomizawa K, Ito T, Honma N, Buhring HJ et al: Role of the CD47-SHPS-1 system in regulation of cell migration. EMBO J 2003, 22(11):2634-2644. 245. Kharitonenkov A, Chen Z, Sures I, Wang H, Schilling J, Ullrich A: A family of proteins that inhibit signalling through tyrosine kinase receptors. Nature 1997, 386(6621):181-186. 246. Meyer PE, Kontos K, Lafitte F, Bontempi G: Information-theoretic inference of large transcriptional regulatory networks. EURASIP J Bioinform Syst Biol 2007:79879. 247. Meyer PE, Lafitte F, Bontempi G: minet: A R/Bioconductor package for inferring large transcriptional networks using mutual information. BMC Bioinformatics 2008, 9:461.

157 248. Xi L, Feber A, Gupta V, Wu M, Bergemann AD, Landreneau RJ, Litle VR, Pennathur A, Luketich JD, Godfrey TE: Whole genome exon arrays identify differential expression of alternatively spliced, cancer-related genes in lung cancer. Nucleic Acids Res 2008, 36(20):6535-6547. 249. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J et al: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004, 5(10):R80. 250. Bild AH, Yao G, Chang JT, Wang Q, Potti A, Chasse D, Joshi MB, Harpole D, Lancaster JM, Berchuck A et al: Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature 2006, 439(7074):353-357. 251. Shedden K, Taylor JM, Enkemann SA, Tsao MS, Yeatman TJ, Gerald WL, Eschrich S, Jurisica I, Giordano TJ, Misek DE et al: Gene expression-based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study. Nat Med 2008, 14(8):822-827. 252. Chitale D, Gong Y, Taylor BS, Broderick S, Brennan C, Somwar R, Golas B, Wang L, Motoi N, Szoke J et al: An integrated genomic analysis of lung cancer reveals loss of DUSP4 in EGFR-mutant tumors. Oncogene 2009, 28(31):2773-2783. 253. Buys TP, Aviel-Ronen S, Waddell TK, Lam WL, Tsao MS: Defining genomic alteration boundaries for a combined small cell and non-small cell lung carcinoma. J Thorac Oncol 2009, 4(2):227-239. 254. Brommesson S, Jonsson G, Strand C, Grabau D, Malmstrom P, Ringner M, Ferno M, Hedenfalk I: Tiling array-CGH for the assessment of genomic similarities among synchronous unilateral and bilateral invasive breast cancer tumor pairs. BMC Clin Pathol 2008, 8:6. 255. Kawanishi H, Takahashi T, Ito M, Matsui Y, Watanabe J, Ito N, Kamoto T, Kadowaki T, Tsujimoto G, Imoto I et al: Genetic analysis of multifocal superficial urothelial cancers by array-based comparative genomic hybridisation. Br J Cancer 2007, 97(2):260-266. 256. Mhawech-Fauceglia P, Rai H, Nowak N, Cheney RT, Rodabaugh K, Lele S, Odunsi K: The use of array-based comparative genomic hybridization (a-CGH) to distinguish metastatic from primary synchronous carcinomas of the ovary and the uterus. Histopathology 2008, 53(4):490-495. 257. Nakano H, Soda H, Nakamura Y, Uchida K, Takasu M, Nakatomi K, Izumikawa K, Hayashi T, Nagayasu T, Tsukamoto K et al: Different epidermal growth factor receptor gene mutations in a patient with 2 synchronous lung cancers. Clin Lung Cancer 2007, 8(9):562-564. 258. Ryoo BY, Na, II, Yang SH, Koh JS, Kim CH, Lee JC: Synchronous multiple primary lung cancers with different response to gefitinib. Lung Cancer 2006, 53(2):245-248. 259. Speel EJ, van de Wouw AJ, Claessen SM, Haesevoets A, Hopman AH, van der Wurff AA, Osieka R, Buettner R, Hillen HF, Ramaekers FC: Molecular evidence for a clonal relationship between multiple lesions in patients with unknown primary adenocarcinoma. Int J Cancer 2008, 123(6):1292-1300. 260. Wa CV, DeVries S, Chen YY, Waldman FM, Hwang ES: Clinical application of array- based comparative genomic hybridization to define the relationship between multiple synchronous tumors. Mod Pathol 2005, 18(4):591-597. 261. Agelopoulos K, Tidow N, Korsching E, Voss R, Hinrichs B, Brandt B, Boecker W, Buerger H: Molecular cytogenetic investigations of synchronous bilateral breast cancer. J Clin Pathol 2003, 56(9):660-665. 262. Whitehurst AW, Bodemann BO, Cardenas J, Ferguson D, Girard L, Peyton M, Minna JD, Michnoff C, Hao W, Roth MG et al: Synthetic lethal screen identification of chemosensitizer loci in cancer cells. Nature 2007, 446(7137):815-819. 263. Barbie DA, Tamayo P, Boehm JS, Kim SY, Moody SE, Dunn IF, Schinzel AC, Sandy P, Meylan E, Scholl C et al: Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1. Nature 2009, 462(7269):108-112. 158 264. Berns K, Hijmans EM, Mullenders J, Brummelkamp TR, Velds A, Heimerikx M, Kerkhoven RM, Madiredjo M, Nijkamp W, Weigelt B et al: A large-scale RNAi screen in human cells identifies new components of the p53 pathway. Nature 2004, 428(6981):431-437. 265. Gobeil S, Zhu X, Doillon CJ, Green MR: A genome-wide shRNA screen identifies GAS1 as a novel melanoma metastasis suppressor gene. Genes Dev 2008, 22(21):2932-2940. 266. Luo B, Cheung HW, Subramanian A, Sharifnia T, Okamoto M, Yang X, Hinkle G, Boehm JS, Beroukhim R, Weir BA et al: Highly parallel identification of essential genes in cancer cells. Proc Natl Acad Sci U S A 2008, 105(51):20380-20385. 267. Luo J, Emanuele MJ, Li D, Creighton CJ, Schlabach MR, Westbrook TF, Wong KK, Elledge SJ: A genome-wide RNAi screen identifies multiple synthetic lethal interactions with the Ras oncogene. Cell 2009, 137(5):835-848. 268. Moffat J, Grueneberg DA, Yang X, Kim SY, Kloepfer AM, Hinkle G, Piqani B, Eisenhaure TM, Luo B, Grenier JK et al: A lentiviral RNAi library for human and mouse genes applied to an arrayed viral high-content screen. Cell 2006, 124(6):1283-1298. 269. Scholl C, Frohling S, Dunn IF, Schinzel AC, Barbie DA, Kim SY, Silver SJ, Tamayo P, Wadlow RC, Ramaswamy S et al: Synthetic lethal interaction between oncogenic KRAS dependency and STK33 suppression in human cancer cells. Cell 2009, 137(5):821-834. 270. Silva JM, Marran K, Parker JS, Silva J, Golding M, Schlabach MR, Elledge SJ, Hannon GJ, Chang K: Profiling essential genes in human mammary cells by multiplex RNAi screening. Science 2008, 319(5863):617-620. 271. Apweiler R, Aslanidis C, Deufel T, Gerstner A, Hansen J, Hochstrasser D, Kellner R, Kubicek M, Lottspeich F, Maser E et al: Approaching clinical proteomics: current state and future fields of application in cellular proteomics. Cytometry A 2009, 75(10):816-832. 272. Apweiler R, Aslanidis C, Deufel T, Gerstner A, Hansen J, Hochstrasser D, Kellner R, Kubicek M, Lottspeich F, Maser E et al: Approaching clinical proteomics: current state and future fields of application in fluid proteomics. Clin Chem Lab Med 2009, 47(6):724-744. 273. Peng XQ, Wang F, Geng X, Zhang WM: Current advances in tumor proteomics and candidate biomarkers for hepatic cancer. Expert Rev Proteomics 2009, 6(5):551-561. 274. Tainsky MA: Genomic and proteomic biomarkers for cancer: a multitude of opportunities. Biochim Biophys Acta 2009, 1796(2):176-193. 275. Zamo A, Cecconi D: Proteomic analysis of lymphoid and haematopoietic neoplasms: There's more than biomarker discovery. J Proteomics 2009. 276. Griffin JL, Kauppinen RA: A metabolomics perspective of human brain tumours. FEBS J 2007, 274(5):1132-1139. 277. Spratlin JL, Serkova NJ, Eckhardt SG: Clinical applications of metabolomics in oncology: a review. Clin Cancer Res 2009, 15(2):431-440. 278. Sreekumar A, Poisson LM, Rajendiran TM, Khan AP, Cao Q, Yu J, Laxman B, Mehra R, Lonigro RJ, Li Y et al: Metabolomic profiles delineate potential role for sarcosine in prostate cancer progression. Nature 2009, 457(7231):910-914. 279. Adamovic T, Trosso F, Roshani L, Andersson L, Petersen G, Rajaei S, Helou K, Levan G: Oncogene amplification in the proximal part of chromosome 6 in rat endometrial adenocarcinoma as revealed by combined BAC/PAC FISH, chromosome painting, zoo-FISH, and allelotyping. Genes Chromosomes Cancer 2005, 44(2):139-153. 280. Ferrandina G, Mey V, Nannizzi S, Ricciardi S, Petrillo M, Ferlini C, Danesi R, Scambia G, Del Tacca M: Expression of nucleoside transporters, deoxycitidine kinase, ribonucleotide reductase regulatory subunits, and gemcitabine catabolic enzymes in primary ovarian cancer. Cancer Chemother Pharmacol 2009.

159 281. Fernandez-Ranvier GG, Weng J, Yeh RF, Khanafshar E, Suh I, Barker C, Duh QY, Clark OH, Kebebew E: Identification of biomarkers of adrenocortical carcinoma using genomewide gene expression profiling. Arch Surg 2008, 143(9):841-846; discussion 846. 282. Segditsas S, Sieber O, Deheragoda M, East P, Rowan A, Jeffery R, Nye E, Clark S, Spencer-Dene B, Stamp G et al: Putative direct and indirect Wnt targets identified through consistent gene expression changes in APC-mutant intestinal adenomas from humans and mice. Hum Mol Genet 2008, 17(24):3864-3875. 283. Conde L, Montaner D, Burguet-Castell J, Tarraga J, Medina I, Al-Shahrour F, Dopazo J: ISACGH: a web-based environment for the analysis of Array CGH and gene expression which includes functional profiling. Nucleic Acids Res 2007, 35(Web Server issue):W81-85. 284. La Rosa P, Viara E, Hupe P, Pierron G, Liva S, Neuvial P, Brito I, Lair S, Servant N, Robine N et al: VAMP: visualization and analysis of array-CGH, transcriptome and other molecular profiles. Bioinformatics 2006, 22(17):2066-2073. 285. Kapushesky M, Emam I, Holloway E, Kurnosov P, Zorin A, Malone J, Rustici G, Williams E, Parkinson H, Brazma A: Gene expression atlas at the European bioinformatics institute. Nucleic Acids Res 2010, 38(Database issue):D690-698. 286. Li L, Bum-Erdene K, Baenziger PH, Rosen JJ, Hemmert JR, Nellis JA, Pierce ME, Meroueh SO: BioDrugScreen: a computational drug design resource for ranking molecules docked to the human proteome. Nucleic Acids Res 2010, 38(Database issue):D765-773. 287. Kato K, Yamashita R, Matoba R, Monden M, Noguchi S, Takagi T, Nakai K: Cancer gene expression database (CGED): a database for gene expression profiling with accompanying clinical information of human cancer tissues. Nucleic Acids Res 2005, 33(Database issue):D533-536. 288. Li H, He Y, Ding G, Wang C, Xie L, Li Y: dbDEPC: a database of Differentially Expressed Proteins in human Cancers. Nucleic Acids Res 2010, 38(Database issue):D658-664. 289. Brooksbank C, Cameron G, Thornton J: The European Bioinformatics Institute's data resources. Nucleic Acids Res 2010, 38(Database issue):D17-25. 290. Safran M, Chalifa-Caspi V, Shmueli O, Olender T, Lapidot M, Rosen N, Shmoish M, Peter Y, Glusman G, Feldmesser E et al: Human Gene-Centric Databases at the Weizmann Institute of Science: GeneCards, UDB, CroW 21 and HORDE. Nucleic Acids Res 2003, 31(1):142-146. 291. Zhang Y, Lv J, Liu H, Zhu J, Su J, Wu Q, Qi Y, Wang F, Li X: HHMD: the human histone modification database. Nucleic Acids Res 2010, 38(Database issue):D149- 154. 292. Betel D, Wilson M, Gabow A, Marks DS, Sander C: The microRNA.org resource: targets and expression. Nucleic Acids Res 2008, 36(Database issue):D149-153. 293. Jiang Q, Wang Y, Hao Y, Juan L, Teng M, Zhang X, Li M, Wang G, Liu Y: miR2Disease: a manually curated database for microRNA deregulation in human disease. Nucleic Acids Res 2009, 37(Database issue):D98-104. 294. Alexiou P, Vergoulis T, Gleditzsch M, Prekas G, Dalamagas T, Megraw M, Grosse I, Sellis T, Hatzigeorgiou AG: miRGen 2.0: a database of microRNA genomic information and regulation. Nucleic Acids Res 2010, 38(Database issue). 295. Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Federhen S et al: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2010, 38(Database issue):D5-D16. 296. Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Marshall KA et al: NCBI GEO: archive for high-throughput functional genomic data. Nucleic Acids Res 2009, 37(Database issue):D885-890. 297. Rhodes DR, Kalyana-Sundaram S, Mahavisno V, Varambally R, Yu J, Briggs BB, Barrette TR, Anstet MJ, Kincead-Beal C, Kulkarni P et al: Oncomine 3.0: genes, 160 pathways, and networks in a collection of 18,000 cancer gene expression profiles. Neoplasia 2007, 9(2):166-180. 298. Baudis M: Genomic imbalances in 5918 malignant epithelial tumors: an explorative meta-analysis of chromosomal CGH data. BMC Cancer 2007, 7:226. 299. Vizcaino JA, Cote R, Reisinger F, Barsnes H, Foster JM, Rameseder J, Hermjakob H, Martens L: The Proteomics Identifications database: 2010 update. Nucleic Acids Res 2010, 38(Database issue):D736-742. 300. Ren Y, Gong W, Zhou H, Wang Y, Xiao F, Li T: siRecords: a database of mammalian RNAi experiments and efficacies. Nucleic Acids Res 2009, 37(Database issue):D146- 149. 301. Chari R, Lockwood WW, Coe BP, Chu A, Macey D, Thomson A, Davies JJ, MacAulay C, Lam WL: SIGMA: a system for integrative genomic microarray analysis of cancer genomes. BMC Genomics 2006, 7:324. 302. Rhead B, Karolchik D, Kuhn RM, Hinrichs AS, Zweig AS, Fujita PA, Diekhans M, Smith KE, Rosenbloom KR, Raney BJ et al: The UCSC Genome Browser database: update 2010. Nucleic Acids Res 2010, 38(Database issue):D613-619.

161 Chapter 6: Conclusions

162 6.1 Summary

Lung adenocarcinoma is the most commonly diagnosed form of lung cancer today, with a large percentage of patients exhibiting poor overall survival and prognosis. Genomic analysis has provided much insight into this disease with the identification of specific differentially expressed genes, somatically mutated genes, hyper- and hypomethylated genes and genes which are amplified and deleted at the DNA copy number level. While tools and platforms for whole genome analysis of gene dosage and gene expression are widely accessible and have improved in resolution, only until recently have technologies to assess DNA methylation in a high throughput manner been made available. Hence, the logical step with these vast amounts of data is to integrate the information from these different assays in a parallel manner to gain a better understanding of the biology of lung adenocarcinoma.

6.1.1 Development of the integrative genetic and epigenetic approach

In chapter 2, the development of the SIGMA2 software package is discussed [1]. At the time the package was developed, there were no analysis tools for integrative genetic and epigenetic analysis, let alone tools to integrate gene dosage and gene expression, which were two well established high throughput platforms. Moreover, for array CGH data alone, limited number of tools existed. Hence, as a precursor to SIGMA2, SIGMA [2], was developed and used as a framework for SIGMA2.

In chapter 3, I demonstrated how when this integrative approach is applied to model systems, that we learn much more using multiple dimensions as compared to when only looking at a single dimension alone. Specifically, we show that we can associate more of the dysregulated gene expression with alterations at the DNA level, with some cell lines illustrating as much as

80% of the observed gene expression changes being able to be associated with DNA level changes. In addition, I also illustrated two key concepts: (i) the “Or” (multiple alternate hit mechanisms) concept where across a sample set and using a fixed frequency of disruption, we

163 can identify nearly three times as many genes when we can account for multiple types of disruption as opposed to accounting for only a single type of alteration and (ii) the “And”

(coupled multiple hits) concept where we identify genes which are targeted my multiple mechanisms in the sample and show that these genes can have significant biological and/or clinical relevance.

6.1.2 Identification of a prevalent genetic alteration in lung adenocarcinoma

Genetic and epigenetic alterations have been shown to be prominent in lung adenocarcinoma.

Within genetic alterations, the majority of documented alterations have involved alterations in gene dosage, somatic mutation, and loss of heterozygosity (LOH) or allelic imbalance (whereby an allele or a portion of the allele is lost or gained in the tumor). In terms of allelic imbalance, the majority of the time this event is captured as a decrease or increase in gene dosage.

However, there are also cases where allelic imbalance exists but there is no net change in gene dosage, termed copy neutral LOH or somatic uniparental disomy (UPD). Chapter 4 discusses the unexpected prevalence of UPD in the lung adenocarcinoma genome.

Though previous studies were done using SNP arrays on lung adenocarcinoma tumors, the prevalence of UPD was likely underestimated due to a number of reasons. Amongst the reasons include the use of heterogeneous samples with high normal cell contamination due to lack of microdissection, lower resolution of alterations identifiable by previous SNP arrays, use of non patient-matched controls as reference and movement from call-based algorithms to algorithms which use allele specific copy number [3].

In addition to the prevalence of UPD, the other key finding from this chapter is the presence of frequent UPD at both known and novel oncogenes. While UPD has previously been shown to affect tumor suppressor genes such as RB1 [4], the association of UPD at oncogene loci has not been reported as often in solid tumors. Moreover, in the previous studies in hematological malignancies such as leukemias, lymphomas, or myeloid dysplastic syndromes, the observed

164 UPD at oncogenes was typically accompanied by an acquired homozygous mutation at the locus [3, 5-8]. From our data, though UPD was also observed at mutated KRAS, as shown previously [9], there was also frequent UPD in cases where KRAS was not mutated. This finding suggests that in the cases which UPD occurs without somatic mutation, that the UPD event may in fact be used a mechanism for preferential allele selection. Specifically, this could be preference for the methylated allele for tumor suppressor genes or unmethylated allele for oncogenes [10, 11]. Alternatively, the preference could be for a more transcriptionally active (or inactive) allele as it has been shown that for specific genes, the two alleles may differ in rates of transcription [12-16]. Thus, integration of genetic data with epigenetic and gene expression data would help decipher the target of these frequently observed UPD events.

6.1.3 Application of the integrative approach to lung adenocarcinoma specimens

Thus far, I have shown that the integrative genetic and epigenetic approach is beneficial in identifying important genes; both which would have been missed if single assays alone were used and those which have concurrent alteration at multiple levels. Chapter 5 discusses how upon application of this approach to lung adenocarcinoma specimens, we see that this trend also holds true in clinical samples. Specifically, I show that novel canonical signaling pathways are significantly enriched for when multiple DNA-based dimensions are used but are missed

(not statistically significant) when a single DNA-based dimension is used. In addition, when we examined the well-documented EGFR signaling pathway, a pathway known to be involved in lung adenocarcinoma, the most frequently disrupted gene was signal-regulatory protein alpha

(SIRPA).

SIRPA has been shown to be a direct downstream component of EGFR, and has been shown to be suppressed in expression by EGFR activation [17, 18]. In the resting lung, SIRPA has been postulated to control the inflammatory response through SHP-1 and eventually, NFKB

[19]. While there are likely multiple components between SIRPA and NFKB, we wanted to assess expression of components directly associated with SIRPA. In addition, since this gene

165 was identified in a small set of samples, we wanted to see if this prevalence of disruption was maintained in an additional, larger set of tumors. Hence, we evaluated expression of SHP-1 and SIRPA in a panel of approximately 60 lung adenocarcinoma tumors and found (i) a high prevalence of underexpression of SIRPA and (ii) a strong correlation between SIRPA and SHP-

1 expression levels. It is interesting to observe this strong relationship between SIRPA and

SHP-1 as most cancer studies have focused on SIRPA’s relationship with SHP-2 instead of

SHP-1.

6.2 Conclusions

I have demonstrated the power of an integrative genetic and epigenetic approach to decipher resultant gene expression changes in lung adenocarcinoma. The development of an application such as SIGMA2 was integral as it represented one of the first academic/research applications with the ability to integrate multiple dimensions of data. To date, there have been a few other applications that have been developed that can perform similar functionalities but most of these have been developed by commercial entities. Moreover, the software still is not out-dated and based on the way it was built, can be extended to handle newer high throughput platforms including sequence-based platforms.

In terms of what we learn from both the demonstration dataset (Chapter 3) as well as clinical tumor dataset (Chapter 5), we know that by using an integrative, multi-dimensional approach, we are detecting genes being disrupted at a much higher frequency when multiple dimensions are examined as compared to single dimensions alone. Moreover, at a given detection frequency, a gene may be disrupted by a single dimension at a low frequency but when multiple dimensions are accounted for, the frequency is in fact high. In Figure 5.5, I illustrate how well known lung cancer genes such as RRM2 are altered at both the genetic and epigenetic level and illustrate how more pathways are deemed significant when multiple dimensions are analyzed. The latter finding is likely a result of the fact that within a given pathway, not only can different genes be affected in different samples by one mechanism (e.g. DNA copy number 166 amplification), but they also can be affected by different, but complementary, mechanisms (e.g.

DNA methylation). These findings validate part A of the hypothesis. In addition, when aligning genetic and epigenetic profiles, I show that a high proportion of the observed differential expression can be attributed to genetic and epigenetic changes, validating part B of the hypothesis (genetic and epigenetic changes resulting in aberrant gene expression). Finally, when examining the EGFR signaling pathway, we observe that a number of key genes are frequently altered, while other genes are not altered as often. The most frequently affected gene, SIRPA, exhibited both genetic and epigenetic alteration and in some cases, this occurred concurrently within the same sample. Moreover, it was also found from both in the analyses of chapter 3 (breast cancer) and chapter 5 (lung cancer), that over three times as many genes are identified as frequently aberrant when using multiple dimensions as compared to any single dimension. These findings validate part C of the hypothesis, whereby a gene within a commonly deregulated lung cancer pathway was identified using the integrative genetic and epigenetic approach.

This finding has potential implications on the sample sizes necessary to discover important alterations as one can look at a small number of samples in more detail rather than a large sample set using a single assay. Specifically, these would be the genes that are disrupted at a low frequency in one dimension but high frequency across multiple dimensions as large sample sets would be needed to be confident of the low single dimension alteration frequency. Such considerations exist in situations where large sample sets are not feasible due to rarity and preciousness of samples.

In terms of the multi-dimensional perspective on lung adenocarcinoma, in addition to finding additional genes and pathways that are disrupted when we look at multiple dimensions, when examining known genes and pathways, we see a complex pattern of deregulation with some components altered more frequently than others. This added complexity highlights how each tumor is different from one another and the rational approach to identifying therapeutic targets

167 will be done at the pathway level as opposed to the gene level. It is clear that within these pathways, key nodes or “choke points” will likely serve as the best targets for therapeutic intervention.

6.3 Future directions

There are two key future directions which should be pursued at this point; (i) further evaluation of SIRPA as a novel tumor suppressor gene in lung adenocarcinoma, (ii) evaluation of the novel signaling pathways implicated through multi-dimensional analysis, and (iii) incorporation of data from other dimensions not evaluated in this thesis.

In terms of SIRPA, the first experiments that need to be done are to assess whether the deregulated mRNA expression of SIRPA is also observed at the protein level. One such approach would be using immunohistochemistry on a tissue microarray comprised of hundreds of samples with well annotated clinical information. Subsequently, the frequency of underexpression, the correlation with overall patient survival, and the overexpression associated with a subset of tumors from patients with never smoking history could be validated.

In addition, depending on what clinical information is available, one could correlate to other parameters that were not available to me from the public gene expression microarray datasets to uncover other interesting clinical associations.

Secondly, with the amount of literature suggesting that SIRPA could be a tumor suppressor gene in adenocarcinoma, the next set of experiments would be designed to test the tumor suppressor role of SIRPA. This would require the silencing of the gene in normal cells, using

RNAi based techniques for example, and assess tumorigenic phenotypes such as anchorage independent growth, reduction of apoptosis, and increase in proliferation. In addition, parallel gene introduction experiments would also need to be done in cancer cell lines which exhibit little or no expression of SIRPA and the level of suppression of the above listed tumor phenotypes would then be assessed.

168 One of the canonical signaling pathways that was identified as the most statistically significant by Ingenuity Pathway Analysis is the Hepatic Fibrosis /Hepatic Stellate Cell Activation pathway.

While the existence and role of stellate cells have been well documented in the liver and pancreas, there have been a limited number of reports of stellate cells in the lung [20]. From what is known in the liver and pancreas, stellate cells are involved in tissue fibrosis and inflammation in chronic diseases such as pancreatitis and hepatitis [21-25]. In pancreatic tumors, activated stellate cells promote an increase in connective tissue surrounding the tumors

(termed the desmoplastic process) and have been shown to be proliferative in the presence of tumor secreted factors [25]. In addition, stellate cells also have implications in drug resistance

[26]. In the lung, it is plausible to envision a role of stellate cells in diseases such as chronic obstructive pulmonary disease (COPD) where tissue fibrosis and inflammation are prominent

[27]. One of the challenges to testing this function in vitro is that it would be important to recapitulate the tumor microenvironment. Hence, this function would have to be tested in vivo using inducible mouse models where expression of secreted factors associated with stellate cell activation, which were identified from our analysis, can be assessed. Phenotypes such as cellular proliferation, apoptosis, and drug resistance could then be assayed and compared between pre and post-induction of these secreted factors.

Finally, although multiple DNA dimensions were analyzed in this thesis, recent advances in technology have allowed for other dimensions that could be incorporated. For example, genome sequencing technologies allow for the detection of novel somatic mutations in a high throughput manner. While performing this at the whole genome level is financially and computationally challenging, this effort can be focused on examining the "exome" (DNA from gene coding exons only) using sequence capture based techniques [28, 29]. MicroRNAs have also shown to be important in lung cancer, with specific microRNAs shown to be differentially expressed [30-36]. MicroRNAs can affect downstream protein expression through a number of different mechanisms [37-40]. Integration of microRNA and sequence mutation data with the

169 previously described genetic and epigenetic data would further increase our understanding of the biology of lung adenocarcinoma.

170 6.4 References

1. Chari R, Coe BP, Wedseltoft C, Benetti M, Wilson IM, Vucic EA, MacAulay C, Ng RT, Lam WL: SIGMA2: a system for the integrative genomic multi-dimensional analysis of cancer genomes, epigenomes, and transcriptomes. BMC Bioinformatics 2008, 9:422. 2. Chari R, Lockwood WW, Coe BP, Chu A, Macey D, Thomson A, Davies JJ, MacAulay C, Lam WL: SIGMA: a system for integrative genomic microarray analysis of cancer genomes. BMC Genomics 2006, 7:324. 3. Sanada M, Suzuki T, Shih LY, Otsu M, Kato M, Yamazaki S, Tamura A, Honda H, Sakata-Yanagimoto M, Kumano K et al: Gain-of-function of mutated C-CBL tumour suppressor in myeloid neoplasms. Nature 2009, 460(7257):904-908. 4. Zhu X, Dunn JM, Goddard AD, Squire JA, Becker A, Phillips RA, Gallie BL: Mechanisms of loss of heterozygosity in retinoblastoma. Cytogenet Cell Genet 1992, 59(4):248-252. 5. Grand FH, Hidalgo-Curtis CE, Ernst T, Zoi K, Zoi C, McGuire C, Kreil S, Jones A, Score J, Metzgeroth G et al: Frequent CBL mutations associated with 11q acquired uniparental disomy in myeloproliferative neoplasms. Blood 2009, 113(24):6182- 6192. 6. Kralovics R, Guan Y, Prchal JT: Acquired uniparental disomy of chromosome 9p is a frequent stem cell defect in polycythemia vera. Exp Hematol 2002, 30(3):229-236. 7. Tiu RV, Gondek LP, O'Keefe CL, Huh J, Sekeres MA, Elson P, McDevitt MA, Wang XF, Levis MJ, Karp JE et al: New lesions detected by single nucleotide polymorphism array-based chromosomal analysis have important clinical impact in acute myeloid leukemia. J Clin Oncol 2009, 27(31):5219-5226. 8. Yamamoto G, Nannya Y, Kato M, Sanada M, Levine RL, Kawamata N, Hangaishi A, Kurokawa M, Chiba S, Gilliland DG et al: Highly sensitive method for genomewide detection of allelic composition in nonpaired, primary tumor specimens by use of affymetrix single-nucleotide-polymorphism genotyping microarrays. Am J Hum Genet 2007, 81(1):114-126. 9. Soh J, Okumura N, Lockwood WW, Yamamoto H, Shigematsu H, Zhang W, Chari R, Shames DS, Tang X, MacAulay C et al: Oncogene mutations, copy number gains and mutant allele specific imbalance (MASI) frequently occur together in tumor cells. PLoS One 2009, 4(10):e7464. 10. Darbary HK, Dutt SS, Sait SJ, Nowak NJ, Heinaman RE, Stoler DL, Anderson GR: Uniparentalism in sporadic colorectal cancer is independent of imprint status, and coordinate for chromosomes 14 and 18. Cancer Genet Cytogenet 2009, 189(2):77- 86. 11. Tuna M, Knuutila S, Mills GB: Uniparental disomy in cancer. Trends Mol Med 2009, 15(3):120-128. 12. Bjornsson HT, Albert TJ, Ladd-Acosta CM, Green RD, Rongione MA, Middle CM, Irizarry RA, Broman KW, Feinberg AP: SNP-specific array-based allele-specific expression analysis. Genome Res 2008, 18(5):771-779. 13. Gimelbrant A, Hutchinson JN, Thompson BR, Chess A: Widespread monoallelic expression on human autosomes. Science 2007, 318(5853):1136-1140. 14. Milani L, Lundmark A, Nordlund J, Kiialainen A, Flaegstad T, Jonmundsson G, Kanerva J, Schmiegelow K, Gunderson KL, Lonnerholm G et al: Allele-specific gene expression patterns in primary leukemic cells reveal regulation of gene expression by CpG site methylation. Genome Res 2009, 19(1):1-11. 15. Palacios R, Gazave E, Goni J, Piedrafita G, Fernando O, Navarro A, Villoslada P: Allele-specific gene expression is widespread across the genome and biological processes. PLoS One 2009, 4(1):e4150.

171 16. Zhang K, Li JB, Gao Y, Egli D, Xie B, Deng J, Li Z, Lee JH, Aach J, Leproust EM et al: Digital RNA allelotyping reveals tissue-specific and allele-specific gene expression in human. Nat Methods 2009, 6(8):613-618. 17. Kapoor GS, Kapitonov D, O'Rourke DM: Transcriptional regulation of signal regulatory protein alpha1 inhibitory receptors by epidermal growth factor receptor signaling. Cancer Res 2004, 64(18):6444-6452. 18. Wu CJ, Chen Z, Ullrich A, Greene MI, O'Rourke DM: Inhibition of EGFR-mediated phosphoinositide-3-OH kinase (PI3-K) signaling and glioblastoma phenotype by signal-regulatory proteins (SIRPs). Oncogene 2000, 19(35):3999-4010. 19. Gardai SJ, Xiao YQ, Dickinson M, Nick JA, Voelker DR, Greene KE, Henson PM: By binding SIRPalpha or calreticulin/CD91, lung collectins act as dual function surveillance molecules to suppress or enhance inflammation. Cell 2003, 115(1):13- 23. 20. Keane MP, Strieter RM, Belperio JA: Mechanisms and mediators of pulmonary fibrosis. Crit Rev Immunol 2005, 25(6):429-463. 21. Geerts A: History, heterogeneity, developmental biology, and functions of quiescent hepatic stellate cells. Semin Liver Dis 2001, 21(3):311-335. 22. Hautekeete ML, Geerts A: The hepatic stellate (Ito) cell: its role in human liver disease. Virchows Arch 1997, 430(3):195-207. 23. Masamune A, Shimosegawa T: Signal transduction in pancreatic stellate cells. J Gastroenterol 2009, 44(4):249-260. 24. Masamune A, Watanabe T, Kikuta K, Shimosegawa T: Roles of pancreatic stellate cells in pancreatic inflammation and fibrosis. Clin Gastroenterol Hepatol 2009, 7(11 Suppl):S48-54. 25. Omary MB, Lugea A, Lowe AW, Pandol SJ: The pancreatic stellate cell: a star on the rise in pancreatic diseases. J Clin Invest 2007, 117(1):50-59. 26. Mahadevan D, Von Hoff DD: Tumor-stroma interactions in pancreatic ductal adenocarcinoma. Mol Cancer Ther 2007, 6(4):1186-1197. 27. Chung KF, Adcock IM: Multifaceted mechanisms in COPD: inflammation, immunity, and tissue repair and destruction. Eur Respir J 2008, 31(6):1334-1356. 28. Ng SB, Buckingham KJ, Lee C, Bigham AW, Tabor HK, Dent KM, Huff CD, Shannon PT, Jabs EW, Nickerson DA et al: Exome sequencing identifies the cause of a mendelian disorder. Nat Genet 2010, 42(1):30-35. 29. Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE et al: Targeted capture and massively parallel sequencing of 12 human exomes. Nature 2009, 461(7261):272-276. 30. Du L, Pertsemlidis A: microRNAs and lung cancer: tumors and 22-mers. Cancer Metastasis Rev 2010. 31. Garofalo M, Di Leva G, Romano G, Nuovo G, Suh SS, Ngankeu A, Taccioli C, Pichiorri F, Alder H, Secchiero P et al: miR-221&222 regulate TRAIL resistance and enhance tumorigenicity through PTEN and TIMP3 downregulation. Cancer Cell 2009, 16(6):498-509. 32. Johnson SM, Grosshans H, Shingara J, Byrom M, Jarvis R, Cheng A, Labourier E, Reinert KL, Brown D, Slack FJ: RAS is regulated by the let-7 microRNA family. Cell 2005, 120(5):635-647. 33. Kumar MS, Erkeland SJ, Pester RE, Chen CY, Ebert MS, Sharp PA, Jacks T: Suppression of non-small cell lung tumor development by the let-7 microRNA family. Proc Natl Acad Sci U S A 2008, 105(10):3903-3908. 34. Talotta F, Cimmino A, Matarazzo MR, Casalino L, De Vita G, D'Esposito M, Di Lauro R, Verde P: An autoregulatory loop mediated by miR-21 and PDCD4 controls the AP-1 activity in RAS transformation. Oncogene 2009, 28(1):73-84. 35. Weiss GJ, Bemis LT, Nakajima E, Sugita M, Birks DK, Robinson WA, Varella-Garcia M, Bunn PA, Jr., Haney J, Helfrich BA et al: EGFR regulation by microRNA in lung

172 cancer: correlation with clinical response and survival to gefitinib and EGFR expression in cell lines. Ann Oncol 2008, 19(6):1053-1059. 36. Xiao C, Srinivasan L, Calado DP, Patterson HC, Zhang B, Wang J, Henderson JM, Kutok JL, Rajewsky K: Lymphoproliferative disease and autoimmunity in mice with increased miR-17-92 expression in lymphocytes. Nat Immunol 2008, 9(4):405-414. 37. Lee RC, Ambros V: An extensive class of small RNAs in Caenorhabditis elegans. Science 2001, 294(5543):862-864. 38. Mattick JS, Makunin IV: Small regulatory RNAs in mammals. Hum Mol Genet 2005, 14 Spec No 1:R121-132. 39. McManus MT: MicroRNAs and cancer. Semin Cancer Biol 2003, 13(4):253-258. 40. Vasudevan S, Tong Y, Steitz JA: Switching from repression to activation: microRNAs can up-regulate translation. Science 2007, 318(5858):1931-1934.

173 APPENDIX I: List of publications

This appendix details all of the publications that I was a part of that were either published, accepted, currently in submission, or prepared for submission. In total, 29 publications are listed below with four of them represented as chapters and an additional nine discussed in section 1.11. The remaining 16 publications are listed below with a brief description accompanying each publication.

Publications included as thesis chapters

1. Chari R, Coe BP, Wedseltoft C, Benetti M, Wilson IM, Vucic E, MacAulay C, Ng RT, Lam

WL. (2008) SIGMA2: A system for the integrative genomic multi-dimensional analysis of cancer genomes, epigenomes, and transcriptomes. BMC Bioinformatics, 9(1):422, 1-12.

This publication is included in the thesis as chapter 2.

2. Chari R, Coe BP, Vucic EA, Lockwood WW, Lam WL. (2010) An integrative multi- dimensional genetic and epigenetic strategy to identify aberrant genes and pathways in cancer.

BMC Systems Biology. Submitted.

This manuscript submitted for publication is included in the thesis as chapter 3.

3. Chari R, Lockwood WW, Coe BP, Soh J, MacAulay C, Lam S, Gazdar AF, Lam WL. (2010)

UPD is a frequent mechanism of gene disruption in lung adenocarcinoma.

This manuscript in preparation is included in the thesis as chapter 4.

4. Chari R, Thu KL, Wilson IM, Lockwood WW, Lonergan KM, Coe BP, Malloff CA, Gazdar AF,

Lam S, Garnis C, MacAulay CE, Alvarez CE, Lam WL. (2010) Integrating the multiple dimensions of genomic and epigenomic landscapes of cancer. Cancer and Metastasis

Reviews, 29(1):73-93.

174 This publication is included in the thesis as chapter 5.

Publications discussed in section 1.11 (9 listed)

5. Chari R, Lockwood WW, Coe BP, Chu A, Macey D, Thomson A, Davies JJ, MacAulay C,

Lam WL. (2006) SIGMA: A system for the integrative genomic microarray analysis of cancer genomes. BMC Genomics, 7(1):324, 1-11.

This publication is described in section 1.11.1.

6. Coe BP, Chari R, MacAulay C, Lam WL. (2010) FACADE: A fast and sensitive algorithm for the segmentation and calling of high resolution array CGH data. Nucleic Acids Research.

Submitted.

This publication is described in section 1.11.1.

7. Lonergan KM, Chari R, deLeeuw RJ, Shadeo A, Chi B, Tsao M, Jones S, Marra M, Ng R,

MacAulay C, Lam S, Lam WL. (2006) Identification of novel lung genes in bronchial epithelium by serial analysis of gene expression. American Journal of Respiratory Cell and Molecular

Biology, 35(6):651-61.

This publication is described in section 1.11.2.

8. Chari R, Lonergan KM, Ng RT, MacAulay C, Lam S, Lam WL. (2007) Effect of active smoking on the bronchial epithelial transcriptome. BMC Genomics, 8(1):297, 1-13.

This publication is described in section 1.11.2.

9. Lee EHL*, Chari R*, Lam A, Ng RT, Yee J, English J, Evans KG, MacAulay C, Lam S, Lam

WL. (2008) Disruption of the non-canonical Wnt pathway in lung squamous cell carcinoma.

Clinical Medicine: Oncology, 2:169-179. *These authors contributed equally

This publication is described in section 1.11.3.

175 This publication is described in section 1.11.3.

10. Lonergan KM, Chari R, Coe BP, Wilson IM, Tsao M-S, Ng RT, MacAulay C, Lam S, Lam

WL. (2010) Transcriptome profiles of carcinoma-in-situ and invasive non-small cell lung cancer as revealed by SAGE. PLoS One, 5(2):e9162, 1-22.

This publication is described in section 1.11.3.

11. Chari R, Lonergan KM, Pikor LA, Coe BP, Zhu CQ, Chan THW, MacAulay C, Tsao M-S,

Lam S, Ng RT, Lam WL. (2010) A sequenced-based approach to identify reference genes for gene expression analysis. BMC Medical Genomics. Submitted.

This publication is described in section 1.11.3.

12. Lockwood WW, Chari R, Coe BP, Girard L, MacAulay C, Lam S, Gazdar AF, Minna JD,

Lam WL. (2008) DNA amplification is a ubiquitous mechanism of oncogene activation in lung and other cancers. Oncogene, 27(33):4615-4624.

This publication is described in section 1.11.4.

13. Lockwood WW, Chari R, Coe BP, Thu KL, Garnis C, Campbell J, Williams AC, Hwang D,

Zhu CQ, Yee J, English J, Tsao M-S, Gazdar AF, MacAulay C, Minna JD, Lam S, Lam WL.

(2010) BRF2 is a lineage specific oncogene amplified early in squamous cell lung cancer development. PLoS Medicine. Submitted.

This publication is described in section 1.11.4.

Other publications not discussed in this thesis (16 listed)

Array comparative genomic hybridization and its application to multiple cancer types

14. Garnis C, Chari R, Buys TP, Zhang L, Ng RT, Rosin MP, Lam WL. (2009) Genomic imbalances in precancerous tissues signal oral cancer risk. Molecular Cancer, 8:50, 1-7.

176 The development of oral cancer is thought to occur through the progression of histopathological stages, from different stages of dyplasia (mild, moderate, and severe) to carcinoma in situ to invasive disease. Similar to many cancer types, early detection of this disease is critical for good prognosis. As such, it is important to be able to determine which cases will and will not progress at early stages of the disease such as mild dysplasia. This manuscript describes the use of array CGH as a tool to predict progression in genomes of mild dysplasia patients and it was shown that the level of genomic alteration had high concordance with disease progression.

15. Coe BP, Lockwood WW, Chari R, Lam WL. (2009) Comparative genomic hybridization on

BAC arrays. Methods in Molecular Biology, 556:7-19.

This publication is a chapter in the Methods in Molecular Biology textbook and describes the process of developing, using and analyzing data from bacterial artificial chromosome CGH arrays.

16. deLeeuw RJ, Zettl A, Klinker E, Haralambieva E, Trottier M, Chari R, Ge Y, Gascoyne RD,

Chott A, Muller-Hermelink HK, Lam WL. (2007) Whole genome analysis and HLA haplotyping of enteropathy-type T-cell lymphoma reveals two distinct lymphoma subtypes.

Gastroenterology, 132(5):1902-11.

Enteropathy-type T-cell lypmhoma (ETL) is an aggressive non-Hodgkin lymphoma and the genetic alterations underlying this disease were not well understood. In this publication, array

CGH was applied to samples from patients with ETL and based on the genetic alterations and

HLA genotyping, it was found that two distinct subytpes of this disease existed, which was contrary to the clinical classification used at the time.

17. Buys TPH, Wilson IM, Coe BP, Lee EHL, Kennett JY, Lockwood WW, Tsui IFL, Shadeo A,

Chari R, Garnis C, Lam WL. (2006) “Detailed Comparisons of Cancer Genomes” in

Comparative Genomics: Fundamental and Applied Perspectives (Brown JR, ed.), CRC Press /

Taylor & Francis, LLC, Boca Raton, FLA, pp. 245-259.

177 This chapter details the technologies used for cancer genome comparisons as well as the different types of comparisons that are currently undertaken in research today such as the comparison of cancer subtypes, clonal versus multiple primary tumors, cancer susceptibility and drug sensitivity.

18. Lockwood WW*, Chari R*, Chi B, Lam WL. (2006) Recent advances in array comparative genomic hybridization technologies and their applications in human genetics. European Journal of Human Genetics, 14(2):139-48.

This publication is a review of literature describing the advances in array CGH technology and its application to many genetic diseases, including cancer.

19. Buys TPH, Wilson IM, Coe BP, Lockwood WW, Davies JJ, Chari R, DeLeeuw RJ, Shadeo

A, MacAulay C, Lam WL. (2005) “Key Features of BAC Array Production and Usage” in DNA

Microarrays (Methods Express Series) (Schena M, ed), Scion Publishing, Ltd., Bloxham, pp.

115-145.

This chapter describes the production and use of bacterial artificial chromosome microarray- based CGH and the analysis of the data generated by this platform. In addition, protocols for array CGH experiments are also provided.

Gene expression based studies

20. Coe BP, Chari R, Lockwood WW, Lam WL. (2008) Evolving strategies for global gene expression analysis of cancer. Journal of Cellular Physiology, 217(3):590-597.

This publication is a review of literature describing the advancement in technology to analyze gene expression in cancer and the movement of the field towards integrative genomics.

21. Shadeo A, Chari R, Lonergan KM, Pusic A, Miller D, Ehlen T, Van Niekerk D, Matisic J,

Richards-Kortum R, Follen M, Guillaud M, Lam WL, MacAulay C. (2008) Up regulation in gene

178 expression of chromatin remodelling factors in cervical intraepithelial neoplasia. BMC

Genomics, 9(1):64, 1-14.

Cervical cancer is a major problem in developing countries. Similar to oral cancer, it is thought to go through a progression of histopathological stages and thus identifying markers at stages of intervention are crucial to the prognoses of patients with this disease. In this publication, a comparison of normal cervical tissue with cervical intraepithelial neoplasia (CIN) was performed to identify genes upregulated in CIN. It was found that genes involved in chromatin remodelling were upregulated in CIN.

22. Shadeo A, Chari R, Vatcher G, Campbell J, Lonergan KM, Matisic J, van NieKerk D, Ehlen

T, Miller D, Follen M, Lam WL, MacAulay C. (2007) Comprehensive serial analysis of gene expression of the cervical transcriptome. BMC Genomics, 8(1):142, 1-11.

This publication describes the transcriptome of normal cervix tissue using serial analysis of gene expression.

Integrative analysis of multiple DNA and RNA dimensions

23. Wilson IM, Vucic EA, Chari R, Zhang Y-A, Starczynowski DT, Lonergan KM, Enfield KSS,

Buys TPH, Yee J, Laird-Offringa I, Karsan A, Liu P, You M, Anderson M, MacAulay C, Lam S,

Gazdar AF, Lam WL. (2010) EYA4 is a non-small cell lung cancer tumor suppressor located in the susceptibility locus on chromosome 6q.

Chromosome arm 6q has been shown to harbor a region associated with lung cancer susceptibility based on the analysis of familial lung cancer datasets. Moreover, this specific region is also frequently lost in sporadic, non-familial lung cancers as well. Hence, many studies have been undertaken to identify the gene(s) in this region which may critical to lung tumorigenesis. In this manuscript, we detail the use of a genetic and epigenetic approach to identify key genes in this region which are frequently deregulated by concerted genetic and

179 epigenetic alteration. This led to the identification of the gene EYA4, which we further demonstrate to have tumor suppressive activity.

24. Soh J, Okumura N, Lockwood WW, Yamamoto H, Shigematsu H, Zhang W, Chari R,

Shames D, Tang X, MacAulay C, Varella-Garcia M, Vooder T, Wistuba II, Lam S, Brekken R,

Toyooka S, Minna JD, Lam WL, Gazdar AF. (2009) Oncogene mutations, copy number gains and mutant allele specific imbalance (MASI) frequently occur together in tumor cells. PLoS

One, 4(10):e7464, 1-13.

Somatic mutation of both oncogenes and tumor suppressor genes have been shown to be important in cancer. While tumor suppressor genes typically are recessive and require both alleles to harbor mutation, activating mutations of oncogenes generally only require one of the alleles to be mutated. However, it has been shown that for specific activating mutations, multiple mutated copies can exist. In this study, using some of the most commonly mutated genes in multiple cancer types, the prevalence of this phenomenon was assessed in set of cell lines and tumors representing lung, pancreatic and colorectal cancers. It was found that for the

EGFR locus, mutation of the gene is accompanied with copy number increase whereby there is preferential gain of the mutated copy and for KRAS, the event observed is acquired uniparental disomy where the wild type copy is lost and the mutant copy is duplicated.

25. Campbell JM, Lockwood WW, Buys TP, Chari R, Coe BP, Lam S, Lam WL. (2008)

Integrative genomic and gene expression analysis of chromosome 7 identified novel oncogene loci in non-small cell lung cancer. Genome, 51(12): 1032–1039.

Genomic alteration of chromosome 7 is a frequent event in non-small cell lung cancer. While the most commonly known oncogenes on this chromosome include EGFR, MET and BRAF, there are likely other candidate genes which may have a role in lung tumorigenesis. In this manuscript, utilizing an integrative genetic and gene expression approach, novel oncogene loci are identified.

180 26. Buys TPH, Chari R, Lee E, Zhang M, MacAulay C, Lam S, Lam WL, Ling V. (2007)

Genetic changes in the evolution of multidrug resistance for cultured human ovarian cancer cells. Genes, Chromosomes and Cancer, 46(12):1069-79.

Drug resistance is a common problem for cancer patients treated by chemotherapeutics. One of the mechanisms of resistance is through the multi-drug resistance phenotype which is often associated with the activity of ATP-binding cassette (ABC) transporters. In this study, using an ovarian cancer cell line exposed to increasing concentrations of vincristine to derive drug resistant derivatives, the genetic and gene expression profiles were compared between these resistant derivatives and the original cancer cell line. It was found that in while initial resistant derivatives (lines exposed to lower concentration of drug) harbored copy number and gene expression increase of ABCC1 and ABCC6, latter resistant derivatives (lines exposed to higher concentration of drug) did not have the increase in ABCC1 and ABCC6, but had an increase of

ABCB, suggesting the drug resistance phenotype may be a dynamic process.

27. Coe BP, Lockwood WW, Girard L, Chari R, Minna JD, MacAulay C, Lam S, Gazdar AF,

Lam WL. (2006) Differential regulation of cell cycle pathways in small cell and non-small cell lung cancer. British Journal of Cancer, (12):1927-35.

Small cell lung cancer (SCLC) and non-small cell lung cancer (SCLC) are the two major cell types of lung cancer. While pathologically they can be distinguished, the molecular basis of these two cancer types is not well understood. In this study, a whole genome integrative genetic and gene expression comparison of NSCLC and SCLC was performed and differential regulation of cell cycle pathways was identified. Specifically, NSCLC is primarily deregulated at the receptor level while SCLC is primarily deregulated at the nuclear transcription factor level.

Software, analysis approaches, and databases

28. Tsui IFL, Chari R, Buys TPH, Lam WL. (2007) Public databases and software for the pathway analysis of cancer genomes. Cancer Informatics, 3:389-407.

181 This manuscript describes the currently available computational resources for the analysis of pathways in cancer. Specifically, the use of these resources to analyze results from high throughput studies examining genetic, epigenetic or gene expression alterations.

29. Chari R*, Lockwood WW*, Lam WL. (2006) Computational methods for the analysis of array comparative genomic hybridization. Cancer Informatics, (2):48-58.

This manuscript describes the most commonly used analysis strategies for array CGH data and compares and contrasts these approaches. In addition, the specific features of currently available software suites are also compared.

182 APPENDIX II: Description of cell lines

Sample ER Status PR Status HER2 Status TP53 Mutation Status**

HCC38 - - +

HCC1008 - - N/A

HCC1143 - - +

HCC1395 + - +

HCC1599 - - +

HCC1937 - - + (heterozygous mutation)

HCC2218 - + + -

BT474 + - + +

MCF7 + + +

MCF10A N/A N/A N/A N/A

** mutation status obtained from the Sanger Cancer Cell Line Project (http://www.sanger.ac.uk/genetics/CGP/CellLines/)

183 APPENDIX III: Sources of data

Sample DNA Copy Allelic Status DNA Gene expression - Number - - Affymetrix Methylation - Affymetrix U133 Plus Array CGH SNP 500K Illumina 2.0 (NCBI GEO Infinium Accession number provided) HCC38 GSE21540 https://cabig.n new data for new data for this ci.nih.gov this publication publication (GSE17768) (GSE21347) (GSE17769) HCC1008 GSE21540 https://cabig.n new data for new data for this ci.nih.gov this publication publication (GSE17768) (GSE21347) (GSE17769) HCC1143 GSE21540 https://cabig.n new data for new data for this ci.nih.gov this publication publication (GSE17768) (GSE21347) (GSE17769) HCC1395 GSE21540 https://cabig.n new data for new data for this ci.nih.gov this publication publication (GSE17768) (GSE21347) (GSE17769) HCC1599 GSE21540 https://cabig.n new data for new data for this ci.nih.gov this publication publication (GSE17768) (GSE21347) (GSE17769) HCC1937 GSE21540 https://cabig.n new data for new data for this ci.nih.gov this publication publication (GSE17768) (GSE21347) (GSE17769) HCC2218 GSE21540 https://cabig.n new data for new data for this ci.nih.gov this publication publication (GSE17768) (GSE21347) (GSE17769) BT474 GSE21540 https://cabig.n new data for https://cabig.nci.nih.gov ci.nih.gov this publication (GSE17768) (GSE21347) (GSE17769) MCF7 GSE21540 https://cabig.n new data for https://cabig.nci.nih.gov ci.nih.gov this publication (GSE17768) (GSE21347) (GSE17769) MCF10A N/A N/A new data for GSM254525 this publication (GSE17768) (GSE17769)

184 APPENDIX IV: MCD strategy and Kaplan-Meier analysis of

TUSC3

185 APPENDIX V: Kaplan-Meier and Oncomine expression analysis of frequent MCD genes

Survival Symbol (+) (-) Total **Status in Tumors (p-value) Associated*

SH3TC1 0 6 6 No - CCNA1 0 5 5 Yes - COL7A1 0 5 5 No - KCTD4 0 5 5 N/A Not tested LMCD1 5 0 5 Yes O3(6.3E-4), O5(3.1E-10) LYAR 0 5 5 Yes U5(3.8E-4), O3(2.6E-4) MTMR9 0 5 5 No U3(1.8E-7) SYT8 0 5 5 N/A Not tested TUSC3 0 5 5 Yes U5(7.4E-5) ASAM 0 4 4 N/A Not tested B3GALNT1 4 0 4 No O3(1.1E-7) COL17A1 0 4 4 No U1(6.8E-5), U3(1.4E-8) ELK3 0 4 4 Yes - FGFR1 0 4 4 Yes U1(2.6E-8),U3(4.2E-6),U4(1.3E-7) KRT17 0 4 4 No U1 (2.3E-11),U2 (1.1E-7), U3(3.9E-7) LCP1 0 4 4 Yes - OSBPL5 0 4 4 N/A Not tested PSD3 0 4 4 Yes - SFXN3 0 4 4 N/A Not tested SH3BGRL3 0 4 4 No O2(2.8E-4) SNRPN 0 4 4 No U3(5.4E-10), U5(2.7E-5) TNFRSF10D 0 4 4 No O5(5.2E-4), U3(9.7E-4) TNS4 0 4 4 N/A Not tested

*Survival associated if gene expression was significant associated with survival in at least one of the two datasets tested (based on p < 0.05 using the log rank test). **U=underexpressed between tumor and normal, O=overexpressed between tumor and normal in the particular dataset; The numbers 1-5 indicate the reports from which the data originated, 1= [1], 2= [2], 3=[3], 4=[4], 5=[5], 6=[6]; “-“ indicates gene was either not represented or not statistically differentially expressed based on group-wise analysis. (+) represents two-fold overexpression, copy number gain, hypomethylation and allelic imbalance; (-) represents two- fold underexpression, copy number loss, hypermethylation, and LOH in the same sample and the number of samples in our dataset which met this criteria.

186 REFERENCES

1. Sorlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, van de Rijn M, Jeffrey SS et al: Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci U S A 2001, 98(19):10869-10874. 2. Perou CM, Sorlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA et al: Molecular portraits of human breast tumours. Nature 2000, 406(6797):747-752. 3. Richardson AL, Wang ZC, De Nicolo A, Lu X, Brown M, Miron A, Liao X, Iglehart JD, Livingston DM, Ganesan S: X chromosomal abnormalities in basal-like human breast cancer. Cancer Cell 2006, 9(2):121-132. 4. Radvanyi L, Singh-Sandhu D, Gallichan S, Lovitt C, Pedyczak A, Mallo G, Gish K, Kwok K, Hanna W, Zubovits J et al: The gene associated with trichorhinophalangeal syndrome in humans is overexpressed in breast cancer. Proc Natl Acad Sci U S A 2005, 102(31):11005-11010. 5. Finak G, Bertos N, Pepin F, Sadekova S, Souleimanova M, Zhao H, Chen H, Omeroglu G, Meterissian S, Omeroglu A et al: Stromal gene expression predicts clinical outcome in breast cancer. Nat Med 2008, 14(5):518-527. 6. Karnoub AE, Dash AB, Vo AP, Sullivan A, Brooks MW, Bell GW, Richardson AL, Polyak K, Tubo R, Weinberg RA: Mesenchymal stem cells within tumour stroma promote breast cancer metastasis. Nature 2007, 449(7162):557-563.

187 APPENDIX VI: Summary of Kaplan-Meier survival analysis

Alternative van de Vijver - P- Sorlie - P- GeneSymbol Names value value SH3TC1 FLJ20356 Fail N/A CCNA1 0.01484628 N/A COL7A1 Fail N/A KCTD4 N/A N/A LMCD1 Fail 0.00261366 LYAR FLJ20425 0.00551113 Fail MTMR9 DKFZP434K171 Fail N/A SYT8 DKFZp434K0322 N/A N/A TUSC3 N33 0.01696356 Fail ASAM N/A N/A B3GALNT1 B3GALT3 Fail N/A COL17A1 Fail Fail ELK3 0.04816902 N/A FGFR1 Fail 0.0147898 KRT17 Fail Fail LCP1 0.01132949 0.04024164 OSBPL5 N/A N/A PSD3 DKFZp761K1423 0.00205916 N/A SFXN3 N/A N/A SH3BGRL3 N/A Fail SNRPN Fail Fail TNFRSF10D Fail N/A TNS4 N/A N/A Fail = p-value > 0.05; N/A = not represented on array platform

188 APPENDIX VII: Copy of UBC Research Ethics Board certificate of approval

189 Page 1 of 1

UBC BCCA Research Ethics Board Fairmont Medical Building (6th Floor) 614 - 750 West Broadway Vancouver, BC V5Z 1H5 Tel: (604) 877-6284 Fax: (604) 708-2132 E-mail: [email protected] University of British Columbia - British Columbia Cancer Agency Website: http://www.bccancer.bc.ca > Research Ethics Board (UBC BCCA REB) Research Ethics RISe: http://rise.ubc.ca

Certificate of Expedited Approval: Annual Renewal

PRINCIPAL INVESTIGATOR: INSTITUTION / DEPARTMENT: REB NUMBER: BCCA/BCCA/Cancer Genetics & Wan Lam H08-01392 Development (BCCA) INSTITUTION(S) WHERE RESEARCH WILL BE CARRIED OUT: Institution Site BC Cancer Agency Vancouver BCCA Other locations where the research will be conducted: N/A

PRINCIPAL INVESTIGATOR FOR EACH ADDITIONAL PARTICIPATING BCCA CENTRE: Vancouver: Wan Lam Vancouver Island: N/A Fraser Valley: N/A Southern Interior: N/A Abbotsford Centre: N/A

SPONSORING AGENCIES AND COORDINATING GROUPS: Canadian Institutes of Health Research (CIHR) PROJECT TITLE: Development of a multi-spectral platform for integrated analysis of clinical and research samples.

APPROVAL DATE: EXPIRY DATE OF THIS APPROVAL: PAA#: H08-01392-A003 August 4, 2009 August 4, 2010

CERTIFICATION: 1. The membership of the UBC BCCA REB complies with the membership requirements for research ethics boards defined in Division 5 of the Food and Drug Regulations of Canada. 2. The UBC BCCA REB carries out its functions in a manner fully consistent with Good Clinical Practices. 3. The UBC BCCA REB has reviewed and approved the research project named on this Certificate of Approval including any associated consent form and taken the action noted above. This research project is to be conducted by the provincial investigator named above. This review and the associated minutes of the UBC BCCA REB have been documented electronically and in writing.

The UBC BCCA Research Ethics Board has reviewed the documentation for the above named project. The research study as presented in documentation, was found to be acceptable on ethical grounds for research involving human subjects and was approved for renewal by the UBC BCCA REB.

UBC BCCA Ethics Board Approval of the above has been verified by one of the following: Dr. George Browman, Chair Dr. Lynne Nakashima, Second Vice-Chair

If you have any questions, please call: Bonnie Shields, Manager, BCCA Research Ethics Board: 604-877-6284 or e-mail: [email protected] Dr. George Browman, Chair: 604-877-6284 or e-mail: [email protected] Dr. Lynne Nakashima, Second Vice-Chair: 604-707-5989 or e-mail: [email protected]

https://rise.ubc.ca/rise/Doc/0/JKQ8088GG9RKN55VLAL9OHM869/fromString.html 15/04/2010