חבור לשם קבלת התואר Thesis for the degree דוקטור לפילוסופיה Doctor of Philosophy

מאת By טל שי Tal Shay

ממספר עותקי דנא לביטוי גנים: שינויים מקומיים, טריזומיות ומונוזומיות From DNA Copy Number to Expression: Local aberrations, Trisomies and Monosomies

Regular Format

מנחה Advisor איתן דומאני Eytan Domany

אלול תשס"ח September 2008

מוגש למועצה המדעית של Submitted to the Scientific Council of the מכון ויצמן למדע Weizmann Institute of Science רחובות, ישראל Rehovot, Israel

תקציר מטרת מחקרי הייתה חקר השפעת שינוי במספר עותקי הדנא על ביטוי גנים. שינוי במספר עותקי הדנא יכול להיות מקומי ולכלול גנים ספורים, או להיות ברמת הכרומוזום השלם, כמו טריזומיה ומונוזומיה. בסיס הנתונים העיקרי שלמדתי במסגרת שיתוף פעולה היה של גליובלסטומה. בנוסף, השתמשתי בבסיסי נתונים ציבוריים של סרטנים נוספים ותסמונת דאון. הבסיס המולקולרי לשינויי ביטוי בגליובלסטומה גליובלסטומה היא גידול המח הנפוץ והאלים ביותר במבוגרים. בשיתוף פעולה עם פרופסור הגי (CHUV, שוויץ ,) נתחנו בסיס נתונים עשיר של גליובלסטומה, שכלל נתונים קליניים , מספר עותקי , הדנא (נמדד על ידי שבב היברידיזציה גנומית השוואתית ), ופרופיל ביטוי גנים. חקרנו את ההתאמה בין מספר עותקי הדנא וביטוי הגנים ברמת זרועות כרומוזומליות ושינויים גנומיים מקומיים. זיהינו הכפלות ידועות וביטוי ביתר של אונקוגנים, ומחיקות ידועות וביטוי בחסר של גנים מדכאי סרטן. מידע זה שימש אותנו למיפוי שינויים במסלולי בקרה המופרעים בגליובלסטומה, ולאפיון קבוצות של גידולים שאין להם שינוי מדיד במסלולים הללו. זיהוי שינויים מקומיים משמעותיים ביולוגית סוגים רבים של גידולים מאבדים ומכפילים כרומוזומים ואזורים גנטיים קצרים יותר. סביר להניח שאזור גנטי שמשתנה בהרבה גידולים, או שהשינוי במספר העותקים שלו גדול, הוא בעל חשיבות קלינית , ולא , תוצר לוואי של חוסר יציבות גנטית. פתחנו שיטה חדשה המגדירה ומדרגת שינויים גנומיים מקומיים במספר העותקים על ידי חישוב פורמלי של תובנות אלו. השיטה מחשבת ניקוד לכל שינוי על ידי אחוז החולים הנושאים שינוי זה, אורך השינוי ועוצמת השינוי. מובהקות הניקוד נקבעת על ידי התפלגות שנוצרה על ידי תמורות. שיטה זו מגלה מיקומים גנומיים שמספר העותקים שלהם שונה באופן מובהק, ויוצרת פרופיל גנטי של שינויים מקומיים לכל גידול. פרופיל זה בשילוב מצב זרועות הכרומוזום (עודף\חסר) יוצר חתימה גנומית תמציתית לכל גידול. קיבוץ לא מבוקר של הגידולים במרחב הנפרש על ידי החתימות מאפשר גילוי תת סוגים חדשים של סרטן. ישמנו שיטה זו על שלושה סוגים של גידולי מח: גליובלסטומה, מדולובלסטומה ונוירובלסטומה , וגילינו , תת סוג חדש של מדולובלסטומה, המאופיין על ידי שינויים כרומוזומליים רבים. הבנת השפעת טריזומיה ומונוזומיה על ביטוי גנים טריזומיה ומונוזומיה של כרומוזום צפויות להשפיע על ביטוי הגנים הממוקמים על כרומוזום זה . ניתוח . של נתונים ממספר סרטנים הראה כי לא כל הגנים על הכרומוזום שמספר עותקיו השתנה מושפעים מהשינוי במספר העותקים. הגנים המושפעים מציגים טווח רחב של שינוי ברמת הביטוי , ובחדירות , משתנה. בייחוד 1(, ) השפעת טריזומיה שמורה יותר מהשפעת מונוזומיה בין יחידים ו )2( יש מתאם חיובי בין רמת הביטוי של גן כשיש שני עותקים שלו ובין רמת השינוי בביטוי של אותו גן כשיש רק עותק אחד שלו או שלושה עותקים. - 4 -

Abstract The goal of my PhD research was to study the effect of DNA copy number changes on gene expression. DNA copy number aberrations may be local, encompassing several , or on the level of an entire , such as trisomy and monosomy. The main dataset I studied was of Glioblastoma, obtained in the framework of a collaboration, but I worked also with public datasets of cancer and Down’s Syndrome. The molecular basis of expression changes in Glioblastoma Glioblastoma is the most common and aggressive type of primary brain tumors in adults. In collaboration with Prof. Hegi (CHUV, Switzerland), we analyzed a rich Glioblastoma dataset including clinical information, DNA copy number (array CGH) and expression profiles. We explored the correlation between DNA copy number and gene expression at the level of chromosomal arms and local genomic aberrations. We detected known amplification and over expression of oncogenes, as well as deletion and down-regulation of tumor suppressor genes. We exploited that information to map alterations of pathways that are known to be disrupted in Glioblastoma, and tried to characterize samples that have no known alteration in any of the studied pathways. Identifying local DNA aberrations of biological significance Many types of tumors exhibit chromosomal losses or gains and local amplifications and deletions. A region that is aberrant in many tumors, or whose copy number change is stronger, is more likely to be clinically relevant, and not just a by-product of genetic instability. We developed a novel method that defines and prioritizes aberrations by formalizing these intuitions. The method scores each aberration by the fraction of patients harboring it, its length and its amplitude, and assesses the significance of the score by comparing it to a null distribution obtained by permutations. This approach detects genetic locations that are significantly aberrant, generating a ‘genomic aberration profile’ for each sample. The ‘genomic aberration profile’ is then combined with chromosomal arm status (gain/loss) to define a succinct genomic signature for each tumor. Unsupervised clustering of the samples based on these genomic signatures can reveal novel tumor subtypes. This approach was applied to datasets from three types of brain tumors: Glioblastoma, Medulloblastoma and Neuroblastoma, and identified a new subtype in Medulloblastoma, characterized by many chromosomal aberrations.

- 5 -

Elucidating the transcriptional effect of monosomy and trisomy Trisomy and monosomy are expected to impact the expression of genes that are located on the affected chromosome. Analysis of several cancer datasets revealed that not all the genes on the aberrant chromosome are affected by the change of copy number. Affected genes exhibit a wide range of expression changes with varying penetrance. Specifically, (1) The effect of trisomy is much more conserved among individuals than the effect of monosomy and (2) the expression level of a gene in the diploid is significantly correlated with the level of change between the diploid and the trisomy or monosomy.

- 6 -

1 INTRODUCTION 10

1.1 High throughput measurement of DNA copy number 10 1.2 High throughput measurement of gene expression 11 1.3 Gene expression analysis 12 1.4 Cancer 13 1.5 Aneuploidy in cancer 13 1.6 Relationships between DNA copy number and gene expression 14 1.7 Thesis structure 15 1.7.1 The molecular basis of expression changes in Glioblastoma 16 1.7.2 Development of a method for identifying local DNA aberrations of biological significance 16 1.7.3 Characterization and understanding of the transcriptional effects of monosomy and trisomy 17 1.7.4 Deciphering transcriptional responses in temporal processes 18

2 THE MOLECULAR BASIS OF EXPRESSION CHANGES IN GLIOBLASTOMA 20

2.1 Background 20 2.1.1 Glioblastoma Multiforme 20 2.1.2 The TP53 pathway 21 2.1.3 The RB pathway 22 2.1.4 The EGFR pathway 23 2.1.5 Pathways interactions 25 2.1.6 Genetic abnormalities and expression patterns as predictors of prognosis 26 2.1.7 Clinical trial 26 2.2 Data 27 2.3 Data analysis 28 2.3.1 Two ways clustering analysis 28 2.3.2 Survival analysis 29 2.3.3 Chromosomal status 29 2.3.4 Correlation between expression and copy number 30 2.3.5 Detecting chromosomal instabilities 30

- 7 -

2.3.6 Effect of the aberrations on expression 31 2.3.7 Amplifications 32 2.3.8 Deletions 32 2.3.9 Pathway analysis 33 2.4 Conclusion 36

3 IDENTIFYING LOCAL DNA ABERRATIONS OF BIOLOGICAL SIGNIFICANCE 41

3.1 Abstract 41 3.2 Background 42 3.2.1 Cancer is characterized by DNA copy number aberrations 42 3.2.2 Array CGH as a tool to measure DNA copy number aberrations 42 3.2.3 Existing methods for analyzing array CGH data 43 3.3 Results 44 3.3.1 Algorithm 44 3.3.2 Parameters space 48 3.3.3 Applications 48 3.4 Discussion 51 3.5 Conclusion 55 3.6 Methods 55 3.6.1 Datasets 55 3.6.2 Aberrations’ annotation 55 3.6.3 Recognizing possible inaccurate genomic locations 56 3.6.4 FDR 56

4 ELUCIDATING THE TRANSCRIPTIONAL EFFECT OF MONOSOMY AND TRISOMY 59

4.1 Aim 59 4.2 Introduction 59 4.2.1 Aneuploidy in cancer 59 4.2.2 Meiotic aneuploidy 59 4.2.3 Mechanisms causing aneuploidy 60 4.2.4 Aneuploidy and gene expression 60

- 8 -

4.2.5 Factors that may affect the response of gene expression to copy number 60 4.3 Methods 64 4.3.1 Data 64 4.3.2 Genes’ information sources 64 4.3.3 Stepwise linear regression 65 4.4 Results 66 4.4.1 Whole chromosome gain and loss are evident in expression 66 4.4.2 The effect of trisomy on expression is similar in different patients 66 4.4.3 The effect of monosomy on expression is different in different patients 67 4.4.4 Understanding trisomy and monosomy expression signatures 67 4.5 Discussion 69

5 DISCUSSION 72

6 LIST OF PUBLICATIONS 74

7 REFERENCES 75

- 9 -

1 Introduction Chromatin is the complex of DNA and in which the genetic material is packaged inside the cells of organisms with nuclei. Chromatin structure is dynamic and exerts profound control over gene expression and other fundamental cellular processes. Changes in its structure can be passed on to the next generation, independent of the DNA sequence itself [1]. Chromatin consists of repeating subunits called nucleosomes, each comprising 147 bp of DNA wrapped in 1.7 super helical turns around a histone. Nucleosomes are further compacted into a 30-nm diameter fiber, which, in turn, are further compacted into structures not yet fully understood [2 and references therein]. Changes in chromatin structure are essential for the access of RNA polymerase to initiate transcription. Thus, the structure of chromatin in the cell affects the transcriptome of the cell. Many other factors affect gene expression in the cell, including transcription factor concentrations and activity levels, RNA binding proteins, and more. In cancer cells and several other diseases there are changes in DNA copy number from normal, that also affect the transcription of genes that reside on the changed region, as well as of other genes. Recent technological advances enable measurements of DNA copy number at high resolution, and of gene expression level for every human gene. High throughput experiments create an information overload. It is doubtful whether investigators producing high throughput data are able to extract all the information buried in it. It was already stated five years ago, that data analysis – not data production – is becoming the bottleneck in gene expression research [3]. We utilize datasets of DNA copy number and gene expression in order to study the relationships between them, in the chromosomal level and the local level. 1.1 High throughput measurement of DNA copy number Comparative genomic hybridization (CGH) is a molecular cytogenetic technique that allows the entire genome to be scanned, in a single step, for copy-number aberrations in chromosomal material. In standard CGH procedures, genomic DNAs isolated from test and reference samples are labeled respectively with red and green fluorescent dyes. Each labeled DNA is subjected to competitive hybridization to normal metaphase . The ratios of red and green fluorescent signals in paired samples are measured along the longitudinal axis of each chromosome. Chromosomal regions involved in deletion or amplification in test DNA appear green or red,

- 10 -

respectively, but chromosomal regions that are equally represented in test and reference DNAs appear yellow [4]. Array Comparative Genomic Hybridization (aCGH) [5, 6] is a procedure that provides genome-wide DNA copy number measurement along genomes of mammalian complexity. The test and reference samples are competitively hybridized to an array with genomic targets. If the control is diploid, a higher signal of the test sample is indicative of amplification, and a higher control signal indicates deletion. Single-copy decreases and increases from diploid are reliably detected [5]. Several types of genomic targets can be printed on the array. For example, Bacterial Artificial Chromosomes (BACs) are fairly widely used: these markers have a typical length of 150KB, and about 2000-8000 BACS are used to provide coverage of the full . In addition, cDNA probes are also used [7] as well as oligonucleotides [8, 9]. Today there are aCGH with very high resolution. E.g. NimbleGen aCGH has 385,000 probes on a single glass (www.nimblegen.com). The reference used for CGH is a normal human tissue. It is common to think that cells in normal tissues are diploid. However, large-scale copy number polymorphisms (about 100 kb and greater) contribute substantially to genomic variation between normal humans [10]. SNP chips that were originally developed for genotyping can also be used to measure DNA copy number. SNP chips do not require a control sample, and have very high resolution of 500,000 features (www.affymetrix.com). 1.2 High throughput measurement of gene expression Today it is possible to simultaneously measure the expression level of each known human gene in a given sample, by microarrays or by sequencing. Microarray technology has become a standard method for gene expression profiling. There are two main types of arrays – spotted arrays [11] and oligo arrays [12]. The spotted arrays, also known as two channel cDNA arrays, are usually printed in house and are not available commercially. Each spot on the array is printed from a cloned long DNA sequence. Similar to aCGH, the mRNA of two samples are labeled in different colors, and hybridized to the array. The resulting data is the ratio between the expression levels of each gene in both samples. In oligo arrays, the probe is a short DNA sequence that is printed directly on the slide. Typically, each sequence is printed in many copies within a predefined region, termed feature, and each gene is represented

- 11 -

by several different features, or probes. Only one sample is hybridized, and the result is an absolute estimation of the expression level. Today, all known human genes can be profiled using a single array. New Affymetrix arrays can even measure the expression level for each human exon. After hybridization the arrays are washed and scanned. Image analysis programs produce a number per spot or feature. In oligo array, all features corresponding to a single gene must be combined together. Then, all the arrays of the experiment should be normalized together. There are several standard summation and normalization methods for Affymetrix chips, including MAS5 [13], Robust Multiarray Average (RMA) [14], and DNA-Chip Analyzer (dChip) [15]. Sequencing technologies are also used to measure gene expression. In Serial Analysis of Gene Expression (SAGE) [16], unique sequence tags (9–10 bp in length) are isolated from individual mRNAs and concatenated serially into long DNA molecules that are then sequenced. New, ultra-high-throughput sequencing techniques enable sequencing of all mRNA in the cell [17]. In this work, we mainly studied Affymetrix chips, normalized by MAS5 or RMA. 1.3 Gene expression analysis There is a wide range of methods for microarray data analysis, both supervised and unsupervised. Supervised methods search for features that are different between predefined groups, such as normal and tumor samples, different tumor types, different treatments etc., or for features that are related to given variables, such as phenotypes of migration, proliferation or patients’ survival. Those methods range from the simple fold-change to more statistically meaningful differential expression methods, such as t-test, Significance Analysis of Microarray (SAM) [18] and Analysis of Variance (ANOVA), correlation, linear regression and survival analysis. When applied to microarrays, those methods produce a p-value per each gene, raising the multiple comparisons problem. Thus, those methods must be followed by multiple comparisons correction, such as Bonferroni correction or False Discovery Rate (FDR) [19] estimation. In addition, it is often useful to reduce the number of comparisons, either by filtering of the genes, and /or by clustering the genes (see below) and applying the statistical methods to clusters, instead of single genes. Unsupervised methods are used for exploratory analysis, and in order to find structures within the data that cannot be predicted ahead, such as new subgroups.

- 12 -

Unsupervised methods that are applied to gene expression data include singular value decomposition [20] , clustering [21, 22, for review see 23], two ways clustering [24, 25], and sorting [26]. 1.4 Cancer Hanahan and Weinberg [27] list six hallmarks of cancer: (1) self-sufficiency in growth signals, (2) insensitivity to growth-inhibitory (antigrowth) signals, (3) evasion of programmed cell death (apoptosis), (4) limitless replicative potential, (5) sustained angiogenesis, and (6) tissue invasion and metastasis. Each of these hallmarks can be achieved in various ways, and in any order. The acquisition of these six different changes is challenging, as mutations are very rare. However, this process may be facilitated by genome instability, e.g. by deletion and mutation of TP53, causing silencing of the p53 DNA damage signaling pathway. This is not considered as a hallmark, but as an enabling change, allowing evolving populations of premalignant cells to reach those six biological end points [27]. The hypothesis of tumor evolution, formulated by Nowell [28], states that a tumor is initiated from a previously normal single cell. Induced change in this cell confers a selective growth advantage over adjacent normal cells. Thus, this cell replicates more, and its descendants carry the same induced change. From time to time, as a result of genetic instability in the expanding tumor population, mutant cells are produced. Most of them are eliminated, but occasionally one has an additional selective advantage with respect to the original tumor cells as well as normal cells, and this mutant becomes the precursor of a new predominant subpopulation. Over time, there is sequential selection, by an evolutionary process, of cells which are increasingly abnormal. Because this sequence is not completely random, certain similarities are acquired by different tumors as they progress [28], analogous to convergent evolution. Advanced tumors of the same type are very different from each other karyotypically and biologically. Hence, each patient's cancer requires individual specific therapy, and even this may be thwarted by emergence of a variant resistance to the treatment. Understanding and controlling the evolutionary process in tumors may be essential for clinical treatment [28]. 1.5 Aneuploidy in cancer Normal human cells contain 46 chromosomes – two of each autosomal chromosome, and two gender chromosomes, either XX or XY. However, most cancers contain cells

- 13 -

that not only possess an abnormal number of chromosomes (often between 60 and 90) but that also differ from each other in the number of chromosomes they contain. Furthermore, these chromosomes commonly have structural aberrations that are vanishingly rare in normal cells: inversions, deletions, duplications, and translocations. These numerical and structural abnormalities define aneuploidy [29 and references therein]. Genes from all bands of the human chromosomes are involved in some commonly occurring tumor associated aberrations [30]. Each solid tumor type displays one of several characteristic combinations of chromosomal gains and losses. There is considerable overlap between the imbalance profiles of the different tumor types, and typically there are more losses than gains [31]. In several cancers local DNA copy number aberrations are predictive of outcome [32-34] or of treatment response [35- 37]. Activation of oncogenes can result from chromosomal translocations and from gene amplifications. Tumor-suppressor genes’ inactivation arises from several mechanisms, including deletions or insertions of various sizes [38]. Analysis and interpretation of local aberrations that contribute to cancer development are hindered by the fact that in cancer cells there is loss and gain of whole chromosomes, that may be the cause of the cancer or a by-product of it [cf 39, 40]. While many cancers display karyotypic changes, oncogenic transformation can occur with no chromosomal instability, both in-vitro [41] and in-vivo [42]. 1.6 Relationships between DNA copy number and gene expression Comparing expression data with aCGH data, it has been shown that aneuploidy is reflected in the expression data, to some extent [43]. DNA copy number changes affect gene expression in many conditions, including colon cancer [44, 45], breast cancer [46, 47], prostate cancer [48], acute myeloid leukemia [49], acute lymphoblastic leukemia [50], Glioblastoma cell lines [51], Down’s Syndrome and chromosome 13 trisomy [52]. This holds not only for full chromosomes, but also for local amplifications and deletions [47], and derivative chromosomes (abnormal chromosomes consisting of segments from two or more chromosomes joined together as the result of a translocation, insertion, or other rearrangement) [51]. Torres et al. [53] engineered haploid yeast strains to contain two copies of specific chromosomes (disomes), generating disomic strains for 13 of the 16 yeast chromosomes. An approximate doubling of gene expression was observed along the

- 14 -

entire length of the disomic chromosomes, indicating that most if not all genes are expressed proportionally to the number of DNA copies in the cell. Two groups of genes were up-regulated in many different aneuploid strains – genes involved in environmental stress response, and genes involved in ribosome biogenesis. Because of the many chromosomal aberrations usually found in cancer cells, it is difficult, if not impossible, to identify the consequences of specific trisomies, independent from other coexisting genomic imbalances, gene mutations, or epigenetic alterations. Upender et al. [54] developed a model system that allows direct correlation of acquired chromosome copy number alterations with transcriptional activity in genetically identical cells, by introducing three different chromosomes into karyotypically diploid, mismatch repair-deficient colorectal cancer cells and into immortalized normal breast epithelial cells. This enables an assessment of the consequences of specific aneuploidies on global gene expression levels relative to their diploid parental cells. The average transcriptional activity of the trisomic chromosomes was significantly increased. Trisomy of chromosome 3 was rather quickly eliminated from the cells, as opposed to chromosomes 7 and 13, indicating a selective disadvantage. In addition, expression levels of multiple genes throughout the genome were dysregulated. 1.7 Thesis structure The goal of my PhD was to develop and apply computational algorithms that decipher molecular networks in normal and pathological human systems from heterogeneous sources of genomic data, including DNA sequences, mRNA and profiles. In cancer, I analyzed a very rich dataset of molecular profiles – including array CGH, mRNA and protein profiles from 80 Glioblastoma Multiforme patients. This project is described in Chapter 2, and it led me to two related but more general research accomplishments. First, I developed a method to identify biologically significant local DNA aberrations in cancers, and applied this method to Medulloblastoma and Neuroblastoma (Chapter 3). Second, I characterized and partially explained the transcriptional effects of monosomy and trisomy, both in cancer and in Down’s Syndrome (Chapter 4). In parallel, I developed and applied computational approaches to decipher the molecular networks underlying transcriptional changes in temporal processes in specific cell types. In particular, my analysis helped to discover the regulation of

- 15 -

CCR7 in dendritic cell maturation, and the role of feedback circuits in orchestrating the cellular response to growth factor stimulus in different cell types. This aspect is briefly discussed in Chapter 1.7.4. Overall, my work addressed the critical challenge of analyzing heterogeneous genomics data in such a way that the combination will add new information. These led to important insights into diverse biological processes, from development to cancer.

1.7.1 The molecular basis of expression changes in Glioblastoma We analyzed a rich Glioblastoma Multiforme dataset including clinical information and molecular profiles at several levels – DNA copy number, mRNA profiles, and protein concentration. Our first goal was to associate clinical characteristics, in particular survival and drug response, with molecular profiles. We discovered that the expression level of a cluster of genes that included several HOX genes was shown to be a predictor for poor survival. This implies a "glioma stem cell" or "self-renewal" phenotype that is responsible for treatment resistance of Glioblastoma. This finding was published in the Journal of Clinical Oncology [55]. Next, we exploited the multi-level molecular profiles in order to identify the pathway components that were disrupted in each relevant pathway in each tumor. We also applied our novel method (Chapter 3) for detection of DNA aberrations, and characterized the relationships between DNA copy number and expression at the level of chromosomal arm, and at the level of local aberrations. For each local aberration, we mapped the primary transcriptional effect of genes located on the aberration and secondary, genome-wide, transcriptional effects. This project is described in Chapter 2.

1.7.2 Development of a method for identifying local DNA aberrations of biological significance We developed a novel method that defines and prioritizes chromosomal aberrations. The method defines V, the “volume” associated with an aberration, as the product of three factors: (a) fraction of patients with the aberration, (b) the aberration’s length and (c) its amplitude. The algorithm compares the values of V derived from the real

- 16 -

data to a null distribution obtained by permutations, and yields the statistical significance (p-value) of the measured value of V. This approach detects genetic locations that are significantly aberrant, generating a ‘genomic aberration profile’ for each sample. Unsupervised clustering of the samples based on these genomic signatures can reveal novel tumor subtypes. This method was applied to datasets from three types of brain tumors – the Glioblastoma dataset (above), and publicly available aCGH datasets of Medulloblastoma and Neuroblastoma. This analysis identified a potential new subtype in Medulloblastoma, characterized by many chromosomal aberrations. A manuscript that contains these findings is currently under review, and appears as Chapter 3.

1.7.3 Characterization and understanding of the transcriptional effects of monosomy and trisomy Changes in chromosome copy number (e.g. trisomy or monosomy) are expected to impact the expression of genes that are located on the affected chromosome. As previously published, in samples with trisomy N, genes that are located on chromosome N will be up-regulated. Symmetrically, genes that are located on chromosome N will be generally down-regulated in samples with monosomy N. To further explore this generalization I examined the expression level of each gene in each sample with trisomy – in both cancer and Down’s Syndrome datasets – or monosomy in cancer samples, and discovered that the picture is more complex. Not all the genes on the aberrant chromosome are affected by copy number. Genes that are affected, exhibit a wide range of change in expression level, and are affected differently in different samples with the same aberration. The effect of trisomy is much more conserved among individuals than the effect of monosomy. Searching for factors explaining the effect of copy number change on expression, we found that the expression level of a gene in the diploid is significantly correlated with the level of change between the diploid and the trisomy or monosomy. This finding was confirmed in several datasets, including several cancer types and Down’s Syndrome. We checked various characteristics of the genes that may help to explain the differential response in gene expression to the same change in copy number. Those characteristics include genes’ length, regulatory region type, functional annotations and more. This project is described in Chapter 4.

- 17 -

1.7.4 Deciphering transcriptional responses in temporal processes Cellular responses, including transcriptional changes, unravel over time. While many clinical oriented studies, such as in cancer tissues, can only examine the beginning and end point of complex temporal processes, cell systems allow us to follow such processes in greater detail. In particular, in two collaborative studies, we examined how new transcriptional programs are activated and progress in response to environmental signals. In each case, we deciphered specific transcriptional circuits that are critical to establish a robust response.

1.7.4.1 Dendritic cells Dendritic cells mature in response to bacterial lipopolysaccharides, as part of the primary immune response. TGFβ attenuates much of the maturation process. In collaboration with Prof. Yoram Groner’s group (Weizmann), we analyzed the global gene expression pattern during dendritic cells' maturation with and without TGFβ. Two inflammatory cytokines were affected: interleukin-12 was down-regulated by TGFβ, whereas interleukin-18, known to potentiate the interleukin-12 effect, was up- regulated. Expression of the peroxisome proliferator-activated receptor gamma (PPARγ) increased in response to TGFβ, concomitantly with reduced expression of chemokine receptor 7 (CCR7). This finding supports the possibility that TGFβ mediated inhibition of CCR7 expression proceeds through induction of PPARγ, which in turn transcriptionally represses CCR7 expression. This project was published in Genes and Immunity [56].

1.7.4.2 Deciphering the cellular response to growth factor stimulus We also studied the effect of growth factor stimulation on several cell-lines (in collaboration with the Yossef Yarden’s Lab, Weizmann). Analyzing time series expression data, we characterized gene groups that define the common response of the different cell-lines, as well as unique responses. Those unique downstream effects explained why different cell-lines responded to the same stimulus differently, and why the same cell-line responded differently to different stimuli. We also discovered negative feedback loops that were responsible for the correct interpretation of the signal. Some of the genes involved show predictive value for survival in certain

- 18 -

cancers, probably due to disruption of normal feedback mechanisms. This project was published in Nature Genetics [57], and a gene discovered during the analysis gave rise to another project, published in Nature Cell Biology [58].

- 19 -

2 The molecular basis of expression changes in Glioblastoma In an ongoing collaboration with Prof. Hegi (CHUV, Lausanne, Switzerland), we analyzed a rich Glioblastoma Multiforme (GBM) dataset including clinical information and molecular profiles at several levels – DNA copy number (array CGH), mRNA profiles (Affymetrix), and protein concentration (tissue micro-arrays). Our first goal was to associate clinical characteristics, in particular survival and drug response, with molecular profiles. We characterized the relationships between DNA copy number and expression at the level of chromosomal arm, and at the level of local aberrations. The method developed to detect DNA aberrations (Chapter 3) was applied to this dataset, discovering known and new aberrant regions. For each local aberration, we mapped out the primary transcriptional effect on the expression levels of genes located on the aberration and secondary genome-wide transcriptional effects. Next, we exploited the multi-level molecular profiles in order to identify the pathway components that were disrupted in each relevant pathway in each tumor. 2.1 Background

2.1.1 Glioblastoma Multiforme Glioblastoma Multiforme (GBM) account for 19% of all primary brain tumors, and is the most common malignant primary brain tumor in adults, with 3.1 new cases per 100,000 population per year in the US (www.cbtrus.org). The incidence of GBMs increases with age. The rates for GBMs are highest in 75 to 84 years olds. GBMs are 1.7 times more common in males, and two times higher among whites as compared to blacks. At the population level, the majority of patients with primary GBMs (68%) had a clinical history of less than 3 months. The mean period from first symptoms to histological diagnosis was 6.3 months [59]. The relative survival estimates for GBM are quite low; median survival is less than one year [60 and references therein], and less than 4% of patients survived five years after diagnosis. GBM survival estimates are better for the few patients who are diagnosed under age 20 (http://www.cbtrus.org/reports//2007-2008/2007report.pdf). Old age is a significant predictive factor for poor survival of GBM patients [61]. Recurrence occurred in 99% of patients throughout the follow-up. Reoperation is not effective [62].

- 20 -

Histopathological features of GBM include poorly differentiated astrocytic cells with increased cell density, varied nuclear appearance, vascular proliferation and regions of necrosis, often surrounded by densely packed tumor nuclei. GBM can be divided into two subtypes based on clinical characteristics: primary and secondary GBM. Primary GBMs arise de novo without evidence of a less malignant lesion, occurring in patients with mean age of 55-62 years. Patients typically display short clinical history (<3 months in the majority of cases), and large tumors. Secondary GBMs evolve progressively in younger patients (mean age 40-45 years) from previous lower grade astrocytoma over a period of 1 to 10 years [59, 63]. The mean time to progression from anaplastic glioma to GBM was ~2 years, and that from low-grade glioma to GBM was ~5 years [59]. Only 5% of GBM are secondary, with histopathological evidence of a precursor low-grade or anaplastic astrocytoma [61]. Genetic studies of GBMs indicate that there are distinct genetic pathways involved in the initiation and progression of these neoplasms [63]. Primary GBMs typically show homozygous deletions of the CDKN2A (p16/INK4A), CDKN2B (p15/INK4B) and ARF (p14) loci on 9p or amplification of CDK4 often together with MDM2 on 12q [64]. Loss of heterozygosity (LOH) of 10q is found in 69% and EGFR amplification occurs in around 34% of cases [61]. Secondary GBMs typically have lost both wild-type TP53 genes by deletion of one and mutation of the other (requiring two genetic events) [65]. Methylation of the RB1 gene is found in 43% of secondary GBMs [66]. They also have lost 10q in about half of the cases, but not necessarily 10p [67], and they have a very low incidence of mutations of the single retained PTEN gene (4% vs. 18% in primary) [67, 68] . Secondary GBMs show promoter methylation of CDKN2A and ARF [69] but very rarely have homozygous deletions of this locus (4% vs. 38% in primary) [70]. They have amplifications of EGFR in less than 10% of cases [71]. Common to both primary and secondary GBMs is thus disruption of both the TP53 and RB1 pathways and LOH on 10q [64].

2.1.2 The TP53 pathway The TP53 tumor-suppressor plays a critical role in the prevention of human cancer. The TP53 protein functions as a homotetramer complex and transactivates numerous genes of diverse function including genes involved in apoptosis and inhibitors of the RB1 pathway (e.g., CDKN1A). In the absence of cellular stress, the TP53 protein is

- 21 -

maintained at low steady-state levels and exerts very little, if any, effect on cell fate. In response to various types of stress, TP53 protein level increases, and it triggers increased DNA repair to take care of the damage or induce senescence and apoptosis. Senescence and apoptosis eliminates the affected cell from the replicative pool, thereby preventing its expansion into a large population of malignant progeny [72]. The diversity of cancer-related signals that trigger a protective TP53 response probably accounts for its being such a central tumor-suppressor, and explains why its inactivation is so frequently selected for in almost all types of cancer [72]. MDM2 is located on 12q13-15 and is an important negative regulator of TP53, and is transcriptionally activated by TP53 [72]. ARF protein binds MDM2, blocking its ability to negatively regulate TP53 [72]. Over-expression of MDM2 leads to enhanced degradation of TP53 thereby inhibiting TP53 mediated growth regulation, titrating out the ability of ARF to control its activity [64]. Thus, amplification and over-expression of MDM2 may disable TP53 function. MDM4 is located on 1q32 and codes for an MDM2-related protein that can also bind to TP53 and inhibits TP53-mediated transcriptional transactivation.

2.1.2.1 TP53 pathway alterations in GBM GBMs can escape TP53-dependent growth control by several aberrations: (1) TP53 mutation (2) MDM2 amplification or (3) MDM4 amplification or (4) ARF deletion. Mutations of the TP53 gene are found in 30–40% of GBMs [73]. TP53 mutations are rare in primary GBMs (11%) but frequent in secondary GBMs (67%) [71]. The MDM2 region is amplified in approximately 8% of GBMs, and is always over- expressed when amplified. About 4% of GBMs have MDM4 amplification and over- expression, and carry neither mutations in conserved regions of the TP53 gene nor amplification of the MDM2 gene [74]. The ARF locus is deleted in 30–40% of primary GBMs [73].

2.1.3 The RB pathway The tumor suppressor RB1 (13ql4) is a key regulator of the G1/S checkpoint. In quiescent cells, RB1 is not phosphorylated and sequesters E2F transcription factors, thus avoiding the G1/S check point. When the cell replicates, cyclin D expression is up-regulated and it binds to CDK4 or CDK6. Together, they form an active kinase complex that phosphorylates RB1. The phosphorylated RB1 releases the E2F

- 22 -

transcription factors, resulting in the transcriptional activation of the genes required for cell cycle progression such as cyclin E (as well as ARF as described above). The formation of the CDK4/cyclin D complex is negatively regulated by CDKN2A and CDKN2B, which are CDK inhibitors that specifically bind to CDK4 and CDK6, competing with and thereby blocking their binding to the cyclin D. Inactivation of RB1, CDKN2A and CDKN2B, or over-expression of cyclin D result in inappropriate cellular levels of E2F transcription factors and progression to S phase [64, 75]. An excess of CDK4 or CDK6 proteins can titrate out normal cellular levels of CDKN2A and CDKN2B, overcoming regulation by these CDK inhibitors and leaving free CDK4 and CDK6 to bind to cyclin D [64].

2.1.3.1 RB pathway alterations in GBM Mutations or homozygous deletion of RB1 are found in 14% of GBMs [73]. RB1 promoter hypermethylation, which is associated with transcriptional silencing of the gene, is found in 43% of secondary GBMs, and 14% of primary GBMs [66]. CDK4 is amplified in 13% of GBMs and is invariably over-expressed when amplified [73]. CDKN2A and CDKN2B are deleted in 30–40% of primary GBMs [73]. Alterations of CDKN2A, RB1 and CDK4 are inversely correlated, suggesting that alteration of any one of them is sufficient to bypass the G1/S check point [76, 77].

2.1.4 The EGFR pathway Many growth signals are mediated by growth factors whose signals are transmitted into cells by transmembrane growth factor receptors with intrinsic tyrosine kinase activity. In GBMs, there are two such receptors – epidermal growth factor receptor (EGFR) and platelet derived growth factor receptor alpha (PDGFRA). The growth stimulation signals in one subtype of GBMs might be promoted by the amplified and over-expressed EGFR, and in another subtype by the PDGFRA [78]. In the absence of a ligand, EGFR exists in a conformation that suppresses kinase activity and restrains formation of receptor dimers. Ligand binding triggers receptor dimerization. When the ligands EGF or TGF-α bind to EGFR proteins, they dimerise and autophosphorylate several tyrosine residues in their cytoplasmic domains. That provides specific docking sites for cytoplasmic proteins containing SRC homology 2 and phosphotyrosine-binding domains. These proteins bind to specific phosphotyrosine residues and initiate intracellular signaling, mainly via two signaling

- 23 -

cascades: RAS/RAF/MAP kinase pathway and PI3K (phosphatidylinositol 3-kinase) pathway [79-81].

2.1.4.1 RAS/RAF/MAP kinase pathway The complex formed by the adaptor proteins GRB2 and SOS binds directly, or through association with the adaptor molecule SHC, to specific docking sites on EGFR. This interaction leads to a conformational modification of SOS, now able to recruit RAS-GDP, resulting in RAS activation (RAS-GTP). RAS-GTP activates RAF- 1 that, through intermediate steps, phosphorylates the extracellular signal-regulated kinases MAPK1 and MAPK2. Activated MAPKs are imported into the nucleus where they phosphorylate specific transcription factors involved in cell proliferation [81].

2.1.4.2 PI3K pathway

PI3K binds to active EGFR and phosphorylates PIP2 (phosphatidylinositol-4,5- bisphosphate) to PIP3 (phosphatidylinositol-3,4,5-triphosphate). PIP3 recruits proteins with Pleckstrin homology domains that will bind to PIP3 on the inside of the membrane. These proteins include AKT, PDK1 and PDK2. AKT phosphorylates many proteins including components of the apoptotic pathway making them more resistant to activation. This results in cell proliferation and increased cell survival by blocking apoptosis. PTEN (phosphatase and tensin homolog) dephosphorylates PIP3 to PIP2, thus inhibiting the PIP3 signal [79-81].

2.1.4.3 EGFR pathway alterations in GBM Amplification of EGFR gene (7p11-12) occurs in 35% of GBMs. The amplicon is frequently large (at least 1.5 Mb) and occurs as double minute (an extra chromosomal genetic material) [82, 83], that may carry other genes, such as GBAS (GBM amplified sequence) [84] and ECOP (EGFR co-amplified and over-expressed protein) [85]. When amplified, EGFR is always over-expressed, but it is also over-expressed in a further approximately 15% of GBMs in the absence of amplification [64]. Mutations in EGFR are found in about half of the tumors with amplified EGFR, and only if EGFR is amplified, suggesting that amplification precedes mutation [86]. The most common mutation is called EGFRvIII, and results in a transcript that is aberrantly spliced, yet remains in frame and codes for a mutated receptor protein that has lost 267 amino acids of its extracellular domain and does not bind ligand. EGFRvIII

- 24 -

expression has numerous additional effects including the up-regulation of extracellular matrix components, metalloproteases, and a serine protease [87] and the down-regulation of the CDK inhibitor CDKN1B (p27), elevated CDK2-cyclin A activity, RB1 protein hyperphosphorylation, activated PI3-K and elevated phosphorylated AKT levels [88]. EGFRvIII is found more frequently in tumors with intact PTEN [89]. Over 90% of GBMs lose alleles from 10q. The regions consistently lost include the PTEN tumor suppressor gene at 10q23-24. PTEN is mutated in 30-45% of primary GBMs [64 and references therein] but rarely in secondary GBMs [68]. PDGFRA is amplified and over-expressed in 8-15% of GBMs [90, 91].

2.1.5 Pathways interactions Each pathway is disrupted by a single genetic abnormality in the vast majority of individual tumors, i.e., the genetic abnormalities affecting the individual pathways are mutually exclusive [64]. The TP53 pathway and RB1 pathway can be disrupted simultaneously by one of two single genetic events: amplification of the region on 12q13-15 encompassing the CDK4 and MDM2 genes, and homozygous deletion of the region on 9p encompassing the CDKN2A, CDKN2B and ARF genes. As each of these events affects a component of the TP53 pathway and a component (or two) of the RB pathway, they are mutually exclusive. As a result, in the TP53 pathway, ARF deletion and MDM2 are mutually exclusive alterations, and in the RB1 pathway, CDKN2A and CDKN2B deletion and CDK4 amplification are mutually exclusive alterations. The first event is amplification of the region on 12q13-15 encompassing the CDK4 and MDM2 genes [64]. This region is amplified in approximately 13% of GBMs [73]. CDK4 is almost always included in the amplicon and is invariably over-expressed when amplified. MDM2 is not always included (only in approximately 8% of GBMs) but when included is always over-expressed [73]. The second event is the homozygous deletion of the region on 9p21 encompassing the CDKN2A, CDKN2B and ARF genes, but this requires two genetic events (deletion of both alleles). The ARF and CDKN2B products are transcribed from individual promoters and first exons (exon 1a and exon 1b) but share the same exons 2 and 3, using different open reading frames for translation resulting in two distinct proteins [64]. Those three genes are considered tumor suppressor genes. Cells with oncogenic

- 25 -

mutations often activate this locus, and it is homozygously deleted in many cancers [92, 93]. Specifically, this locus is deleted in 30–40% of primary GBMs [73]. Over-expression of the EGFR and mutations of the TP53 tumor suppressor gene are mutually exclusive events defining two different genetic pathways in the evolution of GBM [71]. Amplification and/or over-expression of EGFR or of the PDGFRA are also mutually exclusive in GBM [91]. GBMs with EGFR gene amplification almost always have deletions on chromosome 10. As there are cases of GBM with loss of chromosome 10 but without EGFR gene amplification, the loss of chromosome 10 probably precedes EGFR gene amplification in GBM tumorigenesis [94]. Significant correlation of loss of heterozygosity on chromosome 17p with high expression levels of PDGFR is observed in low grade astrocytomas and GBMs [78].

2.1.6 Genetic abnormalities and expression patterns as predictors of prognosis As genetic and other molecular data has accumulated on GBMs and the other astrocytic tumors, attempts have been made to identify more reliable predictors of prognosis and response to therapy than those provided by classical histopathology. The development of high throughput expression profiling raised hopes for educated choice of therapy. Studies of GBMs and prognosis have addressed clinical, histo-pathological, proteomic and genetic factors, either singly or in combination, often with different end-points. As yet, most findings remain controversial or inconclusive [64]. The ability to differentiate response to therapy and outcome by means of molecular diagnostics would have great clinical impact since it would allow identification of subgroups of patients who are most likely to benefit from currently available therapies. Further, molecular profiling may define new tumor subclasses and allow identification of novel molecular targets for future therapeutic approaches.

2.1.7 Clinical trial Temozolomide (TMZ) is a chemotherapy drug with activity against brain tumors. It is rapidly absorbed orally and crosses the blood-brain barrier. TMZ add methyl groups to N7 and O6 atoms on guanine and O3 on adenine [95]. The cytotoxicity of TMZ appears to be related to the failure of the DNA mismatch repair system to find a complementary base for the O6 methylated guanine. This failure results in nicks in the DNA. These nicks accumulate, inhibit initiation of replication and block the cell cycle

- 26 -

at the G2-M boundary, causing cell death. In addition, there is evidence that TMZ reduces the metastatic potential of tumor cells [96 and references therein]. The DNA-repair protein O6-alkylguanine-DNA-alkyltransferase (AGT) is encoded by the gene MGMT. AGT removes the methyl from O6 of guanine and therefore has an important role in maintaining normal cell physiology and genomic stability. However, during TMZ treatment AGT reverses the activity of TMZ, and is thus a TMZ anatagonist. All types of human tumors express AGT, but there is heterogeneity within individual tumors of the same type in AGT expression level. High levels of expression have been noted in colon cancer, melanoma, pancreatic carcinoma, lung cancer and gliomas [97 and references therein]. Loss of MGMT is associated with increased carcinogenic risk and increased sensitivity to methylating agents. MGMT-promoter methylation shuts off MGMT expression in tumors and increases responsiveness to chemotherapy [98]. In the experiment analyzed, patients were treated with radiotherapy and TMZ in the experimental wing and radiotherapy alone in the control wing. MGMT methylation was found to be an independent predictive factor of the response to TMZ treatment. This provides the necessary scientific support for individually tailored therapy for GBM treatment [99]. 2.2 Data All data was collected by the group of M. Hegi and R. Stupp, Lausanne, Switzerland. Frozen GBM biopsies have been collected from patients enrolled in a randomized phase III trial (EORTC 26981/22981) testing radiotherapy alone (control wing) or in combination with concomitant and adjuvant TMZ treatment (treatment wing). In addition, four samples were taken from brain tissues of patients with no brain tumor - during epilepsy surgery (1076, 1780), lobotomy during surgery (1553), and during operation of hemorrhage (148). For each tumor, there is a full clinical record that includes clinical data, treatment and survival data (Table 2-1). Methylation status of the MGMT promoter was also tested for most samples. Expression profiles of 83 tumors were measured on Affymetrix U133_P2 chips. Data were normalized by RMA. Array CGH was measured for 73 tumors on UCSC 2464 markers platform. For 67 samples, there is both expression and aCGH data. Seven of

- 27 -

the aCGH data were discarded due to high level of noise. Thus, there are 60 tumors for which both expression profiles and aCGH data are available. 2.3 Data analysis We wanted to associate clinical characteristics, in particular survival and drug response, with DNA copy number changes and gene expression. We characterized the relationships between DNA copy number and expression at the level of chromosomal arm, and at the level of local aberrations. We also identified the pathway components that were disrupted in each relevant pathway in each tumor.

2.3.1 Two ways clustering analysis Two ways clustering of the expression data was performed by Super Paramagnetic Clustering (SPC) [22]. The expression matrix reordered by clustering is shown in Figure 2-1.

2.3.1.1 Clustering of genes Clusters of genes related to function were found: neuronal content, neurogenesis, CNS development, spermatogenesis, proliferation, immune response/inflammatory response, apoptosis, STAT pathway, cellular metabolism, hypoxia, transport and metabolism, HOX genes, EGFR related genes. In addition, clusters related to genomic location were found. For example, a group of genes located on chromosome 12q formed a cluster.

2.3.1.2 Clustering of samples No clear subgroups of tumors were identified. The four non tumor samples were clearly separated from the rest of the samples. The duplicates (sample 428 was hybridized to two chips) were more similar to each other than to any other sample. Three tumor samples display a distinct expression profile (63, 57R and 212R), as can be seen in the PCA of the 1000 most variable probesets (Figure 2-2). Applying Coupled Two-Way Clustering (CTWC) [21] revealed cluster of genes that separates males from females. Not surprisingly, this cluster contains probesets for genes that reside on the gender chromosomes. A sample that was annotated as a tumor removed from a female displayed expression of probesets on chromosome Y. Though it is possible for a tumor to lose chromosomes and alter the expression of genes, a

- 28 -

tumor from a female is not likely to display expression of genes from the Y chromosome. Indeed, it turned out that there was a mistaken identification in the clinical data file.

2.3.2 Survival analysis There are some problems regarding survival analysis on the GBM dataset. First, as TMZ was shown to be an effective treatment [100], the patients need to be stratified according to the wing they were assigned to – treatment or control. This is further complicated by a salvage treatment with TMZ, given to the control wing patients that survived long enough. In addition, MGMT methylation status was shown to be a predictive factor to response to TMZ treatment [99, 101], and thus the patients need to be stratified by this factor as well, at least in the treatment wing. Furthermore, age was also shown to be a predictive factor for survival in GBM [e.g. 102, 103]. Thus, the ability to show that genes or genomic locations are a predictive factor for this dataset is small. We used Kaplan Meier analysis accompanied with log-rank test to search for predictive factors, but soon realized that other factors should also be taken into account. All aCGH markers were interrogated for association with survival using the Cox proportional hazards method, adjusted for age (>50 years) and MGMT methylation status. No marker passed FDR of 20%. We also tried a model for the treatment wing only, but again no marker passed FDR of 20%. The survival analysis for the expression data was performed by E. Migliavacca, and it is reported in [55].

2.3.3 Chromosomal status Visualization of the aCGH data enables estimating gain and loss of chromosomes (Figure 2-3). The whole is amplified in many samples (Figure 2-4). Chromosome 10 has low copy number in many samples (Figure 2-5). For one sample it is clear that while the entire p arm has low copy number, most of the q arm does not. It can also be seen that chromosome 9 is amplified in some samples, and chromosome 19 and 20 are amplified in several samples (Figure 2-6). There are other less frequent chromosome-size events.

- 29 -

2.3.4 Correlation between expression and copy number At the resolution of the aCGH and expression of the GBM dataset, the correlation between DNA copy number and expression is low, but significant. The full data are shown in Figure 2-7A and 2-7B, following [45], and using smoothed expression and aCGH values along chromosomes (see Figure 2-7C for these curves obtained for sample 1), the correlation was calculated for each sample. The mean Pearson correlation coefficient is 0.47, with standard deviation of 0.11 (see the histogram on Figure 2-7D). At a resolution of chromosomal arm, DNA copy number strongly affects expression; Compare the expression data in Figure 2-8A with the aCGH copy number data in Figure 2-8B. Figure 2-8C displays the values for a single sample. For each sample the median aCGH values and the median gene expression values were calculated for each chromosomal arm. Human cells have 23 pairs of chromosomes – 22 autosomal chromosomes, and two gender chromosomes. There are five acrocentric (chromosomes with very short p- arm) chromosomes in the human genome: 13, 14, 15, 21 and 22. Thus, there are not enough genes or markers on their p-arm. So, there are 17 autosomal chromosomes with p and q arms (17*2), 5 chromosomes with q arm only, X and Y chromosomes with two arms each (2*2), resulting in 43 (17*2 + 5 + 2*2) different chromosomal arms. For this analysis we only used the 39 autosomal arms. The correlation between these two sets of 39 numbers, one for each chromosomal arm (displayed in Figure 2-8A and B) was measured. The mean Pearson correlation coefficient of all 67 samples is 0.72 and the standard deviation is 0.16 (see the histogram on Figure 2-8D). The increase in correlation with the decrease in resolution is expected, as all other factors affecting transcripts level average out.

2.3.5 Detecting chromosomal instabilities Studying the stable gene clusters found (Chapter 2.3.2) we found several clusters that contained mainly genes that reside on the same chromosome or chromosomal arm, for example a cluster with genes that reside on the q arm of chromosome 12. These genes have high expression in a few patients and low expression in most patients. All probesets for genes that reside in that region were tested, and a large group of probesets that displays similar pattern of expression was found (Figure 2-9). The expression matrix allows for mapping the borders of the putative amplicon in each

- 30 -

patient. This region is known to be amplified in GBM [104], and contains two known oncogenes, MDM2 and CDK4. This is a compound amplicon, as the region separating the two oncogenes is typically not amplified. The reason for the non-contiguous amplification was recently discovered to be the presence of a new tumor suppressor gene in GBM, WIF1 that is located between MDM2 and CDK4 (W.L. Lambiv et al., manuscript in preparation). These findings motivated us to develop a robust general method for a rigorous search of significant DNA aberrations, described in Chapter 3. Applying this method to the aCGH data, 24 amplifications and 48 deletions were found (Table 2-2 and Table 2-3, respectively). A significant amplicon was found on chromosome 7, in the region of EGFR (Figure 2-10A). This is also a known amplicon [82], that exists as double minute (short separate DNA segments) in some cases [82, 83]. It can be seen in Figure 2-10B that all samples with the amplification have a relatively high level of EGFR expression, but there are many other samples that have a high level of EGFR expression without any apparent over expression of neighboring genes on the chromosome. This may indicate that amplification is only one of several mechanisms that may cause EGFR over expression.

2.3.6 Effect of the aberrations on expression Gene expression is affected by many factors. In normal cells, where there are two copies of each autosomal chromosome, DNA copy number is constant, and does not affect gene expression. In GBM, as in many other types of cancers, the DNA content of the cells is very different from that of normal cells. Some chromosomes (most frequently 7) have more than two copies, and some chromosomes (most frequently 10) have only one copy. In addition, chromosome parts are amplified and lost, and for smaller chromosomal regions very strong amplifications are frequent, either in part of the chromosome, seen as homogenously stained regions, or extra chromosomal, as double minute. Deletions of one or two copies of parts of a chromosome are also found. Each of these changes may be reflected in gene expression in two ways, to which we will refer as primary and secondary. Figure 2-10A displays the EGFR amplicon, Figure 2- 10B its primary effect, and Figure 2-10C its secondary effect. The effect is called primary when the expression of a gene in the amplified region of the DNA is up- regulated, or the expression of a gene in a deleted region is down-regulated. The

- 31 -

effect is called secondary when the expression of a gene outside the amplified or deleted region of the DNA is affected by the aberration. An example for a secondary effect may be that when a region encoding a transcription factor is deleted, target genes that are positively regulated by this transcription factor are down-regulated. It cannot be said that a primary effect is the driving force for an aberration, but if a certain primary effect is seen in all tumors that are carrying a certain chromosomal aberration, it is likely that this primary effect confers a selective advantage that allows the fixation of this aberration in the population of tumor cells. Thus, in general, oncogenes are expected to be found on amplifications, and tumor suppressor genes on deletions.

2.3.7 Amplifications In the aCGH data of 60 tumors, 34 markers were found to be significantly amplified, with Benjamini-Hochberg (BH) FDR [19] of 10%. These markers are grouped into 24 seemingly separated amplifications, as neighboring markers probably represent the same amplification. A list of amplifications and their primary and secondary effects can be found in Table 2-2. Four amplifications include oncogenes that are known to be amplified in GBM (CDK4, MDM4, MDM2, PDGFRA and EGFR). One amplification contains only one marker, whose annotation is probably wrong (see Chapter 3.6.3 for the method of identifying wrong marker annotation). Two amplifications are in regions that were previously identified as normal copy number variations. That is, the copy number in this chromosomal region varies between normal individuals, and is thus likely to vary between tumor tissues as well.

2.3.8 Deletions In the aCGH data of 60 tumors, 64 markers were found to be significantly deleted, with BH FDR [19] of 10%. These markers are grouped into 48 seemingly separated deletions, as neighboring markers probably represent the same deletion. A list of the deletions and their primary and secondary effects can be found in Table 2-3. One deletion includes a region with three tumor suppressor genes, CDKN2A, CDKN2B and ARF, known to be deleted in GBM. Eight deletions are markers whose annotation is probably wrong (see Chapter 3.6.3 for the method of identifying wrong marker annotation), and six are normal copy number variations.

- 32 -

2.3.9 Pathway analysis Each pathway can be altered at several points. We wanted to characterize each tumor by the points at which various pathways were altered.

2.3.9.1 The TP53 pathway As detailed in Chapter 2.1.3, the TP53 pathway can be altered at several points - the TP53 itself, by deletion or mutation, or by amplification of MDM2 or MDM4, or by deletion of CDKN1A or ARF. Figure 2-11A displays the markers closest to the pathway’s related genes. As previously published [74], MDM2 and MDM4 amplifications are mutually exclusive, and ARF deletion and MDM2 amplification are also mutually exclusive. All aberrations are evident in the expression (Figure 2-11B). However, there are 12 samples that have none of those alterations (Figure 2-11, all panels, right of the black line). In Figure 2-11C, clinical information for all samples is displayed. MIB score of the samples with no known alterations is significantly higher than for those with known TP53 alterations with FDR of 10%. There is only one sample with known TP53 mutation status in the “no known alterations” group, and it is WT. All samples with known “mutant” TP53 status (marked green) have the deletion on chromosome 9. We searched for genes or markers whose expression differentiates these two groups of samples. Only one differentiating marker passed t-test with FDR of 10% (RP11- 218N24, Figure 2-12A). This marker was not identified in our previous screen for deletions). We assume that the markers that are known to be aberrant did not pass t- test as the test assumptions do not hold (normal distribution and equal variance). Indeed, when t-test with unequal variance was applied, more markers passed, including the chromosome 9 locus, and EGFR region (Figure 2-12B). On the other hand, there are several genes that pass t-test (Figure 2-13), including the expected CDKN2A and CDKN2B, which are on the same locus as ARF. Also, there is one gene - TNFRSF10B (KILLER/DR5) from chromosome 8, which is also a known TP53 target, and is in the general region of the marker RP11-218N24. In addition, there are several genes that are co-expressed, mostly down-regulated in the "no known alterations" group. Among them are two known TP53 targets - TGFA and TNFRSF10B, and also MEG3, a non coding RNA suspected as tumor suppressor gene [105, 106]. None of these genes was assigned to a cluster in [55]. However,

- 33 -

three of the genes up-regulated in the "no known alterations" group (MCM2, RFC4 and HMMR) were assigned to G26 and G29. G29 was annotated as a proliferation cluster. The different expression levels of various genes between the two groups may be interpreted in two ways. Either the TP53 pathway is not affected in the “no known alterations” group, in which case we would expect to see differential expression of TP53 targets; or there are other alterations in this group of samples that affect the pathway. It seems that tumors in the “No known alterations” group proliferate more, as indicated by the MIB index and the genes in the proliferation cluster. This group probably harbors other alterations, possibly silencing of the TSG TGFA, TNFSRF10B or MEG3. However, the two groups are not different in their survival (Figure 2-14).

2.3.9.2 The RB pathway As detailed in Chapter 2.1.4, the RB1 pathway can be altered at several points - RB1 gene deletion, mutation or promoter hypermethylation, CDK4 amplification and CDKN2A and CDKNB deletion. Figure 2-15A displays the markers closest to RB1, CDK2, CDK4, CDK6 and CDKN2A and CDKN2B. The deletion of CDKN2A and the amplification of CDK4 are mutually exclusive, as previously reported [76, 77]. CDK4 amplification and CDKN2A and CDKN2B deletion are evident in the expression (Figure 2-15B). However, RB1 deletion is not so clear in the expression. In addition, it seems that CDK2 expression is higher in samples that do not have CDK4 amplification or chromosome 9 locus deletion. There are three samples that have none of those alterations (Figure 2-15, all panels, right of the black line). All three samples have no alterations in the TP53 as well. In Figure 2-15C, clinical information for all samples is displayed. None of the clinical parameters separates the groups. Since only three samples have no known RB1 alterations, searching for other alterations in those three samples is very challenging, as this is a very small group. We searched for genes or markers whose expression differentiates these two groups of samples. With FDR of 10%, many markers were different between the groups (Figure 2-16), including the expected chromosome 7 EGFR region, chromosome 9 deletion, chromosome 10 deletion and 13 deletions.

- 34 -

There are several genes that pass t-test (Figure 2-17). Among the up-regulated genes are many immune system genes that are up-regulated in two of the three the “no known alterations” group, and NAG that is upregulated at the third sample. Though NAG marker is not amplified, NAG, MYCN and MYCNOS, all on the same chromosomal region known to be amplified in Neuroblastoma [107], are up-regulated in this sample. Indeed, MYCN was shown to override the G1/S checkpoint in Neuroblastoma [108]. The two groups are not different in their survival (Figure 2-18).

2.3.9.3 The EGFR pathway As described in Chapter 2.1.5, the EGFR pathway can be altered by EGFR amplification or over-expression, PTEN deletion, silencing or mutation, CDKN1B, CDK2 amplification and PDGFRA amplifications. The relevant genomic regions are shown in Figure 2-19A. Only four samples have no detected alteration in this pathway. As previously reported [94], EGFR amplification occurs in most samples and is almost always accompanied by PTEN deletion. However, contrary to previous report [91], EGFR amplification and PDGFRA amplification are not mutually exclusive. The expression of PDGFRA is up-regulated when it is amplified. EGFR expression is low in the samples with no known alterations, and PTEN expression is high in those samples (Figure 2-19B). The clinical data (Figure 2-19C) shows that PDGFRA amplification is associated with younger age, which is new, to the best of our knowledge. The four samples that have no detected EGFR pathway alterations are all male. Three of them have no TP53 pathway detected alterations, and two of them have no RB pathway detected alteration. Aberrant markers (Figure 2-20) identified using t-test include the EGFR region, and a region on chromosome 4 that is deleted in two of the four samples. T-test with unequal variances finds many more markers, but all hose markers are aberrant in the “known alterations” group. The differentiating genes (Figure 2-21) are not informative. There is no survival difference between the groups (Figure 2-22).

- 35 -

2.4 Conclusion We have been able to confirm known results regarding the molecular characteristic of GBM. We did not find any new relationship to survival. Some new DNA copy number aberrations were found, but are not confirmed yet. Pathway analysis demonstrated the components of the TP53, RB1 and EGFR pathways altered in each tumor. For the tumors with no known TP53 alterations, we showed indicators of induced proliferation, and suggested two candidate tumor suppressor genes in those samples. One of the three samples with no known RB1 alterations had MYCN amplification that may be a novel way to bypass the G1/S checkpoint in GBM. The other two samples have no known alteration in any of the pathways, and have high expression of immune system genes. Thus, we suspect those samples are actually taken from a region that is outside the tumor. We also showed that samples with PDGFRA amplification tend to be of younger age, possibly indicating a secondary GBM like progression.

Figures’ legends Figure 2-1 – Expression matrix of the GBM dataset after clustering. The 1000 probesets with the highest standard deviation across samples are shown. Each row represents a probeset. Probesets are standardized to mean zero and standard deviation 1, and are sorted by clustering. Colors represent the expression level after standardization. Each column represents a normal tissue or a tumor. Columns are sorted by clustering. Two way clustering was performed by SPC [22].

Figure 2-2 - PCA of the 1000 most variable probesets. The non tumor samples are separated. Three tumors are also separated in another direction, two of which are recurring GBMs.

Figure 2-3 – Array CGH data for the GBM dataset. Each row is a marker. Markers are ordered by chromosomes, and within each chromosome, by their genomic position on the chromosome. Each column is a sample. Samples are ordered by standard deviation. Colors represent copy number ratio between the tumor samples tested and a normal diploid reference sample, and are truncated to the range (-1,1). White

- 36 -

represents no reading. Chromosomes 7, 10, 13, and 19 are marked by black rectangles on the bar on the left.

Figure 2-4 – Zoom in on the aCGH data of chromosomes 6, 7 and 8. Chromosome 7 markers are marked by black rectangle on the bar on the left. Each row is a marker. Markers are ordered by chromosomes, and within each chromosome, by their genomic position on the chromosome. Each column is a sample. Samples are sorted by chromosome 7 median value. Colors represent copy number ratio between the tumor samples tested and a normal diploid reference sample. Values are truncated to the range (-1,1). White represents no reading. Chromosome 7 markers have higher copy number than chromosome 6 or 8 markers, in many samples to the right. The EGFR region that is amplified is marked by an asterisk.

Figure 2-5 - Zoom in on the aCGH data of chromosomes 9, 10 and 11. Chromosome 10 markers are marked by black rectangle on the bar on the left. Each row is a marker. Markers are ordered by chromosomes, and within each chromosome, by their genomic position on the chromosome. Each column is a sample. Samples are ordered by chromosome 10 median. Colors represent copy number ratio between the tumor samples tested and a normal diploid reference sample. Values are truncated to the range (-1,1). White represents no reading. Chromosome 10 markers have lower copy number than chromosome 9 or 11 markers in the samples on the left.

Figure 2-6 – Zoom in on the aCGH data of chromosomes 18, 19 and 20. Chromosome 19 markers are marked by black rectangle on the bar on the left. Each row is a marker. Markers are ordered by chromosomes, and within each chromosome, by their genomic position on the chromosome. Each column is a sample. Samples are sorted by chromosome 19 median. Colors represent copy number ratio between the tumor samples tested and a normal diploid reference sample. Values are truncated to the range (-1,1). White represents no reading. Chromosome 19 markers have higher copy number than chromosome 18 markers in the samples on the right. Some samples have both chromosome 19 and 20 amplified.

Figure 2-7 – Overall expression and aCGH in 67 samples in fine resolution. (A) The expression relative to non tumor samples. Values were smoothened and interpolated.

- 37 -

Every row is a probeset, and probesets are sorted by their genomic order. (B) aCGH data. Every row is a marker, and markers are sorted by their genomic order. In both A and B, the samples (columns) are sorted according to standard deviation in B. (C) For a single sample, the expression and aCGH values are shown. For this sample, the correlation coefficient is 0.65. (D) Distribution of the correlation coefficients between expression and aCGH for all samples.

Figure 2-8 – Overall expression and aCGH in 67 samples at chromosomal arm resolution. (A) The expression relative to non tumor samples. The median of all probesets on each chromosome arm is given. Only chromosome arms for which there are probesets and aCGH markers are shown. The samples are in the same order as in Figure 2-7. (B) The aCGH data. The median of all markers on the chromosome arm is given. Only chromosome arms for which there are probesets and aCGH markers are shown. (C) For a single sample (same sample as in Figure 2-7C), the expression and aCGH values are shown for all chromosomal arms. For this sample, the correlation coefficient between aCGH and expression is 0.91. (D) Distribution of correlation coefficients between expression and aCGH at the arm resolution for all samples.

Figure 2-9 – Expression matrix of an amplified region on chromosome 12q. Genes are ordered by their order on the chromosome. A few non amplified genes between kub3 and lrig3 were removed for clarity, as indicated by the black line. Samples are sorted by SPIN [26]. Samples likely to have amplification were sorted on the left side.

Figure 2-10 – EGFR amplicon and its transcriptional effect. (A) The aCGH data for the markers in the amplicon region. Markers are ordered by their order on the chromosome. (B) Primary effect – probesets in the amplicon region that are up- regulated in samples carrying the amplicon. (C) Secondary effect – all probesets that are up-regulated or down-regulated in samples carrying the amplicon. The black line separates amplicon carriers (left) from non carriers.

Figure 2-11 – Known alterations in the TP53 pathway. In all panels, each column is a sample. Samples are sorted by (1) having downregulation of CDKN2A and CDKN2B, (2) MDM4 up-regulation, (3) MDM2 up-regulation and (4) no known alterations, separated by a black line. (A) DNA copy number of the markers related to

- 38 -

the TP53 pathway. Colors represent copy number ratio between the tumor samples tested and a normal diploid reference sample, truncated to (-1, 1). White represents no reading. (B) Normalized gene expression of genes related to the TP53 pathway. (C) Clinical data of the samples.

Figure 2-12 – Markers separating between samples with known TP53 alterations and the rest of the samples using (A) t-test or (B) t-test with unequal variances, with FDR of 10%. Colors represent copy number ratio between the tumor samples tested and a normal diploid reference sample, truncated to (-1, 1). Samples (columns) are ordered as in Figure 2-11.

Figure 2-13 – Normalized expression of the probesets separating between samples with known TP53 alterations and the rest of the samples using t-test with FDR of 10%. Samples (columns) are ordered as in Figure 2-11.

Figure 2-14 – Kaplan Meier plot comparing the survival of samples with known TP53 alterations and samples with no known TP53 alterations.

Figure 2-15 – Known alterations in the RB1 pathway. In all panels, each column is a sample. Samples are sorted by (1) having downregulation of CDKN2A and CDKN2B, (2) CDK4 up-regulation, (3) RB1 region deletion and (4) no known alterations, separated by a black line. (A) DNA copy number of the markers related to the RB1 pathway. Colors represent copy number ratio between the tumor samples tested and a normal diploid reference sample, truncated to (-1, 1). (B) Normalized gene expression of genes related to the RB1 pathway, truncated to (-3,3). (C) Clinical data of the samples.

Figure 2-16 – Markers separating between samples with known RB1 alterations and the rest of the samples using t-test with FDR of 10%. Colors represent copy number ratio between the tumor samples tested and a normal diploid reference sample, truncated to (-1, 1). Samples (columns) are ordered as in Figure 2-15.

- 39 -

Figure 2-17 – Normalized expression of the probesets separating between samples with known RB1 alterations and the rest of the samples using t-test with unequal variances with FDR of 10%. Samples (columns) are ordered as in Figure 2-15.

Figure 2-18 – Kaplan Meier plot comparing the survival of samples with known RB1 alterations and samples with no known RB1 alterations.

Figure 2-19 – Known alterations in the EGFR pathway. In all panels, each column is a sample. Samples are sorted by (1) having PDGFRA amplification, (2) EGFR amplification, (3) PTEN deletion and (4) no known alterations, separated by a black line. (A) DNA copy number of the markers related to the EGFR pathway. Colors represent copy number ratio between the tumor samples tested and a normal diploid reference sample, truncated to (-1, 1). (B) Normalized gene expression of genes related to the EGFR pathway. (C) Clinical data of the samples.

Figure 2-20 – Markers separating between samples with known EGFR alterations and the rest of the samples using (A) t-test or (B) t-test with unequal variances, with FDR of 10%. Colors represent copy number ratio between the tumor samples tested and a normal diploid reference sample, truncated to (-1, 1). Samples (columns) are ordered as in Figure 2-19.

Figure 2-21 – Normalized expression of the probesets separating between samples with known EGFR alterations and the rest of the samples using t-test with FDR of 10%. Samples (columns) are ordered as in Figure 2-19.

Figure 2-22 – Kaplan Meier plot comparing the survival of samples with known EGFR alterations and samples with no known EGFR alterations.

- 40 -

3 Identifying local DNA aberrations of biological significance 3.1 Abstract Background Many types of tumors exhibit characteristic chromosomal losses or gains, as well as local amplifications and deletions. Within any given tumor type, sample specific amplifications and deletions are also observed. Typically, a region that is aberrant in more tumors, or whose copy number change is stronger, would be considered as a more promising candidate to be biologically relevant to cancer. We sought for an intuitive method to define such aberrations and prioritize them. Chromosomal arm status (gain/loss) can be combined with the occurrence of these relevant aberrations to create a succinct fingerprint of the tumor genome. Results We define V, the “volume” associated with an aberration, as the product of three factors: (a) fraction of patients with the aberration, (b) the aberration’s length and (c) its amplitude. Our algorithm compares the values of V derived from the real data to a null distribution obtained by permutations, and yields the statistical significance (p- value) of the measured value of V. We detected genetic locations that were significantly aberrant. Those genomic locations, together with chromosomal arms status, are used to visualize the tumors and reveal new tumor subtypes. This visualization also highlights events that are co-occurring or mutually exclusive. We demonstrate the method for three different public array CGH datasets of Medulloblastoma and Neuroblastoma. We identify a potential new subtype in Medulloblastoma, which is analogous to Neuroblastoma type 1. Conclusions Our method detects chromosomal regions that are known to be altered in the tested cancer types, and suggests new genomic locations to be tested. Combining significant aberrations with chromosomal arm status defines a succinct genome fingerprint. In this reduced dimensional space, it is easier to reveal new sub types of tumors.

- 41 -

3.2 Background

3.2.1 Cancer is characterized by DNA copy number aberrations Genes from all bands of the human chromosomes are involved in some commonly occurring tumor associated aberrations [30]. Each solid tumor type displays one of several characteristic combinations of chromosomal gains and losses. There is considerable overlap between the imbalance profiles of the different tumor types, and typically there are more losses than gains [31]. It has been shown that in several cancers local DNA copy number aberrations are predictive of outcome [32-34] or of treatment response [35-37]. Oncogene activations can result from chromosomal translocations and from gene amplifications. Tumor-suppressor genes’ inactivation arises from several mechanisms, including deletions or insertions of various sizes [38]. Analysis and interpretation of local aberrations that contribute to cancer development are hindered by the fact that in cancer cells there is loss and gain of whole chromosomes, that may be the cause of the cancer or a by-product of it [cf 39, 40]. While many cancers display karyotypic changes, oncogenic transformation can occur with no chromosomal instability, both in-vitro [41] and in-vivo [42].

3.2.2 Array CGH as a tool to measure DNA copy number aberrations Array Comparative Genomic Hybridization (aCGH) [5, 6] is a procedure that provides genome-wide DNA copy number measurement along genomes of mammalian complexity. A control sample and a test sample are competitively hybridized to an array with genomic targets. If the control is diploid, a higher signal of the test sample is indicative of amplification, and a higher control signal indicates deletion. Single-copy decreases and increases from diploid are reliably detected [5]. Several types of genomic targets can be printed on the array. For example, Bacterial Artificial Chromosomes (BACs) are fairly widely used: these markers have a typical length of 150KB, and about 2000-8000 BACS are used to provide coverage of the full human genome. In addition, cDNA probes are also used [7] as well as oligonucleotides [8, 9].

- 42 -

3.2.3 Existing methods for analyzing array CGH data Most methods for analysis of aCGH data focus on assigning copy number or status (gain, normal, loss) to every genomic location in single samples [43, 109-113]. Several such methods were compared [112], with the conclusion that most algorithms do well in detecting the existence and the width of aberrations for large changes and high signal-to-noise ratio. None of the algorithms, however, detected reliably aberrations with small width and low signal-to-noise ratio. Most studies recognize those aberrations that pass a certain threshold of frequency of appearance or amplitude. In nearly all studies, the selection criteria were either not specified, or set in an arbitrary way [43, 109, 114-116]. Considerable effort has been devoted to identify significant and meaningful aberrations, using simultaneously data from multiple samples. Hidden Markov Models, often used to define single sample status, were extended to multiple samples [117]. Rouveirol et al. [118] defined recurrent minimal genomic alterations, and incorporated external constraints, such as a range or frequencies of occurrence and a range of signal magnitudes, to filter the observed alterations. Snijders et al. [119] used aCGH to define minimal common amplified regions and then expression analysis to identify candidate driver genes in amplicons. Diskin et al. [120] presented a method for testing the significance of aberrations across multiple samples. Their input is a list of aberrations in each sample. They calculate a frequency statistic and a footprint statistic out of permutations of the locations in each chromosomal arm. Guttman et al. extended this method to scan a range of thresholds for defining aberrations, selecting multiple aberrations in each threshold [121]. Lipson et al. [122] tried to identify optimal intervals over the aCGH data. Methods in similar spirit were developed for analysis of SNP data, which is informative for genotyping as well as copy number [123, 124]. Intuitively, an aberration is more likely to have biological significance if it happens in many samples, and if it is strong. A longer aberration is less likely to be attributable to measurement error. Thus, the three parameters used to score each marker are the number (or fraction) of carriers (patients), the length of the aberration and its amplitude. We refer to the product of these three factors as the volume V of the marker, and use it as our statistic to assess the validity of each aberration. The method compares the real data to the randomized data obtained by permutations of the real

- 43 -

data, under the null assumption that the genomic locations are independent. Once we obtain the distribution of V in our randomized data, we can evaluate the statistical significance of the actual value of V, measured for each marker. We detect significantly aberrant genetic locations and associate them with a p-value. We demonstrate the method for three different public aCGH datasets from two different childhood neoplasms associated with the nervous system on three different BAC array platforms: Medulloblastoma – GSE8634; Neuroblastoma – GSE5784 [125] and GSE7230 [126]. 3.3 Results

3.3.1 Algorithm Our method uses aCGH data to create a concise genomic description of each sample, including chromosomal status and appearance of significant local copy number aberrations. This concise description can be used to find an informative order or sub- classification of the samples. The algorithm includes two steps – assigning chromosomal status and detecting significant local copy number aberrations.

3.3.1.1 Input The algorithm’s input is the raw log2 aCGH data, and the markers’ status. The raw log2 ratio data of chromosome 2p, taken from GSE7230, is presented in Figure 3-1A. Markers’ status is the assignment per marker per sample – loss (-1), normal (0) or gain (1). The status was set by GLAD [127] (see supplementary note 1). Markers that were not correlated with their adjacent markers, but highly correlated with markers at another genomic location, were removed (see Methods, Chapter ‘Recognizing possible inaccurate genomic locations’). We constructed an amplifications matrix A, which has binary valued elements: Ams = 1 if the aCGH marker m was assigned a gain value on sample s, and Ams = 0 otherwise (the amplification matrix of chromosome 2p based on the GSE7230 data is shown in Figure 3-1B). A deletions matrix D is defined similarly: Dms = 1 if the aCGH marker m has a loss assignment on sample s, and Dms = 0 otherwise. Markers’ status is equal to A-D.

- 44 -

3.3.1.2 Chromosome status We define an entire chromosome arm gain in a sample when more than 50% of the markers have a status of ‘gain’ in that sample. A sample in which an entire chromosome arm is lost is defined by more than 50% of the markers having a status ‘loss’. For graphical representation of chromosomal status, the median log2 ratio of all markers on each chromosomal arm in each sample is used.

3.3.1.3 ‘Volume’ statistic Our goal is to find markers whose aberration happens significantly more frequently than expected by chance, taking into account the known tendency of cancer cells to gain and lose DNA sequences. Three factors are relevant for assessing the significance of an aberration: Width W - The number of carriers – the more tumors have an aberration, the more likely this aberration is to give selective advantage to the cell that carries it. Height H - The amplitude of the aberration. Typically, a duplication event creates only one extra copy of the sequence. Thus, having multiple copies may indicate that having this amplification gives a selective advantage. This is more relevant for amplifications, as deletions can be only at two levels – hemizygous or homozygous deletion. In addition, the amplitude of the aberration measured in a certain tumor is affected by the fraction of subclones in the tumor tissue tested in which it is present. If the fraction is higher, the amplitude is higher. The amplitude is hard to compare among samples, as the range of values varies depending on the percent of diploid cells in the tumor sample. Length L - The length of the aberration (number of neighboring markers included) is also important, but its contribution to the volume statistic defined below should be limited. The reason is that the aim of our analysis is to look for specific genes that "drive" the aberration, and long events, that affect the copy number of a large number of genes, are not informative. Therefore, entire chromosomal arm gains and losses are removed, and all the markers on this chromosome arm are given NaN value (arithmetic representation for Not-a-Number) for this sample. The removed chromosomal arms in each dataset appear in Supplementary Table 3-1, and their graphical representation in Supplementary Figure 3-1. If the statistic that characterizes the aberration increases linearly with the length, the presence of a few samples with very long aberrations can have a very strong effect on the results of the calculation.

- 45 -

This can be avoided by setting on L an upper limit, denoted by K, and choosing K ~5- 10 markers (in the actual implementation, we scan different values of K and combine the results). If the length of the aberration exceeds K markers, the value of the Length parameter is set to L=K. It should be noted that the number of markers does not necessarily reflect linearly the aberration length on the chromosome, as the distances between the markers are not uniform along the genome. Our method takes into consideration all the three factors – width, height and length, in order to calculate the statistic termed ‘volume’ for each marker. The detailed volume calculation is done as follows: For each dataset there are two binary matrices – the amplification matrix A and the deletion matrix D, defined in the ‘Input’ Chapter above. For samples in which an entire chromosome arm is gained (see ‘Chromosome status’ Chapter above for definition), the corresponding entries of A are replaced by NaN, and for samples in which an entire chromosome arm is lost, the corresponding entries are replaced by NaNs in the deletion matrix D. Figure 3-1 displays the volume calculation for chromosomal arm 2p in GSE7230 (Neuroblastoma). The height matrix H is actually the raw log2 ratio. Hms (Figure 3-1A) is the measured aCGH log2 ratio value of

marker m in sample s. Ams (Figure 3-1B) is the amplification matrix, where each element (m, s) contains the digit 1 if the status of marker m on sample s is gain. In the length matrix L (Figure 3-1C), each element (m, s) containing the digit 1 in A is

replaced by Lms. Lms is the length of the sequence of ones on sample s, to which marker m belongs (length dimension). If Lms > K, we set Lms =K, to avoid

overweighting long aberrations. In Figure 3-1 we used K=5. If Ams = 0, Lms = 0 as well. In the X matrix shown in Figure 3-1D, each element (m, s) containing the digit 1

(in A) is replaced by a real number Xms, where Xms = Hms*Ams*Lms (Ams is redundant

here, as Lms=Ams*Lms, and is included for clarity). Finally, all the numbers in row m

are summed – representing the contribution of the width variable to our statistic Vm

(equation 1). Vm represents the ‘volume’ of marker m (Figure 3-1E).

Equation 1 m = ∑ XV ms s

This value is divided by the number of samples with non-NaN entries for this marker. The volume statistic is calculated separately for each value of K, K = 1:10. Six markers are significantly amplified (marked by a red line in Figure 3-1E, see next

- 46 -

Chapter for details on setting the p-value per marker, and how the False Discovery Rate (FDR) in the FDR Chapter in Methods). The raw aCGH data of these six markers are shown in Figure 3-1F.

3.3.1.4 Associating p-values to the volume statistic of each marker Due to our lack of knowledge about the null distribution, in order to assign a p-value

for the volume statistic of each marker m, Vm, a permutation of the original data is applied to approximate the distribution of the data under the null assumption of independence between the aCGH values and the genomic locations in different samples. In order to preserve the length distribution in each sample, the X matrix (that already includes the height and length contribution to the volume statistic) is permuted, and not the H or L matrices. The entries of each column of the matrix X are permuted, and then the values in each row are summed. This randomization preserves the number of aberrant markers in each sample, their intensity, and the contributions of the lengths of the aberrations, while removing any location data. The randomization is repeated N (N=100) times (see ‘Number of permutations’ Chapter below for discussion of robustness in N). For each of the N randomized X matrices

we calculate Vi for every marker, obtaining for our n markers N*n values Vi. The distribution of these N*n numbers is used to calculate the p-value associated with every measured value of V, simply by counting the frequency of values in the null distribution that are higher or equal to the measured value. The null distribution is estimated separately for each value of K, K = 1:10. The FDR procedure [19] was used to control the False Discovery Rates. See FDR Chapter in Methods for details.

3.3.1.5 Definition of an aberrant region After significantly aberrant markers are identified, adjacent markers, as well as markers separated by a single non-aberrant marker, are being combined into a single aberration. The aberration region is defined as the region between the non-aberrant markers that are bordering the aberration. Each aberration was annotated for being included in a normal copy number variation. In addition, genes residing within each aberration, and specifically cancer related genes, were listed (see Methods, Chapter Aberrations’ annotation).

- 47 -

3.3.2 Parameters space

3.3.2.1 Maximal aberration length In order to avoid an overrepresentation of long aberrations, two measures were taken. First, for each chromosomal arm, in samples in which an arm status was ‘gain’ or ‘loss’, all marker values on this arm were replaced by NaNs. In addition, the maximal contribution of an aberration length to the volume was set to K. This K is an arbitrary value, representing preference to aberrations that are longer than one marker, but avoiding dominance of the signal by a few very long aberrations, to the level of ignoring short aberrations. As Supplementary Figure 3-2 shows, as the parameter K increases over the range 1-10, the number of significantly aberrant markers decreases monotonically, and the cumulative number of detected markers reaches a plateau. Therefore, we repeated all the analyses for K = 1:10.

3.3.2.2 Number of permutations As we use the frequency of each volume in all permutations to assess the p-value, the more permutations there are, the more accurate is the result, as a frequency of zero will always be accounted for as significant. The number of permutations N thus may, in principle, affect the number of markers found significant. However, the actual distribution converges fast. Though the p-value of a given volume may vary a bit with increasing N, it reaches a plateau before N=100. For increasing N from 100 to 200, the change in the p-value for a given volume (corresponding to FDR of 0.1 or 0.01 for N=100) is smaller than 10-4. Thus, we chose to work with N = 100.

3.3.3 Applications The method was applied to three datasets. Table 3-1 displays the number of aberrant markers and aberrations detected in each dataset. Significantly deleted markers appear in Supplementary Table 3, and deletions in Supplementary Table 5. Significantly amplified markers appear in Supplementary Table 4, and amplifications in Supplementary Table 6.

3.3.3.1 Medulloblastoma When applied to the Medulloblastoma dataset analyzed here (GSE8634) our method finds all the known chromosomal aberrations of this cancer, and several possibly new ones as well. Figure 3-2 displays the chromosome status map of the Medulloblastoma

- 48 -

dataset, and the significant aberrations. As described in GSE2139 [128], where a subset of the samples were analyzed, isochromosome 17 (i(17q) - loss of 17p, replaced by an exact copy of 17q) is the most frequent aberration. We identified five different subgroups, marked on the bar below Figure 3-2 – Subgroup 1 has many chromosomal aberrations, but not isochromosome 17. Subgroups 2 and 3 carry isochromosome 17, which is the most frequent aberration in Medulloblastoma [128]. On the basis of our analysis, we propose that the tumors displaying this aberration can be farther separated into a group with many chromosomal events (marked 2), similar to Neuroblastoma type 1, and a group with no other pronounced chromosomal events (marked 3). Several events of gain of chromosome 7 in group 2 are accompanied by loss of 8, resulting in chromosomes 7 and 8 being negatively correlated. A subgroup of tumors with loss of chromosome 6 (marked 4) do not have isochromosome 17, as described also in other Medulloblastoma datasets [129, 130]. The last group (marked 5) has few or no chromosomal events. Two tumors of that group have gain of chromosome 7, and three samples have loss of chromosome 22, but those numbers are too small to consider them as separate subtypes. Our method identified 10 amplified regions (comprised of 13 amplified markers) and 99 deleted regions (comprised of 137 deleted markers). Figure 3-2B displays selected aberrations. MYCN and CDK6 amplifications were identified. MYCN region amplification appears only in groups 1-3. Amplification of the CDK6 region appears mostly in groups 1 and 2. NPM1 (Nucleophosmin, B23) was deleted in few samples. NPM1 has been recognized as a partner gene for various chromosomal translocations in hematological malignancies. NPM1 was associated with centrosome duplication and the regulation of p53, and might have a role as a tumor suppressor [cf 131]. This dataset (GSE8634) has not yet been published, but dataset GSE2139 that includes a subset of the samples [128] was analyzed for local aberrations. The publication [128] included a list of amplifications and deletions. We searched for markers that were included in amplifications or deletions identified by [128] and by our method. Three of the amplifications reported by [128] included markers that were identified as significantly amplified by our method – MYCN, CDK6 and marker RP11-382A18. Marker RP11-382A18 is annotated to MYC region on chromosome 8 by the platform of GSE2139 (GPL1432), but to chromosome 10 by the platform of GSE8634 (GPL5685). We will assume that it is indeed located in the MYC region, and it is not amplified together with MYCN. Four of the reported deletions [128]

- 49 -

included markers that were identified as significantly deleted by our method, annotated in [128] to carry CHRD, UTF1, PRDM2 and HDAC4.

3.3.3.2 Neuroblastoma Figure 3-3 displays the chromosome status map of both Neuroblastoma datasets, as well as the aberrations common to the two Neuroblastoma datasets tested. Three distinct clinicogenetic subgroups have been described in Neuroblastoma [132, 133]. The first group (marked 1 on the bar below Figure 3-3 subplots) exhibits predominantly full chromosomal aberrations (typical gains of chromosomes 6, 7, and 17, and losses of chromosomes 3, 4, 11, and 14). Both other two groups (marked 2 and 3) are characterized by structural chromosome aberrations, such as partial 17q gain. Group 2 has MYCN amplification and 1p deletion. Group 3 is characterized by 11q deletion, and to a lesser extent, 3p deletion. This classification explains most of the chromosomal arms associations found. In GSE5784 there are 15 amplifications (28 markers amplified) and 115 deletions (245 markers deleted). In GSE7230 there are 18 amplifications (30 markers amplified) and 49 deletions (87 markers deleted). Three amplifications and 14 deletions are common to both Neuroblastoma datasets (GSE5784, GSE7230) (Table 3-2, Figure 3-3 C and D). The first amplified region, which was separated into two regions in GSE7230, is on chromosome 2, and correspond to the MYCN region. MYCN amplifications were identified mostly in group 2. The other amplification is of the defensins cluster on chromosome 8. In addition to being amplified in several samples, this region is deleted in other samples, in accordance with this region being a known frequent normal copy number variation [134]. Eight of the common deletions correspond to the 1pter deletion, and this deletion was fractioned into eight deletions in GSE7230. Another common deletion is in the region of BRCA1, a known tumor suppressor gene. In GSE5784, several known tumor suppressor genes were deleted - APC, CDKN2A, RB1 and TGFBR1. Also, two regions with known oncogenes were amplified in this dataset - a region on chromosome 11, that includes CCND1, FGF19, FGF3, FGF4 was amplified, as well as a region on chromosome 12 with ETV6. For GSE5784, no aberration list was given in the original publication [125] for comparison. In GSE7230, the ALK region on chromosome 2 was amplified. ALK was previously identified as having a role in Neuroblastoma [135]. The fumarate hydratase (FH)

- 50 -

region was deleted in GSE7230. FH was shown to be a tumor suppressor gene in several cancers [136]. For GSE7230 [126] aberrations are reported at the cytoband level. Only two of the 24 amplified regions that were reported overlap with amplifications identified by our method – MYCN and a region on chromosome 16. Eleven of the 22 deleted regions reported in [126] overlap with deletions identified by our method, including the 1pter deletion and MLH1 region. 3.4 Discussion We have introduced a simple intuitive method to recognize significant local amplifications and deletions in aCGH data. The input is the raw data, and its categorization into gain, normal and loss values for each marker in each sample (defined in our implementation by GLAD [127]). Then, for each marker, its level of change, frequency of change and length of change are combined to create a volume statistic. The significance of this statistic is assessed using a random distribution based on a permutation of all the data. After aberrant markers are detected, they are combined into continuous aberrations that are annotated for normal copy number variations and then associated with cancer related genes. Our guiding principle was to keep the method simple. We wanted to incorporate as few assumptions and as few arbitrary parameters as possible into the method. Implementation of the method necessitates setting three parameters: number of randomizations N, maximal aberration length contribution for statistic calculation, K, and FDR level. The number of permutations N affects the computation time. As the distribution of the volume statistic under permutations converges fast, increasing N above 100 will not change the results. The value of K, the maximal aberration length contribution for the statistic used, does affect the identity of the aberrations detected as significant. Thus, we scanned for K = 1:10, and combined the results. We showed that increasing K above 10 had very little effect on the aberrations detected. The results on all datasets tested show that the method is sensitive enough to detect aberrations comprised by a single marker. Also, as the volume of each marker is calculated, the markers in the minimal common region of the aberration will typically be assigned the highest volume. This will increase the likelihood of these markers to be classified as significantly aberrant. The chosen FDR level naturally affects the results, but setting the level of acceptable false discovery rate, the multiple comparisons equivalent of the confidence, is always

- 51 -

left to the researcher to decide. However, the minimal volume required for an aberration to be detected as significant at each level of FDR can be estimated per each value of K, and the FDR level can be adjusted accordingly. There is no reason to assume that the number of carriers, length and amplitude of an aberration are equally important to set its significance, as they are used here to calculate the ‘volume’ statistic. But they are all biologically relevant parameters, and lacking an educated weighting system for these parameters, this is the simplest way. The relative weight of each parameter can be easily changed within this framework. Actually, we vary the relative weight of the length parameter when varying K. We also tested the case where the Height parameter is ignored, but this causes the loss of detection of relatively rare strong amplifications (e.g. CDK6 amplicon in Medulloblastoma). The accuracy of status assignment (gain/loss/normal) may affect on the results. If thresholds are too restrictive, aberrant markers may not be recognized as such. If thresholds are too permissive, many markers will be considered as aberrant. This may hamper the ability of the method to identify weak or rare aberrations. There are several methods for status assignment available today [111, 127, 137-142], and the user may select the method most appropriate for his data. The normal copy number variation is a complicated issue in detecting significant disease related aberrations. Discarding all aberrations that contain any known variation will remove most of the aberrations, including clinically recognized ones. In addition, the normal copy number variation database contains variations that were identified on patients with various medical conditions that may affect copy number. Thus, only variations identified on normal population on a similar platform [143] were used for annotation. In addition, every marker that was both significantly deleted and significantly amplified was recorded as suspected for normal copy number variation. Indeed, many significantly aberrant locations are annotated as frequent normal copy number variation. In cases when there are enough normal and tumor samples of the same population, it may be interesting to see how significantly the frequencies of high or low copy numbers of certain normal copy number variations differs between the normal and tumor populations, which may serve as an indication for a possible predisposition to cancer of carriers of those variants. Another problem of most aCGH platforms is problematic marker annotations. In clustering the markers on the basis of their aberration profile for each dataset, up to

- 52 -

5% clustered with markers annotated to other chromosomes (data not shown). This is an under-estimation of the number of wrongly annotated markers, as not all chromosomes create a stable cluster of the associated markers. This problem can be addressed in several ways. The simplest one is to discard all single marker aberrations. This however may result in loosing valuable information. Thus, we removed markers that had low correlation with the chromosome to which they were assigned and high correlation with another chromosome. Still, many of the significantly aberrant markers are not correlated to their adjacent markers, and are still suspected to be located elsewhere in the genome. Our method essentially detects recurrent genomic alterations. Those alterations are not necessarily minimal, like in [118], but it is easy to detect the foci of the aberration by the maximal volume, there is no need for arbitrary constraints, and significance is given, unlike in Snijders et al. [119]. Our method is similar in spirit to Diskin et al. [120], but we do not perform binning into fixed–width locations that may incorporate artefacts [120]. The volume statistic we use is similar to the frequency statistic used in [120] for k= 1, i.e. when the aberration’s length is not taken into account. Another difference is that we compare each marker to all the genome, and not to a certain chromosomal arm, thus applying an equal ‘significance’ threshold to all aberrations. We incorporate the amplitude of the aberration into the statistic used, and not by testing a range of thresholds like Guttman et al. [121]. To enable comparing each marker to all the genome, and not to a certain chromosomal arm,, all long events must be removed. In most cases, removing all chromosomal arms on which more than half of the markers are aberrant, is enough. However, in certain cases (11p in Medulloblastoma and Neuroblastoma, 1p in Neuroblastoma) we noted long events of less than half an arm length that were not removed. When these events are on the same genomic location, they may cause identification of many markers in this region as aberrant, always in the same samples. This may be correct, but is not the goal of this analysis, since our primary aim was to find local aberrations. Thus, in such cases, long chromosomal events can be noted and removed prior to the analysis, or after the analysis. Removing these aberrations, that may be interesting in themselves, may allow the detection of more local aberrations. When comparing the aberrations identified by our method to the aberrations identified by other methods, we see all the oncogenes that are known to be amplified in the corresponding cancers, but our method misses some aberrations identified in previous

- 53 -

publications and finds new ones. This is a natural consequence of the parameters we defined and the removal of whole arm events. One of the main differences we have, using our method, it that identification of a region that is aberrant in one sample only as significant is rare. Also, a region that is amplified on an amplified chromosome background, or a region that is deleted on a deleted chromosome background with not many separate appearances on a normal copy number background cannot be identified, as chromosome level events are removed. This is in agreement with our goal of detecting local events. However, this can be overcome by running the method for each chromosome or chromosome arm separately, while keeping in mind the reduction of the sample size, and that p-values cannot be compared between different chromosomes or chromosomal arms. We applied our method on three public datasets of childhood neoplasms associated with the nervous system - one of Medulloblastoma (GSE8634) and two of Neuroblastoma (GSE5784, GSE7230). In Medulloblastoma, we find five distinct sub groups. Two sub groups with isochromosome 17, one with many other chromosomal events (2), and one with few chromosomal events (3). There is also a group with many chromosomal aberrations but without isochromosome 17 (1), a group with loss of chromosome 6 (4), and a group with few aberrations (5). MYCN amplification appears only in the first three groups, and CDK6 amplification appears mostly in the first two groups. MYC amplification appears only when there is no MYCN amplifications, and only in the first two types, strengthening our new suggested partition of the isochromosome 17 type into two subtypes, the first of which is equivalent to Neuroblastoma type 1. In Neuroblastoma, we identified the three known subgroups, and the MYCN amplification known to be associated with one of the types. Comparing two types of childhood neoplasms associated with the nervous system, it is interesting to note the role of chromosome 17, and its interrelations with MYCN amplifications. Chromosome 17 amplification has two forms – isochromosome in Medulloblastoma, and gain of the q-arm or the whole chromosome in Neuroblastoma. MYCN amplification appears mostly with isochromosome 17 in Medulloblastoma, but only with 17q amplification in Neuroblastoma – rarely with whole chromosome gain. It was recently shown that MYCN-directed centrosome amplification, leading to increased tumorigenesis, requires MDM2-mediated suppression of p53 activity in Neuroblastoma cells [144]. Since p53 is located on chromosome 17p, it can be

- 54 -

suggested that suppression of p53 is difficult when there are more than two copies of 17p, and thus there is no selective advantage in MYCN amplification in tumors carrying more than two copies of the full chromosome 17 (Neuroblastoma type 1). Similarly, MYCN amplification is more advantageous if there is deletion of 17p, carrying p53. 3.5 Conclusion Our method allows for a fast and biologically motivated detection of aberrant chromosomal regions, and associates them with chromosomal arm level events to characterize subtypes of cancer. We have demonstrated the ability of the method to detect known and new DNA amplifications and deletions in two types of childhood neoplasms associated with the nervous system. In addition to the known chromosomal aberrations and known subgroups, our method identified a new subgroup in Medulloblastoma. 3.6 Methods

3.6.1 Datasets All aCGH datasets used for analysis were downloaded from GEO (see Table 3-1). Log2 ratios were used as appeared in GEO. Markers were ordered by their genomic location according to the annotation of the corresponding platforms. Loss, normal or gain status was assigned per each marker in each sample by GLAD [127], using the parameters as are used in the GLAD manual (Supplementary note 1).

3.6.2 Aberrations’ annotation Normal copy number variations were downloaded from http://projects.tcag.ca/variation/ for the human genome versions hg17 and hg18. There is no data for hg16. Aberration annotation includes variations identified by aCGH on 270 normal individuals [143]. The list of genes in each aberration was created based on the genomic location from UCSC matched version knownGene table, gene symbols by kgXref table. The genes list in each aberration was scanned to search for cancer related genes ([145], October 30, 2007 version). The lists of deletions and amplifications and their associated genes and cancer related genes appear in Supplementary Tables 5 and 6, respectively.

- 55 -

3.6.3 Recognizing possible inaccurate genomic locations Our working hypothesis is that at least 90% of the markers are annotated to their correct chromosomal locations. In order to identify markers that we suspect to be mistakenly annotated, we use the correlation of the marker’s signal intensity with neighboring markers. If the signal of a marker is correlated to that of its neighbors, it is not likely to be inaccurately annotated. Thus, we calculated for each marker m its Pearson correlation coefficients, c(m-1,m) and c(m,m+1), to its two neighboring markers. We define a threshold T for each dataset such that 20% of the correlations between adjacent markers are lower than T (assuming that less than 10% of the markers are inaccurately located, results in at most 20% of the neighbors being incorrectly identified as such). For most markers m, both c(m-1,m) and c(m,m+1) > T. A low correlation (less than T) to one of the two neighbors may be due to chromosome arm start or end, to an aberration or variation border, or to mistaken location annotation of the neighbor. If, however, the correlation of a marker to both its neighbors is below threshold, it is likely to be on an isolated aberration (copy number change) or - inaccurately located. We flagged these markers as suspected as being assigned to wrong locations. For each suspected marker we applied the procedure described in Supplementary Note 2, to check whether it can be confidently assigned to another genomic location, based on a very high correlation to the aCGH values of several markers in the other genomic location. If so, it was removed (see Supplementary Table 2 for lists of removed markers, together with their putative correct chromosomal arm), otherwise it was left in the analysis. Potentially inaccurate location was identified for 17 to 144 markers per dataset, which constitute 0.7 – 3.5% of the markers (see Table 3-1). We noticed for GSE8634 that many aberrations were highly correlated, and correlated to gender. Some of the samples were probably hybridized to opposite sex control samples. The 28 markers whose two sided t-test p-value between the genders passed FDR of 1% were thus removed, and the analysis was repeated. We assume those markers are actually located on the gender chromosomes, but as no data is included for markers on the gender chromosomes, we used the gender annotation.

3.6.4 FDR Whenever many comparisons are done in parallel, the issue of adjusting the p-value must be dealt with. Here, we performed the Benjamini-Hochberg version of FDR

- 56 -

[19]. This procedure was applied on permutation p-values in Reiner et al. [146] and was shown there to control the FDR, based on simulated data. The FDR procedure was applied on the p-values of all markers from all chromosomes. The designated rate of false discoveries q will naturally affect the number of markers identified as significant, and should be set according to the dataset and the resources allocated to check significant aberrations. When q is set higher, the list of markers that are found significant is longer, but the percentage of false positives also increases. In the analysis presented in this article, FDR of 5% was used. The output of the algorithm is a table with ten columns, one for each value of K (K=1:10), and a row for each marker. FDR was performed on the whole table, and only rows in which at least one column was found significant were defined as aberrations.

Supplementary data is on the attached CD.

Figures’ legends Figure 3-1 - Calculation of the “volume” statistic for chromosomal arm 2p amplifications in GSE7230 (Neuroblastoma) (A) The height matrix H (raw data) of 2p, where each element (m, s) on 2p is the log2 ratio of aCGH marker m in sample s. Each row corresponds to a marker, and each column corresponds to a sample. Values are truncated to [-1,1]. White represents no reading. (B) The amplifications matrix A, where each element (m, s) on chromosome 2p that is amplified in sample s is marked by 1 (red), otherwise 0 (green). (C) The

length matrix L of 2p, where each element (m, s) on chromosome 2p for which Ams=1 is replaced by the length of the sequence of 1s to which it belongs on sample s. Maximal represented length is K=5. Non amplified markers are white. (D) X, the matrix created by multiplying elements of H, A and L. Non amplified markers are white. (E) Averaging the rows of X gives the volume statistic. The red line is the value of the volume statistic above which it is significantly amplified (corresponding to FDR of 0.05). (F) The markers of the only region on chromosomal arm 2p that passes this threshold – the MYCN region. Values are truncated to [-1,1].

Figure 3-2 - Chromosomal status and aberrations in Medulloblastoma

- 57 -

(A) Chromosomal status of dataset GSE8634. Each row corresponds to a chromosomal arm. Values are color coded according to the mean log2 ratio of the markers on each chromosomal arm. (B) Discussed aberrations in Medulloblastoma dataset GSE8634. Each column corresponds to a sample. Each row corresponds to an aberration discussed in the text, and the label indicates the gene associated with it. Values are color coded according to the mean log2 ratio of the markers on each aberration. In both subfigures, samples are manually ordered according to known and new clinicogenetic subgroups, as the bar below shows, and values are truncated to the range [-1, 1], rising from blue to red.

Figure 3-3 - Chromosomal status and aberrations common to both Neuroblastoma datasets Chromosomal status of datasets GSE5784 (A) and GSE7230 (B), and the aberrations common to both of Neuroblastoma datasets, shown for the patients of GSE5784 (C) and GSE7230 (D). Each column corresponds to a sample. Samples are manually ordered according to known and new clinicogenetic subgroups, as the bar below shows. In A and B, each row corresponds to a chromosomal arm. Values are color coded according to the mean log2 ratio of the markers on each chromosomal arm. In C and D, each row corresponds to a common aberration, and the label indicates the chromosome on which the aberration resides. Values are color coded according to the mean log2 ratio of the markers on each aberration. In all subfigures, values are truncated to the range [-1, 1], rising from blue to red.

- 58 -

4 Elucidating the transcriptional effect of monosomy and trisomy 4.1 Aim We checked whether the effect of changing the number of copies of a given chromosome in a given context (tissue, cancer type) has a characteristic signature in expression, and, if so, whether we can show what factors affect the size of the change in expression for each gene. 4.2 Introduction

4.2.1 Aneuploidy in cancer Normal human cells contain 46 chromosomes. However, most cancers contain cells that possess an abnormal number of chromosomes and also differ from each other in the number of chromosomes they contain. Furthermore, these chromosomes commonly have structural aberrations that are vanishingly rare in normal cells: inversions, deletions, duplications, and translocations. These numerical and structural abnormalities define aneuploidy [29].

4.2.2 Meiotic aneuploidy At least 5% of all clinically recognized pregnancies are trisomic or monsomic. Most of these terminate before birth, making aneuploidy the leading known cause of miscarriage. Only trisomies of autosomals chromosome 21 (Down’s Syndrome), 18 and 13 are compatible with live birth, but only Down’s Syndrome patients survive more than a year. Thus, aneuploidy is the leading cause of congenital birth defects and mental retardation [147, 148]. Full autosomal trisomies, resulting from meiosis, are different from cancer related aneuploidy, in that they (1) have homogeneous karyotype in all the cells of the body and (2) represent a steady state. In cancer, every aneuploid cell competes with the surrounding cells, carrying other karyotypes. Thus, the genomic profile is just a snapshot of a single cell (when FISH or SKY are used), or the result of averaging a mixture of karyotypes of many cells from many subclones of the tumor (for aCGH or SNP chips). This mixture may change in time, as the more fit karyotypes take over, or as stress (e.g. chemotherapy) changes the environment. Out of the 289 human autosomal chromosomal bands, eight are not involved in any duplication, and 31 are not involved in any deletion. This may indicate possible

- 59 -

triplolethality and haplolethality of those genomic regions’ malformations [149]. To the best of our knowledge, there was no consistent test of those regions in cancer.

4.2.3 Mechanisms causing aneuploidy Aneuploidy can surface in various ways during DNA division, including via (1) improper attachments of chromosomes to the mitotic spindles, (2) failed cytokinesis, and (3) abnormal numbers of mitotic spindle poles [150 and references therein]. Viruses have been shown to promote aneuploidy via these mechanisms [150 and references therein]. Aneuploid cells can also be a result of genetically unstable polyploid cells (cells having more than two full sets of homologous chromosomes) that act as intermediates [151].

4.2.4 Aneuploidy and gene expression A gain of a chromosome is expected to cause up-regulation of the genes residing on that chromosome as a primary effect. Some of the genes, e.g. transcription factors, will affect the transcription of genes on other chromosomes, resulting in a secondary effect that may appear as down-regulation or up-regulation of target genes. Symmetrically, a loss of a chromosome is expected to cause down-regulation of the genes residing on that chromosome as a primary effect, and the secondary effect is similar to gain of chromosome. The literature discussing the effect of DNA copy number changes on gene expression is reviewed in Chapter 1.6.

4.2.5 Factors that may affect the response of gene expression to copy number From extensive literature search, several factors that may affect gene expression change in response to DNA copy number change have emerged. This Chapter describes those factors, and the logic of their potential effect and its direction.

4.2.5.1 Disomy expression level We hypothesized that genes that are silenced in the disomic chromosome are likely to remain silent when the chromosome is duplicated or deleted.

- 60 -

4.2.5.2 Monoallelic genes Most sequences in the genome are expressed from both alleles. A small class of genes is transcribed preferentially from a single allele in each cell, in a process termed monoallelic expression. There are three types of monoallelic expression: (1) Allelic exclusion, the expression of a single antigen receptor on the mature B and T cell surface, (2) X-chromosome inactivation in female cells, and (3) parental imprinting [152]. In all three systems, the inequality of the two alleles seems to be achieved mainly by differential DNA methylation, asynchronous DNA replication, differential chromatin modifications, unequal nuclear localization, and non-coding RNA [152]. Genomic imprinting is a form of non-Mendelian inheritance in animals, restricted to mammals, where the imprinted genes are expressed uniquely from one allele. The expressed allele, either paternal or maternal, is constant for each imprinted gene, unless a genetic or epigenetic alteration has occurred [153]. The earliest and most abundantly observed alteration in human cancers is Loss Of Imprinting (LOI), due to epigenetic mis-regulation. LOI refers to loss of monoallelic gene regulation, normally conferred by parent-of-origin-specific DNA methylation of regulatory regions. LOI can include activation of the normally silent copy of a growth-promoting gene, or silencing of the normally active copy of a growth-inhibitory gene [153]. Monoallelic expression achieved by (at least) differential DNA methylation and differential chromatin modifications should be conserved in chromosome gain and loss. We assume that monoallelic genes would be affected by copy number depending on which of the alleles copy number is changed. If the silenced allele copy number changes, expression level will not be changed. If the expressed allele copy number changes, expression level will change a lot. In case of deletion, the gene will be silenced, and in case of duplication, the gene expression level will double. Thus, we would expect to see a bimodal distribution of the change in expression in all samples, where one peak is around no change, and the other peak is of changed samples.

4.2.5.3 Haploinsufficiency Haploinsufficiency occurs when a diploid organism has only a single functional copy of a gene and that single copy does not produce enough gene product to bring about a wild-type condition, leading to an abnormal or diseased state. It is likely that haploinsufficient genes’ expression will be affected by copy number.

- 61 -

4.2.5.4 Dublicability Genome sequencing includes cloning of fragments of DNA into E. coli. A small fraction of the organism's genome fails to clone in E. coli, resulting in sequence gaps. Comparing the location of gaps in clone coverage of genome projects of several organisms discovered that homologous locations in various bacteria are falling within gaps. These gaps represent genes that cannot be retained by E. coli and constitute lethal transfers. A possible explanation to the barriers to horizontal gene transfer is toxicity of the bacterial gene to E. coli. However, as some genes of E. coli itself can not be cloned, that is not the only explanation. The un-clonable genes are enriched for complex forming genes and genes that are universally found as a single copy (never duplicated in any sequenced bacteria), termed singletons [154]. Nearly all universally single-copy genes that could not be transferred from five or more genomes, encode ribosomal proteins [154]. Along the same lines, an analysis of E. coli phylogenetic data has shown that gene duplicability and transferability are significantly correlated [155]. Low duplicability and transferability of a gene are likely to stem from a toxic effect of an increased dosage of its product in the cell, either due to an imbalance in the stoichiometry of protein complex constituents, or to other disruptions to the homeostasis of the cell. Thus, we would expect that singletons gene expression will be affected by change in their copy number.

4.2.5.5 Auto-regulatory loops There are transcription factors that were shown to auto-regulate their transcript levels. If it is a positive auto-regulation, we would expect those factors to be affected to a great level by change in copy number. However, if it is a negative auto-regulation, we would expect those transcription factors to be unaffected by change in their copy number. We assembled a list of auto-regulatory transcription factors (Table 4-1). It should be noted that in many cases, auto-regulation is based on binding to promoters, with no experimental validation. Even if validated, auto-regulation in a certain context and organism does not necessarily imply auto-regulation in another context. Still, the idea of detecting positive and negative auto-regulation from the effect of copy number on expression is tempting, and thus we tested its applicability.

- 62 -

In addition to known auto-regulatory genes, we checked if in general, transcription factors are affected differently than the other genes.

4.2.5.6 Normal tissues expression We hypothesized that some genes may have very tight regulation on the transcript level, where others will have less strict regulation. Genes with tight regulation are less likely to be affected by change in copy number. We looked for a dataset that is general enough, and chose Su et al. [156], who measured the expression level of all genes in many normal human tissues. We extracted two measures from this dataset – the number of tissues a gene is expressed in, and its standard deviation.

4.2.5.7 GO annotation Genes of certain functional classes may be under different types of regulation, resulting in different effect of DNA copy number change on gene expression. Thus, we tested for different effect on different GO annotations [157].

4.2.5.8 Syndrome related Genes are related to various syndromes because a mutation, translocation, deletion or amplification of the gene has an effect on the phenotype. For genes whose syndrome- inducing change is deletion or amplification, we would expect expression to be affected by copy number changes, and thus syndrome related genes are more likely to be affected by copy number.

4.2.5.9 House keeping genes House keeping genes are expressed in all tissues/cells to maintain basic cellular functions. Their expression level is often considered as being constant, with very few fluctuations. Thus, house keeping genes are considered in many laboratories as internal controls, assuming their expression does not change. If their expression does not change, it means it is under no regulation. Thus, we would expect it to change with copy number. However, numerous studies showed that house keeping genes expression vary in given situations [158, 159].

- 63 -

4.3 Methods

4.3.1 Data For this analysis both the copy number of a chromosome and the gene expression are needed. We used the GBM dataset described in Chapter 2, as we can extract copy number information from the aCGH data. In addition, we use public expression datasets that are annotated for copy numbers of a given chromosome: GSE6477, Multiple Myeloma dataset with monosomy 13 [160]; GSE9762, Down’s Syndrome fibroblasts; GSE5390, Down’s Syndrome brain [161]; GSE1789, Down’s Syndrome heart [162]; GSE1397, Down’s Syndrome fetal cerebrum, cerebellum, heart and astrocytes, and trisomy 13 cerebrum [163]. See Table 4-2 for datasets details. In datasets that were normalized by MAS5, values lower than 1 are replaced by 1, and the data is log2 transformed. Datasets that were RMA normalized are used as is. Probesets that are located on the tested chromosome and have a gene symbol according to Affymetrix annotation (na24) are selected. Each gene symbol is represented by the probeset with the highest mean expression across all samples.

4.3.2 Genes’ information sources Gene family size was estimated by the number of Ensembl paralogs [164] or the number of HomoloGene paralogs [165], whichever was larger. Paralogs were retrieved from GeneALaCart (www..org, Version 2.38.1, 25 May, 2008). Assignment of values was done based on gene symbol. Genes were classified as disease related if either OMIM disorder field [166] or AKS disorder field (TimeLogic, Inc., Carlsbad, CA, USA), as retrieved from GeneALaCart (www.genecards.org, Version 2.38.1, 25 May, 2008), were not empty. Assignment of values was done based on gene symbol. We downloaded the table knownGene from UCSC genome browser, March 2006. Strand information is knownGene.strand. Transcript length was defined as knownGene.txEnd - knownGene.txStart. Coding region length was defined as knownGene.cdsEnd - knownGene.cdsStart. Length of 5’ UTR was defined as knownGene.cdsStart - knownGene.txStart, and of 3’ UTR as knownGene.txEnd - knownGene.cdsEnd. Total exon length was calculated as the sum of knownGene.exonEnds – knownGene.exonStarts. Total intron length was calculated as the difference between the coding region length and the total exon length. Number of

- 64 -

exons is the knownGene.exonStarts number. The kgXref table was used to map that information to gene symbols. Assignment of values was done based on gene symbol. Normal tissues expression was taken from Su et al. [156]. The number of tissues a gene is expressed in was based on the MAS5 detection calls. Expression standard deviation was calculated across all tissues. This dataset was measured on U133A, so each probeset received its own value. GO annotations were downloaded from Affymetrix site (na24). House keeping genes list [167] was downloaded from http://www.biomedcentral.com/content/supplementary/1471-2164-9-172-s2.pdf. Assignment of values was done based on gene symbol. Human imprinted genes lists were downloaded from http://igc.otago.ac.nz/ [168] and http://www.geneimprint.com/site/genes-by-species. There are 50 genes that were marked as imprinted in human. A list of haploinsufficient genes was taken from [169]. CpG islands locations were taken from UCSC Genome browser, and tested to be in the 1000 bp upstream to the transcription start site.

4.3.3 Stepwise linear regression The multiple linear regression models were built by the Matlab stepwisefit function, with parameter values penter = 0.05; premove = 0.10. We tried to model the change in gene expression using various genes’ characteristics. For many of the characteristics considered, many of the genes did not have known values. Thus, we did not incorporate into the model characteristics which had values for less than 30% of the genes. We did not include categorical variables that were highly unbalanced (less than 10% had one value). Our response value is the change in expression. However, various measures can represent its value. We considered the fold change between the median expression level in the trisomy or monosomy and the median expression level in the disomy. However, as we noticed that a strong predictor is the median expression level in the disomy, this became problematic. Thus, we turned to t-test, the relevant statistical test comparing the distributions of trisomy or monosomy and disomy. We could have used the p-value of the t-test, but p-values that are very close to zero may indicate very different levels of expression change. Thus, we chose to use the t-statistic as the response variable. Though not perfect, it represents the quantity that we try to

- 65 -

estimate. However, similar results were found when using the fold change or the p- value as the response variable. We built a model for each chromosome separately. 4.4 Results

4.4.1 Whole chromosome gain and loss are evident in expression We used the GBM dataset described above to address the trisomy and monosomy expression signature. We selected chromosomes 13 and 19 that have a whole chromosome loss or gain, respectively, with no frequent local aberrations. We did not choose chromosome 7 and 10, as they are affected in almost all samples, either as a whole chromosome or locally – EGFR amplification on chromosome 7, and PTEN deletion on chromosome 10. Array CGH data was used to identify tumors with gain and loss of chromosomes. Only tumors for which the chromosome status was clear were included. Figure 4-1A displays the expression matrix for chromosome 19. Out of 799 genes on chromosome 19, 334 genes (41%) pass t-test with FDR of 10%, and are differentially expressed between tumors that have gain of chromosome 19 and tumors that do not. Out of the 334 genes, 295 genes are upregulated in the tumors with gain of chromosome 19. Figure 4-1B displays the expression matrix for chromosome 13. Out of 275 genes on chromosome 13, 159 genes (58%) pass t-test with FDR of 10%, and are differentially expressed between tumors that have loss of chromosome 13 and tumors that did not. Out of the 159 genes, 157 are down-regulated in the tumors with loss of chromosome 13. For the corresponding numbers of all other datasets, see Table 4-2.

4.4.2 The effect of trisomy on expression is similar in different patients First, we checked that the transcriptional effect of the trisomy is indeed conserved between different individuals. That is, that in two GBM tumors from two different patients with trisomy 19, the same genes on chromosome 19 are up-regulated, and by the same factor. To show this, we present on Figure 4-2 scatter plots, one for each sample with trisomy 19. Each point represents a gene on chromosome 19. The vertical coordinate depicts the log2 ratio of the gene’s expression level (relative to diploid samples) and the horizontal coordinate – the median of the same ratio for all the trisomic samples. The correlation is very good, indicating that the signature of chromosome 19 trisomy is

- 66 -

well defined. Similar results were shown for other trisomies we tested, in cancer and Down’s Syndrome.

4.4.3 The effect of monosomy on expression is different in different patients Figure 4-3 shows scatter plots, one for each sample with monosomy. Each point represents a gene on chromosome 13. The vertical coordinate depicts the log2 ratio of the gene’s expression level (relative to diploid samples) and the horizontal coordinate – the median of the same ratio for all the monosomic samples. There is a correlation, but it is much lower than for trisomy. The same holds for other monosomies we tested in cancer.

4.4.4 Understanding trisomy and monosomy expression signatures After establishing the expression signature of a trisomy in gene expression, we tried to find factors that would explain the variability of gene expression change in response to gain or loss of a copy of the chromosome. We have tested many factors that may correlate with this response. For binary factors, we used t-test to check if the factor indeed affects the expression change. For numeric factors, we used Pearson correlation. For some of the factors we had values for most genes and for others only for few genes. We also constructed a multiple regression model using only factors which have values for most genes. In the case of binary variables, only variables whose value with the lower frequency was above 10% were used. The result for each factor appears in Table 4-3, and for the multiple regression models in Table 4-4. In general, a multiple regression model was able to explain 30-40% of the variation. Median expression level in disomy and the number of tissues a gene is detected in were included in 7 of the 12 models. The standard deviation of a gene in normal tissues was included in 6 of the 12 models; in 3 of them the number of tissues a gene is detected in was also included, indicating that the two factors carry non-redundant information. All other factors did not explain more than 10% of the variability. The results for each factor appear below.

4.4.4.1 Disomy expression level In all but one datasets, disomy expression level was correlated with the change in expression. The correlation is positive for trisomy and negative for monosomy. In order to check if the correlation results from silenced genes that are kept silent or is

- 67 -

true for all expression levels, the genes located on the aberrant chromosome were divided according to their level of expression in tumors having two copies of the same chromosome. For each group, the distribution of the log2 ratio of the trisomy to disomy is shown (Figure 4-4A). It can be seen that the higher the basal level of expression, defined as the median expression level in samples with two copies, the higher is the change in expression as a result of additional copy of the chromosome. Notably, even though the monosomy signature is not so clear, the same is true for monosomy – the level of expression down-regulation is correlated with basal expression level (Figure 4-4B).

4.4.4.2 Monoallelic genes, Haploinsufficiency, autoregulation In the datasets whose aberrant chromosome carried five or more monoallelic or haploinsufficient or autoregulatory genes, the change in expression between those genes and other genes was not significantly different. It should be noted that the numbers of these genes are very small and data is probably incomplete.

4.4.4.3 Dublicability Family size was correlated with the change in expression. The correlation is negative for trisomy and positive for monosomy. That means that the bigger the family size, the lower the change in expression. Alternatively, this may be because singletons are more likely to be affected by the change, which is why they are resistant to duplication. To test this, we checked if family size is more informative as a binary variable or as a numerical variable. It is more informative in some datasets and less in others.

4.4.4.4 Genes physiology The physiological characteristics of the gene - transcript length, coding region length, length of 5’ and 3’ untranslated region, total exon length, total intron length, exon count and strand – were not correlated with change in expression.

4.4.4.5 Expression in normal tissues The number of normal tissues a gene was detected in was significantly correlated with change in expression in 10 of the 12 aberrant chromosomes tested. Positively correlated in trisomy, and negatively correlated in monosomy. That means that genes that are expressed in more tissues are more likely to be affected by the change in their

- 68 -

copy number. The standard deviation of a gene in normal tissues is negatively correlated with the number of normal tissues a gene was detected in (Pearson correlation coefficient = -0.3048, p = 0), but may still carry more information. It was significantly correlated with change in expression in 6 of the 12 aberrant chromosomes tested. Expectedly, negatively correlated in trisomy, and positively correlated in monosomy. That means that genes that have higher standard deviation in normal tissues are less likely to be affected by the change in their copy number. A higher standard deviation is expected in tissue specific genes that have low expression level in most tissues, and high expression level in one or few tissues. This indicates a tighter regulation, in accordance with our hypothesis that genes with tight regulation are less likely to be affected by change in copy number.

4.4.4.6 GO annotation There was no clear effect of GO annotation. In the GBM chromosome 19, genes annotated for transcription and related terms were significantly not affected by copy number.

4.4.4.7 House keeping genes In two of the five datasets whose aberrant chromosome carried five or more house keeping genes, house keeping genes were less affected by the change in their copy number than other genes.

4.4.4.8 Syndrome related genes In five of the 12 datasets, syndrome related genes are more affected by the change in their copy number than other genes.

4.5 Discussion Changes in chromosomes’ number affect gene expression in many conditions, including colon cancer [44, 45], breast cancer [46, 47], prostate cancer [48], acute myeloid leukemia [49], acute lymphoblastic leukemia [50], Glioblastoma cell lines [50], Down’s Syndrome and chromosome 13 trisomy [51]. Hertzberg et al [50] even showed that gene expression can be used to predict the chromosomal status of a tumor.

- 69 -

We confirm that trisomy causes up-regulation of the genes located on the extra chromosome, and that monosomy causes down-regulation of the genes located on the deleted chromosome. To the best of our knowledge, previous publications that studied the relationships between DNA copy number and expression data, including all the above publications, only checked the correlation between the level of change in DNA copy number and the change in expression, or separated the genes on the aberrant chromosomes to genes that are affected and genes that are not. We set to answer the questions of what causes each gene to be affected, and to what level. As a prerequisite, we needed to show that in all cases of trisomy of a certain chromosome, the same genes are affected to the same level in all samples with that trisomy. There is no perfect conservation of the effect, but there is some level of conservation – about 30% of the genes are in the same tenth of change in all samples. The level of conservation is in general lower in monosomies. However, the conservation of the trisomy and monosomy signatures makes the effect a well defined phenomenon and determining the cause of the observed variation of the size of the induced change is valid and relevant. We tested several factors for correlation with the change in expression in response to the change in copy number. The disomic expression level and the gene family size correlated most with the level of change. Transcription factors and syndrome related genes change more. House keeping genes change less. Together, those factors explain around one third of the variance in the change. Thus, there are more yet unknown factors affecting this phenomenon. For several of the tested factors, e.g. autoregulated genes and monoallelic genes, there were very few genes on each chromosome, and thus those factors may be relevant though they did not display statistical significance. Tumors’ karyotypes are very heterogeneous, and it may well be that the region of the tumor from which the copy number profiling was done is different in karyotype from the region from which the expression was profiled. In addition, karyotype changes are often correlated (or anti-correlated), e.g. loss of chromosome 10 that co-occurs with gain of chromosome 7 in glioblastoma. That makes it difficult to isolate the effect of the aberrant chromosome from other effects, as opposed to Down’s syndrome, where all cells have the same karyotype, with only one aberration. A possible solution to the heterogeneity problem is offered by experiments based on inserting chromosomes into cell lines, as was done in colon cancer cell lines [54] and mouse cell lines [170].

- 70 -

As we are currently unable to explain a large part of the variation in gene expression in general, it is not surprising that it is possible to explain only about 30% of the variation in the response of the expression to changes in DNA copy number.

Figures’ legends

Figure 4-1 - The effect of number of copies of a chromosome on the expression of genes on that chromosome. (A) Chromosome 19 in GBM. (B) Chromosome 13 in GBM. Each row is a probeset on the chromosome. Each column is a sample. The colored bar below the expression matrix indicates the number of copies, as seen in the aCGH data.

Figure 4-2 – The conservation of the effect an additional copy of chromosome 19 has on the expression of the genes located on chromosome 19. Each of the six samples with three copies of chromosome 19 is represented in a different subplot. The log2 ratio (relative to diploid) of the expression of each gene is plotted against the median of the log2 ratio of the expression of this gene in all samples with three copies. The red line is the main diagonal, representing a perfect conservation.

Figure 4-3 – The conservation of the effect of loss of a copy of chromosome 13 has on the expression of the genes located on chromosome 13. Each of the nine samples with one copy of chromosome 13 is represented in a different subplot. The log2 ratio (relative to diploid) of the expression of each gene is plotted against the median of the log2 ratio of the expression of this gene in all samples with one copy. The red line is the main diagonal, representing a perfect conservation.

Figure 4-4 – The effect of the disomic gene expression level on the change in expression following gain/loss of a chromosome. (A) Chromosome 19 genes were split into five groups, based on their expression level in samples with two copies of chromosome 19. The distribution of the log2 ratio of the trisomy to the disomy is shown for each group. (B) Chromosome 13 genes were split into five groups, based on their expression level in samples with two copies of chromosome 13. The

- 71 -

distribution of the log2 ratio of the monosomy to the disomy is shown for each group. Thick red line is on zero, meaning no change.

5 Discussion During my PhD I analyzed several types of high throughput data, from gene expression microarrays, array CGH, SNP chips and ChIP on chip measurements, to which I applied a wide range of analysis techniques. These include supervised methods (statistical tests, False Discovery Rate, Cox regression, Kaplan-Meier analysis etc) as well as unsupervised – such as clustering, bi-clustering and sorting.

The data I analyzed was acquisitioned in the framework of long-term collaborations between our group and several biological and clinical groups, including the groups of Yoram Groner [56], Yosef Yarden [57, 58] and Benjamin Geiger [171] from Weizmann, and Giovanni Blandino from Regina Elena Cancer Institute, Rome, Italy [172].

In collaboration with Prof. M. Hegi's group in the CHUV, Lausanne, Switzerland, we studied the ability of gene-expression profiles [55] and aCGH to predict survival of GBM patients. In addition, we mapped the alterations of the three main pathways affected in GBM – TP53, RB1 and EGFR – and showed the altered component in each sample. We detected several new components that may be responsible for altering pathways in samples with none of the known alterations, such as MYCN up- regulation in the RB1 pathway, and TNFSRF10B downregulation in the TP53 pathway.

The GBM research motivated us to develop a new method to detect significant local DNA aberrations in aCGH of multiple tumors. We applied this method on several types of brain tumors. We detected known and new aberrations, and characterized a new subgroup in Medulloblastoma [173]. In GBM, having matched expression data for the same samples allowed us to characterize the transcriptional effect of each aberration, on the genes residing on the aberrant genomic region and genome-wide.

We also studied the effect of trisomy and monosomy on gene expression in GBM tumors, and in tumors in general. We found that trisomy signature is more similar

- 72 -

among trisomies in different tumors of the same type than monosomy signature. We tested several factors that may affect the level of change in gene expression in response to change in copy number. About one third of the variance in the level of change was explained by disomy expression level, gene family size, transcription factor, house keeping genes and syndrome relatedness.

Working on these different projects gave me an understanding of the challenges in combined analysis of high-throughput biological data of several types. Methods for measuring DNA copy number, gene expression and protein levels have developed greatly since I started my PhD, but the methods for analyzing those data are lagging behind, especially when considering methods for analyzing two types of data in a way that extracts new information from the combination. I think that my research is a small step in that direction, utilizing DNA copy number and gene expression data in order to characterize the advantage of amplifying oncogenes and deleting tumor suppressor genes, as well as to provide some insights into transcriptional changes caused by monosomies and trisomies.

- 73 -

6 List of publications

1. Shay T, Reiner A, Lambiv WL, Hegi ME, Domany E: Combining chromosomal arm status and significantly aberrant genomic locations reveals new cancer subtypes. Submitted. 2. Amit I, Citri A, Shay T, Lu Y, Katz M, Zhang F, Tarcic G, Siwak D, Lahad J, Jacob-Hirsch J et al: A module of negative feedback regulators defines growth factor signaling. Nat Genet 2007, 39(4):503-512. 3. Katz M, Amit I, Citri A, Shay T, Carvalho S, Lavi S, Milanezi F, Lyass L, Amariglio N, Jacob-Hirsch J et al: A reciprocal tensin-3-cten switch mediates EGF-driven mammary cell migration. Nat Cell Biol 2007, 9(8):961-969. 4. Fainaru O, Shay T, Hantisteanu S, Goldenberg D, Domany E, Groner Y: TGFbeta-dependent gene expression profile during maturation of dendritic cells. Genes Immun 2007, 8(3):239-244. 5. Nadav L, Shay T, Naparstek E, Domany E, Katz B, Geiger B: Fibronectin- mediated adhesion regulates gene expression profiles in variant sub- populations of malignant plasma cells. Submitted. 6. Fontemaggi G, Dell’Orso S, Trisciuoglio D, Shay T, Sacchi A, Melucci E, Terrenato I, Mottolese M, Domany E, Del Bufalo D et al: Gain of function mutant p53 proteins promote neoangionesesis through ID-4 transcriptional activation. In preperation. 7. Murat A, Migliavacca E, Gorlia T, Lambiv WL, Shay T, Hamou MF, de Tribolet N, Regli L, Wick W, Kouwenhoven MC et al: Stem cell-related "self-renewal" signature and high epidermal growth factor receptor expression associated with resistance to concomitant chemoradiotherapy in glioblastoma. J Clin Oncol 2008, 26(18):3015-3024.

7 References

1. Felsenfeld G, Groudine M: Controlling the double helix. Nature 2003, 421(6921):448-453. 2. Studitsky VM, Walter W, Kireeva M, Kashlev M, Felsenfeld G: Chromatin remodeling by RNA polymerases. Trends in Biochemical Sciences 2004, 29(3):127-135. 3. Detours V, Dumont JE, Bersini H, Maenhaut C: Integration and cross- validation of high-throughput gene expression data: comparing heterogeneous data sets. FEBS Letters 2003, 546(1):98-102. 4. Inazawa J, Inoue J, Imoto I: Comparative genomic hybridization (CGH)- arrays pave the way for identification of novel cancer-related genes. Cancer Sci 2004, 95(7):559-563. 5. Pinkel D, Segraves R, Sudar D, Clark S, Poole I, Kowbel D, Collins C, Kuo W-L, Chen C, Zhai Y et al: High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat Genet 1998, 20(2):207-211. 6. Solinas-Toldo S, Lampel S, Stilgenbauer S, Nickolenko J, Benner A, Dohner H, Cremer T, Lichter P: Matrix-based comparative genomic hybridization: biochips to screen for genomic imbalances. Genes Chromosomes Cancer 1997, 20(4):399-407. 7. Pollack JR, Perou CM, Alizadeh AA, Eisen MB, Pergamenschikov A, Williams CF, Jeffrey SS, Botstein D, Brown PO: Genome-wide analysis of DNA copy-number changes using cDNA microarrays. Nat Genet 1999, 23(1):41-46. 8. Barrett MT, Scheffer A, Ben-Dor A, Sampas N, Lipson D, Kincaid R, Tsang P, Curry B, Baird K, Meltzer PS et al: Comparative genomic hybridization using oligonucleotide microarrays and total genomic DNA. Proc Natl Acad Sci U S A 2004, 101(51):17765-17770. 9. Lucito R, Healy J, Alexander J, Reiner A, Esposito D, Chi M, Rodgers L, Brady A, Sebat J, Troge J et al: Representational Oligonucleotide Microarray Analysis: A High-Resolution Method to Detect Genome Copy Number Variation 10.1101/gr.1349003. Genome Res 2003, 13(10):2291-2305. 10. Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Maner S, Massa H, Walker M, Chi M et al: Large-scale copy number polymorphism in the human genome. Science 2004, 305(5683):525-528. 11. Schena M, Shalon D, Davis RW, Brown PO: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 1995, 270(5235):467-470. 12. Lockhart DJ, Dong H, Byrne MC, Follettie MT, Gallo MV, Chee MS, Mittmann M, Wang C, Kobayashi M, Norton H et al: Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotech 1996, 14(13):1675-1680. 13. Hubbell E, Liu W-M, Mei R: Robust estimators for expression analysis. Bioinformatics 2002, 18(12):1585-1592.

- 75 -

14. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostat 2003, 4(2):249-264. 15. Li C, Wong WH: Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proceedings of the National Academy of Sciences of the United States of America 2001, 98(1):31- 36. 16. Velculescu VE, Zhang L, Vogelstein B, Kinzler KW: Serial analysis of gene expression. Science 1995, 270(5235):484-487. 17. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 2008:gr.079558.079108. 18. Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A 2001, 98(9):5116-5121. 19. Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society Series B (Methodological) 1995, 57(1):289-300. 20. Alter O, Brown PO, Botstein D: Singular value decomposition for genome- wide expression data processing and modeling. Proceedings of the National Academy of Sciences of the United States of America 2000, 97(18):10101- 10106. 21. Getz G, Levine E, Domany E, Zhang MQ: Super-paramagnetic clustering of yeast gene expression profiles. Physica A 2000, 279(1-4):457-464. 22. Blatt M, Wiseman S, Domany E: Superparamagnetic clustering of data. Physical Review Letters 1996, 76(18):3251-3254. 23. Belacel N, Wang Q, Cuperlovic-Culf M: Clustering Methods for Microarray Gene Expression Data. OMICS: A Journal of Integrative Biology 2006, 10(4):507-531. 24. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the United States of America 1999, 96(12):6745-6750. 25. Getz G, Levine E, Domany E: Coupled two-way clustering analysis of gene microarray data. Proceedings of the National Academy of Sciences of the United States of America 2000, 97(22):12079-12084. 26. Tsafrir D, Tsafrir I, Ein-Dor L, Zuk O, Notterman DA, Domany E: Sorting points into neighborhoods (SPIN): data analysis and visualization by ordering distance matrices. Bioinformatics 2005, 21(10):2301-2308. 27. Hanahan D, Weinberg RA: The Hallmarks of Cancer. Cell 2000, 100(1):57- 70. 28. Nowell PC: The clonal evolution of tumor cell populations. Science 1976, 194(4260):23-28. 29. Rajagopalan H, Lengauer C: Aneuploidy and cancer. Nature 2004, 432(7015):338-341. 30. Mitelman F, Mertens F, Johansson B: A breakpoint map of recurrent chromosomal rearrangements in human neoplasia. Nat Genet 1997, 15(4s):417-474.

- 76 -

31. Mertens F, Johansson B, Hoglund M, Mitelman F: Chromosomal Imbalance Maps of Malignant Solid Tumors: A Cytogenetic Survey of 3185 Neoplasms. Cancer Res 1997, 57(13):2765-2780. 32. Seeger R, Brodeur G, Sather H, Dalton A, Siegel S, Wong K, Hammond D: Association of multiple copies of the N-myc oncogene with rapid progression of neuroblastomas. N Engl J Med 1985, 313(18):1111-1116. 33. Kyomoto R, Kumazawa H, Toda Y, Sakaida N, Okamura A, Iwanaga M, Shintaku M, Yamashita T, Hiai H, Fukumoto M: Cyclin-D1-gene amplification is a more potent prognostic factor than its protein over- expression in human head-and-neck squamous-cell carcinoma. Int J Cancer 1997, 74(6):576-581. 34. Slamon DJ, Clark GM, Wong SG, Levin WJ, Ullrich A, McGuire WL: Human breast cancer: correlation of relapse and survival with amplification of the HER-2/neu oncogene. Science 1987, 235(4785):177- 182. 35. Ishiguro R, Fujii M, Yamashita T, Tashiro M, Tomita T, Ogawa K, Kameyama K: CCND1 amplification predicts sensitivity to chemotherapy and chemoradiotherapy in head and neck squamous cell carcinoma. Anticancer Res 2003, 23(6D):5213-5220. 36. Slamon DJ, Leyland-Jones B, Shak S, Fuchs H, Paton V, Bajamonde A, Fleming T, Eiermann W, Wolter J, Pegram M et al: Use of chemotherapy plus a monoclonal antibody against HER2 for metastatic breast cancer that overexpresses HER2. N Engl J Med 2001, 344(11):783-792. 37. Palmberg C, Koivisto P, Kakkola L, Tammela TL, Kallioniemi OP, Visakorpi T: Androgen receptor gene amplification at primary progression predicts response to combined androgen blockade as second line therapy for advanced prostate cancer. J Urol 2000, 164(6):1992-1995. 38. Vogelstein B, Kinzler KW: Cancer genes and the pathways they control. Nat Med 2004, 10(8):789-799. 39. Duesberg P, Li R, Fabarius A, Hehlmann R: Aneuploidy and cancer: from correlation to causation. Contrib Microbiol 2006, 13:16-44. 40. Marx J: Debate Surges Over the Origins of Genomic Defects in Cancer. Science 2002, 297(5581):544-546. 41. Zimonjic D, Brooks MW, Popescu N, Weinberg RA, Hahn WC: Derivation of Human Tumor Cells in Vitro without Widespread Genomic Instability. Cancer Res 2001, 61(24):8838-8844. 42. Lamlum H, Papadopoulou A, Ilyas M, Rowan A, Gillet C, Hanby A, Talbot I, Bodmer W, Tomlinson I: APC mutations are sufficient for the growth of early colorectal adenomas. Proceedings of the National Academy of Sciences 2000, 97(5):2225-2228. 43. Pollack JR, Sorlie T, Perou CM, Rees CA, Jeffrey SS, Lonning PE, Tibshirani R, Botstein D, Borresen-Dale AL, Brown PO: Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proc Natl Acad Sci U S A 2002, 99(20):12963-12968. 44. Platzer P, Upender MB, Wilson K, Willis J, Lutterbaugh J, Nosrati A, Willson JKV, Mack D, Ried T, Markowitz S: Silence of Chromosomal Amplifications in Colon Cancer. Cancer Res 2002, 62(4):1134-1138. 45. Tsafrir D, Bacolod M, Selvanayagam Z, Tsafrir I, Shia J, Zeng Z, Liu H, Krier C, Stengel RF, Barany F et al: Relationship of gene expression and

- 77 -

chromosomal abnormalities in colorectal cancer. Cancer Res 2006, 66(4):2129-2137. 46. Hyman E, Kauraniemi P, Hautaniemi S, Wolf M, Mousses S, Rozenblum E, Ringner M, Sauter G, Monni O, Elkahloun A et al: Impact of DNA Amplification on Gene Expression Patterns in Breast Cancer. Cancer Res 2002, 62(21):6240-6245. 47. Pollack JR, Sorlie T, Perou CM, Rees CA, Jeffrey SS, Lonning PE, Tibshirani R, Botstein D, Borresen-Dale A-L, Brown PO: Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proceedings of the National Academy of Sciences 2002, 99(20):12963-12968. 48. Phillips JL, Hayward SW, Wang Y, Vasselli J, Pavlovich C, Padilla-Nash H, Pezullo JR, Ghadimi BM, Grossfeld GD, Rivera A et al: The Consequences of Chromosomal Aneuploidy on Gene Expression Profiles in a Cell Line Model for Prostate Carcinogenesis. Cancer Res 2001, 61(22):8143-8149. 49. Virtaneva K, Wright FA, Tanner SM, Yuan B, Lemon WJ, Caligiuri MA, Bloomfield CD, de la Chapelle A, Krahe R: Expression profiling reveals fundamental biological differences in acute myeloid leukemia with isolated trisomy 8 and normal cytogenetics. Proceedings of the National Academy of Sciences 2001, 98(3):1124-1129. 50. Hertzberg L, Betts DR, Raimondi SC, Schafer BW, Notterman DA, Domany E, Izraeli S: Prediction of chromosomal aneuploidy from gene expression data. Genes Chromosomes Cancer 2007, 46(1):75-86. 51. Gao C, Furge K, Koeman J, Dykema K, Su Y, Cutler ML, Werts A, Haak P, Vande Woude GF: Chromosome instability, chromosome transcriptome, and clonal evolution of tumor cell populations. Proceedings of the National Academy of Sciences 2007, 104(21):8995-9000. 52. FitzPatrick DR, Ramsay J, McGill NI, Shade M, Carothers AD, Hastie ND: Transcriptome analysis of human autosomal trisomy. Hum Mol Genet 2002, 11(26):3249-3256. 53. Torres EM, Sokolsky T, Tucker CM, Chan LY, Boselli M, Dunham MJ, Amon A: Effects of aneuploidy on cellular physiology and cell division in haploid yeast. Science 2007, 317(5840):916-924. 54. Upender MB, Habermann JK, McShane LM, Korn EL, Barrett JC, Difilippantonio MJ, Ried T: Chromosome transfer induced aneuploidy results in complex dysregulation of the cellular transcriptome in immortalized and cancer cells. Cancer Res 2004, 64(19):6941-6949. 55. Murat A, Migliavacca E, Gorlia T, Lambiv WL, Shay T, Hamou MF, de Tribolet N, Regli L, Wick W, Kouwenhoven MC et al: Stem cell-related "self-renewal" signature and high epidermal growth factor receptor expression associated with resistance to concomitant chemoradiotherapy in glioblastoma. J Clin Oncol 2008, 26(18):3015-3024. 56. Fainaru O, Shay T, Hantisteanu S, Goldenberg D, Domany E, Groner Y: TGFbeta-dependent gene expression profile during maturation of dendritic cells. Genes Immun 2007, 8(3):239-244. 57. Amit I, Citri A, Shay T, Lu Y, Katz M, Zhang F, Tarcic G, Siwak D, Lahad J, Jacob-Hirsch J et al: A module of negative feedback regulators defines growth factor signaling. Nat Genet 2007, 39(4):503-512. 58. Katz M, Amit I, Citri A, Shay T, Carvalho S, Lavi S, Milanezi F, Lyass L, Amariglio N, Jacob-Hirsch J et al: A reciprocal tensin-3-cten switch

- 78 -

mediates EGF-driven mammary cell migration. Nat Cell Biol 2007, 9(8):961-969. 59. Ohgaki H, Kleihues P: Genetic Pathways to Primary and Secondary Glioblastoma. Am J Pathol 2007, 170(5):1445-1453. 60. Rao RD, Uhm JH, Krishnan S, James CD: Genetic and signaling pathway alterations in glioblastoma: relevance to novel targeted therapies. Front Biosci 2003, 8:e270-280. 61. Ohgaki H, Dessen P, Jourde B, Horstmann S, Nishikawa T, Di Patre PL, Burkhard C, Schuler D, Probst-Hensch NM, Maiorka PC et al: Genetic pathways to glioblastoma: a population-based study. Cancer Res 2004, 64(19):6892-6899. 62. Filippini G, Falcone C, Boiardi A, Broggi G, Bruzzone MG, Caldiroli D, Farina R, Farinotti M, Fariselli L, Finocchiaro G et al: Prognostic factors for survival in 676 consecutive patients with newly diagnosed primary glioblastoma. Neuro-oncol 2008, 10(1):79-87. 63. Kleihues P, Louis DN, Scheithauer BW, Rorke LB, Reifenberger G, Burger PC, Cavenee WK: The WHO classification of tumors of the nervous system. J Neuropathol Exp Neurol 2002, 61(3):215-225; discussion 226-219. 64. Ichimura K, Ohgaki H, Kleihues P, Collins VP: Molecular pathogenesis of astrocytic tumours. J Neurooncol 2004, 70(2):137-160. 65. Watanabe K, Sato K, Biernat W, Tachibana O, von Ammon K, Ogata N, Yonekawa Y, Kleihues P, Ohgaki H: Incidence and timing of p53 mutations during astrocytoma progression in patients with multiple biopsies. Clin Cancer Res 1997, 3(4):523-530. 66. Nakamura M, Yonekawa Y, Kleihues P, Ohgaki H: Promoter hypermethylation of the RB1 gene in glioblastomas. Lab Invest 2001, 81(1):77-82. 67. Fujisawa H, Reis RM, Nakamura M, Colella S, Yonekawa Y, Kleihues P, Ohgaki H: Loss of heterozygosity on chromosome 10 is more extensive in primary (de novo) than in secondary glioblastomas. Lab Invest 2000, 80(1):65-72. 68. Tohma Y, Gratas C, Biernat W, Peraud A, Fukuda M, Yonekawa Y, Kleihues P, Ohgaki H: PTEN (MMAC1) mutations are frequent in primary glioblastomas (de novo) but not in secondary glioblastomas. J Neuropathol Exp Neurol 1998, 57(7):684-689. 69. Nakamura M, Watanabe T, Klangby U, Asker C, Wiman K, Yonekawa Y, Kleihues P, Ohgaki H: p14ARF deletion and methylation in genetic pathways to glioblastomas. Brain Pathol 2001, 11(2):159-168. 70. Biernat W, Tohma Y, Yonekawa Y, Kleihues P, Ohgaki H: Alterations of cell cycle regulatory genes in primary (de novo) and secondary glioblastomas. Acta Neuropathol 1997, 94(4):303-309. 71. Watanabe K, Tachibana O, Sata K, Yonekawa Y, Kleihues P, Ohgaki H: Overexpression of the EGF receptor and p53 mutations are mutually exclusive in the evolution of primary and secondary glioblastomas. Brain Pathol 1996, 6(3):217-223; discussion 223-214. 72. Oren M: Decision making by p53: life, death and cancer. Cell Death Differ, 10(4):431-442. 73. Ichimura K, Bolin MB, Goike HM, Schmidt EE, Moshref A, Collins VP: Deregulation of the p14ARF/MDM2/p53 Pathway Is a Prerequisite for

- 79 -

Human Astrocytic Gliomas with G1-S Transition Control Gene Abnormalities. Cancer Res 2000, 60(2):417-424. 74. Riemenschneider MJ, Buschges R, Wolter M, Reifenberger J, Bostrom J, Kraus JA, Schlegel U, Reifenberger G: Amplification and overexpression of the MDM4 (MDMX) gene from 1q32 in a subset of malignant gliomas without TP53 mutation or MDM2 amplification. Cancer Res 1999, 59(24):6091-6096. 75. Smith JS, Jenkins RB: Genetic alterations in adult diffuse glioma: occurrence, significance, and prognostic implications. Front Biosci 2000, 5:D213-231. 76. He J, Olson JJ, James CD: Lack of p16INK4 or Retinoblastoma Protein (pRb), or Amplification-associated Overexpression of cdk4 Is Observed in Distinct Subsets of Malignant Glial Tumors and Cell Lines. Cancer Res 1995, 55(21):4833-4836. 77. Ueki K, Ono Y, Henson JW, Efird JT, von Deimling A, Louis DN: CDKN2/p16 or RB alterations occur in the majority of glioblastomas and are inversely correlated. Cancer Res 1996, 56(1):150-153. 78. Hermanson M, Funa K, Koopmann J, Maintz D, Waha A, Westermark B, Heldin C-H, Wiestler OD, Louis DN, von Deimling A et al: Association of Loss of Heterozygosity on Chromosome 17p with High Platelet-derived Growth Factor {alpha} Receptor Expression in Human Malignant Gliomas. Cancer Res 1996, 56(1):164-171. 79. Kita D, Yonekawa Y, Weller M, Ohgaki H: PIK3CA alterations in primary (de novo) and secondary glioblastomas. Acta Neuropathol 2007, 113(3):295-302. 80. McLendon RE, Turner K, Perkinson K, Rich J: Second messenger systems in human gliomas. Arch Pathol Lab Med 2007, 131(10):1585-1590. 81. Scaltriti M, Baselga J: The Epidermal Growth Factor Receptor Pathway: A Model for Targeted Therapy. Clin Cancer Res 2006, 12(18):5268-5272. 82. Liu L, Ichimura K, Pettersson EH, Goike HM, Collins VP: The complexity of the 7p12 amplicon in human astrocytic gliomas: detailed mapping of 246 tumors. J Neuropathol Exp Neurol 2000, 59(12):1087-1093. 83. Vogt N, Lefevre SH, Apiou F, Dutrillaux AM, Cor A, Leuraud P, Poupon MF, Dutrillaux B, Debatisse M, Malfoy B: Molecular structure of double-minute chromosomes bearing amplified copies of the epidermal growth factor receptor gene in gliomas. Proc Natl Acad Sci U S A 2004, 101(31):11368- 11373. 84. Wang X-Y, Smith DI, Liu W, James CD: GBAS,a Novel Gene Encoding a Protein with Tyrosine Phosphorylation Sites and a Transmembrane Domain, Is Co-amplified withEGFR. Genomics 1998, 49(3):448-451. 85. Park S, James CD: ECop (EGFR-coamplified and overexpressed protein), a novel protein, regulates NF-kappaB transcriptional activity and associated apoptotic response in an IkappaBalpha-dependent manner. Oncogene 2005, 24(15):2495-2502. 86. Frederick L, Wang XY, Eley G, James CD: Diversity and frequency of epidermal growth factor receptor mutations in human glioblastomas. Cancer Res 2000, 60(5):1383-1387. 87. Lal A, Glazer CA, Martinson HM, Friedman HS, Archer GE, Sampson JH, Riggins GJ: Mutant epidermal growth factor receptor up-regulates molecular effectors of tumor invasion. Cancer Res 2002, 62(12):3335-3339.

- 80 -

88. Narita Y, Nagane M, Mishima K, Huang HJS, Furnari FB, Cavenee WK: Mutant Epidermal Growth Factor Receptor Signaling Down-Regulates p27 through Activation of the Phosphatidylinositol 3-Kinase/Akt Pathway in Glioblastomas. Cancer Res 2002, 62(22):6764-6769. 89. Choe G, Horvath S, Cloughesy TF, Crosby K, Seligson D, Palotie A, Inge L, Smith BL, Sawyers CL, Mischel PS: Analysis of the Phosphatidylinositol 3'- Kinase Signaling Pathway in Glioblastoma Patients in Vivo. Cancer Res 2003, 63(11):2742-2746. 90. Holtkamp N, Ziegenhagen N, Malzer E, Hartmann C, Giese A, von Deimling A: Characterization of the amplicon on chromosomal segment 4q12 in glioblastoma multiforme. Neuro-oncol 2007, 9(3):291-297. 91. Fleming TP, Saxena A, Clark WC, Robertson JT, Oldfield EH, Aaronson SA, Ali IU: Amplification and/or Overexpression of Platelet-derived Growth Factor Receptors and Epidermal Growth Factor Receptor in Human Glial Tumors. Cancer Res 1992, 52(16):4550-4553. 92. Kim WY, Sharpless NE: The regulation of INK4/ARF in cancer and aging. Cell 2006, 127(2):265-275. 93. Gil J, Peters G: Regulation of the INK4b-ARF-INK4a tumour suppressor locus: all for one or one for all. Nat Rev Mol Cell Biol 2006, 7(9):667-677. 94. von Deimling A, Louis DN, von Ammon K, Petersen I, Hoell T, Chung RY, Martuza RL, Schoenfeld DA, Yasargil MG, Wiestler OD et al: Association of epidermal growth factor receptor gene amplification with loss of chromosome 10 in human glioblastoma multiforme. J Neurosurg 1992, 77(2):295-301. 95. Gaya A, Rees J, Greenstein A, Stebbing J: The use of temozolomide in recurrent malignant gliomas. Cancer Treat Rev 2002, 28(2):115-120. 96. Friedman HS, Kerby T, Calvert H: Temozolomide and Treatment of Malignant Glioma. Clin Cancer Res 2000, 6(7):2585-2597. 97. Gerson SL: Clinical relevance of MGMT in the treatment of cancer. J Clin Oncol 2002, 20(9):2388-2399. 98. Gerson SL: MGMT: its role in cancer aetiology and cancer therapeutics. Nat Rev Cancer 2004, 4(4):296-307. 99. Hegi ME, Diserens AC, Godard S, Dietrich PY, Regli L, Ostermann S, Otten P, Van Melle G, de Tribolet N, Stupp R: Clinical trial substantiates the predictive value of O-6-methylguanine-DNA methyltransferase promoter methylation in glioblastoma patients treated with temozolomide. Clin Cancer Res 2004, 10(6):1871-1874. 100. Stupp R, Mason WP, van den Bent MJ, Weller M, Fisher B, Taphoorn MJ, Belanger K, Brandes AA, Marosi C, Bogdahn U et al: Radiotherapy plus concomitant and adjuvant temozolomide for glioblastoma. N Engl J Med 2005, 352(10):987-996. 101. Hegi ME, Diserens AC, Gorlia T, Hamou MF, de Tribolet N, Weller M, Kros JM, Hainfellner JA, Mason W, Mariani L et al: MGMT gene silencing and benefit from temozolomide in glioblastoma. N Engl J Med 2005, 352(10):997-1003. 102. Burger PC, Green SB: Patient age, histologic features, and length of survival in patients with glioblastoma multiforme. Cancer 1987, 59(9):1617-1625.

- 81 -

103. Stark AM, Nabavi A, Mehdorn HM, Blomer U: Glioblastoma multiforme- report of 267 cases treated at a single institution. Surg Neurol 2005, 63(2):162-169; discussion 169. 104. Su WT, Alaminos M, Mora J, Cheung NK, La Quaglia MP, Gerald WL: Positional gene expression analysis identifies 12q overexpression and amplification in a subset of neuroblastomas. Cancer Genet Cytogenet 2004, 154(2):131-137. 105. Zhang X, Zhou Y, Mehta KR, Danila DC, Scolavino S, Johnson SR, Klibanski A: A Pituitary-Derived MEG3 Isoform Functions as a Growth Suppressor in Tumor Cells. J Clin Endocrinol Metab 2003, 88(11):5119- 5126. 106. Zhou Y, Zhong Y, Wang Y, Zhang X, Batista DL, Gejman R, Ansell PJ, Zhao J, Weng C, Klibanski A: Activation of p53 by MEG3 Non-coding RNA. J Biol Chem 2007, 282(34):24731-24742. 107. Scott D, Elsden J, Pearson A, Lunec J: Genes co-amplified with MYCN in neuroblastoma: silent passengers or co-determinants of phenotype? Cancer Letters 2003, 197(1-2):81-86. 108. Bell E, Lunec J, Tweddle DA: Cell cycle regulation targets of MYCN identified by gene expression microarrays. Cell Cycle 2007, 6(10):1249- 1256. 109. Hodgson G, Hager JH, Volik S, Hariono S, Wernick M, Moore D, Albertson DG, Pinkel D, Collins C, Hanahan D et al: Genome scanning with array CGH delineates regional alterations in mouse islet carcinomas. Nat Genet 2001, 29(4):459-464. 110. Diaz-Uriarte R, Rueda OM: ADaCGH: A parallelized web-based application and R package for the analysis of aCGH data. PLoS ONE 2007, 2(1):e737. 111. Olshen AB, Venkatraman ES, Lucito R, Wigler M: Circular binary segmentation for the analysis of array-based DNA copy number data. Biostat 2004, 5(4):557-572. 112. Lai WR, Johnson MD, Kucherlapati R, Park PJ: Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data 10.1093/bioinformatics/bti611. Bioinformatics 2005, 21(19):3763-3770. 113. Venkatraman ES, Olshen AB: A faster circular binary segmentation algorithm for the analysis of array CGH data 10.1093/bioinformatics/btl646. Bioinformatics 2007, 23(6):657-663. 114. Ferreira BI, Alonso J, Carrillo J, Acquadro F, Largo C, Suela J, Teixeira MR, Cerveira N, Molares A, Gomez-Lopez G et al: Array CGH and gene- expression profiling reveals distinct genomic instability patterns associated with DNA repair and cell-cycle checkpoint pathways in Ewing's sarcoma. Oncogene 2007. 115. Lo KC, Rossi MR, Burkhardt T, Pomeroy SL, Cowell JK: Overlay analysis of the oligonucleotide array gene expression profiles and copy number abnormalities as determined by array comparative genomic hybridization in medulloblastomas. Genes Chromosomes Cancer 2007, 46(1):53-66. 116. Lassmann S, Weis R, Makowiec F, Roth J, Danciu M, Hopt U, Werner M: Array CGH identifies distinct DNA copy number profiles of oncogenes and tumor suppressor genes in chromosomal- and microsatellite-unstable sporadic colorectal carcinomas. J Mol Med 2007, 85(3):289-300.

- 82 -

117. Shah SP, Lam WL, Ng RT, Murphy KP: Modeling recurrent DNA copy number alterations in array CGH data 10.1093/bioinformatics/btm221. Bioinformatics 2007, 23(13):i450-458. 118. Rouveirol C, Stransky N, Hupe P, Rosa PL, Viara E, Barillot E, Radvanyi F: Computation of recurrent minimal genomic alterations from array-CGH data. Bioinformatics 2006, 22(7):849-856. 119. Snijders AM, Schmidt BL, Fridlyand J, Dekker N, Pinkel D, Jordan RC, Albertson DG: Rare amplicons implicate frequent deregulation of cell fate specification pathways in oral squamous cell carcinoma. Oncogene 2005, 24(26):4232-4242. 120. Diskin SJ, Eck T, Greshock J, Mosse YP, Naylor T, Stoeckert CJ, Jr., Weber BL, Maris JM, Grant GR: STAC: A method for testing the significance of DNA copy number aberrations across multiple array-CGH experiments. Genome Res 2006, 16(9):1149-1158. 121. Guttman M, Mies C, Dudycz-Sulicz K, Diskin SJ, Baldwin DA, Stoeckert CJ, Jr., Grant GR: Assessing the significance of conserved genomic aberrations using high resolution genomic microarrays. PLoS Genet 2007, 3(8):e143. 122. Lipson D, Aumann Y, Ben-Dor A, Linial N, Yakhini Z: Efficient calculation of interval scores for DNA copy number data analysis. J Comput Biol 2006, 13(2):215-228. 123. Beroukhim R, Getz G, Nghiemphu L, Barretina J, Hsueh T, Linhart D, Vivanco I, Lee JC, Huang JH, Alexander S et al: Assessing the significance of chromosomal aberrations in cancer: methodology and application to glioma. Proc Natl Acad Sci U S A 2007, 104(50):20007-20012. 124. Weir BA, Woo MS, Getz G, Perner S, Ding L, Beroukhim R, Lin WM, Province MA, Kraja A, Johnson LA et al: Characterizing the cancer genome in lung adenocarcinoma. Nature 2007, 450(7171):893-898. 125. Tomioka N, Oba S, Ohira M, Misra A, Fridlyand J, Ishii S, Nakamura Y, Isogai E, Hirata T, Yoshida Y et al: Novel risk stratification of patients with neuroblastoma by genomic signature, which is independent of molecular signature. Oncogene 2008, 27(4):441-449. 126. Mosse YP, Diskin SJ, Wasserman N, Rinaldi K, Attiyeh EF, Cole K, Jagannathan J, Bhambhani K, Winter C, Maris JM: Neuroblastomas have distinct genomic DNA profiles that predict clinical phenotype and regional gene expression. Genes Chromosomes Cancer 2007, 46(10):936- 949. 127. Hupe P, Stransky N, Thiery J-P, Radvanyi F, Barillot E: Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics 2004, 20(18):3413-3422. 128. Mendrzyk F, Radlwimmer B, Joos S, Kokocinski F, Benner A, Stange DE, Neben K, Fiegler H, Carter NP, Reifenberger G et al: Genomic and Protein Expression Profiling Identifies CDK6 As Novel Independent Prognostic Marker in Medulloblastoma 10.1200/JCO.2005.02.8589. J Clin Oncol 2005, 23(34):8853-8862. 129. Thompson MC, Fuller C, Hogg TL, Dalton J, Finkelstein D, Lau CC, Chintagumpala M, Adesina A, Ashley DM, Kellie SJ et al: Genomics Identifies Medulloblastoma Subgroups That Are Enriched for Specific Genetic Alterations 10.1200/JCO.2005.04.4974. J Clin Oncol 2006, 24(12):1924-1931.

- 83 -

130. Clifford SC, Lusher ME, Lindsey JC, Langdon JA, Gilbertson RJ, Straughton D, Ellison DW: Wnt/Wingless pathway activation and chromosome 6 loss characterize a distinct molecular sub-group of medulloblastomas associated with a favorable prognosis. Cell Cycle 2006, 5(22):2666-2670. 131. Naoe T, Suzuki T, Kiyoi H, Urano T: Nucleophosmin: a versatile molecule associated with hematological malignancies. Cancer Sci 2006, 97(10):963- 969. 132. Vandesompele J, Baudis M, De Preter K, Van Roy N, Ambros P, Bown N, Brinkschmidt C, Christiansen H, Combaret V, Lastowska M et al: Unequivocal Delineation of Clinicogenetic Subgroups and Development of a New Model for Improved Outcome Prediction in Neuroblastoma. J Clin Oncol 2005, 23(10):2280-2299. 133. Brodeur GM: Neuroblastoma: biological insights into a clinical enigma. Nat Rev Cancer 2003, 3(3):203-216. 134. Hollox EJ, Armour JAL, Barber JCK: Extensive Normal Copy Number Variation of a [beta]-Defensin Antimicrobial-Gene Cluster. The American Journal of Human Genetics 2003, 73(3):591-600. 135. Osajima-Hakomori Y, Miyake I, Ohira M, Nakagawara A, Nakagawa A, Sakai R: Biological role of anaplastic lymphoma kinase in neuroblastoma. Am J Pathol 2005, 167(1):213-222. 136. Germline mutations in FH predispose to dominantly inherited uterine fibroids, skin leiomyomata and papillary renal cell cancer. Nat Genet 2002, 30(4):406-410. 137. Picard F, Robin S, Lavielle M, Vaisse C, Daudin J-J: A statistical approach for array CGH data analysis. BMC Bioinformatics 2005, 6(1):27. 138. Eilers PHC, de Menezes RX: Quantile smoothing of array CGH data. Bioinformatics 2005, 21(7):1146-1153. 139. Hsu L, Self SG, Grove D, Randolph T, Wang K, Delrow JJ, Loo L, Porter P: Denoising array-based comparative genomic hybridization data using wavelets. Biostat 2005, 6(2):211-226. 140. Fridlyand J, Snijders AM, Pinkel D, Albertson DG, Jain ANAN: Hidden Markov models approach to the analysis of array CGH data. Journal of Multivariate Analysis 2004, 90(1):132-153. 141. Myers CL, Dunham MJ, Kung SY, Troyanskaya OG: Accurate detection of aneuploidies in array CGH and gene expression microarray data. Bioinformatics 2004, 20(18):3533-3543. 142. Lingjaerde OC, Baumbusch LO, Liestol K, Glad IK, Borresen-Dale A-L: CGH-Explorer: a program for analysis of array-CGH data. Bioinformatics 2005, 21(6):821-822. 143. Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W et al: Global variation in copy number in the human genome. Nature 2006, 444(7118):444-454. 144. Slack AD, Chen Z, Ludwig AD, Hicks J, Shohet JM: MYCN-Directed Centrosome Amplification Requires MDM2-Mediated Suppression of p53 Activity in Neuroblastoma Cells. Cancer Res 2007, 67(6):2448-2455. 145. Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman N, Stratton MR: A census of human cancer genes. Nat Rev Cancer 2004, 4(3):177-183. 146. Reiner A, Yekutieli D, Benjamini Y: Identifying differentially expressed genes using false discovery rate controlling procedures

- 84 -

10.1093/bioinformatics/btf877. Bioinformatics 2003, 19(3):368-375. 147. Hassold T, Hunt P: To err (meiotically) is human: the genesis of human aneuploidy. Nat Rev Genet 2001, 2(4):280-291. 148. Hassold T, Hall H, Hunt P: The origin of human aneuploidy: where we have been, where we are going. Hum Mol Genet 2007, 16(R2):R203-208. 149. Brewer C, Holloway S, Zawalnyski P, Schinzel A, FitzPatrick D: A Chromosomal Duplication Map of Malformations: Regions of Suspected Haplo- and Triplolethality--and Tolerance of Segmental Aneuploidy--in Humans. The American Journal of Human Genetics 1999, 64(6):1702-1708. 150. Chi YH, Jeang KT: Aneuploidy and cancer. J Cell Biochem 2007, 102(3):531-538. 151. Ganem NJ, Storchova Z, Pellman D: Tetraploidy, aneuploidy and cancer. Current Opinion in Genetics & Development 2007, 17(2):157-162. 152. Goldmit M, Bergman Y: Monoallelic gene expression: a repertoire of recurrent themes. Immunol Rev 2004, 200:197-214. 153. Jelinic P, Shaw P: Loss of imprinting and cancer. J Pathol 2007, 211(3):261-268. 154. Sorek R, Zhu Y, Creevey CJ, Francino MP, Bork P, Rubin EM: Genome- Wide Experimental Determination of Barriers to Horizontal Gene Transfer. Science 2007, 318(5855):1449-1452. 155. Wellner A, Lurie MN, Gophna U: Complexity, connectivity, and duplicability as barriers to lateral gene transfer. Genome Biol 2007, 8(8):R156. 156. Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G et al: A gene atlas of the mouse and human protein-encoding transcriptomes. PNAS 2004, 101(16):6062-6067. 157. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al: : tool for the unification of biology. Nat Genet 2000, 25(1):25-29. 158. Thellin O, Zorzi W, Lakaye B, De Borman B, Coumans B, Hennen G, Grisar T, Igout A, Heinen E: Housekeeping genes as internal standards: use and limits. Journal of Biotechnology 1999, 75(2-3):291-295. 159. Lee PD, Sladek R, Greenwood CM, Hudson TJ: Control genes and variability: absence of ubiquitous reference transcripts in diverse mammalian expression studies. Genome Res 2002, 12(2):292-297. 160. Chng WJ, Kumar S, Vanwier S, Ahmann G, Price-Troska T, Henderson K, Chung TH, Kim S, Mulligan G, Bryant B et al: Molecular dissection of hyperdiploid multiple myeloma by gene expression profiling. Cancer Res 2007, 67(7):2982-2989. 161. Lockstone HE, Harris LW, Swatton JE, Wayland MT, Holland AJ, Bahn S: Gene expression profiling in the adult Down syndrome brain. Genomics 2007, 90(6):647-660. 162. Conti A, Fabbrini F, D'Agostino P, Negri R, Greco D, Genesio R, D'Armiento M, Olla C, Paladini D, Zannini M et al: Altered expression of mitochondrial and extracellular matrix genes in the heart of human fetuses with chromosome 21 trisomy. BMC Genomics 2007, 8:268. 163. Mao R, Wang X, Spitznagel EL, Jr., Frelin LP, Ting JC, Ding H, Kim JW, Ruczinski I, Downey TJ, Pevsner J: Primary and secondary transcriptional effects in the developing human Down syndrome brain and heart. Genome Biol 2005, 6(13):R107.

- 85 -

164. Hubbard TJP, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T et al: Ensembl 2007. Nucl Acids Res 2006:gkl996. 165. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S et al: Database resources of the National Center for Biotechnology Information. Nucl Acids Res 2008, 36(suppl_1):D13-21. 166. McKusick VA: Mendelian Inheritance in Man and its online version, OMIM. Am J Hum Genet 2007, 80(4):588-604. 167. Zhu J, He F, Song S, Wang J, Yu J: How many human genes can be defined as housekeeping with current expression data? BMC Genomics 2008, 9(1):172. 168. Morison IM, Paton CJ, Cleverley SD: The imprinted gene and parent-of- origin effect database. Nucl Acids Res 2001, 29(1):275-276. 169. Dang VT, Kassahn KS, Marcos AE, Ragan MA: Identification of human haploinsufficient genes and their genomic proximity to segmental duplications. Eur J Hum Genet 2008. 170. Williams BR, Prabhu VR, Hunter KE, Glazier CM, Whittaker CA, Housman DE, Amon A: Aneuploidy Affects Proliferation and Spontaneous Immortalization in Mammalian Cells. Science 2008, 322(5902):703-709. 171. Nadav L, Shay T, Naparstek E, Domany E, Katz B, Geiger B: Fibronectin- mediated adhesion regulates gene expression profiles in variant sub- populations of malignant plasma cells. Submitted. 172. Fontemaggi G, Dell’Orso S, Trisciuoglio D, Shay T, Sacchi A, Melucci E, Terrenato I, Mottolese M, Domany E, Del Bufalo D et al: Gain of function mutant p53 proteins promote neoangionesesis through ID-4 transcriptional activation. In preperation. 173. Shay T, Reiner A, Lambiv WL, Hegi ME, Domany E: Combining chromosomal arm status and significantly aberrant genomic locations reveals new cancer subtypes. Submitted.

- 86 -

Table 2-1 - Clinical data Field Description Values EORTC 1 Pilot 2 Experiment 1 - EORTC 2 - Pilot Institution Number Patient id Tumour ID Usually the same as Patient ID R for recurrence TMZ salvage Was TMZ given at salvage 0 - no, 1 - yes Date reoperation Date of second operation MIB% Mitotic index – percentage of dividing cells MIB_H&L Mitotic index classification High, low MGMT whether MGMT methylation was 0 - no experiment tested performed 1 - experiment done MGMT_ok whether MGMT methylation was 0 - experiment failed successfully tested 1 - experiment successfully completed

Meth MGMT methylation status 0 - MGMT non- methylated 1 - MGMT methylated meth_Rec MGMT methylation status on recurrence Diagn Glioblastoma, Glioblastoma with Oligo Component

Date of Randomization Randomized treatment 1- RTX alone 2 - RTX+TMZ Survival Status 0 - death 1 - alive LFU (lost to follow up) Progression Status 0 – no progression 1 – progression PFS status Progression free survival 0 - no PFS event 1 - PFS event Date of last contact or death Date of progression or death Date of birth Biopsy 0 - no, 1 - yes Debulking surgery 0 - no, 1 - yes Extent of debulking Extent in which the tumor could 1 - partial resection surgery be removed, macroscopically. 2 - total resection Partial resection means that tumor had to be left behind. Complete resection means that the surgeon had the impression to have removed everything Gender 1 - male 2 - female Performance status Mmse Mini Mental Status Examination at 0-30 randomization Corticosteroid status at 0 - no, 1 - yes randomization Tumor Lobe 1 - frontal 2 - temporal 3 - parietal 4 - occipital 5 - brain stem 6 - basal ganglia 8 - other

Table 2-2 – List of amplifications in GBM first bac location carriers primary Index ordinal chr first bac (kb) bacs # # effect cause 1 117 1 CTD-2145H2 200929 3 16 3 mdm4 2 170 2 RP11-111P18 19821 1 21 1 nag/mycn/rhob(?) 3 496 4 RP11-98G22 54697 3 11 7 pdgfra 4 764 6 RP11-119C22 33938 1 11 0 5 767 6 RP11-110I16 36668 1 7 0 6 773 6 RP11-105P7 40383 2 17 0 7 825 6 RP11-43B19 161004 1 10 2 8 905 7 RP11-14K11 54904 5 56 12 EGFR 9 1031 8 RP11-287P18 7243 2 36 0 variation 10 1274 9 RP11-40A7 131502 1 12 0 11 1438 11 RP11-98J9 13834 1 40 0 located on chr 7 12 1662 12 CTB-136O14 67504 2 8 12 mdm2 13 1724 13 RP11-17I11 42730 1 6 7 14 1792 14 RP11-52O23 47330 1 6 0 15 1823 14 RP11-92H20 74466 1 10 0 16 1850 14 RP11-47P23 106285 1 9 0 17 1867 15 RP11-2D17 40497 1 11 0 18 1890 15 RP11-50N10 62265 1 7 0 19 1900 15 RP11-64K10 68694 1 13 0 20 1911 15 CTD-2110H9 81683 1 16 1 21 2095 18 RP11-106J7 3356 1 18 0 22 2110 18 RP11-151D11 13132 1 13 0 23 2128 18 RP11-160B24 51927 1 12 0 variation 24 2340 22 RP11-61L22 44064 1 13 0

Table 2-3 – List of deletions in GBM first bac location bacs carriers index ordinal chromosome first bac (kb) # number cause 1 7 1 RP11-51B4 6294 2 18 ; RP11- 2 191 2 77G15 34274 1 10 3 198 2 RP11-173C1 39130 1 33 located on chr 10 RP11- 4 221 2 547F18 67124 1 5 5 339 2 CTB-172I13 244000 1 13 6 393 3 RP11-18A21 78769 1 15 7 418 3 RP11-72E23 146616 1 8 8 446 4 RP11-69L7 830 1 13 9 453 4 RP11-75N18 7430 3 46 located on chr 10; 10 472 4 RP11-53F9 30576 2 17 ; 11 590 4 RP11-6L19 167373 1 13 variation 12 643 5 RP11-88L18 17551 1 14 variation 13 655 5 RP11-190C5 57605 1 6 CTD- 14 658 5 2009C6 64072 1 7 located on chr 6 15 668 5 RP11-1E10 78923 2 12 ; RP11- 16 677 5 252I13 107046 1 9 17 691 5 RP11-45L19 128259 1 5 CTD- 18 707 5 2202A14 149725 1 18 located on chr 14 19 793 6 RP11-32O2 85670 1 12 20 796 6 RP11-113K7 92075 1 8 21 825 6 RP11-43B19 161004 1 28 22 851 7 RP11-128J9 9849 1 37 located on chr 10 23 927 7 GS1-17A10 77815 1 7 24 1116 8 RP11-90B7 79591 1 8 RP11- 25 1119 8 236J18 80642 1 8 located on chr 24 26 1164 8 RP11-17M8 137846 1 14 variation RP11- 27 1186 9 109M15 16233 9 43 CDKN2A;CDKN2B 28 1414 10 RP11-35C24 134300 1 48 RP11- 29 1422 11 209M9 4180 1 17 variation 30 1514 11 CTB-12F4 63792 1 9 31 1518 11 RP11-9K14 65107 1 19 CTD- 32 1525 11 2080I19 68464 1 14 33 1529 11 RP1-162F2 69178 1 19 RP11- 34 1711 13 204D11 24331 1 39 35 1731 13 RP1-269F22 50143 1 23 36 1777 14 RP11-256C2 23657 5 23 ;;; 37 1789 14 RP11-23D3 39835 1 12 variation 38 1804 14 RP11-26H17 55876 1 13 located on chr 9 39 1861 15 RP11-194H7 32428 1 20 variation 40 1944 16 RP11-150K5 29825 1 7 41 1989 16 RP11-60B6 83894 2 7 ; 42 1993 16 PAC191P24 91000 1 5 RP11- 43 2098 18 102E12 4536 1 10 44 2180 19 RP11-17I20 53858 1 7 45 2184 19 RP11-87L13 60725 1 8 CTD- 46 2206 20 2015K11 8811 1 9 47 2291 21 RP11-49J9 21068 2 11 located on chr 4 48 2335 22 RMC22P003 37945 1 13

Table 3-1 - Array CGH datasets analyzed

dataset condition samples # markers # markers amplified amplifications markers deleted deletions amp&del mislocated GSE8634 Medulloblastoma 80 6295 13 10 137 99 4 126 GSE5784 Neuroblastoma 236 2457 28 15 245 115 4 17 GSE7230 Neuroblastoma 82 4073 30 18 87 49 0 144 The datasets are recognized by their Gene Expression Omnibus (GEO - www.ncbi.nlm.nih.gov/geo) series ID. Amp&del number is the number of markers found to be significantly deleted and significanty amplified. Mislocated is the number of markers removed prior to analysis because of a probable wrong annotation. Table 3-2 - Aberrations common to both Neuroblastoma datasets

GSE5784 GSE5784 GSE7230 GSE7230 chromosome start marker end marker start marker end marker interesting genes amplifications 2 H10_K5 H10_M34 CTD-2603D17 CTD-2603D17 2 H10_K5 H10_M34 RP11-775D5 RP11-149C19 MYCN;NAG; 8 H9_L19 H9_I19 RP11-499J9 RP11-499J9 defensins

deletions 1 H11_N30 H11_C10 RP11-82D16 RP11-780N18 TP73; 1 H11_N30 H11_C10 RP11-327P18 RP11-327P18 1 H11_N30 H11_C10 RP11-150L14 RP11-707I5 1 H11_N30 H11_C10 RP11-728G12 RP11-728G12 1 H11_N30 H11_C10 RP11-155L18 RP11-155L18 1 H11_N30 H11_C10 RP11-598N19 RP11-598N19 1 H11_N30 H11_C10 RP11-335G20 RP11-335G20 1 H11_N30 H11_C10 RP11-219O7 RP11-219O7 4 H9_C33 H9_A3 RP11-358C18 RP11-358C18 6 H9_J12 H9_J12 CTD-2356O12 CTD-2356O12 7 H11_M23 H11_M22 RP11-32H11 RP11-32H11 11 H11_A33 H11_O18 RP11-367J12 RP11-367J12 17 H11_J19 H11_J19 CTD-2321N2 CTD-2321N2 17 H10_A32 H10_A32 CTD-2321N2 CTD-2321N2

Table 4-1 – Auto-regulatory transcription factors Gene Direction Organism and tissue Reference FOXA2 human hepatocytes Odom et al, 2006 HNF4A human hepatocytes Odom et al, 2006 HNF1A human hepatocytes Odom et al, 2006 CREB1 human hepatocytes Odom et al, 2006 mouse pancreatic acinar PTF1A positive cells Masui et al, 2008 PER1 negative human cicadian clock Hastings, 2008 PER2 negative human cicadian clock Hastings, 2008 PER3 negative human cicadian clock Hastings, 2008 PER4 negative human cicadian clock Hastings, 2008 CRY1 negative human cicadian clock Hastings, 2008 CRY2 negative human cicadian clock Hastings, 2008 Legraverend et al, CEBPA mouse 1993 HOXA4 Murine embryogenesis Packer et al, 1998 PAX4 human pancreas Smith et al, 2000 MYC negative human Penn et al, 1990 MYCN negative human Sivak et, 1997 SNAI1 negative Peiro et al, 2006 MYOG positive muscle Edmondson et al, 1992 MYOD1 positive muscle Thayer et al, 1989 HOXA13 human fibroblasts Rinn et al, 2008 Sassone-Corsi et al, FOS negative mouse fibroblasts 1988 EGR1 EGR4 negative mammalian cell lines Zipfel, 1997 RUNX2 negative Drissi, 2000 FGF2 positive Weich et al, 1991

Table 4-2 – Datasets analyzed Aberration Normal- Aber- Diso- Condition GEO chr type ization rant my Glioblastoma 19 trisomy rma 6 23 Glioblastoma 20 trisomy rma 7 43 Down’s syndrome fibroblasts GSE9762 21 trisomy mas5 5 5 Down’s syndrome brain GSE5390 21 trisomy rma 7 8 Down’s syndrome heart GSE1789 21 trisomy mas5 10 5 Down’s syndrome fetal astrocytes GSE1397 21 trisomy mas5 2 2 Down’s syndrome fetal cerebellum GSE1397 21 trisomy mas5 3 3 Down’s syndrome fetal cerebrum GSE1397 21 trisomy mas5 4 7 Down’s syndrome fetal heart GSE1397 21 trisomy mas5 2 2 Trisomy 13 cerebrum GSE1397 13 trisomy mas5 3 7 Glioblastoma 13 monosomy rma 14 35 Multiple Myeloma GSE6477 13 monosomy mas5 61 101

Table 4-2 – Datasets analyzed, continue t-test High Low (FDR = in in Condition GEO chr genes 10%) aberrant aberrant Glioblastoma 19 799 334 295 39 Glioblastoma 20 376 174 169 5 Down’s syndrome fibroblasts GSE9762 21 135 48 48 0 Down’s syndrome brain GSE5390 21 87 74 66 8 Down’s syndrome heart GSE1789 21 141 43 43 0 Down’s syndrome fetal astrocytes GSE1397 21 136 0 0 0 Down’s syndrome fetal cerebellum GSE1397 21 133 0 0 0 Down’s syndrome fetal cerebrum GSE1397 21 142 12 10 2 Down’s syndrome fetal heart GSE1397 21 131 0 0 0 Trisomy 13 cerebrum GSE1397 13 139 0 0 0 Glioblastoma 13 275 159 2 157 Multiple Myeloma GSE6477 13 239 113 5 108 Table 4-3– Simple regression models Down’s Down’s Down’s Down’s syndrome syndrome syndrome syndrome fetal GBM GBM fibroblasts brain heart astrocyte chr19 chr20 chr21 chr21 chr21 chr21 trisomy trisomy trisomy trisomy trisomy trisomy disomy 1.61E-32 1.73E-57 2.33E-09 0.004019 6.45E-08 0.038776 txLength 0.002465 0.03049 0.04155 0.992269 0.653196 0.117156 cdsLength 0.429798 0.072642 0.026995 0.800367 0.881627 0.128139 UTR5 6.97E-05 0.112389 0.579458 0.817242 0.214177 0.553021 UTR3 0.00724 0.330403 0.370666 0.208042 0.447937 0.987899 exonsTotalLength 9.04E-07 1.15E-05 0.846934 0.655957 0.366306 0.100884 intronsTotalLength 0.713315 0.088179 0.026404 0.790911 0.863781 0.136032 strand 0.622369 0.389576 0.268883 0.543122 0.035544 0.049716 exonCount 0.34252 0.003271 0.893585 0.433459 0.018137 0.02648 alternative_forms 1.37E-05 2.50E-08 0.706983 0.045232 0.000272 0.840864 UCSC_cpg 6.65E-14 5.16E-08 0.000368 0.003886 0.13248 0.651563 normal_tissue_detected 3.27E-27 2.12E-16 1.39E-05 1.41E-07 6.72E-05 0.843477 normal_tissue_std 4.92E-10 9.51E-05 0.001283 0.693732 0.449076 0.327443 imprinted 0.738142 0.704254 NaN NaN NaN NaN haploinsufficient 0.77797 0.010618 NaN NaN NaN NaN house_keeping 0.000502 0.008754 NaN NaN NaN NaN syndrome_related 7.70E-07 0.382242 0.005365 0.081587 0.119627 0.4922 family 9.21E-06 0.000614 0.006579 0.124697 0.247567 0.971046 GO_B_transcription 0 0.269281 0.12715 0.995242 0.667368 0.185387 GO_B_regulation of transcription, DNA- dependent 0 0.257754 0.202723 0.784832 0.517829 0.242474 GO_B_protein amino acid phosphorylation 0.087337 0.443081 0.535771 0.825017 0.14055 0.045244 GO_B_ubiquitin cycle 0.003073 0.49308 0.1858 0.229973 0.094547 0.019439 GO_B_transport 0.484301 0.003966 0.618291 0.124468 0.15812 0.380949 GO_B_ion transport 0.002076 0.446519 0.843241 0.037045 0.084536 0.466573 GO_B_apoptosis 0.83902 0.377785 0.065906 0.322947 0.676464 0.687298 GO_B_immune response 5.51E-11 0.294766 0.125416 0.609224 0.104093 0.743867 GO_B_cell cycle 0.054923 0.840722 0.236702 0.949334 0.328657 0.384127 GO_B_cell adhesion 0.000111 0.12764 0.023125 0.62129 0.274128 0.615304 GO_B_signal transduction 9.34E-08 0.011831 0.054729 0.305695 0.870472 0.617914 GO_B_G-protein coupled receptor protein signaling pathway 2.32E-05 0.025551 NaN NaN NaN NaN GO_B_multicellular organismal development 0.579205 0.543595 0.17905 0.537822 0.578123 0.823174 GO_B_metabolic 0.663421 0.011607 0.909246 0.715866 0.083588 0.668542 process GO_B_protein transport 0.083164 3.85E-05 NaN NaN NaN NaN GO_B_regulation of transcription 0.862675 0.828123 0.759291 0.937937 0.954391 0.635981 GO_M_nucleotide binding 0.227035 0.127965 0.965114 0.763828 0.603366 0.352263 GO_M_nucleic acid binding 0 0.127101 0.233435 0.657702 0.83493 0.322278 GO_M_DNA binding 0 0.300549 0.254507 0.485903 0.651283 0.350924 GO_M_transcription factor activity 0.006123 0.476717 0.137903 0.862179 0.837347 0.327312 GO_M_RNA binding 5.43E-06 0.283026 0.989183 0.637649 0.502646 0.965119 GO_M_catalytic activity 0.580226 0.324241 0.407219 0.847381 0.605931 0.968896 GO_M_protein kinase activity 0.200427 0.335636 0.990773 0.335829 0.311406 0.026759 GO_M_protein serine/threonine kinase activity 0.439466 0.131691 0.990773 0.335829 0.311406 0.026759 GO_M_signal transducer activity 4.16E-05 0.274707 0.599443 0.250285 0.459892 0.019239 GO_M_receptor activity 2.65E-14 0.024758 0.08851 0.711977 0.158519 0.417599 GO_M_binding 0.563313 0.014924 0.372266 0.661961 0.372913 0.951643 GO_M_calcium ion binding 0.011817 0.461206 0.423149 0.683151 0.102897 0.708124 GO_M_protein binding 0.057804 4.39E-07 0.301754 0.715359 0.626954 0.050279 GO_M_ATP binding 0.638342 0.03852 0.604184 0.904261 0.333582 0.295562 GO_M_peptidase activity 0.001432 0.881023 0.28799 0.756432 0.239939 0.000706 GO_M_zinc ion binding 0 0.131535 0.725727 0.126864 0.008021 0.989397 GO_M_kinase activity 0.437632 0.346383 0.743236 0.436419 0.167397 0.059252 GO_M_oxidoreductase activity 0.262179 0.790758 0.000143 0.187761 0.571696 0.381974 GO_M_transferase activity 0.550666 0.016405 0.651613 0.730483 0.9144 0.277398 GO_M_hydrolase activity 0.238574 0.050673 0.242478 0.792185 0.381973 0.003319 GO_M_sequence- specific DNA binding 0.941898 0.353543 0.322124 0.907704 0.357703 0.821631 GO_M_metal ion binding 0 0.281279 0.710707 0.129204 0.026291 0.841618 GO_C_extracellular region 1.67E-08 0.002814 0.379576 0.313551 0.023623 0.812138 GO_C_intracellular 0 0.000127 0.197519 0.027156 0.0482 0.986051 GO_C_membrane 0.032086 0.961374 0.200382 0.747855 0.499697 0.62701 fraction GO_C_nucleus 0 0.039114 0.69236 0.096007 0.864639 0.029802 GO_C_cytoplasm 0.090953 0.001532 0.493895 0.534157 0.76767 0.652113 GO_C_mitochondrion 0.036294 0.005826 0.00507 0.609725 0.732257 0.863566 GO_C_endoplasmic reticulum 0.361167 0.010845 0.625712 0.892009 0.380819 0.548436 GO_C_cytoskeleton 0.95259 0.498881 NaN NaN NaN NaN GO_C_plasma membrane 9.86E-06 0.349023 0.05096 0.451156 0.984702 0.813726 GO_C_integral to plasma membrane 0 0.320229 0.343853 0.463114 0.250341 0.890696 GO_C_membrane 0 0.230189 0.006909 0.235573 0.822166 0.22966 GO_C_integral to membrane 0 0.362495 0.005804 0.61477 0.397981 0.239535 Red – p-value in the range [0, 0.01] Yellow – p-value in the range [0.01, 0.05] Table 4-3 - Simple regression models, continue Down’s Down’s Down’s syndrome syndrome syndrome fetal fetal fetal fetal cerebellum cerebrum cerebrum heart GBM MM chr21 chr13 chr21 chr21 chr13 chr13 trisomy trisomy trisomy trisomy monosomy monosomy disomy 0.059263 0.029212 1.52E-09 0.022165 1.35E-17 6.76E-14 txLength 0.229086 0.688661 0.873723 0.623704 0.57939 0.063549 cdsLength 0.347861 0.813043 0.846261 0.700517 0.677306 0.070848 UTR5 0.362411 0.567339 0.970677 0.951943 0.309767 0.497431 UTR3 0.797228 0.126071 0.986326 0.616431 0.095623 0.335989 exonsTotalLength 0.260498 0.025303 0.229043 0.557713 0.233047 0.508068 intronsTotalLength 0.359071 0.836795 0.822757 0.688373 0.663735 0.068183 strand 0.101985 0.824797 0.129224 0.603365 0.837597 0.851126 exonCount 0.30306 0.842052 0.000491 0.542955 0.08181 0.420116 alternative_forms 0.544758 0.801758 0.007322 0.907513 1.55E-05 0.050656 UCSC_cpg 0.148134 0.018648 0.028957 0.932706 1.34E-05 0.171185 normal_tissue_detected 0.000452 0.000594 2.66E-08 0.850792 4.18E-05 5.29E-13 normal_tissue_std 0.156709 0.000846 0.210696 0.904659 3.30E-05 0.000219 imprinted NaN NaN NaN NaN NaN NaN haploinsufficient NaN NaN NaN NaN 0.747522 0.924901 house_keeping NaN 0.551853 NaN NaN 0.205234 0.181062 syndrome_related 0.781253 0.178086 0.048401 0.483483 0.001094 5.44E-06 family 0.53843 0.171643 0.057394 0.005194 0.001735 0.015529 GO_B_transcription 0.750189 0.312465 0.277345 0.32303 0.410552 0.02151 GO_B_regulation of transcription, DNA- dependent 0.780197 0.230125 0.190204 0.321185 0.662077 0.04461 GO_B_protein amino acid phosphorylation 0.510954 0.825745 0.249907 0.770819 0.86663 0.737453 GO_B_ubiquitin cycle 0.844768 0.436037 0.307297 0.929981 0.000188 0.001068 GO_B_transport 0.628758 0.242799 0.999239 0.755588 0.221701 0.431276 GO_B_ion transport 0.461845 0.130214 0.905041 0.798356 0.551392 0.148159 GO_B_apoptosis 0.491281 0.823488 0.997514 0.966847 0.08132 0.465873 GO_B_immune response 0.984744 0.094335 0.266515 0.237897 0.339169 0.232935 GO_B_cell cycle 0.016664 0.98118 0.117954 0.077241 0.325858 0.889497 GO_B_cell adhesion 0.948991 0.070044 0.87717 0.488292 0.414801 0.043684 GO_B_signal transduction 0.465978 0.963998 0.530421 0.967604 0.164686 0.639206 GO_B_G-protein coupled receptor protein signaling pathway NaN 0.951682 NaN NaN 0.091814 0.181211 GO_B_multicellular organismal development 0.498833 0.17375 0.382903 0.84422 0.0604 0.013481 GO_B_metabolic process 0.555779 0.773726 0.445115 0.88109 0.387163 0.171443 GO_B_protein transport NaN 0.949193 NaN NaN 0.00263 0.427305 GO_B_regulation of transcription 0.927483 0.326439 0.649872 0.097951 0.018628 0.026351 GO_M_nucleotide binding 0.158604 0.542703 0.731812 0.983368 0.862242 0.706371 GO_M_nucleic acid binding 0.937605 0.332538 0.810057 0.523071 0.414974 0.405861 GO_M_DNA binding 0.8323 0.262847 0.099725 0.744991 0.805976 0.645606 GO_M_transcription factor activity 0.87333 0.315686 0.119654 0.351601 0.11122 0.69576 GO_M_RNA binding 0.317722 0.456724 0.635617 0.817456 0.005722 0.137107 GO_M_catalytic activity 0.813461 0.770888 0.263389 0.981103 0.120205 0.953106 GO_M_protein kinase activity 0.601734 0.825745 0.409611 0.964369 0.86663 0.737453 GO_M_protein serine/threonine kinase activity 0.601734 0.948604 0.409611 0.964369 0.970711 0.490369 GO_M_signal transducer activity 0.921704 0.867543 0.652581 0.414132 0.464066 0.137049 GO_M_receptor activity 0.451074 0.799642 0.791441 0.208993 0.090513 0.212221 GO_M_binding 0.641402 0.293569 0.110582 0.540718 0.036294 0.593146 GO_M_calcium ion binding 0.203125 0.382584 0.6841 0.245383 0.211381 0.385307 GO_M_protein binding 0.491835 0.725638 0.939019 0.658976 0.167789 0.461277 GO_M_ATP binding 0.267285 0.677781 0.953233 0.525894 0.148811 0.744787 GO_M_peptidase activity 0.789596 0.795529 0.378895 0.246166 0.554812 0.940575 GO_M_zinc ion binding 0.031263 0.987322 0.044767 0.944043 0.068235 0.068965 GO_M_kinase activity 0.882691 0.868157 0.959795 0.920735 0.959051 0.746979 GO_M_oxidoreductase activity 0.001354 0.457964 0.053363 0.77904 0.023507 0.140214 GO_M_transferase activity 0.38969 0.785147 0.573574 0.148499 0.696301 0.80255 GO_M_hydrolase activity 0.832152 0.833051 0.668029 0.119837 0.000642 0.91958 GO_M_sequence- specific DNA binding 0.701033 0.386312 0.26645 0.998492 0.053329 0.808077 GO_M_metal ion binding 0.033773 0.754182 0.011546 0.985693 0.269644 0.606091 GO_C_extracellular region 0.628481 0.961419 0.858993 0.994399 0.016591 0.01945 GO_C_intracellular 0.192373 0.854063 0.006579 0.65619 0.282099 0.509773 GO_C_membrane fraction 0.193272 0.729598 0.405562 0.321968 0.074718 0.592687 GO_C_nucleus 0.194926 0.683685 0.839222 0.157058 0.778282 0.500248 GO_C_cytoplasm 0.051849 0.801176 0.90475 0.449948 0.001044 0.038118 GO_C_mitochondrion 0.244019 0.175054 0.010702 0.690851 0.000452 0.577821 GO_C_endoplasmic reticulum 0.850517 0.252385 0.584888 0.454381 0.291558 0.718522 GO_C_cytoskeleton NaN 0.415506 NaN NaN 0.687785 0.162742 GO_C_plasma membrane 0.759 0.128352 0.269357 0.301998 0.12955 0.005075 GO_C_integral to plasma membrane 0.746683 0.256936 0.627488 0.770654 0.142939 0.02979 GO_C_membrane 0.918286 0.115081 0.186662 0.648188 0.163049 0.001522 GO_C_integral to membrane 0.803109 0.121601 0.071898 0.82685 0.082097 0.003999 Red – p-value in the range [0, 0.01] Yellow – p-value in the range [0.01, 0.05] Table 4-4– Multiple regression models p- variables in model Condition chr type R^2 F-statistic value (correlation) disomy:0.098034; UTR5:0.012182; exonsTotalLength:0.018518; exonCount:0.00069792; UCSC_cpg:0.040272; normal_tissue_detected:0.13992; normal_tissue_std:0.048752; syndrome_related:0.01772; Glioblastoma 19 trisomy 0.273352 31.43212 0 family:0.014291; disomy:0.35206; exonsTotalLength:0.036167; normal_tissue_detected:0.18829; Glioblastoma 20 trisomy 0.37821 46.075739 0 family:0.01974; Down’s syndrome disomy:0.22423; fibroblasts 21 trisomy 0.258291 24.02846 0 normal_tissue_std:0.071128; Down’s syndrome UCSC_cpg:0.057607; brain 21 trisomy 0.216594 19.076933 0 normal_tissue_detected:0.17897; Down’s disomy:0.18778; syndrome alternative_forms:0.090026; heart 21 trisomy 0.254004 15.549065 0 normal_tissue_std:0.0040698; Down’s syndrome fetal astrocytes 21 trisomy 0.03493 5.031 0.03 exonCount:0.03493; Down’s syndrome fetal normal_tissue_detected:0.083834; cerebellum 21 trisomy 0.117547 9.191123 0 normal_tissue_std:0.014174; Down’s syndrome fetal normal_tissue_detected:0.048642; cerebrum 21 trisomy 0.062945 7.42265 0 normal_tissue_std:0.046001; Down’s disomy:0.22884; syndrome exonCount:0.083977; fetal heart 21 trisomy 0.292485 18.878504 0 normal_tissue_detected:0.19765; Trisomy 13 cerebrum 13 trisomy 0.055028 8.094297 0.01 family:0.054074; disomy:0.17648; normal_tissue_std:0.084794; Glioblastoma 13 monosomy 0.253223 20.119198 0 syndrome_related:0.027994; disomy:0.21129; Multiple normal_tissue_detected:0.19769; Myeloma 13 monosomy 0.315775 33.84385 0 syndrome_related:0.083709;

Figure 2-1 - GBM dataset clustering

High normalized expression ) Genes (ordered by clustering

Low normalized expression

Samples (ordered by clustering) normal Figure 2-2 - PCA of the 1000 most variable probesets in glioblastoma Figure 2-3 – Glioblastoma aCGH data

Amplification

Chromosome 7

No change Chromosome 10

Chromosome 13

Chromosome 19

Deletion Samples Figure 2-4 - Glioblastoma aCGH data of chromosomes 6, 7 and 8.

Amplification

EGFR

Chromosome 7 No change

Deletion Samples Figure 2-5 - Glioblastoma aCGH data of chromosomes 9, 10 and 11

Amplification

Chromosome 10 No change

Deletion Samples Figure 2-6 – Glioblastoma aCGH data of chromosomes 18, 19 and 20

Amplification

Chromosome 19

No change

Deletion Samples Figure 2-7 – Glioblastoma expression and aCGH in 67 samples in fine resolution

A B

0.5 0.5

0 0 Markers Probesets -0.5 -0.5

Samples Samples C D 20

15

10

5 Expression Samples # CGH

log2(Tumor/Normal) 0 Genomic location -0.2 0 0.2 0.4 0.6 0.8 1 Correlation coefficient Figure 2-8 - Glioblastoma expression and aCGH in 67 samples at chromosomal arm level

A B

0.2 0.5 0.1

0 0

-0.1

-0.5 -0.2

Chromosomal arm Chromosomal arm Chromosomal Samples Samples

C D 0.5 20

15 0

10 -0.5 5 Expression Samples # aCGH

log2(Tumor/Normal) -1 0 Chromosomal arm -0.2 0 0.2 0.4 0.6 0.8 1 Correlation coefficient Figure 2-9 - Expression matrix of an amplified region on chromosome 12q

Samples Figure 2-10 - EGFR amplicon and its transcriptional effect

A

B

C Figure 2-11 Known alterations in the TP53 pathway

A

B

C

Known p53 alteration Unknown p53 alteration Figure 2-12 - Markers separating between samples with known TP53 alterations and the rest of the samples

A

B

Samples (same order) Known p53 alteration Unknown p53 alteration Figure 2-13 - Probesets separating between samples with known TP53 alterations and the rest of the samples

Samples (same order)

Known p53 alteration Unknown p53 alteration Figure 2-14 - survival of samples with known TP53 alterations and samples with no known TP53 alterations Figure 2-15 - Known alterations in the RB1 pathway.

A

B

C

Known RB alteration Unknown RB alteration Figure 2-16 – Markers separating between samples with known RB1 alterations and the rest of the samples

Known RB alteration Unknown RB alteration Figure 2-17 - probesets separating between samples with known RB1 alterations and the rest of the samples

Known RB alteration Unknown RB alteration Figure 2-18 - Survival of samples with known RB1 alterations and samples with no known RB1 alterations Figure 2-19 - Known alterations in the EGFR pathway

A

B

C

Known EGFR alteration Unknown EGFR alteration Figure 2-20 – Markers separating between samples with known EGFR alterations and the rest of the samples A

B

Known EGFR alteration Unknown EGFR alteration Figure 2-21 - Probesets separating between samples with known EGFR alterations and the rest of the samples

Known EGFR alteration Unknown EGFR alteration Figure 2-22 - Survival of samples with known EGFR alterations and samples with no known EGFR alterations Figure 3-1 - Calculation of the “volume” statistic for chromosomal arm 2p amplifications in GSE7230 (Neuroblastoma) Figure 3-2 - Chromosomal status and aberrations in Medulloblastoma Figure 3-3 - Chromosomal status and aberrations common to both Neuroblastoma datasets Figure 4-1 - The effect of number of copies of a chromosome on the expression of genes on that chromosome. Figure 4-2 - The conservation of the effect an additional copy of chromosome 19 has on the expression of the genes located on chromosome 19. Figure 4-3 - The conservation of the effect of loss of a copy of chromosome 13 has on the expression of the genes located on chromosome 13. Figure 4-4 - The effect of the disomic gene expression level on the change in expression following gain/loss of a chromosome.