Identification of cancer-specific transcripts:

With emphasis on the hunt for fusion in colorectal cancer

Marthe Eken

Thesis for the master’s degree in Molecular Bioscience at Department of Molecular Bioscience (IMBV), Faculty of Mathematics and Natural sciences

UNIVERSITY OF OSLO December 2008

2

Acknowledgements

This work was carried out in the project Group of Genome Biology, at the Department of Cancer Prevention, Rikshospitalet-Radiumhospitalet Medical Center, from March 2007 to December 2008.

First of all, I would like to thank my supervisor, Rolf I. Skotheim, for his great support throughout the project, for always taking the time to answer my many questions, and for his everlasting patience with my forgotten italics. I also wish to thank my co-supervisor and head of the department, Ragnhild A. Lothe, for giving me the opportunity to be part of such an excellent group.

I greatly appreciate the other members of the department for making this a great place to work, especially Guro for being my lab-oracle, Anita for providing microarray data, Zere for helping me with the cloning, and Hilde for our many discussions.

I am grateful to my mother for all her great advices through the years and for always being there for me, and to my father for telling me I could do anything.

Finally, special thanks go to my fiancé, Joachim, for his overwhelming patience and support, for always believing in me, and for making me believe in myself.

Oslo, December 2008

Marthe Eken

Table of contents

TABLE OF CONTENTS...... 4

SUMMARY ...... 6

ABBREVIATIONS ...... 8

GENE SYMBOLS ...... 9

1. INTRODUCTION ...... 10 1.1 THE MOLECULAR PHENOTYPE OF CANCER ...... 10 1.2 TRANSCRIPT VARIATION IN CANCER ...... 11 1.2.1 Alternative splicing ...... 12 1.2.2 Alternative core promoter usage ...... 15 1.3 FUSION GENES IN CANCER ...... 16 1.3.1 Chromosomal rearrangements ...... 16 1.3.2 Creation of fusion genes from chromosomal rearrangements ...... 19 1.3.3 Discovery and detection of fusion genes ...... 21 1.3.4 Fusion targeting therapy ...... 22 1.4 COLORECTAL CANCER ...... 23 1.4.1 Developmental pathways ...... 24 1.4.2 Dukes' staging, current treatment and outcome ...... 25 1.5 OTHER CANCER DISEASES STUDIED ...... 27 1.5.1 Malignant peripheral nerve sheath tumour ...... 27 1.5.2 Leukaemia ...... 28 1.5.3 Testicular germ cell tumour ...... 28

2. AIMS ...... 30

3. MATERIALS AND METHODS ...... 31 3.1 MATERIALS ...... 31 3.1.1 Cancer cell lines – colon cancer, testicular cancer, and leukaemia ...... 31 3.1.2 Tissue samples – malignant peripheral nerve sheath tumours and colorectal carcinomas ...... 31 3.2 PUBLICLY AVAILABLE DATABASES ...... 32 3.3 ESTABLISHMENT OF METHODOLOGICAL PROTOCOLS IN EXISTING PROJECTS ...... 32 3.3.1 Analysis of BIRC5 transcipt variants in MPNST samples ...... 32 3.3.2 Validation of fusion gene microarray data ...... 33 3.3.3 Analysis of putative fusion genes in a testicular cancer cell line ...... 35 3.4 DESIGN OF A NOVEL STRATEGY FOR IDENTIFICATION OF FUSION GENES ...... 36 4

3.4.1 Outlier expression profiles ...... 36 3.4.2 Known and putative 3’ fusion gene partners ...... 38 3.4.3 Exon microarray analysis ...... 39 3.5 EXPERIMENTAL ASSAYS ...... 42 3.5.1 Polymerase chain reaction ...... 42 3.5.2 Rapid Amplification of cDNA Ends ...... 44 3.5.3 Detection of PCR-products ...... 47 3.5.4 Cloning...... 48 3.5.5 DNA Sequencing ...... 50

4. RESULTS ...... 52 4.1 BIRC5 TRANSCRIPT VARIANTS ...... 52 4.2 FUSION GENE MICROARRAY ...... 52 4.3 PUTATIVE FUSION GENES IN TGCT ...... 53 4.4 NOVEL STRATEGY FOR IDENTIFICATION OF FUSION GENES ...... 54

5. DISCUSSION ...... 73 5.1 SPLICE VARIANTS OF BIRC5 IN CANCER ...... 73 5.2 VALIDATION OF A NOVEL MICROARRAY-BASED TOOL ...... 74 5.3 PUTATIVE FUSION GENES IN TGCT ...... 75 5.4 HUNT FOR FUSION GENES ...... 76 5.4.1 Methodological considerations ...... 76 5.4.2 Novel exons and transcripts ...... 78

6. FUTURE PERSPECTIVES ...... 85

7. REFERENCE LIST ...... 86

APPENDIX I – PRIMER INFORMATION ...... 98

APPENDIX II – RESULTS ...... 101

APPENDIX III – KNOWN FUSION GENES IN CANCER ...... 110

APPENDIX IV – ABSTRACTS OF MANUSCRIPTS ...... 118

5

Summary

Cancer is a genetic and epigenetic disease where accumulation of alterations in the genome and epigenome of a cell eventually leads to malignancy. Detection and diagnosis of cancer are nevertheless not trivial issues, and furthermore, drugs intended to kill the cancer cells may as well exert their effects on normal cells. This may be due to the lack of specificity of the drugs toward their intended targets, or a lack of cancer-specificity of the drug targets themselves. Events like alternative splicing and chromosomal rearrangements can create cancer-specific transcripts, which themselves or their derived products can be considered for use as diagnostic biomarkers or therapeutic targets.

During the establishment of the experimental assays we used samples from three cancer types to investigate suspected transcript variants in more detail. First, in malignant peripheral nerve sheath tumours, three transcript variants of BIRC5 were found. Secondly, two out of three known fusion genes in leukaemia were validated as positive controls in a novel microarray-based tool. And finally, although none could be validated, three putative fusion genes were investigated in a testicular cancer cell line. The contribution within two of these subprojects has also resulted in co- authorships for two scientific manuscripts (Appendix IV).

The main aim of this master project was to identify cancer-specific transcripts, with particular focus on colorectal cancer. In a hunt for fusion genes in solid tumours, a novel strategy was established, and three starting points for candidate gene selection were set: 1) genes with outlier expression profiles in colorectal cancer; 2) already known and putative downstream fusion gene partners; and 3) the ETS gene family, proven to be involved in fusion genes in prostate cancer. For all candidate genes, the expression levels across all exons were investigated, and genes with overexpression selectively from 3’-exons were further analysed for novel upstream sequences.

6

In total, the exon expression levels for 508 genes were investigated. Fifteen of these genes had deviating exon expression profiles indicating qualitative changes in the transcript structure and were therefore further investigated in the laboratory. Novel transcript variants were identified in all of the genes. These included potentially new promoters, novel exons within intron sequences and intron retentions, however, no fusion genes were found.

In conclusion, our novel strategy for identification of transcript variants in colorectal cancer proved successful, and the novel transcripts will be further investigated in our laboratory to elucidate their prevalence and cancer-specificity, and subsequently their clinical applicability in cancer diagnostics, prognostics, and therapeutics.

7

Abbreviations

3’ss 3’ splice site MMLV RT Moloney murine leukemia virus reverse transcriptase 5’ss 3’ splice site MMR Mismatch repair 5’-UTR 5’ untranslated region MPNST Malignant peripheral nerve ALL Acute lymphocytic leukaemia sheath tumour AML Acute myelogenous leukaemia mRNA Messenger RNA APS Ammonium persulphate MSS Microsatellite stability cDNA Complementary DNA MSI Microsatellite instability CIMP CpG island methylator phenotype NF1 Neurofibromatosis type 1 CIN instability NGSP Nested gene-specific primer CLL Chronic lymphocytic leukaemia NHEJ Non-homologous end CML Chronic myelogenous leukaemia joining COPA Cancer outlier profile analysis PAGE Polyacrylamide gel electrophoresis Ct Cycle threshold PCR Polymerase chain reaction DNA Deoxyribonucleic acid RACE Rapid amplification of cDNA ddNTP Dideoxyribonucleotide ends dNTP Deoxyribonucleotide RNA Ribonucleic acid DSB Double strand break RT-PCR Reverse transcriptase EDTA Ethylenediaminetetraacetic acid polymerase chain reaction ESE Exonic splicing enhancer snRNA Small nuclear ribonucleic acid EST Expressed sequence tag snRNP Small nuclear ribonuclear FAP Familial adenomatous polyposis protein FISH Fluorescence in situ hybridization S.O.C. Super optimal broth with GSP Gene-specific primer catabolite repression medium GTI Gene tissue outlier index SR protein Serine/arginine-rich protein HNPCC Hereditary non-polyposis colorectal cancer TAE Tris-acetate EDTA buffer hnRNP Heterogeneous nuclear TGCT Testicular germ cell tumour ribonucleoprotein TSS Transcription start site HR Homologous recombination UP Universal primer IQR Inter quartile range UPM Universal primer mix LB Lysogeny broth

8

Gene symbols1 PRRX2 Paired related homeobox 2

ABL1 C-abl oncogene 1, receptor RAD51L1 RAD51-like 1 tyrosine kinase (Saccharomyces cerevisiae) APC Adenomatous polyposis coli RUNX1 Runt-related 1 BCR Breakpoint cluster region SERPINB7 Serpin peptidase inhibitor, BIRC5 Baculoviral IAP repeat- clade B (ovalbumin), member containing 5 (survivin) 7 C4BPB Complement component 4 TCF3 Transcription factor 3 (E2A binding protein, beta immunoglobulin enhancer CDKN2A Cyclin-dependent kinase binding factors E12/E47) inhibitor 2A (melanoma, p16, TFPT TCF3 (E2A) fusion partner (in inhibits CDK4) childhood leukaemia CST1 Cystatin SN TFR2 Transferrin receptor 2 ETV6 Ets variant gene 6 (TEL TMPRSS2 Transmembrane protease, oncogene) serine 2 ERG V-ets erythroblastosis virus TP53 Tumour protein p53 E26 oncogene homolog (avian) USP11 Ubiquitin specific peptidase 11 FZD10 Frizzled homolog 10 (Drosophila) VNN1 Vanin 1 GJB6 Gap junction protein, beta 6, WIF1 WNT inhibitory factor 1 30kDa ZDHHC20 Zink finger, DHHC-type GPR177 G-protein coupled receptor containing 20 177 HOXB13 Homeobox B13 HOXC11 Homeobox C11 MIER1 Mesoderm induction early response 1 homolog (Xenopus laevis) MTHFD2L Methylenetetrahydrofolate- dehydrogenase (NADP+ dependent) 2-like NKAIN2 Na+/K+ transporting ATPase interacting 2 NPW Neuropeptide W PBX1 Pre-B-cell leukaemia homeobox 1 PRRX1 Paired related homeobox 1

1 Symbols approved by the nomenclature committee (www.genenames.org). Approved gene names are used throughout the thesis.

9 Introduction

1. Introduction

1.1 The molecular phenotype of cancer

In every multicellular organism, organ, and tissue there exist a finely tuned balance (homeostasis) between cell proliferation and apoptotic cell death. Disruption of this balance, either by increased proliferation or decreased rates of apoptosis, can lead to development of cancer.

Tumours are generally considered to develop through clonal expansion from a single cell, resulting in monoclonal tumours [1]. However, it should be noted that there is evidence for the existence of polyclonal tumours as well [2]. During the clonal expansion process, normal cells are stepwise turned into cancer cells by accumulation of genetic and/or epigenetic aberrations which give the cells growth advantages. These advantages are the results of proto-oncogene activations, inactivation of tumour suppressor genes, and alterations of DNA repair genes [3]. The subsequent phenotypic changes can be classified into six "hallmarks of cancer" [4].

The development of cancer is therefore a multistep process with each step reflecting genetic or epigenetic alterations driving the progressive transformation of human cells. This multistep view of cancer development correlates well with the fact that most cancer types have an increased incidence with age, as it takes time to acquire a sufficient amount of alterations for malignant transformation.

The phenotypic changes during the tumourigenesis are consequences of dominant gain-of-function mutations of proto-oncogenes and recessive loss-of-function mutations of tumour suppressor genes. Proto-oncogenes encode which control cell proliferation, apoptosis, or both, and can be activated by structural alterations resulting from gene fusion [5], activating mutations at the nucleotide level, or by overexpression caused by e. g. juxtaposition to enhancer elements [6], or genomic amplification. Tumour suppressor genes, on the other hand, encode proteins 10 Introduction whose normal function is to inhibit tumourigenesis, and thus, loss of expression or mutations leading to non-functional products may therefore promote tumourigenesis. Tumour suppressor genes usually need to be inactivated in both alleles for tumourigenic effects to take place, although haploinsufficient tumour suppressors have been reported [7,8]. The loss of expression can be due to for example gene deletions or epigenetic inactivation. Even though the genome in most cancer cells is hypomethylated on the general level, it reveals hypermethylation within specific CpG islands2 compared to genomes of normal cells [9]. The majority of CpG islands are located within regulatory elements of genes, and methylation of such regions is associated with transcriptional silencing [10]. Such inappropriate methylation can contribute to the tumourigenesis by silencing tumour suppressors and DNA repair genes.

1.2 Transcript variation in cancer

Perhaps the most remarkable observation from sequencing the genomes of several is the fact that the number of protein-coding genes do not correlate with an organism's overall cellular complexity. In other words, increased complexity does not imply a higher number of genes. For example, mammalian species and the plant Arabidopsis thaliana have similar numbers of protein-coding genes, so from where does the complexity in mammalian species derive?

One mechanism contributing to the cellular complexity is alternative splicing of primary transcripts (pre-mRNAs), giving rise to multiple mRNA transcript variants and subsequently multiple protein isoforms per gene. Alterations of this normal process of alternative splicing is common in cancer cells and result in the production of mRNAs not existing in healthy cells or in the modification of tissue-specific ratios between normal mRNA types. One explanation for these differences is the

2 CpG islands are regions of the genome with significantly higher content of CpG dinucleotides than average.

11 Introduction fundamental difference in expression patterns of known splicing-regulatory genes in cancerous as compared to normal tissues [11]. Individual cancer-specific variants may or may not be functionally important for the cells, but nevertheless, and due to the presence of sequences only present in malignant cells, they have the potential to function as therapeutic targets or as biomarkers for cancer diagnostics and prognostics. This great potential makes discovery and characterisation of novel transcript variants an interesting path towards a better understanding and management of cancer.

1.2.1 Alternative splicing

Alternative splicing is the process by which exons of pre-mRNAs can be spliced in different arrangements to produce structurally and functionally distinct mRNAs and protein variants [12]. The average human gene spans 28,000 nucleotides of genome sequence and the average processed transcript consists of 8.8 exons of approximately 120 nucleotides each [13]. The large amount of exons in a gene enables the splicing machinery to create a large number of different transcript variants from a single gene, and thereby contribute to the cellular complexity of the organism. Alternative splicing has, in fact, been estimated to occur in 86 % of all human genes [14].

As depicted in Figure 1, different alternative splicing mechanisms are known. Exons which are either skipped or included in the final mRNA and are flanked by intron sequences on both sides are called cassette exons (Figure 1A). Another mechanism of alternative splicing is the use of different 5' and 3' splice sites (Figure 1, parts B and C, respectively), where the amount of sequence included from a particular exon varies between different transcripts. If a splice site is missed by the splicing machinery, an intron can be retained in the final mRNA and contribute to the coding sequence (Figure 1D). Also, some exons are mutually exclusive (Figure 1E). This means that in the final processed mRNA, one out of two exons is always present, but never both. Finally, mRNAs can differ in their 5' and 3'-ends by using alternative promoters or poly-adenylation sites (Figure 1, parts F and G, respectively) [12].

12 Introduction

According to Nagasaki et al. cassette exons and intron retention are the two most common types of alternative splicing in human genes [15].

Figure 1. Common sources of qualitative transcript variation (mostly alternative pre-mRNA splicing). (A) Cassette alternative exon. (B) Alternative 5' splice sites. (C) Alternative 3' splice sites. (D) Intron retention. (E) Mutually exclusive alternative exons. (F) Alternative promoter and first exon. (G) Alternative polyadenylation site and terminal exon. [Modified from ref. 12].

Alternative splicing is regulated by three main mechanisms; the splicing machinery and its assembly, interactions between splicing factors and cis-elements, and finally the rate of transcription elongation. In the following, each of these mechanisms will be discussed in more detail.

Splicing is performed by a set of splicing factors combined to the spliceosome complex. The spliceosome consists of five small nuclear RNA (snRNA) molecules and approximately 150 proteins [16]. The five snRNAs (U1, U2, U4, U5, and U6) assemble with proteins to form small nuclear ribonuclear protein complexes (snRNPs). The snRNPs, together with the rest of the spliceosome, recognise introns 13 Introduction by four short sequences; the exon-intron junctions at both ends of the intron (5' and 3' splice sites), the branch site3, and the pyrimidine tract located between the 3' splice site (3’ss) and the branch site [18]. The assembly of the spliceosome starts with the recognition of the 5’ splice site (5’ss) by U1. Next, the branch site and the 3’ss are bound by U2 and together U1 and U2 constitute the prespliceosome or complex A. The tri-snRNP U4/U6/5 then joins the complex to form complex B. This gives rise to conformational changes which results in the loss of U1 and U4 from the complex, and the active spliceosome is created [18].

Several intronic and exonic cis-elements important for correct splice site identification have been described. Binding of different protein factors to these elements can stimulate (enhancers) or repress (silencers) splicing. Exonic splicing enhancers (ESEs) serve as binding sites for specific serine/arginine-rich (SR) proteins. SR proteins bound to ESEs can promote exon definition4 by directly recruiting the splicing machinery and/or by antagonising the action of nearby silencer elements [20]. Most described silencers are intronic elements and they seem to work by interacting with negative regulators which often belong to the heterogeneous nuclear ribonucleoprotein (hnRNP) family [21]. These proteins can act as silencers in different ways [22,23]. The decision to include an exon reflects the intrinsic strength of the flanking splice sites and the combinatorial effects of positive and negative elements.

The transcriptional elongation rate affects alternative splicing by regulating which splice sites are available for the spliceosome at any given time. That is, under high elongation rates two neighbouring splice sites on the pre-mRNA will be available to the spliceosome at the same time, whereas only one will be available if the elongation halts or stops between them. An alternative exon will normally have a somewhat weaker splice site than a constitutive exon. If both a weak and a strong splice site are

3 The branch site is a splicing signal located upstream of the 3' end of the intron [17].

4 Exon definition is the recognition of a particular pre-mRNA segment as an exon by the spliceosome [19].

14 Introduction available to the spliceosome at the same time, the stronger one will be favoured. If, on the other hand, only the weak site is available when the elongation stops, splicing can occur at this position. When the elongation subsequently proceeds, the stronger splice site can be used as well [24].

Alterations of the normal splicing pattern can cause or contribute to diseases, such as cancer, in different ways because relatively small changes in the cis-elements or trans-acting factors can cause severe changes in the spliced product. For instance, if a base-substitution mutation abrogates an ESE and causes exon skipping, the mutant protein, instead of having at most a single amino acid difference from the wild type, will carry a large internal deletion. Mutation-induced exon skipping or inclusion may also result in premature termination codons which in turn can cause nonsense- mediated mRNA decay [25].

In addition, different splice variants from the same gene may have completely different activities, because whole functional domains may be added or deleted from the protein-coding sequence. An example of such alterations is seen in the anti- apoptotic gene BIRC5. This gene is highly upregulated in various cancers [26] and alternative splicing of its pre-mRNA produces four different mRNAs, which encode four different protein isoforms. One isoform has pro-apoptotic properties and acts like a naturally occurring antagonist of the anti-apoptotic functions of the other isoforms [27].

1.2.2 Alternative core promoter usage

Before discussing alternative core promoter usage it is important to keep in mind the differences between a transcription start site (TSS) and a core promoter. A gene's TSS is the first nucleotide to be transcribed into a particular RNA. The core promoter, on the other hand, is the genomic region that surrounds a TSS. The length of a core promoter is defined as the segment of DNA required to recruit the transcription initiation complex and initiate transcription, given the appropriate external signals [28]. Alternative TSSs are often used within a core promoter [29]. 15 Introduction

Use of alternative core promoters enables diversification of transcriptional regulation within a single gene and thereby plays a significant role in the control of gene expression in various cell lineages, tissue types and developmental stages. The use of different core promoters can lead to two types of protein products, depending on the location of the translational start site relative to the used promoter. If the translational start site exists within the first exon, mRNA isoforms that encode distinct proteins will be produced. On the other hand, if the alternative first exon is non-coding, the alternative transcripts will have heterogeneous 5' untranslated regions (5’-UTR), which commonly implies different RNA stability, but the encoded proteins are identical. The molecular mechanisms behind the selective use of multiple promoters are not well known, but the use of diverse core promoter structures, variable concentrations of cis-regulatory elements and regional epigenetic mechanisms are thought to be important factors [reviewed in ref. 30].

Several oncogenes and tumour suppressor genes have multiple promoters and the aberrant use of one promoter over another in some of these genes is directly linked to cancerous cell growth [31,32].

1.3 Fusion genes in cancer

1.3.1 Chromosomal rearrangements

The first described consistent chromosomal abnormality, the Philadelphia chromosome, was reported to be associated with chronic myelogenous leukaemia (CML) in 1960 [33], and in 1973 this chromosome was characterized as a reciprocal translocation involving the nine and 22 [34]. During the subsequent 35

16 Introduction years, a large number of consistent chromosomal aberrations in cancer have been discovered and today over 600 are known.5

Chromosomal aberrations may be caused by translocations, insertions, inversions, deletions, or duplications (Figure 2). A chromosomal translocation is a genomic aberration involving the rejoining of a broken chromosome fragment to another chromosome (Figure 2A). The initial event for such a translocation is the formation of a DNA double-strand break (DSB). This can be induced either by physiological situations, such as during the development of the immune system, or by exogenous DNA damaging agents. Inappropriate repair of such a DSB can lead to chromosomal translocations [35].

DNA breakage usually occurs in two major DNA repair pathways; homologous recombination (HR) or non-homologous end-joining (NHEJ). HR is generally an error-free pathway where a DSB is accurately repaired by using the undamaged sister chromatid as a template for repair of a broken chromatid. In the human genome however, the presence of highly repetitive sequences, e. g. Alu sequences, can lead to ectopic recombination, resulting in DNA rearrangements such as translocations. NHEJ is the repair of DSB by straightforward religation of ends without the requirement for a template [35].

Also, in lymphocytes, chromosomal translocations are often a result of mistakes in the V(D)J or immunoglobulin class switch recombination [36]. The consequence of this may be the joining of coding sequences of a normally silent proto-oncogene locus with regulatory elements of the highly active immunoglobulin locus, thus causing an ectopic overexpression of the downstream oncogene.

Chromosomal insertions and inversions are depicted in Figure 2, parts B and C, respectively. A chromosomal insertion refers to the insertion of genomic sequence

5 According to the Mitelman Database of Chromosomal Aberrations in Cancer (http://cgap.nci.nih.gov/Chromosomes/Mitelman)

17 Introduction into a chromosome [exemplified in ref. 37]. Chromosomal inversion, on the other hand, is a rearrangement in which a segment of a chromosome is reversed end-to-end. An inversion is called paracentric if it does not include the centromere, and pericentric if it does include the centromere [exemplified in ref. 38].

Chromosomal loss due to deletions can involve a region of a chromosome (Figure 2D) or an entire chromosome. Regional deletions can occur in three forms with different consequences. Firstly, regional loss in one chromosome can be repaired by using its counterpart as template. This may result in loss of heterozygosity, but no quantitative loss of genomic material. Secondly, a chromosome region can be lost without the subsequent repair process (hemizygous deletion), and finally, both copies of a locus can be lost due to a homozygous deletion [39].

Chromosomal duplication is duplication of a region of DNA and can occur as an error in HR, a retrotransposition event, or duplication of an entire chromosome. If the duplication process is repeated the result is more copies of the DNA segment, a process called chromosomal amplification.

18 Introduction

Figure 2. Types of chromosomal rearrangements. (A) Translocation, (B) insertion, (C) inversion, (D) deletion, (E) duplication/amplification. Amplification is created by multiple events of the depicted duplication.

1.3.2 Creation of fusion genes from chromosomal rearrangements

Quantitative and qualitative fusion genes (Figure 3) can be the results of all types of chromosomal rearrangements described in section 1.3.1. The quantitative type is caused by promoter swapping where the regulatory elements of a strongly expressed gene becomes aberrantly juxtaposed to a proto-oncogene (Figure 3A). The breakpoints of the rearrangement are located upstream from the coding region of the partner genes resulting in an oncogenic fusion gene with promoter region and sometimes also non-coding exons from the upstream partner [40]. Qualitative fusion genes, on the other hand, arise when the DNA breaks within the coding sequence of two different genes, A and B, and the gene fragments are joined in erroneous combinations (Figure 3B). Introns are generally longer than exons, and thus rearrangements usually occur within an intron [41]. Furthermore, intronic breakpoints imply undisrupted exons on both sides, and thus a reduced likelihood of the downstream partner being out of frame. The qualitative fusion gene, A-B, will consist of parts of the coding region in gene A and parts of the coding region in gene B juxtaposed to one another, often generating a fusion protein which is functionally distinct from both of its original partner genes. Transcription of the resulting A-B fusion gene will be under the control of the promoter region in gene A [40].

19 Introduction

Figure 3. Oncogenic fusion genes (A) Quantitative fusion gene, or promoter swapping. The 3' partner, gene B, is placed downstream of the control region of the 5' gene, A. The chimeric transcript produced contains whole or parts of the 5'-UTR from gene A and an intact coding region from gene B and will therefore give rise to a normal, but differently regulated gene product from B. (B) Qualitative fusion gene. The breakpoints are in the coding region of two different genes, A and B, and the gene fragments are joined erroneously. Only one of the two resulting fusion genes is depicted.

Chromosomal translocations are usually reciprocal and therefore lead to the generation of two chimeric genes. Despite of this, the detection of both these genes is not common. This is because only one of the resulting fusion genes usually is controlled by a strongly active promoter [42].

Many of the genes that are recurrently interrupted by a chromosomal rearrangements in cancer is considered proto-oncogenes, thus generating fusion oncogenes [40,43]. Some of these fusion oncogenes show a strict specificity for tumour type. This specificity is in many cases explained by the tissue specific expression of the upstream fusion partner, which contributes the transcription regulatory sequences. In other cases, the cancer type specificity may be explained by the biological activity of the fusion oncogenes. The rearrangements happen randomly in different cell types, but they only cause tumour development in one of them due to the resulting fusion gene's ability to influence the phenotype in this cell type [40]. Some fusion genes, on the other hand, can arise in many different tumour types. These genes may activate basic transformation pathways that function in multiple cell types and cell lineages [44,45].

20 Introduction

Chromosomal translocations resulting in the formation of fusion genes are the most prevalent form of recurrent genetic alterations known in cancers [46]. To date, 75 % of all known gene fusions are found in haematological disorders, including malignant lymphomas, although they account for less than 10 % of all human cancers [47]. The absence of gene fusions in common epithelial tumours has been attributed to the technical difficulties associated with their cytogenetic analysis. But, bypassing these limitations, as discussed in the next section, has recently led to the discovery of several recurrent gene fusions in some of these cancer types as well [48,49].

1.3.3 Discovery and detection of fusion genes

For the discovery of fusion genes in haematological diseases, karyotyping and fluorescence in situ hybridisation (FISH) have been most commonly used in the initial screening. These methods lack sensitivity to detect anything less than gross genomic rearrangements and provide limited resolution of breakpoint positions. To bypass these limitations, bioinformatics and high-throughput sequencing has recently been used to discover previously unknown fusion genes [50,51].

Genes with an outlier expression profile within a series of cancer samples, i.e. strong overexpression in a subset of samples, have previously been indicated to be altered due to structural genomic changes in the same samples. These structural changes may be chromosomal translocations, insertions, inversions, deletions, or genomic amplifications, which all may lead to high expression of oncogenic fusion transcripts. Three years ago, gene expression outlier analysis of a large database of gene expression data6 was for the first time used to identify fusion genes [50]. In this study, an algorithm called cancer outlier profile analysis (COPA) was used, and ERG was identified as a gene with strong outlier profile in prostate cancers. Subsequent laboratory analyses of ERG-containing transcripts identified TMPRSS2-ERG as a common fusion gene in prostate cancers.

6 www.oncomine.org 21 Introduction

Newly developed strategies for high-throughput sequencing make it possible to carry out genome-wide screening for chromosomal rearrangements in the hunt for new fusion genes. High resolution and rapid analysis make this an attractive strategy, and several groups have already used it to identify the comprehensive set of rearrangements present in particular human cancer cell lines [51,52].

Today, detection of known fusion genes for diagnostic purposes is usually done by combinations of karyotyping, FISH, and reverse transcription polymerase chain reaction (RT-PCR). Karyotyping requires the availability of fresh, vital cells for short-term culturing to obtain metaphase chromosomes, and the success rate of this approach may be particularly low for solid tumours. FISH with locus-specific probes and RT-PCR, on the other hand, are precise and highly specific methods, but are dependent on prior knowledge of the suspected diagnosis and have no screening ability. Thus, an effective screening tool for simultaneously analysis of all fusion genes is needed.

1.3.4 Fusion gene targeting therapy

Fusion transcripts are virtually always unique to cancer cells and therefore constitute attractive drug targets and diagnostic markers. The prime example of the importance of fusion genes in cancer treatment is BCR-ABL1. This fusion gene is the main result of the Philadelphia chromosome and has proven to be an ideal diagnostic marker [33,53] and its encoded fusion protein is as well the molecular target for the cancer drug Gleevec [54,55].

In addition to being unique for malignant cells, there is overwhelming evidence that the formation of fusion genes represents important and early steps in the carcinogenesis, thus making them even more interesting as both diagnostic markers and therapeutic targets. First, as mentioned in section 1.3.2, they are usually closely correlated to specific tumour phenotypes. Secondly, in experimental animal models, gene fusion constructs generally give rise to neoplastic disorders of the same kind as those seen in sporadic human cancers carrying the same gene fusion [56,57]. Finally, 22 Introduction silencing fusion transcripts in vitro leads to the reversal of tumourigenicity, decreased proliferation, and/or differentiation [58,59].

1.4 Colorectal cancer

Colorectal cancer includes cancerous growth in the colon and rectum and is one of the most common cancer forms world-wide with approximately one million new cases each year in the Western world [60]. In Norway, 3500 new cases are diagnosed each year, making colorectal cancer the second most common cancer type in both men (after prostate cancer) and women (after breast cancer). Colorectal cancer is the second most deadly cancer in Norway (after cancer), with over 1100 deaths in 2004, and the five-year survival rates for men and women are 56 and 58 percent, respectively [61].

Colorectal cancer develops either sporadically, or as part of a hereditary cancer syndrome. Hereditary syndromes have germline mutations in specific genes that greatly increase the lifetime risk of developing colorectal cancer as compared to the general population. The hereditary cases account for < 5 % of all colorectal cancers, but many lessons are learned from the molecular studies of these rare syndromes and many of the same genes are implicated in the somatic development of sporadic colorectal cancer. Two hereditary syndromes are particularly well studied; familial adenomatous polyposis (FAP) and hereditary non-polyposis colorectal cancer (HNPCC).

FAP is an autosomal dominant mode disorder accounting for less than 1 % of all colorectal cancer cases. The most compelling feature is the onset and progression of hundreds or thousands of small adenomatous polyps throughout the colon. These polyps typically develop during the second decade of life and the development of cancer is inevitable unless the colon is surgically removed [62]. FAP is caused by germline mutations in the tumour suppressor gene adenomatous polyposis coli (APC) [63].

23 Introduction

HNPCC, or Lynch syndrome, patients have monoallelic germline mutations in one of the DNA mismatch repair (MMR) genes. A defective MMR system gives a high propensity for mutations within genes harbouring mono-, di-, or trinucleotide repeats, and malignant transformation is facilitated by the accumulation of mutations in cancer-critical genes [64].

1.4.1 Developmental pathways

The adenoma-carcinoma sequence7 (Figure 4) is believed to underlie the development of colorectal cancer in most patients and two distinct pathways have been identified; the chromosomal instability (CIN) and the microsatellite instability (MSI) pathways [66,67]. CIN (also referred to as the microsatellite stability pathway) is the most common genetic pathway, accounting for approximately 85 % of colorectal cancers, and is characterized by allelic losses and chromosomal amplifications [67,68]. The MSI pathway accounts for about 15 % of sporadic colorectal cancers. These cancers are associated with frameshift mutations and base- pair substitutions within randomly repeated nucleotide sequences known as microsatellites [69-71]. This type of genetic destabilization is caused by loss of DNA mismatch repair functions [64]. Tumours with MSI is usually located in the proximal colon and the patients have improved survival rates as compared to those with CIN tumours [72,73].

7 This adenoma carcinoma sequence describes the stepwise progression from normal through dysplastic epithelium to carcinoma associated with the accumulation of multiple clonally selected genetic alterations [65].

24 Introduction

Figure 4. The adenoma carcinoma sequence. Molecular alterations associated with the MSI pathway are represented in green, whereas alterations associated with the CIN pathway are shown in blue. Orange colour indicates shared events. Modified based on the model originally proposed in ref. [65] and integrated with molecular interactions in ref. [74].

In 1999, Toyota et al. proposed a third pathway for development of colorectal cancer where a subset of cancers exhibit widespread DNA methylation in promoter sequences [75]. They called it CpG island methylator phenotype (CIMP) and during the recent years CIMP has been established as a common feature of human neoplasia [76]. Although CIMP colorectal tumours overlap with both the CIN and MSI phenotypes, they have a distinct clinical, pathological and molecular profile, such as association with proximal tumour location, female gender, poor differentiation, MSI, and high BRAF- and low TP53-mutation rates [77-80].

1.4.2 Dukes' staging, current treatment and outcome

Colorectal cancers can be classified as Dukes' stages A, B, C, or D (or alternatively as clinical stages 1, 2, 3, and 4) according to the classification system proposed by Cuthbert E. Dukes in 1932 [81]. Dukes’ A tumours are confined to the intestinal mucosa and submucosa, whereas Dukes’ B tumours have invaded these layers and penetrated into the muscle layers, but not metastasised outside of the bowel. Dukes’ C tumours have spread to local lymph nodes, and Dukes’ D tumours have distant metastasis (Figure 5). The survival rate is strongly associated with the Dukes' stage at the time of diagnosis. More than 90 % of patients with a localized tumour (Dukes’ A and B) survive more than five years after initial diagnosis. This number is reduced to less than 10 % for patients with distant metastases (Dukes’ D) [61].

25 Introduction

Figure 5. Dukes' staging in colorectal cancer. In Dukes’ A colorectal cancer the tumour is confined to the inner lining of the colon (mucosa and submucosa). In Dukes’ B colorectal cancer, the tumour has invaded into the muscle layers of the intestinal wall. Dukes’ C colorectal cancer has spread to local lymph nodes and in Dukes’ D colorectal cancer, metastases are found in distant organs.

The standard treatment for colorectal cancer patients in Norway today is surgery in combination with chemotherapy. Chemotherapy is generally given adjuvant to patients with Dukes' C tumours and according to different clinical evaluations to patients with Dukes’ D tumours. Different treatment regimes are available, but most commonly 5-fluorouracil/Leukovorin (calsiumfolinate) is used in combination with other drugs (in particular oxaliplatinum).8 5-fluorouracil is an analogue of uracil and is intracellularly converted to several active metabolites which in turn disrupt DNA and RNA synthesis and thereby cause cell death [82]. This drug will affect both cancer and normal cells, but is more effective on the cancer cells because they divide more rapidly, and thereby need to synthesise more nucleic acids.

8 From the webpage of the Norwegian Gastro Intestinal Cancer Group: http://ngicg.no/wp/

26 Introduction

1.5 Other cancer diseases studied

1.5.1 Malignant peripheral nerve sheath tumour

Malignant peripheral nerve sheath tumour (MPNST) is a rare malignancy which arises from Schwann cells in the nerve sheath which wraps around the axon. This cancer type is presented sporadically or in individuals with the hereditary disease neurofibromatosis type 1 (NF1). NF1 is an autosomal dominant tumour syndrome caused by mutation in the tumour suppressor gene NF1. Individuals with this disease have 5-15 % risk of developing MPNST and account for approximately 50 % of all MPNST cases [83]. The characteristic features of NF1 are café au lait patches, neurofibromas, axillary freckling, iris Lisch nodules, and distinctive bony dysplasia. The prognosis for patients with MPNST is poor, with a ten year survival rate of 22 % [84]. Apart from surgical excision of the tumour, no consensus therapy exists for MPNST patients.

The NF1 gene maps to 17q11.2 and encodes the neurofibromin protein [85,86]. Neurofibromin is a GTPase which inhibits p21-RAS and inactivation of NF1in tumours leads to increased RAS signalling and increased cell proliferation [87]. Complete inactivation of NF1 has been found in benign neurofibromas, demonstrating that this is not sufficient for malignant transformation [88]. CDKN2A has as well been implicated to contribute in the transformation process [89], whereas the role for TP53 is more controversial [90,91].

MPNSTs are classified as soft tissue sarcomas although they are tumours of neuroectodermal origin. An unclassified “sarcoma” is considered to be an MPNST if the tumour arises within or from a peripheral nerve, it arises from a pre-existing benign or other malignant nerve sheath tumour, the tumour has the morphology of an MPNST and arises in a patient with NF1, or if it exhibits histological, immunohistochemical, or ultrastructural features that suggest Schwann cell differentiation [92].

27 Introduction

1.5.2 Leukaemia

Leukaemia is a broad term which can be divided into four major groups; acute lymphocytic leukaemia (ALL), acute myelogenous leukaemia (AML), chronic lymphocytic leukaemia (CLL), and chronic myelogenous leukaemia (CML). All together there are over 500 new cases in Norway each year and leukaemia is the most common cancer type affecting children [61].

Leukaemia are characterised by chromosome aberrations selectively associated with biologically distinct subtypes of the disease (exemplified by CML; see below). In addition, the chromosomal aberrations present may be predictive of the outcome (exemplified by MLL rearrangements; see below).

CML is characterised by a t(9;22)(q34;q11) chromosomal translocation [34] resulting in the Philadelphia chromosome. This translocation is present in 95 % of all CML patients and the remaining 5 % have complex or variant translocations involving additional chromosomes resulting in the same fusion of the BCR gene to the ABL1 gene [93]. The expression of the BCR-ABL1 fusion gene has been shown to be both sufficient and necessary for the transformed phenotype of CML cells [94].

Translocation of the MLL gene at 11q23 with a number of partner genes may be found in both childhood ALL and AML. The same cytogenetic abnormality then predicts a different outcome depending on the disease phenotype; MLL rearrangements in infants with ALL predicts a poor prognosis, whereas it predicts an intermediate prognosis in childhood AML [95].

1.5.3 Testicular germ cell tumour

Testicular germ cell tumours (TGCTs) comprise approximately 98 % of all testicular cancers and is the most common cancer type among Norwegian men in the age of 15 to 40, with approximately 200 new cases each year [61].

28 Introduction

Three epidemiologically, clinically, and histologically diverse groups of TGCT can be defined; teratomas and yolk sac tumours of newborns and infants (type 1), seminomas and nonseminomas of young adults (type 2), and spermatocytic seminomas of elderly men (type 3) [96].

Type 2 TGCT, seminomas and nonseminomas of young adults, is by far the most common. These are believed to develop from primordial germ cells into premalignant and non-invasive intratubular germ cell neoplasia (IGCN) during foetal life [97,98]. After puberty, IGCN can develop into invasive cancer, which is histologically classified into either seminomas, nonseminomas, or a combination of the two. Seminoma cells are undifferentiated and resemble the IGCN cells. Nonseminomas are more heterogeneous, and in addition to the undifferentiated and pluripotent embryonal carcinomas, they may contain more differentiated histological subtypes called choriocarcinomas, yolk sac tumours, and teratomas [99].

Type 2 TGCTs are genetically characterised by excess genetic material of the short arm of chromosome 12 [100], either due to the presence of isochromosome 12p [101] or amplifications of 12p sequence [102]. In addition, a number of other recurrent genetic changes have been described [103], but as for colorectal cancer and MPNST, no recurrent fusion gene has been described for TGCT.

29 Aims

2. Aims

The overall aim of this thesis was to identify novel and cancer-specific transcript variants in solid tumours.

The thesis consists of two parts with individual objectives:

1) To establish the necessary experimental tools and apply these in validation of transcript variants in the following subprojects:

• Investigate transcript variants of BIRC5 in MPNST.

• Analyse known fusion genes in leukaemia cell lines in conjunction with the development of a novel fusion gene microarray.

• Analyse putative fusion genes in a TGCT cell line.

2) To investigate the presence of cancer-specific transcripts in colorectal cancer using relevant in-vitro models and primary tumours. This investigation included the following sub-objectives:

• Identify genes with strong outlier expression profiles in colorectal cancer and establish their expression in particular cell lines.

• Investigate the expression level of individual exons in known and putative fusion genes and in genes with strong outlier expression profile.

• Investigate the 5'-end of mRNA from individual genes in cell lines and/or tumour samples with aberrant exon expression in the 3'-end of the gene.

30 Materials and methods

3. Materials and methods

3.1 Materials

3.1.1 Cancer cell lines – colon cancer, testicular cancer, and leukaemia

The master project involved analyses of 5 testicular cancer cell lines (NTERA2, TERA1, TERA2, NCCIT, and 2102Ep), 4 leukaemia cell lines (RCH-ACV, REH, TOM-1, and 697), and 20 colon carcinoma cell lines (HCT15, HT29, Lovo, LS174T, SW48, CO115, SW480, Colo320, HCT116, ALA, TC71, RKO, TC7, LS1034, IS1, IS2, IS3, V9P, EB, and FRI). The colon carcinoma cell lines were chosen to represent the different subtypes of colorectal cancer [104], e. g. MSS and MSI. RNA was isolated by Trizol (Invitrogen, Carlsbad, California, USA)

3.1.2 Tissue samples – malignant peripheral nerve sheath tumours and colorectal carcinomas

Malignant peripheral nerve sheath tumours Tumour RNA from MPNST samples from 15 Norwegian and Swedish patients were included in the present study. All samples had either been examined by a group of specialist sarcoma pathologists (Swedish samples) or were re-examined by national reference sarcoma pathologists for confirmation of the MPNST diagnosis (Norwegian samples). RNA was isolated by Trizol.

Colorectal cancer Ten primary colorectal carcinoma samples from a prospective clinical series were included in this study. The series was collected at seven hospitals in the South- Eastern part of Norway between 1987 and 1989 [105]. RNA was isolated by the All prep DNA/RNA mini kit (Qiagen Co., Valencia, California, USA). 31 Materials and methods

Ten normal colorectal samples from cancer patients were also included. These were collected at Aker University Hospital in a period between 2005 and 2007. RNA was isolated by the All prep DNA/RNA mini kit (Qiagen) and the Ribopure™ kit (Applied Biosystems/Ambion, Foster City, California, USA).

3.2 Publicly available databases

Throughout this master project sequence information about genes and their different transcripts have been investigated using the Ensembl genome browser9 and all described sequences in this thesis are in compliance with release 50, published July 2008. Sequence specificities, on the other hand, have been assessed by BLAST (Basic Local Alignment Search Tool) searches. These searches were carried out in the human genomic plus transcript database, by use of the nucleotide blast program, and the megablast algorithm.10

3.3 Establishment of methodological protocols in existing projects

3.3.1 Analysis of BIRC5 transcipt variants in MPNST samples

In MPNST samples a 2 Mb region at 17q25 is commonly gained [106]. Several genes are localized in this region and can therefore represent good candidates for MPNST tumourigenesis. One of these genes is BIRC5 and analysis of its transcript variants in MPNST samples was included at the start of the master project to establish methodological protocols.

9 www.ensembl.org

10 http://blast.ncbi.nlm.nih.gov

32 Materials and methods

Four different transcript variants are known for BIRC5 [107] and one set of primers were designed to amplify all of them by annealing in the second and last exon (see Appendix I for more details about the primers). With these primers the different transcript variants produce PCR products of different sizes and can thereby easily be distinguished from one another by gel electrophoresis (Figure 6).

RT-PCR was performed as explained in section 3.5.1 for 15 MPNST samples. The products were then eluted from agarose gel and sequenced as described in sections 3.5.3 and 3.5.5.

Figure 6. Schematic representation of BIRC5 transcript variants. BIRC5 has four different transcript variants called survivin, survivin-3B, survivin-2B and survivin-Δex3. Survivin consists of four exons whereas survivin-3B has an additional exon (3B) between exon 3 and 4. Survivin-2B has an additional exon (2B) between exons two and three, and survivin- Δex3 lacks exon 3 and has an extension of the reading frame into the 5’ untranslated region (UTR). Primers are shown with black arrows. [Modified from ref. 107]

3.3.2 Validation of fusion gene microarray data

A microarray-based approach for simultaneously analysis of all known fusion gene variants was designed prior to this master project. This microarray consists of chimeric oligos targeting all possible combinations of exon junctions between the 3' and 5' partners of 275 known fusion genes. That is, every probe starts with a sequence from the end of one exon in the 5’ fusion partner and ends with a sequence from the start of an exon in the 3’ fusion partner. In addition it included intragenic oligos for measurement of longitudinal profiles along each of the fusion gene partners for altogether 115 genes (Figure 7).

33 Materials and methods

Figure 7. Design of fusion gene microaray. A fusion gene microarray was designed with chimeric oligos (grey lines) targeting every possible fusion point in a putative fusion gene (only the chimeric oligos from exon two in gene A to the four exons in gene B are depicted). The microarray also included intragenic oligos (blue and green lines) measuring the expression of each exon in the putative fusion gene partners. I) The longitudinal exon expression profile from gene A. II) Overview of the different chimeric oligos, with the oligo targeting the fusion point shown in red. III) The longitudinal exon expression profile from gene B.

A set of seven samples with known presence of one fusion gene each were used as positive controls in the fusion gene microarray. This included four prostate cancer samples positive for the TMPRSS2-ERG fusion gene and three leukaemia cell lines, RCH-ACV, REH, and TOM-1 known to carry the TCF3-PBX1, ETV6-RUNX1, and BCR-ABL1 fusion genes, respectively. Validation of the actual fusion junctions in the prostate samples was done by a collaborating group (headed by Manuel Teixeira, Portuguese Oncology Institute, Porto, Portugal), whereas for the leukaemia cell lines the validation was included as part of the master project. This validation was performed by RT-PCR and DNA sequencing.

For TCF3-PBX1 the probe with 5’ sequence from the end of TCF3 exon 15 and 3’ sequence from the start of PBX1 exon three showed the highest relative intensity of all the chimeric exon-exon probes between the two genes. A forward and a reverse primer were therefore designed to anneal in exon 15 of TCF3 and exon three of PBX1, respectively, yielding a chimeric fusion product of 218 base pairs (see Appendix I for primer sequences). For ETV6-RUNX1 the probe with the highest 34 Materials and methods relative intensity showed the breakpoint to be between ETV6 exon five and RUNX1 exon two. The forward and reverse primers were therefore designed to anneal in exon five of ETV6 and exon two of RUNX1, respectively, yielding a chimeric fusion product of 204 base pairs. From the TOM-1 cell line the expected BCR-ABL1 fusion transcript could not be seen from the fusion gene microarray. To verify the presence of this fusion gene in the cell line, the normal breakpoints in BCR-ABL1 were found in the literature to be between BCR exon one or 13 and ABL1 exon three [108]. Forward primers were therefore designed to anneal in BCR exon one and 13, and reverse primers in ABL1 exon three. RT-PCR was performed as explained in section 3.5.1. The PCR products were separated by gel electrophoresis in a 2 % agarose gel (see section 3.5.3) and sequenced in both directions using the same primers as for RT-PCR (See section 3.5.5).

3.3.3 Analysis of putative fusion genes in a testicular cancer cell line

In 2004 Hahn et al. [109] described a procedure for identifying fusion gene transcripts. They used publicly available mRNA and expressed sequence tag (EST) databases and identified 237 potential fusion genes. Among these, 60 were already known, thus validating the approach. The remaining gene pairs were considered novel putative fusion genes. In the original study, laboratory validation was performed for only one of these putative fusion genes. Three of the putative fusion genes, ZDHHC20-MTHFD2L, MIER1-GPR177, and Hs.446400-USP11, were all identified from sequences from the testicular cancer cell line NTERA2. Laboratory investigation of these genes was included in the master project. ZDHHC20- MTHFD2L was found in three entries in the mRNA and EST databases (AK023167, AX794778, and AU131312). One of these was retrieved from NTERA2 (AU131312). Sequences from these entries consist of the sequence from exon one in ZDHHC20 directly followed by the sequence from exon six in MTHFD2L. A forward and a reverse primer were therefore designed to anneal in exon one of ZDHHC20 and exon seven in MTHFD2L, respectively. MIER1-GPR177 was found in one mRNA

35 Materials and methods database entry (AK074990). The sequence in this entry consists of exons one to nine of MIER1 (ENST00000371011; Ensembl) followed by exons two to twelve of GPR177 (ENST00000262348; Ensembl). As for the first putative fusion gene, forward and reverse primers were designed to anneal in exon nine of MIER1 and exon two of GPR177, respectively. One mRNA database entry (AK092258) gives the sequence of the third putative fusion gene. The 5’ putative fusion partner of USP11 is an unknown gene and is here given the UniGene-ID as a name; Hs.446400. The entry sequence consists of Hs.446400 followed by exons six to 21 in USP11 (ENST00000218348; Ensembl) and primers were designed as for the genes above in Hs.446400 and in exon six of USP11. In addition to the three primer pairs designed to amplify the breakpoints between the fusion partners, primers for intragenic controls were also made. That is, a reverse primer in the ZDHHC20 sequence, a forward primer in the MTHFD2L sequence etc. See Appendix I for sequence, annealing temperature and GC-content of the primers used.

RT-PCR was performed as explained in section 3.5.1 for all three genes in the NTERA2 cell line. For each gene both the primer pair targeting the fusion breakpoint and the pair for intragenic control were used. The products were separated by gel electrophoresis on a 2 % agarose gel as described in section 3.5.3.

3.4 Design of a novel strategy for identification of fusion genes

The strategy for identification of fusion genes in colorectal cancer is summarized in Figure 8.

3.4.1 Outlier expression profiles

To highlight genes in which a group of samples are expressed at an increased level, the Gene Tissue outlier Index (GTI) was designed by collaborators (Mpindi et al.,

36 Materials and methods unpublished). GTI is an absolute measure based on a cut-off point and different cut- offs can be used. The formula is as follows:

⎛ P ⎞ ⎛ A − B ⎞ GTI = ⎜ ⎟ ⋅⎜ ⎟ ⎝ N ⎠ ⎝ A ⎠ where P is the number of samples with expression values above cut-off, N is the total number of samples in the group, B is the cut-off and A is the average expression of samples above cut-off. Total GTI (GTIT) is calculated by subtracting GTI for the normal group (GTIN) from GTI for the group of interest, e. g. cancer (GTIC), and multiplying by a constant c, which in our case was set to a thousand:

GTIT = (GTI C − GTI N )⋅ c

A positive GTIT means that the cancer group has a stronger outlier expression profile than the normal group taking into account the number of samples for a particular gene and tissue. Negative values indicate a stronger outlier expression profile in the normal group.

In the present study, a database with gene expression values and the GTI calculations were provided to us from a collaborating group (Professor Olli Kallioniemi, VTT Technical Research Centre of Finland). The database, In Silico Transcriptomics, contains Affymetrix gene expression data from 15,000 human samples. The gene expression values have a common normalisation across all samples, which enable analyses across the whole sample set. Data from a subset of this database was recently published [110]. GTIT was calculated three times (GTIT1, GTIT2 and GTIT3) with three different cut-offs for 17,000 genes in colorectal tissues. For GTIT1 the inter quartile range (IQR) plus the 75 percentile (p75) expression value (IQR + p75) was used as cut-off, whereas the cut-offs for GTIT2 and GTIT3 were 95 and 90 percentiles

(p95 and p90), respectively. Tissue specific GTI (GTIS1, GTIS2 and GTIS3) were also calculated by subtracting GTI for all normal tissue, except colorectal, from GTI for all colorectal samples (cancer and normal). The 17,000 genes were ranked based on

37 Materials and methods each of the six individual GTI calculations. Furthermore, for each gene a combined

GTI (GTICom) was made from the sum of the lowest GTIT and the lowest GTIS.

For this project the 50 genes with lowest GTIT rank in each of the 7 calculations were included for further analysis. We extended this candidate list to include genes with ranks up to 200 for genes with classification as either transcription factor or tyrosine kinase, because these are the two most common functions among fusion gene partners [47]. Expression data for all these genes in six colorectal cancer cell lines (HT29, HCT15, SW48, SW480, RKO, and LS1034) were investigated. Genes where one or more of the cell lines showed different expression from the rest were further investigated using Affymetrix exon microarray data from the same cell lines (see section 3.4.3).

Additionally, four genes (HOXB13, CST1, FZD10, and WIF1) were included for investigation in the laboratory (se section 3.5) based on the outlier expression profile alone. Real-time reverse transcription PCR (see section 3.5.1) was done for each of the five genes in each of the 20 colorectal carcinoma cell lines listed in section 3.1. For all four genes, one or two cell lines showed higher expression than the rest. These cell lines were investigated further by 5'-RACE, cloning and sequencing (section 3.5).

3.4.2 Known and putative 3’ fusion gene partners

A literature survey was performed and all known fusion genes collected in a database. This database was established before the master project, but further curation and expansion of it was included as part of the project (Appendix III). All 3' fusion partners from this database were included for further analysis of exon expression levels (see section 3.4.3).

As mentioned in section 3.3.3, Hahn et al. [109] described a procedure for identifying fusion gene transcripts in 2004. All putative 3' fusion partners from this procedure

38 Materials and methods and 5' putative fusion partners from colorectal carcinoma cell lines were included for further analysis (see section 3.4.3).

Because of their newly proven importance in prostate cancer [reviewed in ref. 48] and the fact that the same fusion genes can be found in different cancer types [111], all 28 members of the ETS gene family were included for analysis of exon expression levels (see section 3.4.3).

3.4.3 Exon microarray analysis

The GeneChip® Human Exon 1.0 ST Array (Affymetrix, Santa Clara, CA, USA) provides genome-wide detection of RNA expression at both gene and exon levels. The microarray has approximately 5.4 million probes grouped into 1.4 million probesets examining more than a million known and predicted exons [112]. The probes are distributed in the different exons along the entire transcript length, and for a gene with ten exons, there are roughly 40 probes matching its sequence. With probes in different exons along the transcript it is possible to monitor the level of expression for each exon compared with the others in the gene and thereby detect different transcript variants created after events such as alternative splicing and alternative promoter usage or poly-adenylation sites.

Ten normal colonic tissue samples, ten colorectal cancer tissue samples and six colorectal cancer cell lines (HT29, HCT15, SW48, SW480, RKO, and LS1034) were analysed. Raw data were imported into the XRAY software (version 2.81; Biotique Systems Inc., Reno, Nevada, USA) where quantile normalisation and calculation of probeset expression values were performed and summarized. Only “core” probesets (RefSeq and full-length GenBank mRNAs) were analysed and the expression score for a probeset was defined to be the median of its probe expression scores. For each probeset the log2-ratio of expression level in test samples to that observed in control samples were calculated.

39 Materials and methods

Exon microarray data were investigated from genes resulting from all the three different input strategies (outlier expression profiles, known and putative fusion genes, and ETS family members). The longitudinal exon expression profile along the entire transcript length of each gene was visualized by an in-house created visual basics script, and evaluated manually by looking for profiles where individual samples were overexpressed only in the 3' part of the transcript compared to the rest of the samples (examples in Figure 16 and Figure 20; Results). Genes with this type of profile were investigated further in the laboratory with 5’-RACE, cloning and sequencing (see section 3.5).

40 Materials and methods

Figure 8. Flowchart for identification of fusion genes. Outlier expression profiles were calculated for 17,000 genes in three different ways using three different cut-offs. The 50 genes with highest score in each of the calculations were chosen for further analysis. Tissue specificity profiles were calculated in three different ways using the same three cut-offs and the 50 genes on top of each list were analysed further. The six calculations were also combined and the 50 genes on top were analysed further. Genes with transcription factor or tyrosine kinase activity among the top 200 in this combined list were also included. The genes were analysed using expression data from six colorectal cancer cell lines, and genes which were differently expressed in one or more cell line, compared to the rest, were analysed using exon microarray data. All known and putative fusion genes and all members of the ETS gene family were also analysed in this way. Genes where one or more sample had overexpression in only the 3'-end of the transcript were analysed further; first by real-time RT-PCR, then by 5'-RACE followed by nested RACE. RACE products of different lengths 41 Materials and methods were separated by agarose gel electrophoresis and eluted individually. Eluted RACE products were cloned by use of TOP10 competent cells and plasmid DNA extracted from overnight cultures by performing Miniprep. The plasmid DNA was then sequenced.

3.5 Experimental assays

In the following, background information to the individual experimental protocols is indicated by green font.

3.5.1 Polymerase chain reaction

In 1970, Kleppe and colleagues proposed that repeating a repair replication reaction over and over again could amplify a DNA fragment of interest [113]. However, it was Mullis that defined the polymerase chain reaction (PCR) as we know it today [114].

The PCR reaction consists of three steps, denaturing, annealing and extension. High temperature (90-95°C) leads to separation of the two DNA strands in the denaturing step. By cooling of the reaction (50-65°C), oligonucleotide primers are allowed to anneal to the single-stranded DNA in the second step and raising the temperature (72°C) allows the enzyme polymerase to add deoxynucleotides to the target DNA in the extension step.

Reverse transcription PCR Reverse transcription PCR (RT-PCR) is used when the starting material is a ribonucleic acid (RNA). During this process the RNA is first reversely transcribed into its complementary DNA (cDNA) using an RNA dependent DNA polymerase (reverse transcriptase) and then the cDNA is amplified using a normal PCR reaction.

For the first-strand cDNA synthesis, reverse transcription was performed using the High-Capacity cDNA Archive Kit (Applied Biosystems, Foster City, CA, USA). The reaction mix consisted of 1 x RT Buffer, 1 x dNTP mix, 1 X random primers, 250 U Multiscribe reverse transcriptase, and 5 µg RNA template. The reaction volume was

42 Materials and methods adjusted to 100 µl with Milli-Q water. The reaction mix was incubated on an MJ Mini Gradient Thermal cycler (BIO-RAD, Hercules, California, USA) first at 25°C for 10 min for the primers to anneal, then, cDNA was synthesized at 37°C for 2 h and in the end the reaction was terminated at 85°C for 5 sec. All the RNA is then converted to cDNA during this incubation and the yield is therefore 5 µg of cDNA.

The PCR reaction was performed using HotStar Taq DNA polymerase Kit (Qiagen).

The reaction mix included 1 x PCR buffer, 1.5 mM MgCl2, 0.2 mM of each of the four dNTPs, 4 pmol of each gene-specific primer, 1 unit HotStar polymerase, 50 ng template (cDNA), and Milli-Q water to a total volume of 25µl. The cDNA was amplified in a Robocycler Gradient 96 (Stratagene, La Jolla, California, USA). The cycling conditions were as follow: first denaturing for 15 min at 95°C, then 27-33 cycles of denaturing for 30 sec at 95°C, annealing for 75 sec at 58-63°C, and elongation for 15 sec at 72°C. In the end a further extension period of 7 min at 72°C was added. The number of cycles was determined by the reaction efficiency. All oligonucleotide primers were designed using the Primer3 software11 with default settings. The specificity of the primer sequences were assessed by BLAST searches12 and for hairpin secondary structures and/or primer-dimer formation using the NetPrimer software.13 All primers were purchased from MedProbe (Oslo, Norway). For primer sequences see Appendix I.

Real time RT-PCR In the term real-time RT-PCR the real-time refers to the constant monitoring principle of the technique, enabling measurements of a relative number of copies present, or newly generated, after each cycle [115]. There are a number of real-time RT-PCR detection chemistries available [116], but in this thesis only hydrolysing probes were

11 http://frodo.wi.mit.edu/

12 http://blast.ncbi.nlm.nih.gov

13 www.premierbiosoft.com

43 Materials and methods used. A hydrolysing probe is sequence-specific and labelled with a reporter dye on the 5’-end and a quencher dye on the 3’-end. The probe binds to the target sequence and is degraded by the nuclease activity of the DNA polymerase during the extension step of the PCR reaction. This degradation separate the reporter from the quencher dye, which results in increased emission of fluorescence [117]. In other words, the more PCR product generated in a particular cycle the more fluorescence emission.

The first PCR cycle at which the fluorescence intensity is greater than the background fluorescence is called the cycle threshold (Ct). Consequently, this threshold will be reached faster the greater the quantity of target DNA in the starting material. During this thesis absolute quantitation was used. In this method a standard curve is produced by serially diluted standards of known concentrations. The curve then produces a linear relationship between the Ct and initial amount of cDNA [118].

The cDNA synthesis was performed using the same kit as described in the previous section and the quantitative PCR were carried out in 96 well plates. A 10 µl reaction volume consisted of 1 x TaqMan Fast Universal PCR Mastermix (dNTPs, enzyme, and buffer) (Applied Biosystems), 1x TaqMan assay (Applied Biosystems), and 10 ng cDNA. To amplify the cDNA, reactions were incubated for 20 sec at 95ºC and then a cycle of 1 sec at 95ºC and 20 sec at 60ºC repeated 40 times.

All samples were run in triplicates and the median expression used. To correct for sample-to-sample variation all results were normalized against two control genes, GUSB and ACTB.

3.5.2 Rapid Amplification of cDNA Ends

The complete 5’- and 3’-ends of cDNA can be amplified by PCR, using a technique variously called rapid amplification of cDNA ends (RACE) [119], one-sided PCR [120] and anchored PCR [121]. The technique uses PCR to amplify partial cDNAs that represent the region between the 5’- or 3’-end and a single point in an mRNA transcript. The main requirement is that a short stretch of sequence in the mRNA of

44 Materials and methods interest is known. A gene-specific primer (GSP), oriented in the direction of either the 5’- or 3’-end, is designed to anneal in the already known sequence. Extension of the cDNA from the end and back to the known region is achieved by using a primer annealing to the pre-existing poly(A) region (3’-RACE) or to an appended homopolymer tail or linker (5’-RACE) [122].

5’-RACE In this thesis 5’-RACE was performed using the SMART RACE cDNA Amplification kit (Clontech, Mountain View, California, USA). See Figure 9 for an illustration of the 5’-RACE reaction, which is explained in the following. The first- strand synthesis is primed with an oligo-(dT) primer and performed by a Moloney murine leukemia virus reverse transcriptase (MMLV RT) which adds 3-5 residues (predominantly cytosines) upon reaching the 3’-end of the first-strand cDNA. A SMART II A oligo in the reaction mix contains a terminal stretch of G-residues which anneals to this cDNA tail. MMLV RT switches template from the mRNA to the SMART oligo and generates a complete cDNA copy of the mRNA with the additional SMART sequence at the end. MMLV RT’s terminal transferase activity is most efficient when the enzyme has reached the end of the RNA-template and the SMART sequence is therefore typically added only to complete first-strand cDNAs.

The 5’-end of the cDNA can then be amplified using a universal primer (UP) which anneals in the SMART sequence and a primer specific for the gene of interest. The GSP must be between 23 and 25 nucleotides long, have a GC-content between 50 and 70 percent, and an annealing temperature above 70°C.

On occasion, a reverse transcription reaction can be non-specifically primed and result in a cDNA containing the SMART sequence at both ends. To reduce the likelihood of such aberrant products, a mixture of long and short UPs (with excess of the short UP) is used. The long UP contains inverted repeat elements. During PCR of a cDNA with SMART sequence in both ends, the long UP will anneal in both ends and the inverted repeats anneal to each other, making a panhandle-like structure. This

45 Materials and methods blocks amplification of such aberrant products because the short UPs are unable to anneal.

Figure 9. 5'-RACE reaction. The illustration is thoroughly explained in the main text. Colours: Dark blue, RNA; light blue, cDNA; green, SMART oligo; orange, primers.

Generation of 5’-RACE-ready cDNA was performed using the SMART RACE cDNA amplification kit (Clontech) and PrimeScript reverse transcriptase (Takara Bio Inc., Otsu, Shiga, Japan). One µg total RNA was combined with 2.4 µM oligo-(dT) primer, 2.4 µM SMART II A oligo, and sterile water to a total volume of 5 µl. The reaction mix was first incubated at 70°C for 2 min to allow the primers to anneal and then on ice for two minutes before adding 1 x first-strand buffer, 2 mM dithiothreitol (DTT), 1 mM dNTP, and 200 U PrimeScript reverse transcriptase to a total volume of 10 µl. Elongation of the cDNA at 42°C for 90 min followed. The first-strand reaction was then diluted in 100 µl Tricine-EDTA buffer and the reaction was stopped by incubation at 72°C for 7 min. 46 Materials and methods

RACE reactions were performed using the SMART RACE cDNA amplification kit and the Advantage 2 PCR kit (Clontech). 1 x Advantage 2 PCR buffer, 0.2 mM dNTP mix, 1X Advantage 2 PCR polymerase mix, 2.5 µl RACE-ready cDNA, 1 x Universal primer mix (UPM), 0.2 µM GSP, and PCR-grade water was combined to a final volume of 50 µl. The cycling conditions were as described in Table 1.

Nested RACE was then performed by combining the same reagents as for RACE, but this time with 5 µl diluted RACE product as template and nested primers. The nested RACE was run by 25 cycles of 30 sec at 94°C, 30 sec at 68°C, and 3 min at 72°C.

Table 1. Cycling conditions for 5'-RACE.

Temperature Time 94°C 30 sec 5 cycles 72°C 3 min 94°C 30 sec 70°C 30 sec 5 cycles 72°C 3 min 94°C 30 sec 68°C 30 sec 25 cycles 72°C 3 min 72°C 7 min

3.5.3 Detection of PCR-products

Fragment analysis Fragment analysis of PCR products was performed in a 96-well optical reaction plate by combining a fluorescent ladder, formamide and 0.5 µl PCR product. The DNA

47 Materials and methods was denatured for 5 min at 95°C, before cooled at 4°C for 30 sec. The plate was sealed with a 3100 Genetic Analyzer Plate Septa (Applied Biosystems), placed in a 96-well Plate Base and inserted into a fully automated AB 3730 DNA analyser (Applied Biosystems). Generation and analysis of electropherograms are described in more detail in section 3.5.5.

Gel electrophoresis PCR products to be cloned were separated using agarose gel electrophoresis. The gels were made with 2 % agarose for separation of PCR products smaller than 1000 base pairs, whereas gels with 1 % agarose were used for larger products. The agarose (8 g for 2 % and 4 g for 1 % gels) was heated in 400 ml 1 X TAE and 4 drops of ethidium bromide (GeneChoice, Frederick, USA) was added to enable visualization of DNA by UV-light. Twenty-five µl of the PCR product was loaded together with 5 µl gel loading buffer and the electrophoresis was performed at 200 V for 20-25 min. PCR products were thereafter cut out of the gel and eluted with the MinElute Gel Extraction Kit (Qiagen).

PCR products not destined for cloning were analysed using polyacrylamide gel electrophoresis (PAGE). A gel was made by combining 7.5 ml 7.5 % acrylamide, 5 µl TEMED (BIO-RAD), and 50 µl 20 % ammonium persulphate (APS). Five µl of the PCR product was loaded together with 3 µl gel loading buffer and the electrophoresis was performed at 200V for 20-25 min. To stain the DNA, the gel was submerged in an ethidium bromide containing buffer for approximately 2 min.

3.5.4 Cloning

The purpose of cloning is to enable characterisation of individual molecules of a particular gene or transcript. The DNA molecule is ligated into a cloning vector and introduced into a host organism, of which the most commonly used, is Escherichia coli. The vectors are small in size, have their own origin of replication and are usually

48 Materials and methods present in many copies in each host organism cell. Selection markers, such as antibiotics resistance genes, enable selection for plasmid-containing host cell clones.

Cloning and transformation was performed using the TOPO TA Cloning Kit (Invitrogen). This kit takes advantage of topoisomerase I and the fact that it can bind to DNA and cleave the phosphodiester backbone after 5’-CCCTT-3’ [123]. The energy from the broken bond is conserved by formation of a covalent bond between the cleaved strand and the topoisomerase I. Before cloning, the vector is cut into linear form, with single 3’ thymidine (T) overhangs. Taq polymerase has a non- template dependent terminal transferase activity, which adds a single deoxyadenosine (A) to the 3’-ends of PCR products. By reversing the cleavage reaction the PCR product with its A-overhang is readily incorporated into the T-overhang containing vector and the topoisomerase is released.

The vector contains the lethal ccdB gene fused to the LacZα gene. Ligation of the PCR product disrupts expression of the ccdB-LacZα gene and allows only positive recombinants to grow [124]. A gene for ampicillin resistance in the vector ensures that only transformed will grow in the presence of this antibiotic compound.

Four µl PCR product eluted from an agarose gel (as described in section 3.5.3) was mixed with 1 µl salt solution and 1 µl TOPO vector before incubation at room temperature for 30 min. The cloning reaction was then transferred to ice. Two µl of the reaction was transferred to a vial of One Shot TOP10 E. coli and incubated on ice for 5-30 min. The cells were given a heat shock for 30 sec at 42°C and immediately transferred back to ice. 250 µl of room temperature S.O.C. medium was added and the cells incubated horizontally at 37°C and 200 rpm for 1 h. After the incubation 50 µl and 75 µl of the transformation mix was spread on pre-warmed selective LB plates containing 100 µg/ml ampicillin. The plates were incubated over night at 37°C.

Individual colonies were picked from selective plates and used to inoculate individual cultures consisting of 5 ml LB-medium and 10 µl ampicillin. The cultures were incubated at 37°C and 250 rpm over night. Bacterial cells were then harvested by

49 Materials and methods centrifugation and plasmid DNA was purified using the QIAprep Spin Miniprep kit (Qiagen).

3.5.5 DNA Sequencing

Two methods for DNA sequencing was proposed in 1977 by Maxam et al. [125] and Sanger et al. [126] Both methods rely on the discovery done by Atkinson in 1969 that attachment of a dideoxynucleotide (ddNTP) to a growing chain of deoxyribonucleic acids blocks further synthesis [127]. Dideoxynucleotides lack the 3’-hydroxylgroup necessary for addition of new deoxynucleotides (dNTP) and incorporation therefore leads to termination at that specific point.

The Sanger method was modified and automated by Smith et al. in 1986 [128] and is the basis for the sequencing protocol most commonly used today. By thermal cycling of the DNA template together with an oligonucleotide primer, a polymerase, dNTPs, and ddNTPs coloured with different fluorophores, the new DNA strand is elongated until a ddNTP is incorporated. The length of the new strand will depend on when during the elongation the ddNTP was incorporated. This leads to a reaction mix containing DNA strands with all possible lengths within the specified template, and after separating these, by e. g. capillary electrophoresis, the sequence can easily be interpreted from the electropherograms.

Reaction mix The sequencing reaction was performed in a 96-well Optical Reaction Plate and consisted of purified template DNA (either PCR product eluted from agarose gel or plasmid DNA from Miniprep purification), primer (forward or reverse), BigDye Terminator v3.1 or v1.1 premix (Applied Biosystems), BigDye Sequencing buffer (Applied Biosystems) and Milli-Q water to a total volume of 10 µl. First, the reaction mixes were incubated at 96°C for 2 min, followed by 25 thermal cycles of 15 sec at 96°C, 5 sec at 50°C, and 4 min at 60°C. The thermal cycling was performed on an MJ Research Cycler (BIO-RAD).

50 Materials and methods

The BigDye Terminator v3.1 premix was used when the fragment to be sequenced were longer than 500 base pairs and the v1.1 for shorter fragments. The premix contains dNTPs and ddNTPs. The different ddNTPs are modified with fluorescent labels which emit light at specific wavelengths when exposed to a laser beam. This makes it possible to visualise the different bases.

Product purification After the sequencing reaction unincorporated dye terminators, salts and other charged molecules must be removed. This was done by using the BigDye Xterminator Purification Kit (Applied Biosystems). Forty-five µl of SAMTM solution and 10 µl of XterminatorTM were added to the sequencing reaction after completion of thermal cycling. The reaction mixes were then vortexed for 30 min and briefly centrifuged in the end.

The SAM solution enhances the performance of the Xterminator solution and stabilises the post-purification reactions. The Xterminator, on the other hand, scavenges unincorporated dye terminators and free salts.

Capillary analysis The 96-well Optical Reaction Plate was sealed with a 3100 Genetic Analyzer Plate Septa (Applied Biosystems), placed in a 96-well Plate Base, and inserted into a fully automated AB 3730 DNA analyser (Applied Biosystems). Inside the analyser the 48- capillary array is filled with POP7 polymer (Applied Biosystems). The samples are then loaded and separated according to size as they migrate through the polymer- filled capillaries. As the fluorescently labelled DNA fragments reach the detection window, a laser beam excites the dye molecules and causes them to fluoresce. The Data Collection software reads and interprets the fluorescence data before displaying them as an electropherogram. The samples were analysed using the software Sequencing Analysis 5.2 (Applied Biosystems), and all electropherograms were read both manually and automatically.

51 Results

4. Results

4.1 BIRC5 transcript variants

Three of the known transcript variants of BIRC5 (survivin, survivin-2B and survivin- Δex3) were detected by RT-PCR in the MPNST samples (see Figure 6, page 33, for an illustration of the gene and its transcript variants). Survivin was found in 14 out of 15 samples, survivin-Δex3 in 13 out of 15, and survivin-2B in only four out of 15 samples. Survivin produced a PCR product of 271 base pairs, survivin-2B a product of 340 base pairs and survivin-Δex3 a product of 153 base pairs. The results for four of the samples are shown in Figure 10A, and as an example, a part of the sequence for survivin-Δex3 is shown in part B.

Figure 10. Results from analysis of BIRC5. (A) The results for amplification of BIRC5 in four MPNST samples separated on a 2 % agarose gel. All four samples (lanes 1-4) produced three bands at approximately 350, 270, and 150 base pairs. (B) A representation of the results after sequencing of the different bands on the agarose gel. The sequence jumps from exon 2 to exon 4 and is therefore from the transcript variant survivin-Δex3. Abbreviations: M, 100 marker; N, negative control.

4.2 Fusion gene microarray

RT-PCR of TCF3-PBX1 and ETV6-RUNX1 gave products with the expected lengths. Sequencing of these products revealed the breakpoint between TCF3 and PBX1 to be downstream of exon 15 in TCF3 and upstream of exon three in PBX1. The breakpoint is shown by an electropherogram in Figure 11A. Similarly, as shown in Figure 11B,

52 Results sequencing of the ETV6-RUNX1 RT-PCR product revealed the breakpoint to be 3' of exon five in ETV6 and 5' of exon two in RUNX1. The expected BCR-ABL1 fusion transcript could not be seen from the RT-PCR.

Figure 11. Validation of fusion gene microarray. (A) Sequencing of the RT-PCR product for TCF3-PBX1. The sequence under the blue bare is the 3’-end of TCF3 exon 15, whereas the sequence under the green bar is the 5’-start of PBX1 exon three. (B) Sequencing of the RT-PCR product of ETV6-RUNX1. The sequence under the blue bar is the 3’-end of ETV6 exon five and the sequence under the green bar is the 5’-start of RUNX1 exon two.

4.3 Putative fusion genes in TGCT

From the various putative fusion genes in TGCT, RT-PCR products were separated by gel electrophoresis (Figure 12). Only the intragenic controls produced bands on the agarose gel and none of the three putative fusion genes were detected.

53 Results

Figure 12. PCR results from the analysis of putative fusion genes in TGCT. (A) Results from RT-PCR of ZDHHC20-MTHFD2L. 1: RT-PCR with forward primer in ZDHHC20 and reverse primer in MTHFD2L. 2: Intragenic control, ZDHHC20 3: Intragenic control, MTHFD2L. (B) Results from RT- PCR of MIER1-GPR177. 1: RT-PCR with forward primer in MIER1 and reverse primer in GPR177. 2: Intragenic control, MIER1. 3: Intragenic control, GPR177. (C) Results from PCR of Hs.446400- USP11. 1: RT-PCR with forward primer in Hs.446400 and reverse primer in USP11. 2: Intragenic control, Hs.446400. 3: Intragenic control, USP11. Abbreviations: M, 100 base pair size marker; N, negative control.

4.4 Novel strategy for identification of fusion genes

Altogether, the outlier expression analysis amounted to 305 unique genes. Among these, 131 had expression profiles in which one or more out of six cell lines were more than 1.5-fold overexpressed as compared to the rest, and were therefore included in the next step. Here, 508 genes (131 outliers, 349 known and putative fusion genes and 28 ETS family members) were investigated with the exon microarray (Figure 8, page 41). Eleven genes (RAD51L1, NKAIN2, VNN1, C4BPB, HOXC11, TFR2, SERPINB7, TFPT, GJB6, PRRX1, and PRRX2) had a longitudinal profile along the exons where one or two of the cell lines deviated from the rest only in the 3'-end. Five of these genes (TFR2, SERPINB7, C4BPB, VNN1, and GJB6) had outlier expression profiles in colorectal tissues, and the other six genes (PRRX1, PRRX2, NKAIN2, HOXC11, TFPT, and RAD51L1) are known fusion gene partners. None of the ETS family members and none of the putative fusion genes exhibited the desirable profile. In addition, four genes were investigated based on their outlier expression profiles (WIF1, HOXB13, CST1, and FZD10). For each of the 15 genes 5’-RACE and nested RACE was performed (see Figure 13 for representative results).

54 Results

Products were separated with gel electrophoresis, cut and eluted from the gel, cloned, and sequenced. No fusion genes were found from analysis of these genes, but novel transcript variants were found in all of the 15 genes.

Figure 13. Representative nested RACE results from analysis of PRRX2, RAD51L1, and VNN1. Lanes one, two, and three shows the results from nested RACE for PRRX2, RAD51L1, and VNN1, respectively. Abbreviations: M1 500 base pair size marker; N1, negative control for PRRX1; N2, negative control for RAD51L1; N3, negative control for VNN1; M2, 100 base pair size marker.

The exon expression profile of RAD51L1 in the SW48 cell line deviated from the other cell lines by having higher expression from exon seven and throughout the gene (Figure 14A). Five transcript variants with a total of 14 exons are known for RAD51L1, but sequencing of the 5'-RACE products from SW48 revealed six novel transcript variants which all included novel exons located inside intron number seven (Figure 14B). The novel exons are spliced together in different ways to create the different transcripts. See Appendix II for details about each transcript and the different exons. The nucleotide sequences of the novel transcripts were evaluated by use of the Tranlate tool for translation of nucleotide sequences into protein sequences.14 This revealed that the transcripts B and F contain open reading frames (i. e., a start codon which is not followed by an immediate in-frame stop codon) of 66 amino acids, and these are thus potentially protein-coding.

14 http://www.expasy.ch/tools/dna.html

55 Results

Figure 14. Novel transcript variants of RAD51L1 in a colorectal cancer cell line. (A) Expression levels of the different probesets (often corresponding to the different exons) in RAD51L1 as seen from exon microarray data. Expression levels from the different cell lines are indicated by different colours and the thick blue, red, and green lines represent the average for the six cell lines, ten colorectal carcinoma samples, and ten normal samples, respectively. The cell line SW48 deviates from the rest of the cell lines by showing stronger expression signals in the 3’-portion of the gene. (B) An overview of the different transcript variants. The black ruler on top indicates number of base pairs from the start of exon one. All exons are marked with a blue or green vertical line. Blue colour indicates known exons, whereas green colour represents novel exons sequenced from SW48. The number of clones found with the same sequence is indicated in brackets after the name of the transcript. The start of every exon is in agreement with the number of base pairs from the start of exon one, but the exon width on the illustration is exaggerated for improved visualisation. Exons are numbered according to their location in the genomic sequence and exact positions of every exon can be found in Appendix II. Location of the nested gene-specific primer (NGSP) is shown by a black arrow. Five different transcripts are known according to Ensembl for RAD51L1. These transcripts have a total of 14 exons.

56 Results

The same type of exon expression profile was found for NKAIN2 in both a cell line (LS1034) and a primary tumour (C1033III). These profiles show a higher expression of exons eight, nine, and ten in LS1034 (Figure 15A) and C1033III (Figure 15B) compared to the other cell lines and primary tumours.

Figure 15. NKAIN2. (A) Expression levels of the different exons in NKAIN2 for six cell lines. LS1034 has higher expression of exons eight to ten than the other cell lines. (B) Expression levels of the different exons in NKAIN2 for ten colorectal carcinomas. C1033III has higher expression of exons 57 Results eight to ten than the other carcinomas. (C) An overview of the different transcript variants. Three different transcript variants are known for NKAIN2 according to Ensembl. Eight new transcripts were found by sequencing of the 5'-RACE products from LS1034 and C1033III and constitute a total of four new exons in introns four, eight, and nine. See legend of Figure 14 for more detailed explanations.

Three transcripts are known for NKAIN2, all of which are transcribed from the same promoter (Figure 15C). Sequencing of the 5'-RACE products from both LS1034 and C1033III reveals the presence of eight novel transcripts including four novel exons, here denoted α, β, γ, and δ. Exon α is used as first exon in transcripts A, D, E, and G whereas exon γ is the first exon in transcript B. Exons β and δ, on the other hand, are located downstream of exon eight and nine, respectively. In the different transcripts, transcription is initiated at exon α, four, γ, nine, or ten. The Translate tool reveals transcripts A, G, D, F, and E as potentially protein-coding, with open reading frames of up to 173 amino acids, whereas transcripts C, B, and H probably are not.

The exon expression profile for VNN1 in the cell line HT29 deviated from that of the other cell lines by higher expression of exons six and seven (Figure 16A). Sequencing of the 5'-RACE products from VNN1 revealed three transcript variants in HT29 (Figure 16B). One transcript variant with seven exons is known for VNN1, but exons one to five in this transcript were never detected in HT29, instead two new exons, α and β, located inside intron number five are present. Transcript A consists of exon α followed by exon β and exon six. The Translate tool indicates that the transcript might encode a protein of 83 amino acids. Transcript B is quite similar to A, but with a 35 basepairs longer exon β. This result in frame shift from the subsequent exon of transcript A, introducing a stop codon, and B is therefore most likely non-coding. In transcript C a short exon α is directly followed by exon six. The Translate tool revealed no open reading frame from this sequence.

58 Results

Figure 16. VNN1. (A) Expression levels of the different exons in VNN1 for six cell lines. HT29 deviates from the other cell lines by higher expression of exons six and seven. (B) An overview of the different transcript variants. One transcript with seven exons is known for VNN1. Three new transcript variants were found by sequencing of the 5'-RACE products from HT29 and include two new exons inside intron number five. See legend of Figure 14 for more detailed explanations.

The exon expression profile for C4BPB in C1034III deviated from the other primary tumours by higher expression from the middle of the second exon and throughout the gene (Figure 17A). Five different transcripts, transcribed from two different promoters, are known for C4BPB (Figure 17B). Three different transcripts were found by sequencing of the 5’-RACE products from C1034III, all of which seem to be transcribed from the two known promoters (Figure 17B). Transcript A consists of the reference exon one and an enlarged exon two with additional sequences 5’ to the reference exon. Transcript B starts in exon two, in accordance with both ENST00000243611 and ENST00000367076. Transcript C is similar to ENST00000367078, but with a larger first exon. Since the gene-specific primer is

59 Results located relatively close to the 5’-end of the gene, we do not have enough information on whether the two new transcripts, A and C, are protein-coding.

Figure 17. C4BPB. (A) Expression levels of the different exons in C4BPB for ten colorectal carcinoma samples. C1034III deviates from the rest in exons two to eight. (B) An overview of the different transcript variants. Five transcripts with a total of seven exons are known for C4BPB. Three new transcript variants were found by sequencing of the 5'-RACE products. See legend of Figure 14 for more detailed explanations.

The exon expression profile for HOXC11 in the primary tumour C1402III deviates from the profile of the other tumours with higher expression from the end of exon one and throughout the gene (Figure 18A). One transcript with two exons is known for HOXC11 (Figure 18B). Sequencing of the 5'-RACE products revealed two novel transcripts in C1402III (Figure 18B). These transcripts consist of a novel exon, here denoted α, of variable length, spliced to exon two in the known transcript. The

60 Results

Translate tool indicates that transcript A, with the large exon α, exhibits an open reading frame encoding up to 119 amino acids with multiple possible initiation codons. The C-terminal end of the putative peptide generated from transcript A is identical to the C-terminal end of the peptide generated from ENST00000243082. Transcript B has a short exon α and only a quite short open reading frame encoding 38 amino acids, identical to the last part of the open reading frame in transcript A.

Figure 18. HOXC11. (A) Expression levels of the different exons in HOXC11 for ten colorectal carcinoma samples. One sample, C1402III, deviates from the rest in the end of exon one and all of exon two. (B) An overview of the different transcript variants. One transcript with two exons is known for HOXC11. Two new transcript variants were found by sequencing of the 5'-end of the cDNA. See legend of Figure 14 for more detailed explanations.

Two cell lines, RKO and SW48, had similar exon expression profiles for TFR2. These profiles deviated from those seen in the other cell lines by higher expression of exon eight and throughout the gene (Figure 19A). One transcript with 18 exons is known for TFR2, and sequencing of the 5’-RACE products from RKO and SW48 revealed ten novel transcripts (Figure 19B). Exons one, two, and three were never 61 Results present in these transcripts, and instead, all transcripts were initiated from exons four, six, and seven. The transcripts differ with regard to the amount of intron sequence included around the known exons. The Translate tool indicates an open reading frame in transcripts A, E, F, and H encoding 46 amino acids and an open reading frame encoding 160 amino acids in transcript D. For all these five transcripts, no stop codon is encoded and the open reading frame continues into the exon(s) downstream of the primer location. No open reading frames were found for transcripts B, C, G, I, and J.

62 Results

Figure 19. TFR2. (A) Expression levels for the different exons in TFR2 for six cell lines. Two cell lines, SW48 and RKO, deviate from the rest in exons eight to eighteen. (B) An overview of the different transcript variants. One transcript with eighteen exons is known for TFR2. Ten new transcript variants were found by sequencing of the 5'-end of the cDNA. See legend of Figure 14 for more detailed explanations.

The exon expression profile of SERPINB7 in the LS1034 cell line deviated from the other cell lines in exons five to nine (Figure 20A). Two transcript variants are known for SERPINB7 with a total of nine exons, where the first two are non-coding (Figure 20B). Sequencing of the 5’-RACE products revealed three variants in LS1034. Transcript B exhibits a novel first exon located inside intron number two. The Translate tool indicates that the transcript variant encodes the same protein as the two known transcripts, but has a different 5’-UTR. Transcript A is identical to ENST00000398019. Transcript C only includes exons four to six and the Translate tool reveals that no open reading frame is encoded by the transcript.

63 Results

Figure 20. SERPINB7. (A) Expression levels of the different exons in SERPINB7 for six cell lines. One cell line, LS1034, deviates from the rest in exons five to nine. (B) An overview of the different transcript variants. Two transcripts with a total of nine exons are known for SERPINB7. Three transcript variants were found by sequencing of the 5'-RACE products in LS1034. See legend of Figure 14 for more detailed explanations.

The exon expression profile for TFPT in SW48 shows higher expression in exons four, five, six, and seven compared to the other cell lines (Figure 21A). Four transcripts, transcribed from three different promoters and with a total of seven exons, are known for TFPT (Figure 21B). Sequencing of the 5’-RACE products revealed the presence of two transcripts in SW48 (Figure 21B). Transcript A is transcribed from exon three and the Translate tool indicates that no open reading frame is encoded by the transcript. Transcript B, on the other hand, is similar to one of the known transcripts (ENST00000301757), but with a larger first exon.

64 Results

Figure 21. TFPT. (A) Expression levels of the different exons in TFPT for six cell lines. One cell line, SW48, deviates from the rest in exons four to seven. (B) An overview of the different transcript variants. Four different transcripts with seven exons are known for TFPT. Two transcript variants were found by sequencing of the 5'-RACE products from SW48. See legend of Figure 14 for more detailed explanations.

The exon expression profile for GJB6 in HT29 deviated from the other cell lines by having higher expression in exons five and six (Figure 22A).

Figure 22. GJB6. (A) Expression levels of the different exons in GJB6 for six cell lines. One cell lines, HT29, deviates from the others by higher expression of exons five and six. (B) An overview of the different transcript variants. Four different transcripts with a total of six exons are known for GJB6. Six transcript variants were found by sequencing of the 5'-RACE products from HT29. See legend of Figure 14 for more detailed explanations.

Four transcripts with a total of six exons are known for GJB6. Sequencing of the 5’- RACE products revealed the presence of six transcript variants in HT29 (Figure 22B). Transcript A only includes the last exon, and do not encode an open reading

65 Results frame. Transcripts B and C, are identical to two of the known protein-coding variants (ENST00000400066 and ENST00000400065, respectively). Transcript D presents the same exon composition as ENST00000400066 but the sequence of exon five is 21 basepairs longer on its 5’-end, which induces seven new amino acids upstream of the coding region. Transcript E and F are initiated in exons two and five, respectively, and the Translate tool indicates that they encode an intact protein, but have a different 5’-UTR.

The exon expression profile for PRRX1 revealed higher expression of exons two to five in SW48 as compared to the other cell lines (Figure 23A). Two transcripts with a total of five exons are known for PRRX1, and sequencing of the 5'-RACE products from SW48 revealed nine transcript variants with a total of five novel exons localised in the 3'-end of intron one (Figure 23B). Exon one is not present in any of the transcripts, and instead, transcription is initiated at exons α, γ, and δ. The novel exons are spliced together in multiple ways to create the nine different transcript structures identified. The Translate tool indicates the presence of open reading frames in transcripts A and B which might encode up to 83 amino acids. No stop codons were found in these frames, indicating the presence of more coding exon(s) 3’ of the primer location. None of the other transcripts seem to contain open reading frames.

66 Results

Figure 23. PRRX1. (A) Expression levels of the different exons in PRRX1 for six cell lines. One cell line, SW48, deviates from the others by higher expression of exons two to five. (B) Overview of the different transcript variants. Two different transcripts with a total of five exons are known for PRRX1. Eight transcript variants were found by sequencing of the 5'-RACE products from SW48. See legend of Figure 14 for more detailed explanations.

The exon expression profile for PRRX2 in the primary tumour sample C1033III deviates from the other samples by having higher expression in the last exon of the 67 Results gene (Figure 24A). One transcript, consisting of four exons, is known for PRRX2, and sequencing of the 5'-RACE products revealed two novel transcript variants, A and B (Figure 24B). Transcript A includes parts of exon three spliced to exon four, whereas transcript B only consists of exon four. Eleven clones exhibited transcript A, and transcription was initiated at the exact same location for all clones (Appendix II). The Translate tool indicates that none of the transcripts are protein-coding.

Figure 24. PRRX2. (A) Expression levels of the different exons in PRRX2 for ten colorectal carcinoma samples. One sample, C1033III, deviates from the others by higher expression of exon number four. (B) An overview of the different transcript variants. One transcript with four exons is known for PRRX2 and two transcript variants were found by sequencing of the 5'-RACE products from C1033III. See legend of Figure 14 for more detailed explanations.

The four genes which were included based on their outlier expression profile alone (WIF1, HOXB13, CST1, and FZD10) were not found to be part of any fusion genes, but novel transcript variants were present. For these genes, real-time RT-PCR was done prior to amplification of the 5’-end. For each gene, the cell line(s) with

68 Results particularly high expression, as compared to the other cell lines, were investigated further.

Real-time RT-PCR of WIF1 showed expression in the Colo320 and V9P cell lines (Figure 25A). This expression was much higher in Colo320 as compared to V9P and Colo320 was therefore chosen for further analysis. One transcript with ten exons is previously known for WIF1, but sequencing of 5'-RACE products from Colo320 showed five novel transcript variants (Figure 25B). Transcripts A, C, D, and E are similar to the known variant, but with transcription start in exons three, four, six, and seven, respectively. Transcript B exhibits a novel first exon, α, located in the 3’-end of intron two. The Translate tool indicates that transcripts A, B, and C may encode a truncated protein. Transcript B might encode a bigger part of the protein than do transcripts A and C, whereas no open reading frames are encoded by transcripts D and E.

Figure 25. WIF1. (A) Relative expression of WIF1 in 20 colorectal cancer cell lines. The expression of WIF1 is clearly much higher in Colo320 compared to the other cell lines. (B) An overview of the 69 Results different transcript variants. One transcript with ten exons is known for WIF1. Five transcript variants were found by sequencing of the 5’-RACE products from Colo320. See legend of Figure 14 for more detailed explanations.

Real-time RT-PCR targeting HOXB13, revealed higher expression in two cell lines, V9P and FRI, as compared to the others (Figure 26A). One transcript consisting of two exons is known for HOXB13. Sequencing of the 5’-RACE products in V9P and FRI revealed the presence of three transcript variants (Figure 26B). Transcript A is identical to the known transcript, whereas transcript B and C are novel variants. Transcript B consists of the 3’-end of intron one directly followed by exon number two. Transcript C, on the other hand, was only found in one of the cell lines, FRI, and includes a novel first exon located over 10,000 basepairs upstream of the start of exon one, spliced to exon two. The Translate tool indicates that neither transcript B nor C are protein-coding.

Figure 26. HOXB13. (A) Relative expression of HOXB13 in 20 colorectal carcinoma cell lines. V9P and FRI show higher expression of HOXB13 than the other cell lines. (B) An overview of the different transcript variants. One transcript with two exons is known for HOXB13. Three transcript variants 70 Results were found by sequencing of the 5'-RACE products from V9P and FRI. See legend of Figure 14 for more detailed explanations.

Real-time RT-PCR targeting CST1 revealed higher expression in the LS174T cell line as compared to the others (Figure 27A). Two transcript variants, with a total of four exons, are known for CST1 (Figure 27A). Sequencing of the 5’-RACE products in LS174T, revealed the presence of three transcript variants (Figure 27B). Transcript A is identical to one of the known transcripts (ENST00000304749), transcript B consists of the two last exons in the known variants, whereas transcript C only consists of the last exon. The Translate tool reveals that no open reading frames are encoded by transcripts B and C.

Figure 27. CST1. (A) Relative expression of CST1 in 20 colorectal carcinoma cell lines. The expression level of CST1 in LS174T is higher than in the other cell lines. (B) An overview of the different transcript variants. Two transcripts with a total of four exons are known for CST1. Three 71 Results transcript variants were found by sequencing of the 5'-RACE products from LS174T. See legend of Figure 14 for more detailed explanations.

Real-time RT-PCR on FZD10 revealed higher expression in the V9P cell line as compared to the other cell lines (Figure 28A). One transcript variant, with only one exon, is known for FZD10 and sequencing of the 5’-RACE products from V9P showed the presence of one transcript. This transcript, A, consists of one exon which is truncated in the 5’-end as compared to the known exon. The Translate tool indicates that it might encode a truncated protein.

Figure 28. FZD10. (A) Relative expression of FZD10 in 20 colorectal carcinoma cell lines. The expression in V9P is higher than the expression in the other cell lines. (B) An overview of the different transcript variants. One transcript with one exon is known for FZD10. One transcript variant was found by sequencing of the 5'-RACE products from V9P and it consisted of parts of the coding region in the known variant. See legend of Figure 14 for more detailed explanations.

72 Discussion

5. Discussion

The overall focus of the thesis has been the study of oncogenic transcripts resulting from fusion genes or alternative splicing. The thesis is divided into two main parts; the establishment of methodological protocols in existing projects and the hunt for fusion genes in colorectal cancer. During the first part of the thesis, experience with methods like cDNA synthesis, primer design, RT-PCR, and DNA sequencing were achieved through analysis of BIRC5 transcript variants, known fusion genes in leukaemia cell lines, and putative fusion genes in a TGCT cell line. In addition to generating interesting results within these projects, the methods constituted a foundation of experience necessary for the establishment of protocols needed in the second part of the thesis. Here, two experimental methods (RACE and cloning), were established for use in our laboratory and applied in the fusion gene hunt. A novel strategy for fusion gene identification was developed. The catching was zero fusion genes, but fifteen genes with novel transcript variants, which all in all are considered a fine bag.

5.1 Splice variants of BIRC5 in cancer

In the process of establishing molecular techniques, the known transcript variants of BIRC5 were analysed and three different variants were detected in 15 MPNST samples. BIRC5 is a gene known to be upregulated in different cancers, and was also recently identified as upregulated in MPNST [106]. The expression of the gene is associated with unfavourable clinicopathological parameters, such as poor prognosis with progressive diseases and shorter patient survival rates [26].

The analysis of the different transcript variants of BIRC5 is important because different variants are shown to be associated with different outcomes. For instance, high levels of survivin-2B are associated with no disease relapse in patients who are alive whereas high levels of survivin-Δex3 are associated with patients who have died

73 Discussion due to disease relapse in non-small-cell lung cancer [129]. This is in compliance with the findings that survivin-2B and survivin-Δex3 play opposite roles in the tumour progression and/or tumourigenesis [reviewed in ref. 130]. Therefore the presence of survivin-Δex3 in the MPNST samples may indicate a poor prognosis, whereas the effect of the transcript variant may be counteracted by survivin-2B in the same samples.

Also, during the last years, more transcript variants have been found from the BIRC5 gene [131]. These variants are not yet well studied, but they may add even more complexity to the function of the gene.

5.2 Validation of a novel microarray-based tool

Known fusion genes in leukaemia were used as positive controls in a pilot study using a novel microarray-based strategy developed to screen for all known fusion transcripts in a given sample (Figure 7). Validation of the particular fusion breakpoints indicated by the fusion gene microarray was included in this master project, as well as the curation and weekly update of a database covering known fusion genes across all cancer types (Appendix III).

The expected BCR-ABL1 fusion transcript could not be seen either from the fusion gene microarray or from RT-PCR. Due to this unexpected result, fluorescence in situ hybridisation (FISH) analysis targeting the fusion gene was performed and the cell line was karyotyped (by the collaborating group of Sverre Heim, Department of Clinical Genetics, Norwegian Radium Hospital). These analyses were also negative for the fusion gene and the karyotype as a whole was not in compliance with the published karyotype of the TOM-1 cell line [132]. The cell line was therefore discarded as a positive control.

The microarray enables an objective and automated genome-wide analysis of all known and predicted fusion genes, as well as precise mapping information on the fusion breakpoint. This is possible because fusion breakpoints map mainly to intronic 74 Discussion sequences [41] and pre-mRNA processing therefore gives rise to transcripts consisting of whole exonic building blocks. Thus, independently of the exact intra- intron breakpoint locations, detection of all possible exon-exon junctions between two fusion gene partners provides specific detection of most fusion events.

The microarray contained measurements of chimeric transcript junctions together with exon-wise measurements of individual fusion partners, and to our knowledge this is the first time a combination like this is used to identify fusion breakpoints. Earlier publications on fusion gene measurements by oligo microarrays have restricted their use to either a few pre-defined fusion genes [133-136]or to the exclusive use of intragenic oligos [137].

The fusion gene microarray can be used irrespective of tumour type and has a great potential in cancer diagnostics, where all known suspected fusion events can be searched for simultaneously. In addition, the microarray can be used for detection of putative fusion genes as well as discovery of already known fusion genes in new cancer types.

5.3 Putative fusion genes in TGCT

In 2004, Hahn et al. reported a cancer type-independent bioinformatics screen for fusion genes by use of human sequence databases [109]. Three putative fusion genes were identified based on sequences from the TGCT cell line NTERA2. To reduce the number of false positives, the authors only considered intronic events as putative fusion genes. Surprisingly, and despite of the precautions, none of the three putative fusion genes in NTERA2 were detected with RT-PCR from an NTERA2 subline running in our laboratory. For all three putative fusion genes, primers were designed to amplify both the putative fusion gene and the individual partner genes involved. Because these intragenic controls were positive for all the putative fusion gene partners, it is safe to conclude that the assays were working, but none of the fusion genes are present in the NTERA2 subline running in our laboratory.

75 Discussion

Most cancer cell lines are genetically unstable, and sublines grown in laboratories around the world are prone to acquire new genetic traits. Such new traits will only be present in cells descending from the single cell which acquired the change. The putative fusion genes found from chimeric cDNA sequences from the NTERA2 cell line by Hahn et al. may be the result of such new traits (in the form of structural chromosome aberrations or trans-splicing15 events). These traits may be present in one or several sublines of NTERA2, but not in the one available in our laboratory.

5.4 Hunt for fusion genes

5.4.1 Methodological considerations

Three starting points were used for the candidate gene selection in the hunt for fusion genes; genes with outlier expression profiles, known and putative 3’ fusion gene partners and members of the ETS gene family. A fusion gene usually leads to the overexpression of the downstream fusion partner and a fusion gene is usually only present in a subset of cancer samples. The formation of a fusion gene therefore leads to overexpression of the downstream partner gene in only some of the samples, giving rise to an outlier expression profile. Previously COPA has been used to calculate outlier profiles in the search for novel fusion genes [50]. In this thesis GTI was used. For a given gene and a given cancer type, GTI calculates the outlier expression profile on the basis of both the fraction of samples with expression above a certain threshold, and the average expression change within these samples. Known and putative 3’ fusion gene partners and ETS gene family members were included because of their known susceptibility for undergoing rearrangements and because the same fusion genes (and in particular the same fusion gene partners) can be present in different cancer types [48,111].

15 The joining of exons from separate pre-mRNAs [138].

76 Discussion

Analysis of the longitudinal exon expression profile was an important step in the process of enriching for the most likely candidate fusion partners and also for selecting a more reasonable number of genes to investigate in the laboratory. If every exon in a gene is under the control of the same promoter, we would expect the exon expression levels to be similar throughout the gene. If, on the other hand, a gene is the downstream partner in a fusion gene the exons downstream of the fusion point will be under the control of the promoter region in the upstream fusion partner. The 5’-portion of the original gene is therefore regulated by one promoter and the 3’- portion by another, leading to different expression of the two parts. This may give rise to longitudinal exon expression profiles looking like the ones seen in Figure 14A to Figure 24A, where exons in the 3'-end of a gene have higher expression than the 5'-exons in certain samples as compared to others. The same type of longitudinal exon expression profile may be the result of a strong alternative promoter localised downstream of the reference promoter.

The master project was initiated before we had exon microarray data available for colorectal cancer samples. Thus, for some genes, the first step in the laboratory was real-time RT-PCR, used to validate the outlier-style expression profile and to identify samples with strong expression. The primers used for this procedure hybridise close to the 3’-end of the genes and, if possible, cross an intron to avoid amplification of genomic DNA. For each of the genes, 5’-RACE primers were designed to anneal in the same region as the real-time primers to ensure detection within an expressed exon. In addition, this primer design strategy ensures that the 5’-RACE products included large parts of the gene and therefore increased the likelihood of finding potential rearrangements present.

To investigate the transcript structure upstream of the altered exon expression, 5'- RACE was used. One debate concerning RACE methods is whether the entire beginning of the transcript is reached. For the SMART RACE kit used in the present master project, it has been reported that 70-90 % of the products correspond to the actual 5'-end of the mRNA [139]. The majority of transcripts found in this project

77 Discussion may therefore be considered to include the 5'-end of the mRNA. This is also supported by findings shown in Appendix II, where different clones for the same transcript start at the exact same base, indicating that this is the first base to be transcribed into mRNA. An example can be seen in Appendix II for PRRX2 transcript A, where all eleven clones started with the same nucleotide.

Multiple transcripts are found for the majority of genes [14]. The gene-specific primers used in the RACE setup anneal to a particular exon. By use of the exon microarray expression profiles, gene-specific primers could be designed to anneal in exons indicated to be highly expressed, and therefore most likely also included in a potential fusion transcript.

Already early in the project, it became evident that two steps in the process from mRNA to sequenced 5’-cDNA ends were essential for success: First of all, since the RACE method only applies one gene-specific primer, it is necessary to perform nested RACE with a nested gene-specific primer to ensure gene-specific RACE products. Secondly, it is necessary to separate nested RACE products on an agarose gel, followed by elution of individual bands, prior to cloning. Abrogating this step will favour cloning of short products. Some of the adenosine overhangs produced by the PCR reaction, and necessary for cloning into the TOPO vectors, are lost during the gel elution step, thus making the cloning reaction less effective. Accordingly, the amount of transformation mix had to be increased to ensure sufficient growth of transformed bacteria.

5.4.2 Novel exons and transcripts

Among the 15 genes investigated in the laboratory, nine were initially included due to outlier expression profiles in tissues from colorectal cancer (TFR2, SERPINB7, C4BPB, VNN1, GJB6, WIF1, HOXB13, CST1, and FZD10) and six due to their known participation as fusion gene partners in other cancer types (RAD51L1, NKAIN2, HOXC11, TFPT, PRRX1, and PRRX2) [140-145]. Of the nine genes with outlier expression profiles, four genes (WIF1, HOXB13, CST1, and FZD10) were 78 Discussion investigated based on this profile alone, whereas five (TFR2, SERPINB7, C4BPB, VNN1, GJB6) also had exon expression profiles deviating in the 3’-end of the transcripts. In total, laboratory investigations of the 15 genes lead to the discovery of 57 novel transcript variants, including 22 novel exons and 34 putative novel promoters in colorectal cancer. In the following each gene and its transcript variants will be discussed in more detail.

Large discrepancies are seen in different human genome databases with regards to, for instance, what is considered a transcript variant and the nomenclature of exons and transcripts. Therefore, throughout the thesis one genomic database, Ensembl, have been used to asses the different transcripts and exons known for a given gene. Ensembl, which is curated by the European Bioinformatics Institue, is considered a comprehensive, well-annotated and stable database, where annotated genes and transcripts are based on mRNA and protein sequences deposited into public databases from the scientific community.

For RAD51L1, the transcription start sites of the herein identified novel transcript variants indicate the presence of three novel promoters, at exons denoted α, β, and γ. The exon expression profile for RAD51L1 (Figure 14, page 56) shows higher expression of the last exons in the investigated cell line as compared to the others and therefore indicate that one or both of the alternative promoters are more activated than the reference promoters. The investigated cell line, SW48, also has higher expression of exon two compared to the other cell lines. This can not be explained by the transcripts described in this thesis because exons one to seven are not present in any of them. The high expression in exon number two might be explained by transcripts which do not contain exon eight, and therefore are not detected with the RACE primed for this exon.

For NKAIN2, the novel exon α is used as first exon in four of the sequenced transcripts and indicate the presence of a novel promoter. Promoters might also be present at exons four, γ, nine and ten, as these are the first exons in the other four transcripts. The exon expression profiles of the cell line and tumour sample 79 Discussion investigated deviate most strikingly from the other cell lines and tumour samples in exon eight, nine, and ten. In addition, they both also have the highest expression in exon five, as compared to samples of the same kind, which is in line with the presence of this exon in five transcripts.

The exon expression profile for VNN1 in HT29 was quite striking (Figure 16, page 59) with the higher expression of exons six and seven as compared to the average expression of the ten tumour samples, which are somewhat upregulated compared to normal samples and cell lines. Three transcript variants with two novel exons were found. For all novel transcript variants of VNN1 expression starts in the novel exon α, indicating the presence of a novel promoter. To account for the high expression of exons six and seven, the promoter used to generate these transcripts must be more active than the normal promoter in VNN1.

The enlarged exon two seen in transcript A of C4BPB might constitute a longer 5’- UTR and thereby affect its stability and/or regulation of translation. Transcript C might be the same as ENST00000367078. The first exon is bigger in transcript C, but this might be due to use of different TSSs and thus, the promoter is not necessarily a novel one.

Both of the novel transcripts seen for HOXC11 consist of a version of exon α, spliced to exon two in the reference transcript. This indicates the presence of a novel promoter at exon α. The possible protein encoded by transcript A, might be a truncated version of the known protein product of ENST00000243082 or a novel protein with identical C-terminal end.

The novel transcript D seen in TFR2 consists of exons four to eight and was only found in the RKO cell line. The exon expression profiles for the two investigated cell lines deviate most from the other cell lines in exons eight to ten, but the presence of exon four in transcript D is in concordance with the peak seen at this position in the exon expression profile for RKO. The drop in expression seen for exon five for all cell lines might be due to a non-functioning probeset. All transcripts are initiated

80 Discussion from either exon four, six, or seven, indicating the presence of novel promoters in these regions.

Two novel and one known transcripts were found for SERPINB7 (Figure 20, page 64). For SERPINB7, the first exon seen in transcript B is likely non-coding and can give the potentially encoded protein a different 5’-UTR than the known isoforms of the gene. This might affect the stability and regulation of the encoded protein.

The exon expression profile for TFPT in SW48 shows high expression of exon one, but lower expression of exons two and three. Exon two is not present in the two transcripts seen in SW48 and might therefore explain the drop in the expression profile. Exon three, on the other hand, is present in both transcripts. This drop in expression is seen, in various degrees, in this location for all the cell lines and may be due to a probeset not working properly. The enlarged first exon in transcript B might be due to alternative TSS use as compared to the known transcript, and not indicate the presence of a novel promoter.

The entire coding region of GJB6 is located in exon 6. The enlarged fifth exon seen in transcript D alters the 5’-UTR and might therefore affect the stability and/or regulation of translation. Transcripts E and F differ from the reference transcripts and indicate the presence of new promoters in front of exons two and five, respectively. The potential proteins encoded by these transcripts are identical, but the transcripts exhibit different 5’-UTR as compared to the known proteins and might therefore be regulated differently. None of the transcripts sequenced from the HT29 cell line includes exon 3, thus explaining the drop seen at this position in the exon expression profile.

In the novel transcript variants seen for PRRX1, transcription is initiated at exons α, γ, and δ indicating the presence of three novel promoters. The exon expression profile for the investigated cell line shows continuous high expression of PRRX1 in exons three, four, and five. This indicates the presence of all these exons in the full-length transcripts and is in concordance with the lack of stop codons upstream of the primer

81 Discussion location in transcripts A and B. To account for the elevated expression of exons two to five, one or more of the novel promoters found in the investigated cell line must be more active than the normal promoter for PRRX1.

Eleven clones containing transcript A of PRRX2 were sequenced, all of which were of the exact same length because transcription was initiated at the exact same nucleotide. This indicates that the far 5’-end of the transcripts were reached using 5’- RACE and therefore also supports the findings of a wider repertoire of promoters for the other genes investigated in this thesis.

For WIF1, the herein identified novel transcript variants indicate the presence of novel promoters at exons three, α, four, six and seven. The activation of one or more of these promoters might explain the high expression of the gene as seen from real- time RT-PCR.

For HOXB13, transcripts B and C indicate the presence of two novel promoters. Strong activation of these or the normal promoter may account for the high expression seen in the investigated cell lines.

Three transcripts were found for CST1in the LS174T cell line. Transcripts B and C indicate the presence of novel promoters located at exons three and four, and strong activation of these or the normal promoter in exon two might account for the high expression in the cell line investigated.

The indicated FZD10 transcript was sequenced from 20 different clones and is shorter than the reference transcript. The TSSs was found in three main locations (Table-A-II- 15, Appendix II) in the single exon constituting the transcript and indicate the presence of three novel promoters. Activation of one or more of these can account for the high expression seen in the cell line investigated.

The Translate tool used to translate nucleotide sequences to peptide sequences of potential proteins has been used to evaluate whether or not different transcripts have the possibility to be protein-coding. The transcripts referred to as non-coding have

82 Discussion been of two types; either with many stop codons dispersed throughout the nucleotide sequence, in all three reading frames, or a transcript sequence with no start codon. The latter type was found in transcripts from TFR2, SERPINB7, TFPT, GJB6, WIF1, and CST1. The nucleotide sequences from these transcripts were typically containing an open reading frame, but did not include start codon for this frame. Nevertheless, these transcripts may as well represent sequences where the 5’-end of the cDNA has not been reached.

True non-coding transcripts may as well be functionally relevant to the cells. Over the past few years, several long non-coding RNAs have been discovered. Many of these RNAs control the activity of protein-coding genes and do so in a variety of ways without necessarily being dependent on the exact sequence of the RNA [146]. For example, as seen from the DHFR gene [147], a non-coding RNA generated from one promoter in a gene can regulate the transcription of protein-coding transcripts generated from another promoter within the same gene.

Nonsense-mediated mRNA decay represents a posttranscriptional process which selectively recognises and degrades mRNAs with truncated open reading frames [25]. The novel transcripts detected in this master project are clearly not degraded, as their corresponding genes were included in the study based on high mRNA levels. This is yet another indication that they may have functional implications to the cells.

The transcripts described in this thesis display 34 potentially novel promoters. This includes both transcripts potentially encoding the reference proteins but containing different 5’-UTR (as seen for GJB6, transcripts E and F) and transcripts potentially encoding novel proteins (as seen for RAD51L1, transcripts B and F). Heterogeneous 5’-UTRs can affect the stability and translation efficiency of the mRNAs and thereby affect the amount of protein present in a cell, whereas isoforms of the same gene may have different functions. The potential proteins encoded by transcripts identified in this thesis may therefore introduce effects to a cancer cell which are different to those of the proteins encoded by the reference transcripts.

83 Discussion

As seen from Appendix II, the exact TSSs for the same type of transcripts within different clones differ by some nucleotides. This is in accordance with the findings that most human promoters lack one distinct TSS, but instead consist of a series of closely located TSSs spread over around 50 to 100 basepairs [29]. For some transcripts, the TSSs seen in Appendix II are separated by more than 100 basepairs, and may therefore indicate the presence of more than one core promoter.

Summarised, the exon expression levels for 508 genes were investigated. Fifteen of the genes had deviating exon expression profiles indicating qualitative changes in the transcript structure and were therefore investigated in the laboratory. No new fusion gene was found, but 57 novel transcript variants including 22 novel exons and 34 putative promoters were identified from colorectal cancer cell lines and tissue samples. Thus, in conclusion, we consider our novel strategy for identification of novel transcript variants in colorectal cancer as successful. The novel transcripts will be further investigated in our laboratory to elucidate their prevalence and clinical relevance in colorectal cancer, as well as their cancer-specificity.

84 Future perspectives

6. Future perspectives

For the numerous transcripts discovered in this master project further analyses are warranted to elucidate their prevalence in colorectal cancer, their potential cancer- specificity, as well as their clinical relevance.

The prevalence of the novel transcripts and exons will be investigated by analyses on a larger cohort of cell lines and tumour samples. Real-time RT-PCR assays discriminating between transcripts carrying the novel exons and the reference transcripts will be designed. This will elucidate both the prevalence of the novel exons as well as provide information on the relative amount of the novel exons within samples expressing them. These analysis will include investigations of whether presence of the novel transcripts are related to any clinical parameters such as patient survival and Dukes' stage, and molecular phenotypes such as CIN/MSI and mutation status of known colorectal cancer-critical genes. We do already have the relevant biobank material ready for such analyses, where both the clinical and molecular data are stored in databases.

We will as well seek to functionally validate the novel transcript variants. Here knock-down of the individual novel and reference-transcripts by siRNA or shRNA- based RNAi will be carried out.

The fusion gene microarray, validated in this thesis, will be used to investigate the presence of known and putative fusion genes in cancer samples, as well as to search for the presence of known fusion genes in new cancer types.

85 Reference list

7. Reference list

1. Nowell PC: The clonal evolution of tumor cell populations. Science 1976, 194: 23-28.

2. Heim S, Teixeira MR, Dietrich CU, Pandis N: Cytogenetic polyclonality in tumors of the breast. Cancer Genet Cytogenet 1997, 95: 16-19.

3. Vogelstein B, Kinzler KW: Cancer genes and the pathways they control. Nat Med 2004, 10: 789-799.

4. Hanahan D, Weinberg RA: The Hallmarks of Cancer. Cell 2000, 100: 57-70.

5. Konopka JB, Watanabe SM, Singer JW, Collins SJ, Witte ON: Cell lines and clinical isolates derived from Ph1-positive chronic myelogenous leukemia patients express c-abl proteins with a common structural alteration. Proc Natl Acad Sci U S A 1985, 82: 1810-1814.

6. Tsujimoto Y, Gorham J, Cossman J, Jaffe E, Croce CM: The t(14;18) chromosome translocations involved in B-cell neoplasms result from mistakes in VDJ joining. Science 1985, 229: 1390-1393.

7. Gruber SB, Ellis NA, Scott KK, Almog R, Kolachana P, Bonner JD et al.: BLM heterozygosity and the risk of colorectal cancer. Science 2002, 297: 2013.

8. Goss KH, Risinger MA, Kordich JJ, Sanz MM, Straughen JE, Slovek LE et al.: Enhanced tumor formation in mice heterozygous for Blm mutation. Science 2002, 297: 2051-2053.

9. Ehrlich M: DNA methylation in cancer: too much, but also too little. Oncogene 2002, 21: 5400-5413.

10. Esteller M, Corn PG, Baylin SB, Herman JG: A gene hypermethylation profile of human cancer. Cancer Res 2001, 61: 3225-3229.

11. Skotheim RI, Nees M: Alternative splicing in cancer: Noise, functional, or systematic? The International Journal of Biochemistry & Cell Biology 2007, In Press, Corrected Proof.

12. Blencowe BJ: Alternative splicing: new insights from global analyses. Cell 2006, 126: 37-47.

13. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J et al.: Initial sequencing and analysis of the human genome. Nature 2001, 409: 860-921.

86 Reference list

14. Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C et al.: Alternative isoform regulation in human tissue transcriptomes. Nature 2008, 456: 470-476.

15. Nagasaki H, Arita M, Nishizawa T, Suwa M, Gotoh O: Automated classification of alternative splicing and transcriptional initiation and construction of visual database of classified patterns. Bioinformatics 2006, 22: 1211-1216.

16. Zhou Z, Licklider LJ, Gygi SP, Reed R: Comprehensive proteomic analysis of the human spliceosome. Nature 2002, 419: 182-185.

17. Ast G: How did alternative splicing evolve? Nat Rev Genet 2004, 5: 773- 782.

18. Brow DA: Allosteric cascade of spliceosome activation. Annu Rev Genet 2002, 36: 333-360.

19. Robberson BL, Cote GJ, Berget SM: Exon definition may facilitate splice site selection in RNAs with multiple exons. Mol Cell Biol 1990, 10: 84-94.

20. Graveley BR: Sorting out the complexity of SR protein functions. RNA 2000, 6: 1197-1211.

21. Wagner EJ, Garcia-Blanco MA: Polypyrimidine tract binding protein antagonizes exon definition. Mol Cell Biol 2001, 21: 3281-3288.

22. Blanchette M, Chabot B: Modulation of exon skipping by high-affinity hnRNP A1-binding sites and by intron elements that repress splice site utilization. EMBO J 1999, 18: 1939-1952.

23. Zhu J, Mayeda A, Krainer AR: Exon identity established through differential antagonism between exonic splicing silencer-bound hnRNP A1 and enhancer-bound SR proteins. Mol Cell 2001, 8: 1351-1361.

24. Kornblihtt AR: Chromatin, transcript elongation and alternative splicing. Nat Struct Mol Biol 2006, 13: 5-7.

25. Cartegni L, Chew SL, Krainer AR: Listening to silence and understanding nonsense: exonic mutations that affect splicing. Nat Rev Genet 2002, 3: 285-298.

26. Li F: Survivin study: what is the next wave? J Cell Physiol 2003, 197: 8- 29.

27. Zhu N, Gu L, Findley HW, Li F, Zhou M: An alternatively spliced survivin variant is positively regulated by p53 and sensitizes leukemia cells to chemotherapy. Oncogene 2004, 23: 7545-7551.

87 Reference list

28. Sandelin A, Carninci P, Lenhard B, Ponjavic J, Hayashizaki Y, Hume DA: Mammalian RNA polymerase II core promoters: insights from genome- wide studies. Nat Rev Genet 2007, 8: 424-436.

29. Carninci P, Sandelin A, Lenhard B, Katayama S, Shimokawa K, Ponjavic J et al.: Genome-wide analysis of mammalian promoter architecture and evolution. Nat Genet 2006, 38: 626-635.

30. Davuluri RV, Suzuki Y, Sugano S, Plass C, Huang TH: The functional consequences of alternative promoter use in mammalian genomes. Trends Genet 2008, 24: 167-177.

31. Agarwal VR, Bulun SE, Leitch M, Rohrich R, Simpson ER: Use of alternative promoters to express the aromatase cytochrome P450 (CYP19) gene in breast adipose tissues of cancer-free and breast cancer patients. J Clin Endocrinol Metab 1996, 81: 3843-3849.

32. Nakanishi T, Bailey-Dell KJ, Hassel BA, Shiozawa K, Sullivan DM, Turner J et al.: Novel 5' untranslated region variants of BCRP mRNA are differentially expressed in drug-selected cancer cells and in normal human tissues: implications for drug resistance, tissue-specific expression, and alternative promoter usage. Cancer Res 2006, 66: 5007- 5011.

33. Nowell PC, Hungerford DA: Chromosome studies on normal and leukemic human leukocytes. J Natl Cancer Inst 1960, 25: 85-109.

34. Rowley JD: Letter: A new consistent chromosomal abnormality in chronic myelogenous leukaemia identified by quinacrine fluorescence and Giemsa staining. Nature 1973, 243: 290-293.

35. Agarwal S, Tafel AA, Kanaar R: DNA double-strand break repair and chromosome translocations. DNA Repair (Amst) 2006, 5: 1075-1081.

36. Kuppers R, Klein U, Hansmann ML, Rajewsky K: Cellular origin of human B-cell lymphomas. N Engl J Med 1999, 341: 1520-1529.

37. Maire G, Brown CW, Bayani J, Pereira C, Gravel DH, Bell JC et al.: Complex rearrangement of chromosomes 19, 21, and 22 in Ewing sarcoma involving a novel reciprocal inversion-insertion mechanism of EWS- ERG fusion gene formation: a case analysis and literature review. Cancer Genet Cytogenet 2008, 181: 81-92.

38. Welborn J, Jenks H, Taplett J, Walling P: Inversion of chromosome 12 and lineage promiscuity in hematologic malignancies. Cancer Genet Cytogenet 2004, 148: 91-103.

39. Dong JT: Chromosomal deletions and tumor suppressor genes in prostate cancer. Cancer Metastasis Rev 2001, 20: 173-193.

88 Reference list

40. Åman P: Fusion genes in solid tumors. Semin Cancer Biol 1999, 9: 303- 318.

41. Novo FJ, de M, I, Vizmanos JL: TICdb: a collection of gene-mapped translocation breakpoints in cancer. BMC Genomics 2007, 8: 33.

42. Gasparini P, Sozzi G, Pierotti MA: The role of chromosomal alterations in human cancer development. J Cell Biochem 2007, 102: 320-331.

43. Åman P: Fusion oncogenes in tumor development. Semin Cancer Biol 2005, 15: 236-243.

44. Tognon C, Knezevich SR, Huntsman D, Roskelley CD, Melnyk N, Mathers JA et al.: Expression of the ETV6-NTRK3 gene fusion as a primary event in human secretory breast carcinoma. Cancer Cell 2002, 2: 367-376.

45. Enlund F, Behboudi A, Andren Y, Oberg C, Lendahl U, Mark J et al.: Altered Notch signaling resulting from expression of a WAMTP1-MAML2 gene fusion in mucoepidermoid carcinomas and benign Warthin's tumors. Exp Cell Res 2004, 292: 21-28.

46. Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R et al.: A census of human cancer genes. Nat Rev Cancer 2004, 4: 177-183.

47. Mitelman F, Johansson B, Mertens F: Fusion genes and rearranged genes as a linear function of chromosome aberrations in cancer. Nat Genet 2004, 36: 331-334.

48. Kumar-Sinha C, Tomlins SA, Chinnaiyan AM: Recurrent gene fusions in prostate cancer. Nat Rev Cancer 2008, 8: 497-511.

49. Soda M, Choi YL, Enomoto M, Takada S, Yamashita Y, Ishikawa S et al.: Identification of the transforming EML4-ALK fusion gene in non-small- cell lung cancer. Nature 2007, 448: 561-566.

50. Tomlins SA, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R, Sun XW et al.: Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science 2005, 310: 644-648.

51. Campbell PJ, Stephens PJ, Pleasance ED, O'Meara S, Li H, Santarius T et al.: Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nat Genet 2008, 40: 722-729.

52. Ruan Y, Ooi HS, Choo SW, Chiu KP, Zhao XD, Srinivasan KG et al.: Fusion transcripts and transcribed retrotransposed loci discovered through comprehensive transcriptome analysis using Paired-End diTags (PETs). Genome Res 2007, 17: 828-838.

53. Nowell PC, Hungerford DA: Chromosome studies in human leukemia. II. Chronic granulocytic leukemia. J Natl Cancer Inst 1961, 27: 1013-1035. 89 Reference list

54. Druker BJ, Sawyers CL, Kantarjian H, Resta DJ, Reese SF, Ford JM et al.: Activity of a specific inhibitor of the BCR-ABL tyrosine kinase in the blast crisis of chronic myeloid leukemia and acute lymphoblastic leukemia with the Philadelphia chromosome. N Engl J Med 2001, 344: 1038-1042.

55. Druker BJ, Talpaz M, Resta DJ, Peng B, Buchdunger E, Ford JM et al.: Efficacy and safety of a specific inhibitor of the BCR-ABL tyrosine kinase in chronic myeloid leukemia. N Engl J Med 2001, 344: 1031-1037.

56. Adams JM, Harris AW, Pinkert CA, Corcoran LM, Alexander WS, Cory S et al.: The c-myc oncogene driven by immunoglobulin enhancers induces lymphoid malignancy in transgenic mice. Nature 1985, 318: 533-538.

57. Heisterkamp N, Jenster G, ten HJ, Zovich D, Pattengale PK, Groffen J: Acute leukaemia in bcr/abl transgenic mice. Nature 1990, 344: 251-253.

58. Rodriguez-Garcia A, Sanchez-Martin M, Perez-Losada J, Perez-Mancera PA, Sagrera-Aparisi A, Gutierrez-Cianca N et al.: Selective destruction of tumor cells through specific inhibition of products resulting from chromosomal translocations. Curr Cancer Drug Targets 2001, 1: 109-119.

59. Thomas M, Greil J, Heidenreich O: Targeting leukemic fusion proteins with small interfering RNAs: recent advances and therapeutic potentials. Acta Pharmacol Sin 2006, 27: 273-281.

60. Parkin DM, Bray F, Ferlay J, Pisani P: Global cancer statistics, 2002. CA Cancer J Clin 2005, 55: 74-108.

61. Cancer in Norway 2006 - Cancer incidence, mortality, survival, and prevalence in Norway. Oslo: Cancer Registry of Norway; 2007.

62. Rustgi AK: The genetics of hereditary colon cancer. Genes Dev 2007, 21: 2525-2538.

63. Groden J, Thliveris A, Samowitz W, Carlson M, Gelbert L, Albertsen H et al.: Identification and characterization of the familial adenomatous polyposis coli gene. Cell 1991, 66: 589-600.

64. Fishel R, Lescoe MK, Rao MR, Copeland NG, Jenkins NA, Garber J et al.: The human mutator gene homolog MSH2 and its association with hereditary nonpolyposis colon cancer. Cell 1993, 75: 1027-1038.

65. Fearon ER, Vogelstein B: A genetic model for colorectal tumorigenesis. Cell 1990, 61: 759-767.

66. Gervaz P, Bucher P, Morel P: Two colons-two cancers: paradigm shift and clinical implications. J Surg Oncol 2004, 88: 261-266.

67. Haydon AM, Jass JR: Emerging pathways in colorectal-cancer development. Lancet Oncol 2002, 3: 83-88. 90 Reference list

68. Hermsen M, Postma C, Baak J, Weiss M, Rapallo A, Sciutto A et al.: Colorectal adenoma to carcinoma progression follows multiple pathways of chromosomal instability. Gastroenterology 2002, 123: 1109- 1119.

69. Thibodeau SN, Bren G, Schaid D: Microsatellite instability in cancer of the proximal colon. Science 1993, 260: 816-819.

70. Aaltonen LA, Peltomaki P, Leach FS, Sistonen P, Pylkkanen L, Mecklin JP et al.: Clues to the pathogenesis of familial colorectal cancer. Science 1993, 260: 812-816.

71. Ionov Y, Peinado MA, Malkhosyan S, Shibata D, Perucho M: Ubiquitous somatic mutations in simple repeated sequences reveal a new mechanism for colonic carcinogenesis. Nature 1993, 363: 558-561.

72. Lothe RA, Peltomaki P, Meling GI, Aaltonen LA, Nystrom-Lahti M, Pylkkanen L et al.: Genomic instability in colorectal cancer: relationship to clinicopathological variables and family history. Cancer Res 1993, 53: 5849-5852.

73. Popat S, Hubner R, Houlston RS: Systematic review of microsatellite instability and colorectal cancer prognosis. J Clin Oncol 2005, 23: 609- 618.

74. Arends JW: Molecular interactions in the Vogelstein model of colorectal carcinoma. J Pathol 2000, 190: 412-416.

75. Toyota M, Ahuja N, Ohe-Toyota M, Herman JG, Baylin SB, Issa JP: CpG island methylator phenotype in colorectal cancer. Proc Natl Acad Sci U S A 1999, 96: 8681-8686.

76. Issa JP: CpG island methylator phenotype in cancer. Nat Rev Cancer 2004, 4: 988-993.

77. van RM, Grieu F, Elsaleh H, Joseph D, Iacopetta B: Characterisation of colorectal cancers showing hypermethylation at multiple CpG islands. Gut 2002, 51: 797-802.

78. Weisenberger DJ, Siegmund KD, Campan M, Young J, Long TI, Faasse MA et al.: CpG island methylator phenotype underlies sporadic microsatellite instability and is tightly associated with BRAF mutation in colorectal cancer. Nat Genet 2006, 38: 787-793.

79. Kambara T, Simms LA, Whitehall VL, Spring KJ, Wynter CV, Walsh MD et al.: BRAF mutation is associated with DNA methylation in serrated polyps and cancers of the colorectum. Gut 2004, 53: 1137-1144.

80. Hawkins N, Norrie M, Cheong K, Mokany E, Ku SL, Meagher A et al.: CpG island methylation in sporadic colorectal cancers and its relationship to microsatellite instability. Gastroenterology 2002, 122: 1376-1387. 91 Reference list

81. Cuthbert E Dukes: The classification of cancer of the rectum. The Journal of Pathology and Bacteriology 1932, 35: 323-332.

82. Longley DB, Harkin DP, Johnston PG: 5-fluorouracil: mechanisms of action and clinical strategies. Nat Rev Cancer 2003, 3: 330-338.

83. Brooks DG: The neurofibromatoses: hereditary predisposition to multiple peripheral nerve tumors. Neurosurg Clin N Am 2004, 15: 145-155.

84. Ducatman BS, Scheithauer BW, Piepgras DG, Reiman HM, Ilstrup DM: Malignant peripheral nerve sheath tumors. A clinicopathologic study of 120 cases. Cancer 1986, 57: 2006-2021.

85. Wallace MR, Marchuk DA, Andersen LB, Letcher R, Odeh HM, Saulino AM et al.: Type 1 neurofibromatosis gene: identification of a large transcript disrupted in three NF1 patients. Science 1990, 249: 181-186.

86. Cawthon RM, Weiss R, Xu GF, Viskochil D, Culver M, Stevens J et al.: A major segment of the neurofibromatosis type 1 gene: cDNA sequence, genomic structure, and point mutations. Cell 1990, 62: 193-201.

87. Zhang YY, Vik TA, Ryder JW, Srour EF, Jacks T, Shannon K et al.: Nf1 regulates hematopoietic progenitor cell growth and ras signaling in response to multiple cytokines. J Exp Med 1998, 187: 1893-1902.

88. Serra E, Puig S, Otero D, Gaona A, Kruyer H, Ars E et al.: Confirmation of a double-hit model for the NF1 gene in benign neurofibromas. Am J Hum Genet 1997, 61: 512-519.

89. Nielsen GP, Stemmer-Rachamimov AO, Ino Y, Moller MB, Rosenberg AE, Louis DN: Malignant transformation of neurofibromas in neurofibromatosis 1 is associated with CDKN2A/p16 inactivation. Am J Pathol 1999, 155: 1879-1884.

90. Legius E, Dierick H, Wu R, Hall BK, Marynen P, Cassiman JJ et al.: TP53 mutations are frequent in malignant NF1 tumors. Genes Chromosomes Cancer 1994, 10: 250-255.

91. Lothe RA, Smith-Sørensen B, Hektoen M, Stenwig AE, Mandahl N, Sæter G et al.: Biallelic inactivation of TP53 rarely contributes to the development of malignant peripheral nerve sheath tumors. Genes Chromosomes Cancer 2001, 30: 202-206.

92. Woodruff JM: Pathology of the major peripheral nerve sheath neoplasms. Monogr Pathol 1996, 38: 129-161.

93. Sawyers CL: Chronic myeloid leukemia. N Engl J Med 1999, 340: 1330- 1340.

92 Reference list

94. Daley GQ, Van Etten RA, Baltimore D: Induction of chronic myelogenous leukemia in mice by the P210bcr/abl gene of the Philadelphia chromosome. Science 1990, 247: 824-830.

95. Chowdhury T, Brady HJ: Insights from clinical studies into the role of the MLL gene in infant and childhood leukemia. Blood Cells Mol Dis 2008, 40: 192-199.

96. Looijenga LH, Oosterhuis JW: Pathogenesis of testicular germ cell tumours. Rev Reprod 1999, 4: 90-100.

97. Skakkebæk NE, Berthelsen JG, Giwercman A, Müller J: Carcinoma-in-situ of the testis: possible origin from gonocytes and precursor of all types of germ cell tumours except spermatocytoma. Int J Androl 1987, 10: 19- 28.

98. Rajpert-De ME, Bartkova J, Samson M, Hoei-Hansen CE, Frydelund-Larsen L, Bartek J et al.: The emerging phenotype of the testicular carcinoma in situ germ cell. APMIS 2003, 111: 267-278.

99. Mostofi FK, Sesterhenn IA: World Health Organization International Histological Classification of Tumours: Histological typing of testis tumours, 2nd edn. Berlin: Springer-Verlag; 1998.

100. Kraggerud SM, Skotheim RI, Szymanska J, Eknæs M, Fosså SD, Stenwig AE et al.: Genome profiles of familial/bilateral and sporadic testicular germ cell tumors. Genes Chromosomes Cancer 2002, 34: 168-174.

101. Atkin NB, Baker MC: Specific chromosome change, i(12p), in testicular tumours? Lancet 1982, 2: 1349.

102. Rodriguez E, Houldsworth J, Reuter VE, Meltzer P, Zhang J, Trent JM et al.: Molecular cytogenetic analysis of i(12p)-negative human male germ cell tumors. Genes Chromosomes Cancer 1993, 8: 230-236.

103. Skotheim RI, Lothe RA: The testicular germ cell tumour genome. APMIS 2003, 111: 136-150.

104. Kleivi K, Teixeira MR, Eknæs M, Diep CB, Jakobsen KS, Hamelin R et al.: Genome signatures of colon carcinoma cell lines. Cancer Genet Cytogenet 2004, 155: 119-131.

105. Meling GI, Lothe RA, Børresen AL, Hauge S, Graue C, Clausen OP et al.: Genetic alterations within the retinoblastoma locus in colorectal carcinomas. Relation to DNA ploidy pattern studied by flow cytometric analysis. Br J Cancer 1991, 64: 475-480.

106. Storlazzi CT, Brekke HR, Mandahl N, Brosjo O, Smeland S, Lothe RA et al.: Identification of a novel amplicon at distal 17q containing the BIRC5/SURVIVIN gene in malignant peripheral nerve sheath tumours. J Pathol 2006, 209: 492-500. 93 Reference list

107. Badran A, Yoshida A, Ishikawa K, Goi T, Yamaguchi A, Ueda T et al.: Identification of a novel splice variant of the human anti-apoptopsis gene survivin. Biochem Biophys Res Commun 2004, 314: 902-907.

108. Burmeister T, Maurer J, Aivado M, Elmaagacli AH, Grunebach F, Held KR et al.: Quality assurance in RT-PCR-based BCR/ABL diagnostics - results of an interlaboratory test and a standardization approach. Leukemia 2000, 14: 1850-1856.

109. Hahn Y, Bera TK, Gehlhaus K, Kirsch IR, Pastan IH, Lee B: Finding fusion genes resulting from chromosome rearrangement by analyzing the expressed sequence databases. Proc Natl Acad Sci U S A 2004, 101: 13257-13261.

110. Kilpinen S, Autio R, Ojala K, Iljin K, Bucher E, Sara H et al.: Systematic bioinformatic analysis of expression levels of 17,330 human genes across 9,783 samples from 175 types of healthy and pathological tissues. Genome Biol 2008, 19: R139.

111. Lannon CL, Sorensen PH: ETV6-NTRK3: a chimeric protein tyrosine kinase with transformation activity in multiple cell lineages. Semin Cancer Biol 2005, 15: 215-223.

112. Gardina PJ, Clark TA, Shimada B, Staples MK, Yang Q, Veitch J et al.: Alternative splicing and differential gene expression in colon cancer detected by a whole genome exon array. BMC Genomics 2006, 7: 325.

113. Kleppe K, Ohtsuka E, Kleppe R, Molineux I, Khorana HG: Studies on polynucleotides. XCVI. Repair replications of short synthetic DNA's as catalyzed by DNA polymerases. J Mol Biol 1971, 56: 341-361.

114. Garcia JG, Ma SF: Polymerase chain reaction: a landmark in the history of gene technology. Crit Care Med 2005, 33: S429-S432.

115. Skrzypski M: Quantitative reverse transcriptase real-time polymerase chain reaction (qRT-PCR) in translational oncology: lung cancer perspective. Lung Cancer 2008, 59: 147-154.

116. Wong ML, Medrano JF: Real-time PCR for mRNA quantitation. Biotechniques 2005, 39: 75-85.

117. Gibson UE, Heid CA, Williams PM: A novel method for real time quantitative RT-PCR. Genome Res 1996, 6: 995-1001.

118. Heid CA, Stevens J, Livak KJ, Williams PM: Real time quantitative PCR. Genome Res 1996, 6: 986-994.

119. Frohman MA, Dush MK, Martin GR: Rapid production of full-length cDNAs from rare transcripts: amplification using a single gene-specific oligonucleotide primer. Proc Natl Acad Sci U S A 1988, 85: 8998-9002.

94 Reference list

120. Ohara O, Dorit RL, Gilbert W: One-sided polymerase chain reaction: the amplification of cDNA. Proc Natl Acad Sci U S A 1989, 86: 5673-5677.

121. Loh EY, Elliott JF, Cwirla S, Lanier LL, Davis MM: Polymerase chain reaction with single-sided specificity: analysis of T cell receptor delta chain. Science 1989, 243: 217-220.

122. Scotto-Lavino E, Du G, Frohman MA: 5' end cDNA amplification using classic RACE. Nat Protoc 2006, 1: 2555-2562.

123. Shuman S: Recombination mediated by vaccinia virus DNA topoisomerase I in Escherichia coli is sequence specific. Proc Natl Acad Sci U S A 1991, 88: 10104-10108.

124. Bernard P, Gabant P, Bahassi EM, Couturier M: Positive-selection vectors using the F plasmid ccdB killer gene. Gene 1994, 148: 71-74.

125. Maxam AM, Gilbert W: A new method for sequencing DNA. Proc Natl Acad Sci U S A 1977, 74: 560-564.

126. Sanger F, Nicklen S, Coulson AR: DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 1977, 74: 5463-5467.

127. Atkinson MR, Deutscher MP, Kornberg A, Russell AF, Moffatt JG: Enzymatic synthesis of deoxyribonucleic acid. XXXIV. Termination of chain growth by a 2',3'-dideoxyribonucleotide. Biochemistry 1969, 8: 4897-4904.

128. Smith LM, Sanders JZ, Kaiser RJ, Hughes P, Dodd C, Connell CR et al.: Fluorescence detection in automated DNA sequence analysis. Nature 1986, 321: 674-679.

129. Ling X, Yang J, Tan D, Ramnath N, Younis T, Bundy BN et al.: Differential expression of survivin-2B and survivin-DeltaEx3 is inversely associated with disease relapse and patient survival in non-small-cell lung cancer (NSCLC). Lung Cancer 2005, 49: 353-361.

130. Li F: Role of survivin and its splice variants in tumorigenesis. Br J Cancer 2005, 92: 212-216.

131. Li F, Ling X: Survivin study: an update of "what is the next wave"? J Cell Physiol 2006, 208: 476-486.

132. Okabe M, Matsushima S, Morioka M, Kobayashi M, Abe S, Sakurada K et al.: Establishment and characterization of a cell line, TOM-1, derived from a patient with Philadelphia chromosome-positive acute lymphocytic leukemia. Blood 1987, 69: 990-998.

133. Nasedkina T, Domer P, Zharinov V, Hoberg J, Lysov Y, Mirzabekov A: Identification of chromosomal translocations in leukemias by hybridization with oligonucleotide microarrays. Haematologica 2002, 87: 363-372. 95 Reference list

134. Nasedkina TV, Zharinov VS, Isaeva EA, Mityaeva ON, Yurasov RN, Surzhikov SA et al.: Clinical screening of gene rearrangements in childhood leukemia by using a multiplex polymerase chain reaction- microarray approach. Clin Cancer Res 2003, 9: 5620-5629.

135. Shi RZ, Morrissey JM, Rowley JD: Screening and quantification of multiple chromosome translocations in human leukemia. Clin Chem 2003, 49: 1066-1073.

136. Lu Q, Nunez E, Lin C, Christensen K, Downs T, Carson DA et al.: A sensitive array-based assay for identifying multiple TMPRSS2:ERG fusion gene variants. Nucleic Acids Res 2008, 36: e130.

137. Jhavar S, Reid A, Clark J, Kote-Jarai Z, Christmas T, Thompson A et al.: Detection of TMPRSS2-ERG Translocations in Human Prostate Cancer by Expression Profiling Using GeneChip Human Exon 1.0 ST Arrays. J Mol Diagn 2008, 10: 50-57.

138. Horiuchi T, Aigaki T: Alternative trans-splicing: a novel mode of pre- mRNA processing. Biol Cell 2006, 98: 135-140.

139. Sasaki N, Nagaoka S, Itoh M, Izawa M, Konno H, Carninci P et al.: Characterization of gene expression in mouse blastocyst using single- pass sequencing of 3995 clones. Genomics 1998, 49: 167-179.

140. Schoenmakers EF, Huysmans C, Van d, V: Allelic knockout of novel splice variants of human recombination repair gene RAD51B in t(12;14) uterine leiomyomas. Cancer Res 1999, 59: 19-23.

141. Tagawa H, Miura I, Suzuki R, Suzuki H, Hosokawa Y, Seto M: Molecular cytogenetic analysis of the breakpoint region at 6q21-22 in T-cell lymphoma/leukemia cell lines. Genes Chromosomes Cancer 2002, 34: 175-185.

142. Taketani T, Taki T, Shibuya N, Kikuchi A, Hanada R, Hayashi Y: Novel NUP98-HOXC11 fusion gene resulted from a chromosomal break within exon 1 of HOXC11 in acute myeloid leukemia with t(11;12)(p15;q13). Cancer Res 2002, 62: 4571-4574.

143. Brambillasca F, Mosna G, Colombo M, Rivolta A, Caslini C, Minuzzo M et al.: Identification of a novel molecular partner of the E2A gene in childhood leukemia. Leukemia 1999, 13: 369-375.

144. Kobzev YN, Martinez-Climent J, Lee S, Chen J, Rowley JD: Analysis of translocations that involve the NUP98 gene in patients with 11p15 chromosomal rearrangements. Genes Chromosomes Cancer 2004, 41: 339-352.

145. Gervais C, Mauvieux L, Perrusson N, Helias C, Struski S, Leymarie V et al.: A new translocation t(9;11)(q34;p15) fuses NUP98 to a novel homeobox

96 Reference list

partner gene, PRRX2, in a therapy-related acute myeloid leukemia. Leukemia 2005, 19: 145-148.

146. Petherick A: Genetics: The production line. Nature 2008, 454: 1042-1045.

147. Martianov I, Ramadass A, Serra BA, Chow N, Akoulitchev A: Repression of the human dihydrofolate reductase gene by a non-coding interfering transcript. Nature 2007, 445: 666-670.

97 Appendix I

Appendix I – Primer information

98 Appendix I

Gene Name Type Length Sequence Tm (°C) GC (%) ABL1 ABL1_ex2_rev Reverse 20 ACCCTGAGGCTCAAAGTCAG 59.5 55 ABL1 ABL1_ex3_rev Reverse 23 TTCCCCATTGTGATTATAGCCTA 64.0 39 BCR BCR_ex1_forw Forward 20 CAACAGTCCTTCGACAGCAG 59.6 55 BCR BCR_ex13_forw Forward 21 CAGATGCTGACCAACTCGTGT 64.0 52 BIRC5 BIRC5-6'FAM-R Reverse 20 TCTCCGCAGTTTCCTCAAAT 59.8 45 BIRC5 BIRC5-EX2-L-F Forward 19 GAGGCTGGCTTCATCCACT 60.4 58 BIRC5 BIRC5-EX1- F Forward 20 AGAACTGGCCCTTCTTGGAG 60.8 55 BIRC5 BIRC5-EX2-K-F Forward 20 GCCCAGTGTTTCTTCTGCTT 59.5 50 BIRC5 BIRC5_ex4_Rev Reverse 20 TCTCCGCAGTTTCCTCAAAT 59.8 45 C4BPB C4BPB_ex1_F Forward 26 CCTTGCTGGGAAGCCCTAACTCTGGA 71.7 58 C4BPB C4BPB_ex2_R Reverse 25 ACGCAACCATAAGACAGCACGCACA 70.6 52 C4BPB C4BPB_ex2_nest_R Reverse 25 GGCTGGAATTCACCCAGCTCAGACA 70.5 56 CST1 CST1_ex1_F Forward 25 TGCGGGTACTAAGAGCCAGGCAACA 70.9 56 CST1 CST1_ex3_R Reverse 24 CGAATGGCCTGGCACAGATCCCTA 71.0 58 CST1 CST1_ex3_nest_R Reverse 27 TGACACCTGGATTTCACCAGGGACCTT 71.7 52 ETV6 ETV6_ex5_forw Forward 20 CACTCCGTGGATTTCAAACA 59.5 45 FZD10 FZD10_ex1_F Forward 25 TTTATGCTGCTGGTGGTGGGGATCA 71.3 52 FZD10 FZD10_ex1_R Reverse 25 CCGTGGTGAGTTTTCTGGGGATGCT 71.3 56 FZD10 FZD10_ex1_nest_R Reverse 25 GCCGCCAGGATCTTCCAGTAATCCA 71.3 56 GJB6 GJB6_ex3_F Forward 25 TTCGGATAGAGGGGTCGCTGTGGTG 72.1 60 GJB6 GJB6_ex3_R Reverse 25 GCAGCATGCAAATCACAGACGCAGA 71.2 52 GJB6 GJB6_ex3_nest_R Reverse 25 AACAAGGTTGGGGCAGGGGTCAATC 72.0 56 GPR177 GPR177_ex2_R Reverse 20 GGAGGGGAATGTGAACAGAA 57.0 50 GPR177 GPR177_ex1_F Forward 20 TCTGCTCGTGTTCCAAATCA 57.1 45 HOXB13 HOXB13_ex1_F Forward 25 CAGCCAGATGTGTTGCCAGGGAGAA 71.2 56 HOXB13 HOXB13_ex2_R Reverse 25 CTTGCGCCTCTTGTCCTTGGTGATG 70.9 56 HOXB13 HOXB13_ex2_R_alt2 Reverse 28 TAAGGGGTAGCGCTGTTCTTCACCTTGG 72.5 54 HOXC11 HOXC11_ex1_F Forward 25 ACAAATCCCAGCTCGTCCGGTTCAG 71.4 56 HOXC11 HOXC11_ex2_R Reverse 25 CCCTGGCCACAGTCCAGTTTTCCAC 71.6 60 HOXC11 HOXC11_ex2_nest_R Reverse 25 CCGGTCTGCAGGTTACAGCAGAGGA 70.6 60 Hs.446400 Hs.446400_F Forward 20 CAGAGCTGCATCCTTATGGT 55.1 50 Hs.446400 Hs.446400_R Reverse 20 AGCTGCAAGTTGTTGTTCCA 56.5 45 MIER1 MIER1_ex9_F Forward 22 CCATCAGAAGACTGGAAAAAGG 58.3 45 MIER1 MIER1_ex10_R Reverse 22 TGCTTCTACACCCTTCTCATCA 57.5 45 MTHFD2L MTHFD2L_ex5_F Forward 20 GACCCAAGAGTCAGCGGTAT 56.5 55 MTHFD2L MTHFD2L_ex7_R Reverse 20 GATCTTCCAGCCACAACCAC 57.4 55 NKAIN2 NKAIN2_ex8_F Forward 27 TGGCTATCAAGGGCCTCAGAAGACATC 70.0 52 NKAIN2 NKAIN2_ex10_R Reverse 25 CAGGAAATCCAAGATGGGCGTGTCC 71.5 56 NKAIN2 NKAIN2_ex10_nest_R Reverse 25 CAAGTGGAATTGGTGTGTGCGTGCT 70.0 52 PBX1 PBX1_ex3_rev Reverse 21 TGCTCCACTGAGTTGTCTGAA 59.1 48 PBX1 PBX1_ex5_rev Reverse 20 GGGTTGCTGAGATGGGAATA 59.9 50 PRRX1 PRRX1_ex _F Forward 25 TAGACCTGGAGGAAGCCGGGGACAT 71.9 60 PRRX1 PRRX1_ex _R Reverse 25 TAATCGGTGGGTCTCGGAGCAGGAC 71.3 60 PRRX1 PRRX1_ex _nest_R Reverse 25 GTGTCCGCTCAAAGACACGCTCCAA 71.4 56 PRRX1 PRRX1_int1_ex3_R Reverse 25 CCCAGCTTTGGTGGCACTTCTGTGA 71.3 56 PRRX1 PRRX1_int1_ex4_R Reverse 28 TCAGGGAAAACGTGAAACTCCTCTTGTC 69.2 46 PRRX2 PRRX2_ex3_F Forward 25 GCCCACCGCCCTGAGTCCAGATTAT 72.2 60 PRRX2 PRRX2_ex4_R Reverse 25 AGGTCCTTGGCAGGCTCTTCCACCT 71.4 60 PRRX2 PRRX2_ex4_nest_R Reverse 25 CAAGGGTTGTGGGCTGCAGTCTCTG 71.0 60 RAD51L1 RAD51L1_ex5_F Forward 27 CCCACCAACATGGGAGGATTAGAAGGA 71.2 52 RAD51L1 RAD51L1_ex8 _R Reverse 25 AGCTGGAGACACCAGGTCTGCCTGA 70.3 60 RAD51L1 RAD51L1_ex8_nest_R Reverse 25 CTGAGAAGCCAGGGCTCCACTCAGA 70.0 60 RUNX1 RUNX1_ex2_rev Reverse 20 CGTGGACGTCTCTAGAAGGA 58.0 55 SERPINB7 SERPINB7_ex3_F Forward 25 TTGGGCGCTCAAGATGACTCCCTCT 71.2 56 SERPINB7 SERPINB7_ex5_R Reverse 25 GTCAACTCGCTCCACTTTGGCATCG 70.9 56 SERPINB7 SERPINB7_ex6_R Reverse 26 GAAGGCTGATTGCCACTTGCCTTTGA 71.5 50 TCF3 TCF3_ex15_forw Forward 19 CACCCTCCCTGACCTGTCT 60.0 63 TCF3 TCF3_ex17_forw Forward 20 GTGACATCAACGAGGCCTTT 60.1 50 TFPT TFPT_ex3_F Forward 25 CACATCCTGGAGAGCGAGCTGGAGA 70.8 60 TFPT TFPT_ex4_R Reverse 25 TCCTGCTGCAGCCTCCGAGTTATCC 71.7 60 TFPT TFPT_ex4_nest_R Reverse 24 CCTGTTCAGGACCCGCTCGTTCAC 70.8 63 TFR2 TFR2_ex7_F Forward 25 TCAGGACTTCGGGGCTCAAGGAGTG 71.6 60 TFR2 TFR2_ex8_R Reverse 25 GCTGGGAAGGCCTGATGATGCAACT 71.5 56 TFR2 TFR2_ex8_nest_R Reverse 25 TGTAGGGGTCTCCAGTTCCCAGGTG 69.4 60 99 Appendix I

Universal NUP Forward 23 AAGCAGTGGTATCAACGCAGAGT 60.8 48 Universal UPM - Long Forward 45 CTAATACGACTCACTATAGGGCAAGCAGT 80.4 47 GGTATCAACGCAGAGT Universal UPM - Short Forward 22 CTAATACGACTCACTATAGGGC 51.3 45 USP11 USP11_eks6_R Reverse 18 GCCTGGCTGACCCTTGAA 58.8 61 USP11 USP11_ex5_F Forward 18 GAGCGGTTTCTGGTGGAG 55.7 61 VNN1 VNN1_ex5_F Forward 25 TGCACACTGTGGAAGGGCGCTATTA 69.7 52 VNN1 VNN1_ex7_R Reverse 27 GGCTTCAGACTAAACAAGCGTCCGTCA 70.8 52 VNN1 VNN1_ex6_nest_R Reverse 25 CTGGGTTCCGAAAGTGCCACTGAGG 71.8 60 WIF1 WIF1_ex9_F Forward 26 GAACCTGCCATGAACCCAACAAATGC 71.4 50 WIF1 WIF1_ex10_R Reverse 25 GCCGCTCCTCGGCCTTTTTAAGTGA 72.5 56 WIF1 WIF1_ex9_nest_R Reverse 25 ATGGCAGGTTCCATGTGCACCACAG 71.7 56 ZDHHC20 ZDHHC20_ex1_F Forward 18 CTGGAGCGTCCGAGTCAC 56.3 67 ZDHHC20 ZDHHC20_ex2_R Reverse 22 CAACGGTCTTTCCATTTTCTTC 58.5 41

100 Appendix II

Appendix II – Results

Abbreviations:

T = Primary tumour sample

C = Cell line

N = Number of times sequenced

101 Appendix II

Table-A-II- 1. Exon positions from RAD51L1. Start/end sequence positions are indicated relative to start of exon 1 (ENSG00000182185; transcribed from plus strand; start position: 67,356,262 bp from p-telomere; Ensembl release 50).

Exon 1 Exon 2 Exon 3 Exon 4 Sequence identifier NC StartEndStartEndStartEndStartEnd ENST00000342389 1 62 3,751 3,836 5,673 5,786 15,289 15,405 ENST00000344360 3,753 3,836 5,673 5,786 15,289 15,405 ENST00000390683 3,753 3,836 5,673 5,786 15,289 15,405 ENST00000402498 1 62 3,751 3,836 5,673 5,786 15,289 15,405 ENST00000403044 1 62 3,751 3,836 5,673 5,786 15,289 15,405 RAD51L1 A 2 SW48 RAD51L1 B 1 SW48 RAD51L1 C 1 SW48 RAD51L1 D 1 SW48 RAD51L1 E 1 SW48 RAD51L1 F 1 SW48 Exon 5Exon 6 Exon 7 Exon α Sequence identifier N C StartEndStartEndStartEndStartEnd ENST00000342389 45,212 45,348 66,078 66,197 67,230 67,413 ENST00000344360 45,212 45,348 66,078 66,197 67,230 67,413 ENST00000390683 45,212 45,348 66,078 66,197 67,230 67,413 ENST00000402498 45,212 45,348 66,078 66,197 67,230 67,413 ENST00000403044 45,212 45,348 66,078 66,197 67,230 67,413 RAD51L1 A 2 SW48 RAD51L1 B 1 SW48 RAD51L1 C 1 SW48 170,719 170,815 RAD51L1 D 1 SW48 170,771 170,815 RAD51L1 E 1 SW48 170,771 170,815 RAD51L1 F 1 SW48 Exon β Exon γ Exon δ Exon ε Sequence identifier N C StartEndStartEndStartEndStartEnd ENST00000342389 ENST00000344360 ENST00000390683 ENST00000402498 ENST00000403044 RAD51L1 A 2 SW48 180,425 180,522 RAD51L1 B 1 SW48 180,466 180,522 296,028 296,321 RAD51L1 C 1 SW48 RAD51L1 D 1 SW48 RAD51L1 E 1 SW48 296,028 296,321 RAD51L1 F 1 SW48 190,417 190,440 269,418 269,498 296,028 296,321 Exon ζ Exon η Exon 8 Sequence identifier N C StartEndStartEndStartEndStart positions ENST00000342389 472,093 472,189 ENST00000344360 472,093 472,189 ENST00000390683 472,093 472,189 ENST00000402498 472,093 472,189 ENST00000403044 472,093 472,189 RAD51L1 A 2 SW48 472,093 472,146 180,425, 180,459 RAD51L1 B 1 SW48 472,093 472,146 180,466, 180,473 RAD51L1 C 1 SW48 328,645 328,748 353,296 353,411 472,093 472,146 170,719 RAD51L1 D 1 SW48 472,093 472,146 170,771 RAD51L1 E 1 SW48 472,093 472,146 170,771 RAD51L1 F 1 SW48 472,093 472,146 190,417 102 Appendix II

Table-A-II- 2. Exon positions from NKAIN2. Start/end sequence positions are indicated relative to start of exon 1 (ENSG00000188580; transcribed from plus strand; start position: 124,166,985 bp from chromosome 6 p-telomere; Ensembl release 50).

Exon 1 Exon 2 Exon 3 Exon 4 Exon α Sequence identifier N C/T Start End Start End Start End Start End Start End ENST00000355094 4 114 167,076 167,103 317,766 317,801 478,866 479,003 ENST00000368416 1 114 478,866 479,003 ENST00000368417 1 114 478,866 479,003 C1033III and NKAIN2 A 5 LS1034 544,294 544,565 NKAIN2 B 5 LS1034 NKAIN2 C 5 C1033III C1033III and NKAIN2 D 2 LS1034 544,294 544,565 NKAIN2 E 2 LS1034 544,294 544,565 NKAIN2 F 1 C1033 478,866 479,003 NKAIN2 G 1 C1033III 544,294 544,565 NKAIN2 H 1 LS1034 Exon 5 Exon 6 Exon 7 Exon 8 Exon β Sequence identifier NC/T Start End Start End Start End Start End Start End ENST00000355094 551,128 551,208 686,247 686,253 854,053 854,247 987,200 987,260 ENST00000368416 551,128 551,208 854,047 854,819 ENST00000368417 551,128 551,208 854,047 854,247 987,200 987,260 C1033III and NKAIN2 A 5 LS1034 551,128 551,208 854,047 854,247 987,200 987,260 NKAIN2 B 5 LS1034 NKAIN2 C 5 C1033III C1033III and NKAIN2 D 2 LS1034 551,128 551,208 854,047 854,247 987,200 987,260 NKAIN2 E 2 LS1034 551,128 551,208 854,047 854,247 987,200 987,260 990,705 990,747 NKAIN2 F 1 C1033 551,128 551,208 854,047 854,247 987,200 987,260 NKAIN2 G 1 C1033III 551,128 551,208 854,047 854,247 NKAIN2 H 1 LS1034 Exon γ Exon 9Exon δ Exon 10 Sequence identifier N C/T Start End Start End Start End Start End Start position ENST00000355094 1,014,248 1,014,329 1,019,081 1,021,477 ENST00000368416 ENST00000368417 1,014,248 1,014,329 1,019,081 1,021,518 544,470, 544,470, C1033III and 544,406, 544,470, NKAIN2 A 5 LS1034 1,014,248 1,014,329 1,019,081 1,019,288 544,294 1,010,112 for all five NKAIN2 B 5 LS1034 1,010,112 1,010,192 1,014,248 1,014,329 1,019,081 1,019,288 clones

1,019,091, 1,019,226, 1,019,226, 1,019,226, NKAIN2 C 5 C1033III 1,019,081 1,019,288 1,019,081 NKAIN2 D 2 LS1034 1,014,248 1,014,329 1,014,947 1015035 1,019,081 1,019,288 544,470, 544,455 NKAIN2 E 2 LS1034 1,014,248 1,014,329 1,014,947 1015035 1,019,081 1,019,288 clones NKAIN2 F 1 C1033 1,014,248 1,014,329 1,014,947 1015035 1,019,081 1,019,288 478,941 NKAIN2 G 1 C1033III 1,014,291 1,014,329 1,019,081 1,019,288 544,470 NKAIN2 H 1 LS1034 1,014,248 1,014,329 1,014,947 1015035 1,019,081 1,019,288 1,014,248

103 Appendix II

Table-A-II- 3. Exon positions from VNN1. Start/end sequence positions are indicated relative to start of exon 1 (ENSG00000112299; transcribed from minus strand; start position: 133,076,881 bp from chromosome 6 p-telomere; Ensembl release 50).

Exon 1 Exon 2 Exon 3 Exon 4 Sequence identifier N C StartEndStartEndStartEndStartEnd ENST00000367928 1 224 2,211 2,342 19,868 20,061 20,735 21,027 VNN1 A 4 HT29 VNN1 B 1 HT29 VNN1 C 1 HT29 Exon 5 Exon α Exon β Exon 6 Sequence identifier N C StartEndStartEndStartEndStartEndStart positions ENST00000367928 21,466 21,828 29,545 29,716 26,645, 26,670, VNN1 A 4 HT29 26,645 27,450 28,645 28,796 29,545 29,659 26,662, 26,676 VNN1 B 1 HT29 26,645 27,450 28,610 28,796 29,545 29,659 26,675 VNN1 C 1 HT29 26,680 26,788 29,545 29,659 26,680

Table-A-II- 4. Exon positions from C4BPB. Start/end sequence positions are indicated relative to start of exon 1 (ENSG00000123843; transcribed from plus strand; start position: 205,328,835 bp from chromosome 1 p-telomere; Ensembl release 50).

Exon 1 Exon 2 Start Sequence identifier N T Start End Start End positions ENST00000243611 372 723 ENST00000367076 372 723 ENST00000367078 1 80 615 723 ENST00000391923 416* *723 ENST00000391924 1 80 615 723 C4BPB A 3 C1034III 1 80 232 641 1, -13, -11 C4BPB B 1 C1034III 372 641 372 C4BPB C 1 C1034III -53 80 615 641 -53 *ENST00000391923 lacks base pairs 496-614

Table-A-II- 5. Exon positions from HOXC11. Start/end sequence positions are indicated relative to start of exon 1 (ENSG00000123388; transcribed from plus strand; start position: 52,653,177 bp from chromosome 12 p-telomere; Ensembl release 50).

Exon 1 Exon α Exon 2 Sequence identifierNT Start End Start End Start End Start positions ENST00000243082 1 798 2,055 3,292 HOXC11 A 4 C1402III 1,244 1,398 2,055 2,300 1,254, 1,281, 1,244, 1,244 HOXC11 B 2 C1402III 1,254 1,300 2,055 2,300 1,254, 1,254

104 Appendix II

Table-A-II- 6. Exon positions from TFR2. Start/end sequence positions are indicated relative to start of exon 1 (ENSG00000106327; transcribed from minus strand; start position: 100,077,109 bp from chromosome 7 p-telomere; Ensembl release 50).

Exon 1 Exon 2 Exon 3 Exon 4 Sequence identifier NC Start End Start End Start End Start End ENST00000223051 1 74 322 575 678 865 7,995 8,135 SW48 and TFR2 A 6 RKO TFR2 B 3 SW48 SW48 and TFR2 C 3 RKO TFR2 D 2 RKO 7,938 8,135 TFR2 E 2 RKO TFR2 F 1 RKO TFR2 G 1 SW48 TFR2 H 1 RKO TFR2 I 1 RKO TFR2 J 1 SW48 Exon 5 Exon 6 Exon 7 Exon 8 Sequence identifier N CStart End Start End Start End Start End Start positions ENST00000223051 8,211 8,322 8,428 8,550 9,353 9,469 9,606 9,745 SW48 and 8,541, 8,541, 8,549, TFR2 A 6 RKO 8,428 8,772 9,353 9,469 9,606 9,633 8,542, 8,536, 8,546 TFR2 B 3 SW48 8,428 8,550 9,353 9,605 9,606 9,633 8,498, 8,517, 8,517 SW48 and TFR2 C 3 RKO 8428* *8772 9353** **9605 9,606 9,633 8,526, 8,549, 8,546 TFR2 D 2 RKO 8,211 8,322 8,428 8,550 9,353 9,469 9,606 9,633 7,938, 7,938 TFR2 E 2 RKO 9,353 9,469 9,606 9,633 9,360, 9,395 TFR2 F 1 RKO 8,404 8,550 9,353 9,469 9,606 9,633 8,404 TFR2 G 1 SW48 8,428 8,550 9,606 9,633 8,502 TFR2 H 1 RKO 8,428 8,550 9,353 9,469 9,606 9,633 8,502 TFR2 I 1 RKO 8,428 8,550 9,571 9,605 9,606 9,633 8,486 TFR2 J 1 SW48 9,353 9,605 9,606 9,633 9,395 * One clone lacks base pairs 8551-8714 ** One clone lacks base pairs 9470-9570

Table-A-II- 7. Exon positions from SERPINB7. Start/end sequence positions are indicated relative to start of exon 1 (ENSG00000166396; transcribed from plus strand; start position: 59,571,257 bp from chromosome 18 p-telomere; Ensembl release 50).

Exon 1 Exon 2Exon α Exon 3 Sequence identifier N C Start End Start End Start End Start End ENST00000336429 1 78 29,313 29,498 ENST00000398019 22,336 22,674 29,313 29,498 SERPINB7 A 6 LS1034 22,336 22,674 29,313 29,498 SERPINB7 B 4 LS1034 24,736 24,783 29,313 29,498 SERPINB7 C 1 LS1034 Exon 4 Exon 5 Exon 6 Sequence identifier N CStart End Start End Start End Start positions ENST00000336429 39,351 39,401 40,119 40,238 43,224 43,341 ENST00000398019 39,351 39,401 40,119 40,238 43,224 43,341 22,382, 22,336, 22,339, 22,339, SERPINB7 A 6 LS1034 39,351 39,401 40,119 40,238 43,224 43,277 22,388, 22,495 24,739,24,739, SERPINB7 B 4 LS1034 39,351 39,401 40,119 40,238 43,224 43,277 24,736, 24,739 SERPINB7 C 1 LS1034 39,351 39,401 40,119 40,238 43,224 43,277 39,395

105 Appendix II

Table-A-II- 8. Exon positions from TFPT. Start/end sequence positions are indicated relative to start of exon 1 (ENSG00000105619; transcribed from minus strand; start position: 59,310,867 bp from p-telomere; Ensembl release 50).

Exon 1 Exon 2 Exon 3 Exon 4 Sequence identifier NC Start End Start End Start End Start End Start positions ENST00000339150 19* *429 976 1,234 5,552 5,622 ENST00000391757 388 429 976 1,234 5,552 5,622 ENST00000391758 602 636 976 1,234 5,552 5,622 ENST00000391759 1 429 976 1,234 5,552 5,622 1,117, 1,121, 1,118, TFPT A 6 SW48 976 1,234 5,552 5,575 1,117, 1,114, 1,114 TFPT B 4 SW48 331 429 976 1,234 5,552 5,575 331, 331, 331, 355 *ENST00000339150 lacks base pairs 163-268

Table-A-II- 9. Exon positions from GJB6. Start/end sequence positions are indicated relative to start of exon 1 (ENSG00000121742; transcribed from minus strand; start position: 19,704,456 bp from chromosome 13 p-telomere; Ensembl release 50).

Exon 1 Exon 2 Exon 3 Sequence identifierN C Start End Start End Start End ENST00000241124 1,337 1,452 ENST00000356192 1 124 813 936 ENST00000400065 15 124 ENST00000400066 17 124 813 936 GJB6 A 7 HT29 GJB6 B 4 HT29 1 124 813 936 GJB6 C 2 HT29 1 124 GJB6 D 2 HT29 1 124 813 936 GJB6 E 1 HT29 813 936 GJB6 F 1 HT29 Exon 4 Exon 5 Exon 6 Sequence identifier N C Start End Start End Start End Start positions ENST00000241124 2,569 2,738 8,823 10,355 ENST00000356192 1,511 1,620 2,569 2,738 8,823 10,355 ENST00000400065 2,569 2,738 8,823 10,347 ENST00000400066 2,569 2,738 8,823 10,347 8,917, 8,917, 8,917, 9,122, 8,917, 8,916, GJB6 A 7 HT29 8,823 9,371 9,137 GJB6 B 4 HT29 2,569 2,738 8,823 9,371 103, 112, 103, 110 GJB6 C 2 HT29 2,569 2,738 8,823 9,371 103, 103 GJB6 D 2 HT29 2,548 2,738 8,823 9,371 98, 98 GJB6 E 1 HT29 2,569 2,738 8,823 9,371 861 GJB6 F 1 HT29 2,524 2,738 8,823 9,371 2,524

106 Appendix II

Table-A-II- 10. Exon positions from PRRX1. Start/end sequence positions are indicated relative to start of exon 1 (ENSG00000116132; transcribed from plus strand; start position: 168,899,937 bp from chromosome 1 p-telomere; Ensembl release 50).

Exon 1 Exon α Exon β Exon γ Sequence identifier NC Start End Start End Start End Start End ENST00000239461 1 288 ENST00000367760 1288 PRRX1 A 7 SW48 PRXX1 B 2 SW48 50,433 50,663 51,315 51,367 PRRX1 C 2 SW48 50,433 50,663 PRRX1 D 1 SW48 50,433 50,663 PRRX1 E 1 SW48 50,433 50,663 PRRX1 F 1 SW48 50,433 50,663 51,315 51,367 53,778 53,840 PRRX1 G 1 SW48 53,387 53,840 PRRX1 H 1 SW48 53,387** 53,840** PRRX1 I 1 SW48 Exon δ Exon ε Exon 2 Sequence identifier N CStart End Start End Start End Start positions ENST00000239461 55,555 55,731 ENST00000367760 55,555 55,731 54,492, 54,079, 54,627, 54,536, 54,495, 54,491, PRRX1 A 7 SW48 53,969* * * 54,761* 55,555 55,663 54,356 PRXX1 B 2 SW48 53,969 54,104 54,658 54,761 55,555 55,663 50,433, 50,433 PRRX1 C 2 SW48 53,969* * * 54,761* 55,555 55,663 50,523, 50,507 PRRX1 D 1 SW48 53,969 54,104 55,555 55,663 50,606 PRRX1 E 1 SW48 55,555 55,663 50,494 PRRX1 F 1 SW48 53,969 54,104 55,555 55,663 50,543 PRRX1 G 1 SW48 55,555 55,663 53,450 PRRX1 H 1 SW48 53,969 54,104 54,658 54,761 55,555 55,663 53,387 PRRX1 I 1 SW48 53,969 54,104 55,555 55,663 54,037 *Exon δ in sequence PRXX1 B and PRRX1 C is a retention of the intron between exons δ and ε. ** The exon lacks bases 53,625-53,778

Table-A-II- 11. Exon positions from PRRX2. Start/end sequence positions are indicated relative to start of exon 1 (ENSG00000167157; transcribed from plus strand; start position: 131,467,741 bp from chromosome 9 p-telomere; Ensembl release 50).

Exon 1 Exon 2 Exon 3 Exon 4 Start Sequence identifierN T Start End Start End Start End Start End positions ENST00000372469 1 486 53,591 53,779 54,956 55,135 56,577 57,031 55,074 for all PRRX2 A 11 C1033III 55,074 55,135 56,577 56,922 11 clones PRRX2 B 1 C1033III 56,689 56,922 56,689

107 Appendix II

Table-A-II- 12. Exon positions from WIF1. Start/end sequence positions are indicated relative to start of exon 1 (ENSG00000156076; transcribed from minus strand; start position: 63,801,383 bp from chromosome 12 p-telomere; Ensembl release 50).

Exon 1 Exon 2Exon α Exon 3 Sequence identifierN C Start End Start End Start End Start End ENST00000286574 1 293 781 920 43,483 43,591 WIF1 A 3 Colo320 43,483 43,591 WIF1 B 2 Colo320 40,886 40,979 43,483 43,591 WIF1 C 1 Colo320 WIF1 D 1 Colo320 WIF1 E 1 Colo320 Exon 4 Exon 5 Exon 6 Exon 7 Sequence identifier N C Start End Start End Start End Start End ENST00000286574 52,433 52,573 53,547 53,642 54,601 54,696 58,761 58,856 WIF1 A 3 Colo320 52,433 52,573 53,547 53,642 54,601 54,696 58,761 58,856 WIF1 B 2 Colo320 52,433 52,573 53,547 53,642 54,601 54,696 58,761 58,856 WIF1 C 1 Colo320 52,433 52,573 53,547 53,642 54,601 54,696 58,761 58,856 WIF1 D 1 Colo320 54,601 54,696 58,761 58,856 WIF1 E 1 Colo320 58,761 58,856 Exon 8 Exon 9 Sequence identifier N C Start End Start End Start positions ENST00000286574 65,211 65,306 66,124 66,219 43,566, 43,582, WIF1 A 3 Colo320 65,211 65,306 66,124 66,164 43,489 WIF1 B 2 Colo320 65,211 65,306 66,124 66,164 40,902, 40,886 WIF1 C 1 Colo320 65,211 65,306 66,124 66,164 52,519 WIF1 D 1 Colo320 65,211 65,306 66,124 66,164 54,690 WIF1 E 1 Colo320 65,211 65,264 66,124 66,164 58,808

Table-A-II- 13. Exon positions from HOXB13. Start/end sequence positions are indicated relative to start of exon 1 (ENSG00000159184; transcribed from minus strand; start position: 44,161,110 bp from chromosome 17 p-telomere; Ensembl release 50).

Exon α Exon 1 Exon 2 Sequence identifier NC Start End Start End Start End Start positions ENST00000290295 1 757 1,707 3,979 89, 34, 23, 84, 90, 70, 90, 1, 90, 90, 584, 554, HOXB13 A 15 V9P, FRI 1* *757 1,707 1,849 600, 569, 553 1,468, 1,513, 1,468, HOXB13 B 6 V9P, FRI 1,468 1,849 1,468, 1,468, 1,468 HOXB13 C 1 FRI -10,520 -10,327 1,707 1,849 -10,520 * One clone lacks base pairs 100-232 in exon 1

Table-A-II- 14. Exon positions from CST1. Start/end sequence positions are indicated relative to start of exon 1 (ENSG00000170373; transcribed from minus strand; start position: 23,679,905 bp from chromosome 20 p-telomere; Ensembl release 50).

Exon 1 Exon 2 Exon 3 Exon 4 Sequence identifierN CStart End Start End Start End Start End Start positions ENST00000304749 332 630 2,140 2,253 3,370 3,715 ENST00000398402 1 112 371 630 2,140 2,253 3,370 3,538 525, 336, 332, 332, 332, 354, 332, 617, CST1 A 11 LS174T 332 630 2,140 2,253 3,370 3,443 585, 332, 525 CST1 B 3 LS174T 2,140 2,253 3,370 3,443 2,146, 2,232, 2,232 CST1 C 1 LS174T 3,370 3,443 3,380 108 Appendix II

Table-A-II- 15. Exon positions from FZD10. Start/end sequence positions are indicated relative to start of exon 1 (ENSG00000111432; transcribed from plus strand; start position: 129,212,985 bp from chromosome 12 p-telomere; Ensembl release 50).

Exon 1 Sequence identifierNCStart End Start positions ENST00000229030 1 3,253 FZD10 A 20 V9P 1 1,886 1412 for 8 clones, 1124 for 4 clones, 1609 for 4 clones, 1615 for 2 clones, 1787, 1630

109 Appendix III

Appendix III – Known fusion genes in cancer

110 Appendix III

Upstream (5') fusion partner Downstream (3') fusion partner Gene Chromosom Gene Chromosom ENSG ENSG symbol e symbol e ABL1 ENSG00000097007 9 SFPQ ENSG00000116560 1 AKAP9 ENSG00000127914 7 BRAF ENSG00000157764 7 ARHGAP20 ENSG00000137727 11 BRWD3 ENSG00000165288 X ASPSCR1 ENSG00000169696 17 TFE3 ENSG00000068323 X ATIC ENSG00000138363 2 ALK ENSG00000171094 2 BCAS4 ENSG00000124243 20 BCAS3 ENSG00000141376 17 BCL11B ENSG00000127152 14 NKX2-5 ENSG00000183072 5 BCL11B ENSG00000127152 14 TLX3 ENSG00000164438 5 BCL11B ENSG00000127152 14 TRDC ENSG00000211829 14 BCL3 ENSG00000069399 19 MYC ENSG00000136997 8 BCL7A ENSG00000110987 12 MYC ENSG00000136997 8 BCR ENSG00000186716 22 ABL1 ENSG00000097007 9 BCR ENSG00000186716 22 FGFR1 ENSG00000077782 8 BCR ENSG00000186716 22 JAK2 ENSG00000096968 9 BCR ENSG00000186716 22 PDGFRA ENSG00000134853 4 BCR ENSG00000186716 22 SET ENSG00000119335 9 BIRC3 ENSG00000023445 11 MALT1 ENSG00000172175 18 BRD4 ENSG00000141867 19 NUT ENSG00000184507 15 BRWD3 ENSG00000165288 X ARHGAP20 ENSG00000137727 11 BTG1 ENSG00000133639 12 MYC ENSG00000136997 8 C15orf21 ENSG00000179362 15 ETV1 ENSG00000006468 7 C3orf27 ENSG00000198685 3 ETV1 ENSG00000006468 7 CANT1 ENSG00000171302 17 ETV1 ENSG00000006468 7 CANT1 ENSG00000171302 17 ETV4 ENSG00000175832 17 CARS ENSG00000110619 11 ALK ENSG00000171094 2 CBFB ENSG00000067955 16 MYH11 ENSG00000133392 16 CCDC6 ENSG00000108091 10 PDGFRB ENSG00000113721 5 CCDC6 ENSG00000108091 10 RET ENSG00000165731 10 CCDC88C ENSG00000015133 14 PDGFRA ENSG00000134853 4 CCDC88C ENSG00000015133 14 PDGFRB ENSG00000113721 5 CCND1 ENSG00000110092 11 FSTL3 ENSG00000070404 19 CDH11 ENSG00000140937 16 USP6 ENSG00000129204 17 CDK5RAP2 ENSG00000136861 9 PDGFRA ENSG00000134853 4 CDK6 ENSG00000105810 7 EVI1 ENSG00000085276 3 CDK6 ENSG00000105810 7 MLL ENSG00000118058 11 CDK6 ENSG00000105810 7 MLLT10 ENSG00000078403 10 CDK6 ENSG00000105810 7 TLX3 ENSG00000164438 5 CDKN2A ENSG00000147889 9 CDKN2A ENSG00000147889 9 CEP110 ENSG00000119397 9 FGFR1 ENSG00000077782 8 CHCHD7 ENSG00000170791 8 PLAG1 ENSG00000181690 8 CHIC2 ENSG00000109220 4 ETV6 ENSG00000139083 12 CIITA ENSG00000179583 16 BCL6 ENSG00000113916 3 CLTC ENSG00000141367 17 ALK ENSG00000171094 2 CLTC ENSG00000141367 17 TFE3 ENSG00000068323 X CLTCL1 ENSG00000070371 22 ALK ENSG00000171094 2 CNBP ENSG00000169714 3 USP6 ENSG00000129204 17 COL1A1 ENSG00000108821 17 PDGFB ENSG00000100311 22 COL1A1 ENSG00000108821 17 USP6 ENSG00000129204 17

111 Appendix III

COL1A2 ENSG00000164692 7 PLAG1 ENSG00000181690 8 COL6A3 ENSG00000163359 2 CSF1 ENSG00000184371 1 CPSF6 ENSG00000111605 12 FGFR1 ENSG00000077782 8 CREB3L2 ENSG00000182158 7 PPARG ENSG00000132170 3 CRTC1 ENSG00000105662 19 MAML2 ENSG00000184384 11 CTNNB1 ENSG00000168036 3 PLAG1 ENSG00000181690 8 DDX5 ENSG00000108654 17 ETV4 ENSG00000175832 17 EIF4A2 ENSG00000156976 3 BCL6 ENSG00000113916 3 EML1 ENSG00000066629 14 ABL1 ENSG00000097007 9 EML4 ENSG00000143924 2 ALK ENSG00000171094 2 EPC1 ENSG00000120616 10 PHF1 ENSG00000112511 6 ERC1 ENSG00000082805 12 RET ENSG00000165731 10 EST14 ETV1 ENSG00000006468 7 ETV6 ENSG00000139083 12 ABL1 ENSG00000097007 9 ETV6 ENSG00000139083 12 ABL2 ENSG00000143322 1 ETV6 ENSG00000139083 12 ACSL6 ENSG00000164398 5 ETV6 ENSG00000139083 12 ARNT ENSG00000143437 1 ETV6 ENSG00000139083 12 BAZ2A ENSG00000076108 12 ETV6 ENSG00000139083 12 CDX2 ENSG00000165556 13 ETV6 ENSG00000139083 12 EVI1 ENSG00000085276 3 ETV6 ENSG00000139083 12 FGFR3 ENSG00000068078 4 ETV6 ENSG00000139083 12 FLT3 ENSG00000122025 13 ETV6 ENSG00000139083 12 FRK ENSG00000111816 6 ETV6 ENSG00000139083 12 GOT1 ENSG00000120053 10 ETV6 ENSG00000139083 12 JAK2 ENSG00000096968 9 ETV6 ENSG00000139083 12 MDS1 ENSG00000206115 3 ETV6 ENSG00000139083 12 MDS2 ENSG00000197880 1 ETV6 ENSG00000139083 12 MN1 ENSG00000169184 22 ETV6 ENSG00000139083 12 MNX1 ENSG00000130675 7 ETV6 ENSG00000139083 12 NCOA2 ENSG00000140396 8 ETV6 ENSG00000139083 12 NTRK3 ENSG00000140538 15 ETV6 ENSG00000139083 12 PDGFRA ENSG00000134853 4 ETV6 ENSG00000139083 12 PDGFRB ENSG00000113721 5 ETV6 ENSG00000139083 12 PER1 ENSG00000179094 17 ETV6 ENSG00000139083 12 PTPRR ENSG00000153233 12 ETV6 ENSG00000139083 12 RUNX1 ENSG00000159216 21 ETV6 ENSG00000139083 12 SYK ENSG00000165025 9 ETV6 ENSG00000139083 12 TCBA1 ENSG00000188580 6 ETV6 ENSG00000139083 12 TTL ENSG00000114999 2 EWSR1 ENSG00000182944 22 ATF1 ENSG00000123268 12 EWSR1 ENSG00000182944 22 CREB1 ENSG00000118260 2 EWSR1 ENSG00000182944 22 DDIT3 ENSG00000175197 12 EWSR1 ENSG00000182944 22 ERG ENSG00000157554 21 EWSR1 ENSG00000182944 22 ETV1 ENSG00000006468 7 EWSR1 ENSG00000182944 22 ETV4 ENSG00000175832 17 EWSR1 ENSG00000182944 22 FEV ENSG00000163497 2 EWSR1 ENSG00000182944 22 FLI1 ENSG00000151702 11 EWSR1 ENSG00000182944 22 MARS ENSG00000166986 12 EWSR1 ENSG00000182944 22 NR4A3 ENSG00000119508 9 EWSR1 ENSG00000182944 22 PBX1 ENSG00000185630 1 EWSR1 ENSG00000182944 22 POU5F1 ENSG00000204531 6 EWSR1 ENSG00000182944 22 TEC ENSG00000135605 4 EWSR1 ENSG00000182944 22 WT1 ENSG00000184937 11 EWSR1 ENSG00000182944 22 ZNF278 ENSG00000100105 22

112 Appendix III

EWSR1 ENSG00000182944 22 ZNF384 ENSG00000126746 12 FGFR1OP ENSG00000112486 6 FGFR1 ENSG00000077782 8 FGFR1OP2 ENSG00000111790 12 FGFR1 ENSG00000077782 8 FHIT ENSG00000189283 3 HMGA2 ENSG00000149948 12 FHIT ENSG00000189283 3 RNF139 ENSG00000170881 8 FIP1L1 ENSG00000145216 4 PDGFRA ENSG00000134853 4 FLJ35294 ETV1 ENSG00000006468 7 FLJ35294 ETV4 ENSG00000175832 17 FLT3 ENSG00000122025 13 ETV6 ENSG00000139083 12 FOXP1 ENSG00000114861 3 ETV1 ENSG00000006468 7 FSTL3 ENSG00000070404 19 BCL6 ENSG00000113916 3 FUS ENSG00000089280 16 ATF1 ENSG00000123268 12 FUS ENSG00000089280 16 CREB3L1 ENSG00000157613 11 FUS ENSG00000089280 16 CREB3L2 ENSG00000182158 7 FUS ENSG00000089280 16 DDIT3 ENSG00000175197 12 FUS ENSG00000089280 16 ERG ENSG00000157554 21 FUS ENSG00000089280 16 MARS ENSG00000166986 12 GAPDH ENSG00000111640 12 BCL6 ENSG00000113916 3 GOLGA5 ENSG00000066455 14 RET ENSG00000165731 10 GOPC ENSG00000047932 6 ROS1 ENSG00000047936 6 HAS2 ENSG00000170961 8 PLAG1 ENSG00000181690 8 HERVK17 ETV1 ENSG00000006468 7 HIP1 ENSG00000127946 7 PDGFRB ENSG00000113721 5 HIST1H4I ENSG00000198339 6 BCL6 ENSG00000113916 3 HMGA1 ENSG00000137309 6 LAMA4 ENSG00000112769 6 HMGA2 ENSG00000149948 12 CCNB1IP1 ENSG00000100814 14 HMGA2 ENSG00000149948 12 COX6C ENSG00000164919 8 HMGA2 ENSG00000149948 12 CXCR7 ENSG00000144476 2 HMGA2 ENSG00000149948 12 EBF1 ENSG00000164330 5 HMGA2 ENSG00000149948 12 FHIT ENSG00000189283 3 HMGA2 ENSG00000149948 12 LHFP ENSG00000183722 13 HMGA2 ENSG00000149948 12 LPP ENSG00000145012 3 HMGA2 ENSG00000149948 12 NFIB ENSG00000147862 9 HMGA2 ENSG00000149948 12 RAD51L1 ENSG00000182185 14 HNRNPA2B1 ENSG00000122566 7 ETV1 ENSG00000006468 7 HP ENSG00000197711 16 MRPS10 ENSG00000048544 6 HSP90AA1 ENSG00000080824 14 BCL6 ENSG00000113916 3 HSP90AB1 ENSG00000096384 6 BCL6 ENSG00000113916 3 IDS ENSG00000010404 X CXorf40A ENSG00000197021 X IKZF1 ENSG00000185811 7 BCL6 ENSG00000113916 3 IL1RAPL1 ENSG00000169306 X DMD ENSG00000198947 X IL2 ENSG00000109471 4 TNFRSF17 ENSG00000048462 16 IL21R ENSG00000103522 16 BCL6 ENSG00000113916 3 ITK ENSG00000113263 5 SYK ENSG00000165025 9 JAZF1 ENSG00000153814 7 PHF1 ENSG00000112511 6 JAZF1 ENSG00000153814 7 SUZ12 ENSG00000178691 17 KIAA1618 ENSG00000180843 17 ALK ENSG00000171094 2 KIF5B ENSG00000170759 10 PDGFRA ENSG00000134853 4 KLK2 ENSG00000167751 19 ETV4 ENSG00000175832 17 KTN1 ENSG00000126777 14 RET ENSG00000165731 10 LCP1 ENSG00000136167 13 BCL6 ENSG00000113916 3 LIFR ENSG00000113594 5 PLAG1 ENSG00000181690 8 LOC204010 ENSG00000205171 12 EBF1 ENSG00000164330 5 MALAT1 TFEB ENSG00000112561 6

113 Appendix III

MEF2D ENSG00000116604 1 DAZAP1 ENSG00000071626 19 MIPOL1 ENSG00000151338 14 ETV1 ENSG00000006468 7 MLL ENSG00000118058 11 ABI1 ENSG00000136754 10 MLL ENSG00000118058 11 ACACA ENSG00000132142 17 MLL ENSG00000118058 11 AFF1 ENSG00000172493 4 MLL ENSG00000118058 11 AFF3 ENSG00000144218 2 MLL ENSG00000118058 11 AFF4 ENSG00000072364 5 MLL ENSG00000118058 11 ARHGAP26 ENSG00000145819 5 MLL ENSG00000118058 11 ARHGEF12 ENSG00000196914 11 MLL ENSG00000118058 11 CASC5 ENSG00000137812 15 MLL ENSG00000118058 11 CBL ENSG00000110395 11 MLL ENSG00000118058 11 CCDC94 ENSG00000105248 19 MLL ENSG00000118058 11 CELSR3 ENSG00000008300 3 MLL ENSG00000118058 11 CENPK ENSG00000123219 5 MLL ENSG00000118058 11 CIP29 ENSG00000205323 12 MLL ENSG00000118058 11 CLP1 ENSG00000172409 11 MLL ENSG00000118058 11 CREBBP ENSG00000005339 16 MLL ENSG00000118058 11 CXXC6 ENSG00000138336 10 MLL ENSG00000118058 11 DAB2IP ENSG00000136848 9 MLL ENSG00000118058 11 EEFSEC ENSG00000132394 3 MLL ENSG00000118058 11 ELL ENSG00000105656 19 MLL ENSG00000118058 11 EP300 ENSG00000100393 22 MLL ENSG00000118058 11 EPS15 ENSG00000085832 1 MLL ENSG00000118058 11 FNBP1 ENSG00000187239 9 MLL ENSG00000118058 11 FOXO3A ENSG00000118689 6 MLL ENSG00000118058 11 FOXO4 ENSG00000184481 X MLL ENSG00000118058 11 FRYL ENSG00000075539 4 MLL ENSG00000118058 11 GAS7 ENSG00000007237 17 MLL ENSG00000118058 11 GMPS ENSG00000163655 3 MLL ENSG00000118058 11 GPHN ENSG00000171723 14 MLL ENSG00000118058 11 LASP1 ENSG00000002834 17 MLL ENSG00000118058 11 LPP ENSG00000145012 3 MLL ENSG00000118058 11 MAML2 ENSG00000184384 11 MLL ENSG00000118058 11 MAPRE1 ENSG00000101367 20 MLL ENSG00000118058 11 MLL ENSG00000118058 11 MLL ENSG00000118058 11 MLLT1 ENSG00000130382 19 MLL ENSG00000118058 11 MLLT10 ENSG00000078403 10 MLL ENSG00000118058 11 MLLT11 ENSG00000143443 1 MLL ENSG00000118058 11 MLLT3 ENSG00000171843 9 MLL ENSG00000118058 11 MLLT4 ENSG00000130396 6 MLL ENSG00000118058 11 MLLT6 ENSG00000108292 17 MLL ENSG00000118058 11 MYO1F ENSG00000142347 19 MLL ENSG00000118058 11 NCKIPSD ENSG00000213672 3 MLL ENSG00000118058 11 PICALM ENSG00000073921 11 MLL ENSG00000118058 11 RARA ENSG00000131759 17 MLL ENSG00000118058 11 RUNX1T1 ENSG00000079102 8 MLL ENSG00000118058 11 SEPT11 ENSG00000138758 4 MLL ENSG00000118058 11 SEPT2 ENSG00000168385 2 MLL ENSG00000118058 11 SEPT5 ENSG00000184702 22 MLL ENSG00000118058 11 SEPT6 ENSG00000125354 X MLL ENSG00000118058 11 SEPT9 ENSG00000184640 17 MLL ENSG00000118058 11 SH3GL1 ENSG00000141985 19 MLL ENSG00000118058 11 SMAP1 ENSG00000112305 6 MLL ENSG00000118058 11 SORBS2 ENSG00000154556 4

114 Appendix III

MLL ENSG00000118058 11 TIRAP ENSG00000150455 11 MLL ENSG00000118058 11 ZFYVE19 ENSG00000166140 15 MLLT10 ENSG00000078403 10 CLP1 ENSG00000172409 11 MNX1 ENSG00000130675 7 MYB ENSG00000118513 6 MSI2 ENSG00000153944 17 HOXA9 ENSG00000078399 7 MSN ENSG00000147065 X ALK ENSG00000171094 2 MYC ENSG00000136997 8 BCL7A ENSG00000110987 12 MYC ENSG00000136997 8 BTG1 ENSG00000133639 12 MYH9 ENSG00000100345 22 ALK ENSG00000171094 2 MYO18A ENSG00000196535 17 FGFR1 ENSG00000077782 8 MYST3 ENSG00000083168 8 ASXL2 ENSG00000143970 2 MYST3 ENSG00000083168 8 CREBBP ENSG00000005339 16 MYST3 ENSG00000083168 8 EP300 ENSG00000100393 22 MYST3 ENSG00000083168 8 NCOA2 ENSG00000140396 8 MYST3 ENSG00000083168 8 NCOA3 ENSG00000124151 20 MYST4 ENSG00000156650 10 CREBBP ENSG00000005339 16 NACA ENSG00000196531 12 BCL6 ENSG00000113916 3 NCOA4 ENSG00000138293 10 RET ENSG00000165731 10 NDE1 ENSG00000072864 16 PDGFRB ENSG00000113721 5 NFKB2 ENSG00000077150 10 TBXAS1 ENSG00000059377 7 NIN ENSG00000100503 14 PDGFRB ENSG00000113721 5 NOL1 ENSG00000111641 12 TCF3 ENSG00000071564 19 NONO ENSG00000147140 X TFE3 ENSG00000068323 X NPM1 ENSG00000181163 5 ALK ENSG00000171094 2 NPM1 ENSG00000181163 5 MLF1 ENSG00000178053 3 NPM1 ENSG00000181163 5 RARA ENSG00000131759 17 NUMA1 ENSG00000137497 11 RARA ENSG00000131759 17 NUP214 ENSG00000126883 9 ABL1 ENSG00000097007 9 NUP214 ENSG00000126883 9 DEK ENSG00000124795 6 NUP214 ENSG00000126883 9 SET ENSG00000119335 9 NUP98 ENSG00000110713 11 ADD3 ENSG00000148700 10 NUP98 ENSG00000110713 11 CCDC28A ENSG00000024862 6 NUP98 ENSG00000110713 11 DDX10 ENSG00000178105 11 NUP98 ENSG00000110713 11 HHEX ENSG00000152804 10 NUP98 ENSG00000110713 11 HOXA11 ENSG00000005073 7 NUP98 ENSG00000110713 11 HOXA13 ENSG00000106031 7 NUP98 ENSG00000110713 11 HOXA9 ENSG00000078399 7 NUP98 ENSG00000110713 11 HOXC11 ENSG00000123388 12 NUP98 ENSG00000110713 11 HOXC13 ENSG00000123364 12 NUP98 ENSG00000110713 11 HOXD11 ENSG00000128713 2 NUP98 ENSG00000110713 11 HOXD13 ENSG00000128714 2 NUP98 ENSG00000110713 11 IQCG ENSG00000114473 3 NUP98 ENSG00000110713 11 JARID1A ENSG00000073614 12 NUP98 ENSG00000110713 11 NSD1 ENSG00000165671 5 NUP98 ENSG00000110713 11 PHF23 ENSG00000040633 17 NUP98 ENSG00000110713 11 PRRX1 ENSG00000116132 1 NUP98 ENSG00000110713 11 PRRX2 ENSG00000167157 9 NUP98 ENSG00000110713 11 PSIP1 ENSG00000164985 9 NUP98 ENSG00000110713 11 RAP1GDS1 ENSG00000138698 4 NUP98 ENSG00000110713 11 SET ENSG00000119335 9 NUP98 ENSG00000110713 11 TOP1 ENSG00000198900 20 NUP98 ENSG00000110713 11 TOP2B ENSG00000077097 3 NUP98 ENSG00000110713 11 WHSC1L1 ENSG00000147548 8 NUT ENSG00000184507 15 BRD4 ENSG00000141867 19

115 Appendix III

OMD ENSG00000127083 9 USP6 ENSG00000129204 17 PAX3 ENSG00000135903 2 FOXO1 ENSG00000150907 13 PAX3 ENSG00000135903 2 FOXO4 ENSG00000184481 X PAX3 ENSG00000135903 2 NCOA1 ENSG00000084676 2 PAX5 ENSG00000196092 9 ETV6 ENSG00000139083 12 PAX5 ENSG00000196092 9 PML ENSG00000140464 15 PAX7 ENSG00000009709 1 FOXO1 ENSG00000150907 13 PAX8 ENSG00000125618 2 PPARG ENSG00000132170 3 PCM1 ENSG00000078674 8 JAK2 ENSG00000096968 9 PCM1 ENSG00000078674 8 RET ENSG00000165731 10 PDE4DIP ENSG00000178104 1 PDGFRB ENSG00000113721 5 PICALM ENSG00000073921 11 MLLT10 ENSG00000078403 10 PIM1 ENSG00000137193 6 BCL6 ENSG00000113916 3 PML ENSG00000140464 15 RARA ENSG00000131759 17 POU2AF1 ENSG00000110777 11 BCL6 ENSG00000113916 3 PRCC ENSG00000143294 1 TFE3 ENSG00000068323 X PRDM16 ENSG00000142611 1 EVI1 ENSG00000085276 3 PRKAR1A ENSG00000108946 17 RET ENSG00000165731 10 PRKG2 ENSG00000138669 4 PDGFRB ENSG00000113721 5 RABEP1 ENSG00000029725 17 PDGFRB ENSG00000113721 5 RANBP2 ENSG00000153201 2 ALK ENSG00000171094 2 RBM15 ENSG00000162775 1 MKL1 ENSG00000196588 22 RET ENSG00000165731 10 NTRK1 ENSG00000198400 1 RHOH ENSG00000168421 4 BCL6 ENSG00000113916 3 RLF ENSG00000117000 1 MYCL1 ENSG00000116990 1 RNF139 ENSG00000170881 8 FHIT ENSG00000189283 3 RPN1 ENSG00000163902 3 EVI1 ENSG00000085276 3 RUNX1 ENSG00000159216 21 CBFA2T3 ENSG00000129993 16 RUNX1 ENSG00000159216 21 CPNE8 ENSG00000139117 12 RUNX1 ENSG00000159216 21 EVI1 ENSG00000085276 3 RUNX1 ENSG00000159216 21 MDS1 ENSG00000206115 3 RUNX1 ENSG00000159216 21 NLRP2 ENSG00000022556 19 RUNX1 ENSG00000159216 21 PRDM16 ENSG00000142611 1 RUNX1 ENSG00000159216 21 PRDX4 ENSG00000123131 X RUNX1 ENSG00000159216 21 RPL22 ENSG00000116251 1 RUNX1 ENSG00000159216 21 RUNX1T1 ENSG00000079102 8 RUNX1 ENSG00000159216 21 SH3D19 ENSG00000109686 4 RUNX1 ENSG00000159216 21 USP25 ENSG00000155313 21 RUNX1 ENSG00000159216 21 USP42 ENSG00000106346 7 RUNX1 ENSG00000159216 21 YTHDF2 ENSG00000198492 1 RUNX1 ENSG00000159216 21 ZFPM2 ENSG00000169946 8 RUNX1 ENSG00000159216 21 ZNF687 ENSG00000143373 1 SEC31A ENSG00000138674 4 ALK ENSG00000171094 2 SENP6 ENSG00000112701 6 TCBA1 ENSG00000188580 6 SFPQ ENSG00000116560 1 ABL1 ENSG00000097007 9 SFPQ ENSG00000116560 1 TFE3 ENSG00000068323 X SFRS3 ENSG00000112081 6 BCL6 ENSG00000113916 3 SLC45A3 ENSG00000158715 1 ERG ENSG00000157554 21 SLC45A3 ENSG00000158715 1 ETV1 ENSG00000006468 7 SLC45A3 ENSG00000158715 1 ETV5 ENSG00000171656 3 SPECC1 ENSG00000128487 17 PDGFRB ENSG00000113721 5 SPTBN1 ENSG00000115306 2 PDGFRB ENSG00000113721 5 SS18 ENSG00000141380 18 SSX1 ENSG00000126752 X SS18 ENSG00000141380 18 SSX2 ENSG00000187754 X

116 Appendix III

SS18 ENSG00000141380 18 SSX4 ENSG00000204645 X SS18L1 ENSG00000184402 20 SSX1 ENSG00000126752 X STAT5B ENSG00000173757 17 RARA ENSG00000131759 17 STRN ENSG00000115808 2 PDGFRA ENSG00000134853 4 TAF15 ENSG00000172660 17 CHN1 ENSG00000128656 2 TAF15 ENSG00000172660 17 NR4A3 ENSG00000119508 9 TAF15 ENSG00000172660 17 TEC ENSG00000135605 4 TAF15 ENSG00000172660 17 ZNF384 ENSG00000126746 12 TAL1 ENSG00000162367 1 STIL ENSG00000123473 1 TCBA1 ENSG00000188580 6 ETV6 ENSG00000139083 12 TCEA1 ENSG00000187735 8 PLAG1 ENSG00000181690 8 TCF12 ENSG00000140262 15 NR4A3 ENSG00000119508 9 TCF12 ENSG00000140262 15 TEC ENSG00000135605 4 TCF3 ENSG00000071564 19 HLF ENSG00000108924 17 TCF3 ENSG00000071564 19 PBX1 ENSG00000185630 1 TCF3 ENSG00000071564 19 TFPT ENSG00000105619 19 TCF3 ENSG00000071564 19 TFPT ENSG00000105619 19 TCF3 ENSG00000071564 19 ZNF384 ENSG00000126746 12 TFE3 ENSG00000068323 X TFEB ENSG00000112561 6 TFG ENSG00000114354 3 ALK ENSG00000171094 2 TFG ENSG00000114354 3 NR4A3 ENSG00000119508 9 TFG ENSG00000114354 3 NTRK1 ENSG00000198400 1 TFRC ENSG00000072274 3 BCL6 ENSG00000113916 3 THRAP3 ENSG00000054118 1 USP6 ENSG00000129204 17 TMOD3 ENSG00000138594 15 MCF2 ENSG00000101977 X TMPRSS2 ENSG00000184012 21 ERG ENSG00000157554 21 TMPRSS2 ENSG00000184012 21 ETV1 ENSG00000006468 7 TMPRSS2 ENSG00000184012 21 ETV4 ENSG00000175832 17 TMPRSS2 ENSG00000184012 21 ETV5 ENSG00000171656 3 TP53BP1 ENSG00000067369 15 PDGFRB ENSG00000113721 5 TPM3 ENSG00000143549 1 ALK ENSG00000171094 2 TPM3 ENSG00000143549 1 NTRK1 ENSG00000198400 1 TPM3 ENSG00000143549 1 PDGFRB ENSG00000113721 5 TPM3 ENSG00000143549 1 TPR ENSG00000047410 1 TPM4 ENSG00000167460 19 ALK ENSG00000171094 2 TPR ENSG00000047410 1 MET ENSG00000105976 7 TPR ENSG00000047410 1 NTRK1 ENSG00000198400 1 TRIM24 ENSG00000122779 7 FGFR1 ENSG00000077782 8 TRIM24 ENSG00000122779 7 RARA ENSG00000131759 17 TRIM24 ENSG00000122779 7 RET ENSG00000165731 10 TRIM33 ENSG00000197323 1 RET ENSG00000165731 10 TRIP11 ENSG00000100815 14 PDGFRB ENSG00000113721 5 TTL ENSG00000114999 2 ETV6 ENSG00000139083 12 ZBTB16 ENSG00000109906 11 RARA ENSG00000131759 17 ZMIZ1 ENSG00000108175 10 ABL1 ENSG00000097007 9 ZMYM2 ENSG00000121741 13 FGFR1 ENSG00000077782 8

117 Appendix IV

Appendix IV – Abstracts of manuscripts

118 Appendix IV

A universal assay for detection of oncogenic fusion transcripts by oligo microarray analysis

Rolf I. Skotheim1,2,*, Gard O. S. Thomassen1,2,3, Marthe Eken1,2,4, Guro E. Lind1,2, Francesca Micci5, Franclim R. Ribeiro1,2,6, Nuno Cerveira6, Manuel R. Teixeira2,6, Sverre Heim5,7, Torbjørn Rognes3,8, and Ragnhild A. Lothe1,2,4

1Department of Cancer Prevention, Institute for Cancer Research and 5Department of Cancer Genetics, The Norwegian Radium Hospital, Rikshospitalet University Hospital, Oslo, Norway. 2Centre for Cancer Biomedicine, 4Department of Molecular Biosciences, 7Medical Faculty, and 8Department of Informatics, University of Oslo, Oslo, Norway. 3Centre for Molecular Biology and Neuroscience, Institute of Medical Microbiology, Rikshospitalet University Hospital, Oslo, Norway. 6Department of Genetics, Portuguese Oncology Institute, Porto, Portugal.

Abstract The ability to detect neoplasia-specific fusion genes is important not only in cancer research, but also increasingly in clinical settings to ensure that correct diagnosis is made and the optimal treatment is chosen. However, the available methodologies to detect such fusions all have their distinct short-comings. Here, we describe a novel oligonucleotide microarray strategy whereby one can screen for all known oncogenic fusion transcripts in a single experiment. To accomplish this, we combine measurements of chimeric transcript junctions with exon-wise measurements of individual fusion partners. To demonstrate the usefulness of the approach, we designed a DNA microarray containing 68,861 oligonucleotide probes that includes oligos covering all combinations of chimeric exon-exon junctions from 275 pairs of fusion genes, as well as sets of oligos internal to all the exons of the fusion partners. Using this array, proof of principle was demonstrated by detection of known fusion genes (such as TCF3:PBX1, ETV6:RUNX1, and TMPRSS2:ERG) from all six positive controls consisting of leukemia cell lines and prostate cancers. This new method challenges currently used diagnostic and research tools for the detection of fusion genes in neoplastic diseases.

119 Appendix IV

Genomic aberrations associated with poor survival for patients with malignant peripheral nerve sheath tumors

Helge R. Brekke1,2, Franclim R. Ribeiro1,3, Trude H. Ågesen1,2, Marthe Eken1,2, Guro E. Lind1,2, Mette Eknæs1,2, Kirsten S. Hall4, Bodil Bjerkehagen5, Eva van den Berg6, Sigbjørn Smeland4, Manuel Teixeira3, Nils Mandahl8, Rolf I. Skotheim1,2, Fredrik Mertens7 and Ragnhild A. Lothe1,2,8

1Department of Cancer Prevention, Institute for Cancer Research, The Norwegian Radium Hospital, Rikshospitalet University Hospital, Montebello, Oslo, Norway. 2Center for Cancer Biomedicine, University of Oslo, Oslo, Norway. 3Department of Genetics, Portuguese Oncology Institute, Porto, Portugal. 4Division of Cancer Medicine and Radiotherapy, Department of Oncology, Norwegian Radium Hospital, Rikshospitalet University Hospital, Montebello, Oslo, Norway. 5Division of Pathology, The Norwegian Radium Hospital, Rikshospitalet University Hospital, Montebello, Oslo, Norway. 6Clinical Genetics, University Hospital of Groningen, the Netherlands. 7Department of Clinical Genetics, University Hospital, Lund, Sweden. 8Department of Molecular Biosciences, University of Oslo, Norway.

Abstract Purpose. Malignant peripheral nerve sheath tumors (MPNSTs) are rare neoplasias often associated with a poor clinical outcome. Due to the limited number of reported cases, it is unclear which genetic aberrations are contributing to the initiation, progression and clinical aggressiveness of these tumors, and whether MPNST pathogenesis is similar in patients with and without neurofibromatosis type 1 (NF1). Patients and methods. Fresh-frozen samples from 48 MPNSTs, 9 cutaneous neurofibromas and 1 plexiform neurofibroma were collected from 51 patients with (n=31) and without (n=20) NF1 history. Chromosomal and array-based comparative genomic hybridization were performed to assess DNA copy number changes. To better evaluate candidate target genes, we integrated DNA copy number changes to genome-wide expression data for a subset of 20 samples. Results. Forty-four MPNSTs (92%) displayed DNA copy number changes. A small deletion at 9p was identified as the sole alteration in a plexiform neurofibroma. Most tumors presented complex profiles, and recurrent gains at 17q (69%), 8q (65%) and 7p (56%), and losses at 9p (46%), 11q (46%) and 17p (42%) were observed. No significant differences were found in the genomic profiles of sporadic versus NF1- associated MPNSTs. Several genomic changes showed prognostic significance independently of clinical variables or patient group. In particular, patients whose tumors displayed concurrent losses at chromosomal regions 10q and Xq displayed a significantly worse prognosis (P=0.0005). Several genes whose expression was affected by DNA copy number aberrations at these regions are highlighted. Conclusions. The copy number profiles of sporadic and NF1-associated MPNSTs indicate a similar pathogenetic origin. Whereas the complexity of the findings prevents us from defining the genetic events leading to this carcinogenic process, the simultaneous occurrence of specific genetic aberrations was strongly associated with poor survival independently of patient group and standard clinical variables. 120