Retrotransposition in colorectal cancer

Tatiana Cajuso Pons, M.Sc.

Applied Tumor Genomics Research Program Department of Medical and Clinical Genetics Faculty of Medicine University of Helsinki Finland

DOCTORAL DISSERTATION

To be presented for public examination with the permission of the Faculty of Medicine of the University of Helsinki, in Lecture Hall 2, Haartman Institute, on the 27th of November, 2020 at 15 o’clock.

Helsinki 2020

Supervised by Academy Professor Lauri A. Aaltonen, M.D., Ph.D. Applied Tumor Genomics Research Program Department of Medical and Clinical Genetics Faculty of Medicine University of Helsinki, Finland

Docent Outi Kilpivaara, Ph.D. Applied Tumor Genomics Research Program Department of Medical and Clinical Genetics Faculty of Medicine University of Helsinki, Finland

Kimmo Palin, Ph.D. Applied Tumor Genomics Research Program Department of Medical and Clinical Genetics Faculty of Medicine University of Helsinki, Finland

Reviewed by Adjunct professor Csilla Sipeky, Ph.D. Institute of Biomedicine Cancer Research Laboratories Western Cancer Centre FICAN West University of Turku, Finland

Adjunct professor Kaisa Huhtinen, Ph.D. Institute of Biomedicine Cancer Research Laboratories Western Cancer Centre FICAN West University of Turku, Finland

Official Opponent Kathleen H. Burns, M.D., Ph.D. Chair, Dana-Farber Cancer Institute, Department of Oncologic Pathology Professor of Pathology, Harvard Medical School Boston, MA USA

Doctoral Programme in Biomedicine ISBN 978-951-51-6701-9 (paperback) ISBN 978-951-51-6702-6 (PDF) Unigrafia Oy Helsinki 2020 The Faculty of Medicine uses the Urkund system (plagiarism recognition) to examine all doctoral dissertations.

1

“If you know you are on the right track, if you have this inner knowledge, then nobody can turn you off... no matter what they say.”

- Barbara McClintock

To my family,

2

TABLE OF CONTENTS

LIST OF ORIGINAL PUBLICATIONS 5 ABBREVIATIONS 6 1. List of 7 ABSTRACT 9 REVIEW OF THE LITERATURE 10 1. Cancer 10 1.1. Genomic instability 11 1.2. Oncogenes 13 1.3. Tumor suppressor genes 13 1.4. Epigenetics 14 2. Colorectal cancer 15 2.1. Adenoma-to- carcinoma sequence 15 2.2. Molecular subtypes 16 3. Transposable elements 17 3.1. DNA transposons 18 3.2. Retrotransposons 19 3.2.1. LTR retrotransposons 19 3.2.2. Non-LTR retrotransposons 19 3.2.2.1. LINE -1 19 3.2.2.2. SINE 20 Alu 20 SVA 20 3.2.3. LINE-1 cycle 20 4. Retrotransposition in 22 5. Retrotransposition in cancer 23 AIMS OF THE STUDY 27 MATERIALS AND METHODS 28 1. Sample collection (Study I - Study III) 28 2. DNA extraction (Study I - Study III) 28 3. RNA extraction (Study III) 28 4. Whole genome sequencing (Study I - Study III) 29 4.1. Structural variation calls (Study I - Study III) 29 4.2. Retrotransposon insertion calls (Study III) 29 4.2.1. Detection of solo retrotransposon insertions 29 4.2.2. Detection of LINE -1 transductions 29

3

5. Nanopore sequencing (Study II) 30 6. SNP arrays (Study I and Study III) 30 7. Methylation assay (Study III) 30 8. RNA sequencing (Study III) 31 9. Long-distance inverse-PCR (LDI-PCR) (Study II) 31 10. Sanger validation (Study I and Study II) 31 RESULTS 33 1. Identification of frequent insertions from a LINE-1 in TTC28 33 1.2. Germline insertions in the Finnish population 34 2. Detection of clonal and subclonal transductions from the LINE-1 in TTC28 35 2.1. Analysis of the insertion sequence 37 2.2. Comparison of Nanopore and WGS local assembly 39 3. Characterization of somatic retrotransposition in 202 colorectal tumors 39 3.1. Insertions in -coding genes 40 3.1.1. Insertions in common fragile site genes 41 3.1.2. Insertions in known cancer genes 41 3.2. Hot reference LINE-1s 43 3.3. Retrotransposition and molecular characteristics 43 3.4. Retrotransposition and disease-specific survival 44 DISCUSSION 46 1. LINE-1 in TTC28 46 1.1. Accurate interpretation of WGS data 46 1.2. Detection of subclonal insertions 47 1.3. Elucidation of insertion characteristics 47 2. Genomic characterization of retrotransposition 48 2.1. Recurrent insertions in protein -coding genes 48 2.2. Retrotransposition as a source of fragility 49 2.3. Retrotransposition as early tumor initiating events 49 3. Retrotransposition and molecular and clinical characteristics 50 3.1. Retrotransposition correlates with CIMP and AI 50 3.2. Retrotransposition correlates with poor CRC survival 51 CONCLUSIONS AND FUTURE PROSPECTS 52 ACKNOWLEDGEMENTS 55 REFERENCES 57

4

LIST OF ORIGINAL PUBLICATIONS

This thesis is based on the following publications which are referred to in the text as Study I, Study II, and Study III.

I. Pitkänen, E., Cajuso T., Katainen, R., Kaasinen, E., Välimäki, N., Palin, K., Taipale, J., Aaltonen, L.A. & Kilpivaara, O. Frequent L1 retrotranspositions originating from TTC28 in colorectal cancer. Oncotarget. 5, 853-859 (2014).

II. Pradhan, B.*, Cajuso, T.*, Katainen, R., Sulo, P., Tanskanen, T., Kilpivaara, O., Pitkänen, E., Aaltonen, L.A., Kauppi, L. & Palin, K. Detection of subclonal L1 transductions in colorectal cancer by long- distance inverse-PCR and Nanopore sequencing. Scientific Reports. 7, 14521 (2017).

III. Cajuso, T., Sulo, P., Tanskanen, T., Katainen, R., Taira, A., Hänninen, U.A., Kondelin, J., Forsström, L., Välimäki, N., Aavikko, M., Kaasinen, E., Ristimäki, A., Koskensalo, S., Lepistö, A., Renkonen-Sinisalo, L., Seppälä, T., Kuopio, T., Böhm, J., Mecklin, J.P., Kilpivaara, O., Pitkänen, E., Palin, K. & Aaltonen, L.A. Retrotransposon insertions can initiate colorectal cancer and are associated with poor survival. Nature Communications. 10, 4022 (2019).

* Equal contribution

The publications are reproduced under a CC BY license.

5

ABBREVIATIONS

AI allelic imbalance bp BWA Burrows-Wheeler Aligner cDNA complementary deoxyribonucleic acid CIMP CpG island methylator phenotype CIN chromosomal instability CRC colorectal cancer DNA deoxyribonucleic acid EN endonuclease HERV endogenous retrovirus Indels small insertions and deletions LDI-PCR long-distance inverse PCR LINE-1 long interspersed element-1 LINE-1HS human specific long interspersed element-1 LTR long terminal repeat MAPK mitogen-activated protein kinases miRNA microRNA MMR mismatch repair mRNA messenger RNA MSI microsatellite instability MSS microsatellite stable ORF open reading frame PCAWG Pan-Cancer Analysis of Whole Genomes PCR polymerase chain reaction RNA ribonucleic acid RNP ribonucleoprotein RT reverse transcriptase SINE short interspersed element SNP single nucleotide polymorphism SNV single nucleotide variant SVA SINE-VNTR-alu TE transposable element TPRT target primed reverse transcription TraFiC Transposon Finder in Cancer TSD target site duplication

6

TSG tumor suppressor TSM target site modification UTR untranslated region VNTR variable number tandem repeat WGS whole genome sequencing

1. List of genes

AFF3 AF4/FMR2 family member 3 ALK ALK receptor tyrosine kinase APC APC regulator of WNT signaling pathway ARAP2 ArfGAP with RhoGAP domain, ankyrin repeat and PH domain 2 ARHGEF4 Rho guanine nucleotide exchange factor 4 BRAF B-Raf proto-oncogene, serine/threonine kinase CDKN1B cyclin dependent kinase inhibitor 1B COL25A1 collagen type XXV alpha 1 chain DLG2 discs large MAGUK scaffold protein 2 ERBB4 erb-b2 receptor tyrosine kinase 4 F8 coagulation factor VIII FAT4 FAT atypical cadherin 4 FHIT fragile histidine triad diadenosine triphosphatase FLI1 Fli-1 proto-oncogene, ETS transcription factor GABRA4 gamma-aminobutyric acid type A receptor subunit alpha4 GRIN2A glutamate ionotropic receptor NMDA type subunit 2A IKZF1 IKAROS family zinc finger 1 KRAS KRAS proto-oncogene LRP1B LDL receptor related protein 1B MECOM MDS1 and EVI1 complex locus MLH1 mutL homolog 1 MYC MYC proto-oncogene, bHLH transcription factor NOVA1 NOVA alternative splicing regulator 1 NRG1 neuregulin 1 POLE DNA Polymerase Epsilon Catalytic Subunit PREX2 phosphatidylinositol-3,4,5-trisphosphate dependent Rac exchange factor2 PTEN phosphatase and tensin homolog PTPRD protein tyrosine phosphatase receptor type D PTPRT protein tyrosine phosphatase receptor type T

7

RUNX1T1 RUNX1 partner transcriptional co-repressor 1 SGIP1 SH3GL interacting endocytic adaptor 1 ST18 ST18 C2H2C-type zinc finger transcription factor TP53 tumor protein p53 TTC28 tetratricopeptide repeat domain 28 ZNF251 zinc finger protein 251

8

ABSTRACT

Colorectal cancer (CRC) is the third most common cancer type and a major cause for cancer deaths. Retrotransposons are transposable elements (TE) able to mobilize and insert via an RNA intermediate. Although germline retrotransposition can contribute to genetic diversity, uncontrolled retrotransposition can lead to genomic instability. Retrotransposons are normally repressed by promoter methylation however, they become highly active in many cancers, such as CRC. Since retrotransposons are difficult to detect, their role in tumorigenesis has remained considerably unexplored.

The main goal of this thesis project was to further understand the impact of retrotransposition in colorectal tumorigenesis utilizing whole genome sequencing (WGS) and Nanopore sequencing. In study I, we detected an active reference long interspersed element-1 (LINE-1) located in the first intron of TTC28. This active LINE-1 led to the most frequent somatic structural rearrangement in our dataset, with a total of 83 somatic retrotranspositions in 92 CRCs. In study II, we applied long-distance inverse-PCR (LDI-PCR) with Nanopore sequencing to detect retrotranspositions arising from the active LINE-1 identified in study I. We identified 25 subclonal insertions in addition to 14 insertions previously detected by WGS, indicating active retrotransposition during the tumorigenic process. In study III, we characterized somatic retrotransposon insertions in 202 colorectal tumor whole genomes. Among recurrent insertions in fragile sites and cancer genes, we identified two insertions in exon 16 of APC, suggesting that retrotransposon insertions can contribute to tumor initiation. Furthermore, the number of somatic insertions correlated with the CpG island methylator phenotype (CIMP), the genomic fraction of allelic imbalance (AI), and poor CRC survival.

These findings suggest that the clinical impact of retrotransposition in CRC might be more important than previously acknowledged, although several questions remain unanswered. Future work should shed light on the timing and novel mechanisms by which retrotransposition could influence tumorigenesis. The better understanding of the role of both somatic and germline retrotransposition could provide tools for patient stratification and cancer prevention.

9

REVIEW OF THE LITERATURE

1. Cancer

Cancer can be defined as a group of heterogeneous diseases characterized by genetic changes that lead to uncontrolled cell proliferation. Genetic changes, or mutations, occur frequently in human cells. However, intrinsic mutational processes and environmental or lifestyle exposures can increase mutation rate (Stratton et al. 2009; Tomasetti and Vogelstein 2015). Although DNA repair can fix such changes, some are not repaired, and can be incorporated in the DNA of subsequent cell generations.

Germline mutations are genetic changes that occur in the reproductive cells and are transmitted to all cells of the offspring. On the other hand, somatic mutations are genetic changes that occur after conception, they are present in a subset of cells, and therefore, cannot be inherited.

The tumorigenic process starts with a mutation in a normal cell that confers a selective growth advantage. Subsequently, additional somatic mutations occur enabling uncontrolled cell proliferation leading to tumor progression (Nowell 1976). Tumors undergo a purification process, similar to natural selection, where changes that provide a growth advantage are selected further in a process known as clonal expansion (Nowell 1976). Clonal expansion is characterized by the outgrowth of cells with a growth advantage over other cells. Mutations that confer a growth advantage are called driver mutations, whereas mutations that do not confer any growth advantage are known as passenger mutations (Figure 1).

10

Figure 1. Evolution from a normal cell to a cancer cell over a lifetime. A normal cell (fertilized egg) starts accumulating passenger and driver mutations over time, eventually leading to cancer transformation. Intrinsical mutational processes and environmental and lifestyle exposures, such as DNA repair defects and tobacco smoke, contribute to the acquisition of more mutations. Figure adapted from Stratton et al. (Stratton et al. 2009).

In 2011, Hanahan and Weinberg defined cancer hallmarks as the abilities shared by the majority of cancer cells that are acquired throughout the tumorigenic process and allow cancer transformation. Cancer hallmarks include; sustained proliferation signal, evasion of growth suppressors, resistance to cell death, replicative immortality, angiogenesis, and invasion and metastasis. Two emerging hallmarks were proposed, reprogramming of metabolism and evasion of the immune response. In addition, Hanahan and Weinberg defined the two enabling characteristics as the processes that facilitate cancers to acquire the above mentioned cancer hallmarks. The enabling characteristics include, tumor- promoting inflammation and genomic instability (Hanahan and Weinberg 2011) .

1.1. Genomic instability

Genomic instability is characterized by a high rate of genetic changes within the genome of a cell. Genetic changes range from small changes in the DNA sequence to larger changes involving . Small DNA changes include single nucleotide variants (SNVs) which can involve the deletion/addition/substitution of one base pair (bp), and insertion/deletion of a few base pairs (indels). Among DNA changes in protein-coding regions, changes can be synonymous when they do not result in an amino acid change, and nonsynonymous when they result in an amino acid change. Large genetic changes include aneuploidies and chromosomal rearrangements. Aneuploidies can involve the amplification or loss of one or more chromosomes. Chromosomal

11

rearrangements involve fragments of chromosomes that are in the wrong orientation (inversion) or in the wrong location (translocations and insertions). Other chromosomal rearrangements lead to loss or gain of fragments of chromosomes resulting in deletions and duplications or amplifications (Figure 2). Another type of chromosomal rearrangements are insertions arising from active retrotransposons (Figure 2). Retrotransposons are DNA sequences that become active in many cancers, and they can contribute to genomic instability by copying and inserting into different genomic locations.

Figure 2. Chromosomal rearrangements. Common types of intra and interchromosomal rearrangements. Intrachromosomal rearrangements are changes that involve one such as deletions, duplications, insertions and inversions. Interchromosomal rearrangements are changes that can involve two different chromosomes such as translocations and retrotranspositions.

As mentioned before, mutations in cancer genomes can be classified based on their impact on tumor development. Driver mutations are changes that have conferred growth advantage at any time during tumor development and therefore are selected further. Passenger mutations, on the other hand, are genetic changes that do not confer any selective advantage. The majority of SNVs, indels and chromosomal rearrangements are passengers of the tumorigenic process, and breakpoints of chromosomal rearrangements often fall in regions of the genome where there are no genes (gene desserts) or adjacent to fragile sites (Vogelstein et al. 2013). Fragile sites are genomic sites prone to DNA breakage under stress conditions, such as replicative stress (Glover et al. 1984; Sutherland

12

et al. 1998; Schwartz et al. 2006). On the other hand, genes with recurrent driver mutations are known as driver genes. Driver genes can be classified into oncogenes and tumor suppressor genes based on the effect on tumor development.

1.2. Oncogenes

Proto-oncogenes are genes that accumulate gain-of-function mutations that lead to oncogene transformation and promote tumor growth. Proto-oncogenes usually encode transcription factors, growth factors and their receptors, signal transducers, chromatin remodelers and apoptotic regulators that are involved in cell proliferation and apoptosis (Croce 2008).

The most common oncogenic changes include SNVs at specific hotspots, gene amplifications and translocations that result in their overexpression or activation (Vogelstein et al. 2013; Zehir et al. 2017). These changes are usually dominant, meaning that only one hit is enough to promote tumorigenesis (Vogelstein et al. 2013).

Examples of proto-oncogenes include KRAS, MYC and RET. KRAS activation mostly occurs by the accumulation of SNVs at specific hotspots, whereas MYC overexpression and activation results from gene amplifications and translocations (Croce 2008; Zehir et al. 2017). Although germline gain-of-function mutations in proto-oncogenes are rarely observed, germline mutations in RET oncogene have been reported to cause hereditary multiple endocrine neoplasia type 2 (Mulligan et al. 1993; Hofstra et al. 1994).

1.3. Tumor suppressor genes

Tumor suppressor genes (TSGs) promote tumorigenesis by the accumulation of loss-of-function mutations that lead to reduced activity or complete loss-of- function. Based on their function, TSGs can be classified as gatekeepers, caretakers or landscapers. Gatekeepers, such as APC, are involved in processes that limit cell growth or promote cell death preventing tumor initiation. Caretakers, such as MLH1, are involved in processes that maintain genome integrity such as DNA repair (Kinzler and Vogelstein 1997; Negrini et al. 2010). Landscapers, such as PTEN, are involved in processes that, when inactivated, lead to an abnormal tumor microenvironment (Giri and Ittmann 1999; Michor et al. 2004).

TSG mutations are often spread throughout the coding regions of the gene and they typically involve a combination of SNVs, indels, large deletions, promoter hypermethylation and mitotic recombination that results in loss of heterozygosity. These mutations are often recessive, meaning that two hits are required to promote tumor growth (Knudson 1971). However, in genes with haploinsufficiency

13

or a dominant negative effect, only one mutated allele is enough to promote tumorigenesis (Berger et al. 2011). In haploinsufficient genes, such as CDKN1B, the wildtype allele is not enough to maintain sufficient normal function (Le Toriellec et al. 2008) and in genes with dominant negative effect such as TP53 (Herskowitz 1987; Kraiss et al. 1988), the mutated protein interferes with the function of the normal protein.

Germline mutations in tumor suppressor genes, such as MLH1, increase cancer risk since only one additional somatic change is needed for biallelic inactivation (Peltomaki et al. 1993).

1.4. Epigenetics

In addition to genetic changes, cancer cells accumulate epigenetic modifications. Epigenetic modifications are changes that affect the chemical structure of DNA but do not alter the DNA sequence. They can involve DNA methylation, histone modifications, changes in nucleosome position and micro-ribonucleic acid (miRNA) expression changes (Sharma et al. 2010).

DNA methylation consists of the covalent addition of a methyl group, mostly at cytosine residues that are clustered in CpG islands. CpG islands are regions of the genome with high content of cytosine residues followed by guanine residues. DNA methylation is associated with the repression of gene transcription, inactivation of X-chromosome, repression of transposable elements, genomic imprinting, aging and carcinogenesis (Lyon 1961; Li et al. 1993; Beard et al. 1995; Yoder et al. 1997; Daura-Oller et al. 2009; Zemach et al. 2010; Gonzalo 2010; Wang and Lei 2018).

Histone modifications involve the addition of acetyl and methyl groups to histones, leading to chromatin changes, and therefore gene regulation (Sharma et al. 2010).

Nucleosomes are the main structural unit of DNA packaging. A nucleosome consists of a DNA fragment wrapped around eight histones. Changes in nucleosome position can alter DNA accessibility and interfere with transcription factor binding (Shivaswamy et al. 2008).

MiRNAs are short RNAs (~21-23 nucleotides) that target messenger RNAs (mRNAs) and lead to RNA degradation or block gene translation (Ambros 2004; Zeng and Cullen).

Epigenetic modifications can promote tumorigenesis by, for example, acting as the second hit in tumor suppressor gene inactivation.

14

2. Colorectal cancer

Colorectal cancer (CRC) is the third most common cancer in both women and men worldwide (https://www.who.int/news-room/fact-sheets/detail/cancer). Approximately 5% of CRC cases arise from known high-penetrance germline mutations (Rustgi 2007). However, approximately 20% of CRC cases occur in a family setting where a moderate risk of CRC is observed but with no clear pattern of genetic inheritance. These cases may arise from a combination of rare and low- penetrance mutations as well as environmental factors (Nagy et al. 2004; Rustgi 2007; Fletcher and Houlston 2010). On the other hand, the majority of CRC cases occur in a sporadic setting where a combination of genetic, environmental, lifestyle and clinical factors are likely to contribute to CRC development (Rawla et al. 2019). CRC is defined as the malignant growth of cells arising from the epithelium of the colon or rectum and have the ability to invade neighbouring tissues. The majority of sporadic colorectal cancers develop via the adenoma-to-carcinoma sequence.

2.1. Adenoma-to-carcinoma sequence

Adenomas are gastrointestinal lesions characterized by excessive cell proliferation. However, a fraction of adenomas can progress to malignant tumors by the sequential acquisition of key somatic changes and successive waves of clonal expansion (Fearon and Vogelstein 1990). Fearon and Vogelstein were the first to propose a model for CRC, in which they associated each mutation to a tumorigenic step. The most commonly mutated driver genes in CRC include APC, KRAS, and TP53.

APC is a tumor suppressor gene mutated in ~67% of colorectal adenocarcinomas (Ellrott et al. 2018; Gao et al. 2018; Hoadley et al. 2018; Liu et al. 2018; Sanchez- Vega et al. 2018; Taylor et al. 2018; Bhandari et al. 2019) and in ~77% of metastatic CRCs (Cerami et al. 2012; Gao et al. 2013; Yaeger et al. 2018). Most mutations in APC occur in the mutation cluster region and result in a truncated protein product (Mori et al. 1992; Polakis 1995; Fearon 2011). APC acts as an antagonist of the Wnt signalling pathway and its inactivation promotes cell fate determination, proliferation and migration. Mutations in APC have been reported in both adenomas and carcinomas, indicating that APC inactivation is implicated in the adenoma formation from normal epithelia (Figure 3) (Fearon 2011).

KRAS is an oncogene mutated in ~37% of colorectal adenocarcinomas (Ellrott et al. 2018; Gao et al. 2018; Hoadley et al. 2018; Liu et al. 2018; Sanchez-Vega et al. 2018; Taylor et al. 2018; Bhandari et al. 2019) and in ~45% of metastatic CRCs (Cerami et al. 2012; Gao et al. 2013; Yaeger et al. 2018). Most mutations in KRAS occur in hotspot regions located in codons 12, 13, 61, 117, and 146. KRAS mutation activates the RAS/MAPK pathway and promotes cell proliferation and

15

growth. KRAS mutations are associated with adenoma growth and tumors with both APC and KRAS mutations are more likely to progress to malignant lesions (Figure 3) (Fearon 2011).

TP53 is a tumor suppressor gene mutated in ~53% of colorectal adenocarcinomas (Ellrott et al. 2018; Gao et al. 2018; Hoadley et al. 2018; Liu et al. 2018; Sanchez- Vega et al. 2018; Taylor et al. 2018; Bhandari et al. 2019) and in ~74% of metastatic CRCs (Cerami et al. 2012; Gao et al. 2013; Yaeger et al. 2018). Mutations in TP53 are spread throughout the coding region. TP53 plays an im- portant role in the regulation and progression of the cell cycle upon the detection of sustained DNA damage. Biallelic inactivation of TP53 is associated with the transition from benign lesions to malignant lesions (Figure 3) (Fearon 2011).

Figure 3. Progression from normal epithelia to carcinoma in CRC. The most common mutated genes and their associated tumorigenic step are shown. Figure adapted from Vogelstein et al. (Vogelstein et al. 2013).

2.2. Molecular subtypes

CRCs can arise via two different molecular pathways, the chromosomal instability (CIN) pathway and the microsatellite instability (MSI) pathway.

Eighty-five percent of sporadic CRCs follow the CIN pathway (Boland et al. 2010). The CIN pathway is characterized by an increased rate of chromosomal rearrangements. The majority of CIN tumors follow the adenoma-to-carcinoma sequence proposed by Fearon and Vogelstein (Fearon and Vogelstein 1990). It has been proposed that CIN may arise from defective genes involved in several steps of cell division, however other genes may also contribute such as APC or TP53 (Fodde et al. 2001; Pino and Chung 2010).

On the other hand, 15% of sporadic CRCs arise from a defective mismatch repair machinery (MMR). The MMR deficient machinery leads to an increased number

16

of mutations, especially indels in repetitive areas such as microsatellites. Sporadic MSI CRCs are characterized by biallelic methylation of MLH1 promoter and BRAF mutations (Boland et al. 2010). MLH1 is a tumor suppressor gene involved in DNA mismatch repair. BRAF is an oncogene that activates the RAS/MAPK pathway promoting cell proliferation and growth (Fearon 2011). The majority of mutations in BRAF are V600E and they are usually mutually exclusive with KRAS mutations (Rajagopalan et al. 2002).

MSI tumors are associated with proximal tumor location, mucinous histology, poor differentiation, immune cell infiltration and lower risk of metastasis (Kim et al. 1994; Kakar et al. 2003; Malesci et al. 2007). In addition, seventy-five percent of MSI-positive sporadic CRCs are also characterized by a CpG island methylator phenotype (CIMP), which refers to genome-wide gene promoter hypermethylation (Toyota et al. 1999).

3. Transposable elements

Transposable elements (TEs) are DNA sequences able to mobilize within genomes by a mechanism known as transposition. They were discovered by Barbara McClintock while she was studying the color patterns of maize (Mcclintock 1956).

At present, they can be found in a wide range of living organisms. They com- prise almost half of the however, several elements are no longer transposition competent (Jurka 2000; Consortium and International Human Genome Sequencing Consortium 2001). Based on their mobilization intermedi- ate, they are classified into two classes; DNA transposons and Retrotransposons (Figure 4).

17

Figure 4. Transposable elements in the human genome. Classification of transposable elements based on their mobilization intermediate and sequence similarities. TE class and the example structure and the percentage of the human genome are shown (Consortium and International Human Genome Sequencing Consortium 2001). Figure modified from Kazazian et al. (Kazazian and Moran 2017).

3.1. DNA transposons

DNA transposons utilize DNA as an intermediate for mobilization. They comprise 3% of the human genome however, there is no current activity in humans (Consortium and International Human Genome Sequencing Consortium 2001). They can be grouped into three subclasses based on their mobilization mechanism; “Cut-and-paste”, “Rolling-circle” and “Self-replicating” (Nancy L Craig, Robert Craigie, Martin Gellert, Alan M Lambowitz 2002; Feschotte and Pritham 2007).

“Cut-and-paste” transposons are the largest subclass and it includes ten different subfamilies. They are all characterized by a transposase enzyme, which excises the double-strand DNA transposon sequence and inserts it elsewhere (Pierre Capy, Claude Bazin, Dominique Higuet, Thierry Langin 1996).

Both “rolling-circle” Helitrons and “self-replicating” Mavericks, mobilize via a single-strand DNA molecule using a replication-based mechanism. Helitrons utilize a helicase for mobilization, and Mavericks resemble DNA viruses and they likely use a B-type DNA polymerase for mobilization (Kapitonov and Jurka 2001; Kapitonov and Jurka 2006; Pritham et al. 2007).

18

3.2. Retrotransposons

Retrotransposons utilize an RNA intermediate for mobilization and are characterized by a “copy-and-paste” mechanism (Boeke et al. 1985). Retrotransposons can be autonomous or non-autonomous. Autonomous elements encode for all necessary for retrotransposition, whereas non- autonomous elements depend on the machinery of autonomous elements to retrotranspose. Retrotransposons are classified into two major classes depending on the presence of long terminal repeats (LTR) at both ends of the element (Nancy L Craig, Robert Craigie, Martin Gellert, Alan M Lambowitz 2002; Feschotte and Pritham 2007).

3.2.1. LTR retrotransposons

LTR retrotransposons are autonomous elements which contain direct repeat sequences flanking the element and internal promoters. Although there are different subclasses of LTR retrotransposons, only human endogenous retroviruses (HERVs) have been recently active in humans (Cordaux and Batzer 2009). They resemble retroviruses and their cycle is characterized by reverse transcription and integration into the genome in the nucleus (Garfinkel et al. 1985). HERVs account for 8% of the human genome however, there is no evidence of current activity (Figure 4) (Consortium and International Human Genome Sequencing Consortium 2001; Mills et al. 2007).

3.2.2. Non-LTR retrotransposons

Non-LTR retrotransposons do not contain LTR sequences, and they are characterized by a stretch of adenines at the end of the element (polyA tail). They can be classified based on their length into Long Interspersed Elements (LINE- 1s) and Short Interspersed Elements (SINEs) (Figure 4) (Nancy L Craig, Robert Craigie, Martin Gellert, Alan M Lambowitz 2002; Feschotte and Pritham 2007).

3.2.2.1. LINE-1

LINE-1s are autonomous elements and therefore, they are also responsible for the mobilization of SINEs (Dewannieux et al. 2003; Hancks et al. 2011; Raiz et al. 2012). There are ~500000 LINE-1 copies in the human genome however, only around 100 elements of the subfamily Homo sapiens-specific (LINE-1HS) remain intact and are still active (Consortium and International Human Genome Sequencing Consortium 2001; Beck et al. 2010; Huang et al. 2010; Iskow et al. 2010; Ewing and Kazazian 2010; Ewing and Kazazian 2011). Among full-length LINE-1s, retrotransposon activity can vary considerably and the most active are termed “hot” (Brouha et al. 2003; Scott et al. 2016).

19

The canonical length of an active LINE-1 is 6kb. It contains a 5’ untranslated region (UTR) which contains an RNA polymerase II promoter (Swergold 1990), two open reading frames (ORF1 and ORF2), and a 3’ UTR containing a RNA polymerase polyadenylation signal ending in a polyA tail (Figure 4) (Babushok and Kazazian 2007). ORF1p encodes an RNA binding protein and ORF2p encodes for an endonuclease (EN) and reverse transcriptase (RT) (Hohjoh and Singer 1997a; Hohjoh and Singer 1997b; Martin et al. 2003; Khazina et al. 2011).

3.2.2.2. SINE

SINEs are non-autonomous elements and therefore, utilize LINE-1 proteins to mobilize. Depending on their sequence similarities, they have been classified into Alu elements and SINE-VNTR-Alus (SVAs).

Alu

There are over one million Alu elements in the human genome and the typical length of a full-length Alu is 300 bp (Kriegs et al. 2007). Their sequence contains an RNA polymerase III promoter, two monomer sequences separated by an A- rich linker region, and a polyA tail (Figure 4) (Batzer and Deininger 2002). Since Alu elements do not contain a RNA polymerase III termination signal, they often retrotranspose 3’ flanking sequences downstream the element in a process termed 3’ transduction (Shaikh et al. 1997; Comeaux et al. 2009).

SVA

There are ~2700 SVA copies in the human genome and they are typically 2kb long. They contain an hexamer repeat region, an alu-like region, a variable number of tandem repeats, a HERV-K10_like region, and a polyadenylation signal ending with a polyA tail (Figure 4). Transcription of SVA elements often depends on nearby promoters, since they do not have their own promoter (Ostertag et al. 2003; Wang et al. 2005; Cordaux and Batzer 2009).

3.2.3. LINE-1 cycle

The life cycle of a human LINE-1 element starts with transcription by RNA polymerase II in the nucleus of the cell. Transcription starts at the LINE-1 internal promoter and ends at the 3’ polyA tail or, if the polyadenylation signal is weak, it can lead to 3’ transductions (Goodier et al. 2000; Pickeral et al. 2000).

After transcription, the LINE-1 RNA is exported to the cytoplasm and both ORFs are translated (Leibold et al. 1990). The LINE-1 RNA, ORF1p and ORF2p form a ribonucleoprotein (RNP) particle in a phenomenon known as cis-preference and enters the nucleus (Esnault et al. 2000; Wei et al. 2001).

20

In the nucleus, the integration occurs in a process termed Target Primed Reverse Transcription (TPRT) (Luan et al. 1993; Christensen et al. 2005). TPRT starts with the EN nicking the DNA, often at the consensus recognition site (5’ TTTT-AA 3’) (Feng et al. 1996; Hohjoh and Singer 1997b; Morrish et al. 2002). The EN cleavage exposes a free 3’ OH group which will be used by the RT to reverse transcribe the LINE-1 mRNA starting from the 3’ polyA tail (Doucet et al. 2015; Ahl et al. 2015). Finally, the complementary strand is synthesized, and the junctions are repaired by a mechanism not well understood, although host factors are likely to be involved. During the retrotransposition process, the source element remains intact and functional.

This process leaves various signatures of the retrotransposition process, including; LINE-1 sequence corresponding to the currently active LINE-1 families (Wei et al. 2001; Hancks and Kazazian 2016), frequent 5’ truncations and inversions of the LINE-1 sequence (Ostertag and Kazazian 2001), insertion site preference for the EN recognition site (5’ TTTT-AA 3’) (Feng et al. 1996), target site duplications (TSDs) of different lengths flanking the insertion (Jurka 1997), and a polyA tail reflecting the RNA precursor (Dombroski et al. 1991; Doucet et al. 2015). The LINE-1 retrotransposon machinery can be hijacked by Alus (Dewannieux et al. 2003), SVAs (Hancks et al. 2011; Raiz et al. 2012) and even other cellular mRNAs (Figure 5) (Esnault et al. 2000).

21

Figure 5. LINE.1 life cycle. The LINE-1 cycle starts with transcription by RNA polymerase II in the nucleus of the cell (1). The LINE-1 mRNA leaves the nucleus (2) for translation of both ORF1 and ORF2 (3) and formation of the ribonucleoprotein in the cytoplasm of the cell (4). The RNP complex re-enters the nucleus (5) where the LINE-1 cDNA integrates into a new genomic location (6).

4. Retrotransposition in humans

LINE-1 elements were first recognized in the human genome in the 1980s (Adams et al. 1980; Singer 1982; Hattori et al. 1986). However, evidence of their mobility was found later, when ORF2p was identified to have RT functions (Hattori et al. 1986) and full-length LINE-1 transcripts were detected in human teratocarcinoma cell lines (Skowronski and Singer 1985; Skowronski et al. 1988). The first evidence of insertional mutagenesis in humans was reported in 1988, when two de novo LINE-1 insertions in exon 14 of the F8 gene were reported as the cause of Haemophilia A (Kazazian et al. 1988). These findings indicated that in these two patients, a retrotransposon had been active in either the parental germ cells or very early during embryonic development. In fact, active retrotransposition occurs in both the germ cells and during early embryonic development, thus leading to polymorphic insertions and somatic mosaicism (Prak and Kazazian 2000; Bourc’his and Bestor 2004; Molaro et al. 2011).

22

In addition, somatic retrotransposon activity has also been detected during brain development, adult neurogenesis (Coufal et al. 2009; Bodea et al. 2018) and in several cancer types. However, in most healthy adult tissues retrotransposons become strongly repressed by epigenetic and post transcriptional mechanisms.

In the repression by epigenetic marks, Krüppel-associated box proteins play an important role in the establishment of de novo methylation during embryogenesis (Thayer et al. 1993; Hata and Sakaki 1997). Whereas DNA methyltransferase I and ubiquitin-like with PHD and RING finger domains I play an important role in maintaining LINE-1 promoter methylation after DNA replication (Hata and Sakaki 1997). In addition, PIWI-interacting RNAs, short interfering RNAs, and the Apolipoprotein B mRNA Editing Catalytic Polypeptide protein complex play an important role in the post transcriptional regulation of retrotransposons by recognizing and cleaving LINE-1 RNAs (Chambeyron et al. 2008).

5. Retrotransposition in cancer

The first example of somatic insertional mutagenesis from a retrotransposon was reported in 1992 (Miki et al. 1992). In this study, they found a LINE-1 insertion in exon 16 of APC in a CRC patient. The insertion was not present in the corresponding normal tissue, indicating that the insertion was a somatic event (Miki et al. 1992). In 2010, Iskow et al. reported frequent somatic retrotranspositions in lung tumors, a methylation signature associated with LINE- 1-permissive tumors, and suggested that retrotransposition may play a driving role in tumorigenesis (Iskow et al. 2010; Lee et al. 2012; Achanta et al. 2016).

Subsequently, several studies reported high retrotransposon activity in colorectal cancer (Lee et al. 2012; Solyom et al. 2012; Pitkänen et al. 2014; Cajuso et al.), esophageal squamous carcinomas and adenocarcinomas (Doucet-O’Hare et al. 2015; Doucet-O’Hare et al. 2016), gastric and small bowel tumors, hepatocellular carcinomas, and pancreatic ductal adenocarcinomas (Shukla et al. 2013; Rodić et al. 2015; Ewing et al. 2015). In addition to tumors from the gastrointestinal tract, frequent somatic retrotranspositions have been reported in head and neck (Helman et al. 2014; Tubio et al. 2014), prostate and ovarian cancers (Lee et al. 2012; Tubio et al. 2014; Tang et al. 2017). On the other hand, almost no insertions are detected in gliomas (Iskow et al. 2010; Lee et al. 2012; Achanta et al. 2016) or blood cancers (Lee et al. 2012).

In addition, retrotransposon insertions have been recently characterized in the Pan-Cancer Analysis of Whole Genomes (PCAWG) dataset which includes 2954 tumors across 38 tumor types (Rodriguez-Martin et al. 2020). In this dataset, the top five tumor types with the highest mean number of insertions were esophageal,

23

head and neck, colorectal, uterus, stomach, and ovarian cancers (Figure 6). On the other hand, almost no insertions were detected in thyroid, kidney, cervix, bone, myeloid or nervous system cancers (Rodriguez-Martin et al. 2020).

Figure 6. Retrotransposon insertions in PCAWG data. Number of insertions per tumor in the ten PCAWG cancers with the highest mean number of insertions. Insertion count is represented on the Y-axis, and tumor locations on the X-axis. Tumor locations are sorted in a descending order based on the mean number of insertions.

As shown by the studies cited above, retrotransposon activity varies substantially among different tumor types, but also between individuals. Although it is not well understood why some tumor types are more prone to retrotransposition than others, a correlation between insertion count and ORF1p expression has been reported (Rodić et al. 2014).

Recurrent insertions have been identified in several protein-coding genes which are frequently mutated in cancer (Lee et al. 2012; Solyom et al. 2012; Helman et al. 2014). Although the majority of insertions are intronic, intronic insertions can also disrupt gene expression (Lee et al. 2012; Helman et al. 2014). Retrotransposon insertions in the exons of driver cancer genes have also been identified, including two insertions in APC (Miki et al. 1992; Scott et al. 2016), one insertion in PTEN (Helman et al. 2014), and one oncogenic insertion in ST18 by blocking its repression enhancer (Shukla et al. 2013).

In contrast to germline insertions, most somatic insertions are 5’ truncated (Solyom et al. 2012; Burns 2017) and somatic insertions are biased towards cancer-specific hypomethylated areas (Lee et al. 2012). Somatic retrotransposon

24

insertions can already be found in early stages of tumor evolution and several insertions are present in different tumor sections, suggesting that some insertions are selected further (Doucet-O’Hare et al. 2015; Ewing et al. 2015). Although retrotransposition is an ongoing process, insertions are acquired in a discontinuous manner where different LINE-1s might be activated in different stages of the tumorigenic process (Tubio et al. 2014; Rodić et al. 2015; Pradhan et al. 2017) (Figure 7).

Figure 7. Fluctuating activity of different LINE-1s during the tumorigenic process. In the upper side, a model of the progression from normal epithelia to invasive colon cancer. In the lower side, a representation of the transient activity of different source LINE-1s throughout tumor evolution, each clone is color coded to match the group of clones they belong to in the upper figure. LINE-1 activity is depicted as lines that are color coded to match the active source element.

Several associations to clinical and molecular characteristics have been identified. For example, insertion count is associated with patient age (Solyom et al. 2012) and higher insertion count is found in tumors with higher rates of chromosomal rearrangements and other somatic mutations (Helman et al. 2014). ORF1p expression is associated with TP53 mutation in lung cancers (Rodić et al. 2014) and gastrointestinal tumors with high immune activity are associated with lower insertion count (Jung et al. 2018). LINE-1 promoter hypomethylation is associated with decreased survival in breast, ovarian, non-small cell lung and colorectal cancers, as well as esophageal squamous cell carcinoma (Pattamadilok et al. 2008; Ogino et al. 2008b; Saito et al. 2010; van Hoesel et al. 2012; Iwagami et al. 2013; Harada et al. 2015). In addition, LINE-1 hypomethylation is associated with the CpG island methylator phenotype in CRC (Ogino et al. 2008a).

25

All these studies suggest that while some retrotransposon insertions may play a driving role in tumorigenesis, others are likely passengers of the tumorigenic process. Furthermore, it is not clear whether the association between LINE-1 hypomethylation and clinical characteristics is a direct consequence of retrotransposon activation, or retrotransposon activation is a consequence of cancer-related global hypomethylation. Thus, the role of retrotransposon insertions in tumorigenesis needs further investigation.

26

AIMS OF THE STUDY

The aim of this thesis project was to provide a deeper understanding on the role of retrotransposition in colorectal cancer. The specific aims relevant to each study (I-III) are:

I. To identify frequent insertions from a LINE-1 in TTC28.

II. To detect transductions from the LINE-1 in TTC28 by LDI-PCR and Nanopore sequencing.

III. To characterize somatic retrotransposition and investigate associations with clinical characteristics in 202 colorectal tumors.

27

MATERIALS AND METHODS

1. Sample collection (Study I - Study III)

Fresh frozen tumor and normal tissue material (either adjacent normal colon or blood) was obtained from two large CRC sample collections (C and S series). The first sample collection involved nine Finnish hospitals between May 1994 and April 1998 (C-series) (Aaltonen et al. 1998; Salovaara et al. 2000). A second collection (S-series) started in 1998 from two out of the nine Finnish hospitals and it is currently ongoing (unpublished).

All tumors were screened for MSI as described previously (C-series) (Aaltonen et al. 1998; Salovaara et al. 2000) or by using the Bethesda microsatellite panel (S- series). Family history including first degree relatives, cancer diagnoses, cause of death and other clinical characteristics were collected from National population registries, parishes, the Finnish Cancer Registry, Statistics Finland and pathological reports.

A signed informed consent or authorization from the National Supervisory Authority for Welfare and Health was obtained for all participants (Dnro 421/04/044/06, Dnro 8048/06.01.03.01/2014, Dnro 358/32/300/05, Dnro 1476/06.01.03.01/2012). The study has been reviewed and approved by the Ethics Committee of the Hospital District of Helsinki and Uusimaa (Dnro 133/E8/03, 408/13/03/03/2009). Permission to use patient information was obtained from the National Institute for Health and Welfare (Dnro 53/07/2000, Dnro THL/1071/5.05.00/2011, Dnro THL/151/5.05.00/2017).

2. DNA extraction (Study I - Study III)

DNA was extracted from fresh frozen tumor tissue and the corresponding normal (either adjacent colon or blood) using DNA isolation standard methods (Lahiri and Nurnberger 1991). Only tumors with tumor percentage above 50 based on Hematoxylin and Eosin staining were selected for DNA extraction.

3. RNA extraction (Study III)

Total RNA was extracted from fresh frozen tumor tissue and the corresponding adjacent normal colon tissue using the RNeasy Mini Kit (Qiagen). Only tumor

28

samples with tumor percentage above 50 based on Hematoxylin and Eosin staining were used for RNA extraction.

4. Whole genome sequencing (Study I - Study III)

WGS was performed on the Illumina HiSeq 2000 to at least 40x median coverage and with 100 bp paired-end reads. Paired-end reads were aligned with Burrows- Wheeler Aligner (BWA) (v.0.6.2) (Li and Durbin 2009). The aligned paired-end reads were mapped against the 1000 Genomes Project Phase 2 reference assembly hs37d5 as described in previous studies (Katainen et al. 2015; Palin et al. 2018).

4.1. Structural variation calls (Study I - Study III)

To detect structural rearrangements (deletions, tandem duplications, inversions and translocations) from paired-end WGS data we utilized DELLY (Rausch et al. 2012). In study I, we utilized DELLY calls from 92 familial CRC cases (Pitkänen et al. 2014). In study II and III, we utilized a new set of DELLY calls from 213 CRC cases including the 92 familial cases (Katainen et al. 2015).

4.2. Retrotransposon insertion calls (Study III)

4.2.1. Detection of solo retrotransposon insertions

We utilized Transposon Finder in Cancer (TraFiC) (Tubio et al. 2014) to detect Alu, SINE, ERV and solo-LINE-1 insertions from 201 CRCs including the 92 familial cases. TraFiC was applied with default parameters with the exception of; a=1 (RepeatMasker accuracy), s=3 (minimum of 3 reads in a tumor cluster), and gm=3 (minimum of 3 reads in normal cluster) and including paired-end reads with equal mapping quality only if it was above 0 (where the anchor read was defined as the first end). Somatic filtering was applied when a tumor call was present in any normal (234 normal samples) with a 200 bp window.

4.2.2. Detection of LINE-1 transductions

To detect 3’ transductions from full-length reference LINE-1s we utilized DELLY (Rausch et al. 2012) (section 4.1). 3’ transductions are retrotransposon insertions including downstream sequence which allows the unique mapping of paired-end reads. We extracted DELLY calls with mapping quality above 37, at least 3 supporting reads, and longer than 1000 bp. We selected DELLY calls with one end located within 1000 bp from the 3’end of any of the reference LINE-1s. Tumor calls present in any of the pooled normal samples (± 200 bp) were filtered away.

29

5. Nanopore sequencing (Study II)

Nanopore sequencing is a long-read sequencing technique based on the current generated by a DNA molecule when it passes through a nanopore. Nanopore sequencing was utilized in conjunction with LDI-PCR to detect 3’ transductions and to elucidate insertion characteristics. Nanopore libraries were prepared according to manufacturer’s instructions (SQK-LSK108 and EXP-NBD103). MinION flow cell (FLO-MIN106) was run for 6 hours (MinION Mk1B). Basecalling was performed with ONT albacore Sequencing Pipeline Software (version 1.0.2) and basecalled reads were aligned to the 1000 Genomes Project Phase 2 reference assembly hs37d5 as described in previous studies (Katainen et al. 2015; Palin et al. 2018). To identify transductions from nanopore data we applied the LDI-PCR.py software. Briefly, LDI-PCR.py identifies reads that display the following hallmarks: at least one alignment to the LINE-1 in TTC28 (chr22:29060420-29067335), at least one alignment to a single target locus elsewhere in the genome (maximum distance 100 kb) and mapping quality above 20. Further filtering criteria were applied to define the candidate insertion calls, and the reads were polished to increase the accuracy of the consensus sequences.

6. SNP arrays (Study I and Study III)

To identify regions of AI, we utilized Single nucleotide polymorphism (SNP) chip data from 1699 normal and tumor pairs including the whole genome sequenced samples (Palin et al. 2018). DNA was genotyped with Infinium Omni2.5 (Illumina inc.) array. Regions with allelic imbalance were calculated utilizing BAFsegmentation with default parameters (Staaf et al. 2008).

7. Methylation assay (Study III)

To evaluate CIMP status of the WGSed tumor samples, we utilized the MS-MLPA assay (Nygren et al. 2005) and SALSA MLPA ME042 CIMP probemix (MRC- Holland, Amsterdam, The Netherlands). This assay allows the detection of methylation at CpG islands located at the promoter of eight different genes. MS- MLPA assay was performed according to manufacturer’s instructions (Nygren et al. 2005) and Coffalyser software (MRC-Holland, Amsterdam, The Netherlands) was used to detect the levels of methylation for each probe. If ≥25% of the probes of one gene were methylated, the gene was scored as methylated. CIMP status

30

of the tumors was defined as CIMP-high if 5-8 genes were scored as methylated, and CIMP-low was defined if 0-4 genes were scored as methylated.

8. RNA sequencing (Study III)

To investigate the effect of retrotransposon insertions on gene expression we utilized RNA sequencing (RNA-seq) data from 34 CRCs with WGS data. RNA sequencing was performed on the Illumina Hiseq 2000 with 49 and 75 bp paired-end reads (Ongen et al. 2014). RNA-seq data was further processed and quantified utilizing Kallisto software (Bray et al. 2016).

9. Long-distance inverse-PCR (LDI-PCR) (Study II)

LDI-PCR was utilized in conjunction with Nanopore sequencing to identify 3’ transductions arising for the L1 element located in TTC28. LDI-PCR comprises three different steps. First, digestion of genomic DNA by using a combination of different restriction enzymes. Second, DNA ligation of the resulting fragments, and third, amplification of ligated DNA fragments by circular PCR. DNA was digested with three restriction enzymes: SacI, PstI and NstI. Digested DNA was self-ligated utilizing T4 DNA ligase (ThermoFisher Scientific, Waltham, Massachusetts, US) to produce circular templates. Eight PCR replicates were used to ensure enough DNA concentration. PCR products were run on a 1% agarose gel and purified with NucleoSpin Gel and PCR Clean-up kit (Macherey-Nagel, Düren, Germany). PCR replicates that showed a consistent amplicon pattern were pooled together. Fragment distribution and molarity were evaluated from the 18 different reactions in TapeStation 2200 (Agilent Technologies, California, US). Subsequently, the LDI-PCR products were pooled into nine different barcodes in equal molarity for Nanopore library preparation.

10. Sanger validation (Study I and Study II)

Sanger sequencing was utilized for the validation of 3’ transductions. Primers were designed using Primer3Plus software (Untergasser et al. 2012). PCRs were conducted with AmpliTaqGoldVR (Applied Biosystems, Foster City, CA) or Expand Long Template PCR system (Roche, Basel, Switzerland). PCR purification was conducted with ExoSAP-IT PCR purification kit (USB, Cleveland, OH) or by using QIAquick Gel extraction kit (qiagen, Hilden, Germany). Purified reactions were sequenced with Big Dye Terminator v.3.1 kit (Applied Biosystems,

31

Foster City, CA) at the Institute for Molecular Medicine Finland. The sequence graphs were manually analyzed using FinchTV v.1.4.0 (Geospiza, Seattle, WA).

32

RESULTS

1. Identification of frequent insertions from a LINE-1 in TTC28

We utilized DELLY to characterize somatic structural rearrangements from 92 CRC whole genomes. We identified a recurrent event originating from the first intron of TTC28 in 19 out of 92 CRC cases. A very similar signal was reported in another study where it was identified as translocations likely to inactivate TTC28 (Network and The Cancer Genome Atlas Network 2012). After thorough visual inspection of the paired-end read data, we were able to identify additional rearrangements resulting in a total of 83 rearrangements in 52 cases (Figure 8).

Figure 8. Active LINE-1 in TTC28. Aberrant signal originating from TTC28 in 92 CRC familial cases. Circos plot showing the 83 rearrangements arising from the same LINE-1 in .

We observed that all breakpoints clustered downstream of a reference LINE-1 (LINE-1base id 129, dbRIP id 2000144). The close proximity of the breakpoints to a full-length LINE-1 suggested that these events were likely to be retrotranspositions instead of translocations.

33

Subsequently, we compared the mean read coverage in the 3’ end of the LINE-1 (chr22:29065455-29066124) to the mean coverage of chromosome 22 for each sample. We identified that the relative mean coverage was higher in samples with at least one event than samples with no detected events (t-test, p-value < 2.2x10- 16). These findings were compatible with an amplification of the 3’ end of the LINE- 1 that was recurrently involved in the rearrangements.

In addition, we validated by Sanger sequencing three somatic events located in the introns of SGIP1, NOVA1 and ARHGEF4. In each case, we observed two amplicons of different length. One shorter amplicon corresponding to the wild type allele and a longer amplicon likely to be the allele with the inserted sequence. The LINE-1 or flanking sequence was identified in all cases and a polyA tail at the 3’ end of the insertion was identified in SGIP1 and ARHGEF4.

Altogether these findings - a cluster of breakpoints downstream an active LINE-1, the amplification of the LINE-1 sequence compatible with the “copy-and-paste” mechanism, and the presence of a polyA tail- indicated that these events were LINE-1 retrotranspositions rather than translocations (Network and The Cancer Genome Atlas Network 2012). Furthermore, we found that insertions in genes showed intronic preference (χ2=54.2, df=1, p=1.81 e-13), and only one recurrent hit was identified in NOVA1.

1.2. Germline insertions in the Finnish population

In addition to the somatic events, two insertions in GABRA4 and RP11-136012 were found to be in the germline. We validated the insertions in GABRA4 by Sanger sequencing, and found the insertion to be present in 4.3% (4/92) of the corresponding normals of the WGSed CRCs, in 3.3% (3/90) of additional CRC normals, and in 10% (9/90) of anonymous blood donors.

Furthermore, patients with the germline insertion in GABRA4 shared significantly longer haplotypes (t-test, p-value =0.046) than patients without the insertion, however no difference was found in patients with the RP11-136012 insertions. These results suggest a shared founder haplotype in the patients with germline insertions in GABRA4, and an independent origin or a more ancient haplotype in patients with a germline insertion in RP11-136012.

34

2. Detection of clonal and subclonal transductions from the LINE-1 in TTC28

In this study we applied LDI-PCR to detect 3’ transductions, arising from the previously characterized LINE-1, from two samples with whole genome data. By using this method, we were able to identify 3’ transductions without knowing the target locations.

Primer pairs and restriction enzymes were selected based on the rationale that LINE-1 insertions are often 5’ truncated, and that the 3’ transduced sequence is usually shorter than 1000 bp (Tubio et al. 2014). SacI and PstI restriction enzymes were selected because one cut site was located within the ORF1p of the LINE-1 and the other cut site was within the sequence involved in 3’ transductions. NstI was selected because the restriction fragment included the full sequence of the LINE-1 as well as the 3’ transduction sequence, which allowed us to detect possible full-length insertions. Primer design was based on the location and the score of different polyadenylation signals downstream the LINE-1 (Figure 9) (Tabaska and Zhang 1999) .

After PCR of the circular templates, the PCR products were pooled into nine different barcodes in equal molarity for Nanopore library preparation and sequencing. After Nanopore sequencing, we applied a software (LDI-PCR.py) to identify amplicons with hallmarks of the 3’ transductions arising from the LINE-1 in TTC28. We identified 14 out of the 15 3’ transductions that were previously identified from WGS data. In addition, we detected 25 novel transductions that exhibit all the hallmarks of LDI-PCR but were not detectable by 40x WGS, even after manual inspection of the read data (Table 1). Almost all 25 insertions were supported by fewer nanopore reads than the ones detected by WGS (Wilcoxon rank-sum test, p-value= 2.43 x 10 -6) (Table 1) thus suggesting that the novel insertions were likely to be subclonal events.

Additionally, we were able to validate two insertions by conventional PCR and nanopore sequencing (Y:15633117 and 14:79638932), and six out of ten insertions by allele-specific PCR and sanger sequencing. The presence of both clonal and subclonal insertions indicates active retrotransposition during the tumorigenic process.

35

Figure 9. Schematic of the LDI-PCR method and rationale used to detect 3’ transductions. The reference full-length element is depicted in a blue rectangle, and the 3’ transduced region in a blue line. Polyadenylation signals are marked as red lollipops. Representation of a 3’ transduction in a new location is depicted in blue, and the novel location in brown. After DNA digestion and ligation, target circular templates are produced which will be amplified by LDI-PCR. L1: LINE-1.

36

2.1. Analysis of the insertion sequence

Long-read sequencing allowed the elucidation of the insertion sequence. A consensus of the inserted sequence was obtained for each insertion, and the complete sequence was recreated in 35 out of 39 insertions. From the 35 insertions, eight consensus sequences contained short alignment gaps (3-10 bp). These gaps were likely to be the result of mismatches from Nanopore sequencing. Target side modifications (TSMs) were observed in all the consensus sequences, from which 29 were duplications and six were deletions. The mean insertion length was 493 bp, and it ranged from 142 to 1124 bp. All the consensus sequences showed 5’ truncation and ~77% of them did not include the LINE-1 sequence, also defined as orphan transductions.

We also explored the use of 9 polyadenylation signals located within the unique sequence, and we observed that 90% of the insertions did not terminate in the canonical signal, but instead used two alternative polyadenylation signals (6th and 9th). We examined the strength of the signals utilizing polyadq and found that the preference for the 9th was consistent with the predictions. However, the 6th polyadenylation signal was predicted to be false by polyadq.

Furthermore, we observed inversion of the inserted sequence in 16 out of 35 insertions likely to involve twin priming. Twin priming is a phenomenon that involves an inversion of the inserted sequence due to an internal region of homology between the LINE-1 sequence and the target site. We were able to observe the exact point of the inversion in all the cases, and identified the region of microhomology in 10 out of 16 insertions.

37 Table 1. Clonal and subclonal insertions detected by LDI-PCR/Nanopore sequencing. The insertions are sorted by sample and insertions also detected by WGS are squared. TSM: target side modifications (dup: duplications, del: dele- tion); TP: twin priming; Ins.length (bp): insertion length in base pairs; * : internal duplication; **: sequence missing.

Target TSM Size of TSM TP Ins. length (bp) Reads 1:195769724-195769709 dup. 16 yes 29 3984 4:93280482-93280454 dup. 28 no 243 4658 4:155900401-155900387 dup. 14 yes 1015 1705 4:183051382-183051401 del. 18 no 142 10103 7:146783241-146783223 dup. 18 yes 536 19889 7:152661949-152661940 dup. 9 yes 455 20107 12:33708291-33708277 dup. 14 yes 377 16909 2:78612537-78612530 dup. 8 yes 665 5 3:99147126-99147111 dup. 16 yes 984 47 4:90987152-90987149 dup. 4 no 834 99 6:74978206-74978187 dup. 20 no 597 9 8:111856478-111856457 dup. 22 yes 857 6 14:99070525-99070523 dup. 3 no 354 55 16:26220799-26220798 dup. 2 no 142 182 16:5902138-5902158 del. 19 no 165 72 1:115147190-115147187 dup. 4 no 246 12753 2:182004540-182004515 dup. 26 yes 783* 20217 2:229159082-229159075 dup. 8 no 330 704 6:70787202-70787188 dup. 15 yes >223 17007 6:133527459-133527443 dup. 17 no 204 98 8:88681299-88681304 del. 4 no 416 1095 12:128116403-128116405 del. 1 no 391 15099 2:50947578-50947612 del. 33 NA >183 98 2:129889238-129889240 del. 1 no 176 193 4:44621421-44621515 del. 93 no 607 33 5:8665955-8665942 dup. 14 no 242 451 5:83347372-83347360 dup. 13 yes 340 78 5:119565858-119565843 dup. 16 yes 455 14 6:112763097-112763084 dup. 14 yes 671 38 7:152870668-152870685 del. 16 yes 306 457 8:114925191-114925178 dup. 14 no 1124 1020 8:107979180-107979169 dup. 12 yes 1055 28 10:101386670-101386662 dup. 9 NA >161 36 10:107557372-107557435** NA NA NA >120 86 12:33100097-33100084 dup. 14 yes 670 248 14:79638932-79638931 dup. 2 no 147 172 18:1233989-1233975 dup. 15 yes 735 117 X:108351909-108351907 dup. 3 no 456 119 Y:15633117-15633103 dup. 15 no 244 1084

2.2. Comparison of Nanopore and WGS local assembly

We utilized insertions detected by both methods (WGS and nanopore) in one of the samples (n=7), to compare long-read sequencing to short-read local assembly.

Local assembly was successfully accomplished in one contig in four out of seven cases with defined parameters. However, the remaining three insertions were assembled in two contigs and thus, the sequence in between was lost. The length of the lost sequence varied from 88-614 bp based on the genomic coordinates of both insertion junctions. However, we were not able to estimate the length of the lost sequence in one case involving twin priming where the point of inversion was within the lost sequence. All insertions with lost sequence were longer than 400 bp based on the LDI-PCR/Nanopore consensus sequence.

Both methods predicted insertion length accurately (deviance 0-7 bp) and, in 10 out of 14 insertions the breakpoints were predicted consistently. In cases with an inconsistent breakpoint, we observed that the inconsistent breakpoint was located after a stretch of Ns which corresponded to the location of the consensus polyA tail, which was likely to compromise the local assembly.

The comparison of local assembly of short-read WGS and the consensus sequences generated by LDI-PCR and Nanopore sequencing indicates that local assembly was of limited value in the reconstruction of the complete insertions and in the accurate identification of polyA tails.

3. Characterization of somatic retrotransposition in 202 colorectal tumors

To characterize somatic retrotranspositions we utilized TRaFiC and DELLY in 202 colorectal tumors including 12 MSI, 187 microsatellite stable (MSS) and 3 POLE mutant tumors. After somatic filtering, we were able to identify 5072 somatic insertions. The median number of insertions was 17, minimum number of insertions was 0 and maximum number of insertions was 197. Insertion density was higher in closed chromatin (1.78 insertions per Mbp) than in open chromatin (0.96 insertions per Mbp), and in late replicating regions (replication time >0.8, 3.06 insertions per Mbp) compared to early replicating regions (replication time < 0.2, 0.73 insertions per Mbp).

39

3.1. Insertions in protein-coding genes

Out of 5072 somatic insertions, 1680 insertions (33%) were in protein-coding genes and as expected, the majority of the insertions were located in the introns and only 72 insertions were found in the exons. We also observed higher insertion count in genes with lower expression and lower insertion count in genes with higher expression (Figure 10).

Figure 10. Insertions in protein-coding genes. Number of somatic insertions in protein- coding genes, and in 21 fragile sites. In the left panel, logarithm of the gene expression values (median TPM values in 34 tumors) is represented on the Y-axis, and number of insertions per gene classified in groups are represented on the X-axis. Right panel shows the insertion fraction over the fraction of allelic imbalance in 21 fragile sites genes.

In addition, we observed recurrent insertions (at least two insertions) in 333 protein-coding genes. Genes with the highest number of insertions were LRP1B (19 insertions), DLG2 (10 insertions), and PTPRD (9 insertions). Both LRP1B and DLG2 are common fragile site genes and recurrent hotspots for HPV viral integration. Although recurrent insertions in certain genes could be the result of tumor selection, no clear clusters of insertions in either of the genes were observed. The high number of insertions in these genes could still be the result of conferring a tumor growth advantage, but also due to other characteristics, such as low gene expression, gene length, sequence composition, replication timing and chromatin structure, or a combination of all. In addition, the top three genes with the highest recurrent insertion density were COL25A1, ARAP2, and ZNF251. All the insertions were located in the introns, no clear hotspots of insertions were apparent, and none of them were classified as known cancer genes.

40

We also evaluated whether insertions could affect the expression of the nearest genes by using RNA-seq data from a subset of 34 tumors (detailed explanation in Cajuso et al. (Cajuso et al.)). However, we could not find a significant global effect in the expression of the closest genes to the insertions detected in the 34 tumors. Therefore, based on these findings, it is difficult to conclude whether these insertions were driver events or passengers of the tumorigenic process.

3.1.1. Insertions in common fragile site genes

Among protein-coding genes with recurrent insertions, 12 were common fragile sites. Since fragile sites are prone to genomic instability, we investigated whether the fraction of copy number gains and losses - here defined as AI - and the frac- tion of insertions in fragile site genes correlated. Although no correlation was found, we observed two groups; fragile site genes with high insertion fraction dis- played low AI fraction and vice versa (Figure 10).

Since insertion count is higher in genes with lower expression, we decided to investigate whether gene expression could explain the negative correlation between insertion fraction and AI fraction. In fact, genes with high insertion fraction (insertion fraction / AI fraction >1) had lower expression than genes with high AI fraction (0 < insertion fraction / AI fraction < 1) (Exact Two-Sample Fisher-Pitman Permutation Test for log transformed gene expression, p=0.04).

Thus, our findings indicate that while AIs are a recurrent characteristic of fragile sites, less expressed fragile sites are more prone to retrotranspositions.

3.1.2. Insertions in known cancer genes

We identified recurrent insertions in 15 known cancer genes, although no clear bias towards TSGs or oncogenes was observed. All recurrent insertions in cancer genes were intronic except for one gene (Table 2).

41

Table 2. List of known cancer genes with two or more somatic insertions. TSG: tumor suppressor gene.

Gene ID Gene name Num. of ins) Cancer census role ENSG00000168702 LRP1B 19 TSG ENSG00000178568 ERBB4 7 Oncogene, TSG ENSG00000171094 ALK 5 Oncogene, fusion ENSG00000196090 PTPRT 3 TSG ENSG00000046889 PREX2 3 Oncogene ENSG00000185811 IKZF1 3 TSG, fusion ENSG00000183454 GRIN2A 2 TSG ENSG00000144218 AFF3 2 Oncogene, fusion ENSG00000157168 NRG1 2 TSG, fusion ENSG00000079102 RUNX1T1 2 Oncogene, TSG, fusion ENSG00000151702 FLI1 2 Oncogene, fusion ENSG00000134982 APC 2 TSG ENSG00000189283 FHIT 2 TSG, fusion ENSG00000085276 MECOM 2 Oncogene, fusion ENSG00000196159 FAT4 2 TSG

The only gene with recurrent exonic insertions was APC. Both insertions were in the exon 16 and consistent with the distribution of nonsynonymous changes. Both insertions were located in close proximity to the two insertions previously identified (Figure 11). Both tumors with APC insertions displayed copy number loss and loss of heterozygosity encompassing APC but no other sequence varia- tions were found. Therefore, our findings together with the two previous reports, suggest that LINE-1 insertions can serve as tumor initiating events in a small proportion of CRC cases (Miki et al. 1992; Scott et al. 2016).

42

Figure 11. Representation of the linear protein of APC. Distribution of nonsynony- mous changes in 187 MSS CRCs are represented as red lollipops. Retrotransposon in- sertions identified in this study are shown as dark blue lollipops and two insertions that were previously identified are shown as green lollipops. MCR: mutation cluster region.

3.2. Hot reference LINE-1s

We utilized the 3’ transduced region from 346 transductions to identify the reference source elements and investigate their activity. We first identified that, out of 315 reference full-length elements 56 were responsible for at least one transduction. Fifty percent of these elements (28/56) had been previously re- ported to be active in human cancers (Rodriguez-Martin et al. 2020) and, as we previously reported (Study I), the most active element was the LINE-1 located in TTC28, which accounted for 46% of all transductions.

Therefore, the majority of the detected 3’ transductions originated from a very few active reference LINE-1s.

3.3. Retrotransposition and molecular characteristics

To evaluate whether retrotransposon insertion count was associated with clinical and molecular characteristics, we applied a multiple linear regression model. We hypothesized that insertion count was associated with tumor location, TP53 mutation, MSI, the genomic fraction of AI and CIMP. We also included mean coverage, Dukes tumor stage, sex and age at diagnosis to adjust the model for possible confounders (Table 3). After adjusting the model, we identified that insertion count was significantly associated with CIMP and the genomic fraction of AI (Figure 12, Table 3). Both associations remained significant after adding BRAF mutation (V600E) in the model (Multiple linear regression model CIMP p- value=0.004, genomic fraction of AI p-value=0.004) and when the model was applied with only MSS samples (Multiple linear regression model CIMP p- value=0.001, genomic fraction of AI p-value= 0.006).

43

Figure 12. Log-transformed insertion count versus the genomic fraction of AI and CIMP. Scatter plots of log-transformed insertion count over the genomic fraction of AI left panel) and CIMP (right panel).

Table 3. Output of the multiple linear regression model of the log-transformed insertion count. CIMP-H: CpG island methylator phenotype high; AI: Allelic imbalance; MSI: Microsatellite instability; Mean cov.: Mean coverage; Age: age at diagnosis.

Coeff. Std. err. z p Signif. Intercept 0.408 0.647 0.63 5.29E-01 CIMP-H 0.607 0.169 3.6 3.22E-04 *** AI (/10% of reference) 0.0826 0.0284 2.91 3.64E-03 ** TP53 mutation −0.0684 0.134 −0.509 6.10E-01 MSI 0.15 0.285 0.527 5.98E-01 Mean cov.(/10 reads) 0.309 0.0862 3.59 3.33E-04 *** Age (/10 years) 0.0331 0.0603 0.549 5.83E-01 Male 0.057 0.123 0.464 6.42E-01 Dukes B −0.0578 0.168 −0.344 7.31E-01 Dukes C −0.0483 0.186 −0.259 7.96E-01 Dukes D 0.0049 0.211 0.0232 9.81E-01 Proximal location 0.206 0.144 1.43 1.52E-01

3.4. Retrotransposition and disease-specific survival

To investigate whether the number of retrotransposon insertions correlated with CRC survival, we applied a Proportional Hazards model. We adjusted the model with variables known to affect CRC survival, including Dukes tumor stage, sex,

44

MSI, age at diagnosis and BRAF mutation. Since insertion count was associated with CIMP and the genomic fraction of AI, we also included both variables in the model. We identified that insertion count was significantly associated with poor CRC survival independently of other prognostic factors (Figure 13, Table 4).

Figure 13. Insertion count is associated with poor CRC survival. Kaplan-Meyer curves from patients with 0-20 insertions and patients with more than 20 insertions.

Table 4. Output of the Cox proportional Hazards model. MSI: Microsatellite instability; CIMP-H: CpG island methylator phenotype high; Age: age at diagnosis; AI: Allelic imbalance.

Coeff. Std. err. z p HR [95% CI] Signif. Insertion count (/10) 0.108 0.0362 2.98 2.93E-03 1.11 [1.04, 1.20] ** MSI −0.258 0.642 −0.402 6.88E-01 0.773 [0.219, 2.72] CIMP-H 0.174 0.341 0.51 6.10E-01 1.19 [0.610, 2.32] BRAF mutation 0.79 0.447 1.77 7.68E-02 2.20 [0.918, 5.29] . Age [55, 75) years −0.147 0.408 −0.360 7.19E-01 0.863 [0.388, 1.92] Age ≥ 75 years 0.188 0.427 0.439 6.60E-01 1.21 [0.523, 2.78] Male 0.311 0.232 1.34 1.80E-01 1.37 [0.866, 2.15] Dukes B 0.452 0.449 1.01 3.13E-01 1.57 [0.652, 3.79] Dukes C 1.77 0.431 4.12 3.82E-05 5.89 [2.53, 13.7] *** Dukes D 2.78 0.454 6.12 9.07E-10 16.2 [6.64, 39.4] *** AI (/10% of reference) −0.0583 0.0539 −1.08 2.80E-01 0.943 [0.849, 1.05]

45

DISCUSSION

Cancer is a group of heterogeneous diseases that results from the accumulation of driver events in a multi-step manner. Over the last decade, next generation sequencing has allowed the characterization of cancer genomes, which has led to the identification of several driver events. Retrotransposons are repetitive sequences that become active in many cancers and lead to insertional mutagenesis. However, retrotransposons remain difficult to detect, mostly due to their repetitive nature, which has resulted in few comprehensive studies. Therefore, the impact of retrotransposons in tumorigenesis remains considerably unknown. In this thesis work, we aimed at investigating retrotransposition in colorectal tumorigenesis by utilizing whole genome and long-read sequencing.

In study I, we identified recurrent somatic retrotranspositions arising from a full- length LINE-1 located in the first intron of TTC28. In study II, we utilized LDI-PCR and Nanopore sequencing and detected 39 transductions, including 25 novel events, from the same LINE-1. Moreover, high coverage long-read sequencing allowed us to build the inserted sequence in 90% of the cases. In study III, we identified 5072 somatic insertions including 346 transductions in 202 colorectal tumors. We found recurrent insertions in fragile sites and cancer genes, and a recurrent hit in exon 16 of APC. In addition, high insertion count was associated with the CpG island methylator phenotype, the genomic fraction of allelic imbalance and CRC poor survival.

1. LINE-1 in TTC28

1.1. Accurate interpretation of WGS data

The LINE-1 identified to be active in study I is a reference full-length element which was previously identified to be active (Beck et al. 2010; Brouha et al. 2003). The same LINE-1 has been identified as the most active element in 290 samples from 12 different tumor types (Tubio et al. 2014), in 202 colorectal tumors (Cajuso et al. 2019), and in 2954 tumors across 38 tumor types (Rodriguez-Martin et al. 2020). In addition, it has been identified to be active in esophageal adenocarcinomas (Paterson et al. 2015) and endometriosis associated ovarian cancer (Xia et al. 2017). On the other hand, a previous study had identified re- current chromosomal rearrangements involving TTC28 and interpreted them as translocations (Network and The Cancer Genome Atlas Network 2012). They predicted that the translocations inactivated TTC28 and, since TTC28 is a target

46

of TP53 (Brady et al. 2011), its inactivation was suggested to enhance tumor progression. However, we and others (Pitkänen et al. 2014; Tubio et al. 2014; Paterson et al. 2015) have indicated that these events are more likely to be LINE-1 retrotransponsitions rather than inactivating translocations. The misinterpretation of LINE-1 transductions, in particular orphan transductions, could arise from the mobilization of the unique downstream sequence which can resemble a translo- cation event. However, an accurate interpretation is possible when hallmarks of retrotransposition such as the LINE-1 sequence, polyA tail and TSD are identi- fied at the target sites.

The correct identification of structural rearrangements is key for the accurate interpretation of cancer data. In this case, translocations were predicted to inactivate TTC28. However, we did not predict any effect on TTC28 due to retrotransposition, since the LINE-1 “copy-and-paste” mechanism leaves the source element intact.

1.2. Detection of subclonal insertions

We were able to identify 25 subclonal insertions, not detected by 40x WGS, by using LDI-PCR and Nanopore sequencing. These findings indicate that the number of insertions detected in study I, study III and probably several other studies, is underestimated since they probably missed subclonal insertions. This is expected because of low sequencing coverage, tumor heterogeneity and normal tissue contamination. The presence of clonal and subclonal insertions indicates that retrotransposition is an ongoing tumorigenic process, although somatic insertions are acquired discontinuously as previously reported (Rodić et al. 2015).

In study III, we investigated the activity of other reference LINE-1s in addition to the LINE-1 in TTC28. As mentioned above, this LINE-1 was the most active in our CRC dataset as well as in other tumor types (Tubio et al. 2014; Rodriguez-Martin et al. 2020). It is important to note that in these studies, the identification of active LINE-1s relied on the transduction rate and different LINE-1s are active in different stages of the tumorigenic process (Tubio et al. 2014). Therefore, additional LINE- 1s might be active in later stages of tumorigenesis or might have lower transduction rates and therefore, their activity might be underestimated. Additional studies utilizing high coverage long-read sequencing will likely shed light on the clonality and timing of retrotransposition, as well as, the identification of which LINE-1s are most active in different stages of the tumorigenic process.

1.3. Elucidation of insertion characteristics

The utilization of long-read sequencing allowed the resolution of both insertion junctions and the elucidation of insertion characteristics in study II. Seventy-seven

47

percent of the insertions were orphan transductions, where no LINE-1 sequence was identified. The frequency of orphan transductions is higher than the rate reported by Tubio et al. (Tubio et al. 2014), which is probably caused by higher sensitivity of the LDI-PCR and Nanopore sequencing approach. Target site duplications were found in 82% of the insertions which is consistent with the retrotransposon insertion mechanism as previously reported in CRC (Solyom et al. 2012) and in transgenic mouse models (Babushok 2005). Twin-priming was observed in 45% of the insertions. We suggested that the high rate of twin priming could be due to the presence of several stretches of Ts at the 3’ end of the LINE- 1, which could be used as a second primer during RT. However, a higher rate of twin-priming could also be due to a higher resolution of the complete sequence. Furthermore, we found that nanopore sequencing was better at reconstructing the consensus sequences than local assembly of short-read data. However, LDI-PCR failed to amplify one insertion, which could be due to the formation of a secondary structure between the insertion polyT tail and a nearby reference stretch of As.

2. Genomic characterization of retrotransposition

In study III, we characterized somatic retrotranspositions in 202 colorectal tumors. The majority of somatic insertions (99%) originated from LINE-1s, but we also detected a few insertions from Alus, SVAs, and HERVs. Since there is no evidence of HERV activity in humans, the detected somatic HERV insertions could be false positive calls or the result of microhomology-mediated break-induced repair (Hastings et al. 2009) as suggested previously (Lee et al. 2012). Regarding the genomic distribution of somatic insertions, we found higher insertion count in closed chromatin and in late replicating regions as previously reported (Helman et al. 2014; Tubio et al. 2014). However, higher insertion density was found in open chromatin after adjusting for L1 motif content and replication time in a recent report (Rodriguez-Martin et al. 2020). Among insertions in genes, higher insertion count was identified in genes with lower expression than in genes with higher expression, also consistent with recent studies (Helman et al. 2014; Rodriguez- Martin et al. 2020).

2.1. Recurrent insertions in protein-coding genes

We identified recurrent insertions in 333 protein-coding genes from which 12 were in fragile sites and 15 were in Cancer Census genes. The majority of these insertions are located in the introns, which is consistent with the intronic preference estimated in study I. Although insertions have been shown to affect gene expression (Lee et al. 2012; Helman et al. 2014), no effect was found in our study nor in Tubio et al. (Tubio et al. 2014). However, in our case this result could be due to the approach we utilized, since we could only assess the overall effect

48

of insertions on nearby genes due to a small sample size. Therefore, the impact of somatic retrotransposition on gene expression should be investigated further.

The highest number of insertions was detected in LRP1B. LRP1B is a common fragile site gene, a hotspot for HPV integration and a cancer census gene (Rajaram et al. 2013; Hu et al. 2015). Recurrent insertions in genes frequently mutated in cancer have been previously identified (Lee et al. 2012; Solyom et al. 2012; Helman et al. 2014). In fact, we identified 15 known cancer genes with recurrent insertions. Recurrent insertions in cancer genes can be the result of several characteristics such as sequence composition, chromatin state, replication timing, gene length and somatic selection. However, evidence of insertions inactivating TSGs or activating oncogenes remains scarce. Further studies investigating disruption of cancer genes by other means than insertion, such as LINE-1 mediated chromosomal rearrangements (Rodriguez-Martin et al. 2020), should provide additional evidence of how often retrotransposition inactivates tumor suppressor genes and activates oncogenes.

2.2. Retrotransposition as a source of fragility

We identified recurrent insertions in 12 out of 21 known common fragile site genes. We observed that less expressed fragile site genes were targeted by retrotranspositions rather than AIs. Helman et al. already observed an enrichment of insertions in large common fragile sites located in late replicating regions and closed chromatin (Helman et al. 2014). Thus, retrotransposition seems to contribute to genomic fragility in genes with lower expression which may be less prone to allelic imbalances.

2.3. Retrotransposition as early tumor initiating events

We identified 72 insertions in the exons of genes from which two were in APC. Insertions in the exon 16 of APC have been reported twice by two independent studies. In 1992, Miki et al. found the first LINE-1 insertion that could act as a tu- mor initiating event, which was later supported by the identification of the second insertion in 2016 (Miki et al. 1992; Scott et al. 2016). Our findings also indicate that LINE-1 insertions may serve as tumor initiating events, although in a very small proportion of CRCs. In these patients, the source LINE-1 that led to the in- sertions probably evaded somatic repression in the corresponding normal tissue as it was reported in 2016 (Scott et al. 2016).

49

3. Retrotransposition and molecular and clinical characteristics

The availability of clinical data allowed the investigation of the relation between somatic insertion count and molecular and clinical characteristics.

3.1. Retrotransposition correlates with CIMP and AI

We applied a multiple linear regression model to investigate whether insertion count was associated with tumor location, TP53 mutation, MSI, the genomic fraction of AI, CIMP, tumor stage, sex and age at diagnosis. After model fit assessment, we found that high insertion count was positively correlated with CIMP and the genomic fraction of AI. We did not find an association with age at diagnosis, although it had been reported (Solyom et al. 2012). This could be due to a larger sample size and the addition of more covariates in our study. We did not find a significant association with TP53 mutation either, despite its association with ORF1p expression (Rodić et al. 2014) and insertion count (Rodriguez-Martin et al. 2020). Again, this could be due to differences in the datasets or insertion count is associated with TP53 mutation in the absence of the AI covariate.

Furthermore, the association between insertion count and CIMP is slightly unexpected since LINE-1 promoter hypomethylation is a characteristic of active LINE-1s (Tubio et al. 2014). However, gene promoter hypermethylation and LINE- 1 promoter hypomethylation are not necessarily exclusive. In fact, LINE-1 promoter hypomethylation was inversely associated with CIMP and MSI in CRC (Ogino et al. 2008a). Despite an association with CIMP we did not find an association with MSI. This could be due to a small number of MSI samples, or because insertion count is associated with CIMP independently of MSI. In fact, the association with CIMP remained significant independently of BRAF mutation and, after applying the same model with only MSS samples. Therefore, further research is needed to further explore and validate these associations in additional datasets. In addition, long-read sequencing of the epigenome of tumors with high insertion count could elucidate whether high insertion count arises from only a few hot LINE-1s or from global LINE-1 promoter hypomethylation. Additionally, it could elucidate whether insertion count, LINE-1 hypomethylation or hot LINE-1 hypomethylation are associated, and in turn associated with CIMP.

50

3.2. Retrotransposition correlates with poor CRC survival

We applied a Proportional Hazard Cox model and identified that insertion count was associated with poor CRC survival, independently of known prognostic factors. These findings are in concordance with the association between LINE-1 hypomethylation and poorer prognosis in several cancers (Pattamadilok et al. 2008; Ogino et al. 2008b; Saito et al. 2010; van Hoesel et al. 2012; Iwagami et al. 2013; Harada et al. 2015). However, our study shows that retrotransposon activity- not only LINE-1 hypomethylation- is associated with poor survival in a multivariate model. Future studies should validate this association including covariates known to affect survival in a more extended dataset, but also in other tumor types. In addition, the association between insertion count and other covariates, such as tumor immune cell infiltration, should be explored.

51

CONCLUSIONS AND FUTURE PROSPECTS

Retrotransposition is a hallmark of several cancers with high activity reported in CRC. However, the somatic landscape of retrotransposition in CRC has remained considerably unexplored. The studies in this thesis work highlight the importance of accurate interpretation of whole genome data and provide evidence of active and ongoing retrotransposon activity in CRC. Furthermore, the association with CIMP and the genomic fraction of AI- both separate characteristics of the two distinct molecular pathways- suggests the presence of more molecular subtypes with respect to retrotransposon activity and CRC prognosis.

Although our findings indicate that retrotransposition may play a more important role in tumorigenesis, it remains unclear how often they affect gene expression and how often retrotransposition can indirectly target cancer genes. LINE-1- mediated rearrangements result from aberrant LINE-1 integrations that can lead to large deletions, duplications and translocations. These rearrangements could involve deletions and duplications of several megabases, thus involving numerous TSGs and oncogenes but also enhancer and repressor regions. The combination of whole genome and transcriptome long-read sequencing should elucidate how often retrotransposition affects gene expression and the role of LINE-1-mediated rearrangements in promoting tumorigenesis.

In addition, retrotransposon sequences often contain several transcription factor binding sites. These sites might be functional and therefore modify the regulatory cancer genome. However, the elucidation of the inserted sequences remains challenging by short-read sequencing. Long-read sequencing can sequence in one read the whole insertion and therefore, elucidate how often retrotransposition might bring regulatory elements to new locations.

Furthermore, several studies suggest that retrotransposons are already active in the precancerous lesions. However, comprehensive characterizations of somatic retrotransposition in precancerous lesions and the corresponding normal tissue are scarce. Future studies characterizing the somatic landscape of retrotransposition in tumors versus the corresponding normals could provide insights into predicting which cells are more likely to become cancerous. In addition, the catalog of germline insertions in human populations remains largely unexplored. However, the implementation of whole genome long-read sequencing in large populations will provide tools for the characterization of the complement of human polymorphic insertions, and for the identification of novel insertions that

52

are present in only a few individuals. Several of these insertions could be responsible for disease phenotypes or an increase in disease risk. Furthermore, a catalog of full-length insertions could contribute to understanding what makes some cancer genomes more prone to retrotransposition than others.

To conclude, this is an exciting time for the field of retrotransposition. With the development of molecular techniques, such as long-read sequencing and single cell sequencing, several questions will be resolved in the upcoming years. The better understanding of the effects and timing of retrotransposition, as well as what makes some genomes more permissive than others, can provide novel insights into patient stratification and cancer prevention.

53

“Go to the limits of your longing”: “Let everything happen to you: beauty and terror. / Just keep going. No feeling is final.”

- Rainer Maria Rilke

54

ACKNOWLEDGEMENTS

This work was carried out at the Department of Medical and Clinical Genetics and Applied Tumor Genomics Research Program, Faculty of Medicine, University of Helsinki during 2012-2020. The present and former directors are thanked for providing excellent research facilities and the Doctoral Programme in Biomedicine for high-quality education. The Ida Montin Foundation, The Cancer Society of Finland, The Maud Kuistila Memorial foundation and The Juhani Aho Foundation for Medical Research are also thanked for their financial support.

My deepest gratitude goes to my supervisor Lauri Aaltonen for his endless support and for giving me the opportunity to be a part of such an outstanding team. I admire Lauri’s scientific curiosity and I am grateful for his many scientific ideas and questions that have always encouraged me to learn more. I am also deeply thankful to Outi Kilpivaara and Kimmo Palin for their supervision. I thank Outi for challenging me to always think critically and for trusting me with my projects. I thank Kimmo P. for his patience, everyday advice and our many great scientific discussions. I am grateful to the three of you for your contin- uous help and moral support during all these years. I am also indebted to Päivi Sulo, Esa Pitkänen and Kimmo P. for accompanying me in the fascinating journey to unravel the role of retrotransposons in cancer.

Saara Ollila and Alan Schulman are thanked for their efforts on the thesis committee and their encouragement throughout my PhD. Kaisa Huhtinen and Csilla Spikey are thanked for reviewing my thesis and for their constructive and highly valuable comments.

I sincerely thank all collaborators without whom this work would not have been possible: the surgeons Heikki Järvinen, Jukka-Pekka Mecklin, Laura Renkonen-Sinisalo, Anna Lepistö, Selja Koskensalo and Toni Seppälä for providing excellent patient material; Ari Ristimäki and Jan Böhm for their expertise in pathology; and Lisa Kauppi and Barun Pradhan for their contribution to the Nanopore project.

My heartfelt appreciation goes to all members - former and present - of the Aaltonen group for providing such a supportive and inspiring atmosphere. I thank Auli Karhu for her support, sense of humor and our many chats during all these years. Netta Mäkinen, for being part of the organization of the CancerBio Summer School and for her company in AACR. Esa Pitkänen for his guidance, especially at the beginning of my PhD. Niko Välimäki, for his interest in my projects. Eevi Kaasinen, for being so helpful and fun to work with especially in the TTC28 project. Janne Ravantti and Heikki Metsola for maintaining our IT infrastructure and always being so helpful. Mervi Aavikko, for her enthusiasm, kindness and for recruiting the Science Slam Helsinki team - Julia, Linda VdB and Erkka - whom I thank for such a wonderful experience. I thank Mervi (again), Eevi, Miika, Riku, Iikki and Heikki for bringing the lab spirit to Menorca. Special thanks to Iikki, for the great company both inside and outside the lab; Riku for all his help in all my projects; and Miika for the many deep scientific discussions. Linda VdB, Sofie and Miika are also thanked for the many dinners, movies and parties together.

55

Sini, Marjo, Sirpa, Inga-Lill, Mairi, Iina and Alison are also thanked. You are of vital importance for every single project in the lab. I thank Sini for always being so supportive and fun; Sirpa, for being so kind and happy to help; and Alison, you have been an excellent work mate but - most importantly- a really good friend to me.

My warmest gratitude goes to all members of the CRC family, our office has been a continuous source of knowledge, fun and friendship. I dearly thank Alex for all the support and the many memorable moments together; Tomas, for his knowledge, interest and ability to explain me statistics; Ulrika, for our many years of companionship; Johanna, for all the sweet and funny moments; Jaana, for all the moments of support and festival Jaana; Aurora, for all the little pranks that have enlightened the atmosphere and Sam, for all the help, friendship and understanding. I dearly thank all of you for every moment we have shared together.

My friends and colleagues in Biomedicum are appreciated for their support and company throughout the years. I thank Diego Balboa, Julia Casado, Alejandra Cervera and Tiia Pelkonen as well as the many members of the Sulkava team for all the fun moments together. Also, I would not have survived all these years in Finland without my everyday normal friends: Edu, Maria, Sulo, Ceci, Altai, Alexey, Fia, Jan, Eeva, Peter, Heini and Erkka.

My love and gratitude go to my dear family for always encouraging me to pursue what I believe in. I am deeply grateful to my father, Vicente, for all the advice, for teaching me how to appreciate the small things in life and for always challenging me; to my mother, Carmen, for her never ending support, unconditional love and our many conversations about life; and to my brother, Esteban, for not only being my big brother but also my best friend. I also want to express my profound gratitude to Kimmo M., for being my partner in life, for his unconditional support and for making me laugh every single day. I dearly thank all of you for always being there for me.

And finally my sincere gratitude goes to all the patients and families who participated in this study. Without their support, our research would not be possible.

Helsinki, November 2020

56

REFERENCES

Aaltonen LA, Salovaara R, Kristo P, Canzian F, Hemminki A, Peltomäki P, Chadwick RB, Kääriäinen H, Eskelinen M, Järvinen H, Mecklin JP, de la Chapelle A (1998) Incidence of hereditary nonpolyposis colorectal cancer and the feasibility of molecular screening for the disease. N Engl J Med 338:1481–1487

Achanta P, Steranka JP, Tang Z, Rodić N, Sharma R, Yang WR, Ma S, Grivainis M, Huang CRL, Schneider AM, Gallia GL, Riggins GJ, Quinones-Hinojosa A, Fenyö D, Boeke JD, Burns KH (2016) Somatic retrotransposition is infrequent in glioblastomas. Mob DNA 7:22

Adams JW, Kaufman RE, Kretschmer PJ, Harrison M, Nienhuis AW (1980) A family of long reiterated DNA sequences, one copy of which is next to the human beta globin gene. Nucleic Acids Research 8:6113–6128

Ahl V, Keller H, Schmidt S, Weichenrieder O (2015) Retrotransposition and Crystal Structure of an Alu RNP in the Ribosome-Stalling Conformation. Mol Cell 60:715– 727

Ambros V (2004) The functions of animal microRNAs. Nature 431:350–355

Babushok DV (2005) L1 integration in a transgenic mouse model. Genome Research 16:240–250

Babushok DV, Kazazian HH Jr (2007) Progress in understanding the biology of the human mutagen LINE-1. Hum Mutat 28:527–539

Batzer MA, Deininger PL (2002) Alu repeats and human genomic diversity. Nature Reviews Genetics 3:370–379

Beard C, Li E, Jaenisch R (1995) Loss of methylation activates Xist in somatic but not in embryonic cells. Genes Dev 9:2325–2334

Beck CR, Collier P, Macfarlane C, Malig M, Kidd JM, Eichler EE, Badge RM, Moran JV (2010) LINE-1 Retrotransposition Activity in Human Genomes. Cell 141:1159–1170

Berger AH, Knudson AG, Pandolfi PP (2011) A continuum model for tumour suppression. Nature 476:163–169

Bhandari V, Hoey C, Liu LY, Lalonde E, Ray J, Livingstone J, Lesurf R, Shiah Y-J, Vujcic T, Huang X, Espiritu SMG, Heisler LE, Yousif F, Huang V, Yamaguchi TN, Yao CQ, Sabelnykova VY, Fraser M, Chua MLK, van der Kwast T, Liu SK, Boutros PC, Bristow RG (2019) Molecular landmarks of tumor hypoxia across cancer types. Nat Genet 51:308–318

Bodea GO, McKelvey EGZ, Faulkner GJ (2018) Retrotransposon-induced mosaicism in the neural genome. Open Biol 8: . https://doi.org/10.1098/rsob.180074

Boeke JD, Garfinkel DJ, Styles CA, Fink GR (1985) Ty elements transpose through an RNA intermediate. Cell 40:491–500

57

Boland CR, Richard Boland C, Goel A (2010) Microsatellite Instability in Colorectal Cancer. Gastroenterology 138:2073–2087.e3

Bourc’his D, Bestor TH (2004) Meiotic catastrophe and retrotransposon reactivation in male germ cells lacking Dnmt3L. Nature 431:96–99

Brady CA, Jiang D, Mello SS, Johnson TM, Jarvis LA, Kozak MM, Kenzelmann Broz D, Basak S, Park EJ, McLaughlin ME, Karnezis AN, Attardi LD (2011) Distinct p53 transcriptional programs dictate acute DNA-damage responses and tumor suppression. Cell 145:571–583

Bray NL, Pimentel H, Melsted P, Pachter L (2016) Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol 34:888

Brouha B, Schustak J, Badge RM, Lutz-Prigge S, Farley AH, Moran JV, Kazazian HH Jr (2003) Hot L1s account for the bulk of retrotransposition in the human population. Proc Natl Acad Sci U S A 100:5280–5285

Burns KH (2017) Transposable elements in cancer. Nature Reviews Cancer 17:415–424

Cajuso T, Sulo P, Tanskanen T, Katainen R, Taira A, Hänninen UA, Kondelin J, Forsström L, Välimäki N, Aavikko M, Kaasinen E, Ristimäki A, Koskensalo S, Lepistö A, Renkonen-Sinisalo L, Seppälä T, Kuopio T, Böhm J, Mecklin J-P, Kilpivaara O, Pitkänen E, Palin K, Aaltonen LA (2019) Retrotransposon insertions can initiate colorectal cancer and are associated with poor survival. Nat Comms 10:4022

Cerami E, Gao J, Dogrusoz U, Gross BE, Sumer SO, Aksoy BA, Jacobsen A, Byrne CJ, Heuer ML, Larsson E, Antipin Y, Reva B, Goldberg AP, Sander C, Schultz N (2012) The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov 2:401–404

Chambeyron S, Popkova A, Payen-Groschêne G, Brun C, Laouini D, Pelisson A, Bucheton A (2008) piRNA-mediated nuclear accumulation of retrotransposon transcripts in the Drosophila female germline. Proc Natl Acad Sci U S A 105:14964– 14969

Christensen SM, Bibillo A, Eickbush TH (2005) Role of the Bombyx mori R2 element N- terminal domain in the target-primed reverse transcription (TPRT) reaction. Nucleic Acids Res 33:6461–6468

Comeaux MS, Roy-Engel AM, Hedges DJ, Deininger PL (2009) Diverse cis factors controlling Alu retrotransposition: what causes Alu elements to die? Genome Res 19:545–555

Consortium IHGS, International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature 409:860–921

Cordaux R, Batzer MA (2009) The impact of retrotransposons on human genome evolution. Nature Reviews Genetics 10:691–703

Coufal NG, Garcia-Perez JL, Peng GE, Yeo GW, Mu Y, Lovci MT, Morell M, O’Shea KS, Moran JV, Gage FH (2009) L1 retrotransposition in human neural progenitor cells. Nature 460:1127–1131

Croce CM (2008) Oncogenes and cancer. N Engl J Med 358:502–511

58

Daura-Oller E, Cabre M, Montero MA, Paternain JL, Romeu A (2009) Specific gene hypomethylation and cancer: New insights into coding region feature trends. Bioinformation 3:340–343

Dewannieux M, Esnault C, Heidmann T (2003) LINE-mediated retrotransposition of marked Alu sequences. Nature Genetics 35:41–48

Dombroski B, Mathias S, Nanthakumar E, Scott A, Kazazian H (1991) Isolation of an active human transposable element. Science 254:1805–1808

Doucet AJ, Wilusz JE, Miyoshi T, Liu Y, Moran JV (2015) A 3′ Poly(A) Tract Is Required for LINE-1 Retrotransposition. Molecular Cell 60:728–741

Doucet-O’Hare TT, Rodić N, Sharma R, Darbari I, Abril G, Choi JA, Young Ahn J, Cheng Y, Anders RA, Burns KH, Meltzer SJ, Kazazian HH Jr (2015) LINE-1 expression and retrotransposition in Barrett’s esophagus and esophageal carcinoma. Proc Natl Acad Sci U S A 112:E4894–900

Doucet-O’Hare TT, Sharma R, Rodić N, Anders RA, Burns KH, Kazazian HH (2016) Somatically Acquired LINE-1 Insertions in Normal Esophagus Undergo Clonal Expansion in Esophageal Squamous Cell Carcinoma. Human Mutation 37:942–954

Ellrott K, Bailey MH, Saksena G, Covington KR, Kandoth C, Stewart C, Hess J, Ma S, Chiotti KE, McLellan M, Sofia HJ, Hutter C, Getz G, Wheeler D, Ding L, MC3 Working Group, Cancer Genome Atlas Research Network (2018) Scalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic Pipelines. Cell Syst 6:271–281.e7

Esnault C, Maestre J, Heidmann T (2000) Human LINE retrotransposons generate processed pseudogenes. Nat Genet 24:363–367

Ewing AD, Gacita A, Wood LD, Ma F, Xing D, Kim M-S, Manda SS, Abril G, Pereira G, Makohon-Moore A, Looijenga LHJ, Gillis AJM, Hruban RH, Anders RA, Romans KE, Pandey A, Iacobuzio-Donahue CA, Vogelstein B, Kinzler KW, Kazazian HH Jr, Solyom S (2015) Widespread somatic L1 retrotransposition occurs early during gastrointestinal cancer evolution. Genome Res 25:1536–1545

Ewing AD, Kazazian HH Jr (2010) High-throughput sequencing reveals extensive variation in human-specific L1 content in individual human genomes. Genome Res 20:1262–1270

Ewing AD, Kazazian HH Jr (2011) Whole-genome resequencing allows detection of many rare LINE-1 insertion alleles in humans. Genome Res 21:985–990

Fearon ER (2011) Molecular genetics of colorectal cancer. Annu Rev Pathol 6:479–507

Fearon ER, Vogelstein B (1990) A genetic model for colorectal tumorigenesis. Cell 61:759–767

Feng Q, Moran JV, Kazazian HH, Boeke JD (1996) Human L1 Retrotransposon Encodes a Conserved Endonuclease Required for Retrotransposition. Cell 87:905–916

Feschotte C, Pritham EJ (2007) DNA Transposons and the Evolution of Eukaryotic Genomes. Annual Review of Genetics 41:331–368

59

Fletcher O, Houlston RS (2010) Architecture of inherited susceptibility to common cancer. Nature Reviews Cancer 10:353–361

Fodde R, Smits R, Clevers H (2001) APC, Signal transduction and genetic instability in colorectal cancer. Nature Reviews Cancer 1:55–67

Gao J, Aksoy BA, Dogrusoz U, Dresdner G, Gross B, Sumer SO, Sun Y, Jacobsen A, Sinha R, Larsson E, Cerami E, Sander C, Schultz N (2013) Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci Signal 6:l1

Gao Q, Liang W-W, Foltz SM, Mutharasu G, Jayasinghe RG, Cao S, Liao W-W, Reynolds SM, Wyczalkowski MA, Yao L, Yu L, Sun SQ, Fusion Analysis Working Group, Cancer Genome Atlas Research Network, Chen K, Lazar AJ, Fields RC, Wendl MC, Van Tine BA, Vij R, Chen F, Nykter M, Shmulevich I, Ding L (2018) Driver Fusions and Their Implications in the Development and Treatment of Human Cancers. Cell Rep 23:227–238.e3

Garfinkel DJ, Boeke JD, Fink GR (1985) Ty element transposition: reverse transcriptase and virus-like particles. Cell 42:507–517

Giri D, Ittmann M (1999) Inactivation of the PTEN tumor suppressor gene is associated with increased angiogenesis in clinically localized prostate carcinoma. Human Pathology 30:419–424

Glover TW, Berger C, Coyle J, Echo B (1984) DNA polymerase alpha inhibition by aphidicolin induces gaps and breaks at common fragile sites in human chromosomes. Hum Genet 67:136–142

Gonzalo S (2010) Epigenetic alterations in aging. J Appl Physiol 109:586–597

Goodier JL, Ostertag EM, Kazazian HH Jr (2000) Transduction of 3’-flanking sequences is common in L1 retrotransposition. Hum Mol Genet 9:653–657

Hanahan D, Weinberg RA (2011) Hallmarks of cancer: the next generation. Cell 144:646–674

Hancks DC, Goodier JL, Mandal PK, Cheung LE, Kazazian HH Jr (2011) Retrotransposition of marked SVA elements by human L1s in cultured cells. Hum Mol Genet 20:3386–3400

Hancks DC, Kazazian HH Jr (2016) Roles for retrotransposon insertions in human disease. Mob DNA 7:9

Harada K, Baba Y, Ishimoto T, Chikamoto A, Kosumi K, Hayashi H, Nitta H, Hashimoto D, Beppu T, Baba H (2015) LINE-1 Methylation Level and Patient Prognosis in a Database of 208 Hepatocellular Carcinomas. Annals of Surgical Oncology 22:1280– 1287

Hastings PJ, Ira G, Lupski JR (2009) A microhomology-mediated break-induced replication model for the origin of human copy number variation. PLoS Genet 5:e1000327

Hata K, Sakaki Y (1997) Identification of critical CpG sites for repression of L1 transcription by DNA methylation. Gene 189:227–234

60

Hattori M, Kuhara S, Takenaka O, Sakaki Y (1986) L1 family of repetitive DNA sequences in primates may be derived from a sequence encoding a reverse transcriptase-related protein. Nature 321:625–628

Helman E, Lawrence MS, Stewart C, Sougnez C, Getz G, Meyerson M (2014) Somatic retrotransposition in human cancer revealed by whole-genome and exome sequencing. Genome Res 24:1053–1063

Herskowitz I (1987) Functional inactivation of genes by dominant negative mutations. Nature 329:219–222

Hoadley KA, Yau C, Hinoue T, Wolf DM, Lazar AJ, Drill E, Shen R, Taylor AM, Cherniack AD, Thorsson V, Akbani R, Bowlby R, Wong CK, Wiznerowicz M, Sanchez-Vega F, Robertson AG, Schneider BG, Lawrence MS, Noushmehr H, Malta TM, Cancer Genome Atlas Network, Stuart JM, Benz CC, Laird PW (2018) Cell-of-Origin Patterns Dominate the Molecular Classification of 10,000 Tumors from 33 Types of Cancer. Cell 173:291–304.e6

Hofstra RMW, Landsvater RM, Ceccherini I, Stulp RP, Stelwagen T, Luo Y, Pasini B, Hoppener JWM, van Amstel HKP, Romeo G, Lips CJM, Charles H C (1994) A mutation in the RET proto-oncogene associated with multiple endocrine neoplasia type 2B and sporadic medullary thyroid carcinoma. Nature 367:375–376

Hohjoh H, Singer MF (1997a) Ribonuclease and high salt sensitivity of the ribonucleoprotein complex formed by the human LINE-1 retrotransposon. J Mol Biol 271:7–12

Hohjoh H, Singer MF (1997b) Sequence-specific single-strand RNA binding protein encoded by the human LINE-1 retrotransposon. EMBO J 16:6034–6043

Huang CRL, Schneider AM, Lu Y, Niranjan T, Shen P, Robinson MA, Steranka JP, Valle D, Civin CI, Wang T, Wheelan SJ, Ji H, Boeke JD, Burns KH (2010) Mobile interspersed repeats are major structural variants in the human genome. Cell 141:1171–1182

Hu Z, Zhu D, Wang W, Li W, Jia W, Zeng X, Ding W, Yu L, Wang X, Wang L, Shen H, Zhang C, Liu H, Liu X, Zhao Y, Fang X, Li S, Chen W, Tang T, Fu A, Wang Z, Chen G, Gao Q, Li S, Xi L, Wang C, Liao S, Ma X, Wu P, Li K, Wang S, Zhou J, Wang J, Xu X, Wang H, Ma D (2015) Genome-wide profiling of HPV integration in cervical cancer identifies clustered genomic hot spots and a potential microhomology- mediated integration mechanism. Nature Genetics 47:158–163

Iskow RC, McCabe MT, Mills RE, Torene S, Pittard WS, Neuwald AF, Van Meir EG, Vertino PM, Devine SE (2010) Natural mutagenesis of human genomes by endogenous retrotransposons. Cell 141:1253–1261

Iwagami S, Baba Y, Watanabe M, Shigaki H, Miyake K, Ishimoto T, Iwatsuki M, Sakamaki K, Ohashi Y, Baba H (2013) LINE-1 hypomethylation is associated with a poor prognosis among patients with curatively resected esophageal squamous cell carcinoma. Ann Surg 257:449–455

Jung H, Choi JK, Lee EA (2018) Immune signatures correlate with L1 retrotransposition in gastrointestinal cancers. Genome Res 28:1136–1146

Jurka J (2000) Repbase Update: a database and an electronic journal of repetitive

61

elements. Trends in Genetics 16:418–420

Jurka J (1997) Sequence patterns indicate an enzymatic involvement in integration of mammalian retroposons. Proc Natl Acad Sci U S A 94:1872–1877

Kakar S, Burgart LJ, Thibodeau SN, Rabe KG, Petersen GM, Goldberg RM, Lindor NM (2003) Frequency of loss of hMLH1 expression in colorectal carcinoma increases with advancing age. Cancer 97:1421–1427

Kapitonov VV, Jurka J (2006) Self-synthesizing DNA transposons in eukaryotes. Proc Natl Acad Sci U S A 103:4540–4545

Kapitonov VV, Jurka J (2001) Rolling-circle transposons in eukaryotes. Proceedings of the National Academy of Sciences 98:8714–8719

Katainen R, Dave K, Pitkänen E, Palin K, Kivioja T, Välimäki N, Gylfe AE, Ristolainen H, Hänninen UA, Cajuso T, Kondelin J, Tanskanen T, Mecklin J-P, Järvinen H, Renkonen-Sinisalo L, Lepistö A, Kaasinen E, Kilpivaara O, Tuupanen S, Enge M, Taipale J, Aaltonen LA (2015) CTCF/cohesin-binding sites are frequently mutated in cancer. Nat Genet 47:818–821

Kazazian HH Jr, Moran JV (2017) Mobile DNA in Health and Disease. N Engl J Med 377:361–370

Kazazian HH Jr, Wong C, Youssoufian H, Scott AF, Phillips DG, Antonarakis SE (1988) Haemophilia A resulting from de novo insertion of L1 sequences represents a novel mechanism for mutation in man. Nature 332:164–166

Khazina E, Truffault V, Büttner R, Schmidt S, Coles M, Weichenrieder O (2011) Trimeric structure and flexibility of the L1ORF1 protein in human L1 retrotransposition. Nat Struct Mol Biol 18:1006–1014

Kim H, Jen J, Vogelstein B, Hamilton SR (1994) Clinical and pathological characteristics of sporadic colorectal carcinomas with DNA replication errors in microsatellite sequences. Am J Pathol 145:148–156

Kinzler KW, Vogelstein B (1997) Cancer-susceptibility genes. Gatekeepers and caretakers. Nature 386:761, 763

Knudson AG (1971) Mutation and Cancer: Statistical Study of Retinoblastoma. Proceedings of the National Academy of Sciences 68:820–823

Kraiss S, Quaiser A, Oren M, Montenarh M (1988) Oligomerization of oncoprotein p53. J Virol 62:4737–4744

Kriegs JO, Churakov G, Jurka J, Brosius J, Schmitz J (2007) Evolutionary history of 7SL RNA-derived SINEs in Supraprimates. Trends Genet 23:158–161

Lahiri DK, Nurnberger JI Jr (1991) A rapid non-enzymatic method for the preparation of HMW DNA from blood for RFLP studies. Nucleic Acids Res 19:5444

Lee E, Iskow R, Yang L, Gokcumen O, Haseley P, Luquette LJ 3rd, Lohr JG, Harris CC, Ding L, Wilson RK, Wheeler DA, Gibbs RA, Kucherlapati R, Lee C, Kharchenko PV, Park PJ, Cancer Genome Atlas Research Network (2012) Landscape of somatic retrotransposition in human cancers. Science 337:967–971

62

Leibold DM, Swergold GD, Singer MF, Thayer RE, Dombroski BA, Fanning TG (1990) Translation of LINE-1 DNA elements in vitro and in human cells. Proc Natl Acad Sci U S A 87:6990–6994

Le Toriellec E, Despouy G, Pierron G, Gaye N, Joiner M, Bellanger D, Vincent-Salomon A, Stern M-H (2008) Haploinsufficiency of CDKN1B contributes to leukemogenesis in T-cell prolymphocytic leukemia. Blood 111:2321–2328

Li E, Beard C, Jaenisch R (1993) Role for DNA methylation in genomic imprinting. Nature 366:362–365

Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754–1760

Liu J, Lichtenberg T, Hoadley KA, Poisson LM, Lazar AJ, Cherniack AD, Kovatich AJ, Benz CC, Levine DA, Lee AV, Omberg L, Wolf DM, Shriver CD, Thorsson V, Cancer Genome Atlas Research Network, Hu H (2018) An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics. Cell 173:400–416.e11

Luan DD, Korman MH, Jakubczak JL, Eickbush TH (1993) Reverse transcription of R2Bm RNA is primed by a nick at the chromosomal target site: a mechanism for non-LTR retrotransposition. Cell 72:595–605

Lyon MF (1961) Gene Action in the X-chromosome of the Mouse (Mus musculus L.). Nature 190:372–373

Malesci A, Laghi L, Bianchi P, Delconte G, Randolph A, Torri V, Carnaghi C, Doci R, Rosati R, Montorsi M, Roncalli M, Gennari L, Santoro A (2007) Reduced likelihood of metastases in patients with microsatellite-unstable colorectal cancer. Clin Cancer Res 13:3831–3839

Martin SL, Branciforte D, Keller D, Bain DL (2003) Trimeric structure for an essential protein in L1 retrotransposition. Proc Natl Acad Sci U S A 100:13815–13820

Mcclintock B (1956) Controlling elements and the gene. Cold Spring Harb Symp Quant Biol 21:197–216

Michor F, Iwasa Y, Rajagopalan H, Lengauer C, Nowak MA (2004) Linear model of colon cancer initiation. Cell Cycle 3:358–362

Miki Y, Nishisho I, Horii A, Miyoshi Y, Utsunomiya J, Kinzler KW, Vogelstein B, Nakamura Y (1992) Disruption of the APC gene by a retrotransposal insertion of L1 sequence in a colon cancer. Cancer Res 52:643–645

Mills RE, Andrew Bennett E, Iskow RC, Devine SE (2007) Which transposable elements are active in the human genome? Trends in Genetics 23:183–191

Molaro A, Hodges E, Fang F, Song Q, McCombie WR, Hannon GJ, Smith AD (2011) Sperm methylation profiles reveal features of epigenetic inheritance and evolution in primates. Cell 146:1029–1041

Mori Y, Nagse H, Ando H, Horii A, Ichii S, Nakatsuru S, Aoki T, Miki Y, Mori T, Nakamura Y (1992) Somatic mutations of the APC gene in colorectal tumors: mutation cluster region in the APC gene. Human Molecular Genetics 1:229–233

63

Morrish TA, Gilbert N, Myers JS, Vincent BJ, Stamato TD, Taccioli GE, Batzer MA, Moran JV (2002) DNA repair mediated by endonuclease-independent LINE-1 retrotransposition. Nat Genet 31:159–165

Mulligan LM, Kwok JBJ, Healey CS, Elsdon MJ, Eng C, Gardner E, Love DR, Mole SE, Moore JK, Papi L, Ponder MA, Telenius H, Tunnacliffe A, Ponder BAJ (1993) Germ- line mutations of the RET proto-oncogene in multiple endocrine neoplasia type 2A. Nature 363:458–460

Nagy R, Sweet K, Eng C (2004) Highly penetrant hereditary cancer syndromes. Oncogene 23:6445–6470

Nancy L Craig, Robert Craigie, Martin Gellert, Alan M Lambowitz (2002) Mobile DNA II. ASM Press

Negrini S, Gorgoulis VG, Halazonetis TD (2010) Genomic instability — an evolving hallmark of cancer. Nature Reviews Molecular Cell Biology 11:220–228

Network TCGA, The Cancer Genome Atlas Network (2012) Comprehensive molecular characterization of human colon and rectal cancer. Nature 487:330–337

Nowell P (1976) The clonal evolution of tumor cell populations. Science 194:23–28

Nygren AOH, Ameziane N, Duarte HMB, Vijzelaar RNCP, Waisfisz Q, Hess CJ, Schouten JP, Errami A (2005) Methylation-specific MLPA (MS-MLPA): simultaneous detection of CpG methylation and copy number changes of up to 40 sequences. Nucleic Acids Res 33:e128

Ogino S, Kawasaki T, Nosho K, Ohnishi M, Suemoto Y, Kirkner GJ, Fuchs CS (2008a) LINE-1 hypomethylation is inversely associated with microsatellite instability and CpG island methylator phenotype in colorectal cancer. International Journal of Cancer 122:2767–2773

Ogino S, Nosho K, Kirkner GJ, Kawasaki T, Chan AT, Schernhammer ES, Giovannucci EL, Fuchs CS (2008b) A cohort study of tumoral LINE-1 hypomethylation and prognosis in colon cancer. J Natl Cancer Inst 100:1734–1738

Ongen H, Andersen CL, Bramsen JB, Oster B, Rasmussen MH, Ferreira PG, Sandoval J, Vidal E, Whiffin N, Planchon A, Padioleau I, Bielser D, Romano L, Tomlinson I, Houlston RS, Esteller M, Orntoft TF, Dermitzakis ET (2014) Putative cis-regulatory drivers in colorectal cancer. Nature 512:87–90

Ostertag EM, Goodier JL, Zhang Y, Kazazian HH Jr (2003) SVA elements are nonautonomous retrotransposons that cause disease in humans. Am J Hum Genet 73:1444–1451

Ostertag EM, Kazazian HH Jr (2001) Twin priming: a proposed mechanism for the creation of inversions in L1 retrotransposition. Genome Res 11:2059–2065

Palin K, Pitkänen E, Turunen M, Sahu B, Pihlajamaa P, Kivioja T, Kaasinen E, Välimäki N, Hänninen UA, Cajuso T, Aavikko M, Tuupanen S, Kilpivaara O, van den Berg L, Kondelin J, Tanskanen T, Katainen R, Grau M, Rauanheimo H, Plaketti R-M, Taira A, Sulo P, Hartonen T, Dave K, Schmierer B, Botla S, Sokolova M, Vähärautio A, Gladysz K, Ongen H, Dermitzakis E, Bramsen JB, Ørntoft TF, Andersen CL, Ristimäki A, Lepistö A, Renkonen-Sinisalo L, Mecklin J-P, Taipale J, Aaltonen LA

64

(2018) Contribution of allelic imbalance to colorectal cancer. Nat Commun 9:3664

Paterson AL, Weaver JMJ, Eldridge MD, Tavaré S, Fitzgerald RC, Edwards PAW, OCCAMs Consortium (2015) Mobile element insertions are frequent in oesophageal adenocarcinomas and can mislead paired-end sequencing analysis. BMC Genomics 16:473

Pattamadilok J, Huapai N, Rattanatanyong P, Vasurattana A, Triratanachat S, Tresukosol D, Mutirangura A (2008) LINE-1 hypomethylation level as a potential prognostic factor for epithelial ovarian cancer. Int J Gynecol Cancer 18:711–717

Peltomaki P, Aaltonen L, Sistonen P, Pylkkanen L, Mecklin J, Jarvinen H, Green J, Jass, Weber J, Leach F, Et A (1993) Genetic mapping of a locus predisposing to human colorectal cancer. Science 260:810–812

Pickeral OK, Makałowski W, Boguski MS, Boeke JD (2000) Frequent human genomic DNA transduction driven by LINE-1 retrotransposition. Genome Res 10:411–415

Pierre Capy, Claude Bazin, Dominique Higuet, Thierry Langin (1996) The dynamics and Evolution of Transposable elements. Springer

Pino MS, Chung DC (2010) The Chromosomal Instability Pathway in Colon Cancer. Gastroenterology 138:2059–2072

Pitkänen E, Cajuso T, Katainen R, Kaasinen E, Välimäki N, Palin K, Taipale J, Aaltonen LA, Kilpivaara O (2014) Frequent L1 retrotranspositions originating from TTC28 in colorectal cancer. Oncotarget 5:853–859

Polakis P (1995) Mutations in the APC gene and their implications for protein structure and function. Curr Opin Genet Dev 5:66–71

Pradhan B, Cajuso T, Katainen R, Sulo P, Tanskanen T, Kilpivaara O, Pitkänen E, Aaltonen LA, Kauppi L, Palin K (2017) Detection of subclonal L1 transductions in colorectal cancer by long-distance inverse-PCR and Nanopore sequencing. Scientific Reports 7

Prak ET, Kazazian HH Jr (2000) Mobile elements and the human genome. Nat Rev Genet 1:134–144

Pritham EJ, Putliwala T, Feschotte C (2007) Mavericks, a novel class of giant transposable elements widespread in eukaryotes and related to DNA viruses. Gene 390:3–17

Raiz J, Damert A, Chira S, Held U, Klawitter S, Hamdorf M, Löwer J, Strätling WH, Löwer R, Schumann GG (2012) The non-autonomous retrotransposon SVA is trans- mobilized by the human LINE-1 protein machinery. Nucleic Acids Res 40:1666– 1683

Rajagopalan H, Bardelli A, Lengauer C, Kinzler KW, Vogelstein B, Velculescu VE (2002) Tumorigenesis: RAF/RAS oncogenes and mismatch-repair status. Nature 418:934

Rajaram M, Zhang J, Wang T, Li J, Kuscu C, Qi H, Kato M, Grubor V, Weil RJ, Helland A, Borrenson-Dale A-L, Cho KR, Levine DA, Houghton AN, Wolchok JD, Myeroff L, Markowitz SD, Lowe SW, Zhang M, Krasnitz A, Lucito R, Mu D, Powers RS (2013) Two Distinct Categories of Focal Deletions in Cancer Genomes. PLoS One

65

8:e66264

Rausch T, Zichner T, Schlattl A, Stütz AM, Benes V, Korbel JO (2012) DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28:i333–i339

Rawla P, Sunkara T, Barsouk A (2019) Epidemiology of colorectal cancer: incidence, mortality, survival, and risk factors. Gastroenterology Review 14:89–103

Rodić N, Sharma R, Sharma R, Zampella J, Dai L, Taylor MS, Hruban RH, Iacobuzio- Donahue CA, Maitra A, Torbenson MS, Goggins M, Shih I-M, Duffield AS, Montgomery EA, Gabrielson E, Netto GJ, Lotan TL, De Marzo AM, Westra W, Binder ZA, Orr BA, Gallia GL, Eberhart CG, Boeke JD, Harris CR, Burns KH (2014) Long interspersed element-1 protein expression is a hallmark of many human cancers. Am J Pathol 184:1280–1286

Rodić N, Steranka JP, Makohon-Moore A, Moyer A, Shen P, Sharma R, Kohutek ZA, Huang CR, Ahn D, Mita P, Taylor MS, Barker NJ, Hruban RH, Iacobuzio-Donahue CA, Boeke JD, Burns KH (2015) Retrotransposon insertions in the clonal evolution of pancreatic ductal adenocarcinoma. Nat Med 21:1060–1064

Rodriguez-Martin B, Alvarez EG, Baez-Ortega A, Zamora J, Supek F, Demeulemeester J, Santamarina M, Ju YS, Temes J, Garcia-Souto D, Detering H, Li Y, Rodriguez- Castro J, Dueso-Barroso A, Bruzos AL, Dentro SC, Blanco MG, Contino G, Ardeljan D, Tojo M, Roberts ND, Zumalave S, Edwards PAW, Weischenfeldt J, Puiggròs M, Chong Z, Chen K, Lee EA, Wala JA, Raine K, Butler A, Waszak SM, Navarro FCP, Schumacher SE, Monlong J, Maura F, Bolli N, Bourque G, Gerstein M, Park PJ, Wedge DC, Beroukhim R, Torrents D, Korbel JO, Martincorena I, Fitzgerald RC, Van Loo P, Kazazian HH, Burns KH, PCAWG Structural Variation Working Group, Campbell PJ, Tubio JMC, PCAWG Consortium (2020) Pan-cancer analysis of whole genomes identifies driver rearrangements promoted by LINE-1 retrotransposition. Nat Genet. https://doi.org/10.1038/s41588-019-0562-0

Rustgi AK (2007) The genetics of hereditary colon cancer. Genes & Development 21:2525–2538

Saito K, Kawakami K, Matsumoto I, Oda M, Watanabe G, Minamoto T (2010) Long interspersed nuclear element 1 hypomethylation is a marker of poor prognosis in stage IA non-small cell lung cancer. Clin Cancer Res 16:2418–2426

Salovaara R, Loukola A, Kristo P, Kääriäinen H, Ahtola H, Eskelinen M, Härkönen N, Julkunen R, Kangas E, Ojala S, Tulikoura J, Valkamo E, Järvinen H, Mecklin JP, Aaltonen LA, de la Chapelle A (2000) Population-based molecular detection of hereditary nonpolyposis colorectal cancer. J Clin Oncol 18:2193–2200

Sanchez-Vega F, Mina M, Armenia J, Chatila WK, Luna A, La KC, Dimitriadoy S, Liu DL, Kantheti HS, Saghafinia S, Chakravarty D, Daian F, Gao Q, Bailey MH, Liang W-W, Foltz SM, Shmulevich I, Ding L, Heins Z, Ochoa A, Gross B, Gao J, Zhang H, Kundra R, Kandoth C, Bahceci I, Dervishi L, Dogrusoz U, Zhou W, Shen H, Laird PW, Way GP, Greene CS, Liang H, Xiao Y, Wang C, Iavarone A, Berger AH, Bivona TG, Lazar AJ, Hammer GD, Giordano T, Kwong LN, McArthur G, Huang C, Tward AD, Frederick MJ, McCormick F, Meyerson M, Cancer Genome Atlas Research Network, Van Allen EM, Cherniack AD, Ciriello G, Sander C, Schultz N (2018) Oncogenic Signaling Pathways in The Cancer Genome Atlas. Cell 173:321–337.e10

66

Schwartz M, Zlotorynski E, Kerem B (2006) The molecular basis of common and rare fragile sites. Cancer Lett 232:13–26

Scott EC, Gardner EJ, Masood A, Chuang NT, Vertino PM, Devine SE (2016) A hot L1 retrotransposon evades somatic repression and initiates human colorectal cancer. Genome Res 26:745–755

Shaikh TH, Roy AM, Kim J, Batzer MA, Deininger PL (1997) cDNAs derived from primary and small cytoplasmic Alu (scAlu) transcripts. J Mol Biol 271:222–234

Sharma S, Kelly TK, Jones PA (2010) Epigenetics in cancer. Carcinogenesis 31:27–36

Shivaswamy S, Bhinge A, Zhao Y, Jones S, Hirst M, Iyer V (2008) Dynamic Remodeling of Individual Nucleosomes Across a Eukaryotic Genome in Response to Transcriptional Perturbation. SciVee

Shukla R, Upton KR, Muñoz-Lopez M, Gerhardt DJ, Fisher ME, Nguyen T, Brennan PM, Baillie JK, Collino A, Ghisletti S, Sinha S, Iannelli F, Radaelli E, Dos Santos A, Rapoud D, Guettier C, Samuel D, Natoli G, Carninci P, Ciccarelli FD, Garcia-Perez JL, Faivre J, Faulkner GJ (2013) Endogenous retrotransposition activates oncogenic pathways in hepatocellular carcinoma. Cell 153:101–111

Singer MF (1982) SINEs and LINEs: Highly repeated short and long interspersed sequences in mammalian genomes. Cell 28:433–434

Skowronski J, Fanning TG, Singer MF (1988) Unit-length line-1 transcripts in human teratocarcinoma cells. Mol Cell Biol 8:1385–1397

Skowronski J, Singer MF (1985) Expression of a cytoplasmic LINE-1 transcript is regulated in a human teratocarcinoma cell line. Proceedings of the National Academy of Sciences 82:6050–6054

Solyom S, Ewing AD, Rahrmann EP, Doucet T, Nelson HH, Burns MB, Harris RS, Sigmon DF, Casella A, Erlanger B, Wheelan S, Upton KR, Shukla R, Faulkner GJ, Largaespada DA, Kazazian HH Jr (2012) Extensive somatic L1 retrotransposition in colorectal tumors. Genome Res 22:2328–2338

Staaf J, Lindgren D, Vallon-Christersson J, Isaksson A, Göransson H, Juliusson G, Rosenquist R, Höglund M, Borg A, Ringnér M (2008) Segmentation-based detection of allelic imbalance and loss-of-heterozygosity in cancer cells using whole genome SNP arrays. Genome Biol 9:R136

Stratton MR, Campbell PJ, Futreal PA (2009) The cancer genome. Nature 458:719–724

Sutherland GR, Baker E, Richards RI (1998) Fragile sites still breaking. Trends Genet 14:501–506

Swergold GD (1990) Identification, characterization, and cell specificity of a human LINE- 1 promoter. Molecular and Cellular Biology 10:6718–6729

Tabaska JE, Zhang MQ (1999) Detection of polyadenylation signals in human DNA sequences. Gene 231:77–86

Tang Z, Steranka JP, Ma S, Grivainis M, Rodić N, Huang CRL, Shih I-M, Wang T-L, Boeke JD, Fenyö D, Burns KH (2017) Human transposon insertion profiling:

67

Analysis, visualization and identification of somatic LINE-1 insertions in ovarian cancer. Proc Natl Acad Sci U S A 114:E733–E740

Taylor AM, Shih J, Ha G, Gao GF, Zhang X, Berger AC, Schumacher SE, Wang C, Hu H, Liu J, Lazar AJ, Cancer Genome Atlas Research Network, Cherniack AD, Beroukhim R, Meyerson M (2018) Genomic and Functional Approaches to Understanding Cancer Aneuploidy. Cancer Cell 33:676–689.e3

Thayer RE, Singer MF, Fanning TG (1993) Undermethylation of specific LINE-1 sequences in human cells producing a LINE-1 -encoded protein. Gene 133:273–277

Tomasetti C, Vogelstein B (2015) Cancer etiology. Variation in cancer risk among tissues can be explained by the number of stem cell divisions. Science 347:78–81

Toyota M, Ahuja N, Ohe-Toyota M, Herman JG, Baylin SB, Issa JP (1999) CpG island methylator phenotype in colorectal cancer. Proc Natl Acad Sci U S A 96:8681–8686

Tubio JMC, Li Y, Ju YS, Martincorena I, Cooke SL, Tojo M, Gundem G, Pipinikas CP, Zamora J, Raine K, Menzies A, Roman-Garcia P, Fullam A, Gerstung M, Shlien A, Tarpey PS, Papaemmanuil E, Knappskog S, Van Loo P, Ramakrishna M, Davies HR, Marshall J, Wedge DC, Teague JW, Butler AP, Nik-Zainal S, Alexandrov L, Behjati S, Yates LR, Bolli N, Mudie L, Hardy C, Martin S, McLaren S, O’Meara S, Anderson E, Maddison M, Gamble S, Foster C, Warren AY, Whitaker H, Brewer D, Eeles R, Cooper C, Neal D, Lynch AG, Visakorpi T, Isaacs WB, Veer LV, Caldas C, Desmedt C, Sotiriou C, Aparicio S, Foekens JA, Eyfjörd JE, Lakhani SR, Thomas G, Myklebost O, Span PN, Børresen-Dale A-L, Richardson AL, Van de Vijver M, Vincent-Salomon A, Van den Eynden GG, Flanagan AM, Futreal PA, Janes SM, Bova GS, Stratton MR, McDermott U, Campbell PJ, ICGC Breast Cancer Group, ICGC Bone Cancer Group, ICGC Prostate Cancer Group (2014) Mobile DNA in cancer. Extensive transduction of nonrepetitive DNA mediated by L1 retrotransposition in cancer genomes. Science 345:1251343

Untergasser A, Cutcutache I, Koressaar T, Ye J, Faircloth BC, Remm M, Rozen SG (2012) Primer3--new capabilities and interfaces. Nucleic Acids Res 40:e115 van Hoesel AQ, van de Velde CJH, Kuppen PJK, Liefers GJ, Putter H, Sato Y, Elashoff DA, Turner RR, Shamonki JM, de Kruijf EM, van Nes JGH, Giuliano AE, Hoon DSB (2012) Hypomethylation of LINE-1 in primary tumor has poor prognosis in young breast cancer patients: a retrospective cohort study. Breast Cancer Res Treat 134:1103–1114

Vogelstein B, Papadopoulos N, Velculescu VE, Zhou S, Diaz LA Jr, Kinzler KW (2013) Cancer genome landscapes. Science 339:1546–1558

Wang H, Xing J, Grover D, Hedges DJ, Han K, Walker JA, Batzer MA (2005) SVA elements: a hominid-specific retroposon family. J Mol Biol 354:994–1007

Wang Y-P, Lei Q-Y (2018) Metabolic recoding of epigenetics in cancer. Cancer Commun 38:25

Wei W, Gilbert N, Ooi SL, Lawler JF, Ostertag EM, Kazazian HH, Boeke JD, Moran JV (2001) Human L1 retrotransposition: cis preference versus trans complementation. Mol Cell Biol 21:1429–1439

Xia Z, Cochrane DR, Anglesio MS, Wang YK, Nazeran T, Tessier-Cloutier B, McConechy

68

MK, Senz J, Lum A, Bashashati A, Shah SP, Huntsman DG (2017) LINE-1 retrotransposon-mediated DNA transductions in endometriosis associated ovarian cancers. Gynecol Oncol 147:642–647

Yaeger R, Chatila WK, Lipsyc MD, Hechtman JF, Cercek A, Sanchez-Vega F, Jayakumaran G, Middha S, Zehir A, Donoghue MTA, You D, Viale A, Kemeny N, Segal NH, Stadler ZK, Varghese AM, Kundra R, Gao J, Syed A, Hyman DM, Vakiani E, Rosen N, Taylor BS, Ladanyi M, Berger MF, Solit DB, Shia J, Saltz L, Schultz N (2018) Clinical Sequencing Defines the Genomic Landscape of Metastatic Colorectal Cancer. Cancer Cell 33:125–136.e3

Yoder JA, Walsh CP, Bestor TH (1997) Cytosine methylation and the ecology of intragenomic parasites. Trends Genet 13:335–340

Zehir A, Benayed R, Shah RH, Syed A, Middha S, Kim HR, Srinivasan P, Gao J, Chakravarty D, Devlin SM, Hellmann MD, Barron DA, Schram AM, Hameed M, Dogan S, Ross DS, Hechtman JF, DeLair DF, Yao J, Mandelker DL, Cheng DT, Chandramohan R, Mohanty AS, Ptashkin RN, Jayakumaran G, Prasad M, Syed MH, Rema AB, Liu ZY, Nafa K, Borsu L, Sadowska J, Casanova J, Bacares R, Kiecka IJ, Razumova A, Son JB, Stewart L, Baldi T, Mullaney KA, Al-Ahmadie H, Vakiani E, Abeshouse AA, Penson AV, Jonsson P, Camacho N, Chang MT, Won HH, Gross BE, Kundra R, Heins ZJ, Chen H-W, Phillips S, Zhang H, Wang J, Ochoa A, Wills J, Eubank M, Thomas SB, Gardos SM, Reales DN, Galle J, Durany R, Cambria R, Abida W, Cercek A, Feldman DR, Gounder MM, Hakimi AA, Harding JJ, Iyer G, Janjigian YY, Jordan EJ, Kelly CM, Lowery MA, Morris LGT, Omuro AM, Raj N, Razavi P, Shoushtari AN, Shukla N, Soumerai TE, Varghese AM, Yaeger R, Coleman J, Bochner B, Riely GJ, Saltz LB, Scher HI, Sabbatini PJ, Robson ME, Klimstra DS, Taylor BS, Baselga J, Schultz N, Hyman DM, Arcila ME, Solit DB, Ladanyi M, Berger MF (2017) Mutational landscape of metastatic cancer revealed from prospective clinical sequencing of 10,000 patients. Nat Med 23:703–713

Zemach A, McDaniel IE, Silva P, Zilberman D (2010) Genome-wide evolutionary analysis of eukaryotic DNA methylation. Science 328:916–919

Zeng Y, Cullen BR The Biogenesis and Function of MicroRNAs. Gene Expression and Regulation 481–492

69