Regulatory units in evolution, development, and human disease

Regulationseinheiten in Evolution, Entwicklung und humaner Krankheit

Der Naturwissenschaftlichen Fakultät

der Friedrich-Alexander-Universität

Erlangen-Nürnberg

zur Erlangung des Doktorgrades

Dr.rer.nat.

vorgelegt von

Lifei Li Als Dissertation genehmigt von der Naturwissenschaftlichen Fakultät der Friedrich-Alexander-Universität Erlangen-Nürnberg

Tag der mündlichen Prüfung: 28.05.2020 Vorsitzender des Promotionsorgans: Prof. Dr. Georg Kreimer Gutachter: Prof. Dr. Leila Taher Prof. Dr. Christian Pilarsky Declaration of Authorship

I, Lifei Li, declare that this , titled “Regulatory units in evolution, development, and human disease”, and the work presented in it are my own. I confirm that I did not submit previously any part of this thesis for any degree or any other qualification at Friedrich-Alexander

University Erlangen-Nürnberg, nor at any other institution. I also declare that, where I have consulted the published work of others, this is always clearly attributed, and where I have quoted from the work of others, the source is always given. I also confirm that, where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself.

i Abstract

Misregulation of gene expression can result in broad types of diseases and abnormalities.

For elucidating the molecular mechanisms underlying evolution, development and disease, the

2% protein-coding regions in the human genome have been studied for decades, but the the 98% non-coding regions remain less understood. Many studies have revealed that evolutionary con- served non-coding elements (CNEs) act as cis-regulatory elements (CREs). Nevertheless, not all

CREs are evolutionary conserved and, hence, the identification of CREs has proved challenging.

Some CREs work cooperatively to regulate gene expression. Together with their target gene(s), these CREs form regulatory units. For instance, topologically associating domains (TADs) are large regulatory units in the genome, controlling and restricting the interactions between genes and CREs. Many studies have found the disruption of TADs can lead to gene expression mis- regulation.

This thesis aimed at characterizing regulatory units. In particular, we explored the mecha- nisms by which regulatory unit disruption can lead to changes in gene expression in evolution, development, and disease.

The first part of this thesis focused on the identification and characterization of small reg- ulatory units consisting of two CREs. We identified 5,500 pairs of adjacent CNEs (to which we further refer as “CNE-CNE pairs”) in the human genome with either expanded, conserved or contracted inter-CNE sequences compared to the common mammalian ancestor. Particu- larly, the CNEs in CNE-CNE pairs with conserved and mildly contracted inter-CNEs sequences were most likely to perform as active or poised enhancers; in addition, both CNEs of the pairs exhibited similar epigenetic profiles, suggesting that the CNE-CNE pairs tend to act as small regulatory units. Furthermore, transposon deletions and insertions were associated with the con- traction or expansion of inter-CNE sequences, indicating that transposon activity might disrupt the link between the two CNEs in the CNE-CNE pairs, leading to a loss of cis-regulatory func- tion. Our study identified novel regulatory units, and highlighted the existence of cooperative interaction between adjacent CREs in the regulatory units that are distance-sensitive and can be disrupted by transposon activity. Our findings contribute to understanding the mechanism by which selective forces act on CREs in the context of evolution and human genetic diseases.

ii The second part of this thesis investigated the mechanisms by which regulatory units can be disrupted, leading to gene expression misregulation in disease. To address this, we first constructed 1,467 consensus TADs in the “normal genome” and 1,622 consensus TADs in the “cancer genome”. To evaluate the prognosis value of knowing the location of structural variants such as copy number variants (CNVs) within the TADs in cancer patient survival outcome, we applied Cox regression analysis. In this manner, we identified 35 prognostic TADs; 54% of these

TADs did not contain any genes with a known association to cancer causality, indicating that a large fraction of the TADs have prognostic value independently of coding variants. Furthermore, 34% of the 35 prognostic TADs underwent strong structural perturbations in the cancer genome.

Hence, the prognostic value of at least a fraction of the 35 TADs appeared to be associated with the disruption of normal CRE interactions. Our study emphasized the importance of identifying perturbed regulatory units in monitoring cancer development and progression.

Understanding the molecular mechanisms regulating gene expression will help us to under- stand the formation and evolution of life and to find cures for diseases.

iii Zusammenfassung

Eine fehlerhafte Regulation der Genexpression kann zu einem breiten Spektrum an Krankheiten und Abnormitäten führen. Zur Aufklärung der molekularen Mechanismen, die Evolution, En- twicklung und Krankheit zugrunde liegen, werden die 2% protein-kodierenden Regionen im menschlichen Genom seit Jahrzehnten untersucht, die 98% nichtkodierenden Regionen sind bisher deutlich weniger gut verstanden. Viele Studien haben gezeigt, dass evolutionär kon- servierte nicht-kodierende Elemente (KNEs) als cis-regulatorische Elemente (CREs) fungieren. Jedoch sind nicht alle CREs evolutionär konserviert, weswegen sich die Identifizierung von

CREs als schwierig erwiesen hat. Einige CREs regulieren die Genexpression kooperativ. Zusam- men mit ihrem Zielgen oder ihren Zielgenen bilden diese CREs Regulationseinheiten. Zum

Beispiel sind topologisch assoziierende Domänen (TADs) große Regulationseinheiten im Genom, die die Interaktionen zwischen Genen und CREs kontrollieren und einschränken. Viele Studien haben bestätigt, dass die Störung der TAD-Struktur zu einer fehlerhaften Genexpression führen kann.

Diese Arbeit zielte auf die Charakterisierung von Regulationseinheiten ab. Insbesondere hat sie die Mechanismen erforscht, durch die eine Störung von Regulationseinheiten zu Verän- derungen der Genexpression führt, im Kontext von Evolution, Entwicklung und Krankheit.

Der erste Teil dieser Arbeit hat sich mit der Identifizierung und Charakterisierung kleiner

Regulationseinheiten, bestehend aus zwei CREs, befasst. Wir haben 5500 Paare benachbarter

KNEs im menschlichen Genom identifiziert (im Folgenden als ”KNE-KNE-Paare” bezeich- net), bei denen die Inter-KNE-Sequenzen im Vergleich zu dem gemeinsamen Vorfahren der Säugetiere entweder expandiert, konserviert, oder kontrahiert sind. Insbesondere die KNEs in KNE-KNE-Paaren mit konservierten oder leicht kontrahierten Inter-KNE-Sequenzen wiesen die größte Wahrscheinlichkeit auf, als aktive Enhancer oder ”poised” Enhancer (d.h., in Bere- itschaftsstellung) zu fungieren. Des Weiteren zeigten die zu einem Paar gehörenden KNEs ähn- liche epigenetische Profile, was darauf hindeutet, dass die KNE-KNE-Paare als kleine Regu- lationseinheiten fungieren. Darüber hinaus waren Transposon-Deletionen (respektive Insertio- nen) mit einer Kontraktion (respektive Expansion) der Inter-KNE-Sequenzen assoziiert. Dies deutet an, dass die Transposonaktivität möglicherweise den funktionellen Zusammenhang zwis-

iv chen den beiden KNEs in den KNE-KNE-Paaren stört und so zu einem Verlustihrer cis-regulatorischen

Funktion führt. Unsere Studie hat neue Regulationseinheiten identifiziert und die Existenz ko- operativer Interaktionen zwischen benachbarten CREs, die sich in distanzsensitiven und durch Transposonaktivität zerstörbaren Regulationseinheiten befinden, hervorgehoben. Unsere Ergeb- nisse tragen zum Verständnis des Mechanismus bei, durch den selektive Kräfte auf CREs im

Kontext von Evolution und genetisch bedingten Erkrankungen des Menschen wirken.

Der zweite Teil dieser Arbeit hat die Mechanismen untersucht, durch die Regulationsein- heiten gestört werden können, was zu einer Fehlregulation der Genexpression bei Krankheiten führt. Hierzu haben wir zunächst 1467 Konsensus-TADs im „normalen Genom“ und 1622

Konsensus-TADs im „Krebsgenom“ konstruiert. Um den prognostischen Wert der Kenntnis der

Position von Strukturvarianten wie Kopienzahlvarianten (KZVs) innerhalb der TADs bezüglich der Überlebensdauer von Krebspatienten zu ermitteln, haben wir eine Cox-Regressionsanalyse durchgeführt. Auf diese Weise haben wir 35 prognostische TADs identifiziert; 54% dieser TADs enthielten keine Gene mit einem bekannten Zusammenhang mit der Krebsursache, was darauf hinweist, dass ein großer Teil der TADs unabhängig von kodierenden Varianten einen prog- nostischen Wert hat. Darüber hinaus zeigten 34% der 35 prognostischen TADs starke struk- turelle Störungen im Krebsgenom. Daher schien der prognostische Wert von mindestens einem Bruchteil der 35 TADs mit der Störung normaler CRE-Interaktionen verbunden zu sein. Unsere

Studie zeigt, wie wichtig es ist, gestörte Regulationseinheiten zur Überwachung der Krebsen- twicklung und -progression zu identifizieren.

Das Verständnis der molekularen Mechanismen, die die Genexpression regulieren, wird uns helfen die Entstehung und Evolution des Lebens zu verstehen und Heilmittel für Krankheiten zu finden.

v Table of Contents

Declaration of Authorship i

Abstract v

Table of Contents vi

List of figures viii

List of Tables ix

1 Introduction 1

1.1 Gene Expression Regulation in Eukaryotes ...... 1

1.1.1 Transcription factors and cis-regulatory elements ...... 1

1.1.2 Epigenetic regulation of gene expression ...... 3 1.1.3 Topologically associating domains (TADs) ...... 4

1.2 Genomic variants and gene expression misregulation ...... 5

1.3 Identification and characterization of CREs in the human genome ...... 7

1.3.1 Comparative genomics ...... 7

1.3.2 Next-generation sequencing (NGS) ...... 8 1.4 Public data for multi-assay genomic investigation of gene regulation and disease 10

1.5 Research questions ...... 11

2 Results 14

2.1 Small regulatory units in evolution and development ...... 14

2.2 Disruption of large regulatory units in cancer ...... 19

vi 3 Discussion 25

4 Conclusion and Outlook 28

5 Author Contribution 30

6 Bibliography 32

vii List of Figures

2.1 Types of regulatory units in the genome...... 15

2.2 Definition of CNE-CNE pairs in genome ...... 17

2.3 Conceptual framework for investigating disruption of regulatory units in cancer. 20 2.4 Performance of TAD- and gene-based models for patient survival prediction. . . 23

viii List of Tables

2.1 Species used to identify mammalian conserved and deeply conserved CNE-CNE

pairs...... 16

2.2 Epigenetic profiles of clusters of self-organizing units ...... 18 2.3 Cancer types used to construct TAD- and gene-based LASSO Cox Regression

models...... 21

ix 1 . Introduction 1. Introduction

1.1 Gene Expression Regulation in Eukaryotes

Genes are the DNA sequences in the genome that code for proteins. The process by which a gene gives rise to a protein is called “gene expression” and comprises two major steps: tran- scription and translation. During transcription, the DNA that forms the gene is transcribed to messenger RNA (mRNA). Then, the genetic information in the mRNA is used to build a chain of amino acids in the process of translation. The two processes comprise three stages: initiation, elongation, and termination. In eukaryotic organisms, gene expression is primarily regulated at the level of transcription initiation [1]. Gene expression regulation ensures the appropriate level of expression in the correct cell type at the right time point during cell differentiation and organismal development, and gene expression misregulation can result in a broad spectrum of diseases [2]. Understanding the mechanisms of gene expression regulation is crucial for unrav- eling how different patterns of gene expression control cell differentiation, development, and even cause diseases.

1.1.1 Transcription factors and cis-regulatory elements

Only 2% of the human genome codes for proteins [3]. The remaining 98% of the genome is non-coding in nature. A non-negligible fraction of the non-coding genome plays a function in gene expression regulation and RNA processing [4]. In the past decades, different analysis techniques have been applied to annotate non-coding sequences. However, the comprehensive functional profiling of the non-coding genome still requires more time and effort.

In the process of gene transcription, mRNA synthesis is carried out by RNA polymerase enzymes. In particular, the synthesis of mRNA precursors is catalyzed by the RNA polymerase II. The process is regulated by multiple transcription factors (TFs) [5]: proteins that work alone or combine with other proteins in a complex to promote or prevent the recruitment of the RNA polymerase to a specific gene. TFs are regarded as key components in gene regulatory net- works and comprise a diverse family of proteins. The DNA sequences encoding TFs or other

1 1 . Introduction

DNA-binding proteins are called trans-regulatory elements (TREs). TFs contain specific DNA- binding domains that recognize and attach to specific DNA sequences, known as cis-regulatory elements (CREs). The short DNA sequences located within CREs, which physically interact with the TFs and bind to them, are called TF binding sites (TFBSs) and are typically less than

30 base pairs (bp) in length. TFBSs represent the sequence features of regulatory regions and the conserved sequence patterns of a collection of similar TFBSs are known as a TFBS motif. The accurate detection and characterization of TFBSs are essential contributions to understand the mechanisms of gene expression regulation. Some resources have been developed to annotate TFBS profiles in several species, such as JASPAR and GTRD [6, 7].

In the genome, CREs are brought into the proximity of their target genes through the for- mation of chromatin loops and may be located distantly from them [8, 9]. The binding of TFs to CREs can initiate, enhance or suppress transcription. According to their regulatory func- tions, CREs are commonly categorized into promoters, enhancers, silencers and insulators. In particular, promoters and enhancers are the best understood types.

Promoters are about 100-1000 bp long and are required to initiate transcription. Most pro- moters recognized by the RNA polymerase II are characterized by the presence of a 25 to 35 bp-long TATA box [10]. Promoters are located immediately upstream of the transcription start site (TSS) of their target genes, which makes it easy to find them. Enhancers vary from about

50 to 1500 bp in length [11]. Enhancers can increase the activity of the promoter of their target genes and, thereby, their transcription levels, under specific conditions. The location of the en- hancers relative to their target genes is highly variable: they can lie upstream, downstream or within introns of the genes that they regulate or be located far away [12]. Silencers are similar to enhancers, but involved in repression of transcription [13]. Finally, insulators are typically

500 to 3,000 bp-long sequences blocking promoter-enhancer interactions and serving as barriers against the spreading of the silencing effects of heterochromatin. The nuclear protein CCCTC- binding factor (CTCF) is an example of an insulator-binding protein. CTCF binds to insulators in the genome and mediates insulation through preventing the crosstalk between active and in- active genomic regions [14]. The functional cooperation between multiple regulatory elements

(CREs and TREs) maintains the accurate gene expression.

2 1 . Introduction

1.1.2 Epigenetic regulation of gene expression

In eukaryotes, the DNA is organized into chromatin, a complex in which the DNA is wrapped around histone proteins. According to its level of compaction, the chromatin can be classified as euchromatin or heterochromatin. While the euchromatin is a gene-rich, loosely compact structure associated with active gene transcription, the heterochromatin is a gene-poor, firmly compact structure transcriptionally silent [15, 16]. The chromatin structure and DNA ac- cessibility are regulated by covalent modifications of the histones and DNA methylation. These are known as epigenetic modifications [17]. By modifying the chromatin structure and DNA accessibility, epigenetic modifications can regulate gene expression.

The histones form octamers containing two copies of each of four histone proteins, mainly

H2A, H2B, H3 and H4. The N-terminal tail of the histones is subject to a number of post- translational modifications, referred to as “histone modifications”. Histone modifications com- prise several types, such as methylation, phosphorylation, acetylation, ubiquitylation, and sumoy- lation [18]. Histone modification patterns are thought to constitute a “histone code” that regu- lates the dynamics of chromatin organization and function. In particular, individual histone mod- ification can distinguish euchromatin from heterochromatin [19]. Thus, heterochromatin is en- riched in hypoacetylated histones and methylated lysine 9 of histone H3 (H3K9me), whereas eu- chromatin is characterized by acetylated histones H3/H4 and methylated lysine 4 of H3 (H3K4me) [20, 21, 22]. In addition, particular histone modifications are associated with different transcrip- tional states. For instance, lysine methylation is associated with repressed transcription, whereas lysine acetylation is related to active transcription [18, 23].

DNA methylation is the biological process involving the transfer of a methyl group onto the C5 position of the cytosine to form 5-methylcytosine within the context of the CpG dinucleotide

[24]. DNA methylation and histone H3K9 methylation can cooperatively form and maintain regions of heterochromatin. In particular, DNA methylation represses transcription by recruiting methyl-binding proteins and their associated chromatin remodeling factors or by preventing

TFs from binding to the DNA [25]. Normal DNA methylation patterns contribute to maintain genome stability and accurate gene expression. Global hypomethylation together with region- specific hypermethylation is frequently observed in tumor cells, resulting in gene expression

3 1 . Introduction misregulation [26].

Epigenetic mechanisms can maintain chromatin stability and regulate gene activity during organismal development, and their alterations can result in various human diseases [27]. Re- cent genome-wide technologies have enabled us to perform a systematic exploration of histone modifications and DNA methylation profiles, advancing the characterization of the epigenetic landscape of the genome.

1.1.3 Topologically associating domains (TADs)

The genomic DNA has been found to form three-dimensional (3D) structures. Most notably, the genome is organized into topologically associated domains (TADs), which are megabase- sized genomic regions harboring self-interacting chromatin loops [28]. The insulator protein CTCF plays an important role in the formation of TADs. Indeed, CTCF has been found to be en- riched at chromatin loops [29]; in turn, CTCF sites are enriched in TAD boundary regions (TBRs,

[30]). TADs contribute to gene regulation by restricting or facilitating promoter-enhancer inter- actions in a cell-specific manner [31].

Numerous studies have revealed that TADs can be further characterized into two compart- ments A and B, enriched with either active and open or inactive and closed chromatin, respec- tively [32, 33]. Compartments that are active in a specific cell type or condition can be inactive in other cell type or condition, and vice versa [34]. Although compartment switching was ini- tially thought to play a major role in the regulation of gene expression during cell differentiation

[35], compartment switching have been recently shown to contribute, but not determine cell type specific patterns of gene expression [36]. Chromatin compartmentalization is believed to be a key characteristics of high-order genome organization and function in animal development [37].

But recent evidence showed CTCF loss only displayed small effects on compartments whereas cohesin loss could obviously increase the degree of compartmentalization, suggesting compart- ments and TADs might be formed by distinct, probably antagonistic mechanism [38]. Currently, how exactly the dynamics of chromatin organization affects gene expression regulation remains unclear.

4 1 . Introduction

1.2 Genomic variants and gene expression misregulation

In the context of diseases and abnormalities, epigenetics, TADs, and dynamics of chromatin structure are shown to be disrupted by genomic variants, and in turn, cause gene expression misregulation. There are two major categories of genomic variants: (i) single nucleotide poly- morphisms (SNPs) and short insertions and deletions (indels); and (ii) structural variants (SVs), which are rearrangements of large DNA segments and may involve DNA copy number variants

(CNVs). The most common SVs comprise deletions, duplications, inversions and translocations and are regarded as one of the hallmarks of cancer [39]. Variants can be located everywhere in the genome and variants in both coding and non-coding sequences have been associated with phenotypic diversity in health and disease.

SNPs are substitutions of single nucleotides and occur throughout the genome with a fre- quency of one in approximately every 1,000 bp [40]. SNPs are associated with population di- versity in health [41], the susceptibility of an individual to many diseases [42], and individual response to medicinal drug treatment in diseases [43]. For example, the SNP rs3903072, located within a putative CRE of the tumor suppressor gene CTSW, is associated with low expression level of CTSW and lower survival probability in breast cancer patients [44]. Also, the SNP rs2853669 within the promoter of gene encoding the telomerase reverse transcriptase (TERT) can influence patient survival rates, recurrence risks, and prognosis of human diseases, e.g. liver cancer, breast cancer, ovarian cancer and melanoma [45, 46, 47].

While not as common as SNPs, indels still make up 15-20% of human polymorphisms, representing the second most common type of genetic variants in the human genome [48]. Indels comprise insertions or deletions measuring from 1 to 10kb in length. Many studies have revealed that non-coding indels can affect the regulation of cancer-related genes and are associated with lower patient survival [49, 50].

Most SVs encompass more than one gene or regulatory element; hence, SVs may have pro- found effects on gene regulation. In particular, CNVs drive pathogenesis by changing the num- ber of copies of genes or CREs [51]. An increasing number of studies indicate that non-coding CNVs are as important as coding CNVs in driving tumorigenesis or abnormalities [52, 53]. For instance, non-coding CNVs altering CREs are a major cause of congenital limb malformations

5 1 . Introduction

[54]. In particular, the duplication of a BMP2 enhancer is associated with human limb malfor- mation [55]. Also, duplications in the zone of polarizing activity regulatory sequence (ZRS) cause SHH misregulation, which can result in a number of limb abnormalities [56, 57]. In ad- dition, a duplication involving a distant potential CRE of IHH leads to the IHH misregulation during digit and skull development [58]. Further, there is evidence showing a robust association between a CNV located in the promoter of MAPKAPK2 and risk as well as prognosis for lung cancer, suggesting the potential of this CNV as a biomarker for susceptibility and prognosis for lung cancer patients [59]. Also CNVs covering a distal PAX5 enhancer can lead to gene misregulation in chronic lymphocytic leukemia [60]. All these observations suggest that CNVs can have pathogenic effects by altering CREs, something that has been under-appreciated for decades.

Moreover, CNVs can also rewrite promoter-enhancer interactions by disrupting TAD struc- ture, which, in turn, can cause abnormalities and diseases [61, 62]. In particular, the deletion of a TAD boundary may lead to the fusion of two adjacent TADs, bringing enhancers from neigh- boring TADs to activate oncogenes (“enhancer hijacking”, [63]). Also, CNVs can facilitate the formation of new boundaries, splitting one TAD into sub-domains and consequently abrogating interactions between enhancers and promoters [64]. For example, the disruption of the TAD en- compassing WNT6, IHH, EPHA4 and PAX3 loci can result in human limb malformations [61].

Fusion of the TAD compassing oncogenes TAL1 and its adjacent TAD leads to activate TAL1 of lymphoblastic leukemia [65]. Similarly, fusion of the TAD compassing oncogenes PDGFRA and its adjacent TAD cause activation of PDGFRA of gliomas patients [66]. The fragmentation of a TAD harboring the tumor suppressor gene TP53 leads to misregulation of a number of genes in prostate cancer cells [64]. Taken together, perturbation of TADs are closely associated with gene expression misregulation in cancer.

6 1 . Introduction

1.3 Identification and characterization of CREs in the human

genome

Due to their important role in evolution, development, and disease, it is essential to identify and characterize all CREs in the genome. However, the identification of CREs still remains a big challenge [67]. First, CREs are scattered across the 98% of the human genome not encoding protein, resulting in a large search space [68]. Also, as stated above, the locations of CREs are highly variable relative to their target genes. Thus, enhancers do not necessarily impact their closest promoters; instead, they can bypass adjacent genes to regulate genes located more distantly [8, 69]. Furthermore, CREs activity is highly dynamic and tissue-, cell-, time- or condition-specific [70].

1.3.1 Comparative genomics

Comparative genomics can be used to explore the similarities and differences between the genomes of two or more species, subspecies or individuals of the same species [71]. Several studies have examined the evolution of CREs using comparative genomics, revealing that many

CREs are under selective constraint and more evolutionary conserved than non-functional se- quences [72]. Moreover, many conserved non-coding elements (CNEs) have been shown to act as CREs. Thus, approximately 500 sequences in the human genome displaying 100% sequence identity over at least 200 bp with sequences in the mouse and rat genomes have been found to act as enhancers [73, 74]. Because they are perfectly conserved, these sequences are referred to “ultra-conserved elements (UCEs)”. Furthermore, approximately 1,400 highly conserved el- ements (HCEs), displaying at least 70% sequence identity over 100 bp between the human, pufferfish and fugu genomes exhibited enhancer activity in one or more tissue in functional assays [75].

The alignment of different genomes is the first step for identifying conserved elements. The

BLAST-Like alignment Tool (BLAT, [76]) is a fast alignment tool that indexes a genome by recording all non-overlapping k-mers and their positions in the genome. It is often used to search for sequences in the genome that are closely related to a query sequence. BLAT is op-

7 1 . Introduction timized to identify DNA sequences of length 25 bp or more and with at least 95% similarity.

To take advantage of the large number of genome sequences, approaches such as a phastCons

[77] rely on multiple alignments while considering the phylogeny of the species that are repre- sented. Specifically, phastCons uses a phylogenetic hidden Markov model (phylo-HMM) [77].

The phylo-HMM consists of two states: a “conserved” state and a “non-conserved” state. The phastCons tool uses maximum likelihood to estimate the parameters of the phylo-HMM based on the genomes of multiple species. The phylo-HMM can be then used to assign a score to each nucleotide in a genome reflecting the probability that the nucleotide is in a “conserved” state of the phylo-HMM. In this manner, phastCon can identify “conserved elements” as segments of the alignment that are likely to have been “generated” by the conserved state of the phylo-HMM. phastCons has been used to identify CNEs for different sets of species [78].

Because orthologous sequences usually share the same functions, their identification is cen- tral to comparative genomics. Reciprocal Best Hits (RBH) in the genomes of two species are a common working definition of orthology. Specifically, RBH are defined as two sequences, each in a different genome, which are each other’s best hits [79]. For each sequence in a genome, the sequence in the other genome that shares the best alignment score with it is defined as its

(single-directional) best hit. BLAT is often applied to quickly find sequence hits [80], in partic- ular between closely related genomes.

Hence, comparative genomics has established itself as a valuable approach to identify can- didate CREs in the genome [81].

1.3.2 Next-generation sequencing (NGS)

Another approach for identification and characterization of CREs is to apply next-generation sequencing (NGS) technologies, which have emerged as a turning point in uncovering the gene- regulatory landscape in the genome. Large international research consortia such as the Encyclo- pedia of DNA Elements (ENCODE) [82] and Roadmap Epigenomics Mapping [83] have been established with the purpose of identifying all functional elements in the human genome. They aim at constructing a comprehensive CRE database through high-throughput data analysis.

NGS technologies can produce multi-dimensional genome-wide data. Thus, mRNA se-

8 1 . Introduction quencing (mRNA-seq) is widely used to analyse transcriptomes for a wide range of studies.

Specially, mRNA-seq can be used to quantify gene expression levels or identify both known and novel transcript isoforms or gene fusions [84]. In addition, chromatin immunoprecipitation fol- lowed by next-generation sequencing (ChIP-seq) is applied to generate maps of TF binding and histone modifications across the entire genome. The latter mainly comprises the monomethy- lation of histone H3 at Lys4 (H3K4me1), enriched at enhancers; the trimethylation of H3K4

(H3K4me3) at promoters; the H3K27 acetylation (H3K27ac) at active enhancers and promoters; the H3K27 trimethylation (H3K27me3) at constitutive heterochromatin regions and the H3K9 trimethylation (H3K9me3) at facultative heterochromatin region [85, 86]. In particlar, different combinations of enriched histone modifications can annotate different cis-regulatory elements accurately. H3K4me1 and H3K27ac are enriched at active enhancer; H3K4me3 and H3K27ac are found at the active promoters; H3K4me1 and H3K27me3 or H3K9me3 are associated with poised enhancer. Integrating mRNA-seq and ChIP-seq analysis is crucial to decipher the gene regulatory landscape underlying development, evolution and disease.

Further, NGS technologies can be used for studying chromatin structure and accessibility.

Chromatin accessibility represents the degree to which nuclear proteins such as TFs, are able to physically contact the DNA [87]. Specifically, accessible genomic regions comprises 2-3% of total DNA sequence. Yet over 90% accessible genomic regions are captured bound by TFs

[87]. TF binding has been found to be determined by chromatin accessibility as well as sequence patterns. Therefore, chromatin accessibility has emerged as a powerful marker for detecting ac- tive CREs. A variety of methods enable us to identify whether the chromatin is open or closed, active or inactive. The techniques include Deoxyribonuclease I (DNase I)-hypersensitive site sequencing (DNase-seq), formaldehyde-assisted isolation of regulatory elements (FAIRE-seq), and assay for transposase-accessible chromatin (ATAC-seq). Finally, chromosome conforma- tion capture or “Hi-C” protocols, which are also based on high-throughput sequencing technolo- gies, can be applied to investigate chromatin-chromatin interactions, the 3D spatial organization of the genome, and the interactions between genes and distal CREs [31].

These technologies have facilitated the description of the genome-wide landscape of gene expression regulation.

9 1 . Introduction

1.4 Public data for multi-assay genomic investigation of gene

regulation and disease

With the rapid development of NGS technologies, large consortia have generated extensive multi-assay datasets. The development of bioinformatics approaches to analyze such public data resources has great potential to benefit biological and medical research. For example, ENCODE [82] is an ongoing international collaboration that generates and provides a variety of data and methods to identify CREs in the human genome. Used Techniques include RNA-seq, ChIP-seq, DNase-seq, DNA-methylation arrays, bisulfite sequencing, and

FAIRE-seq. Notably, ENCODE has systematically mapped TFBSs, candidate CREs, histone modifications and chromatin structure for numerous human tissues and cell lines.

In addition, large-scale databases have been generated to identify variants in the genomes of patients with various diseases. Particularly, in the last decades two large international consor- tia, the Cancer Genome Atlas (TCGA, [88]) and the International Cancer Genome Consortium

(ICGC, [89]), have produced and analyzed large-scale whole-genome sequencing (WGS) data for a variety of cancer types. They have also developed a variety of computational pipelines and tools to accurately detect cancer-related variants. These programs have made remarkable progress in detecting protein-coding variants that act as cancer drivers and in elucidating their causative effects. Specially, they found a set of protein-coding genes with known causal as- sociations with multiple cancers. These genes are a hallmark of cancer and were dubbed as pan-cancer genes [90].

Compared to protein-coding variants, there is less progress in understanding the pathogenic effects of non-coding variants. The Genome-Wide Association Studies (GWASs) have found that more than 80% of disease-associated variants fall in non-coding sequences of the genome [91]. These variants can cause gene expression misregulation via different mechanisms, includ- ing modifying CREs, non-coding genes with regulatory functions like long non-coding RNAs

(lncRNAs), or disrupting TAD structure [92]. Hence, non-coding variants should draw more attention and be better studied for uncovering their roles in gene expression misregulation in disease.

10 1 . Introduction

1.5 Research questions

Medicine comprises three critical components: prognosis, diagnosis and therapy. Identi- fying relevant genomic variants could permit the personalization of each of these components.

In other words, integrating clinical information and genomic variation is necessary for moving from one-size-fits-all treatments to personalized precision medicine. Notably, non-coding vari- ants are likely to have strong clinical implications. For instance, GWAS and sequencing studies have verified that many non-coding variants are involved in cancer [93]. Moreover, increasing evidence from clinical studies suggests that non-coding variants can be efficient biomarkers for drug-induced toxicity and drug response [94]. Many of such non-coding variants are presumably located in CREs [95, 96].

In general, a gene is rarely regulated by a single CRE; instead, genes are often controlled by multiple CREs that act cooperatively [97]. Some cooperative CREs may be scattered over distances of several hundred kb in the genome [98]. Together with their target gene(s), these CREs form regulatory units [99]. Some of these regulatory units are small and contain a few

CREs. For example, CREs are found to be clustered together in the genome, co-regulating gene expression [100]. Moreover, “super-enhancer” comprising clusters of enhancer play a key role in controlling gene expression in cell identity and diseases [101, 102]. Other regulatory units, such as TADs, are megabase-sized regions encompassing thousands of genes and CREs [28]. Regulatory units can be disrupted, leading to gene expression misregulation. For instance, a small deletion or can disrupt the interactions between individual enhancer of Wap gene by 99% [103]. Also, TADs have been found to be highly conserved across different tissues, cell types and species [104, 31]; because of this, it has been inferred that they are an inherent feature of chromatin organization in the genome. However, TADs are subject to alteration, especially in the context of cancer [105]. In particular, TADs boundaries can be disrupted by

SVs, and the perturbation of TADs could drive tumorigenesis by bringing together inappropriate

CREs or disrupting CRE interactions. Elucidating how the disruption of TADs influences cancer progression could provide new insights for personalized medicine. New computational methods are needed to gain more knowledge about how regulatory units are involved in gene expression regulation or misregulation. Some unanswered questions about

11 1 . Introduction regulatory units include:

i) how to identify regulatory units in the genome;

ii) how to characterize regulatory units;

iii) how CREs in regulatory units regulate gene expression cooperatively;

iv) what can influence or maintain the functions of regulatory units underlying evolution and

development;

v) what can disrupt regulatory units and how does the disruption of regulatory unit impacts

gene expression disease.

This thesis aims at characterizing regulatory units and understanding their roles in gene ex- pression regulation. Particularly, we explore the causes and consequences of regulatory unit disruption underlying adaptive evolution, development, and disease. Specifically, this thesis is separated into two parts:

1. The first part focuses on the identification and characterization of small regulatory units

consisting of two CREs. Using CNEs as proxies for CREs, we investigated which pairs of

CNEs (CNE-CNE pairs) in the human genome are likely to act as regulatory units. Sun at

al [106] reported that the distance between two adjacent CNEs shows conservation in ver- tebrates. Hence, we hypothesized that the degree of conservation of inter-CNE sequences

of CNE-CNE pairs might reflect the potential functions of CNEs.

To examine this, we clustered CNE-CNE pairs according to the degree of conservation

of their inter-CNE sequences and examined the functions of CNEs through exploring the

patterns of their epigenetic profiles. Further, we examined if transposon activity plays a roles in changing the conservation of inter-CNE sequences, and further influences func-

tions of the CNEs. This part contributes to identify and characterize regulatory units in

the human genome and the evolutionary mechanisms leading to their loss, which might

be associated with adaptive evolution and human genetic diseases.

12 1 . Introduction

2. The second part is about the identification and charaterization of large regulatory units

(TADs) that are commonly perturbed in cancer. In addition, we examined by which mech-

anism cancer-related CNVs can disrupt TADs. We provide a proof-of-principle to demon- strate the prognostic value of TADs in patient survival outcomes. This part contributes

to understand the relationship between non-coding variants and TAD disruption in can-

cer, and provides new insights for the development of novel diagnostic, therapeutic and

preventive strategies in cancer.

The identification and characterization of regulatory units is necessary to understand how changes in gene expression regulation contribute to evolution, development, and disease.

13 2 . Results 2. Results

The research presented here aimed at investigating the functional potential of the non-coding sequence with the aid of bioinformatics tools and genomic data generated by international con- sortia such as ENCODE and TCGA (see Figure 2.1). Specifically, the first part of the chapter is devoted to the identification and characterization of novel small regulatory units consisting of pairs of adjacent CNEs and the inter-CNE sequences in between [107]. We found that pairs of CNEs with mildly contracted or conserved inter-CNE distances often act as regulatory units consisting of two CREs with similar regulatory activities. This regulatory units can be disrupted by transposon activity. Interesting, we also found that transposons often give rise to redundant

CREs [108]. The second part of the chapter provides insights into how the disruption of large regulatory units such as TADs contributes to gene expression misregulation in cancer [109]. In the context of cancer, we also investigated how knocking-out the mutated KRAS in pancreatic cancer cell lines with a heterozygous KRAS mutation with the clustered regularly interspaced short palindromic repeat (CRISPR)/CRISPR-associated protein 9 (Cas9) system affected the expression of genes involved in cell growth, cell maturation, and cell death [110].

2.1 Small regulatory units in evolution and development

Comparative genomics has identified thousands of CNEs across multiple species that are of particular interest because of their potential role as CREs [111, 112]. In addition, the length of the sequences between two CNEs, in particular, HCEs and UCEs, has been shown to exhibit high conservation across vertebrates [113, 106]. Previous research also reported a correlation be- tween the transposon density of the inter-CNE sequence and the conservation of its length, with longer sequences displaying higher transposon densities [113, 106]. Based on these observa- tions, we hypothesized that the inter-CNE sequences might contain elements that are functional or structurally relevant which can be perturbed by transposon activities.

To test this, we first developed a new method to quantify degree of conservation of the distance in between two adjacent CNEs (inter-CNE distance) and analyzed the epigenetic profile of the CNEs. Then, we investigated if transposon activity influences the conservation of inter-

14 2 . Results

Figure 2.1: TADs are represented as red triangle. Boxes in yellow, green, and blue represent gene, CREs, and adjacent pairs of CREs that are evoluonary conserved (“funconal CNEs”), respecvely. The alteraon of large regulatory units, such as TADs, or small regulatory units, such as pairs of adjacent CREs, could modify regulatory interacons. For example, the fusion of two ad- jacent TADs could lead to the formaon of a interacons between a gene and a CRE, whereas the change of inter-CNE distance could lead to the loss of interacon between two CREs and, in turn, to the loss of regulatory funcon.

CNE distances and affects the functions of the CNEs [107].

We extracted 21,367 CNEs in the human genome from the PhastCons database [77]. Pairs of adjacent CNEs without any other conserved element in between were defined as CNE-CNE pairs. To assess the conservation of the genomic distance between the CNEs of the CNE-CNE pairs, we searched for the ortholog of each CNE in other 22 vertebrate genomes (see Table 2.1).

Specifically, we used BLATto identify RBHs for each of the CNEs in the human genome. A total of 5,657 CNE-CNE pairs were conserved in mammals and defined as mammalian conserved.

Among these CNE-CNE pairs, 1,956 were further conserved in at least two non-mammalian species and defined as deeply conserved (see figure 2.2A).

Species

Human Turtle chimp Lizard

Marmoset Frog

Mouse Zebrafish

Dog Fugu Armadillo Rat

15 2 . Results

Species

Elephant Rhesus

Opossum Microbat

Platypus Panda

Chicken Pig Zebra finch Dolphin

Rhinoceros

Table 2.1: Species used to idenfy mammalian conserved and deeply conserved CNE-CNE pairs.

We next inferred the inter-CNE distance of CNE-CNE pairs in the common mammalian ancestor based on the inter-CNE distance of CNE-CNE pairs in the human genome and their or- thologs in the 16 mammalian species. We introduced the “relative distance difference (nRDD)” to quantify the relative changes in the distance between adjacent CNEs of CNE-CNE pairs in the modern mammalian genomes (see Figure 2.2B), calculated as follow:

dh − dr nRDD = Gh Gr dh + dr Gh Gr 2 where dh and dr were the distances between the midpoints of two adjacent CNEs in the species of interest and the common mammalian ancestor, respectively. Gh and Gr were the genome sizes of the species of interest and the common mammalian ancestor, respectively. Therefore,

dh and dr were the genome size-normalized distances for the species of interest and the common Gh Gr mammalian ancestor. By computing genome size-normalized distances, we were able to identify changes in the inter-CNE distances that were independent from changes in the overall genome sizes.

The nRDD values can be used to assess changes in the inter-CNE distance. Thus, nRDD values close to 0 represented no changes in the distances relative to the common mammalian ancestor; negative and positive nRDD values indicated inter-CNE contraction and expansion in the species of interest compared to the common mammalian ancestor, respectively.

We further divided the human CNE-CNE pairs into ten equal-sized groups according to

16 2 . Results

Figure 2.2: Distances in between adjacent CNEs (inter-CNE distances) were highly conserved across different species. A) We idenfied a total 321,621 CNE-CNE pairs in the human genome; 5,657 were conserved in at least human and platypus (“mam- malian conserved’’, in blue) and 1,956 are further conserved in at least two non-mammalian vertebrates (“deeply conserved’’, in green). A total of 3,140 CNE-CNE pairs are conserved in at least human and two non-mammalian vertebrates (green and or- ange). B) Inter-CNE distances for human and mouse CNE-CNE pairs, compared to the common mammalian ancestor. The thick black lines represented genomes for human, mouse and common mammalian ancestor (reference). The genome sizes are indi- cated by Gh (human), Gm (mouse) and Gr (reference). The two gray squares refer to adjacent CNEs. dh, dm and dr represent the inter-CNE distance (in bp) for the species we describe above. The genome size-normalized distance for each species is the inter-CNE distance divided by the genome size. The genome size-normalized distance of species of interest (human, mouse) is compared to that of the common mammalian ancestor to idenfy the inter-CNE expansion and contracon. (Adapted from [107]). increasing nRDD values. Compared to the common mammalian ancestor, groups 1-3 comprised the CNE-CNE pairs with contracted inter-CNE distances, groups 4-7 referred to the CNE-CNE pairs with relatively conserved inter-CNE distances, and groups 8-10 contained the CNE-CNE pairs with expanded inter-CNE distances.

Interestingly, we observed that the density of transposon, including DNA, LTR, LINE and

SINE, gradually increased with nRDD value. Nevertheless, all groups except group 10 exhibited significantly lower transposon densities than expected from the genome, whereas the extremely expanded group 10 showed no significant difference in transposon densities compared to ex- pected by chance. These observations suggest that transposon activity might play an active role in inter-CNE expansion and contraction. In order to assess if CNE-CNE pairs with different changes of inter-CNE distances are as- sociated with any regulatory functions, we investigated the epigenetic profiles of the CNEs and the inter-CNE sequence of each CNE-CNE pair in each of the ten groups. The histone modifi- cations contained H3K4me1, H3K4me3, H3K27ac, H3K27me3 and H3K9me3 for each of 21 human tissues from ENCODE. For this purpose, we applied an unsupervised clustering method called Self Organized Map (SOM). Specifically, we clustered the CNE-CNE pairs into 12 self-

17 2 . Results organizing units according to their enrichment profiles for five histone modifications at:

1. the 300bp sequences centered at the left CNE and

2. right CNE of the CNE-CNE pair;

3. the entire inter-CNE sequence of the CNE-CNE pair; and

4. the 200bp region with the highest mean enrichment level within the inter-CNE sequence

for each histone modification.

We next clustered the self-organizing units according to their different epigenetic profiles: i) no or low cis-regulatory activity; ii) low enrichment level at the CNEs and inter-CNE sequences, but relatively high enrichment level at the latter; iii) high enrichment levels at both CNEs and at the inter-CNE sequence, in particular, for H3K27me3; and iv) high enrichment levels for

H3K4me1, H3K4me3 and H3K27ac across the entire sequence (see Table 2.2). While cluster iii) represented poised enhancers, cluster iv) comprised CNE-CNE pairs in which both CNEs are most likely act as active enhancers or promoters.

Clusters of units Fraction of CNE-CNE pairs Histone modification enrichment

i 56% -

ii 34% H3K4me1, H3K9me3, H3K27me3 iii 5% H3K27me3

iv 5% H3K4me1, H3K3me3, H3K27ac

Table 2.2: Epigenec profiles of clusters of self-organizing units

Depending on the tissue, 19% to 77% of all the CNE-CNE pairs in cluster iv exhibited higher enrichment levels for H3K27ac in both CNEs compared to random genomic sequences, suggesting that the CNEs of these pairs act as a small regulatory units.

We further performed an enrichment test to assess if any nRDD groups display enrichment in clusters of self-organizing units. Interestingly, we observed nRDD group 3 and 4-7 particularly enriched in the self-organizing units of cluster iii and iv, whereas nRDD group 1-2 and 8-10 enriched in the self-organizing units of cluster i and ii. These findings strengthen the hypothesis

18 2 . Results that different degrees of conservation of the inter-CNE distance are associated with different epigenetic profiles. In particular, the CNEs in the CNE-CNE pairs with mildly contracted or conserved inter-CNE sequences are most likely to act as poised and active enhancers, whereas CNEs of the CNE-CNE pairs with extreme contractions or expansions display no or very low cis-regulatory activities. We performed analogous analyses for the CNE-CNE pairs in the mouse genome, with findings similar to human.

In summary, this study identified novel regulatory units consisting of CNE-CNE pairs with mildly contracted or conserved inter-CNE distances, hinting at inter-CNE distances being rele- vant to the function of the CNEs. Moreover, the function of these units appears to be disrupted by transposon activity.

2.2 Disruption of large regulatory units in cancer

The disruption of regulatory units can lead to gene expression misregulation. Large regula- tory units such as TADs are subject to perturbations in the context of cancer [105]. In particular, non-coding CNVs can disrupt TAD structure and rewrite enhancer-promoter interactions, lead- ing to diseases and abnormalities. Therefore, prioritizing non-coding CNVs disrupting TADs and elucidating how the disruption of TADs influences gene expression in tumorigenesis might provide new insights for personalized medicine.

We first developed a new algorithm to construct a consensus TAD map representing the TADs that are prevalent across 24 different normal tissues/cell line. Thus, 1,467 consensus

TADs were identified in the human genome. We further refer to them simply as “TADs”. To detect the distribution of recurrent CNVs within the TADs, we utilized CNVs data from 10,435 patients of cancer types from TCGA Data Portals. We observed a weak correlation between the size of the TADs and the number of CNVs that they comprised (Spearman’s correlation coefficient = 0.32), suggesting that the distribution of CNVs is independent of the size of TADs.

Hence, TADs enriched for CNVs might be involved in cancer mechanisms.

To identify the TADs with unusual CNVs enrichment, we compared the number of patients with CNVs overlapping with each TAD to the expectation from the human genome according to

19 2 . Results

Figure 2.3: The conceptual framework contains key steps of the study. their sizes, and detected 79 (6%) TADs significantly enriched with CNVs. These TADs were as- sociated with multiple biological process, cellular components and molecular functions related to pathogenic mechanisms. In particular, they were associated with type I interferon response, natural killer cell activation and immune response, which are related to tumor immune surveil- lance [114]. Interestingly, distinct cancer types showed different subset of TADs enriched with CNVs, with a median 64 TADs significantly enriched with CNVs. Furthermore, the same 13

TADs were enriched for CNVs in the genomes of the patients of at least 15 different cancer types. These 13 TADs might be considered a pan-cancer mutational signature, but none showed any enrichment for pan-cancer genes, suggesting that other pathogenetic mechanisms might be involved in. To identify the TADs that are commonly disrupted in cancer and the non-coding CNVs that contribute to their disruption, we developed a novel framework [109]. Further, we applied

LASSO Cox Regression Model to assess the prognostic power of CNVs in TADs on predicting

20 2 . Results and explaining patient survival outcomes (see Figure 2.3).

To assess the prognostic value of the TADs enriched for CNVs in patient survival outcome, we trained a LASSO Cox regression model for each of 19 cancer types (see Table 2.3). The variables or features used to train the models comprised the age and sex of the patients as well as the presence or absence of CNVs in each of the TADs enriched with CNVs.

Cancer type Project id

LGG Brain Lower Grade Glioma

KIRP Kidney renal papillary cell carcinoma UCEC Uterine Corpus Endometrial Carcinoma

READ Rectum adenocarcinoma

GBM Glioblastoma multiforme

KIRC Kidney renal clear cell carcinoma

SARC Sarcoma BRCA Breast invasive carcinoma

OV Ovarian serous cystadenocarcinoma

ESCA Esophageal carcinoma

BLCA Bladder Urothelial Carcinoma CESE Cervical squamous cell carcinoma and endocervical adenocarcinoma

COAD Colon adenocarcinoma

HNSC Head and Neck squamous cell carcinoma

LUSC Lung squamous cell carcinoma

STAD Stomach adenocarcinoma LIHC Liver hepatocellular carcinoma

PAAD Pancreatic adenocarcinoma

LUAD Lung adenocarcinoma

Table 2.3: Cancer types used to construct TAD- and gene-based LASSO Cox Regression models.

For nine out of the 19 cancer types, we were able to obtain reliable models, with median c-index values from 0.55 to 0.8 (see Figure 2.4). For reference, these TAD-based models were

21 2 . Results compared to gene-based models, trained and tested in the same manner. Compared to the gene- based models, TAD-based models performed better in BRCA, OV, SARC and UCEC, worse in KIRP, KIRC and GBM; and showed no difference in LGG and READ. In total, the models identified 35 prognostic TADs. Out of these, 32 were exclusive to one cancer type.

Interestingly, 19 out of 35 prognostic TADs did not comprise any pan-cancer genes, indicat- ing that the prognostic power of the TADs might lie in non-coding variants. A remarkable case is

SARC (sarcoma). No good gene markers have been identified so far for SARC [115, 116, 117], and this is consistent with the poor performance of our gene-based model (see Figure 2.4). In contrast, the TAD-based model showed a reliable performance. None of the five prognostic

TADs for SARC comprised any pan-cancer genes. In total, 16 out of the 35 prognostic TADs comprised pan-cancer genes, but the prognostic value of such TADs was not necessarily related to the presence of pan-cancer genes. These results indicate that the presence of a CNV in a TAD can be as informative as the presence of a CNV in a pan-cancer gene. In order to have biological insights into the prognostic features of the TADs, we constructed a consensus TAD map for the cancer genome. The cancer TAD map comprised 1,622 consen- sus TADs. By comparing the TAD maps constructed for the normal and cancer genomes, we separated the TADs in the normal genome into three categories:

1. constitutive: 643 normal TADs showed a high degree of conservation when compared to their counterparts in the cancer genome;

2. perturbed: 505 normal TADs displayed remarkable differences to cancer TADs; and

3. ambiguous: the remaining 319 normal TADs exhibited intermediate changes or similari-

ties compared to cancer TADs.

A large fraction of the TADs displayed a relatively high degree of conservation, in agreement with [104, 31]. In addition, there were 505 TADs with substantial perturbations, which might be related to the presence of CNVs. Hence, we next investigated and compared the distribution of

CNVs in constitutive and perturbed TADs. We hypothesized that CNVs might disrupt the TAD boundaries. Hence, we extracted two specific subsets of perturbed TADs:

22 2 . Results

Figure 2.4: The boxplot shows the c-index (CI) computed on 1,000 random test sets of paents, for 19 cancer types. TAD- and gene-based models are shown in yellow and blue, respecvely. The red dash line represents CI = 0.55 and black dash line for CI = 0.5. The asterisk in red (“*”) indicates that TAD- and gene-based model is significant different with p-value < 0.05 (Wilcoxon rank-sum test). C-index = 1 indicates perfect predicon whereas c-index = 0.5 refer to random predicon. Nine of the 19 TAD- based models exhibited a reliable performance (CI > 0.55); out of those nine models, four were significantly beer than the corresponding gene-based models, three performed worse, and two were similar.

(i) a subset of 111 TADs that were fused with at least another TAD into a larger TAD in the

cancer genome; and

(ii) a subset of 120 TADs split into two or more TADs in the cancer genome.

Interestingly, both subsets displayed higher densities of CNVs than constitutive TADs. Specif- ically, (i) showed general a enrichment for CNVs, whereas (ii) displayed a stronger enrichment of CNVs towards the center of the TAD, at the location of the newly formed TAD boundary. These findings highlight the association between the presence of CNVs and the disruption of TAD boundaries, suggesting that CNVs can either lead to the elimination of existing bound- aries or induce the formation of new boundaries. Moreover, 34% of 35 prognostic TADs were perturbed, indicating that approximately one third of the prognostic TADs might gain predictive power directly from the perturbation of the three-dimensional structure. In summary, we systematically explored a TAD-based model for capturing the effects of non-

23 2 . Results coding variants in cancer mechanisms. We found that CNVs disrupting TADs can be used as as effective predictive markers for broad types of cancer. This study provides a novel framework for prioritizing non-coding variants for the development of personalized medicine.

To conclude, this thesis developed new computational frameworks to identify and charac- terize regulatory units in the human genome, as well as study how their disruption influences gene expression in human disease.

24 3 . Discussion 3. Discussion

To identify and characterize regulatory units in the human genome, we developed novel algorithms together with bioinformatics tools, and analyzed large-scale genomic data obtained from open resources ENCODE and TCGA. Specifically, we concentrated on identifying novel small regulatory units and investigating disruptions of large regulatory units in the context of evolution, development, and disease.

Our first study identified and characterized novel regulatory units in the human genome. There are many studies investigating the roles of non-coding genomic sequences in the human genome. CNEs have been shown to serve as CREs, and we have shown that the length of the ge- nomic sequences in between two adjacent CNEs, or inter-CNE sequences, exhibits high conser- vation across different species, but can be disrupted by transposon activity. These observations suggest that the inter-CNE sequences contain functional elements or are structurally relevant.

In contrast to identification of the changes of inter-CNE distance in the human genome relative to the other seven species [113, 106], we integrated a certain number of mammalian species to infer the inter-CNE distances of the extinct common mammalian ancestor, and used it as refer- ence to identify human-specific changes in inter-CNE sequences. In the search of the causes of inter-CNE contractions and expansions, we identified an association between transposon activ- ities and changes in inter-CNE sequences. Specifically, transposon deletion and insertion are associated with contraction and expansion of inter-CNE sequences, respectively. We wondered whether the CNE-CNE pairs and according inter-CNE sequences with varying conservation lev- els might exhibit any functional patterns; therefore, we applied a SOM-based approach to cluster CNE-CNE pairs according to the their epigenetic profiles. Interestingly, the CNEs in the CNE-

CNE pairs with conserved or mildly contracted inter-CNE sequences are most likely to be poised or active enhancers. In addition, the CNEs in a pair always exhibit similar epigenetic profiles, suggesting that the CNE-CNE pairs with conserved or mildly contracted inter-CNE distance act as small regulatory units in the genome. An interaction, or epistasis, seems to exist between the adjacent CNEs, which maintains both CNEs or makes them perform as enhancers. Com- plex traits or diseases are found to be affected and even determined by epistatic effects [118].

However, the genome-wide detection of epistasis is still challenging [119]. Studies focusing on

25 3 . Discussion epistasis are essential for elucidating the mechanisms of gene misregulation in the context of human disease. Our study identified and characterized regulatory units, consisting of pairs of potentially epistatic CREs. Further examination and validation of their functions is warranted. Our second study focused on uncovering the effects of disruption of regulatory units in dis- eases. TADs are highly conserved across species and cell types in the genome, but can be perturbed, i.e., exhibit structural alterations, in cancer. We hypothesized that the disruption of

TADs might be involved in gene expression misregulation in human disease. To distinguish perturbed TADs from constitutive or unaltered TADs in cancer, we first developed a novel al- gorithm to define consensus TADs to represent the cancer and normal genome across multiple tissues/cell lines. Next, we obtained 79 consensus TADs enriched with CNVs compared to ran- dom genomic sequences of the same size. The TADs enriched with CNVs did not show any enrichment of pan-cancer genes, suggesting that other cancer mechanisms might be involved.

After training a LASSO Cox regression model, presence of CNVs in TADs performed as well as the presence of CNVs in pan-cancer genes in terms of patient survival prediction. Moreover, the majority of prognostic TADs that display a predictive power are independent of mutated pan-cancer genes. These observations support that TAD-based model can capture either coding or non-coding pathogenic variants, thus potentially being an improvement over traditional gene- based model. Furthermore, TAD-based models may be useful for elucidating the mechanisms by which non-coding variants might contribute to tumorigenesis. For instance, approximately

34% of 35 prognostic TADs undergo structural variations, suggesting non-coding variants drive gene expression misregulation by disrupting TADs structure. That is to say, perturbed regulatory units can be considered as novel prognostic markers. Functional assays are necessary to validate our findings. Moreover, our study presents a resource for prioritizing non-coding variants for experimental validation.

The CRISPR/Cas9 system could be used to systematically perturb and characterize candidate

CREs [120, 121], enabling us to gain more insights into gene expression misregulation. In addition, the CRISPR/Cas9 system could be used for modifying TAD structures and examining the consequences of TAD disruption.

Taken together, this thesis highlights the importance of regulatory units in gene expression

26 3 . Discussion regulation. In particular, we identified small regulatory units, consisting of pairs of cooperative enhancers. Furthermore, our study strengthens the idea that the disruption of regulatory units can affect gene regulatory interactions, and in turn lead to disease. The list of regulatory units we identified need follow-up experimental investigation to examine and validate how they are involved in the regulatory mechanisms underlying development, evolution, and diseases.

27 4 . Conclusion and Outlook 4. Conclusion and Outlook

The 2% protein-coding regions in the human genome have been studied for decades, but the 98% non-coding regions remain less understood. In order to have a comprehensive annotation of the genome, numerous bioinformatics methods and tools have been developed and applied to the analysis of large–scale genomic data from open resources, such as TCGA and ENCODE.

Regulatory units, consisting of CREs and/or their target genes, play key roles in gene expression regulation. This thesis aimed at identifying and characterizing regulatory units and exploring the causes and consequences of regulatory unit disruption underlying evolution, development, and human disease.

The first part of this thesis identified and characterized novel regulatory units consisting of pairs of CNEs (“CNE-CNE pairs”). Specifically, both CNEs in CNE-CNE pairs with conserved and mildly contracted inter-CNE sequences were found to act as enhancers. Furthermore, the two CNEs of these pairs act cooperatively, as small regulatory units that could be disrupted by transposon activity. This study supports the existence of cooperative interactions between adja- cent CREs, improving our understanding of the mechanism for how selective forces act on CREs in the context of evolution and human genetic disease. The second part of this thesis investigated the mechanisms by which regulatory units can be disrupted, leading to gene expression misreg- ulation in cancer. We provide a proof-of-principle that disrupted TADs can capture pathogenic non-coding variants involved in cancer mechanisms. In particular, the CNVs disrupting TADs can be used as effective predictive markers for broad types of cancer, and a substantial fraction of prognostic features are linked to changes in TADs. This work provides a novel framework for prioritizing and characterizing cancer-related non-coding variants for the development of personalized medicine.

Gene expression in humans is complex and controlled by precise temporal and spatial reg- ulation. In particular, regulatory units comprising multiple interacting CREs act as platforms to recruit TFs. When the interaction between CREs is disrupted or changed in these regulatory units, gene expression is subject to change. Thus, the identification and characterization of reg- ulatory units is essential to understand how changes in gene expression regulation contribute to evolution, development, and human disease. The interaction between CREs can be additive, re-

28 4 . Conclusion and Outlook pressive, competitive, synergistic or hierarchical [122]. To explore these interactions, follow-up experiments, such as functional assays, are necessary. In order to improve the prediction and characterization of all CREs in the genome, a variety of computational approaches are required. Our study strengthens the importance of the interactions between the CREs for gene expression regulation. A systematic dissection and characterization of the interactions between the CREs might link different interaction types to different evolutionary and disease mechanisms. Thus, it would be interesting to know whether the conserved inter-CNE distances observed in pairs of CNEs are also a common feature of larger clustered CNEs. To evaluate how changes in the distances between the CREs influence the interactions between CREs and their activities, the

CRISPR/Cas9 system could be utilized together with ChIP-seq. Moreover, the integrated anal- ysis of transcriptomic and epigenomic data would also provide the opportunity to connect the activity of regulatory units to the expression level of their target gene(s), facilitating the interpre- tation for how cluster of CREs cooperatively regulate gene expression. Also, multi-dimensional genome-wide cancer data would enable us to identify changes in gene expression, CRE activity, and interactions between CREs and their target genes, which would, in turn, help us to uncover the mechanisms underlying cancer.

Eventually, the combination of computational methods with wet-lab experiments will allow us to decipher the mechanisms underlying gene expression regulation in the future.

29 5 . Author Contribution 5. Author Contribution

The research article “Pairs of Adjacent Conserved Noncoding Eements Separated by Con- served Genomic Distances Act as Cis-Regulatory Units”, published in Genome biology and evolution in 2018. This study was conceptualized by Leila Taher and me. The methodology was performed by me and Leila Taher. The formal analysis, investigation and visualization were performed by me, Leila taher, Nicolai KH Barth and Eva Hirth. The original draft paper was written by me. Review and editing were performed by me and Leila Taher. Leila Taher supervised the work.

The research article “Cancer Is Associated with Alteraions in the Three-Dimensional Or- ganization of the Genome”, was published in Cancers in 2019. This study was conceptualized by Leila Taher. The methodology was performed by me and Leila Taher. The formal analysis, investigation and visualization were performed by me, Leila taher and Nicolai KH Barth. The original draft paper was written by me. Review and editing were performed by me, Nicolai KH

Barth, Christian Pilarsky and Leila Taher. Leila Taher supervised the work. The research article “CRISPR/Cas9 mediated knock-out of Kras G12D mutated pancreatic cancer cell lines”, was published in International Journal of Molecular Sciences in 2019. This study was conceptualized by Christian Pilarsky and Eva Lentsch. The methodology, validation and investigation was performed by Eva Lentsch, me, Susanne Pfeffer, Arif B. Ekici and Leila

Taher. Original draft paper was written by Eva Lentsch. Review and editing were performed by Eva Lentsch, me, Susanne Pfeffer, Arif B. Ekici, Leila Taher, Christian Pilarsky and Robert

Grützmann. Robert Grützmann supervised the work.

The research article “Independent transposon exaptation is a widespread mechanism of re- dundant enhancer evolution in the mammalian genome”, published in Genome biology and evo- lution in 2020. This study was conceptualized by Leila Taher. The methodology was performed by Nicolai KH Barth and Leila Taher. The formal analysis, investigation and visualization were performed by Nicolai KH Barth, Leila taher and me. The original draft paper was written by

Nicolai KH Barth. Review and editing were performed by Nicolai KH Barth and Leila Taher.

30 5 . Author Contribution

Leila Taher supervised the work.

31 6. Bibliography

[1] Grace Gill. Transcriptional initiation: taking the initiative. Current Biology, 4(4):374– 376, 1994.

[2] Tong Ihn Lee and Richard A Young. Transcriptional regulation and its misregulation in disease. Cell, 152(6):1237–1251, 2013.

[3] ENCODE Project Consortium et al. An integrated encyclopedia of dna elements in the human genome. Nature, 489(7414):57, 2012.

[4] Megha Ghildiyal and Phillip D Zamore. Small silencing rnas: an expanding universe. Nature Reviews Genetics, 10(2):94–108, 2009.

[5] Silke Sperling. Transcriptional regulation at a glance. BMC bioinformatics, 8(6):S2, 2007.

[6] Albin Sandelin, Wynand Alkema, Pär Engström, Wyeth W Wasserman, and Boris Lenhard. Jaspar: an open-access database for eukaryotic transcription factor binding profiles. Nucleic acids research, 32(suppl_1):D91–D94, 2004.

[7] Ivan Yevshin, Ruslan Sharipov, Tagir Valeev, Alexander Kel, and Fedor Kolpakov. Gtrd: a database of transcription factor binding sites identified by chip-seq experiments. Nucleic acids research, page gkw951, 2016.

[8] Laura A Lettice, Simon JH Heaney, Lorna A Purdie, Li Li, Philippe de Beer, Ben A Oostra, Debbie Goode, Greg Elgar, Robert E Hill, and Esther de Graaff. A long-range shh enhancer regulates expression in the developing limb and fin and is associated with preaxial polydactyly. Human molecular genetics, 12(14):1725–1735, 2003.

[9] Nathaniel D Heintzman and Bing Ren. Finding distal regulatory elements in the human genome. Current opinion in genetics & development, 19(6):541–549, 2009.

[10] Piero Carninci, Albin Sandelin, Boris Lenhard, Shintaro Katayama, Kazuro Shimokawa, Jasmina Ponjavic, Colin AM Semple, Martin S Taylor, Pär G Engström, Martin C Frith, et al. Genome-wide analysis of mammalian promoter architecture and evolution. Nature genetics, 38(6):626–635, 2006.

[11] Elizabeth M Blackwood and James T Kadonaga. Going the distance: a current view of enhancer action. Science, 281(5373):60–63, 1998.

[12] Dirk A Kleinjan and Veronica van Heyningen. Long-range control of gene expression: emerging mechanisms and disruption in disease. The American Journal of Human Ge- netics, 76(1):8–32, 2005.

[13] Li M Li and David N Arnosti. Long-and short-range transcriptional repressors induce distinct chromatin states on repressed genes. Current Biology, 21(5):406–412, 2011.

[14] Somi Kim, Nam-Kyung Yu, and Bong-Kiun Kaang. Ctcf as a multifunctional pro- tein in genome regulation and gene expression. Experimental & molecular medicine, 47(6):e166–e166, 2015.

32 [15] Kathryn L Huisinga, Brent Brower-Toland, and Sarah CR Elgin. The contradictory def- initions of heterochromatin: transcription and silencing. Chromosoma, 115(2):110–122, 2006. [16] Hisashi Tamaru. Confining euchromatin/heterochromatin territory: jumonji crosses the line. Genes & development, 24(14):1465–1478, 2010. [17] ER Gibney and CM Nolan. Epigenetics and gene expression. Heredity, 105(1):4–13, 2010. [18] Andreas Lennartsson and Karl Ekwall. Histone modification patterns and epigenetic codes. Biochimica et biophysica acta (BBA)-general subjects, 1790(9):863–868, 2009. [19] Gerald P Holmquist and Terry Ashley. Chromosome organization and chromatin modifi- cation: influence on genome function and evolution. Cytogenetic and genome research, 114(2):96–125, 2006. [20] Michael Grunstein. Yeast heterochromatin: regulation of its assembly and inheritance by histones. Cell, 93(3):325–328, 1998. [21] Ken-ichi Noma, C David Allis, and Shiv IS Grewal. Transitions in distinct his- tone h3 methylation patterns at the heterochromatin domain boundaries. Science, 293(5532):1150–1155, 2001. [22] Jun-ichi Nakayama, Judd C Rice, Brian D Strahl, C David Allis, and Shiv IS Grewal. Role of histone h3 lysine 9 methylation in epigenetic control of heterochromatin assembly. Science, 292(5514):110–113, 2001. [23] Rosa Karlić, Ho-Ryun Chung, Julia Lasserre, Kristian Vlahoviček, and Martin Vingron. Histone modification levels are predictive for gene expression. Proceedings of the Na- tional Academy of Sciences, 107(7):2926–2931, 2010. [24] Lisa D Moore, Thuc Le, and Guoping Fan. Dna methylation and its basic function. Neu- ropsychopharmacology, 38(1):23–38, 2013. [25] Orlando J Miller, Wolfgang Schnedl, Julian Allen, and Bernard F Erlanger. 5- methylcytosine localised in mammalian constitutive heterochromatin. Nature, 251(5476):636–637, 1974. [26] Keith D Robertson and Peter A. Jones. Dna methylation: past, present and future direc- tions. Carcinogenesis, 21(3):461–467, 2000. [27] Azam Moosavi and Ali Motevalizadeh Ardekani. Role of epigenetics in biology and human diseases. Iranian biomedical journal, 20(5):246, 2016. [28] Jesse R Dixon, David U Gorkin, and Bing Ren. Chromatin domains: the unit of chromo- some organization. Molecular cell, 62(5):668–680, 2016. [29] Sjoerd Johannes Bastiaan Holwerda and Wouter de Laat. Ctcf: the protein, the binding partners, the binding sites and their chromatin loops. Philosophical Transactions of the Royal Society B: Biological Sciences, 368(1620):20120369, 2013. [30] Michael H Nichols and Victor G Corces. A ctcf code for 3d genome architecture. Cell, 162(4):703–705, 2015.

33 [31] Jesse R Dixon, Siddarth Selvaraj, Feng Yue, Audrey Kim, Yan Li, Yin Shen, Ming Hu, Jun S Liu, and Bing Ren. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature, 485(7398):376, 2012.

[32] Erez Lieberman-Aiden, Nynke L Van Berkum, Louise Williams, Maxim Imakaev, Tobias Ragoczy, Agnes Telling, Ido Amit, Bryan R Lajoie, Peter J Sabo, Michael O Dorschner, et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. science, 326(5950):289–293, 2009.

[33] Suhas SP Rao, Miriam H Huntley, Neva C Durand, Elena K Stamenova, Ivan D Bochkov, James T Robinson, Adrian L Sanborn, Ido Machol, Arina D Omer, Eric S Lander, et al. A 3d map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell, 159(7):1665–1680, 2014.

[34] James Fraser, Iain Williamson, Wendy A Bickmore, and Josée Dostie. An overview of genome organization and how we got there: from fish to hi-c. Microbiol. Mol. Biol. Rev., 79(3):347–372, 2015.

[35] James Fraser, Carmelo Ferrai, Andrea M Chiariello, Markus Schueler, Tiago Rito, Gio- vanni Laudanno, Mariano Barbieri, Benjamin L Moore, Dorothee CA Kraemer, Stuart Aitken, et al. Hierarchical folding and reorganization of chromosomes are linked to tran- scriptional changes in cellular differentiation. Molecular systems biology, 11(12), 2015.

[36] Jesse R Dixon, Inkyung Jung, Siddarth Selvaraj, Yin Shen, Jessica E Antosiewicz- Bourget, Ah Young Lee, Zhen Ye, Audrey Kim, Nisha Rajagopal, Wei Xie, et al. Chromatin architecture reorganization during stem cell differentiation. Nature, 518(7539):331–336, 2015.

[37] Quentin Szabo, Frédéric Bantignies, and Giacomo Cavalli. Principles of genome folding into topologically associating domains. Science advances, 5(4):eaaw1668, 2019.

[38] Anders S Hansen, Claudia Cattoglio, Xavier Darzacq, and Robert Tjian. Recent evidence that tads and chromatin loops are dynamic structures. Nucleus, 9(1):20–32, 2018.

[39] P Andrew Futreal, Lachlan Coin, Mhairi Marshall, Thomas Down, Timothy Hubbard, Richard Wooster, Nazneen Rahman, and Michael R Stratton. A census of human cancer genes. Nature reviews cancer, 4(3):177, 2004.

[40] Anthony J Brookes. The essence of snps. Gene, 234(2):177–186, 1999.

[41] Barkur S Shastry. Snp alleles in human disease and evolution. Journal of human genetics, 47(11):561, 2002.

[42] Joel N Hirschhorn and Mark J Daly. Genome-wide association studies for common dis- eases and complex traits. Nature Reviews Genetics, 6(2):95, 2005.

[43] Zilfalil Bin Alwi. The use of snps in pharmacogenomics studies. The Malaysian journal of medical sciences: MJMS, 12(2):4, 2005.

[44] Yi Zhang, Mohith Manjunath, Jialu Yan, Brittany A Baur, Shilu Zhang, Sushmita Roy, and Jun S Song. The cancer-associated genetic variant rs3903072 modulates immune cells in the tumor microenvironment. Frontiers in Genetics, 10:754, 2019.

34 [45] Franklin W Huang, Eran Hodis, Mary Jue Xu, Gregory V Kryukov, Lynda Chin, and Levi A Garraway. Highly recurrent tert promoter in human melanoma. Science, 339(6122):957–959, 2013. [46] Stig E Bojesen, Karen A Pooley, Sharon E Johnatty, Jonathan Beesley, Kyriaki Michaili- dou, Jonathan P Tyrer, Stacey L Edwards, Hilda A Pickett, Howard C Shen, Chanel E Smart, et al. Multiple independent variants at the tert locus are associated with telomere length and risks of breast and ovarian cancer. Nature genetics, 45(4):371–384, 2013. [47] Eunkyong Ko, Hyun-Wook Seo, Eun Sun Jung, Baek-hui Kim, and Guhung Jung. The tert promoter snp rs2853669 decreases e2f1 transcription factor binding and increases mortality and recurrence risks in liver cancer. Oncotarget, 7(1):684, 2016. [48] Julienne M Mullaney, Ryan E Mills, W Stephen Pittard, and Scott E Devine. Small insertions and deletions (indels) in human genomes. Human molecular genetics, 19(R2):R131–R136, 2010. [49] Takahiro Nakagomi, Yosuke Hirotsu, Taichiro Goto, Daichi Shikata, Yujiro Yokoyama, Rumi Higuchi, Sotaro Otake, Kenji Amemiya, Toshio Oyama, Hitoshi Mochizuki, et al. Clinical implications of noncoding indels in the surfactant-encoding genes in lung cancer. Cancers, 11(4):552, 2019. [50] Kai Ye, Jiayin Wang, Reyka Jayasinghe, Eric-Wubbo Lameijer, Joshua F McMichael, Jie Ning, Michael D McLellan, Mingchao Xie, Song Cao, Venkata Yellapantula, et al. Sys- tematic discovery of complex insertions and deletions in human cancers. Nature medicine, 22(1):97, 2016. [51] Philip J Hastings, James R Lupski, Susan M Rosenberg, and Grzegorz Ira. Mechanisms of change in gene copy number. Nature Reviews Genetics, 10(8):551, 2009. [52] Gregory M Cooper, Bradley P Coe, Santhosh Girirajan, Jill A Rosenfeld, Tiffany H Vu, Carl Baker, Charles Williams, Heather Stalker, Rizwan Hamid, Vickie Hannig, et al. A copy number variation morbidity map of developmental delay. Nature genetics, 43(9):838, 2011. [53] Christian R Marshall, Abdul Noor, John B Vincent, Anath C Lionel, Lars Feuk, Jennifer Skaug, Mary Shago, Rainald Moessner, Dalila Pinto, Yan Ren, et al. Structural variation of chromosomes in autism spectrum disorder. The American Journal of Human Genetics, 82(2):477–488, 2008. [54] Ricarda Flöttmann, Bjørt K Kragesteen, Sinje Geuer, Magdalena Socha, Lila Allou, Anna Sowińska-Seidler, Laure Bosquillon De Jarcy, Johannes Wagner, Aleksander Jamsheer, Barbara Oehl-Jaschkowitz, et al. Noncoding copy-number variations are associated with congenital limb malformation. Genetics in Medicine, 20(6):599, 2018. [55] Katarina Dathe, Klaus W Kjaer, Anja Brehm, Peter Meinecke, Peter Nürnberg, Jordao C Neto, Decio Brunoni, Nils Tommerup, Claus E Ott, Eva Klopocki, et al. Duplications in- volving a conserved regulatory element downstream of bmp2 are associated with brachy- dactyly type a2. The American Journal of Human Genetics, 84(4):483–492, 2009. [56] Robert E Hill and Laura A Lettice. Alterations to the remote control of shh gene expres- sion cause congenital abnormalities. Philosophical Transactions of the Royal Society B: Biological Sciences, 368(1620):20120357, 2013.

35 [57] Eve Anderson, Silvia Peluso, Laura A Lettice, and Robert E Hill. Human limb abnor- malities caused by disruption of hedgehog signaling. Trends in Genetics, 28(8):364–373, 2012.

[58] Eva Klopocki, Silke Lohan, Francesco Brancati, Randi Koll, Anja Brehm, Petra See- mann, Katarina Dathe, Sigmar Stricker, Jochen Hecht, Kristin Bosse, et al. Copy-number variations involving the ihh locus are associated with syndactyly and craniosynostosis. The American Journal of Human Genetics, 88(1):70–75, 2011.

[59] Bin Liu, Lei Yang, Binfang Huang, Mei Cheng, Hui Wang, Yinyan Li, Dongsheng Huang, Jian Zheng, Qingchu Li, Xin Zhang, et al. A functional copy-number variation in map- kapk2 predicts risk and prognosis of lung cancer. The American Journal of Human Ge- netics, 91(2):384–390, 2012.

[60] Xose S Puente, Silvia Beà, Rafael Valdés-Mas, Neus Villamor, Jesús Gutiérrez-Abril, José I Martín-Subero, Marta Munar, Carlota Rubio-Pérez, Pedro Jares, Marta Aymerich, et al. Non-coding recurrent mutations in chronic lymphocytic leukaemia. Nature, 526(7574):519, 2015.

[61] Darío G Lupiáñez, Katerina Kraft, Verena Heinrich, Peter Krawitz, Francesco Brancati, Eva Klopocki, Denise Horn, Hülya Kayserili, John M Opitz, Renata Laxova, et al. Dis- ruptions of topological chromatin domains cause pathogenic rewiring of gene-enhancer interactions. Cell, 161(5):1012–1025, 2015.

[62] Anne-Laure Valton and Job Dekker. Tad disruption as oncogenic driver. Current opinion in genetics & development, 36:34–40, 2016.

[63] Joachim Weischenfeldt, Taronish Dubash, Alexandros P Drainas, Balca R Mardin, Yuanyuan Chen, Adrian M Stütz, Sebastian M Waszak, Graziella Bosco, Ann Rita Halvorsen, Benjamin Raeder, et al. Pan-cancer analysis of somatic copy-number alter- ations implicates irs4 and igf2 in enhancer hijacking. Nature genetics, 49(1):65–74, 2017.

[64] Phillippa C Taberlay, Joanna Achinger-Kawecka, Aaron TL Lun, Fabian A Buske, Ken- neth Sabir, Cathryn M Gould, Elena Zotenko, Saul A Bert, Katherine A Giles, Denis C Bauer, et al. Three-dimensional disorganization of the cancer genome occurs coincident with long-range genetic and epigenetic alterations. Genome research, 26(6):719–731, 2016.

[65] Denes Hnisz, Abraham S Weintraub, Daniel S Day, Anne-Laure Valton, Rasmus O Bak, Charles H Li, Johanna Goldmann, Bryan R Lajoie, Zi Peng Fan, Alla A Sigova, et al. Activation of proto-oncogenes by disruption of chromosome neighborhoods. Science, 351(6280):1454–1458, 2016.

[66] William A Flavahan, Yotam Drier, Brian B Liau, Shawn M Gillespie, Andrew S Vente- icher, Anat O Stemmer-Rachamimov, Mario L Suvà, and Bradley E Bernstein. Insula- tor dysfunction and oncogene activation in idh mutant gliomas. Nature, 529(7584):110, 2016.

[67] Yanling Peng and Yubo Zhang. Enhancer and super-enhancer: Positive regulators in gene transcription. Animal models and experimental medicine, 1(3):169–179, 2018.

36 [68] Len A Pennacchio, Wendy Bickmore, Ann Dean, Marcelo A Nobrega, and Gill Bejerano. Enhancers: five essential questions. Nature Reviews Genetics, 14(4):288, 2013.

[69] Tomoko Sagai, Masaki Hosoya, Youichi Mizushina, Masaru Tamura, and Toshihiko Shi- roishi. Elimination of a long-range cis-regulatory module causes complete loss of limb- specific shh expression and truncation of the mouse limb. Development, 132(4):797–803, 2005.

[70] Nathaniel D Heintzman, Gary C Hon, R David Hawkins, Pouya Kheradpour, Alexander Stark, Lindsey F Harp, Zhen Ye, Leonard K Lee, Rhona K Stuart, Christina W Ching, et al. Histone modifications at human enhancers reflect global cell-type-specific gene expression. Nature, 459(7243):108–112, 2009.

[71] Liping Wei, Yueyi Liu, Inna Dubchak, John Shon, and John Park. Comparative genomics approaches to study organism similarities and differences. Journal of biomedical infor- matics, 35(2):142–150, 2002.

[72] Dario Boffelli, Marcelo A Nobrega, and Edward M Rubin. Comparative genomics at the vertebrate extremes. Nature Reviews Genetics, 5(6):456–465, 2004.

[73] Gill Bejerano, Michael Pheasant, Igor Makunin, Stuart Stephen, W James Kent, John S Mattick, and David Haussler. Ultraconserved elements in the human genome. Science, 304(5675):1321–1325, 2004.

[74] Axel Visel, Shyam Prabhakar, Jennifer A Akiyama, Malak Shoukry, Keith D Lewis, Amy Holt, Ingrid Plajzer-Frick, Veena Afzal, Edward M Rubin, and Len A Pennacchio. Ultra- conservation identifies a small subset of extremely constrained developmental enhancers. Nature genetics, 40(2):158–160, 2008.

[75] Adam Woolfe, Martin Goodson, Debbie K Goode, Phil Snell, Gayle K McEwen, Tanya Vavouri, Sarah F Smith, Phil North, Heather Callaway, Krys Kelly, et al. Highly con- served non-coding sequences are associated with vertebrate development. PLoS Biol, 3(1):e7, 2004.

[76] W James Kent. Blat—the blast-like alignment tool. Genome research, 12(4):656–664, 2002.

[77] Adam Siepel, Gill Bejerano, Jakob S Pedersen, Angie S Hinrichs, Minmei Hou, Kate Rosenbloom, Hiram Clawson, John Spieth, LaDeana W Hillier, Stephen Richards, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome research, 15(8):1034–1050, 2005.

[78] Annabelle Haudry, Adrian E Platts, Emilio Vello, Douglas R Hoen, Mickael Leclercq, Robert J Williamson, Ewa Forczek, Zoé Joly-Lopez, Joshua G Steffen, Khaled M Haz- zouri, et al. An atlas of over 90,000 conserved noncoding sequences provides insight into crucifer regulatory regions. Nature genetics, 45(8):891–898, 2013.

[79] Natalie Ward and Gabriel Moreno-Hagelsieb. Quickly finding orthologs as reciprocal best hits with blat, last, and ublast: how much do we miss? PloS one, 9(7), 2014.

[80] Medha Bhagwat, Lynn Young, and Rex R Robison. Using blat to find sequence similarity in closely related genomes. Current protocols in bioinformatics, 37(1):10–8, 2012.

37 [81] Amol Prakash and Martin Tompa. Discovery of regulatory elements in vertebrates through comparative genomics. Nature biotechnology, 23(10):1249–1256, 2005.

[82] Jennifer Harrow, Adam Frankish, Jose M Gonzalez, Electra Tapanari, Mark Diekhans, Felix Kokocinski, Bronwen L Aken, Daniel Barrell, Amonida Zadissa, Stephen Searle, et al. Gencode: the reference human genome annotation for the encode project. Genome research, 22(9):1760–1774, 2012.

[83] Bradley E Bernstein, John A Stamatoyannopoulos, Joseph F Costello, Bing Ren, Alek- sandar Milosavljevic, Alexander Meissner, Manolis Kellis, Marco A Marra, Arthur L Beaudet, Joseph R Ecker, et al. The nih roadmap epigenomics mapping consortium. Nature biotechnology, 28(10):1045–1048, 2010.

[84] Kimberly R Kukurba and Stephen B Montgomery. Rna sequencing and analysis. Cold Spring Harbor Protocols, 2015(11):pdb–top084970, 2015.

[85] Gabriel E Zentner and Peter C Scacheri. The chromatin fingerprint of gene enhancer elements. Journal of Biological Chemistry, 287(37):30888–30896, 2012.

[86] Christa Buecker and Joanna Wysocka. Enhancers as information integration hubs in de- velopment: lessons from genomics. Trends in Genetics, 28(6):276–284, 2012.

[87] Sandy L Klemm, Zohar Shipony, and William J Greenleaf. Chromatin accessibility and the regulatory epigenome. Nature Reviews Genetics, 20(4):207–220, 2019.

[88] Cancer Genome Atlas Research Network et al. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature, 455(7216):1061, 2008.

[89] International Cancer Genome Consortium et al. International network of cancer genome projects. Nature, 464(7291):993, 2010.

[90] Simon A Forbes, David Beare, Prasad Gunasekaran, Kenric Leung, Nidhi Bindal, Harry Boutselakis, Minjie Ding, Sally Bamford, Charlotte Cole, Sari Ward, et al. Cosmic: exploring the world’s knowledge of somatic mutations in human cancer. Nucleic acids research, 43(D1):D805–D811, 2014.

[91] Stacey L Edwards, Jonathan Beesley, Juliet D French, and Alison M Dunning. Beyond gwass: illuminating the dark road from association to function. The American Journal of Human Genetics, 93(5):779–797, 2013.

[92] Xingli Guo, Lin Gao, Yu Wang, David KY Chiu, Tong Wang, and Yue Deng. Advances in long noncoding rnas: identification, structure prediction and function annotation. Brief- ings in functional genomics, 15(1):38–46, 2015.

[93] Ekta Khurana, Yao Fu, Dimple Chakravarty, Francesca Demichelis, Mark A Rubin, and Mark Gerstein. Role of non-coding sequence variants in cancer. Nature Reviews Genetics, 17(2):93, 2016.

[94] Brian S Gloss and Marcel E Dinger. Realizing the significance of noncoding functionality in clinical genomics. Experimental & molecular medicine, 50(8):1–8, 2018.

38 [95] John F Fullard, Claudia Giambartolomei, Mads E Hauberg, Ke Xu, Georgios Voloudakis, Zhiping Shao, Christopher Bare, Joel T Dudley, Manuel Mattheisen, Nikolaos K Robakis, et al. Open chromatin profiling of human postmortem brain infers functional roles for non-coding schizophrenia loci. Human molecular genetics, 26(10):1942–1951, 2017.

[96] Stanley Zhou, James R Hawley, Fraser Soares, Giacomo Grillo, Mona Teng, Seyed Ali Madani Tonekaboni, Junjie Tony Hua, Ken J Kron, Parisa Mazrooei, Musaddeque Ahmed, et al. Noncoding mutations target cis-regulatory elements of the foxa1 plexus in prostate cancer. Nature Communications, 11(1):1–13, 2020.

[97] Seyed Ali Madani Tonekaboni, Parisa Mazrooei, Victor Kofia, Benjamin Haibe-Kains, and Mathieu Lupien. Identifying clusters of cis-regulatory elements underpinning tad structures and lineage-specific regulatory networks. Genome research, 29(10):1733– 1743, 2019.

[98] Stefan Schoenfelder and Peter Fraser. Long-range enhancer–promoter contacts in gene expression control. Nature Reviews Genetics, page 1, 2019.

[99] Niall Dillon and Pierangela Sabbattini. Functional gene expression domains: defining the functional unit of eukaryotic gene regulation. Bioessays, 22(7):657–665, 2000.

[100] Gabriel Kreiman. Identification of sparsely distributed clusters of cis-regulatory elements in sets of co-expressed genes. Nucleic acids research, 32(9):2889–2900, 2004.

[101] Yin Chen, Bin Yao, Zhongze Zhu, Yongzhu Yi, Zhifang Zhang, Guifang Shen, et al. A constitutive super-enhancer: homologous region 3 of bombyx mori nucleopolyhe- drovirus. Biochemical and biophysical research communications, 318(4):1039–1044, 2004.

[102] Denes Hnisz, Brian J Abraham, Tong Ihn Lee, Ashley Lau, Violaine Saint-André, Alla A Sigova, Heather A Hoke, and Richard A Young. Super-enhancers in the control of cell identity and disease. Cell, 155(4):934–947, 2013.

[103] Ha Youn Shin, Michaela Willi, Kyung Hyun Yoo, Xianke Zeng, Chaochen Wang, Gil Metser, and Lothar Hennighausen. Hierarchy within the mammary stat5-driven wap super-enhancer. Nature genetics, 48(8):904, 2016.

[104] Nariman Battulin, Veniamin S Fishman, Alexander M Mazur, Mikhail Pomaznoy, Anna A Khabarova, Dmitry A Afonnikov, Egor B Prokhortchouk, and Oleg L Serov. Comparison of the three-dimensional organization of sperm and fibroblast genomes us- ing the hi-c approach. Genome biology, 16(1):77, 2015.

[105] Vera B Kaiser and Colin A Semple. When tads go bad: chromatin structure and nuclear organisation in human disease. F1000Research, 6, 2017.

[106] Hong Sun, Geir Skogerbø, and Runsheng Chen. Conserved distances between vertebrate highly conserved elements. Human molecular genetics, 15(19):2911–2922, 2006.

[107] Lifei Li, Nicolai KH Barth, Eva Hirth, and Leila Taher. Pairs of adjacent conserved noncoding elements separated by conserved genomic distances act as cis-regulatory units. Genome biology and evolution, 10(9):2535–2550, 2018.

39 [108] Nicolai KH Barth, Lifei Li, and Leila Taher. Independent transposon exaptation is a widespread mechanism of redundant enhancer evolution in the mammalian genome. Genome Biology and Evolution, 2020. [109] Lifei Li, Nicolai KH Barth, Christian Pilarsky, and Leila Taher. Cancer is associated with alterations in the three-dimensional organization of the genome. Cancers, 11(12):1886, 2019. [110] Eva Lentsch, Lifei Li, Susanne Pfeffer, Arif B Ekici, Leila Taher, Christian Pilarsky, and Robert Grützmann. Crispr/cas9-mediated knock-out of krasg12d mutated pancreatic cancer cell lines. International journal of molecular sciences, 20(22):5706, 2019. [111] Laurent Duret and Philipp Bucher. Searching for regulatory elements in human noncoding sequences. Current opinion in structural biology, 7(3):399–406, 1997. [112] Len A Pennacchio and Edward M Rubin. Genomic strategies to identify mammalian regulatory sequences. Nature reviews genetics, 2(2):100, 2001. [113] Hong Sun, Geir Skogerbø, Xiaohui Zheng, Wei Liu, and Yixue Li. Genomic regions with distinct genomic distance conservation in vertebrate genomes. BMC genomics, 10(1):133, 2009. [114] Lena Müller, Petra Aigner, and Dagmar Stoiber. Type i interferons and natural killer cell regulation in cancer. Frontiers in immunology, 8:304, 2017. [115] Hideaki Tsuyoshi and Yoshio Yoshida. Molecular biomarkers for uterine leiomyosarcoma and endometrial stromal sarcoma. Cancer science, 109(6):1743–1752, 2018. [116] Cancer Genome Atlas Research Network et al. Comprehensive and integrated genomic characterization of adult soft tissue sarcomas. Cell, 171(4):950, 2017. [117] Emanuela D’Angelo and Jaime Prat. Uterine sarcomas: a review. Gynecologic oncology, 116(1):131–139, 2010. [118] Adam G Jones, Reinhard Bürger, and Stevan J Arnold. Epistasis and natural selection shape the mutational architecture of complex traits. Nature communications, 5(1):1–10, 2014. [119] Peter M Visscher, Naomi R Wray, Qian Zhang, Pamela Sklar, Mark I McCarthy, Matthew A Brown, and Jian Yang. 10 years of gwas discovery: biology, function, and translation. The American Journal of Human Genetics, 101(1):5–22, 2017. [120] Josh Tycko, Michael Wainberg, Georgi K Marinov, Oana Ursu, Gaelen T Hess, Brae- den K Ego, Amy Li, Alisa Truong, Alexandro E Trevino, Kaitlyn Spees, et al. Mitigation of off-target toxicity in crispr-cas9 screens for essential non-coding elements. Nature communications, 10(1):1–14, 2019. [121] Rui Lopes, Gozde Korkmaz, and Reuven Agami. Applying crispr–cas9 tools to iden- tify and characterize transcriptional enhancers. Nature Reviews Molecular Cell Biology, 17(9):597–604, 2016. [122] Hannah K Long, Sara L Prescott, and Joanna Wysocka. Ever-changing landscapes: tran- scriptional enhancers in development and evolution. Cell, 167(5):1170–1187, 2016.

40