WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

MASTER THESIS

Transposable Elements Significantly Contributed to the Core Promoters in the Human Genome

Author: First Examiner: Marten KELLNER Dr. Francesco CATANIA Supervisor/ Secound Examiner: Prof. Dr. Wojciech MAKAŁOWSKI

A thesis submitted in fulfillment of the requirements for the degree of Master of Science

in the

Comparative Genomics Group Institute of Bioinformatics WWU Münster

August 20, 2019 i

Declaration of Academic Integrity

I, Marten KELLNER, declare that this thesis titled, “Transposable Elements Signif- icantly Contributed to the Core Promoters in the Human Genome” and the work presented is solely my own work and that I have used no sources or aids other than the ones stated. All passages in my thesis for which other sources, including elec- tronic media, have been used, be it direct quotes or content references, have been acknowledged as such and the sources cited.

Signed:

Date:

I agree to have my thesis checked in order to rule out potential similarities with other works and to have my thesis stored in a database for this purpose.

Signed:

Date: ii

“The saddest aspect of life right now is that science gathers knowledge faster than society gathers wisdom.”

Isaac Asimov iii

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Abstract

Faculty of Biology Institute of Bioinformatics WWU Münster

Master of Science

Transposable Elements Significantly Contributed to the Core Promoters in the Human Genome

by Marten KELLNER

Transposable elements (TEs) are major components of the human genome constitut- ing at least half of it. More than half a century ago, Barbara McClintock and later Roy Britten and Eric Davidson postulated that the TEs might be major players in the host gene regulation. A large amount of data produced by ENCODE project for ac- tive transcription factor binding sites (TFBSs) located in TE-originated parts of poly- merase II promoters were scanned in this study. In total, more than 35,000 promoters in six different tissues were analyzed and over 26,000 of them harbored TEs. More- over, these TEs usually provide one or more of TFBSs in the host promoters, which resulted in more than 6% of active TFBSs located in promoters in TE-originated se- quences. Rewiring of transcription circuits played a significant role in mammalian and consequently increased their functional and morphological diversity. In this large-scale analysis, it was demonstrated that TEs contributed to a large frac- tion of human TFBSs. Interestingly, these TFBSs usually act in a tissue-specific man- ner. Many TFBSs transported from LINE and LTR elements into promoter regions became inactive, whereas SINE elements transport possible TFBSs which became active in promoter regions. Furthermore we have shown that TE originated TFBSs often influence transcription both positive and negative making them more neutral. Thus, our study clearly showed that TEs played a significant role in shaping ex- pression patterns in mammals and humans in particular. Furthermore, since several TE families are still active in our genome, they continue to influence not only our genome architecture but also gene functioning in a broader sense. iv

Acknowledgements

First of all, I would like to express by deepest gratitude to my thesis advisor Prof. Dr. Wojciech Makalowski of the Institute of Bioinformatics in Münster. The door to Prof. Makalowski’s office was always open whenever I had a question about my research or writing. I am very grateful for his patience, his continued support and his invaluable advices.

I would also like to acknowledge Dr. Francesco Catania of the Institute of Evo- lutionary Biodiversity at the Westfälische Wilhelms-Universität in Münster for his great support.

In addition, I would like to thank my colleagues from the Institute of Bioin- formatics who always had an open ear for me and my questions. Thank you for supporting me with your inspiring discussions. They accepted me with open arms and created an friendly and supporting work environment.

Finally, I must express my very profound gratitude to my family and to my girl- friend for providing me with unfailing support and encouragement throughout my years of study and through the process of researching and writing this thesis. This accomplishment would not have been possible without them. Thank you. v

Contents

Declaration of Academic Integrityi

Abstract iii

Acknowledgements iv

1 Introduction1 1.1 History of transposable elements...... 1 1.2 Classification and duplication of transposons...... 2 1.3 Transcription factors and methods to discover binding sites...... 4 1.4 Transposons and their involvement with transcription factor binding sites...... 7

2 Materials and Methods9 2.1 Materials...... 9 2.1.1 Programs...... 9 2.1.2 Data Sources...... 9 2.2 Analyses...... 10 2.2.1 Data Preparation...... 10 2.2.2 JASPAR Analysis of ENCODE Hits...... 12 2.2.3 Genome Overview...... 12 2.2.4 Tissue Comparison...... 14 2.2.5 Analysis of Individual TFs...... 14 2.2.6 Pathway Enrichment Analysis...... 15

3 Results 16 3.1 Data selection...... 16 3.2 TE distribution...... 19 3.3 TFBSs located in TE-derived sequences...... 23 3.4 Tissue specific TFBSs located in TE-derived sequences...... 26 3.4.1 Influence of TFBSs on transcription in different genome regions 28 3.4.2 Individual TFBSs...... 32 3.5 Pathway analysis...... 36 3.5.1 Pathway analysis of genes affected by TE-originated TFBSs.. 36 3.5.2 Pathway analysis of Promoter without any TEs...... 37

4 Discussion 40 vi

A Supplementary - Figures 43

B Supplementary - Tables 47

Bibliography 64 vii

List of Figures

1.1 Schematic representations of TF binding, Chip-seq process, and Chip- seq peak calling...... 5

2.1 Flowchart of the main analysis script...... 13

3.1 Violin-plot of the number of ENCODE entries for four TFs with the highest experiment amount...... 17 3.2 Density of TFBSs, TEs, and promoter regions on chromosome 18.... 20 3.3 Nucleotide Distribution of TE-families...... 21 3.4 Fraction of promoter area occupied by TE-originated and binding site associated sequences...... 21 3.5 Fraction of promoter area occupied by different families of TE-originated sequences...... 22 3.6 Distribution of TE derived sequences in pol II promoter regions.... 22 3.7 Distribution of TFBSs in TE derived sequences in pol II promoter regions 23 3.8 Distribution of TFBSs in different TE-families in promoter regions and out of promoter regions...... 25 3.9 FAMD of TFBSs of TE derived promoter sequences and TE sequences not in promoter...... 27 3.10 Pairwise comparision of TFBS’s uniqueness in TE derived promoter sequences...... 28 3.11 Pairwise comparision of TFBS’s uniqueness in promoter sequences not derived from TEs...... 29 3.12 Number of TFBSs with transcription related GO-Annotations in pro- moter without TEs, TEs not in promoter, and TEs in promoter regions. 31 3.13 Fraction of shared TFBSs in TE derived promoter sequences of five tissues...... 32 3.14 Box-plots of fraction of active TFBSs against possible TFBSs in TEs in promoter and TEs not in promoter for five different TFs and TEs sub-families...... 35 3.15 WordCloud representation of pathway analysis from promoters with no TE derived sequences...... 37

A.1 Observed and expected numbers of TFBSs in TE-derived promoter sequences and TE sequences not in promoter...... 43 viii

A.2 Graphical representation of the number of shared and unique TFBSs for TE derived promoter sequences in pairwise tissue comparison... 44 A.3 Graphical representation of the number of shared and unique TFBSs for promoter regions without TEs in pairwise tissue comparison.... 45 A.4 Number of all possible motif positions for ENCODE entries from MEF2B and CTCF with FPR-threshold of α = 0.05 and 0.01...... 46 ix

List of Tables

2.1 Number of available TFs experimentally analysed in different Tissues. 11

3.1 Percent of ENCODE entries with hits (α = 0.05) for TFBSs in range of ENCODE peak for different distances with 154 different TFs. Counted were for each ENCODE entry only one hit, if in this genome position no other hit was documented (one per position) or with out this re- striction (multiple per position)...... 18 3.2 Transposon distribution in different genomic regions...... 19 3.3 Human genes whose promoters almost completely originated in TEs.. 23 3.4 Number of TFs with GO-Annotations influencing transcription with chromatin strength, histone binding, transcription, or all sets...... 30 3.5 Number of TEs from the sub-families L1, L2, Alu, MIR, and ERV1 with ENCODE or JASPAR hits for each analysed TF...... 36 3.6 Pathways overrepresented in different tissue comparisons of TFBSs.. 38

B.1 Gene categories used in the study...... 47 B.2 Pathways enriched in the gene set whose promoters harbor TE-derived sequences...... 48 B.3 Pathways enriched in the gene set whose promoters were devoid of TEs...... 50 B.4 GO-term annotations used to infer TFBS influence (increase or de- crease) on transcription by influencing chromatin strength, histone, or transcription directly...... 52 B.5 List of transcription factors with available ENCODE data and their corresponding JASPAR motif matrices with consesnsus sequence used in this study...... 58 x

List of Abbreviations

DBD DNA Binding Domain ENCODE ENCyclopedia Of DNA Elements LINE Long Interspersed Nuclear Element LTR Long Terminal Repeat PWM Position Weight Matrix RC Rolling Circle SINE Short Interspersed Nuclear Element TE TFBS Transcription Factor Binding Site TSS Transcription Start Side 1

Chapter 1

Introduction

Partial results of the presented work have been published in: Kellner, Marten, and Wojciech, Makałowski. 2019. “Transposable Elements Signif- icantly Contributed to the Core Promoters in the Human Genome.” Science China Life Sciences 62(4): 489–97. https://doi.org/10.1007/s11427-018-9449-0

1.1 History of transposable elements

Transposable elements (TEs) are major components of all eukaryotic genomes and over time the supposed amount of TE elements in the human genome increased. In 2001 it was estimated that 45% of the human genome derived from TE sequences (Lander et al., 2001), ten years later the number increased to over half of the hu- man genome originating in TEs with most of the major families represented there (Koning et al., 2011). Those numbers could probably be even higher, then, how- ever, the transposons are mutated beyond recognition (Lander et al., 2001; Koning et al., 2011). With such a large contribution to the human genome sequences, it is not surprising that TEs have a significant influence on the genome organisation and evo- lution (Kazazian and Moran, 1998; Makałowski, 2000; Feschotte, 2008; Cordaux and Batzer, 2009). TEs were discovered by Barbara McClintock in the mid-20th century and since they appeared to influence phenotypic traits, she named them controlling elements (McClintock, 1950; Mcclintock, 1956). Unfortunately, her revolutionary dis- covery was not very well receipted and it took decades to recognise its importance (Malamy et al., 1972). Nevertheless, the discovery that the major fraction of eukary- otic genomes is repetitive in its nature (Waring and Britten, 1966; Britten and Kohne, 1968) resulted in a hypothesis that repetitive sequences play an important role in the gene expression regulation (Britten and Davidson, 1969). Since most of the repeti- tive sequences in the eukaryotic genomes originated in TEs, Britten and Davidson’s hypothesis can be applied to TEs. As the end of the previous century witnessed a shift in the view of TEs from selfish genomic parasites (Doolittle and Sapienza, 1980; Orgel and Crick, 1980; Hickey, 1982) to drivers of evolution and facilitators of genomic inventions (Davidson and Britten, 1979; Brosius, 1991; Makalowski, 1995), evidence started to accumulate supporting Britten and Davidson’s and McClintock’s ideas (Banville and Boie, 1989; Sverdlov, 1998; Hamdi et al., 2000). Chapter 1. Introduction 2

1.2 Classification and duplication of transposons

Transposons are usually assigned to one of the two categories: RNA (class I or retrotransposons) or DNA (class II or DNA transposons) (Finnegan, 1989). Those two classes differ in the intermediate they use to transpose themselves within the genome. They are further divided into nine orders, five in Class I and four in Class II, and many families (Wicker et al., 2007). The two classes differ widely in their form of proliferation. While class I uses a copy-paste method with RNA as an interme- diate, class II favours a cut-and-paste approach. While jumping in the genome the TEs show a preference for specific regions, with either more or less genes depending on the TE (Chuong et al., 2017). This results in a non randomly distribution in the genome (Korenberg and Rykowski, 1988). Although, the image of transposons has changed with time from selfish parasite to drivers of evolution, they are still para- sites and in need of the host cell machinery for transposition.

The transposition process for fully functional LINE elements for example ( 6 kb), which belong to the sub-class non-LTRs of class I, starts with polymerase II binding at the 5’-UTR of the element. The whole element is transcribed into RNA and ex- ported into the cytoplasma. Here the proteins of the two open reading frames are translated and bind directly with the LINE-RNA forming a protein-RNA complex (Feng et al., 1996). This complex is imported into the nucleus where the endonucle- ase in the RNA-complex breaks one string DNA by the target sequence. After that, the 3’-UTR can align itself with the broken sequence and then the reverse transcrip- tion starts. The steps of the final integration are still unknown, but premature stop of the reverse transcriptase can lead to non functional 5-truncated copies (Viollet et al., 2014; Klein and O’Neill, 2018). At the end of the transposition the number of LINE elements in the genome increased by one.

SINE elements belong to the same sub-class as LINE elements, still, they are de- pendent of LINE elements regarding to their transposition (Okada et al., 1997; De- wannieux et al., 2003; Sinnett et al., 1992). This classifies SINE as non-autonomus TEs, whereas LINE are autonomus TEs (Hurst and Werren, 2001). For the tran- scription SINE elements make use of polymerase III for which they contain an in- ternal promoter. It is not at all surprising, since the elements developed from tRNA, 7SL and 5sRNA (Ullu and Tschudi, 1984; Labuda et al., 1991; Piskurek et al., 2003; Kapitonov and Jurka, 2003). With 300 bp they are a lot shorter than LINE and do not contain any open reading frames (Deininger, 2011). Instead of using their own proteins, they wait in the nucleus for LINE elements to finish their retrotransposi- tion to make use of the then free reverse transcriptase and endonuclease (Sinnett et al., 1992; Feng et al., 1996; Jurka, 1997). Certain SINE and LINE elements terminate therefore in the same 3-sequence (Okada et al., 1997; Kajikawa and Okada, 2002). Chapter 1. Introduction 3

The other retroposon sub-class, long terminal repeats (LTRs), uses the copy and paste method as well. The structure of the elements consists mostly of two long re- peated sequences between 250 and 600 bp long flanking a central coding region of 6 to 7 kb (Holton et al., 2001). The coding region contains gag, pol, and sometimes env like genes (Boeke and Corces, 1989; Havecker et al., 2004). The LTR regions con- tain promoter and enhancer, while structural proteins of the virus-like particles are encoded in the gag gene (Matthews et al., 1997). Enzymes needed for reversetran- scription and integration are provided by the pol gene (Matthews et al., 1997). The transposition is a bit more complicated than in non-LTR elements. To begin with the element is transcribed into RNA and exported into the cytoplasma. There the trans- lated virus-like particles of the transposon encapsulate the RNA like a viral envelop in which the reverse transcription occurs (Lander et al., 2001). The first strand cDNA is synthesised with a tRNA molecule as primer. In parallel the RNA template is dis- mantled with the exception of one resistant 3’ RNA element, which functions as a new primer for a shorter sequence (Zhang et al., 2014). This short element jumps to the 3’ end of the first cDNA strand and is fully synthesised. In a last step the cDNA is transported into the nucleus and integrated into the genome (Zhang et al., 2014).

DNA elements of class II transposons have in general a transposase encoded coding region, which is flanked by terminal inverse repeats. As already mentioned class II transposons use a cut-and-paste approach, where the whole TE element is cut from the genome and insert into a new region (Muñoz-López and García-Pérez, 2010). This system has a major flaw, since encoded transposase is produced in the cytoplasm, it cannot distinguish active from inactive elements, when it returns to the nucleus (Lander et al., 2001). In this way inactive elements accumulate in the genome over time (Lander et al., 2001).

Although more than half of the human genome derived from TEs (Koning et al., 2011), the overall activity of transposons declined in the hominid lineage (Lander et al., 2001; Mills et al., 2007). It has been shown that in the last 6 million years mostly L1 (LINE), and ALU and SVA (SINE) elements were actively transposed (Mills et al., 2007; Lander et al., 2001; Wallace et al., 1991; Kazazian and Moran, 1998), with Alu and L1 contributing 95% of active TE insertions in the human genome (Sudmant et al., 2015). Chapter 1. Introduction 4

1.3 Transcription factors and methods to discover binding sites

Transcription factors (TFs) are proteins which bind DNA in a sequence-specific man- ner and regulate transcription (Fig. 1.1 A). They can be divided into three groups. The first group to mention are the pioneer TFs. This group target DNA in nucle- osomes and help to deplete them, which opens the chromatin for other proteins (Henikoff and Ramachandran, 2018; Magnani et al., 2011; Cirillo et al., 2002; Tay- lor et al., 1991; Adams and Workman, 1995; Iwafuchi-Doi et al., 2016). Those TFs contain special domains, which enables some of TFs to modify histones by methy- lation or acytilation or bind enzymes with similar functions (Frietze and Farnham, 2011; Zhu et al., 2013). The second TF group, settler TFs, can only bind DNA with open chromatin (Sherwood et al., 2014; Lee and Young, 2013) and migrant TFs as third group bind only to a subset of their available target sites (Slattery et al., 2014). Another possible way to group TFs is in looking at their regulatory responsibility in control of initiation vs elongation (Fuda et al., 2009; Rahl and Young, 2014; Kwak and Lis, 2013). But this classification is flawed, since it allows a grey area where TFs can be active in initiation and elongation.

The ability of TFs to bind DNA is not only dependent on the chromatin struc- ture and nucleotide sequence (Liu et al., 2006; Bai and Morozov, 2010; Jolma et al., 2013), but also by DNA methylation (Lazarovici et al., 2013), and cooperative DNA binding of TFs (Lambert et al., 2018; Panne, 2008; Wasson and Hartemink, 2009). Some TFs even form dimers and need two binding sites close together (Lambert et al., 2018; Hu and Gallo, 2010). The influence on TFs in transcription of DNA (syn- ergistic regulation) (Lambert et al., 2018) is not only due to their ability to affect the chromatin state, but also by recruiting co-factors and proteins of the RNA poly- merase II machinery to promoter regions (Lelli et al., 2012; Frietze and Farnham, 2011) or by obstructing binding sites of other proteins (Ptashne, 2011; Cirillo, 1998). For an effective regulation of transcription in any genome the transcription factors need to bind reliable in the right genomic region, e.g. promoter regions, regardless the method TFs use to regulate transcription. Which region a TF binds is defined by two mechanisms. To begin with there is a shape readout, where the TF identifies sequence typical DNA structures or deformities (Samee et al., 2017; Mathelier et al., 2016; Rohs et al., 2009). The other method and more commonly analysed is the base readout. Here the amino acids of the TF interact physically with accessible edges of base pairs of single or double stranded DNA to find their sequence specific target (Slattery et al., 2014). Chapter 1. Introduction 5

(A)

(B) (C)

FIGURE 1.1: (A) Schematic of a typical TF binding and interacting with other cell elements, source: Lambert et al., 2018; (B) Schematic diagram of the ChIP-Seq process, source: Raha et al., 2010; (C) Repre- sentation of ChIP-Seq data using density profiles and following peak calling, source: Valouev et al., 2008 Chapter 1. Introduction 6

Before identifying where a TF binds the DNA, TFs themselves have to be dis- covered. In the past, a lot more experimental methods were used, such as protein microarrays (Wong et al., 2016; Hall et al., 2004). With modern advancements a shift to a more computational approach was taken and TFs are primarily discovered by a homology search of already identified TF-class characteristic DNA-binding domains (DBD) (Vaquerizas et al., 2012; Gao et al., 2018). The methods for the discovery of binding sites has undergone a similar shift. In the early days, methods like DNase I footprinting and electrophoretic mobility shift assays were used to identify protein- DNA interactions in regions of interest, e.g. promoter regions (Garner and Revzin, 1981; Galas and Schmitz, 1978). The isolated DNA could also be sequenced to con- struct the specific binding sequences for the analysed TFs (Galas and Schmitz, 1978). Those first sequences were the foundation for a computational analysis method: po- sition weight matrices (PWMs) (Slattery et al., 2014). In the first place, an alignment matrix is modelled from the aligned sequences, by counting the number of occur- rences from the different bases for each position (Hertz and Stormo, 1999; Lambert et al., 2018). In a next step, this matrix is transformed into a PWM, by forming the natural logarithm of the a priori probability for each base divided by the observed frequency (Hertz and Stormo, 1999).

For a sequence analysis the matrix is compared to an element of equal length of the sequence. The PWM-values for the corresponding bases are summed up to calculate the predicted relative affinity for this location (Lambert et al., 2018; Slattery et al., 2014). The highest hits indicate possible binding sites in the sequence. A clear disadvantage of PWMs is that only the base sequence is analysed and every other possible interaction is ignored (Slattery et al., 2014). This makes PWMs a good tool for in vitro analysis, but in vivo binding has too complex structures and unknown parameters to predict actual regulated genes with PWMs (Slattery et al., 2014; Cu- sanovich et al., 2014). To the present day, there is no binding site prediction model, which takes all relevant transcription levels into account (Slattery et al., 2014). Exist- ing TF-motifs for different species are accessible in several different databases, from which JASPAR (Sandelin et al., 2004) and TRANSFAC (Matys et al., 2006) are best known. A recent study conducted on human transcription factors calculated the number of available TFs for the human genome in the different databases usable for research to be 1,639 (Lambert et al., 2018). The real number of TFs in the human genome is probably higher.

With technological advancements new methods became available with which it is possible to experimentally identify in vivo binding sites of transcription factors. ChIP-Seq is such a new method (Pepke et al., 2009; Johnson et al., 2007) and is widely used by the Encyclopedia of DNA elements (ENCODE consortium) to localise active transcription factor binding sites (TFBSs) in different tissue and cell lines (Sloan et Chapter 1. Introduction 7 al., 2016). ChIP-Seq is the combination of chromatin immunoprecipitation and sub- sequent DNA sequencing. In preparation DNA-binding proteins are fixed on DNA through atomic bonds, also called cross-linked, with most often formaldehyde has detergent (Tolinski, 2009; Raha et al., 2010; Park, 2009). With sonication the DNA is sheared into small fragments before the DNA-protein complex of interest is im- munoprecipitate with specific antibodies and the sequence determined after revers- ing the cross-links, e.g. through degradation of the proteins (see Fig. 1.1 B) (Raha et al., 2010; Park, 2009). To determine the binding positions from the high amount of sequence reads, the ChIP-Seq reads are aligned to a reference sequence or genome and a peak calling of the read densities conducted (Fig. 1.1 C). The DNA sequences after sonification have a right or left sided tail from the protein binding site which is reflected in the forward and reverse strand through shifted peaks. The real bind- ing site position has to be estimated with computer algorithms (Zang et al., 2009; Valouev et al., 2008).

1.4 Transposons and their involvement with transcription factor binding sites

The first TEs were discovered as controlling elements due to their influence on ex- pression patterns of genes (McClintock, 1950; Mcclintock, 1956). It is therefore not surprising that TEs harbour TFBSs (Becker et al., 1993). After the human genome has been sequenced (Lander et al., 2001; Venter et al., 2001), computational analyses of the whole genome became available. This led to discovery of thousands of poten- tial transcription factor binding sites located in TE-originated sequences (Jordan et al., 2003; Thornburg et al., 2006). Recently, these predictions were corroborated by a number of studies that focused on specific regulatory networks (Kunarso et al., 2010; Lynch et al., 2015; Lynch et al., 2011) or tissues (Chuong et al., 2013; Mariño-Ramírez et al., 2005; Ito et al., 2017). The transcription factor CTCF for example has been shown to be inserted in core regulatory network in stem cells through transposons (Kunarso et al., 2010). In addition TE transported TFBSs had a high influence on pri- mate evolution as source for new regulatory sequences (Feschotte, 2008; Chuong et al., 2017), or as contributor to open chromatin regions (Jacques et al., 2013). The abil- ity of TEs to maintain their promoter functions over a vast evolutionary time-frame is possible one of the reasons for their influence on regulatory evolution (Medstrand et al., 2005). There is even evidence, that humans unknowingly used species specific transpositions as selection tool in plant and animal domestication (Sun et al., 2014; Lisch, 2013).

However, global picture of real TE contribution to regulatory networks is still missing. Most studies focus on a single species, tissues or TFs (Sundaram et al., 2014; Jacques et al., 2013). To fill this gap, the wealth of data produced by ENCODE Chapter 1. Introduction 8 consortium (Sloan et al., 2016) was used to search for the TE-originated active tran- scription factor binding sites in proximal promoters of genes transcribed by poly- merase II in the human genome. Moreover, it was possible to tease out the data for several tissues and compared the usage of TFBSs between different tissues. Our study shows that TEs contributed a large number of TFBSs to the human genome. Interestingly, these TFBSs tend to be tissue specific, therefore our findings support the idea that TEs significantly contribute to the genome evolution and novelty. 9

Chapter 2

Materials and Methods

2.1 Materials

2.1.1 Programs

Most of the analyses were conducted using in-house developed scripts written in Python version 3.6.4 with bedtools version 2.27.1. For Pathway analysis the R soft- ware environment version 3.4.4 with the package WebGestaldR version 0.1.1 was used. R was also used for statistical analysis.

2.1.2 Data Sources

Transcription factor binding site positions for 636 different transcription factors (TFs) were downloaded from ENCODE project database (Sloan et al., 2016) on MAY 29, 2018 1. The ENCODE data encompassed 1727 experiments covering a total of 126 different tissues and cell lines. A completed list of TFs and associated experiments is provided as supplementary material on the server of the Institute of Bioinformatics in Münster 2.

The annotation for the human genome was downloaded from the GENCODE project 3 (Harrow et al., 2012) on JUNE 22, 2018. The release 28 of the Genome Ref- erence Consortium Human Build 38 path release 12 (GRCh38.p12) was chosen since it was the newest build available .

The Genome Table Browser 4 provided annotated positions for transposable el- ements (TEs) for human genome assembly GRCh38 which were extracted on MAY 30, 2018 by setting repeats as groups and RepeatMasker as Tracks.

A hierarchy for TFs is available on TFClass 5 and was obtained on JUNE 22, 2018. TFClass is a free database where eukaryotic TFs are classified based on their

1https://www.encodeproject.org/ 2http://www.bioinformatics.uni-muenster.de/share/TFBS_human-genome/ 3https://www.gencodegenes.org/release/current.html 4https://www.genome.cse.ucsc.edu/cgi-bin/hgTables 5https://tfclass.bioinf.med.uni-goettingen.de/ Chapter 2. Materials and Methods 10

DNA-binding domains (Wingender et al., 2018).

579 motifs of the 636 TFs were available and downloaded from the online JAS- PAR database 6 (Khan et al., 2018) on OCTOBER 31, 2018. When multiple motif matrices were available for one TF, the newest matrix was chosen. A full list of ma- trices and associated TFs used in this study is provided in table B.5.

To obtain GO term accessions associated with transcription regulation, the En- sembl gene IDs of the 636 TFs were uploaded to the BioMart tool available on the Ensembl website 7 (Zerbino et al., 2018). The dataset was set to Human genes (GRCh38.p12) and the respective GO term accessions, ensemble IDs, and GO term definition downloaded on DECEMBER 20, 2018. Afterwards the GO-Annotations were manually categorised in influence on chromatin strength, histone binding, and translation. Furthermore the annotations in those three groups were divided into positive and negative influence.

2.2 Analyses

2.2.1 Data Preparation

For this study, promoter regions based on the human genome annotation version GRCh38.p12. were extracted. Those regions were defined as the sequence 1.5 kb up- stream of the transcription start. Pseudogenes or gene entries without a polymerase II promoter region were excluded from the analyses, resulting in 35,007 usable pro- moter regions. A full list of included gene types is provided in table B.1.

The RepeatMasker file for the TE annotation of the human genome was filtered by removing low complexity entries and simple repeats, leaving only entries which belong to either Class I, Class II, or Unknown transposons. Due to small representa- tion in the set TEs classified as Unknown or only as Retroposon they were removed from the dataset. Moreover, Rolling Circles (RCs) were combined with the TE-family DNA. In this work only TEs which were completely present in promoter region were analysed.

The ENCODE database provided binding site positions for 636 different TFs in 126 different tissues and cell lines, but a high number of these only had binding site positions for one TF. To increase the coverage the different tissue and cell lines were arranged into 22 groups based on the human anatomy (Tab. 2.1). Furthermore, the binding site positions were defined as 20 bp around the ENCODE position peak which lead to the cutting of all entries to -10 and +10 bp around the defined peak. The binding site positions for the different TFs were merged depending on the TF

6http://jaspar.genereg.net/ 7https://www.ensembl.org/index.html Chapter 2. Materials and Methods 11 and tissue the experiment was conducted in.

TABLE 2.1: Number of available TFs experimentally analysed in dif- ferent Tissues.

Tissue group Number of TFs Blood 360 Liver 212 Kidney 204 Breast 93 Stem cells 50 Lung 50 Reproductive organs 24 Bone marrow 18 Fibroplast 15 Nerve cells 11 Digestive tract 10 Prostate gland 6 Blood transport 5 Muscle 5 Pancreas 5 Skin 4 Lymph nodes or similar 3 Spleen 3 Parathyroid 2 Adrenal gland 2 Fat 2 Retina 1

The TF-hierarchy provided by TFClass showed few TF entries associated to more than one class. In such cases the TFs were removed from those classes and assigned to the unasssigned class. All steps to this point are also shown in Figure 2.1.

Some JASPAR motifs exceeded the expected sequence length of 10 bp with a wide margin. A closer look showed, that the edges of those motifs are often less supported. It was decided to cut the edges, if the information content was lower than 0.4 bits, calculated using the Shannon entropy (Schmitt and Herzel, 1997). The length of the motif for CTCF was after cutting equal to already reported CTCF bind- ing site lengths (Bell and Felsenfeld, 2000). The mean length of the motifs dropped from 12 bp to 10 bp.

The list of GO term accessions was reduced by excluding all entries where the proof was solely through author statement with no citation of published references (NAS), other GO Annotations used to make inference without data (IC), and where no data was given (ND). All three categories were seen as not reliable enough for Chapter 2. Materials and Methods 12 this study.

2.2.2 JASPAR Analysis of ENCODE Hits

To evaluate the assumption that the motif for TFBSs in the ENCODE data is encom- passed 20 bp around the peak a JASPAR motif search was conducted in python using the Biopython module version 1.72. For this the ENCODE files with uncut entries of the same TF were pooled and tested with the corresponding JASPAR motif. Pseudo- count and Background were set to a GC-content of 40% and a threshold for the false positive rate was calculated for each motif with a p-value of 0.05.

The evaluation of the motif search was conducted with two perspectives. First, the idea of the best possible circumstances, where identical motif hits from two EN- CODE entries are the result of different experiments finding the same position and are counted as such. In this case it is possible that more hits are closer to the peak. Second, is the idea of the worst possible circumstances, where a motif hit is dis- carded as chance finding, because of a identical motif position in another entry. For the previous entry the next possible hit is taken. In both cases only one hit for each ENCODE position was counted.

2.2.3 Genome Overview

For a visual representation of the distribution of promoter regions, TFBSs, and TEs in the human genome a sliding window procedure was conducted over every chro- mosome with a window size of 500 bp and a step size of 50 bp. The same approach was used to assert the distribution of TEs, all together and split into the four fami- lies, and TFBSs in the promoter region with a window size of 50 bp and step size of 5 bp.

For the general distribution of TFBSs in the human genome the tissue groups were discarded and all files for each TF combined. In cases where an overlap of 10 bp between two entries was observed it was seen as the same TFBS and the posi- tions were merged. In the next step those TFBSs were assigned to one of the three groups: TEs in promoter, promoter regions excluding TEs, and TEs not in promoters. The threshold for acceptance was set to an overlap of 10bp. Using the hierarchy of TFClass TFs were classified and for the two groups containing TEs further divided belonging to one of the four TE-families (Fig. 2.1). Chapter 2. Materials and Methods 13 TF-hirarchy TFBSs files peak +-10 bp ENCODE files all_TFBSs file Resize entries to Promoter Discarding entries Fraction TE nucleotide merge where overlaps >= 10 combining all TFBSs of TFs != conservative, pseudoreplicated TEs to Promotercount assigning TFBSs for Genome Overview when TFBSs overlaps >= 10 bp with region promoter_wo_TEs Gencode Assigning NoTFBS per TE- Annotation Promoter file entrei end: TSS class and TF hierarchy Discarding entries entrie start: TSS -1500bp Resize for promoter regions promoter_start <= TE promoter_end biotype !=Pseudo genes, RNA, TEC, ribozyme TEs_in_promoter TE file for each TF TFBSs in tissues issue into File Genes with shared Annotation Repeatmasker TR, DNA, unknown, Retroposon, RC Discarding entries .... Calculation TF1-chr1 unique TFBSs issue A TF1-chr1 Intersect TF1-chr1 T TF1-chr1 Chromosome = chrM, EBV count shared and Pairwise comparison for TEs in and not promoter only entries overlap >= 10 count TFBS shared and unique add TF name to chromosom colum shared TFBS extracting Ensembl IDs for those groups combining all TFs for one T two different tissues using bedtools for all TFs per tissue TE != LINE, SINE, L TFBSs in tissues Genes with unique

FIGURE 2.1: Flow chart of the main analysis script, from raw data over data preparation to to binding site association. The yellow boxes represent generated files, the green and white boxes are computa- tional operations, and the red ones indicate that the same operation was conducted with different groupings. Chapter 2. Materials and Methods 14

At the same time smaller aspects were investigated. The nucleotide distribution of the four TE-families in promoter and out of promoter regions compared, the scat- tering of TFBSs classes in different TE-families, and beyond that it was looked at the amount of TEs in promoter. Using R version 3.4.4 with the packages factoMineR version 1.4.1 and factoextra version 1.0.5 a Factor Analysis of Mixed Data (FAMD) was conducted. For this, TFBSs grouped into TF-classes and assigned to TE-families of TEs in promoter were divided by the sum of TFBSs in promoter associated TEs.

Using GO-Annotations for the different TF genes, the TFs were grouped accord- ing to their influence on translation, histone binding, or chromatin strength. Those three groups will be referred as translation influence if not stated otherwise. If the difference between the number of GO-Annotations with positive influence and GO- Annotations with negative influence was smaller or equal to one, the TF was seen as neutral in influence. Where the difference was greater than one it was tested if one side has at least twice as many Annotations than the other. When the threshold was not achieved the TF was classified as neutral in influence. In cases where no Annotations were available the TF was set as with no data and excluded.

2.2.4 Tissue Comparison

For the tissue comparison, 6 of the 22 tissue groups were chosen (blood, liver, kidney, breast, stem-cell, lung). All six tissue groups had the data for at least 50 TFs avaiable. Since the overlap of comparable TFs between the six tissues showed only one TF, a pair-wise comparison was conducted between the shared TFs of two tissues to in- crease the amount of comparable TFs . Overlapping TFBSs of the same TF of at least 10 bp between two tissues were considered as identical positions. The analy- sis was conducted for TEs contained in promoter and promoter regions excluding TEs. Ratios between identical and unique TFBSs for different tissues and compar- ison between TE-containing promoters and promoters without TEs were tested for significance using Z-test (α = 0.05).

2.2.5 Analysis of Individual TFs

From the six analysed tissue groups five of them had ENCODE data for the same 12 TFs (CTCF, EP300, EZH2, GABPA, JUND, MAFK, POLR2A, POLR2AphosphoS5, RFX5, SIN3A, SP1, TAF1). Those TFs were chosen and the shared positions of bind- ing sites in TE derived sequences in promoter and promoter without TE sequences determined.

To asses if active TFBSs were transported through TEs into promoter regions or if those became active through mutation afterwards a JASPAR motif search of selected Chapter 2. Materials and Methods 15

TFs and TEs was conducted and compared to ENCODE hits. For the comparison a Wilcoxon Rank sum test was conducted.

2.2.6 Pathway Enrichment Analysis

Pathway analysis was conducted with R software enviroment version 3.4.4 with the package WebGestaldR version 0.1.1 (Wang et al., 2017). The enrichment was set to over representation analysis (ORA) for Homo sapiens and the enrichment database of KEGG (Kanehisa et al., 2017). As reference set the provided genome was used. The maximum threshold to exclude categories was increased to 500 and the number of represented categories visualised in the final report was set to 50. All other settings were used with the default values. This analysis step was used for the different tissues with TEs in promoter regions and promoter regions without TEs. 16

Chapter 3

Results

3.1 Data selection

At this point in time over 55,000 different genes are annotated in the human genome. From this set 35,007 genes transcribed by polymerase II, i.e. protein-coding and lin- cRNA (for detailed list of gene categories used in the study see Appendix table B.1), were selected. The promoter region was defined as 1,500 bp long sequence upstream of the transcription start side (TSS) of each gene.

The data provided from the Encode consortium consist of 126 tissues and cell- lines and contained information on 636 transcription factors. As already mentioned in chapter2 tissues and cell lines were further combined in 22 groups based on the human anatomy (Tab. 2.1) to increase the coverage of available transcription factors for each tissue. Even after combining the different tissues not all organs were anal- ysed for the same transcription factors which also led to high discrepancies in the number of transcription factors for each organ. For Blood i.e. data on 360 different transcription factors is available but only 1 (CTCF or CCCTC-binding) for retina . Most of the data represent specialised organs so it is not surprising that they use unique sets of transcription factors. However, it is difficult to decide if this repre- sents a real biological phenomenon or if it is caused by methodological selection and reflect research aspects of the ENCODE groups. For this analysis only organs with data on at least 50 transcription factors were selected, which left the 6 tissues blood, liver, kidney, breast, stem cells, and lung.

To assert if the number of conducted experiments from ENCODE for transcrip- tion factors influences the number of discovered binding sites the four TFs with the highest experiment counts were compared (CTCF 178 experiments, POLRA2 82 ex- periments, EP300 54 experiments, POLRA2phosphoS5 44 experiments). As seen in figure 3.1 this concern is unfounded. CTCF with the highest experiment count and POLRA2phosphoS5 with the lowest show similar distribution, whereas EP300 with the second lowest experiment count has the lowest bulk of binding site numbers. It must be pointed out that in the figure the Violin-plot is a mathematical distribution therefore predicts also values below zero, which is biological nonsense and will not Chapter 3. Results 17 be addressed further.

1.2 × 105

● ● ● ● ● 8.0 × 104 ● ●

● 4.0 × 104 ● ● per experiment

● ● Number of binding sites 0 ●

CTCF EP300 POLR2A

POLR2AphosphoS5

Transcription factor

FIGURE 3.1: Violin-plot of the number of ENCODE entries for four TFs with the highest experiment amount (CTCF: 178 ex- periments, POLRA2: 82 experiments, EP300: 54 experiments, POLRA2phosphoS5: 44 experiments) .

Not only the influence of the number of conducted experiments was asserted, but also how reliable the chosen 20 bp window around the reported ENCODE peak reflects the position of the binding site. For this all the raw ENCODE results were scanned with corresponding JASPAR motif matrices for possible binding site posi- tions. From the 636 transcription factors with ENCODE data JASPAR database had motifs for 154 TFs. Since the peak already indicates where the position should be, matrix hits with a FPR value over the calculated threshold for the motif (α = 0.05) close to the peak were considered as real hits. A worst case as well as best case situation was analysed. In the worst case situation each ENCODE entry sequence represents its own binding site and by identical motifs between two entries, for one of them the next best hit is chosen. For the best case identical motifs between two ENCODE entries were seen as experimental confirmation for the same binding site. The number of available ENCODE entries for the 154 TFs amounts to 14,352,092. In 99.97% of those entries a JASPAR motif was accepted in the best case situation, for the worst case the amount drops to 82.92% (Tab. 3.1). The high loss on ENCODE entries for the worst case is not surprising since each position has a limited amount Chapter 3. Results 18 on acceptable motif positions. If those are already assigned to other peaks, which is expected by multiple experiments in the same tissue, peaks which confirm posi- tions are pushed out. Excluding the entries for which no motif was found 71.88% of ENCODE entries had a motif hit in the proposed range of 20 bp in the worst case analysis. For the best case situation 92.04% of all entries with a motif hit lay within the range. An increase of the nucleotide range around the peak increased the amount of accepted motif positions exponentially, with 6->3->2% in the worst case and 4->2->0.7% in the best case, for 5 bp steps. A range of 20 bp seems to be a good trade off between precision and accurateness for binding site positions in ENCODE data. Even more so in light of the higher percentage of entries with a motif for the best case analysis.

TABLE 3.1: Percent of ENCODE entries with hits (α = 0.05) for TFBSs in range of ENCODE peak for different distances with 154 different TFs. Counted were for each ENCODE entry only one hit, if in this genome position no other hit was documented (one per position) or with out this restriction (multiple per position).

Range around ENCODE peak (in bp) one-per-position multiple-per-position 10 71.88 92.04 15 77.38 96.54 20 80.34 98.24 25 82.24 98.97 30 83.66 99.34 40 85.78 99.67 50 87.47 99.80 100 93.51 99.96 200 99.19 100.00 500 100.00 100.00 Number of peaks 12,827,336 14,349,902

The accepted motif positions in the ENCODE entries show for all transcription factors the same distribution with a high motif count close to the ENCODE peak. Since the distance to the ENCODE peak was a selection criteria, which introduces a bias for the real position, all possible positions for binding sites were examined. The majority of TFs followed the expected distribution, but a small number showed a different distribution than expected. The highest number of possible motifs was an- ticipated where ENCODE predicted the binding site (ENCODE peak), which trans- lates in the most cases to the middle position of the ENCODE entry. Transcription Chapter 3. Results 19 factors like CTCF follow this rule. Others like MEF2B have a high number of mo- tif positions at the predicted position, but the true motif peaks are left and right of this position forming a M-like distribution. The reported motif sequence for MEF2B is CTA(T/A)AAATAG. The first and last 3 to 4 nucleotides are complementary se- quences, which is an indication that the transcription factor forms a dimer to bind specific DNA positions and literature confirms that MEF2B forms a dimer prior to binding DNA (Molkentin et al., 1996). Interestingly if the motifs are selected with a harsher FPR-threshold (α = 0.01), the distribution takes the more expected form. For this analysis it was still seen as only one binding site at the peak.

3.2 TE distribution

Other experiments have already shown that transposons are not randomly distributed in the genome (Korenberg and Rykowski, 1988). Currently 4.5 million TE frag- ments are annotated in the human genome (Tab. 3.2) and a density distribution of transposons on the chromosomes supports this notion. Transposons in the hu- man genome have and TFBSs a more steady density across chromosomes compared to promoter regions (example Fig. 3.2). Nevertheless high density regions for trans- posons exits. Those regions generally overlap with high density regions of promoter regions, binding sites, or both, although they can stand alone (Fig. 3.2 position 1.8 ∗ 107). Those overlapping regions indicate an interesting interaction between binding sites, promoter, and transposons.

TABLE 3.2: Transposon distribution in different genomic regions

TE family Genome Promoter regions Non-promoter regions Number of elements Number of nucleotides Number of elements Number of nucleotides Number of elements Number of nucleotides LINE 1,516,226 641,953,033 16,210 3,211,688 1,500,016 638,741,345 SINE 1,779,271 392,908,499 32,368 6,655,074 1,746,903 386,253,425 LTR 725,763 268,434,413 6,549 1,719,349 719,214 266,715,064 DNA 489,391 103,055,478 6,487 1,042,442 482,904 102,013,036 Retroposon 5,397 4,223,296 50 7,089 5,347 4,216,207 Unknown 5,531 737,222 55 6,043 5,476 731,179 Total 4,521,579 1,411,311,941 61,719 12,641,685 4,459,860 1,398,670,256

Core promoter regions are enriched in SINE elements, while LINE and LTRs are less frequent. Only DNA elements seem to occur in similar frequency (3.3 and 3.2). The difference is significant for each TE category (chi-square p-value less than 1 ∗ 10−15) which is in accordance to literature (Thornburg et al., 2006). The trans- posons are not equally distributed in all promoters (Thornburg et al., 2006). Some promoters are completely depleted of all transposon sequences, others are invaded by as many as 10 transposons (Fig. 3.6). Interestingly promoter regions with two or three TE derived sequences have the highest binding site count (Fig. 3.7). From 35,007 analysed promoter regions harbour at least 75% of them, not less than one Chapter 3. Results 20

− 2.0 × 10 8

− 1.5 × 10 8

Promoter TE −8 Density 1.0 × 10 TFBS

− 5.0 × 10 9

0 2 × 107 4 × 107 6 × 107 8 × 107

Position on the chromosome

FIGURE 3.2: Density of TFBSs, TEs, and promoter regions on chromo- some 18 calculated with the sliding window method (window size 500 bp and step size 50 bp). Entries with 0.2 bp per window were excluded.

TE-derived sequence, which is similar to the fraction reported by Thornburg (83%) (Thornburg et al., 2006). The difference is most likely through the decreased pro- moter size analysed in this study (1,500 bp) to Thornburg (2,000 bp), since it is ob- served that transposon density increases with distance to the TSS as shown (Fig. 3.4 and Fig. 3.5) and stated in literature (Mariño-Ramírez et al., 2005). Although gener- ally transposon density decreases closer to the TSS, LTR seems to be different since a jump in density is observed starting at -240 bp to TSS. Since the binding site density in promoter regions flows converse to the general transposon density (Fig. 3.4), a close association between binding sites and LTR transposon would be expected (see next Section). Chapter 3. Results 21

LINE SINE LTR DNA

7.3

8.3 19.1 25.4 13.6

45.8

52.7

27.7

FIGURE 3.3: Distribution of major types of TEs in the human genome. The promoter regions are depicted in the inner circle, while TE dis- tribution in the rest of the human genome is presented in the outer circle.

0.6

●●● ●●●●●●●● ●●●●● ●●●●●●● ●●●●●●●●● ●●●● ●●●●●● ●●● ●●●●●●● ●● ●●●●●●● ●●● ●●●●●●●●● ●● ●●●●●●●●●●● ● ●●●●● ●● ●●●● ●● ●●●● ●● ●●●● ●● ●●●● ●● ●●●●●●● ● ●●●●●●● ●● ●●●●●●●●● ● ●●●●●●●●● ●● ●●●●●●●● ● ●●●● ●● ●●● ● ●● ●● ●●● ● ●●●● ●● ●●●● ● ●●●● ●● ●●● ● ●●● ●● ●●● ● ●●● ●● ●●● ●● ●●● ● ●●● ● ●●● ●● 0.4 ●●● ● ● ● ●●● ● ●●● ● ●●● ●● ●●● ● ●●● ●● ●●● ●● ●●● ● ●● ●● ●● ● ●● ●● ●● ●● ●●● ●●●● ●● ●● ●● ●● ●● ●● ●● ● ●● ●● ●● ●● ●● ●● ●● ●● ●● ● ●● ●● ●● ● ● ● ● ●●● ●● TE ●● ● ●● ●● ●●● ● ●● ●● ●●● ● ●●● ● ●● ●● ●●● ● ●●● ● ●●● ●● ●●● ●● ● ●● ●●● ●● TFBS ● ● ●●● ● ●●● ●● ●●● ● ●●●●● ●● ●●●● ●● ●●● ●● ●●●●●● ●● ●●●●● ●● ●●●● ●● ●●●● ●●● 0.2 ●●●●● ●● ●● ● ●●●●●●● ●● ●●●●●●● ●●●● ●●●● ●●●● ●●●●● ●●●●●●● ●●●●●●●● ●●●●●●●●●●● ●●●●●●● ●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●● Fraction of derived of derived Fraction nucleotides in Promoter nucleotides

0.0 −1500 −1250 −1000 −750 −500 −250 0

Distance to TSS (nt)

FIGURE 3.4: Fraction of promoter area occupied by TE-originated and binding site associated sequences. The promoters were analysed us- ing sliding window approach with window size of 50 nt and 5 nt sliding step. Because the plotting point was set to the middle of the window, the first dot was placed at position -25 and it represents win- dow -1 to -50. Chapter 3. Results 22

0.4

●● ●●●●● ●●● ●●● ●●● ●●● ●●● ●●●●● ●●●●●● ●●●●●● ●●●●●● ●●●●●● ●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●● ●●●●●● ●●●●●●●● ●●●● ●●●●● ●●●●●●● ●●●●●● ●●●●●●●●●● ●●●● ●●●●●●●●● ●●●●●●● ●●●● ●●●●●● ●●●●●●●●●●●●● ●●●●●●●● ●●●● ●●●●● ●●●●● ●●●●● ●●●● ●●● ●●●●●●●● ●●●●● ●●● ●●●●●●● ●●●●●●●● ●●●●● 0.3 ●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●● ●●●●●●●●●●● ●●●●●●●●●● ●●●●● ●●●● ●●●●●●● ●●● ●● ●●●● ●●● ●●● ●●●● ●●●● ●●●●● ●●●●● ●●●● ●●●●●● ●●●● ●●●●●●●●● ●●● ●● ●●●●● ●● ●●● ●●● ●● ●●● ●●●● ●● ●●● ●●●●● ●● ●●● ●●●●●● ●● ●●● ●●●●● ●●● ●● ●●●●●●●●●●●●●●● ●●● ●●●● ●●●● ●●●●●● ●●● ●●● ●●●● ●●● ●● ●●● ●●● ●●● ●● ●●●● ●● ●● ●●● ●●●●●●● ●●● ● ●●●● ●●●●●●●● ●● ●● ●●●● ●●●● ●● ● ●●●● ●●● ●● ●● ●●●●● ●●●● ●● ● ●●● ●●●● ●● ● ●● ●●●●●●●●●●● ●● ●●● ●●●●●●● ●● ●●● ●● ●●●●●● ●●● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●● ●● DNA ●●● ●● ●●● ●● ●●● ●● ●●● ●● ●●● ● ●●●●● ●● ●●●●● ● ●●●● ●● ●●● ● 0.2 ●● ● ● ● ● ●● ● LINE ●● ● ●● ●● ●●●●● ●● ● ●●●●● ●●● ●●● ●●●●●●●●●●●●●●●●●●●● ●● ● ●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●● ● ●● ●●●●●●●●●●●●●●● ●●●● ●● ● ●●●● ●● ● ●●●● ●● ● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●● ● ●●●●●●●●●●● ●●●● ●● ● ●●●●●● ●●● ● ● ●●●●● ● ● ●●●● ●● ● LTR ●● ● ● ●●●●●●●●●●● ● ● ●●●●● ●● ● ●●●●● ● ● ●●●●●●● ●● ● ●●●● ●● ●●●●●●●●●●●●●●●●●●●●●● ●● ●●●● ●● ●●● ●● ●●● ●●● ●●● ● ●● ● ●● ● ● ●● ● ● SINE ●● ● ●● ●● ● ●● ●● ● ●● ●● ● ●●● ●●● ● ●● ●●● ● ●●● ●●●● ● ●● ●●● ● ●● ●●● ● ●●● ●● ● ●●●●● 0.1 ●●● ●● ●●●●● ●●● ● ●● ●● ●●●● ● ●●●●●● ●●●● ●●●●● ●●●●● ●●●● Fraction of TE derived of TE derived Fraction ●●● ●●● nucleotides in Promoter nucleotides

0.0 −1500 −1250 −1000 −750 −500 −250 0

Distance to TSS (nt)

FIGURE 3.5: Fraction of promoter area occupied by different families of TE-originated sequences. The promoters were analysed using slid- ing window approach with window size of 50 bp and 5 bp sliding step. Because the plotting point was set to the middle of the window, the first dot was placed at position -25 and it represents window -1 to -50.

1 × 104 8796 8317 × 3 8 10 7276

× 3 6 10 5251

4 × 103 3072 Promoter count

2 × 103 1505 578 169 34 6 3 0 0 1 2 3 4 5 6 7 8 9 10 Number of TE fragments

FIGURE 3.6: Distribution of TE derived sequences in pol II promoter regions. Chapter 3. Results 23

× 4 6 10 56022 54168

5 × 104

40205 4 × 104 38742

3 × 104 22506

TFBS count 2 × 104

10747 1 × 104 2531 1082 217 11 0 1 2 3 4 5 6 7 8 9 10 Number of TE fragments

FIGURE 3.7: Distribution of TFBSs in TE derived sequences in pol II promoter regions.

Out of the 75% promoter regions with TE derived sequences, eight genes have promoter fully composed of transposon sequences (Tab. 3.3). A cross reference with the human genome annotation showed that none of those genes has cognate Ref- Seq record and only two of them seem to be "regular" protein coding genes. Four are novel lincRNA genes and two others are located in an intron of the PRCD gene. Therefore it cannot be ruled out that these are not real genes but represent false an- notation in current human genome assembly.

TABLE 3.3: Human genes whose promoters almost completely origi- nated in TEs.

Gene ID Gene name Fraction of TE-derived sequences TE elements Gene type ENSG00000154415.7 PPP1R3A 0.95 LTR Protein coding ENSG00000166228.8 PCBD1 0.92 LINE Protein coding ENSG00000233480.1 AP000946.2 0.94 LINE LincRNA ENSG00000257729.2 RP 11-788H18.1 0.95 Different types LincRNA ENSG00000258969.1 RP 11-305B6.3 0.93 LTR LincRNA ENSG00000267543.1 AC015802.3-201 1.00 Different types Sense intronic ENSG00000272386.1 AC015802.5 1.00 Different types Sense intronic ENSG00000285191.1 AC090679.2 0.95 Different types LincRNA

3.3 TFBSs located in TE-derived sequences

In a previous study the potential of TE derived sequences to provide regulatory el- ements was investigated (Thornburg et al., 2006). It was a purely computational Chapter 3. Results 24 study since no experimental tools or large scale data was available at the time. EN- CODE provided in the last decade with its Chip-Seq experiments such data (Sloan et al., 2016). These data were analysed to see what is the real contribution of TE de- rived sequences to gene expression regulation through binding site import.

Data for 636 transcription factors were analysed in 1,722 ENCODE experiments. Those experiments demonstrate 3,173,045 active binding sites in promoter regions and of those binding sites, 215,964 (6.8%) were located in TE-derived sequences. This shows an enormous contribution of transposons to protein coding gene expres- sion regulation and corroborates previous predictions (Jordan et al., 2003; Thornburg et al., 2006).

Active binding sites in TE derived sequences within promoter regions general follow the TE-distribution observed in promoter. SINE elements harbour the most TFBSs, followed by LINE and and least affected are LTR and DNA elements (Fig. 3.8). The expected correlation between binding sites and LTR elements from their nucleotide density in promoter regions (Fig. 3.4 and Fig. 3.5) is not observed. The distribution of binding sites interacting with specific categories of transcription factors is similar in each TE-family with Zinc-coordinating DNA-binding domains, Helix-turn-helix domains, and Basic domains dominating the landscape. This might merely reflect ENCODE experimental design with their choice of transcription fac- tors. Also a high amount of binding sites are yet unassigned, which would indicate an other methodological problem: uncompleted classification of TFs.

TE sequences not located within promoter regions show the same distribution of binding sites interacting with specific categories of transcription factors (Fig. 3.8). Comparing the amount of active TFBSs in TE-families outside of promoters to the nucleotide frequency of those families show, on the other hand, clear differences to the expected amount. LINE having the most binding sites and DNA the least is expected, but SINE and LTR have almost an equal amount of binding sites while showing a nucleotide difference of 8% (Fig. 3.8). Those two families have together more active binding sites than LINE (1.6 ∗ 105 to 2.8 ∗ 105) by a similar nucleotide distribution of 47% to 46%. By calculating the expected binding site amount for the TE families based on their nucleotide distribution it shows that LINE has less and LTR more active binding sites than expected. Indicating that LINE gains TFBSs in the promoter after a jump and LTR losses some until the amount correlates with their nucleotide distribution. However, only LINE had a statistical significant (pro- portion Z-test p < 0.0001). This is interesting, since LTR could be explained due to the higher fraction close to TSS, which could exhibit a greater evolutionary pressure to preserve the existing binding site constellation. LINE on the other hand has few of the only active elements in the human genome, so it can be argued that the active Chapter 3. Results 25

Immunoglobulin fold Other all−alpha−helical DNA−binding domains Zinc−coordinating DNA−binding domains alpha−Helices exposed by beta−structures beta−Sheet binding to DNA Basic domains TF−Class beta−Hairpin exposed by an alpha/beta−scaffold Yet undefined DNA−binding domains beta−Barrel DNA−binding domains unassigned Helix−turn−helix domains

TEs in promoter TEs not in promoter 1.25 × 105

1.00 × 105

1 × 106

7.50 × 104

5.00 × 104

5 Number of TFBSs 5 × 10

2.50 × 104

0 0

LINE SINE LTR DNA LINE SINE LTR DNA

TE−Families

FIGURE 3.8: Distribution of TFBSs in different TE-families for trans- posons out of promoter regions and within promoter regions. and inactive TEs have different numbers of active TFBSs, which is not reflected in our analysis.

To gain further insight in how strongly TE families and TFBS classes influence binding site deposition in promoter regions and non-promoter regions, a multi factor analysis was conducted. The results are shown in Figure 3.9 colour coded once for transposon families (A) and once for transcription factor classes (B). As already seen before binding sites form groups respectively with the different transposon families (Fig. 3.9 A), whereas TF classes are mixed with the exception of unassigned and Zinc-coordinating DNA-binding domains, which are more excluded from the other points. Those two also, as seen above, make up the highest count for TF classes in transposon families. Dimension 1 in the graphs explains for 18% of the variance ob- served in the data and TF classes have the most influence in dimension 1. That is the reason why those classes can be grouped from left to right. TE families on the other hand group from top to bottom since they have the strongest influence on dimension Chapter 3. Results 26

2 which only explains 7% of the variance.

3.4 Tissue specific TFBSs located in TE-derived sequences

Through the ENCODE data it was also possible to investigate organ specific regu- lations. For this only organs with data for at least 50 TFs were selected, which left 6 of 22 different organs (Tab. 2.1). As already seen, TE derived sequences are a rich source of binding sites for transcription factors, which rises questions: How specific are those TFBSs? Do TE derived sequences provide binding sites who act globally or locally? To answer those questions all TFBSs detected for tissue specificity were analysed and compared with the global landscape. The six tissues were analysed by investigating all transcription factors which data in at least two tissues and compare the ENCODE results in a pairwise manner. This resulted in a comparison where each tissue set had their own transcription factor set for analysing.

The ratio between shared binding sites of one tissue pair and the unique amount for each organ was calculated to be comparable between all comparisons. Between all comparisons only a small fraction of binding sites used in a given organ is also used in an other organ it was compared with. For example only 3% of active bind- ing sites located in TE derived sequences in promoter regions of blood tissue are also used by breast tissue (Fig. 3.10). Fractions of shared binding sites between two or- gans vary from 3% blood-breast comparison up to 26% comparing kidney and lung. Important is that for a comparison, for example kidney and lung, the calculated values are not reciprocal. Kidney shares 26% of its active binding sites with lung (Kidney -> lung = 26%) but lung shares only 15% of his binding sites with kidney (lung -> kidney = 15%). It reflects the fact that overall numbers of TFBSs for specific factors used in a given tissue are different. In this example, there are 27,150 TFBSs active in lung tissue but only 11,635 in kidney (Appendix Fig. A.4).

The small number of shared binding sites may not be unique feature of TE- originated TFBSs, but the reflection of a general trend of the gene expression pattern in human cells. To test which hypothesis is correct, the same analysis was conducted for promoter regions excluding TE derived sequences. The fraction of shared bind- ing sites for all comparison is similar to TE derived sequences below 50%. It ranges from 7% blood-kidney to 38% for kidney-lung, but those are all higher than observed by the TE derived sequences (Fig. 3.11), with the exception of stem cell-kidney and lung -kidney comparison where the reverse is true. All numbers are significantly dif- ferent as suggested by Z-test (α = 0.05). This shows, that transposons bring mostly tissue specific transcription factor binding sites into promoter regions. Chapter 3. Results 27 6 6 ● ● Yet undefined DNA−binding domains Yet Zinc−coordinating DNA−binding domains ● ● ● ● ● ● ● ● 4 4 ● ● ● SINE ● ● ● ● Immunoglobulin fold Immunoglobulin Other all−alpha−helical DNA−binding domains unassigned LTR ● ● ● ● ● ● ● ● ● LINE ● ● 2 2 Dim1 (18%) ● ● DNA ● ● ● ● Groups beta−Hairpin exposed by an alpha/beta−scaffold beta−Hairpin by exposed beta−Sheet binding to DNA Helix−turn−helix domains ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● alpha−Helices exposed by beta−structures by alpha−Helices exposed Basic domains beta−Barrel DNA−binding domains ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Groups

1 0 1 0

−1 −1 Dim2 (7.1%) Dim2 (7.1%) Dim2 A B

FIGURE 3.9: FAMD of TFBSs of TE derived promoter sequences and TE sequences not in promoter regions with (A) TE-families as colour groups and (B) TF-classes as colour groups. For each group a thicker point was added which represents the mathematicle centre of the group. Chapter 3. Results 28

TEs in promoter

0.30 Blood 0.03 0.04 0.05 0.06 0.05 0.25 Breast 0.14 0.13 0.13 0.12 0.09 Kidney 0.09 0.18 0.08 0.26 0.17 0.20 Liver 0.10 0.11 0.06 0.09 0.06 0.15 Lung 0.18 0.15 0.15 0.15 0.08 0.10

Compared tissue Stem-cell 0.18 0.15 0.23 0.15 0.11 0.05

Blood Liver Lung Breast Kidney Stem-cell Reference tissue

FIGURE 3.10: Pairwise comparision of TFBS’s uniqueness in TE de- rived promoter sequences. Fraction of the query (Y-axis) tissue TFBSs are compared to the reference tissue (X-axis) is listed in a cell. The re- sults off all TFs available for analysis were concatenated.

3.4.1 Influence of TFBSs on transcription in different genome regions

Another interesting aspect analysed in this study is the distribution of binding sites associated with the promotion of transcription or repression of transcription. The classification of the different binding sites was conducted by first downloading GO term annotations for the genes of the 636 different transcription factors. 3,168 unique GO term annotations were acquired and manually revised for entries which could be associated with transcription, which left 105 entries for the analysis (full list of GO term annotations in Appendix Tab. B.4). Since the influence on transcription also encompasses the coiling state of the DNA, when it is not open no transcription takes place, it was decided to divide the annotations into groups associated with chro- matin strength, associated with histone binding, and associated with transcription. For each group the GO term annotations were further divided into positive influ- ence (promotion of transcription) and negative influence (repression of transcrip- tion). In cases where a transcription factor had for one group positive and negative Go term annotations it was seen as a neutral influence, if the amount of GO terms had no greater difference than one and, in cases where the difference was greater than one, the Ratio of positive and negative annotations was less than two. After Chapter 3. Results 29

Promoter without TEs

Blood 0.11 0.07 0.11 0.15 0.11 0.28 Breast 0.26 0.21 0.19 0.23 0.13 0.24 Kidney 0.13 0.27 0.16 0.38 0.26 0.20 Liver 0.17 0.17 0.08 0.15 0.12 0.16 0.26 0.24 0.16 0.20 0.13 Lung 0.12

Compared tissue Stem-cell 0.24 0.19 0.21 0.20 0.18 0.08

Blood Liver Lung Breast Kidney Stem-cell Reference tissue

FIGURE 3.11: Pairwise comparision of TFBS’s uniqueness in pro- moter sequences not derived from TEs. Fraction of the query (Y-axis) tissue TFBSs are compared to the reference tissue (X-axis) is listed in a cell. The results off all TFs available for analysis were concatenated. the assignment of influences to transcription factors 418 TFs have an association di- rectly linked to transcription, whereas only 65 are linked to histone modification and 16 to chromatin (Tab. 3.4). The proportion of neutral transcription factors to positive and negative is in the histone and chromatin groups heavily against the side of the neutral TFs (3:62 and 1:15), whereas negative to positive is more even. In the tran- scription group the number of neutral TFs is higher than the number of positive or negative TFs, but the difference is not as grave as in histone or chromatin groups. However the erratic distribution of TFs has to be accounted for in the next analysis steps. For the sake of completeness a fourth group was created where the DO term annotations of the other three groups were mixed. The resulting TF distribution fol- lows the one seen in the transcription group, with a few more neutral and few less positive and negative TFs. The absolute number of TFs is 419 TFs for this group, one less than transcription indicating that all except one transcription factor of the histone and chromatin groups are also presented in the transcription group.

The first run yielded results as expected from the number of transcription factors for each influence. With other words: the number of TFs correlated with the number of discovered TFBSs. To compensate for this influence the values were adjusted by dividing the number of binding site with the corresponding number of transcrip- tion factors for each group and set (resulting in R = 0.0115). Although transcription group and the sum of all groups have differences in the TF distribution for their Chapter 3. Results 30

TABLE 3.4: Number of TFs with GO-Annotations influencing tran- scription with chromatin strength, histone binding, transcription, or all sets.

Form of influence on transcription Influence on transcription negative positive neutral Chromatin 7 8 1 Histone 35 27 3 Transcription 109 127 182 All 107 122 190 sets of influence, this difference is not enough to change the distribution of binding sites in the different genome regions for those two groups (Fig. 3.12). In promoter regions without TE derived sequences, TE derived sequences in the promoter re- gions, and transposons not located in promoter regions the number of binding sites for the pooled GO term annotations and transcription associated GO terms is almost equal and an independence of the sets from the two groups could not be rejected (chi-square p-value = 0.8301).

Transposons in promoter regions and outside of promoter regions show for all groups a similar distribution of influence. Neutral binding sites have the highest number followed by the negative and positive binding sites. An exception is chro- matin, where transposons not in promoter regions have an equal number of positive and negative binding sites and TE derived sequences in promoter regions a higher number of negative sites (1.2 times as much). The groups chromatin and histone are also striking in their number of neutral binding sites compared to their positive and negative sites combined, further referred as biased binding sites. TE derived sequences in promoter regions have 1.6 times more neutral than biased binding sites associated with histone modification and for chromatin the number is 1.25 times higher. For transposons not in promoter regions chromatin ratio of neutral and bi- ased binding site is lower with 0.9. The histone frequency also drops to 1.25 times more neutral binding sites. Therefore the number of neutral binding sites in trans- posons increases from TEs not in promoter regions to TEs in promoter regions for chromatin and histone affecting binding sites.

Transposons and their distribution of transcription affecting binding sites dif- fer greatly from promoter regions derived of transposon sequences. Here a higher number of binding sites have a positive influence on transcription through chro- matin and histone modification, where as the neutral ones have the lowest number. Chapter 3. Results 31

Influence on transcription: Negative Positive Neutral

TEs in promoter Promoter without TEs 1500

9000

1000

6000

500 3000

0 0

TEs not in promoter All Histon 12000 Chromatin Transcription Number of TFBSs

8000 (adjusted for number of avaiable TFs) of avaiable number (adjusted for

4000

0

All Histon Chromatin Transcription

Form of influence on transcription

FIGURE 3.12: Number of TFBSs with transcription related GO- Annotations in promoter without TEs, TEs not in promoter, and TEs in promoter regions grouped by influence on chromatin strength, his- tone binding, general transcription, and all of those sets, further sepa- rated for positive, negative, or neutral influence on transcription. The count was adjusted by dividing the groups with the number of TFs for each group to eliminate sampling error.

Transcription on the other hand follows the same pattern as observed in the trans- posons. For all three regions transcription has the same ratio of neutral to biased binding sites and the same for negative to positive binding sites, which indicates that transcription associated binding sites show no preference of influence for one region. In hindsight the GO term annotation set for transcription may have been too broad and could be split further for future studies. The groups histone and chro- matin on the other hand show that promoter regions without transposons have four times as many biased binding sites than neutral ones. This stands in opposition to transposon regions where more neutral than biased binding sites were observed. This is especially true for TE derived promoter sequences. Those results indicate that at least for binding sites affecting chromatin and histone modification transpo- son prefer to transport sites with neutral influences into promoter regions. Chapter 3. Results 32

3.4.2 Individual TFBSs

In earlier it was already shown that the number of shared binding sites between two tissues is quite small, especially for TE derived promoter regions. However the question arouse if there were transcription factors who were conspicuous in their number of shared binding between more than two tissues. To answer this question 12 different transcription factors were chosen for which ENCODE data for five of the six different tissues were available (Kidney was excluded). With the use of the python module bedtools shared ENCODE positions between those 5 tissues were determined for TE derived promoter regions and promoter regions without TE se- quences. In order to achieve a comparison, the fraction of binding sites in TE derived sequences shared in all five tissues was calculated and is shown for every transcrip- tion factor in figure 3.14. Most of the analysed transcription factors have less than 5% of their shared promoter located binding sites in TE derived sequences. Exceptions are CTCF with 6% and MAFK with 20% (TGCTGA(G/C)TCAGCA, Basic leucine zipper factors (bZIP)). Those two striking transcription factors were chosen for a pairwise analysis. POLR2A and RFX5 were chosen as control group and SP1 and TAF1 as an additional group since they show the smallest fractions.

0.25

0.20

0.15

0.10

0.05 Fraction of shared TFBSs in Fraction

TE−derived promoter sequences TE−derived 0.00

SP1 CTCF EZH2 JUND MAFK RFX5 TAF1 EP300 GABPA SIN3A POLR2A

POLR2AphosphoS5

Transcription factor

FIGURE 3.13: Fraction of shared TFBSs in TE derived promoter se- quences for TFs with data in five tissues (blood, lung, stem cell, breast, and liver).

The transcription factor MAFK has in almost all pairwise comparisons an equal fraction of binding sites in TE derived promoter sequences and TEs not located in Chapter 3. Results 33 promoter regions. The tissues blood in the comparisons of blood to liver and blood to lung and the tissue stem cell in stem cell to lung comparison have higher fraction of shared TFBSs in TEs in promoter regions than TEs not in promoter regions, but the p-values for blood in blood to lung and stem cell in stem cell to lung are with p = 0.032 and p = 0.025 the highest accepted in all MAFK comparisons. CTCF on the other hand has overall less shared TFBS in TEs in promoter regions, with the exception of lung in all lung comparisons, where it had equal fractions with TEs not in promoter regions, and the tissue blood in the blood to lung comparison had a higher fraction in promoter TEs. The control TF POLR2A has in all comparisons a lower fraction of shared binding sites in TE derived promoter sequences than TEs not in promoter regions. The other control, RFX5, on the other hand has an equal fraction of shared TFBSs between TEs in and not in promoter regions, with an ex- ception of liver in the liver to stem cell comparison, where the fraction is greater in TE derived promoter sequences. The two TFs with the lowest fraction of shared TFBSs in promoter located TEs between five tissues show a similar pattern. TAF1 has in all but one comparison significant less shared TFBSs in TEs within promoter regions (exception: breast of breast to stem cell comparison, p = 0.097), which makes it similar to POLR2A or in a lesser degree to CTCF. SP1 on the other hand has in 50% of all comparisons an equal fraction of shared TFBSs between TEs in and not in promoter regions. The other half has a smaller fraction in TEs in promoter re- gions. Interestingly both tissues of the same comparison belong to the same group, i.e. blood and liver in blood to liver comparison have a smaller fraction of shared TFBSs in TE derived promoter sequences, but blood and breast in blood to breast have an equal fraction. Overall although MAFK has a higher percentage of shared TFBSs in TE derived promoter regions between five tissues, this uniqueness van- ishes in pairwise tissue comparisons. Here MAFK and RFX5 showed similar results (comparisons showed mostly equal fractions), whereas CTCF, SP1, and TAF1 could be grouped together (fraction of shared TFBSs in TE derived promoter sequences lower).

For five of those twelve transcription factors JASPAR motif matrices were avail- able and it was decided to use them to answer the question, if transposons transport active binding sites into the promoter, where a portion becomes dormant, or if they transport possible binding site positions into promoter regions, which become later active sides. The Encode data provided positions for active binding sites and the possible binding site positions were estimated with a JASPAR motif search. As tar- get sequences the transposon sub-families with the highest number of active TFBSs were chosen. The FPR thresholds were calculated with α = 0.05. For each TE se- quence the fraction of ENCODE hits was calculated, using the sum of possible and active binding sites. TE sequences with no active binding sites or possible binding sites were removed from the set. This could introduce a possible bias, since TEs with Chapter 3. Results 34 only possible positions are direct indicators for the direction of TFBS transport. The problem lies in the amount. The numbers of TE sequences without active size is so high, that the calculations simply return zero (see Tab. 3.5). It appears more reason- able to concentrate on sequences with both, possible and active, binding sites.

In most of the comparisons the fraction of active binding sites in TE derived pro- moter sequences is higher than in TEs not in promoter regions (Fig. 3.14). This is true for all L1 comparisons and most of L2. L2 for MAFK shows a similar distribution to JUND L2, but the difference for TEs in promoter against TEs not in promoter is in this case not significant (Mann–Whitney U test, p = 0.084). Interestingly the same is true for ERV1. In all comparisons for every TF, except MAFK, the fraction of active TF- BSs is significant higher in TE derived promoter sequences. In MAFK, where again the distribution looks similar to JUND (Fig. 3.14), no significant difference between the two genome regions could be found (Mann–Whitney U test, p = 0.436). The other two TE-subfamilies have significant differences depending on the transcrip- tion factor. MIR has for CTCF and MAFK significant less active binding sites in TE derived promoter regions (Mann–Whitney U test, CTCF: p = 2.691 ∗ 10−11, MAFK: p = 0.001), whereas the other three TFs exhibit no significant difference between the two genome regions. Interestingly Alu elements has significantly more active bind- ing sites in promoter located TEs for the transcription factor JUND (Mann–Whitney U test, p = 0.0002), whereas CTCF and MAFK have significantly more active binding sites in TEs outside of promoters (Mann–Whitney U test, CTCF: p = 6.803 ∗ 10−12, MAFK: p = 9.017−7). The other two TFs have no significant differences between TE locations.

Overall the direction of binding sites activation (first insert of active TFBS than deactivated or insert of possible TFBS than activated) is dependent on the TE el- ement and transcription factor. Nevertheless some commonalities are observable. In general, the fraction of active binding site is higher in TE derived promoter se- quences. Especially for L1, L2, and ERV1 elements (LINE and LTR), indicating that possible binding sites are transported into promoter regions and later mutate to ac- tive sites. This result also correspond with the already mentioned TFBS distribution in TEs (Fig. 3.8), where the number of TFBSs in LTR elements not in promoter re- gions was higher than expected from their nucleotide amount (Fig. 3.3), whereas the promoter regions followed the expected path. In contrast for the LINE elements the opposite is true. Less active binding sites were observed in TEs not in promoter regions than expected from the nucleotide distribution of LINE elements. That the analysed SINE elements have either both directions for different TFs or no direc- tions is also reflected in the TFBSs distribution in TEs, since the expected numbers for binding sites is for TEs in promoter regions and not in promoter regions as ex- pected from their nucleotide distribution. Chapter 3. Results 35

● ● ● ● L1 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ● ●●●●●●

● ● ● ●● ● ● L2 ●● ●●●●●●●●●●●●●●● ● ●● ●● CTCF ●●●●●●●●●●●●●●●●● ●●●● ●●●● ● ● ● ● ● Alu ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ● ●●● ● ●●

● ● ●●●● MIR ●●

●● ● ●● ● ●● ● ERV1 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ● ●●●●● ●●●● ● ●●●

● ● L1 ●●●●●●●●●●●●●●●●●●●●● ● ●●● ● ●●●● ● ● ● ● ● ●

● ● ● ●● ● ● L2 ●●●●● ● ●●●●● ●●● ● ●● JUND ● ● ● ● ● ● ● Alu ●● ● ●●●● ● ●●●● ● ●● ● ●● ● ●

● ● ● MIR ● ●● ●●●● ● ●

● ●● ● ● ● ERV1 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●● ● ● ● ● ●

● ● L1 ●●●●●●●●●●●●●●●● ●●●●● ● ●●● ● ●●● ● ●● ● ● ● ●

● ● ● L2 ● ● ●● ● ● ● ● ● MAFK ●●● Alu ●●●●●● ● ●● ● ●● ● ●● ● ● ● ● ● ●

● ● ● ● ● MIR ● ●● ● ● ●

● ● ● ● ● ERV1 ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●● ● ●● ● ● ● ●

L1 ●● ● ● ● ● ● ● ● ● ● ● ● ●

L2 ● ● ● ● RFX5

Alu ●● ●●●●● ●● ● ● ● ● ●

● MIR ● ●

● ERV1 ●●● ●●●●●●●● ●● ● ● ● ● ●

● ● ● ● L1 ●●●●●●●●●● ● ● ●●● ● ● ● ● ● ●

● ● ● ● ● ● L2 ● ● ● ● ● ● ● ● ● ● SP1 ● ● ●● ● Alu ●●●●●●●●●●●●●●●●●●● ● ●●● ● ● ● ● ● ● ●

● ● ● MIR ● ● ● ●● ● ● ● ● ● ●

●●●● ● ● ● ● ● ERV1 ●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●● ● ● ● ● ● ●

0.00 0.05 0.10 0.15 0.20 Percentage of ENCODE hits

TEs not in promoter TEs in promoter

FIGURE 3.14: Fraction of active TFBSs against possible TFBSs in TEs in promoter and TEs not in promoter for five different TFs and TEs sub-families. Plotted are only values up to 0.22. Chapter 3. Results 36

TABLE 3.5: Number of TEs from the sub-families L1, L2, Alu, MIR, and ERV1 with ENCODE or JASPAR hits for each analysed TF.

Transcription factor Hits reported RFX5 MAFK JUND CTCF SP1 all ENCODE and JASPAR 2,166 24,190 27,153 53,100 13,291 105,159 only JASPAR 3,312,595 3,295,667 3,307,629 3,263,310 3,319,753 3,356,117 only ENCODE 4 30 33 127 11 203

3.5 Pathway analysis

3.5.1 Pathway analysis of genes affected by TE-originated TFBSs

The next question was: are there any groups of genes who utilise TE-originated TFBSs more often than others? To answer this question a pathway analysis was performed with the WebGestald package in R using six different data sets:

1. Active TFBSs derived from TEs only in promoters of tissue A but not tissue B

2. Active TFBSs derived from TEs only in promoters of tissue B but not tissue A

3. Active TFBSs derived from TEs in promoters of tissue B and tissue A

4. Active TFBSs only in non-TE-originated sequences of tissue A but not tissue B

5. Active TFBSs only in non-TE-originated sequences of tissue B but not tissue A

6. Active TFBSs in non-TE-originated sequences of tissue B and tissue A

In general there are not too many enriched pathways for TE-originated TFBSs in promoter regions. For example the results for data set 3 yielded only two en- riched pathways in a single tissue comparison (blood vs stem cells) and both path- way were related to human diseases: systemic lupus erythematosus and alcoholism. For promoter who had uniquely used binding sites in some tissues, kidney and blood showed in the highest number of tissue comparisons overrepresented path- ways (five for each of the tissues). The observed overrepresented pathways are of- ten related to generic and not necessarily tissue specific pathways, like ribosom, splicosome, and RNA transport. Many disease related pathways were observed in this study (full list attached in Appendix tab. B.2). It is still possible that the over- represented pathways were not due to TE originated TFBSs, but part of the gen- eral overrepresentation in the compared tissues. Consequently, the overrepresented pathways from the analysis performed with binding sites presented solely in pro- moter regions (data set 4 to 6) was subtracted from the previous results, which let remain only several pathways. Again the blood tissue has the highest number (13) of overrepresented pathways (Tab. 3.6). Liver is second in ranking with five over- represented pathways, followed by breast and kidney with three overrepresented pathways and lung includes two. Finally, there are no overrepresented pathways Chapter 3. Results 37 in stem cells. Interestingly "ribosom" pathway is overrepresented in three different tissues (blood, lung , liver). Similarly NAFLD, a Non-alcoholic fatty liver disease, is overrepresented in breast, kidney, and liver, while oxidative phosphorylation path- way (Process for ATP generation) is overrepresented in blood, breast, and kidney. In conclusion transposons provide significant novelty to specific gene regulations, es- pecially blood tissues, and these novel TFBSs affect some metabolic pathways more than the others.

entrainment Cholinergic Endometrial Dilated junction Prostate Calcium Regulation Wnt onset multiple Alcoholism Circadian cGMPPKG cancer carcinoma cell resistance Ras Hippo

PI3KAkt Rap1 Melanogenesis

AGERAGE lipolysis MAPK Maturity Type ErbB cAMP Breastaddiction Protein Prolactin stem tyrosine endocannabinoid ligandreceptor inhibitor Taste Thyroid pathwayMelanoma Longevityregulating EGFR transduction signalinglung diabetes cardiomyocytes cardiomyopathy Glutamatergic Glioma Pathways Insulin interaction synapse Basal cells HTLVI Retrograde Estrogen adipocytes Salivary guidance Morphine species complicationsabsorption young pluripotencyII Axon mellitus Small Neuroactive digestion diabetic secretion AdrenergicOxytocin Gap infection ProteoglycansTranscriptional Dopaminergic kinase misregulation

FIGURE 3.15: WordCloud representation of pathway analysis from promoters with no TE derived sequences. Scaling was conducted us- ing the square root of word count.

3.5.2 Pathway analysis of Promoter without any TEs

As previous stated 25% of the analysed promoter did not have any TE-derived se- quences. These interesting circumstances raised the question, if those promoter be- long to genes with important functions, where any insertion in the promoter region would be detrimental for the individual. A similar approach as described previously was taken by extracting the genes with promoter regions devoid of transposon se- quences and upload them using the R package WebGestald. Promoter regions with- out any TEs where enriched in 51 Pathways (full list in Appendix B.3). To present the results in a better way, a WordCloud was generated with the square root of the word count as scaling and is presented in figure 3.15. Four of those pathways are associated with the brain (Synapses or Axon), which fits with the hypothesis that those genes are associated with important pathways, but there are 34 more pathways which can be assigned to two equally big groups. One is filled with signaling related pathways and the other is filled with disease related pathways. Almost all pathways of the second group are associated with different cancers (Fig. 3.15). The rest is, in decreasing order, diabetes, addiction, and infection. Nevertheless the high number of enriched pathways associated either with signaling or neurons, lends credence Chapter 3. Results 38

TABLE 3.6: Pathways overrepresented in different tissue comparisons of TFBSs.

Tissue Pathway name KEGG number

Blood Aminoacyl-tRNA biosynthesis hsa00970 Basal transcription factors hsa03022 Citrate cycle (TCA cycle) hsa00020 DNA replication hsa03030 NF-B signaling pathway hsa04064 Nucleotide excision repair hsa03420 Oxidative phosphorylation hsa00190 Proteasome hsa03050 Protein export hsa03060 Ribosome hsa03010 RNA transport hsa03013 SNARE interactions in vesicular transport hsa04130 Ubiquinone and other terpenoid-quinone biosynthesis hsa00130 Breast Metabolic pathways hsa01100 Non-alcoholic fatty liver disease (NAFLD) hsa04932 Oxidative phosphorylation hsa00190 Kidney Nucleotide excision repair hsa00190 Non-alcoholic fatty liver disease (NAFLD) hsa04932 Oxidative phosphorylation hsa00190 Liver Bacterial invasion of epithelial cells hsa05100 Carbon metabolism hsa01200 Citrate cycle (TCA cycle) hsa00020 Non-alcoholic fatty liver disease (NAFLD) hsa04932 Ribosome hsa05120 Lung Epithelial cell signaling in Helicobacter pylori infection hsa03010 Ribosome hsa05120 Stem cell No enrichment Chapter 3. Results 39 to the idea that the 25% of TE depleted promoter sequences belong to evolutionary hard regulated genes. 40

Chapter 4

Discussion

Most of the human promoter regions contain TE-originated sequences and thou- sands of TFBSs are located there. In the extreme cases (genes RP11-666A8.7 and RP11-666A8.12), the whole promoter region consists of TE-derived sequences. How- ever, both genes are located in the intron of the PRCD gene and are part of the Ha- vana manual annotation project 1. Therefore, it is possible that they are not repre- senting real genes but merely artefacts of spliced out intron. On the other extreme there are 8,796 genes whose promoters are depleted of TEs. One can indeed ask the question why some of the promoters are depleted of TEs. Is there anything spe- cial about these genes? It was attempted to answer this question by looking at the metabolic pathways associated with these genes. The resulting 51 pathways were enrichment in signaling, regulating, and synapsis pathways pointing to a evolution- ary pressure on important regulations. It has been shown that genes of essential functions, e.g. metabolism, have low TE insertions (Lagemaat et al., 2003). How- ever, many of the enriched pathways were also associated to different disease like cancer. The influence of TEs on diseases is well known (Mills et al., 2007; Lee and Young, 2013; Hancks and Kazazian, 2016) so it is surprising that genes with no TE insertions in promoter have enrichment in those pathways. A closer examination which genes of those pathways are affected may shed light in this puzzle.

As mentioned above, typical human pol II promoters harbour significant number of TEs. Although recently it has been suggested that TE exaptation into regulatory regions is rare (Simonti et al., 2017), it is hard to believe that such a significant frac- tion of the promoters would not harbour any regulatory elements, especially that early computational predictions suggested otherwise (Jordan et al., 2003; Thornburg et al., 2006). The presented analysis in this study demonstrated that hundreds of thousands of active TFBSs in the promoters are indeed located within TE-derived sequences. Of course, one could argue that 200,000 out of more than 3 million TFBSs is not an impressive number but we dare to disagree.

Interestingly, when two tissues in which the same TFs were analysed are com- pared for the active TFBSs, it is clear that most of these binding sites are unique for a

1https://www.ensembl.org/info/genome/genebuild/manual_havana.html Chapter 4. Discussion 41 given tissue. In other words, only few of the TFBSs seem to be active in both tissues analysed and usually only 20% of the active TFBSs in a given tissue are also active in another one. Trizzino et al. recently analysed TEs for their contribution to hu- man “regulatory novelty” and they came to similar conclusion that “TEs can poten- tially contribute to the turnover of regulatory sequences in a tissue-specific fashion” (Trizzino et al., 2018). The contribution of TEs to tissue specific novelty in regula- tion is dependent on the TF as seen for the single TF analysis. MAFK and RFX1 for example have no difference between TE derived promoter sequences and promoter regions contrary to CTCF and POLR2A. This is especially interesting, since Kim et al. found stable CTCF binding across different cell lines (Kim et al., 2007). This can be explained since this study focused on specific regions (promoter), which are the center of regulatory differences. To mention is, that from all 6 analysed TFs in 5 tissues no comparison showed a higher amount of shared TFBSs in TE derived pro- moter sequences instead of promoter. TEs therefor bring tissue specific binding sites or have no influence on tissue specificity.

Another point of interest in this study was the direction in which TFBSs arise in promoter, in other words if the TFBSs were already active in TEs and then trans- ported into promoter regions, or if TEs have possible binding sites, which become active after insertion into a promoter. The results have shown, that the direction is TE dependent. For LINE and LTR elements the TFBSs were active in the TE and after insertion in a promoter they evolutionary pressure deactivated a portion of them. The SINE elements on the other hand jump with inactive binding sites which become active in the promoter. The difference between SINE and the other elements is in their structure. SINE elements have no protein coding sequences and only an active promoter (see chapter1). The region after promoter is therefore a blank se- quence full of potential TFBSs. It stands to reason that at the time when the TE is inserted into a promoter the regulation apparatus is met with reduction in TFBSs and under evolutionary pressure the possible sites become active.

The analysis of the directional influence of TE originated TFBSs in promoter re- gions for multiple TFs seems to be a first in transposon research, since no paper to this topic was found. There are studies, where individual TFs and their influence on regulatory networks are researched, but not a general approach as conducted in this study. The method in itself has its flaws. So are the GO-terms grouped by hand which can introduce a partial bias. Furthermore, most of the GO-term annota- tions available for the ENCODE TFs are machine annotations and not reviewed by human annotator. Lastly changes of the terminologies are frequent (Huntley et al., 2014) which could shift results of the analysis. Still the results obtained with this method are reasonable in hindsight of literature results. First that TFs with neutral influences on transcription exist is no mistake. Although TFs are defined as pro- teins which bind DNA in a sequence-specific manner and regulate transcription, the Chapter 4. Discussion 42 regulation can be achieved indirectly by binding co-factors or other enzymes with opposite effects (Lee and Young, 2013; Frietze and Farnham, 2011; Rosenfeld et al., 2006; Schmitges et al., 2016; Li et al., 2007). One of the interesting results is that TEs harbour more neutral TFBSs for histone and chromatin modification than positive or negative ones. For chromatin in TE derived promoter sequences it is even 1.5 times higher. Since they are corrected values it means that a small number of TFs contribute a significant amount of binding sites related to chromatin modifications. This coincide with the fact that TEs "contributed nearly half of the open chromatin regions of the human genome" (Jacques et al., 2013) and that promoter regions in the human genome have low chromatin density (Thurman et al., 2012).

Rewiring of transcription circuits played a major role in mammalian evolution and consequently increased their functional and morphological diversity (King and Wilson, 1975; Mariño-Ramírez et al., 2005). In this large-scale analysis, it was demon- strated that TEs contributed a large fraction of human TFBSs. Interestingly, these TFBSs usually act at tissue-specific manner. Thus, our study clearly showed that TEs played a significant role in shaping expression pattern in mammals and hu- mans in particular. These findings further support a hypothesis put forward by one of us more than 20years ago that “transposable elements are major evolution- ary force in shaping eukaryotic genomes” (Makałowski, 2000). Furthermore, since several TE families are still active in our genome, they continue to influence not only our genome architecture but also gene functioning in a broader sense. Two open questions deserve further investigation: what is evolutionary dynamic of the TE-originated TFBSs and what is a polymorphism level of these TFBSs in different human populations. These will be subject of our next study. 43

Appendix A

Supplementary - Figures

Observed number of TFBSs Expected number of TFBSs

TEs not in promoter TEs in promoter 2.0 × 106 1.25 × 105

1.00 × 105 1.5 × 106

7.50 × 104 1.0 × 106 5.00 × 104

5.0 × 105 Number of TFBSs 2.50 × 104

0 0

LTR LTR LINE SINE DNA LINE SINE DNA

Class of transposable elements

FIGURE A.1: Observed and expected numbers of TFBSs in TE- derived promoter sequences and TE sequences not in promoter. Ex- pected numbers were calculated by multiplying the number of TFBSs in the genomic region with the nucleotide fraction of for a specific TE-class in the same genomic region. Appendix A. Supplementary - Figures 44

FIGURE A.2: Number of shared and unique TFBSs for TE derived promoter sequences in pairwise tissue comparison represented as Venn-diagrams. Each comparison has the number of TFs used (com- pared TFs) and the number of TFs with shared TFBSs between the two tissues (shared TFs) above the diagram. Appendix A. Supplementary - Figures 45

FIGURE A.3: Number of shared and unique TFBSs for promoter se- quences without TEs in pairwise tissue comparison represented as Venn-diagrams. Each comparison has the number of TFs used (com- pared TFs) and the number of TFs with shared TFBSs between the two tissues (shared TFs) above the diagram. Appendix A. Supplementary - Figures 46

0.05 B 2 F E M ) p b

n i (

k a e p

E D O C N E

o t

e c n a t F s i C D T

C

Number Number of possib of le TFBS le s

FIGURE A.4: All possible motif positions for ENCODE entries from MEF2B and CTCF with FPR-threshold of α = 0.05 and 0.01. Position 0 on the x-axis is equivalent to the recorded ENCODE peak. 47

Appendix B

Supplementary - Tables

TABLE B.1: Gene categories used in the study.

Gene category Number of annotated genes protein_coding 19888 IG_C_gene 14 IG_D_gene 37 IG_J_gene 18 IG_V_gene 144 TR_C_gene 6 TR_D_gene 4 TR_J_gene 79 TR_V_gene 106 processed_transcript 556 3prime_overlapping_ncRNA 32 bidirectional_promoter_lncRNA 47 lincRNA 7490 non_coding 3 antisense 5501 sense_intronic 899 sense_overlapping 183 Total 35,007 Appendix B. Supplementary - Tables 48

TABLE B.2: Pathways enriched in the gene set whose promoters har- bor TE-derived sequences.

Tissue Pathway name KEGG number Blood Citrate cycle (TCA cycle) hsa00020 Ubiquinone and other terpenoid-quinone biosynthesis hsa00130 Oxidative phosphorylation hsa00190 N-Glycan biosynthesis hsa00510 Other types of O-glycan biosynthesis hsa00514 Aminoacyl-tRNA biosynthesis hsa00970 Metabolic pathways hsa01100 Platinum drug resistance hsa01524 Ribosome hsa03010 RNA transport hsa03013 Basal transcription factors hsa03022 DNA replication hsa03030 Spliceosome hsa03040 Proteasome hsa03050 Protein export hsa03060 Nucleotide excision repair hsa03420 Fanconi anemia pathway hsa03460 NF-kappa B signaling pathway hsa04064 Ubiquitin mediated proteolysis hsa04120 SNARE interactions in vesicular transport hsa04130 Protein processing in endoplasmic reticulum hsa04141 Lysosome hsa04142 Endocytosis hsa04144 Apoptosis hsa04210 Fc gamma R-mediated phagocytosis hsa04666 Neurotrophin signaling pathway hsa04722 Insulin signaling pathway hsa04910 Non-alcoholic fatty liver disease (NAFLD) hsa04932 Huntington’s disease hsa05016 Alcoholism hsa05034 Shigellosis hsa05131 Continued on next page Appendix B. Supplementary - Tables 49

Table B.2 – Continued from previous page Tissue Pathway name KEGG number Hepatitis B hsa05161 Herpes simplex infection hsa05168 Epstein-Barr virus infection hsa05169 Viral carcinogenesis hsa05203 Colorectal cancer hsa05210 Renal cell carcinoma hsa05211 Pancreatic cancer hsa05212 Central carbon metabolism in cancer hsa05230 Breast Oxidative phosphorylation hsa00190 Metabolic pathways hsa01100 Non-alcoholic fatty liver disease (NAFLD) hsa04932 Huntington’s disease hsa05016 Kidney Oxidative phosphorylation hsa00190 Metabolic pathways hsa01100 Ribosome hsa03010 RNA transport hsa03013 Spliceosome hsa03040 Nucleotide excision repair hsa03420 FoxO signaling pathway hsa04068 Non-alcoholic fatty liver disease (NAFLD) hsa04932 Huntington’s disease hsa05016 Renal cell carcinoma hsa05211 Liver Citrate cycle (TCA cycle) hsa00020 Pyruvate metabolism hsa00620 Metabolic pathways hsa01100 Carbon metabolism hsa01200 Ribosome hsa03010 Spliceosome hsa03040 Non-alcoholic fatty liver disease (NAFLD) hsa04932 Bacterial invasion of epithelial cells hsa05100 Lung Ribosome hsa03010 mTOR signaling pathway hsa04150 Continued on next page Appendix B. Supplementary - Tables 50

Table B.2 – Continued from previous page Tissue Pathway name KEGG number Epithelial cell signaling in Helicobacter hsa05120

TABLE B.3: Pathways enriched in the gene set whose promoters were devoid of TEs.

Geneset Description hsa05200 Pathways in cancer hsa04550 Signaling pathways regulating pluripotency of stem cells hsa04360 Axon guidance hsa04151 PI3K-Akt signaling pathway hsa05224 Breast cancer hsa04022 cGMP-PKG signaling pathway hsa04724 Glutamatergic synapse hsa05202 Transcriptional misregulation in cancer hsa04713 Circadian entrainment hsa04211 Longevity regulating pathway hsa04080 Neuroactive ligand-receptor interaction hsa04725 Cholinergic synapse hsa05217 Basal cell carcinoma hsa04015 Rap1 signaling pathway hsa04261 Adrenergic signaling in cardiomyocytes hsa04390 Hippo signaling pathway hsa04723 Retrograde endocannabinoid signaling hsa04916 Melanogenesis hsa04540 Gap junction hsa04921 Oxytocin signaling pathway hsa05205 Proteoglycans in cancer hsa04213 Longevity regulating pathway - multiple species hsa05218 Melanoma hsa04310 Wnt signaling pathway hsa04010 MAPK signaling pathway Continued on next page Appendix B. Supplementary - Tables 51

Table B.3 – Continued from previous page Geneset Description hsa04917 Prolactin signaling pathway hsa05213 Endometrial cancer hsa04728 Dopaminergic synapse hsa04933 AGE-RAGE signaling pathway in diabetic complications hsa01521 EGFR tyrosine kinase inhibitor resistance hsa05215 Prostate cancer hsa04915 Estrogen signaling pathway hsa04930 Type II diabetes mellitus hsa04950 Maturity onset diabetes of the young hsa04024 cAMP signaling pathway hsa04970 Salivary secretion hsa04014 Ras signaling pathway hsa04020 Calcium signaling pathway hsa04911 Insulin secretion hsa05166 HTLV-I infection hsa05414 Dilated cardiomyopathy hsa04742 Taste transduction hsa05032 Morphine addiction hsa05034 Alcoholism hsa05214 Glioma hsa04012 ErbB signaling pathway hsa05216 Thyroid cancer hsa04923 Regulation of lipolysis in adipocytes hsa05222 Small cell lung cancer hsa04974 Protein digestion and absorption hsa04350 TGF-beta signaling pathway Appendix B. Supplementary - Tables 52

TABLE B.4: GO-term annotations used to infer TFBS influence (increase or decrease) on transcription by influencing chromatin strength, histone, or transcription directly.

Transcription influ- GO-term Accession GO-term definition ence Reduced through GO:0000183 chromatin silencing at rDNA chromatin GO:0005677 chromatin silencing complex modification GO:0006342 chromatin silencing GO:0006344 maintenance of chromatin silencing GO:0031937 positive regulation of chromatin si- lencing GO:0031940 positive regulation of chromatin si- lencing at telomere Reduced through GO:0004407 histone deacetylase activity histone modification GO:0016575 histone deacetylation GO:0031065 positive regulation of histone deacetylation GO:0032452 histone demethylase activity GO:0032453 histone demethylase activity (H3-K4 specific) GO:0032454 histone demethylase activity (H3-K9 specific) GO:0034647 histone demethylase activity (H3- trimethyl-K4 specific) GO:0034648 histone demethylase activity (H3- dimethyl-K4 specific) GO:0035067 negative regulation of histone acety- lation GO:0035575 histone demethylase activity (H4-K20 specific) GO:0051864 histone demethylase activity (H3-K36 specific) GO:0070932 histone H3 deacetylation GO:0070933 histone H4 deacetylation GO:0071558 histone demethylase activity (H3-K27 specific) Continued on next page Appendix B. Supplementary - Tables 53

Table B.4 – Continued from previous page Transcription influ- GO-term Accession GO-term definition ence GO:0090241 negative regulation of histone H4 acetylation GO:1901675 negative regulation of histone H3- K27 acetylation Reduced GO:0000122 negative regulation of transcription by RNA polymerase II GO:0001078 transcriptional repressor activ- ity, RNA polymerase II proximal promoter sequence-specific DNA binding GO:0001103 RNA polymerase II repressing tran- scription factor binding GO:0001206 transcriptional repressor activity, RNA polymerase II distal enhancer sequence-specific binding GO:0001222 transcription corepressor binding GO:0001227 transcriptional repressor activity, RNA polymerase II transcription regulatory region sequence-specific DNA binding GO:0003714 transcription corepressor activity GO:0010629 negative regulation of gene expres- sion GO:0010768 negative regulation of transcription from RNA polymerase II promoter in response to UV-induced DNA dam- age GO:0010944 negative regulation of transcription by competitive promoter binding GO:0016458 gene silencing GO:0017053 transcriptional repressor complex GO:0017055 negative regulation of RNA poly- merase II transcriptional preinitiation complex assembly GO:0031047 gene silencing by RNA Continued on next page Appendix B. Supplementary - Tables 54

Table B.4 – Continued from previous page Transcription influ- GO-term Accession GO-term definition ence GO:0034244 negative regulation of transcription elongation from RNA polymerase II promoter GO:0045814 negative regulation of gene expres- sion, epigenetic GO:0045892 negative regulation of transcription, DNA-templated GO:0061987 negative regulation of transcription from RNA polymerase II promoter by glucose GO:0071930 negative regulation of transcription involved in G1/S transition of mitotic cell cycle GO:0090571 RNA polymerase II transcription re- pressor complex GO:1903026 negative regulation of RNA poly- merase II regulatory region sequence- specific DNA binding GO:1903758 negative regulation of transcription from RNA polymerase II promoter by histone modification GO:1990441 negative regulation of transcription from RNA polymerase II promoter in response to endoplasmic reticulum stress GO:2000637 positive regulation of gene silencing by miRNA Increased through GO:0006352 DNA-templated transcription, initia- chromatin tion modification GO:0035327 transcriptionally active chromatin Increased through GO:0004402 histone acetyltransferase activity histone modification GO:0010484 H3 histone acetyltransferase activity GO:0016573 histone acetylation Continued on next page Appendix B. Supplementary - Tables 55

Table B.4 – Continued from previous page Transcription influ- GO-term Accession GO-term definition ence GO:0031064 negative regulation of histone deacetylation GO:0035066 positive regulation of histone acetyla- tion GO:0043966 histone H3 acetylation GO:0043967 histone H4 acetylation GO:0043969 histone H2B acetylation GO:0043981 histone H4-K5 acetylation GO:0043982 histone H4-K8 acetylation GO:0043983 histone H4-K12 acetylation GO:0043984 histone H4-K16 acetylation GO:0043995 histone acetyltransferase activity (H4-K5 specific) GO:0043996 histone acetyltransferase activity (H4-K8 specific) GO:0043997 histone acetyltransferase activity (H4-K12 specific) GO:0044154 histone H3-K14 acetylation GO:0046811 histone deacetylase inhibitor activity GO:0046972 histone acetyltransferase activity (H4-K16 specific) GO:0071442 positive regulation of histone H3-K14 acetylation GO:0090240 positive regulation of histone H4 acetylation GO:1901726 negative regulation of histone deacetylase activity GO:2000144 positive regulation of DNA- templated transcription, initiation GO:2000617 positive regulation of histone H3-K9 acetylation GO:2000620 positive regulation of histone H4-K16 acetylation Continued on next page Appendix B. Supplementary - Tables 56

Table B.4 – Continued from previous page Transcription influ- GO-term Accession GO-term definition ence Increased GO:0000432 positive regulation of transcription from RNA polymerase II promoter by glucose GO:0000435 positive regulation of transcription from RNA polymerase II promoter by galactose GO:0001077 transcriptional activator activ- ity, RNA polymerase II proximal promoter sequence-specific DNA binding GO:0001102 RNA polymerase II activating tran- scription factor binding GO:0001205 transcriptional activator activity, RNA polymerase II distal enhancer sequence-specific DNA binding GO:0001223 transcription coactivator binding GO:0001225 RNA polymerase II transcription coactivator binding GO:0001228 transcriptional activator activity, RNA polymerase II transcription regulatory region sequence-specific DNA binding GO:0003257 positive regulation of transcription from RNA polymerase II promoter involved in myocardial precursor cell differentiation GO:0003713 transcription coactivator activity GO:0006352 DNA-templated transcription, initia- tion GO:0006367 transcription initiation from RNA polymerase II promoter GO:0006990 positive regulation of transcription from RNA polymerase II promoter involved in unfolded protein re- sponse Continued on next page Appendix B. Supplementary - Tables 57

Table B.4 – Continued from previous page Transcription influ- GO-term Accession GO-term definition ence GO:0007221 positive regulation of transcription of Notch receptor target GO:0010628 positive regulation of gene expres- sion GO:0016251 RNA polymerase II general transcrip- tion initiation factor activity GO:0035948 positive regulation of gluconeogene- sis by positive regulation of transcrip- tion from RNA polymerase II pro- moter GO:0036003 positive regulation of transcription from RNA polymerase II promoter in response to stress GO:0036091 positive regulation of transcription from RNA polymerase II promoter in response to oxidative stress GO:0042795 snRNA transcription by RNA poly- merase II GO:0045815 positive regulation of gene expres- sion, epigenetic GO:0045893 positive regulation of transcription, DNA-templated GO:0045899 positive regulation of RNA poly- merase II transcriptional preinitiation complex assembly GO:0045944 positive regulation of transcription by RNA polymerase II GO:0051123 RNA polymerase II transcriptional preinitiation complex assembly GO:0060261 positive regulation of transcription initiation from RNA polymerase II promoter Continued on next page Appendix B. Supplementary - Tables 58

Table B.4 – Continued from previous page Transcription influ- GO-term Accession GO-term definition ence GO:0061395 positive regulation of transcription from RNA polymerase II promoter in response to arsenic-containing sub- stance GO:0061408 positive regulation of transcription from RNA polymerase II promoter in response to heat stress GO:0061419 positive regulation of transcription from RNA polymerase II promoter in response to hypoxia GO:0097550 transcriptional preinitiation complex GO:1901522 positive regulation of transcription from RNA polymerase II promoter involved in cellular response to chemical stimulus GO:1902895 positive regulation of pri-miRNA transcription by RNA polymerase II GO:1990440 positive regulation of transcription from RNA polymerase II promoter in response to endoplasmic reticulum stress

TABLE B.5: List of transcription factors with available ENCODE data and their corresponding JASPAR motif matrices with consesnsus se- quence used in this study.

TF JASPAR ID Consensus sequence ATF4 MA0833.1 ATGAYGCAAT ATF7 MA0834.1 ATGACGTCAT BCL6B MA0731.1 TGCTTTCTAGGAATTCM BHLHE40 MA0464.2 KCACGTGM CEBPA MA0102.3 ATTGCAYAAY CEBPB MA0466.2 TTRCGCAAY CEBPG MA0838.1 ATTRCGCAAY Continued on next page Appendix B. Supplementary - Tables 59

Table B.5 – Continued from previous page TF JASPAR ID Consensus sequence CLOCK MA0819.1 ACACGTGT CREB1 MA0018.3 TGACGTCA CREB3 MA0638.1 TGCCACGTCAY CREB3L1 MA0839.1 TGCCACGTCANCA CTCF MA0139.1 CCASYAGRKGGCRS CTCFL MA1102.1 CNSCAGGGGGCR CUX1 MA0754.1 TRATCRAT E2F1 MA0024.3 TTGGCGCCAA E2F4 MA0470.1 GGCGGGAA E2F6 MA0471.1 RGGCGGGAR E2F7 MA0758.1 TTTCCCGCCAAAA E2F8 MA0865.1 TTTCCCGCCAAA EBF1 MA0154.3 ANTCCCNWGGGA EGR1 MA0162.3 ACGCCCACGCA EGR2 MA0472.2 MCGCCCACGCA ELF1 MA0473.2 ACCCGGAAGT ELF3 MA0640.1 ACCCGGAAGTA ELF4 MA0641.1 ACCCGGAAGT ELK1 MA0028.2 ACCGGAAGT ERF MA0760.1 ACCGGAAGT ESR1 MA0112.3 RGGTCACNRTGACCT ETS1 MA0098.3 ACCGGAART ETV4 MA0764.1 CCGGAAGT ETV5 MA0765.1 CGGAWG ETV6 MA0645.1 SCGGAAGT FOS MA0476.1 TGASTCAT FOSL1 MA0477.1 RTGASTCAK FOSL2 MA0478.1 RTGASTCAB FOXA1 MA0148.3 TGTTTACWYW FOXK2 MA1103.1 GTAAACA FOXP1 MA0481.2 GTAAACA FOXP2 MA0593.1 WGTAAACA Continued on next page Appendix B. Supplementary - Tables 60

Table B.5 – Continued from previous page TF JASPAR ID Consensus sequence GATA2 MA0036.3 CTTATCT GATA3 MA0037.3 WGATAA GLI2 MA0734.1 GCGACCACMCT GLIS1 MA0735.1 MGACCCCCCACGAWG GLIS2 MA0736.1 GACCCCCCGCRANG GMEB2 MA0862.1 TACGTAA HINFP MA0131.2 CNRCGTCCGC HLF MA0043.2 TTACRTAA HMBOX1 MA0895.1 TAGTTA HNF1A MA0046.2 RTTAATNATTAAC HNF4G MA0484.1 RGNNCAAAGKYCA HSF1 MA0486.2 TTCTAGAAYNTTC IRF1 MA0050.2 STTTCACTTTCNNTT IRF2 MA0051.1 SGAAAGYGAAASCNW IRF3 MA1418.1 RRAANGGAAACCGAAAC IRF4 MA1419.1 CGAAACCGAAACY IRF5 MA1420.1 CCGAAACCGAAACY IRF9 MA0653.1 AACGAAACCGAAACT JUN MA0488.1 ATGATGTCAT JUNB MA0490.1 RTGASTCA JUND MA0491.1 RTGASTCAT KLF13 MA0657.1 TGMCACGCCCCTTTTT KLF14 MA0740.1 RCCACGCCCMC KLF16 MA0741.1 MCACGCCCCC KLF4 MA0039.3 CACACCCT KLF5 MA0599.1 CCCCRCCCH KLF9 MA1107.1 CCACACCCA LEF1 MA0768.1 AAGATCAAAGG MAFF MA0495.2 TGCTGAGTCAGCA MAFG MA0659.1 TGCTGASTCAGCA MAFK MA0496.2 TGCTGASTCAGCA MAX MA0058.3 CACGTG Continued on next page Appendix B. Supplementary - Tables 61

Table B.5 – Continued from previous page TF JASPAR ID Consensus sequence MEF2A MA0052.3 CTAWAAATAGM MEF2B MA0660.1 RCTAWAAATAGC MEF2C MA0497.1 CYAAAAATAG MEF2D MA0773.1 CTAWAAATAGM MEIS2 MA0774.1 TGACAG MGA MA0801.1 AGGTGTGA MITF MA0620.2 GTCACGTGAC MIXL1 MA0662.1 YAATTA MLX MA0663.1 RTCACGTGAT MNT MA0825.1 CACGTG MXI1 MA1108.1 CCACGTG MYB MA0100.3 CAACTG MYBL2 MA0777.1 RRCCGTTAAACNGYY MYC MA0147.3 CCACGTG MZF1 MA0056.1 GGGGA NEUROD1 MA1109.1 ACAGATGG NFATC3 MA0625.1 TTTCCR NFE2 MA0841.1 ATGACTCATS NFIA MA0670.1 TGCCAA NFIC MA0161.2 CTTGGCA NFIL3 MA0025.1 TTAYGTAAY NFYA MA0060.3 CCAATCA NFYB MA0502.1 RRCCAATCAG NR2C2 MA0504.1 RGGTCARAGGTCA NR2F1 MA0017.2 ARAGGTCANNNG NR2F2 MA1111.1 AAGGTCA NR3C1 MA0113.3 RGWACAYNRTGTWCY NR4A1 MA1112.1 AAAGGTCA NRF1 MA0506.1 GCGCNTGCGCR PAX5 MA0014.3 GCGTGACC PBX2 MA1113.1 TGATTGACA PBX3 MA1114.1 TGAGTGACAG Continued on next page Appendix B. Supplementary - Tables 62

Table B.5 – Continued from previous page TF JASPAR ID Consensus sequence PKNOX1 MA0782.1 TGACAGSTGTCA POU2F2 MA0507.1 ATTTGCATR POU5F1 MA1115.1 ATGCAAA PRDM1 MA0508.2 ACTTTCWC RBPJ MA1116.1 TGGGAA RELA MA0107.1 GGGRATTTCC RELB MA1117.1 ATTCCCC REST MA0138.2 TCAGCACCATGGACAGCKCC RFX3 MA0798.1 GTTRCCATGGYAAC RFX5 MA0510.2 GTTGCCATGGYAAC RUNX3 MA0684.1 WAACCRCAA RXRB MA0855.1 GGGGTCAAAGGTCA SCRT1 MA0743.1 GCAACAGGTG SCRT2 MA0744.1 GCAACAGGTG SOX13 MA1120.1 ACAATG SP1 MA0079.3 CCCCKCCCCC SP3 MA0746.1 CCACGCCCM SPI1 MA0080.4 AAAAAGCGGAAGT SREBF1 MA0595.1 RTCACCCCAY SREBF2 MA0596.1 RTGGGGTGAY SRF MA0083.3 TGMCCATATATGGKCA STAT1 MA0137.3 TTCYRGGAA STAT3 MA0144.2 TTCYKGGAA TBX21 MA0690.1 AGGTGTGAA TCF7L2 MA0523.1 ASATCAAAG TEAD1 MA0090.2 RCATTCC TEAD2 MA1121.1 ACATTCC TEAD3 MA0808.1 RCATTCC TEAD4 MA0809.1 RCATTCC TFAP4 MA0691.1 AWCAGCTGWT TFDP1 MA1122.1 GCGGGAA TFE3 MA0831.2 CACGTGAY Continued on next page Appendix B. Supplementary - Tables 63

Table B.5 – Continued from previous page TF JASPAR ID Consensus sequence TGIF2 MA0797.1 TGACAGSTGTCA THAP1 MA0597.1 TGCCC USF1 MA0093.2 CAYGTGACC USF2 MA0526.2 GTCACGTG YY1 MA0095.2 CAARATGGCNGC YY2 MA0748.1 CCGCCATT ZBED1 MA0749.1 TRTCGCGACATR ZBTB33 MA0527.1 TCTCGCGAGANY ZBTB7A MA0750.2 CCGGAAGTG ZBTB7B MA0694.1 CGACCACCGA ZEB1 MA0103.3 CACCTG ZNF143 MA0088.2 TWCCCAYAATGCAYYG ZNF24 MA1124.1 CATTCATTCATTC ZNF263 MA0528.1 GGAGGAGGRRGRGGRGGRRGR ZNF282 MA1154.1 CTTTCCCMYAACACG ZNF354C MA0130.1 MTCCAC ZNF384 MA1125.1 AAAAAAAA ZNF740 MA0753.1 CCCCCCCAC ZSCAN4 MA1155.1 TGCACACACTGAAA 64

Bibliography

Adams, C. C. and J. L. Workman (1995). “Binding of disparate transcriptional acti- vators to nucleosomal DNA is inherently cooperative.” In: Molecular and Cellular Biology 15.3, pp. 1405–1421. ISSN: 0270-7306, 1098-5549. DOI: 10.1128/MCB.15.3. 1405. URL: https://mcb.asm.org/content/15/3/1405 (visited on 08/14/2019). Bai, Lu and Alexandre V. Morozov (2010). “Gene regulation by nucleosome position- ing”. In: Trends in genetics: TIG 26.11, pp. 476–483. ISSN: 0168-9525. DOI: 10.1016/ j.tig.2010.08.003. Banville, Denis and Yves Boie (1989). “Retroviral long terminal repeat is the pro- moter of the gene encoding the tumor-associated calcium-binding protein onco- modulin in the rat”. In: Journal of molecular biology 207, pp. 481–90. DOI: 10.1016/ 0022-2836(89)90458-0. Becker, K. G. et al. (1993). “Binding of the ubiquitous nuclear transcription factor YY1 to a cis regulatory sequence in the human LINE-1 transposable element”. In: Human Molecular Genetics 2.10, pp. 1697–1702. ISSN: 0964-6906. DOI: 10.1093/ hmg/2.10.1697. Bell, Adam C. and Gary Felsenfeld (2000). “Methylation of a CTCF-dependent bound- ary controls imprinted expression of the Igf2 gene”. In: Nature 405.6785, pp. 482– 485. ISSN: 1476-4687. DOI: 10.1038/35013100. URL: https://www.nature.com/ articles/35013100 (visited on 08/19/2019). Boeke, Jef D. and Victor G. Corces (1989). “Transcription and Reverse Transcription of Retrotransposons”. In: Annual Review of Microbiology 43.1, pp. 403–434. DOI: 10.1146/annurev.mi.43.100189.002155. URL: https://doi.org/10.1146/ annurev.mi.43.100189.002155 (visited on 08/13/2019). Britten, R. J. and E. H. Davidson (1969). “Gene Regulation for Higher Cells: A The- ory”. In: Science 165.3891, pp. 349–357. ISSN: 0036-8075, 1095-9203. DOI: 10.1126/ science.165.3891.349. URL: http://www.sciencemag.org/cgi/doi/10.1126/ science.165.3891.349 (visited on 08/13/2019). Britten, R. J. and D. E. Kohne (1968). “Repeated sequences in DNA. Hundreds of thousands of copies of DNA sequences have been incorporated into the genomes of higher organisms”. In: Science (New York, N.Y.) 161.3841, pp. 529–540. ISSN: 0036-8075. DOI: 10.1126/science.161.3841.529. Brosius, J. (1991). “Retroposons–seeds of evolution”. In: Science (New York, N.Y.) 251.4995, p. 753. ISSN: 0036-8075. DOI: 10.1126/science.1990437. Chuong, Edward B. et al. (2013). “Endogenous retroviruses function as species-specific enhancer elements in the placenta”. In: Nature genetics 45.3, pp. 325–329. ISSN: BIBLIOGRAPHY 65

1061-4036. DOI: 10.1038/ng.2553. URL: https://www.ncbi.nlm.nih.gov/pmc/ articles/PMC3789077/ (visited on 08/16/2019). Chuong, Edward B., Nels C. Elde, and Cédric Feschotte (2017). “Regulatory activities of transposable elements: from conflicts to benefits”. In: Nature Reviews Genetics 18.2, pp. 71–86. ISSN: 1471-0064. DOI: 10 . 1038 / nrg . 2016 . 139. URL: https : //www.nature.com/articles/nrg.2016.139 (visited on 12/06/2018). Cirillo, L. A. (1998). “Binding of the winged-helix transcription factor HNF3 to a linker histone site on the nucleosome”. In: The EMBO Journal 17.1, pp. 244–254. ISSN: 14602075. DOI: 10.1093/emboj/17.1.244. URL: http://emboj.embopress. org/cgi/doi/10.1093/emboj/17.1.244 (visited on 08/13/2019). Cirillo, Lisa Ann et al. (2002). “Opening of Compacted Chromatin by Early Devel- opmental Transcription Factors HNF3 (FoxA) and GATA-4”. In: Molecular Cell 9.2, pp. 279–289. ISSN: 10972765. DOI: 10.1016/S1097-2765(02)00459-8. URL: https://linkinghub.elsevier.com/retrieve/pii/S1097276502004598 (vis- ited on 08/13/2019). Cordaux, Richard and Mark A. Batzer (2009). “The impact of retrotransposons on human genome evolution”. In: Nature reviews. Genetics 10.10, pp. 691–703. ISSN: 1471-0056. DOI: 10.1038/nrg2640. URL: https://www.ncbi.nlm.nih.gov/pmc/ articles/PMC2884099/ (visited on 09/17/2018). Cusanovich, Darren A. et al. (2014). “The Functional Consequences of Variation in Transcription Factor Binding”. In: PLOS Genetics 10.3, e1004226. ISSN: 1553- 7404. DOI: 10.1371/journal.pgen.1004226. URL: https://journals.plos. org/plosgenetics/article?id=10.1371/journal.pgen.1004226 (visited on 08/16/2019). Davidson, E. H. and R. J. Britten (1979). “Regulation of gene expression: possible role of repetitive sequences”. In: Science 204.4397, pp. 1052–1059. ISSN: 0036-8075, 1095-9203. DOI: 10.1126/science.451548. URL: https://science.sciencemag. org/content/204/4397/1052 (visited on 08/12/2019). Deininger, Prescott (2011). “Alu elements: know the SINEs”. In: Genome Biology 12.12, p. 236. ISSN: 1465-6906. DOI: 10.1186/gb-2011-12-12-236. URL: https://www. ncbi.nlm.nih.gov/pmc/articles/PMC3334610/ (visited on 08/13/2019). Dewannieux, Marie, Cécile Esnault, and Thierry Heidmann (2003). “LINE-mediated retrotransposition of marked Alu sequences”. In: Nature Genetics 35.1, pp. 41– 48. ISSN: 1546-1718. DOI: 10 . 1038 / ng1223. URL: https : / / www . nature . com / articles/ng1223 (visited on 08/12/2019). Doolittle, W. Ford and Carmen Sapienza (1980). “Selfish genes, the phenotype paradigm and genome evolution”. In: Nature 284.5757, pp. 601–603. ISSN: 1476-4687. DOI: 10.1038/284601a0. URL: https://www.nature.com/articles/284601a0 (visited on 08/12/2019). Feng, Qinghua et al. (1996). “Human L1 Retrotransposon Encodes a Conserved En- donuclease Required for Retrotransposition”. In: Cell 87.5, pp. 905–916. ISSN: BIBLIOGRAPHY 66

0092-8674. DOI: 10.1016/S0092-8674(00)81997-2. URL: http://www.sciencedirect. com/science/article/pii/S0092867400819972 (visited on 08/15/2019). Feschotte, Cédric (2008). “Transposable elements and the evolution of regulatory networks”. In: Nature Reviews Genetics 9.5, pp. 397–405. ISSN: 1471-0064. DOI: 10. 1038/nrg2337. URL: https://www.nature.com/articles/nrg2337 (visited on 08/12/2019). Finnegan, D. J. (1989). “Eukaryotic transposable elements and genome evolution”. In: Trends in genetics: TIG 5.4, pp. 103–107. ISSN: 0168-9525. Frietze, Seth and Peggy J. Farnham (2011). “Transcription Factor Effector Domains”. In: A Handbook of Transcription Factors. Ed. by Timothy R. Hughes. Subcellular Biochemistry. Dordrecht: Springer Netherlands, pp. 261–277. ISBN: 978-90-481- 9069-0. DOI: 10.1007/978-90-481-9069-0_12. URL: https://doi.org/10.1007/ 978-90-481-9069-0_12 (visited on 08/15/2019). Fuda, Nicholas J., M. Behfar Ardehali, and John T. Lis (2009). “Defining mechanisms that regulate RNA polymerase II transcription in vivo”. In: Nature 461, pp. 186– 192. ISSN: 1476-4687. DOI: 10.1038/nature08449. URL: https://www.nature. com/articles/nature08449 (visited on 08/15/2019). Galas, D J and A Schmitz (1978). “DNAse footprinting: a simple method for the de- tection of protein-DNA binding specificity.” In: Nucleic Acids Research 5.9, pp. 3157– 3170. ISSN: 0305-1048. URL: https://www.ncbi.nlm.nih.gov/pmc/articles/ PMC342238/ (visited on 08/15/2019). Gao, Ye et al. (2018). “Systematic discovery of uncharacterized transcription factors in Escherichia coli K-12 MG1655”. In: Nucleic Acids Research 46.20, pp. 10682– 10696. ISSN: 0305-1048. DOI: 10.1093/nar/gky752. URL: https://academic.oup. com/nar/article/46/20/10682/5078243 (visited on 08/15/2019). Garner, Mark M. and Arnold Revzin (1981). “A gel electrophoresis method for quan- tifying the binding of proteins to specific DNA regions: application to compo- nents of the Escherichia coli lactose operon regulatory system”. In: Nucleic Acids Research 9.13, pp. 3047–3060. ISSN: 0305-1048. DOI: 10.1093/nar/9.13.3047. URL: https://academic.oup.com/nar/article/9/13/3047/1056440 (visited on 08/15/2019). Hall, David A. et al. (2004). “Regulation of Gene Expression by a Metabolic En- zyme”. In: Science 306.5695, pp. 482–484. ISSN: 0036-8075, 1095-9203. DOI: 10 . 1126/science.1096773. URL: https://science.sciencemag.org/content/306/ 5695/482 (visited on 08/15/2019). Hamdi, Hamdi et al. (2000). “Alu-mediated phylogenetic novelties in gene regula- tion and development”. In: Journal of molecular biology 299, pp. 931–9. DOI: 10. 1006/jmbi.2000.3795. Hancks, Dustin C. and Haig H. Kazazian (2016). “Roles for retrotransposon inser- tions in human disease”. In: Mobile DNA 7.1. ISSN: 1759-8753. DOI: 10 . 1186 / s13100- 016- 0065- 9. URL: http://mobilednajournal.biomedcentral.com/ articles/10.1186/s13100-016-0065-9 (visited on 08/12/2019). BIBLIOGRAPHY 67

Harrow, Jennifer et al. (2012). “GENCODE: the reference human genome annotation for The ENCODE Project”. In: Genome Research 22.9, pp. 1760–1774. ISSN: 1549- 5469. DOI: 10.1101/gr.135350.111. Havecker, Ericka R, Xiang Gao, and Daniel F Voytas (2004). “The diversity of LTR retrotransposons”. In: Genome Biology 5.6, p. 225. ISSN: 1465-6906. DOI: 10.1186/ gb- 2004- 5- 6- 225. URL: https://www.ncbi.nlm.nih.gov/pmc/articles/ PMC463057/ (visited on 11/26/2018). Henikoff, Steven and Srinivas Ramachandran (2018). “Pioneers Invade the Nucle- osomal Landscape”. In: Molecular Cell 71.2, pp. 193–194. ISSN: 1097-2765. DOI: 10 . 1016 / j . molcel . 2018 . 07 . 004. URL: http : / / www . sciencedirect . com / science/article/pii/S1097276518305513 (visited on 08/14/2019). Hertz, G. Z. and G. D. Stormo (1999). “Identifying DNA and protein patterns with statistically significant alignments of multiple sequences.” In: Bioinformatics 15.7, pp. 563–577. ISSN: 1367-4803. DOI: 10.1093/bioinformatics/15.7.563. URL: https : / / academic . oup . com / bioinformatics / article / 15 / 7 / 563 / 278226 (visited on 08/15/2019). Hickey, D. A. (1982). “Selfish DNA: a sexually-transmitted nuclear parasite”. In: Ge- netics 101.3, pp. 519–531. ISSN: 0016-6731. Holton, Nicholas J. et al. (2001). “An active retrotransposon in Candida albicans”. In: Nucleic Acids Research 29.19, pp. 4014–4024. ISSN: 1362-4962, 0305-1048. DOI: 10.1093/nar/29.19.4014. URL: https://academic.oup.com/nar/article- lookup/doi/10.1093/nar/29.19.4014 (visited on 08/14/2019). Hu, Zihua and Steven M Gallo (2010). “Identification of interacting transcription fac- tors regulating tissue gene expression in human”. In: BMC Genomics 11.1, p. 49. ISSN: 1471-2164. DOI: 10.1186/1471-2164-11-49. URL: http://bmcgenomics. biomedcentral.com/articles/10.1186/1471-2164-11-49 (visited on 08/13/2019). Huntley, Rachael P et al. (2014). “Understanding how and why the Gene Ontology and its annotations evolve: the GO within UniProt”. In: GigaScience 3, p. 4. ISSN: 2047-217X. DOI: 10.1186/2047-217X-3-4. URL: https://www.ncbi.nlm.nih. gov/pmc/articles/PMC3995153/ (visited on 08/16/2019). Hurst, Gregory D. D. and John H. Werren (2001). “The role of selfish genetic elements in eukaryotic evolution”. In: Nature Reviews Genetics 2.8, pp. 597–606. ISSN: 1471- 0064. DOI: 10 . 1038 / 35084545. URL: https : / / www . nature . com / articles / 35084545 (visited on 08/12/2019). Ito, Jumpei et al. (2017). “Systematic identification and characterization of regulatory elements derived from human endogenous retroviruses”. In: PLoS Genetics 13.7. ISSN: 1553-7390. DOI: 10.1371/journal.pgen.1006883. URL: https://www.ncbi. nlm.nih.gov/pmc/articles/PMC5529029/ (visited on 09/17/2018). Iwafuchi-Doi, Makiko et al. (2016). “The Pioneer Transcription Factor FoxA Main- tains an Accessible Nucleosome Configuration at Enhancers for Tissue-Specific Gene Activation”. In: Molecular Cell 62.1, pp. 79–91. ISSN: 1097-2765. DOI: 10 . BIBLIOGRAPHY 68

1016/j.molcel.2016.03.001. URL: http://www.sciencedirect.com/science/ article/pii/S1097276516001799 (visited on 08/15/2019). Jacques, Pierre-Étienne, Justin Jeyakani, and Guillaume Bourque (2013). “The Major- ity of Primate-Specific Regulatory Sequences Are Derived from Transposable El- ements”. In: PLOS Genetics 9.5, e1003504. ISSN: 1553-7404. DOI: 10.1371/journal. pgen.1003504. URL: https://journals.plos.org/plosgenetics/article?id= 10.1371/journal.pgen.1003504 (visited on 08/12/2019). Johnson, David S. et al. (2007). “Genome-Wide Mapping of in Vivo Protein-DNA In- teractions”. In: Science 316.5830, pp. 1497–1502. ISSN: 0036-8075, 1095-9203. DOI: 10.1126/science.1141319. URL: https://science.sciencemag.org/content/ 316/5830/1497 (visited on 08/16/2019). Jolma, Arttu et al. (2013). “DNA-binding specificities of human transcription fac- tors”. In: Cell 152.1, pp. 327–339. ISSN: 1097-4172. DOI: 10.1016/j.cell.2012. 12.009. Jordan, I. King et al. (2003). “Origin of a substantial fraction of human regulatory sequences from transposable elements”. In: Trends in Genetics 19.2, pp. 68–72. ISSN: 0168-9525. DOI: 10.1016/S0168- 9525(02)00006- 9. URL: http://www. sciencedirect . com / science / article / pii / S0168952502000069 (visited on 07/26/2018). Jurka, Jerzy (1997). “Sequence patterns indicate an enzymatic involvement in inte- gration of mammalian retroposons”. In: Proceedings of the National Academy of Sci- ences 94.5, pp. 1872–1877. ISSN: 0027-8424, 1091-6490. DOI: 10.1073/pnas.94.5. 1872. URL: https://www.pnas.org/content/94/5/1872 (visited on 08/15/2019). Kajikawa, Masaki and Norihiro Okada (2002). “LINEs Mobilize SINEs in the Eel through a Shared 3 Sequence”. In: Cell 111.3, pp. 433–444. ISSN: 0092-8674, 1097- 4172. DOI: 10.1016/S0092- 8674(02)01041- 3. URL: https://www.cell.com/ cell/abstract/S0092-8674(02)01041-3 (visited on 08/15/2019). Kanehisa, Minoru et al. (2017). “KEGG: new perspectives on genomes, pathways, diseases and drugs”. In: Nucleic Acids Research 45 (D1), pp. D353–D361. ISSN: 0305-1048. DOI: 10 . 1093 / nar / gkw1092. URL: https : / / academic . oup . com / nar/article/45/D1/D353/2605697 (visited on 08/12/2019). Kapitonov, Vladimir V. and Jerzy Jurka (2003). “A novel class of SINE elements de- rived from 5S rRNA”. In: Molecular Biology and Evolution 20.5, pp. 694–702. ISSN: 0737-4038. DOI: 10.1093/molbev/msg075. Kazazian, Haig H. and John V. Moran (1998). “The impact of L1 retrotransposons on the human genome”. In: Nature Genetics 19.1, pp. 19–24. ISSN: 1061-4036, 1546- 1718. DOI: 10 . 1038 / ng0598 - 19. URL: http : / / www . nature . com / articles / ng0598-19 (visited on 08/13/2019). Khan, Aziz et al. (2018). “JASPAR 2018: update of the open-access database of tran- scription factor binding profiles and its web framework”. In: Nucleic Acids Re- search 46 (Database issue), pp. D260–D266. ISSN: 0305-1048. DOI: 10.1093/nar/ BIBLIOGRAPHY 69

gkx1126. URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5753243/ (visited on 08/03/2019). Kim, Tae Hoon et al. (2007). “Analysis of the vertebrate insulator protein CTCF bind- ing sites in the human genome”. In: Cell 128.6, pp. 1231–1245. ISSN: 0092-8674. DOI: 10.1016/j.cell.2006.12.048. URL: https://www.ncbi.nlm.nih.gov/ pmc/articles/PMC2572726/ (visited on 12/20/2018). King, M. C. and A. C. Wilson (1975). “Evolution at two levels in humans and chim- panzees”. In: Science 188.4184, pp. 107–116. ISSN: 0036-8075, 1095-9203. DOI: 10. 1126/science.1090005. URL: https://science.sciencemag.org/content/188/ 4184/107 (visited on 08/16/2019). Klein, Savannah J. and Rachel J. O’Neill (2018). “Transposable elements: genome innovation, chromosome diversity, and centromere conflict”. In: Chromosome Re- search 26.1, pp. 5–23. ISSN: 1573-6849. DOI: 10.1007/s10577-017-9569-5. URL: https://doi.org/10.1007/s10577-017-9569-5 (visited on 08/13/2019). Koning, A. P. Jason de et al. (2011). “Repetitive Elements May Comprise Over Two- Thirds of the Human Genome”. In: PLoS Genetics 7.12. Ed. by Gregory P. Copen- haver, e1002384. ISSN: 1553-7404. DOI: 10.1371/journal.pgen.1002384. URL: https://dx.plos.org/10.1371/journal.pgen.1002384 (visited on 08/13/2019). Korenberg, J. R. and M. C. Rykowski (1988). “Human genome organization: Alu, lines, and the molecular structure of metaphase chromosome bands”. In: Cell 53.3, pp. 391–400. ISSN: 0092-8674. DOI: 10.1016/0092-8674(88)90159-6. Kunarso, Galih et al. (2010). “Transposable elements have rewired the core regula- tory network of human embryonic stem cells”. In: Nature Genetics 42.7, pp. 631– 634. ISSN: 1546-1718. DOI: 10.1038/ng.600. URL: https://www.nature.com/ articles/ng.600 (visited on 08/12/2019). Kwak, Hojoong and John T. Lis (2013). “Control of Transcriptional Elongation”. In: Annual review of genetics 47, pp. 483–508. ISSN: 0066-4197. DOI: 10.1146/annurev- genet-110711-155440. URL: https://www.ncbi.nlm.nih.gov/pmc/articles/ PMC3974797/ (visited on 08/15/2019). Labuda, D. et al. (1991). “Evolution of mouse B1 repeats: 7SL RNA folding pattern conserved”. In: Journal of Molecular Evolution 32.5, pp. 405–414. ISSN: 0022-2844. Lagemaat, Louie N. van de et al. (2003). “Transposable elements in mammals pro- mote regulatory variation and diversification of genes with specialized func- tions”. In: Trends in Genetics 19.10, pp. 530–536. ISSN: 0168-9525. DOI: 10.1016/ j.tig.2003.08.004. URL: http://www.sciencedirect.com/science/article/ pii/S0168952503002312 (visited on 08/12/2019). Lambert, Samuel A. et al. (2018). “The Human Transcription Factors”. In: Cell 172.4, pp. 650–665. ISSN: 0092-8674. DOI: 10.1016/j.cell.2018.01.029. URL: http: //www.sciencedirect.com/science/article/pii/S0092867418301065 (visited on 08/14/2019). Lander, E. S. et al. (2001). “Initial sequencing and analysis of the human genome”. In: Nature 409.6822, pp. 860–921. ISSN: 0028-0836. DOI: 10.1038/35057062. BIBLIOGRAPHY 70

Lazarovici, Allan et al. (2013). “Probing DNA shape and methylation state on a ge- nomic scale with DNase I”. In: Proceedings of the National Academy of Sciences of the United States of America 110.16, pp. 6376–6381. ISSN: 1091-6490. DOI: 10.1073/ pnas.1216822110. Lee, Tong Ihn and Richard A. Young (2013). “Transcriptional Regulation and Its Misregulation in Disease”. In: Cell 152.6, pp. 1237–1251. ISSN: 0092-8674. DOI: 10.1016/j.cell.2013.02.014. URL: http://www.sciencedirect.com/science/ article/pii/S0092867413002031 (visited on 08/15/2019). Lelli, Katherine M., Matthew Slattery, and Richard S. Mann (2012). “Disentangling the Many Layers of Eukaryotic Transcriptional Regulation”. In: Annual Review of Genetics 46.1, pp. 43–68. DOI: 10 . 1146 / annurev - genet - 110711 - 155437. URL: https : / / doi . org / 10 . 1146 / annurev - genet - 110711 - 155437 (visited on 08/15/2019). Li, Bing, Michael Carey, and Jerry L. Workman (2007). “The Role of Chromatin dur- ing Transcription”. In: Cell 128.4, pp. 707–719. ISSN: 00928674. DOI: 10.1016/j. cell.2007.01.015. URL: https://linkinghub.elsevier.com/retrieve/pii/ S0092867407001092 (visited on 08/13/2019). Lisch, Damon (2013). “How important are transposons for plant evolution?” In: Na- ture Reviews Genetics 14.1, pp. 49–61. ISSN: 1471-0056, 1471-0064. DOI: 10.1038/ nrg3374. URL: http://www.nature.com/articles/nrg3374 (visited on 08/12/2019). Liu, Xiao et al. (2006). “Whole-genome comparison of Leu3 binding in vitro and in vivo reveals the importance of nucleosome occupancy in target site selection”. In: Genome Research 16.12, pp. 1517–1528. ISSN: 1088-9051. DOI: 10.1101/gr.5655606. Lynch, Vincent J. et al. (2011). “Transposon-mediated rewiring of gene regulatory networks contributed to the evolution of pregnancy in mammals”. In: Nature Genetics 43.11, pp. 1154–1159. ISSN: 1546-1718. DOI: 10.1038/ng.917. Lynch, Vincent J. et al. (2015). “Ancient Transposable Elements Transformed the Uterine Regulatory Landscape and Transcriptome during the Evolution of Mam- malian Pregnancy”. In: Cell Reports 10.4, pp. 551–561. ISSN: 22111247. DOI: 10. 1016 / j . celrep . 2014 . 12 . 052. URL: https : / / linkinghub . elsevier . com / retrieve/pii/S221112471401105X (visited on 08/12/2019). Magnani, Luca, Jérôme Eeckhoute, and Mathieu Lupien (2011). “Pioneer factors: di- recting transcriptional regulators within the chromatin environment”. In: Trends in Genetics 27.11, pp. 465–474. ISSN: 0168-9525. DOI: 10.1016/j.tig.2011.07. 002. URL: http://www.sciencedirect.com/science/article/pii/S0168952511001107 (visited on 08/14/2019). Makalowski, Wojciech (1995). “SINEs as a genomic scrap yard: an essay on genomic evolution”. In: The impact of short interspersed elements (SINEs) on the host genome, pp. 81–104. Makałowski, Wojciech (2000). “Genomic scrap yard: how genomes utilize all that junk”. In: Gene 259.1, pp. 61–67. ISSN: 03781119. DOI: 10.1016/S0378-1119(00) BIBLIOGRAPHY 71

00436-4. URL: https://linkinghub.elsevier.com/retrieve/pii/S0378111900004364 (visited on 08/13/2019). Malamy, M. H., M. Fiandt, and W. Szybalski (1972). “Electron microscopy of polar insertions in the lac operon of Escherichia coli”. In: Molecular & general genetics: MGG 119.3, pp. 207–222. ISSN: 0026-8925. DOI: 10.1007/bf00333859. Mariño-Ramírez, L. et al. (2005). “Transposable elements donate lineage-specific reg- ulatory sequences to host genomes”. In: Cytogenetic and Genome Research 110.1, pp. 333–341. ISSN: 1424-8581, 1424-859X. DOI: 10.1159/000084965. URL: https: //www.karger.com/Article/FullText/84965 (visited on 08/14/2019). Mathelier, Anthony et al. (2016). “DNA shape features improve transcription factor binding site predictions in vivo”. In: Cell systems 3.3, 278–286.e4. ISSN: 2405-4712. DOI: 10.1016/j.cels.2016.07.001. URL: https://www.ncbi.nlm.nih.gov/ pmc/articles/PMC5042832/ (visited on 08/15/2019). Matthews, G. D. et al. (1997). “pCal, a highly unusual Ty1/copia retrotransposon from the pathogenic yeast Candida albicans”. In: Journal of Bacteriology 179.22, pp. 7118–7128. ISSN: 0021-9193. DOI: 10.1128/jb.179.22.7118-7128.1997. Matys, V. et al. (2006). “TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes”. In: Nucleic Acids Research 34 (Database issue), pp. D108–110. ISSN: 1362-4962. DOI: 10.1093/nar/gkj143. McClintock, B. (1950). “The origin and behavior of mutable loci in maize”. In: Pro- ceedings of the National Academy of Sciences 36.6, pp. 344–355. ISSN: 0027-8424, 1091-6490. DOI: 10.1073/pnas.36.6.344. URL: http://www.pnas.org/cgi/ doi/10.1073/pnas.36.6.344 (visited on 08/13/2019). Mcclintock, B. (1956). “Intranuclear systems controlling gene action and mutation”. In: Brookhaven Symposia in Biology 8, pp. 58–74. ISSN: 0068-2799. Medstrand, P. et al. (2005). “Impact of transposable elements on the evolution of mammalian gene regulation”. In: Cytogenetic and Genome Research 110.1, pp. 342– 352. ISSN: 1424-8581, 1424-859X. DOI: 10.1159/000084966. URL: https://www. karger.com/Article/FullText/84966 (visited on 08/12/2019). Mills, Ryan E. et al. (2007). “Which transposable elements are active in the human genome?” In: Trends in Genetics 23.4, pp. 183–191. ISSN: 0168-9525. DOI: 10.1016/ j.tig.2007.02.006. URL: http://www.sciencedirect.com/science/article/ pii/S0168952507000595 (visited on 07/26/2018). Molkentin, J D et al. (1996). “Mutational analysis of the DNA binding, dimerization, and transcriptional activation domains of MEF2C.” In: Molecular and Cellular Bi- ology 16.6, pp. 2627–2636. ISSN: 0270-7306. URL: https://www.ncbi.nlm.nih. gov/pmc/articles/PMC231253/ (visited on 08/12/2019). Muñoz-López, Martín and José L. García-Pérez (2010). “DNA Transposons: Nature and Applications in Genomics”. In: Current Genomics 11.2, pp. 115–128. ISSN: 1389-2029. DOI: 10.2174/138920210790886871. URL: https://www.ncbi.nlm. nih.gov/pmc/articles/PMC2874221/ (visited on 04/03/2019). BIBLIOGRAPHY 72

Okada, Norihiro et al. (1997). “SINEs and LINEs share common 3 sequences: a re- view”. In: Gene 205.1, pp. 229–243. ISSN: 03781119. DOI: 10.1016/S0378-1119(97) 00409-5. URL: https://linkinghub.elsevier.com/retrieve/pii/S0378111997004095 (visited on 08/13/2019). Orgel, L. E. and F. H. C. Crick (1980). “Selfish DNA: the ultimate parasite”. In: Nature 284.5757, pp. 604–607. ISSN: 1476-4687. DOI: 10 . 1038 / 284604a0. URL: https : //www.nature.com/articles/284604a0 (visited on 08/12/2019). Panne, Daniel (2008). “The enhanceosome”. In: Current Opinion in Structural Biol- ogy. Theory and simulation / Macromolecular assemblages 18.2, pp. 236–242. ISSN: 0959-440X. DOI: 10 . 1016 / j . sbi . 2007 . 12 . 002. URL: http : / / www . sciencedirect . com / science / article / pii / S0959440X07002023 (visited on 08/15/2019). Park, Peter J. (2009). “ChIP–seq: advantages and challenges of a maturing technol- ogy”. In: Nature Reviews Genetics 10.10, pp. 669–680. ISSN: 1471-0064. DOI: 10. 1038/nrg2641. URL: https://www.nature.com/articles/nrg2641 (visited on 08/16/2019). Pepke, Shirley, Barbara Wold, and Ali Mortazavi (2009). “Computation for ChIP- seq and RNA-seq studies”. In: Nature Methods 6.11, S22–32. ISSN: 1548-7105. DOI: 10.1038/nmeth.1371. Piskurek, Oliver et al. (2003). “Unique mammalian tRNA-derived repetitive ele- ments in dermopterans: the t-SINE family and its retrotransposition through multiple sources”. In: Molecular Biology and Evolution 20.10, pp. 1659–1668. ISSN: 0737-4038. DOI: 10.1093/molbev/msg187. Ptashne, Mark (2011). “Principles of a switch”. In: Nature Chemical Biology 7.8, pp. 484– 487. ISSN: 1552-4450, 1552-4469. DOI: 10.1038/nchembio.611. URL: http://www. nature.com/articles/nchembio.611 (visited on 08/16/2019). Raha, Debasish, Miyoung Hong, and Michael Snyder (2010). “ChIP-Seq: A Method for Global Identification of Regulatory Elements in the Genome”. In: Current Pro- tocols in Molecular Biology 91.1, pp. 21.19.1–21.19.14. ISSN: 1934-3647. DOI: 10 . 1002/0471142727.mb2119s91. URL: https://currentprotocols.onlinelibrary. wiley.com/doi/abs/10.1002/0471142727.mb2119s91 (visited on 08/15/2019). Rahl, Peter B. and Richard A. Young (2014). “MYC and Transcription Elongation”. In: Cold Spring Harbor Perspectives in Medicine 4.1. ISSN: 2157-1422. DOI: 10.1101/ cshperspect.a020990. URL: https://www.ncbi.nlm.nih.gov/pmc/articles/ PMC3869279/ (visited on 08/15/2019). Rohs, Remo et al. (2009). “The role of DNA shape in protein–DNA recognition”. In: Nature 461.7268, pp. 1248–1253. ISSN: 1476-4687. DOI: 10.1038/nature08473. URL: https://www.nature.com/articles/nature08473 (visited on 08/15/2019). Rosenfeld, Michael G., Victoria V. Lunyak, and Christopher K. Glass (2006). “Sensors and signals: a coactivator/corepressor/epigenetic code for integrating signal- dependent programs of transcriptional response”. In: Genes & Development 20.11, BIBLIOGRAPHY 73

pp. 1405–1428. ISSN: 0890-9369, 1549-5477. DOI: 10 . 1101 / gad . 1424806. URL: http://genesdev.cshlp.org/content/20/11/1405 (visited on 08/16/2019). Samee, Md Abul Hassan, Benoit G. Bruneau, and Katherine S. Pollard (2017). “Tran- scription factors recognize DNA shape without nucleotide recognition”. In: bioRxiv, p. 143677. DOI: 10.1101/143677. URL: https://www.biorxiv.org/content/10. 1101/143677v1 (visited on 08/15/2019). Sandelin, Albin et al. (2004). “JASPAR: an open-access database for eukaryotic tran- scription factor binding profiles”. In: Nucleic Acids Research 32 (Database issue), pp. D91–94. ISSN: 1362-4962. DOI: 10.1093/nar/gkh012. Schmitges, Frank W. et al. (2016). “Multiparameter functional diversity of human C2H2 zinc finger proteins”. In: Genome Research 26.12, pp. 1742–1752. ISSN: 1088- 9051, 1549-5469. DOI: 10.1101/gr.209643.116. URL: http://genome.cshlp. org/content/26/12/1742 (visited on 08/16/2019). Schmitt, Armin O. and Hanspeter Herzel (1997). “Estimating the Entropy of DNA Sequences”. In: Journal of Theoretical Biology 188.3, pp. 369–377. ISSN: 0022-5193. DOI: 10.1006/jtbi.1997.0493. URL: http://www.sciencedirect.com/science/ article/pii/S0022519397904938 (visited on 08/19/2019). Sherwood, Richard I. et al. (2014). “Discovery of directional and nondirectional pio- neer transcription factors by modeling DNase profile magnitude and shape”. In: Nature Biotechnology 32.2, pp. 171–178. ISSN: 1546-1696. DOI: 10.1038/nbt.2798. URL: https://www.nature.com/articles/nbt.2798 (visited on 08/14/2019). Simonti, Corinne N., Mihaela Pavliˇcev,and John A. Capra (2017). “Transposable Element Exaptation into Regulatory Regions Is Rare, Influenced by Evolution- ary Age, and Subject to Pleiotropic Constraints”. In: Molecular Biology and Evolu- tion 34.11, pp. 2856–2869. ISSN: 0737-4038. DOI: 10.1093/molbev/msx219. URL: https://academic.oup.com/mbe/article/34/11/2856/4082767 (visited on 01/22/2019). Sinnett, Daniel et al. (1992). “Alu RNA transcripts in human embryonal carcinoma cells: Model of post-transcriptional selection of master sequences”. In: Journal of Molecular Biology 226.3, pp. 689–706. ISSN: 0022-2836. DOI: 10 . 1016 / 0022 - 2836(92)90626-U. URL: http://www.sciencedirect.com/science/article/ pii/002228369290626U (visited on 08/15/2019). Slattery, Matthew et al. (2014). “Absence of a simple code: how transcription factors read the genome”. In: Trends in biochemical sciences 39.9, pp. 381–399. ISSN: 0968- 0004. DOI: 10.1016/j.tibs.2014.07.002. URL: https://www.ncbi.nlm.nih. gov/pmc/articles/PMC4149858/ (visited on 11/23/2018). Sloan, Cricket A. et al. (2016). “ENCODE data at the ENCODE portal”. In: Nu- cleic Acids Research 44 (D1), pp. D726–732. ISSN: 1362-4962. DOI: 10.1093/nar/ gkv1160. Sudmant, Peter H. et al. (2015). “An integrated map of structural variation in 2,504 human genomes”. In: Nature 526.7571, pp. 75–81. ISSN: 0028-0836. DOI: 10.1038/ BIBLIOGRAPHY 74

nature15394. URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4617611/ (visited on 08/12/2019). Sun, Wei et al. (2014). “An Adaptive Transposable Element Insertion in the Regula- tory Region of the EO Gene in the Domesticated Silkworm, Bombyx mori”. In: Molecular Biology and Evolution 31.12, pp. 3302–3313. ISSN: 1537-1719, 0737-4038. DOI: 10.1093/molbev/msu261. URL: https://academic.oup.com/mbe/article- lookup/doi/10.1093/molbev/msu261 (visited on 08/12/2019). Sundaram, Vasavi et al. (2014). “Widespread contribution of transposable elements to the innovation of gene regulatory networks”. In: Genome Research 24.12, pp. 1963– 1976. ISSN: 1088-9051, 1549-5469. DOI: 10 . 1101 / gr . 168872 . 113. URL: http : / / genome . cshlp . org / lookup / doi / 10 . 1101 / gr . 168872 . 113 (visited on 08/12/2019). Sverdlov, Eugene D (1998). “Perpetually mobile footprints of ancient infections in human genome”. In: FEBS Letters 428.1, pp. 1–6. ISSN: 0014-5793. DOI: 10.1016/ S0014 - 5793(98 ) 00478 - 5. URL: http : / / www . sciencedirect . com / science / article/pii/S0014579398004785 (visited on 08/13/2019). Taylor, I C et al. (1991). “Facilitated binding of GAL4 and heat shock factor to nu- cleosomal templates: differential function of DNA-binding domains.” In: Genes & Development 5.7, pp. 1285–1298. ISSN: 0890-9369. DOI: 10.1101/gad.5.7.1285. URL: http://www.genesdev.org/cgi/doi/10.1101/gad.5.7.1285 (visited on 08/13/2019). Thornburg, Bartley G., Valer Gotea, and Wojciech Makałowski (2006). “Transpos- able elements as a significant source of transcription regulating signals”. In: Gene. Genome and RNA: Expression and Functions 365, pp. 104–110. ISSN: 0378-1119. DOI: 10.1016/j.gene.2005.09.036. URL: http://www.sciencedirect.com/ science/article/pii/S0378111905006530 (visited on 08/12/2019). Thurman, Robert E. et al. (2012). “The accessible chromatin landscape of the hu- man genome”. In: Nature 489.7414, pp. 75–82. ISSN: 0028-0836. DOI: 10.1038/ nature11232. URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3721348/ (visited on 08/16/2019). Tolinski, Michael (2009). “Chapter 15 - Crosslinking”. In: Additives for Polyolefins. Ed. by Michael Tolinski. Oxford: William Andrew Publishing, pp. 215–220. ISBN: 978- 0-8155-2051-1. DOI: 10.1016/B978-0-8155-2051-1.00015-7. URL: http://www. sciencedirect.com/science/article/pii/B9780815520511000157 (visited on 08/16/2019). Trizzino, Marco, Aurélie Kapusta, and Christopher D. Brown (2018). “Transposable elements generate regulatory novelty in a tissue-specific fashion”. In: BMC Ge- nomics 19.1, p. 468. ISSN: 1471-2164. DOI: 10.1186/s12864- 018- 4850- 3. URL: https://doi.org/10.1186/s12864-018-4850-3 (visited on 11/26/2018). Ullu, Elisabetta and Christian Tschudi (1984). “Alu sequences are processed 7SL RNA genes”. In: Nature 312.5990, pp. 171–172. ISSN: 1476-4687. DOI: 10.1038/ BIBLIOGRAPHY 75

312171a0. URL: https : / / www . nature . com / articles / 312171a0 (visited on 08/14/2019). Valouev, Anton et al. (2008). “Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data”. In: Nature Methods 5.9, pp. 829–834. ISSN: 1548- 7105. DOI: 10.1038/nmeth.1246. URL: https://www.nature.com/articles/ nmeth.1246 (visited on 08/15/2019). Vaquerizas, Juan M., Sarah A. Teichmann, and Nicholas M. Luscombe (2012). “How Do You Find Transcription Factors? Computational Approaches to Compile and Annotate Repertoires of Regulators for Any Genome”. In: Gene Regulatory Net- works: Methods and Protocols. Ed. by Bart Deplancke and Nele Gheldof. Methods in Molecular Biology. Totowa, NJ: Humana Press, pp. 3–19. ISBN: 978-1-61779- 292-2. DOI: 10.1007/978-1-61779-292-2_1. URL: https://doi.org/10.1007/ 978-1-61779-292-2_1 (visited on 08/15/2019). Venter, J. C. et al. (2001). “The sequence of the human genome”. In: Science (New York, N.Y.) 291.5507, pp. 1304–1351. ISSN: 0036-8075. DOI: 10.1126/science.1058040. Viollet, Sébastien, Clément Monot, and Gaël Cristofari (2014). “L1 retrotransposi- tion”. In: Mobile Genetic Elements 4. ISSN: 2159-2543. DOI: 10.4161/mge.28907. URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4014453/ (visited on 04/03/2019). Wallace, Margaret R. et al. (1991). “A de novo Alu insertion results in neurofibro- matosis type 1”. In: Nature 353.6347, pp. 864–866. ISSN: 1476-4687. DOI: 10.1038/ 353864a0. URL: https : / / www . nature . com / articles / 353864a0 (visited on 08/14/2019). Wang, Jing et al. (2017). “WebGestalt 2017: a more comprehensive, powerful, flexible and interactive gene set enrichment analysis toolkit”. In: Nucleic Acids Research 45 (W1), W130–W137. ISSN: 1362-4962. DOI: 10.1093/nar/gkx356. Waring, M. and R. J. Britten (1966). “Nucleotide Sequence Repetition: A Rapidly Reassociating Fraction of Mouse DNA”. In: Science 154.3750, pp. 791–794. ISSN: 0036-8075, 1095-9203. DOI: 10 . 1126 / science . 154 . 3750 . 791. URL: http : / / www.sciencemag.org/cgi/doi/10.1126/science.154.3750.791 (visited on 08/13/2019). Wasson, Todd and Alexander J. Hartemink (2009). “An ensemble model of compet- itive multi-factor binding of the genome”. In: Genome Research 19.11, pp. 2101– 2112. ISSN: 1549-5469. DOI: 10.1101/gr.093450.109. Wicker, Thomas et al. (2007). “A unified classification system for eukaryotic trans- posable elements”. In: Nature Reviews Genetics 8.12, pp. 973–982. ISSN: 1471-0064. DOI: 10 . 1038 / nrg2165. URL: https : / / www . nature . com / articles / nrg2165 (visited on 08/13/2019). Wingender, Edgar et al. (2018). “TFClass: expanding the classification of human transcription factors to their mammalian orthologs”. In: Nucleic Acids Research 46 (D1), pp. D343–D347. ISSN: 1362-4962. DOI: 10.1093/nar/gkx987. BIBLIOGRAPHY 76

Wong, K. et al. (2016). “A Comparison Study for DNA Motif Modeling on Pro- tein Binding Microarray”. In: IEEE/ACM Transactions on and Bioinformatics 13.2, pp. 261–271. ISSN: 1545-5963. DOI: 10.1109/TCBB.2015. 2443782. Zang, Chongzhi et al. (2009). “A clustering approach for identification of enriched domains from histone modification ChIP-Seq data”. In: Bioinformatics 25.15, pp. 1952– 1958. ISSN: 1367-4803. DOI: 10.1093/bioinformatics/btp340. URL: https:// academic.oup.com/bioinformatics/article/25/15/1952/212783 (visited on 08/15/2019). Zerbino, Daniel R. et al. (2018). “Ensembl 2018”. In: Nucleic Acids Research 46 (D1), pp. D754–D761. ISSN: 0305-1048. DOI: 10.1093/nar/gkx1098. URL: https:// academic.oup.com/nar/article/46/D1/D754/4634002 (visited on 08/04/2019). Zhang, Lulu et al. (2014). “The structure and retrotransposition mechanism of LTR- retrotransposons in the asexual yeast Candida albicans”. In: Virulence 5.6, pp. 655– 664. ISSN: 2150-5608. DOI: 10.4161/viru.32180. Zhu, Jiang et al. (2013). “Genome-wide Chromatin State Transitions Associated with Developmental and Environmental Cues”. In: Cell 152.3, pp. 642–654. ISSN: 0092- 8674. DOI: 10.1016/j.cell.2012.12.033. URL: http://www.sciencedirect. com/science/article/pii/S0092867412015553 (visited on 08/15/2019).