Transposable Elements Significantly Contributed to the Core
Total Page:16
File Type:pdf, Size:1020Kb
WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER MASTER THESIS Transposable Elements Significantly Contributed to the Core Promoters in the Human Genome Author: First Examiner: Marten KELLNER Dr. Francesco CATANIA Supervisor/ Secound Examiner: Prof. Dr. Wojciech MAKAŁOWSKI A thesis submitted in fulfillment of the requirements for the degree of Master of Science in the Comparative Genomics Group Institute of Bioinformatics WWU Münster August 20, 2019 i Declaration of Academic Integrity I, Marten KELLNER, declare that this thesis titled, “Transposable Elements Signif- icantly Contributed to the Core Promoters in the Human Genome” and the work presented is solely my own work and that I have used no sources or aids other than the ones stated. All passages in my thesis for which other sources, including elec- tronic media, have been used, be it direct quotes or content references, have been acknowledged as such and the sources cited. Signed: Date: I agree to have my thesis checked in order to rule out potential similarities with other works and to have my thesis stored in a database for this purpose. Signed: Date: ii “The saddest aspect of life right now is that science gathers knowledge faster than society gathers wisdom.” Isaac Asimov iii WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER Abstract Faculty of Biology Institute of Bioinformatics WWU Münster Master of Science Transposable Elements Significantly Contributed to the Core Promoters in the Human Genome by Marten KELLNER Transposable elements (TEs) are major components of the human genome constitut- ing at least half of it. More than half a century ago, Barbara McClintock and later Roy Britten and Eric Davidson postulated that the TEs might be major players in the host gene regulation. A large amount of data produced by ENCODE project for ac- tive transcription factor binding sites (TFBSs) located in TE-originated parts of poly- merase II promoters were scanned in this study. In total, more than 35,000 promoters in six different tissues were analyzed and over 26,000 of them harbored TEs. More- over, these TEs usually provide one or more of TFBSs in the host promoters, which resulted in more than 6% of active TFBSs located in promoters in TE-originated se- quences. Rewiring of transcription circuits played a significant role in mammalian evolution and consequently increased their functional and morphological diversity. In this large-scale analysis, it was demonstrated that TEs contributed to a large frac- tion of human TFBSs. Interestingly, these TFBSs usually act in a tissue-specific man- ner. Many TFBSs transported from LINE and LTR elements into promoter regions became inactive, whereas SINE elements transport possible TFBSs which became active in promoter regions. Furthermore we have shown that TE originated TFBSs often influence transcription both positive and negative making them more neutral. Thus, our study clearly showed that TEs played a significant role in shaping ex- pression patterns in mammals and humans in particular. Furthermore, since several TE families are still active in our genome, they continue to influence not only our genome architecture but also gene functioning in a broader sense. iv Acknowledgements First of all, I would like to express by deepest gratitude to my thesis advisor Prof. Dr. Wojciech Makalowski of the Institute of Bioinformatics in Münster. The door to Prof. Makalowski’s office was always open whenever I had a question about my research or writing. I am very grateful for his patience, his continued support and his invaluable advices. I would also like to acknowledge Dr. Francesco Catania of the Institute of Evo- lutionary Biodiversity at the Westfälische Wilhelms-Universität in Münster for his great support. In addition, I would like to thank my colleagues from the Institute of Bioin- formatics who always had an open ear for me and my questions. Thank you for supporting me with your inspiring discussions. They accepted me with open arms and created an friendly and supporting work environment. Finally, I must express my very profound gratitude to my family and to my girl- friend for providing me with unfailing support and encouragement throughout my years of study and through the process of researching and writing this thesis. This accomplishment would not have been possible without them. Thank you. v Contents Declaration of Academic Integrityi Abstract iii Acknowledgements iv 1 Introduction1 1.1 History of transposable elements......................1 1.2 Classification and duplication of transposons...............2 1.3 Transcription factors and methods to discover binding sites.......4 1.4 Transposons and their involvement with transcription factor binding sites..............................7 2 Materials and Methods9 2.1 Materials....................................9 2.1.1 Programs...............................9 2.1.2 Data Sources..............................9 2.2 Analyses.................................... 10 2.2.1 Data Preparation........................... 10 2.2.2 JASPAR Analysis of ENCODE Hits................ 12 2.2.3 Genome Overview.......................... 12 2.2.4 Tissue Comparison.......................... 14 2.2.5 Analysis of Individual TFs..................... 14 2.2.6 Pathway Enrichment Analysis................... 15 3 Results 16 3.1 Data selection................................. 16 3.2 TE distribution................................ 19 3.3 TFBSs located in TE-derived sequences.................. 23 3.4 Tissue specific TFBSs located in TE-derived sequences.......... 26 3.4.1 Influence of TFBSs on transcription in different genome regions 28 3.4.2 Individual TFBSs........................... 32 3.5 Pathway analysis............................... 36 3.5.1 Pathway analysis of genes affected by TE-originated TFBSs.. 36 3.5.2 Pathway analysis of Promoter without any TEs......... 37 4 Discussion 40 vi A Supplementary - Figures 43 B Supplementary - Tables 47 Bibliography 64 vii List of Figures 1.1 Schematic representations of TF binding, Chip-seq process, and Chip- seq peak calling................................5 2.1 Flowchart of the main analysis script.................... 13 3.1 Violin-plot of the number of ENCODE entries for four TFs with the highest experiment amount.......................... 17 3.2 Density of TFBSs, TEs, and promoter regions on chromosome 18.... 20 3.3 Nucleotide Distribution of TE-families................... 21 3.4 Fraction of promoter area occupied by TE-originated and binding site associated sequences............................. 21 3.5 Fraction of promoter area occupied by different families of TE-originated sequences.................................... 22 3.6 Distribution of TE derived sequences in pol II promoter regions.... 22 3.7 Distribution of TFBSs in TE derived sequences in pol II promoter regions 23 3.8 Distribution of TFBSs in different TE-families in promoter regions and out of promoter regions........................... 25 3.9 FAMD of TFBSs of TE derived promoter sequences and TE sequences not in promoter................................ 27 3.10 Pairwise comparision of TFBS’s uniqueness in TE derived promoter sequences.................................... 28 3.11 Pairwise comparision of TFBS’s uniqueness in promoter sequences not derived from TEs.............................. 29 3.12 Number of TFBSs with transcription related GO-Annotations in pro- moter without TEs, TEs not in promoter, and TEs in promoter regions. 31 3.13 Fraction of shared TFBSs in TE derived promoter sequences of five tissues..................................... 32 3.14 Box-plots of fraction of active TFBSs against possible TFBSs in TEs in promoter and TEs not in promoter for five different TFs and TEs sub-families.................................. 35 3.15 WordCloud representation of pathway analysis from promoters with no TE derived sequences........................... 37 A.1 Observed and expected numbers of TFBSs in TE-derived promoter sequences and TE sequences not in promoter................ 43 viii A.2 Graphical representation of the number of shared and unique TFBSs for TE derived promoter sequences in pairwise tissue comparison... 44 A.3 Graphical representation of the number of shared and unique TFBSs for promoter regions without TEs in pairwise tissue comparison.... 45 A.4 Number of all possible motif positions for ENCODE entries from MEF2B and CTCF with FPR-threshold of a = 0.05 and 0.01............. 46 ix List of Tables 2.1 Number of available TFs experimentally analysed in different Tissues. 11 3.1 Percent of ENCODE entries with hits (a = 0.05) for TFBSs in range of ENCODE peak for different distances with 154 different TFs. Counted were for each ENCODE entry only one hit, if in this genome position no other hit was documented (one per position) or with out this re- striction (multiple per position)........................ 18 3.2 Transposon distribution in different genomic regions.......... 19 3.3 Human genes whose promoters almost completely originated in TEs.. 23 3.4 Number of TFs with GO-Annotations influencing transcription with chromatin strength, histone binding, transcription, or all sets...... 30 3.5 Number of TEs from the sub-families L1, L2, Alu, MIR, and ERV1 with ENCODE or JASPAR hits for each analysed TF............ 36 3.6 Pathways overrepresented in different tissue comparisons of TFBSs.. 38 B.1 Gene categories used in the study...................... 47 B.2 Pathways enriched in the gene set whose promoters harbor TE-derived sequences.................................... 48 B.3 Pathways enriched in the gene set whose