Computational Modelling of Human Transcriptional Regulation by an Information Theory-Based Approach
Total Page:16
File Type:pdf, Size:1020Kb
Western University Scholarship@Western Electronic Thesis and Dissertation Repository 4-12-2018 2:00 PM Computational Modelling of Human Transcriptional Regulation by an Information Theory-based Approach Ruipeng Lu The University of Western Ontario Supervisor Rogan, Peter K. The University of Western Ontario Graduate Program in Computer Science A thesis submitted in partial fulfillment of the equirr ements for the degree in Doctor of Philosophy © Ruipeng Lu 2018 Follow this and additional works at: https://ir.lib.uwo.ca/etd Part of the Bioinformatics Commons, Biostatistics Commons, Computational Biology Commons, Congenital, Hereditary, and Neonatal Diseases and Abnormalities Commons, Genomics Commons, Microarrays Commons, Other Computer Sciences Commons, and the Statistical Methodology Commons Recommended Citation Lu, Ruipeng, "Computational Modelling of Human Transcriptional Regulation by an Information Theory- based Approach" (2018). Electronic Thesis and Dissertation Repository. 5305. https://ir.lib.uwo.ca/etd/5305 This Dissertation/Thesis is brought to you for free and open access by Scholarship@Western. It has been accepted for inclusion in Electronic Thesis and Dissertation Repository by an authorized administrator of Scholarship@Western. For more information, please contact [email protected]. Abstract ChIP-seq experiments can identify the genome-wide binding site motifs of a transcription factor (TF) and determine its sequence specificity. Multiple algorithms were developed to derive TF binding site (TFBS) motifs from ChIP-seq data, including the entropy minimization-based Bipad that can derive both contiguous and bipartite motifs. Prior studies applying these algorithms to ChIP-seq data only analyzed a small number of top peaks with the highest signal strengths, biasing their resultant position weight matrices (PWMs) towards consensus-like, strong binding sites; nor did they derive bipartite motifs, disabling the accurate modelling of binding behavior of dimeric TFs. This thesis presents a novel motif discovery pipeline by adding the recursive masking and thresholding functionalities to Bipad to improve detection of primary binding motifs. Analyzing 765 ENCODE ChIP-seq datasets with this pipeline generated contiguous and bipartite information theory-based PWMs (iPWMs) for 93 sequence-specific TFs, discovered 23 cofactor motifs for 127 TFs and revealed six high-confidence novel motifs. The accuracy of these iPWMs were determined via four independent validation methods, including detection of experimentally proven TFBSs, explanation of effects of characterized SNPs, comparison with previously published motifs and statistical analyses. Novel cofactor motifs supported previously unreported TF coregulatory interactions. This thesis further presents a unified framework to identify variants in hereditary breast and ovarian cancer (HBOC), successfully applying these iPWMs to prioritize TFBS variants in 20 complete genes of HBOC patients. The spatial distribution and information composition of cis-regulatory modules (e.g. TFBS clusters) in promoters substantially determine gene expression patterns and TF target genes. Multiple algorithms were developed to detect TFBS clusters, including the information density-based clustering (IDBC) algorithm that simultaneously considers the spatial and information densities of TFBSs. Prior studies predicting tissue-specific gene expression levels and differentially expressed (DE) TF targets used log likelihood ratios to quantify TFBS strengths and merged adjacent TFBSs into clusters. This thesis presents a machine learning framework that uses the Bray-Curtis function to quantify the similarity between tissue-wide expression profiles of genes, and IDBC-identified clusters from iPWM-detected TFBSs to predict gene expression profiles and DE direct TF targets. Multiple clusters enable gene expression to be robust against TFBS mutations. Keywords Transcription factor binding sites, Shannon information theory, position weight matrices, chromatin immunoprecipitation-sequencing, motif discovery, genetic variants, single nucleotide polymorphisms, mutation analyses, binding site clusters, transcription factor target genes, gene expression, machine learning, Bray-Curtis similarity ii Co-Authorship Statement Chapter 1 – Ruipeng Lu wrote the introduction and created all figures and tables. Peter Rogan provided helpful comments and feedback. Chapter 2 – Ruipeng Lu wrote the thesis overview. Peter Rogan provided helpful comments and feedback. Chapter 3 - Peter Rogan conceived of and directed the study. Eliseos Mucaki compared the resultant iPWMs with previous published motifs, and Ruipeng Lu performed other research and analyses. Ruipeng Lu, Eliseos Mucaki and Peter Rogan wrote the manuscript. Chapters 4 and 5 – Peter Rogan designed, coordinated, and supervised the study, which was motivated by discussions with Joan Knoll regarding prioritization of variants of uncertain significance. Eliseos Mucaki performed probe design and synthesis. Eliseos Mucaki and Natasha Caminsky performed sample preparation and sequencing. Eliseos Mucaki wrote software and performed bioinformatic analysis. Eliseos Mucaki, Natasha Caminsky, and Amy Perri conducted variant analysis and prioritization. Ruipeng Lu generated the TF iPWMs and Eliseos Mucaki generated the RNA-binding protein, splicing regulatory factor, and splicing iPWMs. Ami Perri confirmed prioritized variants by Sanger sequencing. In Chapter 3, Matthew Halvorsen and Alain Laederach conducted the SHAPE analysis. Eliseos Mucaki, Natasha Caminsky, Ami Perri, Joan Knoll, and Peter Rogan wrote the manuscript, which has been approved by all authors. Chapter 6 - Peter Rogan defined the objectives and directed the study. Ruipeng Lu and Peter Rogan devised the general machine learning framework. Ruipeng Lu implemented this framework and collected the results. Both Ruipeng Lu and Peter Rogan interpreted the results and wrote the manuscript. Chapter 7 – Ruipeng Lu wrote the discussion and created all tables and figures with helpful comments and feedback from Peter Rogan. iii Acknowledgments First, I would like to thank my supervisor, Dr. Peter Rogan. Four years ago, he gave me this precious opportunity to pursue a PhD degree, so that I was able to fulfill my dream to continue conducting academic research and making contributions to science. During the past four years, he has been wholeheartedly guiding, mentoring, supporting and encouraging me through my research projects. I learned a lot from his diligent research schedule, meticulously scholarly attitude and creatively divergent thinking. Without Dr. Rogan, I would not have anything I have now here. I would also like to thank Dr. Joan Knoll, for her continuous care and encouragement about my research progress. I would also like to thank Dr. Robert Mercer, Dr. Daniel Lizotte, Dr. Ilka Heinemann, and Dr. Wyeth Wasserman for examining my thesis. I would also like to thank Dr. Robert Mercer for examining my PhD Research Topics Survey/Proposal two years ago and the knowledge I learned from his Computational Linguistics course. I would also like to thank all of the past and present Rogan/Knoll Lab members. John taught me some basic bioinformatic knowledge after I joined the lab, and I have been happily collaborating with him and Ben on research projects. Li, Ben, Stephanie and Wahab selflessly helped me with my personal issues and research, and I also happily collaborated with Natasha and Amy during the past four years. Finally, I would like to thank my parents. iv Table of Contents Abstract .................................................................................................................................i Co-Authorship Statement................................................................................................... iii Acknowledgments...............................................................................................................iv Table of Contents ................................................................................................................. v List of Tables ......................................................................................................................xi List of Figures .................................................................................................................. xiii List of Appendices ............................................................................................................. xv Chapter 1 .............................................................................................................................. 1 1 Background ....................................................................................................................... 1 1.1 Transcription Factors ............................................................................................... 1 1.1.1 Determinants of Transcription Factor Binding ............................................ 3 1.1.2 Impacts of Transcription Factor Binding ..................................................... 6 1.2 Information Theory-based Position Weight Matrices .............................................. 8 1.2.1 Derivation of Information Theory-based Position Weight Matrices ........... 9 1.2.2 Computation of the Information Content of an Individual Binding Site ... 10 1.2.3 Relationship between 푅푖 Values and Thermodynamics ............................ 12 1.2.4 Sequence Logos ........................................................................................