Analysis, Visualization, and Machine Learning of Epigenomic Data

University of Massachusetts Medical School eScholarship@UMMS GSBS Dissertations and Theses Graduate School of Biomedical Sciences 2017-12-12 Analysis, Visualization, and Machine Learning of Epigenomic Data Michael J. Purcaro University of Massachusetts Medical School Let us know how access to this document benefits ou.y Follow this and additional works at: https://escholarship.umassmed.edu/gsbs_diss Part of the Computational Biology Commons, Genomics Commons, and the Integrative Biology Commons Repository Citation Purcaro MJ. (2017). Analysis, Visualization, and Machine Learning of Epigenomic Data. GSBS Dissertations and Theses. https://doi.org/10.13028/M23T1Q. Retrieved from https://escholarship.umassmed.edu/gsbs_diss/938 Creative Commons License This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License This material is brought to you by eScholarship@UMMS. It has been accepted for inclusion in GSBS Dissertations and Theses by an authorized administrator of eScholarship@UMMS. For more information, please contact [email protected]. ANALYSIS, VISUALIZATION, AND MACHINE LEARNING OF EPIGENOMIC DATA A Dissertation Presented By MICHAEL JOSEPH PURCARO Submitted to the Faculty of the University of Massachusetts Graduate School of Biomedical Sciences, Worcester in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY DECEMBER 12, 2017 BIOINFORMATICS AND COMPUTATIONAL BIOLOGY M.D., PH.D. PROGRAM I-ii ANALYSIS, VISUALIZATION, AND MACHINE LEARNING OF EPIGENOMIC DATA A Dissertation Presented By MICHAEL JOSEPH PURCARO This work was undertaken in the Graduate School of Biomedical Sciences Bioinformatics and Computational Biology Under the mentorship of ________________________________________________ Zhiping Weng, PhD, Thesis Advisor ________________________________________________ Elinor Karlsson, PhD, Member of Committee ________________________________________________ Manuel Garber, PhD, Member of Committee ________________________________________________ Robert Brewster, PhD, Member of Committee ________________________________________________ Ross Hardison, PhD, External Member of Committee ________________________________________________ Jeffrey Bailey, MD, PhD, Chair of Committee ________________________________________________ Anthony Carruthers, Ph.D., Dean of the Graduate School of Biomedical Sciences December 12, 2017 iii This work is dedicated to the most loving, understanding, and special woman in my world: my wife and best friend, Zemei. It is also dedicated to my parents, who have always encouraged me to pursue my dreams, despite the many, many years of training it has entailed. iv Acknowledgements Being involved for 4.5 years in an undertaking as complex as the ENCODE Consortium complicates writing acknowledgments. With so many moving pieces—data, metadata, code, papers, conference calls, and Google Docs—to deal with, ENCODE, and especially the Weng Lab, is a swirl of activity and ideas; keeping track of the intellectual and personal impact so many people have made on me would be a job in itself. Certain people, though, were particularly important in my journey: Arjan van der Velde, Nicholas Hathaway, Henry Pratt, Nathaniel Erskine, and Eugenio Mattei (in no particular order) were critical in my survival, allowing me to test ideas (about code, science, medicine, and life in general) and get honest, always entertaining feedback. I learned much from the intellectual prowess of our lab mates: Jill Moore, Tyler Borrman, Jack Huey, Sweta Vangaveti, Sowmya Iyer, Xiao-Ou Zhang, Shikui Tu, Junko Tsuhi, Jie Wang, Hao Chen, Thom Vreven, Yu Fu, Adam Wespiser, and Wei Wang. From the long suffering ENCODE DCC group, I would like to particularly thank "data heroes" Cricket Sloan, Kathrina Onate, and Jason Hilton for all their work in supporting us. From the IT department, I would like to thank David Plamondon, Lewis Robbins, and Charles Davidson for help and advice. I am very appreciative of mentoring and advice from Thomas Smith, who has helped put my work in a wider clinical perspective. I am also very appreciative of my TRAC committee—Jeffrey Bailey, Elinor Karlsson, Konstantin Zeldovich, and Robert Brewster—for helping me steer through this process, as well as the MD/PhD program for giving me the opportunity to undertake training here for a career as a physician scientist. v Life in the lab would come to a crashing halt without the endless support, advice, and motherly watchfulness from Barbara Bucciaglia, Heidi Beberman, Christine Tonevski, and Maureen Schulz. And, of course, none of this would have been possible without Zhiping Weng, whose endless generosity, curiosity, and intellectual insight have been deeply inspiring, and made the lab into an intellectual—and computational— playground. vi Abstract The goal of the Encyclopedia of DNA Elements (ENCODE) project has been to characterize all the functional elements of the human genome. These elements include expressed transcripts and genomic regions bound by transcription factors (TFs), occupied by nucleosomes, occupied by nucleosomes with modified histones, or hypersensitive to DNase I cleavage, etc. Chromatin Immunoprecipitation (ChIP-seq) is an experimental technique for detecting TF binding in living cells, and the genomic regions bound by TFs are called ChIP-seq peaks. ENCODE has performed and compiled results from tens of thousands of experiments, including ChIP-seq, DNase, RNA-seq and Hi-C. These efforts have culminated in two web-based resources from our lab— Factorbook and SCREEN—for the exploration of epigenomic data for both human and mouse. Factorbook is a peak-centric resource presenting data such as motif enrichment and histone modification profiles for transcription factor binding sites computed from ENCODE ChIP-seq data. SCREEN provides an encyclopedia of ~2 million regulatory elements, including promoters and enhancers, identified using ENCODE ChIP-seq and DNase data, with an extensive UI for searching and visualization. While we have successfully utilized the thousands of available ENCODE ChIP- seq experiments to build the Encyclopedia and visualizers, we have also struggled with the practical and theoretical inability to assay every possible experiment on every possible biosample under every conceivable biological scenario. We have used machine learning techniques to predict TF binding sites and enhancers location, and demonstrate machine learning is critical to help decipher functional regions of the genome. vii Table of Contents I. Chapter I: Introduction ........................................................................................ 2 II. Chapter II: Building and Visualizing an Encyclopedia of ENCODE candidate Regulatory Elements ........................................................ 9 II.1 Preface .......................................................................................................................................... 9 II.2 Summary ..................................................................................................................................... 11 II.3 Introduction ................................................................................................................................. 11 II.4 Results ......................................................................................................................................... 14 II.4.1 Summary of Encode Phase 3 Data Production .................................................................................. 14 II.4.2 The Encode Portal and Uniformly Processed Data ........................................................................... 18 II.4.3 The Encode Encyclopedia ................................................................................................................. 19 II.4.4 Encyclopedia Ground Level .............................................................................................................. 20 II.4.5 Encyclopedia Integrative Level ......................................................................................................... 21 II.4.6 The Registry of candidate Regulatory Elements ............................................................................... 23 II.4.7 Selection of cREs for the Registry .................................................................................................... 27 II.4.8 Comprehensiveness of the current Registry of cREs ........................................................................ 28 II.4.9 Classifying cREs in the Registry ....................................................................................................... 30 II.4.10 Relative abundance of cREs-PLS vs. cREs-ELS .............................................................................. 34 II.4.11 Comparison between cREs and the corresponding ChromHMM states............................................ 35 II.4.12 Cell and tissue type clustering ........................................................................................................... 35 II.5 SCREEN: A Web Engine for Searching and Visualizing cREs ................................................. 36 II.5.1 SCREEN Methods ............................................................................................................................ 40 II.5.2 SCREEN Usage ................................................................................................................................ 46 II.5.3 Testing SCREEN User Interface ......................................................................................................

Load more