Identifying Recurrent Patterns of Chromatin Modifications at Regulatory Regions on Genome
Total Page:16
File Type:pdf, Size:1020Kb
Identifying Recurrent Patterns of Chromatin Modifications at Regulatory Regions on Genome DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Nan Meng Graduate Program in Computer Science and Engineering The Ohio State University 2015 Dissertation Committee: Dr. Raghu Machiraju, Adviser Dr. Kun Huang, Co-adviser Dr. Yuejie Chi Copyright by Nan Meng 2015 Abstract Epigenetics is an important regulation layer of DNA transcription in cells. Post- translational modification is one of the most studied aspects of epigenetics. The distribution of chromatin modifications at known regulatory regions could provide invaluable information on epigenetic regulations of DNA transcription. Firstly, a pipeline was developed to quantitatively identify and analyze distinct recurrent distribution patterns of single chromatin modification at regulatory regions. The clustering based method identified several recurrent patterns in cell lines from different tissue types. One particular pattern identified at promoter regions indicates activation of gene transcription. Further investigation show genes displaying this pattern carry important roles in cell development and differentiation. Later the study was extended to regulatory regions for long non-coding RNAs. This study shows that similar epigenetic regulatory functions regulate lncRNAs as well. Then a computational framework was developed to study combinatorial patterns of multiple chromatin modifications. Results show that recurrent combinatorial patterns provide more insights on subtle details of epigenetic regulation. However, it is computationally challenging and unnecessary to include all chromatin modifications as some of their distributions may be correlated. Our framework provides an efficient way to select a subset of chromatin modifications that best represents the recurrent combinatorial patterns. In this framework, high dimensional distribution data of ii chromatin modifications are considered embedded in several local low dimensional subspaces. Distributions of chromatin modifications could be expressed as linear combinations of others in the same subspace. Hence, multiple chromatin modifications residing in the same subspace could be represented by one representative. Recurrent combinatorial patterns of all the selected representatives are then identified and carefully studied. Study shows that different recurrent combinatorial patterns are associated with different states of promoters. The previously un-annotated promoters identified by these patterns are further analyzed and show strong associations to activation of DNA transcriptions. This framework also provides information on similarity among chromatin modifications, which could be used for guidance on selection of chromatin modifications for future experiments. This proposed methodology could be utilized to reveal distinct patterns of chromatin modifications associated with various epigenetic regulation mechanisms at various regulatory regions or the whole human genome, depending on user’s choice of genomic resolution. It could also be adapted to analyze other types of genomic data or generate interesting hypotheses for future investigation. These methods could be used in various aspects to interpret the human genome and study epigenetic regulation. iii Dedication This document is dedicated to my family. iv Acknowledgments Firstly I would like to thank my adviser Prof. Raghu Machiraju, for his unceasing guidance and support. Without his encouragement and help, I could not have completed my work. In additional to my adviser, I am very grateful to work with my co-adviser Prof. Kun Huang. A lot of ideas were inspired from discussions with my two mentors and it was a great honor to work with both of them. I am also very grateful for my committee members Prof. Yuejie Chi and Prof. Paul Granello for their invaluable comments and questions regarding my research. Especially Prof. Chi has been a great role model for me since the first time I met her. I also want to thank Prof. Clark Anderson for his guidance and support during early years of my PhD program. I will always be thankful for people I work with in my lab: Chao Wang, Hao Ding, Qihang Li, Zhi Han, Jie Zhang, Brian Arand, Arunima Srivastava, Travis Johnson and Ross Donatelli. Furthermore, I am very grateful that I met some lifelong friends along the way, Jihyun Lee, Pei Zhang, Yufang Sun and James Gentry stood by my side through thick and thin for the past few years. I also want to send special thanks to my friends I knew before grad school who accompanied me through this journey from near and far, Qian Gao, Chuan Lu, Jinyan Guan, Wei Zhang and Hua Fu. Finally, I want to thank my parents for their unconditional love and support. v Vita 2009.......................................................B.S. Computer Science, Hebei University of Technology, China 2013.......................................................M.S. Computer Science and Engineering, The Ohio State University 2014 to present .....................................Graduate Teaching Associate, Department of Biomedical Informatics, The Ohio State University Publications Meng N, Machiraju R, Huang K. Identify Critical Genes in Development with Consistent H3K4me2 Patterns across Multiple Tissues. IEEE/ACM Trans Comput Biol Bioinforma 2015, PP:1–1. Fields of Study Major Field: Computer Science and Engineering vi Table of Contents Abstract ............................................................................................................................... ii Dedication .......................................................................................................................... iv Acknowledgments............................................................................................................... v Vita ..................................................................................................................................... vi List of Tables ...................................................................................................................... x List of Figures ................................................................................................................... xii Chapter 1: Introduction ...................................................................................................... 1 1.1 Investigate epigenetic regulatory regions with recurrent patterns of chromatin modifications ............................................................................................................... 1 1.2 From regulatory regions to recurrent patterns of chromatin modifications ........... 2 1.3 Thesis Statement .................................................................................................... 8 1.4 Outline of Solution ................................................................................................ 8 1.5 Contributions ......................................................................................................... 9 Chapter 2: Related Work .................................................................................................. 12 2.1 Data collection and Challenges ........................................................................... 12 2.2 Methods and tools ................................................................................................ 16 vii 2.3 Sparse subspace clustering and combinatorial patterns of chromatin modifications ............................................................................................................. 27 2.4 From past to present ............................................................................................ 28 Chapter 3: Identify Recurrent Patterns of H3K4me2 at promoters of Critical Developmental Genes across Multiple Tissues ................................................................ 29 2.1 Introduction ......................................................................................................... 30 2.2 Material and Methods .......................................................................................... 33 2.3 Results ................................................................................................................. 39 2.4 Discussion and Summary .................................................................................... 46 Chapter 4: Identify recurrent patterns of H3K4me2 at promoters of Critical lncRNAs across Multiple Tissues ..................................................................................................... 50 4.1 Background .......................................................................................................... 51 4.2 Methods ............................................................................................................... 53 4.3 Results ................................................................................................................. 60 4.4 Summary .............................................................................................................. 69 Chapter 5: Identify Combinatorial Recurrent Patterns of Chromatin Modifications at Promoters of Different States ............................................................................................ 71 5.1 Introduction ......................................................................................................... 71 5.2 Method ................................................................................................................