Identifying Tissue Specific Distal Regulatory Sequences in the Mouse Genome
Total Page:16
File Type:pdf, Size:1020Kb
Identifying tissue specific distal regulatory sequences in the mouse genome by Julie Chih-yu Chen A thesis submitted in conformity with the requirements for the degree of Master Cell and Systems Biology University of Toronto © Copyright by Julie Chih-yu Chen 2011 Identifying tissue specific distal regulatory sequences in the mouse genome Julie Chih-yu Chen Master Cell and Systems Biology University of Toronto 2011 Abstract Epigenetic modifications, transcription factor (TF) availability and chromatin conformation influence how a genome is interpreted by the transcriptional machinery responsible for gene expression. Enhancers buried in non-coding regions are associated with significant differences in histone marks between different cell types. In contrast, gene promoters show more uniform modifications across cell types. In this report, enhancer identification is first carried out using an enhancer associated feature in mouse erythroid cells. Taking advantage of public domain ChIP-Seq data sets in mouse embryonic stem cells, an integrative model is then used to assess features in enhancer prediction, and subsequently locate enhancers. Significant associations with multiple TF bound loci, higher expression in the closest genes, and active enhancer marks support functionality and tissue-specificity of these enhancers. Motif enrichment analysis further determines known and novel TFs regulating the target cell type. Furthermore, the features identified can facilitate more accurate enhancer prediction in other cell types. ii “You cannot open a book without learning something.” Confucius iii Acknowledgments My time at U of T has been a life changing experience. As a result of the environment and my commitment during this time, my desire in pursuing research has never been more evident. Much more than I set out to achieve, I have expanded my horizon and developed diverse sets of skills both in research and in personal growth. I would like to thank my supervisor, Dr. Jennifer Mitchell, for the introduction to the research, the opportunity to learn wet lab techniques, the experience, and the frequent discussion that shaped the biological framework of the study. I am particularly grateful of her guidance on biological interpretations and her help on editing my NSERC application. I would also like to thank Dr. Quaid Morris for computational guidance in the mouse embryonic stem cell project, and for the discussions that enriched my research experience. I also thank both of my committee members, Dr. Sue Varmuza and Dr. Nicholas Provart, for their inputs on my study and all the help on my completion of the graduate study. I am especially thankful of Dr. Nicholas Provart for the thorough editing of my thesis. I am very grateful of the government funding agencies, OGS and NSERC, for the awards to support my research, and the conference organizations, BioC2010 and CDB Symposium 2011, for the travel fellowships, which allowed me not only to present my posters, but also interact with other researchers and broaden my scope of knowledge. I would like to thank my lab mates, Anandi Bhattacharya and Mike Schwartz, and Jessica Yang and Dr. Yunchen Gong from the Guttman lab for the discussions on biological and bioinformatics research, and for being there to brighten up the days. I also thank Dr. Paul Boutros for his final and very helpful inputs, and Dr. Ieuan Clay for the opportunity to participate in one of his projects. Furthermore, a great deal of my bioinformatics and statistical applications in the thesis were acquired from periods when I was supervised by Dr. Chao A. Hsiung, Dr. I- Shou Chang and Dr. Von-Wun Soo. I am deeply grateful of their influences, and the opportunities to learn and develop these skills. Lastly, I would like to thank my parents and my best friend, Hui-yi, for the constant support and encouragement from the other end of the world throughout my graduate study. I also thank my iv family, Andy, Alice, Cat and n, for their warm and loving support. I could not have reached this point without them. Finally, on a non-scientific note, I am grateful of the five positive messages in the fortune cookies, which had strong relevance to the stages of my life at the moments I received them. v Table of Contents Acknowledgments ........................................................................................................................... iv Table of Contents ............................................................................................................................ vi Declaration…………….. ................................................................................................................. x List of Abbreviations ...................................................................................................................... xi List of Tables…………………… ................................................................................................. xii List of Figures…………………… ............................................................................................... xiii List of Appendices ........................................................................................................................ xiv Chapter 1 Introduction ..................................................................................................................... 1 1.1. One genome, multiple epigenomes and transcriptomes ..................................................... 2 1.2. Distal regulatory elements: Enhancers ................................................................................ 2 1.2.1. Significance not to be overlooked ........................................................................... 2 1.2.2. Epigenetic states at enhancers in relation to tissue specificity ............................... 5 1.2.3. Regulation of gene expression through chromatin looping .................................... 5 1.3. Features predictive of enhancers ......................................................................................... 5 1.3.1. Interaction of proteins and enhancers ..................................................................... 6 1.3.2. Histone modification states at enhancers ................................................................ 6 1.3.3. Active and poised enhancers in embryonic stem cells ............................................ 7 1.4. Computational approaches relevant to enhancer identification .......................................... 7 1.4.1. Position specific matrices and comparative genomics ............................................ 7 1.4.2. Integrative modeling of ChIP-Seq data ................................................................... 8 1.5. Thesis overview .................................................................................................................. 9 Chapter 2 Methods ......................................................................................................................... 10 2.1. Methods for enhancer identification in mouse erythroid cell ........................................... 11 2.1.1. Mapping of datasets to the mouse genome ........................................................... 11 vi 2.1.2. Enrichment of nucRNA-Seq and ChIP-Seq datasets ............................................ 11 2.1.3. Conservation, motif identification, and function annotation analyses .................. 13 2.1.4. Native ChIP-qPCR of H3K4me1 .......................................................................... 13 2.2. Methods for enhancer identification in mouse ES cells .................................................... 14 2.2.1. Public datasets ……………………………. ......................................................... 14 2.2.2. Data pre-processing .............................................................................................. 14 2.2.3. Training data sets .................................................................................................. 16 2.2.4. Feature combination assessment using Naive Bayes ............................................ 16 2.2.5. Feature extraction with lasso regularized multinomial logistic regression ........... 17 2.2.6. Absolute gene expression in mouse ES cell ......................................................... 17 2.2.7. Gene Ontology functional enrichment analysis .................................................... 17 2.2.8. Association with multiple transcription factor bound loci .................................... 18 2.2.9. Supervised motif analysis ..................................................................................... 18 2.2.10. Comparison to other high throughput sequencing datasets .................................. 18 Chapter 3 Enhancer Identification in Erythroid Cells: A Biologically Directed Approach .......... 20 3.1. Introduction ....................................................................................................................... 21 3.2. Results……………………… ........................................................................................... 22 3.2.1. A closer look at the Hbb locus control region ...................................................... 22 3.2.2. Identification of putative enhancers ...................................................................... 24 3.2.3. Overlap with transcription factors and conserved regions of the genome ............ 27 3.2.4. Multiple transcription factor peaks in proximity to putative enhancers ............... 29 3.2.5. H3K4me1 ChIP-qPCR results support putative