Computational Analysis of the Mammalian Cis-Regulatory Landscape a Dissertation Submitted to the Department of Computer Science

COMPUTATIONAL ANALYSIS OF THE MAMMALIAN CIS-REGULATORY LANDSCAPE A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Cory Yuen Fu McLean May 2011 © 2011 by Cory Yuen Fu McLean. All Rights Reserved. Re-distributed by Stanford University under license with the author. This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/ This dissertation is online at: http://purl.stanford.edu/jh459nm5080 ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Gill Bejerano, Primary Adviser I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Serafim Batzoglou I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. David Kingsley Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost Graduate Education This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives. iii Abstract Improvements in DNA sequencing technologies have made it possible to determine the genetic makeup of many organisms. Computational analyses of the massive amounts of sequence data available have produced many insights into evolutionary and developmental biology. For example, comparison of the full genome sequences of human and mouse discovered that the majority of functional sequence in the human genome does not code for protein. Much of this functional non-coding sequence appears to act in a regulatory role, dictating the precise tissues and developmental time points in which each protein should be produced. This dissertation describes three major contributions to the computational analysis of regulatory elements. First, I describe the Genomic Regions Enrichment of Annotations Tool (GREAT), a novel statistical method and associated web-based tool developed to infer the biological functions of regulatory elements based on the functions of their putative target genes. I demonstrate its marked improvement over current methods at interpreting functional enrichment signals for a variety of regulatory element types. Next, I discuss a computational methodology developed to identify medium- to large-scale (10-100,000 nucleotide) genomic deletions from whole genome sequences of multiple mammals. Using this methodology, I quantify the dispensability of highly conserved non-coding elements (CNEs) as their likelihood to be deleted in a subset of species. Despite their genomic prevalence and apparent redundancy in function, CNEs are very rarely lost in extant species. Even more surprisingly, there is a very weak relationship between dispensability and nucleotide conservation level. Sequences under purifying selection at moderate levels of nucleotide conservation are lost at a iv rate similar to those at perfect sequence conservation. Instead, evolutionary resis- tance to loss is more strongly correlated with depth of sequence homology, as ancient enhancers are more resistant to deletion than ones that arose more recently in evolution. Finally, I present the discovery and analysis of human-specific genomic deletions. By comparing the genome sequences of five species including human and our nearest ape relative, the chimpanzee, I identified 583 regions present in non-human species that contain highly-conserved sequence but are surprisingly deleted in humans. Sta- tistical analyses indicate that these deletions occur preferentially near steroid hormone receptor genes and brain-expressed genes that are known to inhibit proliferation. Ex- perimental results provide particular examples that may have contributed to unique human traits: the loss of an AR enhancer is correlated with the human loss of penile spines and sensory vibrissae, and the loss of a GADD45G enhancer is correlated with the human expansion of the cerebral cortex. v Acknowledgments Foremost, I would like to thank my advisor, Gill Bejerano, for all of his guidance and support during my graduate career. He introduced me to computational biology and played an active role in all stages of my research. His ideas and opinions often caused me to think in new ways, and my research usually improved because of them. Through our work together I grew a lot as a scientist and as a person. I would also like to thank everyone in the Bejerano lab: Saatvik Agarwal, Jenny Chen, Tisha Chung, Shoa Clarke, Andrew Doxey, Harendra Guturu, Michael Hiller, Jim Notwell, Bruce Schaar, Sushant Shankar, Geetu Tuteja, and Aaron Wenger. I appreciate their friendship and all of the interesting discussions we had. Three lab members had particularly strong influences on my graduate experience and warrant special mention. Aaron Wenger was an ideal collaborator and source of immense technical knowledge, and provided many useful critiques of GREAT in all stages of its development. Michael Hiller was a great colleague both for our brainstorming sessions to develop various genomic data analysis methodologies and for his compan- ionship on hikes in Yosemite National Park. Bruce Schaar was a source of many exciting discussions as well as a great collaborator with a knack for identifying good experiments and the tenacity to see them through. I feel very lucky to have been able to spend much of my graduate career in close collaboration with the Kingsley lab. David Kingsley is an inspiring scientist who provided many insights and ideas during our collaboration. He also took time to discuss important personal decisions with me. I am very grateful for his informal mentorship. Alex Pollen was a great friend whose endless enthusiasm I admire, Phil Reno and Terry Capellini were sources of ideas and advice, and Craig Lowe provided vi much technical knowledge about the UCSC codebase. The support of my friends and family has been very important to me during my entire time at Stanford. Mark Sellmyer was a great source of laughs and big con- versations. Vincent Chu, Eu-Jin Goh, and Justin Brockman were awesome climbing partners. Doug Allaire was someone I could count on to listen during rough patches. My parents and siblings George and Renée were constant sources of unflinching love, support, and encouragement. My wife Laura has always been there for me, in good times and bad. She is a source of inspiration to me both professionally and personally and without her this work would not have been possible. The results presented in Chapters 3–5 of this dissertation all derive from previously published articles: 1. Cory Y. McLean, Dave Bristor, Michael Hiller, Shoa L. Clarke, Bruce T. Schaar, Craig B. Lowe, Aaron M. Wenger and Gill Bejerano. GREAT improves functional interpretation of cis-regulatory regions. Nature Biotechnology, 28, 495– 501 (2010). 2. Cory McLean and Gill Bejerano. Dispensability of mammalian DNA. Genome Research, 18, 1743–1751 (2008). 3. Cory Y. McLean*, Philip L. Reno*, Alex A. Pollen*, Abraham I. Bassan, Ter- ence D. Capellini, Catherine Guenther, Vahan B. Indjeian, Xinhong Lim, Dou- glas B. Menke, Bruce T. Schaar, Aaron M. Wenger, Gill Bejerano, and David M. Kingsley. Human-specific loss of regulatory DNA and the evolution of human- specific traits. Nature, 472, 216–219 (2011). * Authors contributed equally. In the Chapter 3 work, Dave Bristor and Aaron Wenger designed and developed the web interface of the tool. Michael Hiller found and parsed 75% of the ontologies into a format usable by the tool. Shoa Clarke performed much of the analysis of the Serum Response Factor data. Bruce Schaar provided input on the analysis of the p300 data. I designed and developed the core calculation engine, processed the GO, InterPro, and Pathway Commons ontologies, analyzed and interpreted all data sets, and wrote the manuscript with my advisor Gill Bejerano. vii In the Chapter 4 work, I designed and implemented the methodologies used, performed the analyses, interpreted the results, and wrote the manuscript, all in concert with my advisor Gill Bejerano. In the Chapter 5 work, Philip Reno performed all experimental work associated with the AR hCONDEL. Alex Pollen performed all experimental work associated with the GADD45G hCONDEL. I designed and implemented the computational methodologies used and performed the computational analyses. Philip Reno, Alex Pollen, Gill Bejerano, David Kingsley, and I interpreted the results and wrote the manuscript. viii Contents Abstract iv Acknowledgments vi 1 Introduction 1 1.1 Organization . 2 2 Cis-regulatory elements 3 2.1 Background . 3 2.2 Cis-regulatory element identification . 4 2.3 Cis-regulatory element function . 5 2.4 Cis-regulatory element composition . 6 2.5 Cis-regulatory elements in disease . 6 2.6 Cis-regulatory elements in trait evolution . 8 3 Regulatory element functional analysis 9 3.1 Results . 11 3.1.1 A genomic region-based binomial test for long-range gene regulatory domains . 11 3.1.2 Comparison of enrichment tests and gene regulatory domain ranges . 14 3.1.3 Serum Response Factor binding in human Jurkat cells . 21 3.1.4 P300 binding in the developing mouse limbs . 25 3.1.5 P300 binding in the developing mouse forebrain and midbrain 30 ix 3.1.6 Evaluation of GREAT usage patterns . 32 3.2 Discussion . 36 3.3 Methods . 37 3.3.1 Gene set definition . 37 3.3.2 Association rules from genomic regions to genes . 38 3.3.3 Binomial test over genomic regions . 40 3.3.4 Hypergeometric test over genes . 41 3.3.5 Foreground/background hypergeometric test over genomic regions .

Computational Analysis of the Mammalian Cis-Regulatory Landscape a Dissertation Submitted to the Department of Computer Science

Anti-HABP2 Monoclonal Antibody, Clone FQS25662 (DCABH-7346) This Product Is for Research Use Only and Is Not Intended for Diagnostic Use

Nº Ref Uniprot Proteína Péptidos Identificados Por MS/MS 1 P01024

Genomic Analysis of Estrogen Cascade Reveals Histone Variant H2A.Z Associated with Breast Cancer Progression

Downloaded Per Proteome Cohort Via the Web- Site Links of Table 1, Also Providing Information on the Deposited Spectral Datasets

Characterisation of Genetic Regulatory Effects for Osteoporosis Risk Variants in Human Osteoclasts Benjamin H

Current Knowledge of Germline Genetic Risk Factors for the Development of Non-Medullary Thyroid Cancer

HABP2 Sirna (Human)

The Structure, Function and Evolution of the Extracellular Matrix: a Systems-Level Analysis

HABP2 Rabbit Polyclonal Antibody – TA335072 | Origene

Whole Genome Sequencing of Familial Non-Medullary Thyroid Cancer Identifies Germline Alterations in MAPK/ERK and PI3K/AKT Signaling Pathways

Peripheral Nerve Single-Cell Analysis Identifies Mesenchymal Ligands That Promote Axonal Growth

F IDENTIFICATION of AMYOTROPHIC LATERAL