COMPUTATIONAL ANALYSIS OF THE MAMMALIAN CIS-REGULATORY LANDSCAPE

A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

Cory Yuen Fu McLean May 2011

© 2011 by Cory Yuen Fu McLean. All Rights Reserved. Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/jh459nm5080

ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Gill Bejerano, Primary Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Serafim Batzoglou

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

David Kingsley

Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives.

iii Abstract

Improvements in DNA sequencing technologies have made it possible to determine the genetic makeup of many organisms. Computational analyses of the massive amounts of sequence data available have produced many insights into evolutionary and devel- opmental biology. For example, comparison of the full genome sequences of human and mouse discovered that the majority of functional sequence in the does not code for . Much of this functional non-coding sequence appears to act in a regulatory role, dictating the precise tissues and developmental time points in which each protein should be produced. This dissertation describes three major contributions to the computational anal- ysis of regulatory elements. First, I describe the Genomic Regions Enrichment of Annotations Tool (GREAT), a novel statistical method and associated web-based tool developed to infer the biological functions of regulatory elements based on the functions of their putative target . I demonstrate its marked improvement over current methods at interpreting functional enrichment signals for a variety of regula- tory element types. Next, I discuss a computational methodology developed to identify medium- to large-scale (10-100,000 nucleotide) genomic deletions from whole genome sequences of multiple mammals. Using this methodology, I quantify the dispensability of highly conserved non-coding elements (CNEs) as their likelihood to be deleted in a subset of species. Despite their genomic prevalence and apparent redundancy in function, CNEs are very rarely lost in extant species. Even more surprisingly, there is a very weak relationship between dispensability and nucleotide conservation level. Sequences under purifying selection at moderate levels of nucleotide conservation are lost at a

iv rate similar to those at perfect sequence conservation. Instead, evolutionary resis- tance to loss is more strongly correlated with depth of , as ancient enhancers are more resistant to deletion than ones that arose more recently in evolu- tion. Finally, I present the discovery and analysis of human-specific genomic deletions. By comparing the genome sequences of five species including human and our nearest ape relative, the chimpanzee, I identified 583 regions present in non-human species that contain highly-conserved sequence but are surprisingly deleted in humans. Sta- tistical analyses indicate that these deletions occur preferentially near steroid hormone receptor genes and brain-expressed genes that are known to inhibit proliferation. Ex- perimental results provide particular examples that may have contributed to unique human traits: the loss of an AR enhancer is correlated with the human loss of penile spines and sensory vibrissae, and the loss of a GADD45G enhancer is correlated with the human expansion of the cerebral cortex.

v Acknowledgments

Foremost, I would like to thank my advisor, Gill Bejerano, for all of his guidance and support during my graduate career. He introduced me to computational biology and played an active role in all stages of my research. His ideas and opinions often caused me to think in new ways, and my research usually improved because of them. Through our work together I grew a lot as a scientist and as a person. I would also like to thank everyone in the Bejerano lab: Saatvik Agarwal, Jenny Chen, Tisha Chung, Shoa Clarke, Andrew Doxey, Harendra Guturu, Michael Hiller, Jim Notwell, Bruce Schaar, Sushant Shankar, Geetu Tuteja, and Aaron Wenger. I appreciate their friendship and all of the interesting discussions we had. Three lab members had particularly strong influences on my graduate experience and warrant special mention. Aaron Wenger was an ideal collaborator and source of immense technical knowledge, and provided many useful critiques of GREAT in all stages of its development. Michael Hiller was a great colleague both for our brainstorming sessions to develop various genomic data analysis methodologies and for his compan- ionship on hikes in Yosemite National Park. Bruce Schaar was a source of many exciting discussions as well as a great collaborator with a knack for identifying good experiments and the tenacity to see them through. I feel very lucky to have been able to spend much of my graduate career in close collaboration with the Kingsley lab. David Kingsley is an inspiring scientist who provided many insights and ideas during our collaboration. He also took time to discuss important personal decisions with me. I am very grateful for his informal mentorship. Alex Pollen was a great friend whose endless enthusiasm I admire, Phil Reno and Terry Capellini were sources of ideas and advice, and Craig Lowe provided

vi much technical knowledge about the UCSC codebase. The support of my friends and family has been very important to me during my entire time at Stanford. Mark Sellmyer was a great source of laughs and big con- versations. Vincent Chu, Eu-Jin Goh, and Justin Brockman were awesome climbing partners. Doug Allaire was someone I could count on to listen during rough patches. My parents and siblings George and Ren´ee were constant sources of unflinching love, support, and encouragement. My wife Laura has always been there for me, in good times and bad. She is a source of inspiration to me both professionally and personally and without her this work would not have been possible.

The results presented in Chapters 3–5 of this dissertation all derive from previously published articles: 1. Cory Y. McLean, Dave Bristor, Michael Hiller, Shoa L. Clarke, Bruce T. Schaar, Craig B. Lowe, Aaron M. Wenger and Gill Bejerano. GREAT improves func- tional interpretation of cis-regulatory regions. Nature Biotechnology, 28, 495– 501 (2010).

2. Cory McLean and Gill Bejerano. Dispensability of mammalian DNA. Genome Research, 18, 1743–1751 (2008).

3. Cory Y. McLean*, Philip L. Reno*, Alex A. Pollen*, Abraham I. Bassan, Ter- ence D. Capellini, Catherine Guenther, Vahan B. Indjeian, Xinhong Lim, Dou- glas B. Menke, Bruce T. Schaar, Aaron M. Wenger, Gill Bejerano, and David M. Kingsley. Human-specific loss of regulatory DNA and the evolution of human- specific traits. Nature, 472, 216–219 (2011). * Authors contributed equally. In the Chapter 3 work, Dave Bristor and Aaron Wenger designed and developed the web interface of the tool. Michael Hiller found and parsed 75% of the ontologies into a format usable by the tool. Shoa Clarke performed much of the analysis of the Serum Response Factor data. Bruce Schaar provided input on the analysis of the p300 data. I designed and developed the core calculation engine, processed the GO, InterPro, and Pathway Commons ontologies, analyzed and interpreted all data sets, and wrote the manuscript with my advisor Gill Bejerano.

vii In the Chapter 4 work, I designed and implemented the methodologies used, per- formed the analyses, interpreted the results, and wrote the manuscript, all in concert with my advisor Gill Bejerano. In the Chapter 5 work, Philip Reno performed all experimental work associated with the AR hCONDEL. Alex Pollen performed all experimental work associated with the GADD45G hCONDEL. I designed and implemented the computational method- ologies used and performed the computational analyses. Philip Reno, Alex Pollen, Gill Bejerano, David Kingsley, and I interpreted the results and wrote the manuscript.

viii Contents

Abstract iv

Acknowledgments vi

1 Introduction 1 1.1 Organization ...... 2

2 Cis-regulatory elements 3 2.1 Background ...... 3 2.2 Cis-regulatory element identification ...... 4 2.3 Cis-regulatory element function ...... 5 2.4 Cis-regulatory element composition ...... 6 2.5 Cis-regulatory elements in disease ...... 6 2.6 Cis-regulatory elements in trait evolution ...... 8

3 Regulatory element functional analysis 9 3.1 Results ...... 11 3.1.1 A genomic region-based binomial test for long-range reg- ulatory domains ...... 11 3.1.2 Comparison of enrichment tests and gene regulatory domain ranges ...... 14 3.1.3 Serum Response Factor binding in human Jurkat cells . . . . . 21 3.1.4 P300 binding in the developing mouse limbs ...... 25 3.1.5 P300 binding in the developing mouse forebrain and midbrain 30

ix 3.1.6 Evaluation of GREAT usage patterns ...... 32 3.2 Discussion ...... 36 3.3 Methods ...... 37 3.3.1 Gene set definition ...... 37 3.3.2 Association rules from genomic regions to genes ...... 38 3.3.3 Binomial test over genomic regions ...... 40 3.3.4 Hypergeometric test over genes ...... 41 3.3.5 Foreground/background hypergeometric test over genomic re- gions ...... 42

4 Dispensability of mammalian DNA 44 4.1 Results ...... 46 4.1.1 Ultraconserved-like element loss rates ...... 46 4.1.2 Dispensability of neutral DNA ...... 51 4.1.3 Rodent dispensability under extended branch lengths . . . . . 51 4.1.4 Dispensability of DNA under purifying selection ...... 55 4.1.5 Deep sequence conservation and dispensability ...... 57 4.1.6 Robustness of loss event predictions ...... 59 4.2 Discussion ...... 60 4.3 Methods ...... 64 4.3.1 Identifying rodent-specific losses of mammalian conserved DNA 64 4.3.2 Identifying primate-specific losses of mammalian conserved DNA 66 4.3.3 Deep sequence conservation and dispensability ...... 66

5 Human CNE losses and trait evolution 68 5.1 Results ...... 69 5.1.1 Computational identification of conserved sequences deleted in humans ...... 69 5.1.2 Experimental characterization of an AR enhancer deleted in humans ...... 72 5.1.3 Experimental characterization of a GADD45G enhancer deleted in humans ...... 75

x 5.2 Discussion ...... 77 5.3 Methods Summary ...... 78

6 Conclusion 79 6.1 Future directions ...... 81

A Supplementary material for Chapter 3 83 A.1 Ontologies supported ...... 83 A.1.1 ...... 84 A.1.2 Mouse Phenotype ...... 84 A.1.3 MSigDB Ontologies ...... 84 A.1.4 PANTHER Pathway ...... 85 A.1.5 Pathway Commons ...... 85 A.1.6 BioCyc Pathway ...... 86 A.1.7 MGI Expression: Detected & Not Detected ...... 86 A.1.8 Transcription Factor Targets ...... 86 A.1.9 miRNA Targets ...... 87 A.1.10 InterPro ...... 87 A.1.11 TreeFam ...... 87 A.1.12 HGNC Gene Families ...... 87 A.2 Additional ChIP-Seq analyses ...... 90 A.2.1 P300 in mouse embryonic stem cells ...... 91 A.2.2 Signal transducer and activator of transcription 3 (Stat3) in mouse embryonic stem cells ...... 92 A.2.3 Neuron-restrictive silencer factor (NRSF) in human Jurkat cells 94 A.2.4 GA-Binding Protein (GABP) in human Jurkat cells ...... 95 A.3 Supplementary Figures ...... 97 A.4 Supplementary Tables ...... 99

B Supplementary material for Chapter 5 142 B.1 Materials and methods ...... 142 B.1.1 Discovering Human-specific Deletions (hDELs) ...... 142

xi B.1.2 Finding chimpanzee sequences highly conserved in non-human species ...... 143 B.1.3 Identifying a set of Human Conserved Sequence Deletions (hCON- DELs) ...... 144 B.1.4 Data quality issues: false positive calls vs. rare deletion variants 145 B.1.5 In silico deletion validation and polymorphism analysis . . . . 146 B.1.6 PCR deletion validation and polymorphism analysis ...... 147 B.1.7 Neandertal allele identification of hCONDELs ...... 148 B.1.8 Simulation significance testing ...... 149 B.1.9 Exonic hCONDEL validation attempts ...... 151 B.1.10 Gene structure overlap ...... 151 B.1.11 Functional annotation enrichments of hCONDELs ...... 152 B.1.12 Syntenic 14-way mammal multiple alignments ...... 154 B.1.13 cCONDELs and mCONDELs ...... 155 B.1.14 Cell proliferation or cell migration suppressor gene enrichments 156 B.1.15 LacZ reporter assay ...... 157 B.1.16 In situ hybridization ...... 158 B.1.17 Cell culture assay ...... 158 B.2 Supplementary Figures ...... 160 B.3 Supplementary Tables ...... 167

Bibliography 185

xii List of Tables

3.1 Human and mouse ontologies supported by GREAT version 1.2.6 . . 15 3.2 GREAT version 1.2.6 ontology contents for human ...... 16 3.3 GREAT version 1.2.6 ontology contents for mouse ...... 17 3.4 GREAT parameters, filters, and options, and their effects ...... 18 3.5 Comparison of several enrichment tools ...... 20 3.6 Data sets analyzed by GREAT ...... 20 3.7 SRF gene-based ontology enrichments ...... 22 3.8 SRF 5+1 basal up to 1 Mb GREAT enrichments ...... 23 3.9 SRF GO enrichments highlight benefits of coupling binomial and hy- pergeometric tests ...... 24 3.10 p300 limb 5+1 basal up to 1 Mb GREAT enrichments ...... 27 3.11 p300 forebrain 5+1 basal up to 1 Mb GREAT enrichments ...... 31 3.12 p300 midbrain 5+1 basal up to 1 Mb GREAT enrichments ...... 33

4.1 Misannotated rodent coding exon losses ...... 60

5.1 Summary of annotation enrichments of hCONDELs...... 72

A.1 SRF “gene-based GREAT” enrichments ...... 99 A.2 SRF 5+1 basal up to 50 kb GREAT enrichments ...... 101 A.3 SRF two nearest genes up to 1 Mb GREAT enrichments ...... 102 A.4 SRF single nearest gene up to 1 Mb GREAT enrichments ...... 103 A.5 p300 limb DAVID gene-based enrichments ...... 104 A.6 p300 limb “gene-based GREAT” enrichments ...... 105

xiii A.7 p300 limb 5+1 basal up to 50 kb GREAT enrichments ...... 106 A.8 p300 limb two nearest genes up to 1 Mb GREAT enrichments . . . . 107 A.9 p300 limb single nearest gene up to 1 Mb GREAT enrichments . . . . 108 A.10 p300 forebrain DAVID gene-based enrichments ...... 109 A.11 p300 forebrain “gene-based GREAT” enrichments ...... 110 A.12 p300 forebrain 5+1 basal up to 50 kb GREAT enrichments ...... 111 A.13 p300 forebrain two nearest genes up to 1 Mb GREAT enrichments . . 112 A.14 p300 forebrain single nearest gene up to 1 Mb GREAT enrichments . 113 A.15 p300 midbrain “gene-based GREAT” enrichments ...... 114 A.16 p300 midbrain 5+1 basal up to 50 kb GREAT enrichments ...... 115 A.17 p300 midbrain two nearest genes up to 1 Mb GREAT enrichments . . 116 A.18 p300 midbrain single nearest gene up to 1 Mb GREAT enrichments . 117 A.19 p300 mESC comparison of DAVID and GREAT ...... 118 A.20 p300 mESC “gene-based GREAT” enrichments ...... 119 A.21 p300 mESC 5+1 basal up to 50 kb GREAT enrichments ...... 120 A.22 p300 mESC two nearest genes up to 1 Mb GREAT enrichments . . . 121 A.23 p300 mESC single nearest gene up to 1 Mb GREAT enrichments . . . 122 A.24 Stat3 mESC comparison of DAVID and GREAT ...... 123 A.25 Stat3 mESC “gene-based GREAT” enrichments ...... 125 A.26 Stat3 mESC 5+1 basal up to 50 kb GREAT enrichments ...... 126 A.27 Stat3 mESC two nearest genes up to 1 Mb GREAT enrichments . . . 127 A.28 Stat3 mESC single nearest gene up to 1 Mb GREAT enrichments . . 128 A.29 NRSF comparison of DAVID and GREAT ...... 129 A.30 NRSF “gene-based GREAT” enrichments ...... 131 A.31 NRSF 5+1 basal up to 50 kb GREAT enrichments ...... 132 A.32 NRSF two nearest genes up to 1 Mb GREAT enrichments ...... 134 A.33 NRSF single nearest gene up to 1 Mb GREAT enrichments ...... 135 A.34 GABP comparison of DAVID and GREAT ...... 136 A.35 GABP “gene-based GREAT” enrichments ...... 138 A.36 GABP 5+1 basal up to 50 kb GREAT enrichments ...... 139 A.37 GABP two nearest genes up to 1 Mb GREAT enrichments ...... 140

xiv A.38 GABP single nearest gene up to 1 Mb GREAT enrichments . . . . . 141

B.1 Sliding window conservation criteria ...... 167 B.2 The 583 human conserved sequence deletions ...... 167 B.3 PCR validation of 39 hCONDELs in 23 human populations ...... 178 B.4 PCR analysis of 8 polymorphic hCONDELs in 96 humans ...... 179 B.5 Derived allele frequency of polymorphic hCONDELs ...... 181 B.6 hCONDELs in regions of recent positive human selection ...... 181 B.7 Summary of all exonic human conserved sequence deletions ...... 182 B.8 PCR examination of unsupported exonic hCONDELs ...... 182 B.9 hCONDEL GO Molecular Function annotation enrichments . . . . . 182 B.10 hCONDEL InterPro annotation enrichments ...... 183 B.11 All cCONDEL annotation enrichments ...... 183 B.12 All mCONDEL annotation enrichments ...... 184

xv List of Figures

3.1 Enrichment analysis of a set of cis-regulatory regions ...... 12 3.2 Binding profiles and their effects on statistical tests ...... 13 3.3 The GREAT version 1.2.6 workflow ...... 19 3.4 Distal binding events contribute substantially to accurate functional enrichments of p300 limb peaks ...... 29 3.5 GREAT usage patterns ...... 34 3.6 Computationally-defined gene regulatory domains ...... 39

4.1 Computational pipeline to discover rodent-specific DNA losses . . . . 47 4.2 Mammalian phylogeny used to calculate DNA loss rates ...... 48 4.3 Maximum length of primate-dog non-exonic DNA lost in rodents . . . 49 4.4 Indispensability of mammalian DNA under purifying selection . . . . 50 4.5 The relationship between neutrally-evolving mammalian DNA loss rates and mammalian genome structure ...... 52 4.6 Abundance and loss rate of intergenic and intronic primate-dog sequences 53 4.7 Abundance and loss rates of human-dog-horse non-exonic DNA . . . 54 4.8 Abundance of coding exon sequences for all three alignment topologies 56 4.9 Functional DNA loss rates ...... 57 4.10 Deep sequence conservation and indispensability of mammalian DNA 58 4.11 An example mislabeled coding exon loss arising from accelerated evo- lution ...... 59

5.1 Hundreds of sequences highly conserved between chimpanzee and other species are deleted in humans...... 71

xvi 5.2 Transgenic analysis of a chimpanzee and mouse AR enhancer region missing in humans...... 74 5.3 Transgenic analysis of a chimpanzee and mouse forebrain enhancer missing from a tumor suppressor gene in humans...... 76

A.1 Robustness of genomic region-based and gene-based enrichment tests 97 A.2 Binomial and hypergeometric P value differences for several different datasets ...... 98

B.1 Phylogeny of species used ...... 160 B.2 Chimpanzee G/C content of hCONDELs ...... 161 B.3 Repeat masking artifacts produce false hDELs ...... 162 B.4 Repeat masking artifacts produce mis-annotated hDEL sizes . . . . . 163 B.5 Artifactual hDEL due to duplicated chimpanzee sequence ...... 163 B.6 Computational validation of deletion genotype and assessment of poly- morphism ...... 164 B.7 Human conserved sequence deletion Trace Archive validation results . 165 B.8 Chimpanzee browser screenshot of the AR hCONDEL locus . . . . . 165 B.9 Chimpanzee browser screenshot of the GADD45G hCONDEL locus . 166 B.10 Size distributions of chimpanzee and mouse conserved sequence deletions166

xvii Chapter 1

Introduction

Deoxyribonucleic acid (DNA) is the medium used to encode information pertaining to the development and function of living creatures. The human genome contains roughly three billion nucleotides packaged into discrete linear . From a computational perspective, the entire genetic information of an organism can be encoded as long strings over the small alphabet {A, C, G, T}. Deciphering the precise instructions encoded within DNA, and how changes to the instructions alter the development and function of organisms, are major questions within the fields of biology and genetics. Since the human genome was initially sequenced [105], the cost per nucleotide to sequence DNA has dropped ≈ 100,000-fold [132]. The growth rate of our sequencing capabilities has far exceeded that of Moore’s Law [161] and enables the sequencing of many individuals [1] and many different species [84]. The wealth of biological data now available has revolutionized the way biology is being performed today. Com- putational genomics is a burgeoning interdisciplinary field that integrates aspects of computer science, genetics, statistics, and mathematics to address questions of biolog- ical interest. Problems in computational genomics span many biological disciplines, as computational methods are applicable to virtually any domain in which large data sets must be mined and analyzed effectively. Examples include (but are not limited to) the assembly of full genome sequences from short fragments [219, 167, 12, 138], effi- cient alignment of similar DNA sequences [5, 204, 114, 137], gene prediction [34, 101],

1 CHAPTER 1. INTRODUCTION 2

ribonucleic acid (RNA) and protein folding [178, 174], assessment of evolutionary constraint on homologous sequences from distinct organisms [207, 57], and gene set enrichment analysis [218, 100]. Cis-regulatory elements are a fascinating class of functional regions in the genome. They act as the “control layer” within the genome, dictating in which cell types and at which time points genes should be expressed. A fuller understanding of cis-regulatory elements will be a prerequisite to deciphering the complexity of organismal develop- ment in its entirety.

1.1 Organization

The following chapters will describe our current understanding of cis-regulatory el- ements, the contributions this dissertation provides to their analysis, and areas of further study. Chapter 2 serves as a background for understanding the landscape of the human genome from a computational perspective and provides motivation for the study of cis-regulatory elements. In Chapter 3, I introduce the problem of de- termining the functions of a set of cis-regulatory elements based on their genomic positions, describe the statistical methodology and web-based tool we developed to address the problem, named GREAT, and demonstrate GREAT’s marked improve- ment over current methods at inferring the functions of multiple distinct types of cis-regulatory elements. In Chapter 4, I describe a computational methodology to identify large-scale deletions from whole genome sequences and quantify the func- tional importance of highly conserved non-coding elements. Chapter 5 extends this methodology in the application to human-specific losses of highly conserved DNA in its impact on human-specific evolution. Finally, in Chapter 6 I provide some ideas for future research in the study of cis-regulatory elements and their impact on human evolution and disease. Chapter 2

Cis-regulatory elements

This chapter introduces salient features of cis-regulatory elements: why they are believed to be important, how they can be identified within the genome, their function and genomic composition, their involvement in disease, and why they may be an ideal substrate on which evolution can act.

2.1 Background

One of the most surprising findings from the initial sequencing of the human genome [105] was the relative paucity of protein-coding genes. Since genes encode the information required to generate , and proteins play key roles in nearly all cellular pro- cesses, it was assumed that the number of protein coding genes correlated with the complexity of an organism. In reality, there are only ≈ 21,000 genes in the hu- man genome [52], similar to the ≈ 19,000 found in the nematode Caenorhabditis elegans [223], markedly fewer than the ≈ 45,000–55,000 found in rice [249], and dra- matically fewer than early predictions of up to 2,000,000 human genes [113]. The lack of correlation between gene number and organismal complexity suggests two possible (and compatible) scenarios to generate organismal complexity: single genes encoding multiple proteins and the presence of additional functional sequences that do not en- code proteins. We now know that both of these scenarios are active within humans. of pre-mRNA is a method to generate diversity in the protein

3 CHAPTER 2. CIS-REGULATORY ELEMENTS 4

products within an organism [23]. Additionally, many distinct classes of functional non-coding sequence have been discovered that regulate expression of protein-coding genes. Cis-regulatory elements are of much interest as they act to regulate the initia- tion of transcription, when the majority of gene regulation is thought to occur [150].

2.2 Cis-regulatory element identification

Comparative genomics is a powerful methodology for identifying functional genomic sequences. DNA replication is not perfect, with mutations occurring in humans at a rate of roughly 10−8 per (bp) [248, 192]. Mutations with beneficial or no functional consequences may be passed down through generations, but mutations that cause a negative fitness effect are less likely to be propagated. Given two organisms, an expected sequence identity between their genomes can be calculated based on the DNA mutation rate, the generation times of the organisms, and the total time since the organisms shared a common ancestor. Regions of higher sequence identity than expected are thus likely to be functional, since the evolutionary force of purifying selection has culled organisms containing harmful mutations in the regions. Human and mouse last shared a common ancestor between 84–99 million years ago [166, 214], providing an ideal evolutionary separation time to measure genomic constraint since enough time has passed that constrained elements can be differ- entiated from non-functional sequences, but the species are closely related enough that much non-functional DNA can still be aligned. The sequencing of the mouse genome [241] thus enabled comprehensive comparisons to the human genome. By aligning the sequence of repetitive elements inserted into the genome of the human- mouse ancestor in the extant human and mouse genomes, an estimate of the neutral rate of sequence evolution was calculated. When binned into 50 bp segments, align- ment scores for these ancestral repeats followed a normal distribution, as would be expected for neutrally evolving sequence. By performing whole genome alignments between human and mouse, calculating the distribution of alignment scores for all sequences, and scaling and comparing the distributions from ancestral repeats and whole genome alignments, an excess of highly conserved sequences within the whole CHAPTER 2. CIS-REGULATORY ELEMENTS 5

genome alignment was seen. This excess sequence under purifying selection comprises roughly 5% of the genome [241]. This estimate has remained robust or even increased as additional genomes became available for comparison (rat [107], dog [141]) and comprehensive functional study of 1% of the human genome occurred [224]. Since genes comprise only 1.5% of all human genome sequence, this implies that over two thirds of the functional sequence in the human genome does not code for protein. Additionally, since the total amount of functional DNA includes sequences not un- der detectable pan-mammalian purifying selection, this estimate of the total role of cis-regulatory elements in the genome is likely an underestimate. Efforts to identify conserved elements within the human genome have evolved from simple absolute nucleotide identity calculations [16, 246] to more sophisticated techniques that utilize the rates of neutral evolution within the species of interest to examine resistance to insertions or deletions [145], estimate the number of rejected substitutions within individual alignment columns [57], or apply phylogenetic Hidden Markov Models in which conserved regions are assumed to evolve under a similar tree scaled by a factor between 0 and 1 [207].

2.3 Cis-regulatory element function

Regardless of the conservation metric used, non-coding regions identified as being un- der purifying selection in the human genome cluster preferentially near transcription factors and genes involved in development [16, 246, 141, 237]. Systematic in vivo screening of conserved non-coding elements (CNEs) in zebrafish [246] and transgenic mice [179, 237] showed that many CNEs act as regulatory sequences driving reporter in particular tissues during development, often in a subset of the tissues in which the nearby genes are endogenously expressed. These results suggest that CNEs frequently act as tissue-specific enhancers to regulate some fraction of the expression of nearby genes. Strong interspecies non-coding conservation has been accepted as a standard identification method for cis-regulatory elements that pos- sesses decent specificity (40–50% of tested CNEs are cis-regulatory elements active CHAPTER 2. CIS-REGULATORY ELEMENTS 6

at the mouse embryonic day 11.5 developmental stage [236]). Notably, while gene pro- moters contain much important regulatory sequence, many vertebrate cis-regulatory sequences lie distal to their target genes [135, 179, 237, 235].

2.4 Cis-regulatory element composition

Transcription factors are proteins that can bind DNA in a sequence-specific man- ner [133]. The DNA sequences bound by a transcription factor are called its “binding sites”. Cis-regulatory enhancers are composed of clusters of transcription factor bind- ing sites (TFBS) [63, 25]. The presence of multiple TFBS in a single cis-regulatory element indicates that multiple proteins are required to activate the regulatory ele- ment, producing a combinatorial control of gene expression [191]. TFBS are generally short (6 to 12 bp in length) and degenerate sequences [20]. Computational prediction of functional TFBS from a single genome sequence is thus fraught with false positives [33]. Restricting TFBS predictions to binding sites that show evolutionary conservation in multiple diverged species, known as phylogenetic footprinting, can reduce false positive TFBS predictions [25]. However, the perva- siveness of binding site turnover likely causes phylogenetic footprinting methods to omit many functional binding site predictions [200, 201]. Experimental methods to identify cis-regulatory elements have benefited greatly from improvements in genome sequencing technologies. Of particular utility is the protocol of chromatin immunoprecipitation followed by massively parallel sequencing (ChIP-seq) [111, 148, 176]. ChIP-seq can identify the precise genomic locations to which a TF of interest is bound. Downstream analysis of the binding profile of a TF can include determining its sequence-specific binding preferences or predicting the cellular functions regulated by the TF.

2.5 Cis-regulatory elements in disease

Gene expression alterations are mechanisms by which diseases occur. Cis-regulatory element modifications have been linked to many abnormalities. Deletions of enhancer CHAPTER 2. CIS-REGULATORY ELEMENTS 7

cis-regulatory elements reduce or eliminate target gene expression and are associated with multiple diseases. A classic example of regulatory element disruption causing disease involves the locus control region (LCR) of the β-globin locus (reviewed in [93, 136]). The hemoglobin tetramer that carries oxygen within human red blood cells is composed of α-globin and β-globin molecules [54] . An imbalance in the levels of α- and β-globins causes thalassemia and has been linked to the deletion of regulatory sequences within the β-globin LCR [124, 74]. Similarly, deletion of a FOXL2 enhancer is linked to blepharophimosis ptosis epicanthus inversus syndrome [22], deletion of a POU3F4 enhancer is linked to deafness [64], deletion of a SOX9 enhancer is linked to Pierre Robin sequence [19], and deletion of a SHOX enhancer is linked to Leri-Weill dyschondrosteosis [198]. Single point mutations within cis-regulatory enhancers can also abrogate gene expression by eliminating transcription factor binding sites within cis-regulatory el- ements. Disruption of an AP-2alpha binding site within an enhancer of IRF6 is associated with cleft lip [190]. Similarly, a point mutation that disrupts an MSX1 binding site within a SOX9 enhancer is associated with Pierre Robin sequence [19], mutation of a Six3 binding site within a SHH forebrain enhancer suggests a link be- tween disrupted SHH expression and holoprosencephaly [109], and a point mutation within a RET enhancer causes RET expression to decrease and is associated with Hirschsprung disease risk [78]. Alternatively, diseases can arise from novel aberrant gene expression domains gained by mutation or duplication of cis-regulatory sequences. For example, a single point mutation in an enhancer located 1 Mb from its target gene SHH causes preaxial polydactyly due to the gain of SHH expression in the anterior region of the developing limb [135]. Duplications of regions containing cis-regulatory enhancers of SOX9 are associated with brachydactyly and nail aplasia [130]. These data emphasize that proper gene expression is essential for normal devel- opment and provide clear evidence linking the alteration of a cis-regulatory sequence to aberrant expression of its target gene, which in turn produces phenotypic conse- quences for the organism harboring the regulatory alteration. CHAPTER 2. CIS-REGULATORY ELEMENTS 8

2.6 Cis-regulatory elements in trait evolution

Cis-regulatory elements have been proposed to drive the majority of organismal com- plexity [123, 38, 39]. The heart of the argument for regulatory elements driving evolution has to do with the pleiotropy of mutations–how many distinct effects the mutation produces. Barring somatic mutations, every cell within an organism con- tains an identical copy of the entire genome sequence. Mutations to the coding region of a gene will cause an altered protein to be made throughout the organism. The more pleiotropic a gene is, the more likely a mutation will be deleterious to the gene’s ability to perform its function in one or more processes [216]. On the other hand, the abundance of highly conserved sequences surrounding genes suggests that cis-regulatory elements may be more modular, with each cis-regulatory element per- forming a unique role to affect gene expression in a particular tissue at a particular time point in development [62, 234]. This fine-grained control of gene expression via modular cis-regulatory elements provides a mechanism by which pleiotropic genes can alter single traits via gene expression changes while expression in other tissues remains wholly unaffected. Examples abound in which developmental changes occurred as a result of mod- ifications to cis-regulatory elements [17, 135, 88, 187, 152, 159, 43]. Some of the most dramatic regulatory changes driving evolution are those in which complete loss of gene expression is not viable. One example concerns the essential gene Pitx1, in which mouse experiments that eliminate Pitx1 expression entirely cause prenatal and perinatal lethality [131]. The parallel loss of a Pitx1 enhancer within multiple freshwa- ter populations of threespine stickleback fish causes the loss of a pelvic spine [43]. The region shows multiple signatures of positive selection, indicating that the enhancer loss (and concomitant pelvic spine loss) was beneficial to the populations of freshwater lakes. Regulatory alterations to other master developmental genes such as Ubx and yellow drive morphological diversity within Drosophila species [215, 88, 187], further emphasizing the role that cis-regulatory elements can play in driving developmental trait evolution. Chapter 3

Regulatory element functional analysis

Comparative genomic and experimental methods are enabling the genome-wide iden- tification of cis-regulatory elements. However, unlike genes, whose constituent protein domains and corresponding functions can be recognized from sequence data alone, the grammar of cis-regulatory elements remains unknown. Functional analysis of a set of cis-regulatory elements thus relies on a “guilt by association” methodology in which each cis-regulatory element is assumed to be affecting all the functions of all genes it putatively regulates. Existing techniques strongly favor specificity over sensitivity when assigning cis-regulatory elements to target genes by only considering binding events that occur within proximal promoters as functional regulatory elements. Ow- ing to the distal nature of much vertebrate gene regulation, this limited assignment severely hampers the interpretation of cis-regulatory element function. Additionally, genome structure (in particular, the uneven distribution of genes within the genome) renders efforts to increase assignments inappropriate within the existing statistical framework for enrichment. This chapter presents a new method for functional inter- pretation of cis-regulatory sequences that overcomes both drawbacks.

We developed the Genomic Regions Enrichment of Annotations Tool (GREAT) to analyze the functional significance of cis-regulatory regions identified by localized

9 CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 10

measurements of DNA binding events across an entire genome. Whereas previous methods took into account only binding proximal to genes, GREAT is able to prop- erly incorporate distal binding sites and control for false positives using a binomial test over the input genomic regions. GREAT incorporates annotations from 20 ontologies and is available as a web application. Applying GREAT to data sets from chromatin immunoprecipitation coupled with massively parallel sequencing (ChIP-seq) of multi- ple transcription-associated factors, including SRF, NRSF, GABP, Stat3, and p300 in different developmental contexts, we recover many functions of these factors that are missed by existing gene-based tools, and we generate testable hypotheses. The utility of GREAT is not limited to ChIP-seq, as it can also be applied to open chromatin, localized epigenomic markers and similar functional data sets, as well as comparative genomics sets. The coupling of chromatin immunoprecipitation with massively parallel sequenc- ing (ChIP-seq) is ushering in a new era of genome-wide functional analysis [111, 148, 176]. Thus far, computational efforts have focused on pinpointing the genomic locations of binding events from the deluge of reads produced by deep sequenc- ing [110, 118, 194, 228, 232]. Functional interpretation is then performed using gene- based tools developed in the wake of the preceding microarray revolution [119, 4, 71]. In a typical analysis, one compares the total fraction of genes annotated for a given on- tology term with the fraction of annotated genes picked by proximal binding events to obtain a gene-based P value for enrichment (Figure 3.1a). This procedure has a fundamental drawback: associating only proximal binding events (for example, under 2–5 kb from the transcription start site) typically discards over half of the observed binding events (Figure 3.2a). On the other hand, the standard approach to capturing distal events–associating each binding site with the one or two near- est genes–introduces a strong bias toward genes that are flanked by large intergenic regions [144, 220]. For example, though the Gene Ontology [9] (GO) term ‘multicel- lular organismal development’ is associated with 14% of human genes, the ‘nearest genes’ approach associates over 33% of the genome with these genes. This biologi- cal bias results in numerous false positive enrichments, particularly for the input set sizes typical of a ChIP-seq experiment (Figure 3.2b, Figure A.1). Building on our CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 11

experience in addressing these pitfalls [16, 15, 144] we have built a tool that robustly integrates distal binding events while eliminating the bias that leads to false positive enrichments.

3.1 Results

Here we describe GREAT, which analyzes the functional significance of sets of cis- regulatory regions by explicitly modeling the vertebrate genome regulatory landscape and using many rich information sources.

3.1.1 A genomic region-based binomial test for long-range gene regulatory domains

GREAT associates genomic regions with genes by defining a ‘regulatory domain’ for each gene in the genome. Each genomic region is associated with all genes in whose regulatory domains it lies (Figure 3.1b). High-throughput chromosomal conformation capture (3C) approaches such as 5C [72], Hi-C [139] or enhanced ChIP-4C [202] are providing first glimpses of ac- tual gene regulatory domains. Because we still lack precise empirical maps, however, GREAT assigns each gene a regulatory domain consisting of a basal domain that ex- tends 5 kb upstream and 1 kb downstream from its transcription start site (denoted below as 5+1 kb) and an extension up to the basal regulatory domain of the nearest upstream and downstream genes within 1 Mb (GREAT allows the user to modify the rule and distances). GREAT further refines the regulatory domain of a handful of genes, including several global control regions [212], with their experimentally- determined regulatory domains. GREAT can also incorporate additional locus-based and genome-wide data as they become available. Given a set of input genomic regions and an ontology of gene annotations, GREAT computes ontology term enrichments using a binomial test over the genomic regions that explicitly accounts for variability in gene regulatory domain size by measuring the total fraction of the genome annotated for any given ontology term and counting CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 12

a Hypergeometric test over genes b Binomial test over genomic regions Step 1: Infer proximal gene regulatory domains Step 1: Infer distal gene regulatory domains

Gene transcription start site Gene transcription start site π Ontology annotation π Ontology annotation (e.g. “actin cytoskeleton”) (e.g. “actin cytoskeleton”) Proximal regulatory domain Distal regulatory domain of gene with/without π of gene with/without π π π π π π π

Step 2: Associate genomic regions with Step 2: Calculate annotated fraction of genome genes via regulatory domains Genomic region associated with nearby gene 0.6 of genome is annotated with π Ignored distal genomic region π π π Step 3: Count genomic regions associated with the annotation

Genomic region Step 3: Count genes selected by proximal genomic regions 2 genes selected by proximal genomic regions 5 genomic regions hit annotation π 1 gene selected carries annotation π

Step 4: Perform hypergeometric test over genes Step 4: Perform binomial test over genomic regions N = 8 genes in genome n = 6 total genomic regions Kπ = 3 genes in genome carry annotation π pπ = 0.6 fraction of genome annotated with π n = 2 genes selected by proximal genomic regions kπ = 5 genomic regions hit annotation π kπ = 1 gene selected carries annotation π

p-value = Pr (k≥1 | N=8, K=3, n=2) p-value = Pr (k≥5 | n=6, p=0.6) hyper binom

Figure 3.1: Enrichment analysis of a set of cis-regulatory regions. (a) The current pre- vailing methodology associates only proximal binding events with genes and performs a gene-list test of functional enrichments using tools originally designed for microarray analysis. (b) GREAT’s binomial approach over genomic regions uses the total fraction of the genome associated with a given ontology term (green bar) as the expected fraction of input regions associated with the term by chance. CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 13

how many input genomic regions fall into those areas (Figure 3.1b). In the example above, GREAT expects 33% of all input elements to be associated with ‘multicellular organismal development’ by chance, rather than the 14% of input elements that a gene-based test assumes. The binomial test over genomic regions integrates distal binding events in a way that is robust to erroneous assignments of genomic regions to genes. Namely, the longer the regulatory domain of any gene, and by extension of any ontology term, the higher the expected number of regions associated with this term by chance. Indeed, the genomic region-based statistic markedly reduces the number of false positive enriched terms even when using very large regulatory domains (Figure 3.2b, Figure A.1). The binomial test treats each input genomic region as a point-binding event, thus making it most suitable for testing targets with localized binding peaks. The binomial test also highlights cases where a single gene attracts an unlikely number of input genomic regions. To separate these biologically interesting gene-specific events from term-derived enrichments that are distributed across multiple genes, we perform both the genomic region-based binomial test and the traditional gene-based hypergeometric test. In doing so we separately highlight ontology terms enriched by both tests (term- derived enrichment) from those enriched only by the genomic region-based binomial test (gene-specific enrichment) or the gene-based hypergeometric (regulatory domain bias) (Figure 3.2c, Figure A.2). GREAT supports direct enrichment analysis of both the human and mouse genomes. It integrates twenty separate ontologies containing biological knowledge about gene functions, phenotype and disease associations, regulatory and metabolic pathways, gene expression data, presence of regulatory motifs to capture cofactor dependencies, and gene families (Tables 3.1–3.3). Core computations are performed by the GREAT server while subsequent brows- ing is executed on the user’s machine. An overview of the tool’s functionality and options when analyzing data is given in Table 3.4, and its web interface is shown in Figure 3.3. CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 14

SRF (H: Jurkat) NRSF (H: Jurkat) GABP (H: Jurkat) 10 a b c U Stat3 (M: ESC) p300 (M: ESC) B H Binomial test over Hypergeometric

p300 (M: limb) p300 (M: forebrain) p300 (M: midbrain) 8

s genomic regions test over genes t n 0.7 s 60 b10:h3 b7:h1 b3:h2 +x x e H \ B + m

r +x 6

m h5 0.6 e t 50 + +x b8:h4 e

l h9 h8 h7+ b9:h6 d + + h10 e e 0.5 o + o B \ H l o h o o o l 40 oo o c o i 4 a o o r o 0.4 o oooo o o

f o n o o o o o o oo o e 30 o o ooo o b1 o oo o ooo oo o x 0.3 e o oo o o o oo o n o o oo v o ooooooooooooooooo o o o o i o o ooo o o o

2 o o o o o o ooooo o o o t 20 o oooooooo ooo ooooo oooo oo o b5 i o oo o b2 i o ooooooooooooooooooo ooooooo oo x t 0.2 oooo oooooooooo ooooo o o x s oooooooooooo o o o x ooooooooooooooooooooooooooooooooooooooooooooooooo oo o o o c ooooooooooooo ooo oo o o oooooooooooooooooooooooooooooooooooooooooo oooooooo o o o o o x b6 b4 oooooooooooooooooooooooooooooooooooooooooo oo o p o o oooo a 10 ooooooooooooooooooooooooooooooooooooooooooooooo o o oo o ooooooooooooooooooooooooooooooooooooo o o r 0.1 oooooooooooooooooooooooooooooo ooo oo o e ooooooooooooooo o o ooooooooooooooooooooooooooo ooo -log(Hypergeometric p-value) oooo o o 0 oooooooooo F ooooooooooo s ooooooo l

0 a 0 0-2 2-5 5-50 50-500 > 500 F 0.1 0.5 1 5 10 50 100 0 2 4 6 8 10 Distance to nearest transcription start site (kb) Input set size (thousands) -log(Binomial p-value)

Figure 3.2: Binding profiles and their effects on statistical tests. (a) ChIP-seq data sets of several regulatory proteins show that the majority of binding events lie well outside the proximal promoter, both for sequence-specific transcription factors (SRF and NRSF, [232]; Stat3, [46]) and a general enhancer-associated protein (p300, [235, 46]). Cell type is given in parentheses: H, human; M, mouse. (b) When not restricted to proximal promoters, the gene-based hypergeometric test (red) generates false positive enriched terms, especially at the size range of 1,000–50,000 input regions typical of a ChIP-seq set. Negligible false positive enrichment was observed for the region- based binomial test (blue). For each set size, we generated 1,000 random input sets in which each base pair in the human genome was equally likely to be included in each set, avoiding assembly gaps. We calculated all GO term enrichments for both hypergeometric and binomial tests using GREAT’s 5+1 kb basal promoter and up to 1 Mb extension association rule (see Results). Plotted is the average number of terms artificially significant at a threshold of 0.05 after application of the conservative Bonferroni correction. (c) GO enrichment P values using the genomic region-based binomial (x axis) and gene-based hypergeometric (y axis) tests on the SRF data [232] with GREAT’s 5+1 kb basal promoter and up to 1 Mb extension association rule (see Results). b1 through b10 denote the top ten most enriched terms when we used the binomial test. h1 through h10 denote the top ten most enriched terms when we used the hypergeometric test. Terms significant by both tests (B ∩ H) provide specific and accurate annotations supported by multiple genes and binding events (Table 3.8). Terms significant by only the hypergeometric test (H\B) are general and often associated with genes of large regulatory domains, whereas terms significant by only the binomial test (B\H) cluster four to six genomic regions near only one or two genes annotated with the term (Table 3.9). CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 15

Table 3.1: Human and mouse ontologies supported by GREAT version 1.2.6.

Ontology References Gene Ontology GO Molecular Function [9] GO Biological Process [9] GO Cellular Component [9] Phenotype Data and Human Disease Mouse Phenotype [32, 24, 208] MSigDB Cancer Neighborhood* [28, 218] MSigDB Cancer Modules* [205, 218] Pathway Data PANTHER Pathway [157] Pathway Commons [41] BioCyc Pathway [40] MSigDB Pathway [218] Gene Expression Data MGI Expression: Detected [209, 32] MGI Expression: Not Detected [209, 32] MSigDB Perturbation [218] Regulatory Motifs MSigDB Predicted Promoter Motifs [218] Transcription Factor Targets [142] MSigDB miRNA Motifs [218] miRNA Targets [142] Gene Families InterPro [103] TreeFam [195] HGNC Gene Families* [31]

* Ontology only supported in human. CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 16

Table 3.2: GREAT version 1.2.6 ontology contents for human.

Ontology Terms Genes Direct Download associations date GO Molecular Function 2,800 14,401 43,207 5 Mar 2009 GO Biological Process 5,215 13,293 47,287 5 Mar 2009 GO Cellular Component 834 15,210 39,984 5 Mar 2009 Mouse Phenotype 5,781 5,377 96,704 22 Apr 2009 MSigDB Cancer Neighborhood 427 4,717 41,713 11 Mar 2009 MSigDB Cancer Modules 456 7,918 47,511 11 Mar 2009 PANTHER Pathway 150 1,983 4,676 9 Mar 2009 Pathway Commons 1,253 3,921 52,505 20 Jul 2009 BioCyc Pathway 288 693 1,860 13 Mar 2009 MSigDB Pathway 706 6,473 25,280 11 Mar 2009 MGI Expression: Detected 6,700 7,330 190,463 23 Mar 2009 MGI Expression: Not Detected 3,079 4,812 82,621 23 Mar 2009 MSigDB Perturbation 911 11,189 67,052 11 Mar 2009 Transcription Factor Targets 19 5,375 9,980 12 Mar 2009 MSigDB Predicted Promoter Motifs 615 11,777 154,911 11 Mar 2009 MSigDB miRNA Motifs 222 6,896 32,101 11 Mar 2009 miRNA Targets 9 1,095 1,199 12 Mar 2009 InterPro 6,587 15,228 53,630 3 Mar 2009 TreeFam 8,272 16,684 16,812 3 Mar 2009 HGNC Gene Families 238 4,616 5,021 6 Mar 2009 CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 17

Table 3.3: GREAT version 1.2.6 ontology contents for mouse.

Ontology Terms Genes Direct Download associations date GO Molecular Function 2,380 13,932 47,595 23 Mar 2009 GO Biological Process 4,539 12,805 45,480 23 Mar 2009 GO Cellular Component 686 14,548 34,827 23 Mar 2009 Mouse Phenotype 5,798 5,536 98,514 22 Apr 2009 PANTHER Pathway 149 1,759 4,023 9 Mar 2009 Pathway Commons* 83 116 206 20 Jul 2009 BioCyc Pathway 275 888 2,109 13 Mar 2009 MSigDB Pathway 456 3,479 10,778 11 Mar 2009 MGI Expression: Detected 6,701 7,730 194,545 23 Mar 2009 MGI Expression: Not Detected 3,119 5,121 87,257 23 Mar 2009 MSigDB Perturbation 248 7,419 23,073 11 Mar 2009 Transcription Factor Targets 6 1,194 1,377 12 Mar 2009 MSigDB Predicted Promoter Motifs 615 9,117 127,459 11 Mar 2009 MSigDB miRNA Motifs 222 5,948 28,369 11 Mar 2009 miRNA Targets 1 97 97 12 Mar 2009 InterPro 6,281 16,096 51,140 3 Mar 2009 TreeFam 7,953 16,974 17,040 3 Mar 2009

* The mouse Pathway Commons ontology contains considerably less data than its human counterpart because many input databases in Pathway Commons are specific to human pathways. CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 18

Table 3.4: GREAT parameters, filters, and options, and their effects.

Parameter Effect Region - gene association rule Determines how gene regulatory domains are calculated. When we allowed for distal associations, the sets we examined remained robust regardless of the exact choice of association rule. Our default rule (basal and extension) models a current hypothesis of gene regulatory domains.

Region - gene association rule parameters Determine the length of each inferred gene regulatory domain. As we show, when the right statistical model is used, including distal associations of up to 1 Mb can strongly increase biological signals.

Statistical significance visual filter Highlights statistically significant results in bold font. Multiple test correction options and thresholds for significance can be modified.

Binomial fold enrichment filter Complements P value by requiring that statistically significant terms have strong biological effects. Often filters ontology terms that apply to thousands of genes.

Observed gene hits filter Shows only enriched terms for which input regions select at least this many genes. Helps avoid enrichments owing to many regions selecting a small number of genes.

Minimum annotation count threshold Increases statistical power by reducing the number of tests performed, by testing only ontology terms associated a priori with at least this many genes.

Display type Summary display shows only terms statistically significant by both binomial and hypergeometric tests. Full display ignores the statistical significance filter and shows terms that meet all other criteria.

Export Export tables individually or in batches into a file of tab-separated values or publication-ready HTML.

UCSC custom tracks Clicking a specific region from within a term detail page region opens the University of California Santa Cruz Genome Browser [116] focused on that region, with two custom tracks automatically loaded – one for the total set of input regions and another for the subset of regions associated with the chosen term. CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 19

a b

c d

Figure 3.3: Screenshots of the GREAT version 1.2.6 workflow. (a) The input screen lets the user choose an organism, input a set of cis-regulatory regions, and alter the mapping of cis- regulatory regions to genes. (b) The main output screen displays enriched terms from 20 different ontologies. Controls let the user set significance criteria and level of detail and download HTML or tab-separated output files. (c) The individual term details screen displays information related to the enrichment of a particular ontology term. (d) Clicking any region in a term details screen opens a UCSC Genome Browser display focused on that region that includes custom tracks for the entire set of input regions and for the subset that contributes to the term. CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 20

3.1.2 Comparison of enrichment tests and gene regulatory domain ranges

To demonstrate the utility of our approach, we compared GREAT results to previ- ously published gene-based analyses as well as to enrichments from the Database for Annotation, Visualization, and Integrated Discovery (DAVID) [100]. Most gene-based tools assess enrichments in a very similar manner; we chose DAVID as a representa- tive gene-based tool owing to its popularity and its ability to test a breadth of data sources similar to that of GREAT (Table 3.5).

Table 3.5: Comparison of several enrichment tools.

Primary use Ontologies Web Real-time Reference supported based response GREAT Cis-regulatory regions Many Yes Yes [154] DAVID Gene sets Many Yes Yes [100] Babelomics Gene sets Many Yes No [3] OntoTools Gene sets GO, Yes Yes [120] GOstat Gene sets Only GO Yes Yes [14] GoMiner Gene sets Only GO No N/A [250]

We analyzed eight ChIP-seq data sets from a range of human and mouse cells and tissues (Table 3.6), each with a different distribution of proximal and distal binding events (Figure 3.2a). We tested each data set in six different ways: (i) by reproducing the original study’s list of enrichments, or using DAVID on the set of genes with binding events within 2 kb of the transcription start site where an original analysis was missing; (ii) by using GREAT with the default regulatory domain definition (basal promoter 5+1 kb and extension up to 1 Mb); (iii) by using GREAT’s hypergeometric test on the set of genes with binding events within 2 kb of the transcription start site, to control for the different gene mappings and ontologies in DAVID and GREAT; (iv) by using GREAT with a 5+1 kb basal promoter and a more limited 50 kb exten- sion; and (v, vi) by using GREAT with either one (v) or two (vi) nearest genes up CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 21

Table 3.6: Data sets analyzed by GREAT.

Dataset Species Tissue References Serum Response Factor Homo sapiens Jurkat cells [232] Neuron-Restrictive Silencer Factor Homo sapiens Jurkat cells [232] GA-Binding Protein Homo sapiens Jurkat cells [232] p300 Mus musculus Embryonic limb [235] p300 Mus musculus Embryonic forebrain [235] p300 Mus musculus Embryonic midbrain [235] p300 Mus musculus Embryonic stem cells [46] Signal transducer and Mus musculus Embryonic stem cells [46] activator of transcription 3

to 1 Mb (Table 3.7 and 3.8, Table A.1–A.38). GREAT invariably revealed strong enrichments for experimentally-validated functions of the specific factors, as well as for testable – and, to our knowledge, novel – functions. It also implicated subsets of regulatory regions in driving the assayed developmental processes and in activating key signaling pathways. In a majority of data sets, distal binding events were essential to recover known functions, strongly suggesting that many of the distal associations are biologically meaningful (see below). Furthermore, in most sets, restricting regu- latory domain extension to 50 kb retains many enriched terms but omits roughly half of both the binding events and the genes implicated using the full 1-Mb extension. Although including distal associations is crucial, the exact distal association rule is not – the default, nearest gene, and two nearest genes rules (tests ii, v, vi) behaved very similarly. Additionally, inclusion of the small set of experimentally-determined gene regulatory domains we curated from the literature made very little difference in the rankings of any of the sets (data not shown). We present the analysis of four ChIP-seq data sets below and discuss the remainder in Section A.2.

3.1.3 Serum Response Factor binding in human Jurkat cells

First, we analyzed a set of genomic regions bound by the Serum Response Factor (SRF) in the human Jurkat cell line identified via ChIP-seq and mapped to the genome CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 22

using the quantitative enrichment of sequence tags (QuEST) ChIP-seq peak-calling tool [232]. This data set’s authors applied existing gene-based enrichment tools, which did not discern specific functions of SRF from the set of regions it binds [232], and concluded that SRF is a regulator of basic cellular processes with no specific physiological roles (results reproduced in Table 3.7).

Table 3.7: Previously published [232] gene-based ontology enrichments for regions bound by SRF in human Jurkat cells.

Term P value nucleus 5.18 × 10−70 protein binding 2.16 × 10−50 cytoplasm 6.67 × 10−27 transcription 4.13 × 10−26 nucleotide binding 1.04 × 10−23 metal ion binding 1.92 × 10−22 zinc ion binding 5.76 × 10−20 RNA binding 3.38 × 10−18 regulation of transcription, DNA-dependent 1.15 × 10−15 ATP binding 4.84 × 10−15

Listed are the top enriched GO terms found using a gene-based analysis of the 1,936 genes with an SRF binding peak within 2 kb (adapted from [232]). Though the large number of selected genes produces strong P values, the enriched terms yield only a broad view of SRF functions. The first actin-related term, ‘actin binding’ is ranked 28th (data not shown).

Although SRF is indeed a regulator of basic cellular functions, numerous studies have implicated SRF in more specific biological contexts. SRF is a key regulator of the Fos oncogene [42] and has also been described as a ‘master regulator of actin cytoskeleton’ [158]. Neither FOS nor actin appeared in the top ten hypotheses gen- erated by the previous study (Table 3.7). The same was true when we used GREAT with only proximal (2 kb) associations (Table A.1). However, GREAT analysis of the most significant SRF ChIP-seq peaks [232] (QuEST score > 1; n = 556) using the default settings (5+1 kb basal, up to 1 Mb extension) prominently highlights the key observation that gene-based analyses were unable to reveal: SRF regulates genes associated with the actin cytoskeleton [158] (Table 3.8). CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 23

Table 3.8: GREAT cis-regulatory element enrichments of SRF using the default basal plus extension association rule with a basal regulatory region extending 5 kb upstream and 1 kb downstream of the transcription start site and a maximum extension of 1 Mb, using the highest-scoring SRF peaks anywhere in the genome (QuEST score > 1; n = 556).

Ontology Term Binomial Binomial Hyper Distal Support P value Fold P value Binding±

Gene Ontology: actin cytoskeleton 6.91 × 10−9 3.05 2.22 × 10−7 38.9% [158] Cellular Component cortical cytoskeleton 4.03 × 10−6 5.90 5.41 × 10−4 54.5% [158]

Gene Ontology: actin binding 5.21 × 10−5 2.03 2.74 × 10−5 51.4% [158] Molecular Function

Transcription Factor SRF targets 4.97 × 10−76 13.22 9.79 × 10−68 14.3% positive Targets, Identified control by ChIP-chip YY1 targets 1.45 × 10−6 2.09 0.0084 20.4% [168]* E2F4 and p130 0.0047 2.01 0.0027 44.4% Novel† E2F4 0.0194 2.08 0.0031 36.4% Novel†

Predicted SRF variants 4.54 × 10−28 to 3.69 to 1.71 × 10−25 to 17.4% to positive Promoter Motifs 4.19 × 10−12 15.46 2.04 × 10−9 28.6% controls GABPA/GABPB 4.20 × 10−9 3.67 6.68 × 10−6 27.6% Novel† Motif NGGGACTTTCCA 1.02 × 10−4 2.12 8.30 × 10−5 20.0% Novel† EGR1 1.71 × 10−4 2.03 0.0013 46.9% Novel†

Pathway Commons TRAIL signaling 2.37 × 10−7 2.45 1.71 × 10−5 46.3% [21] Class I PI3K signaling 9.92 × 10−7 2.56 4.45 × 10−5 44.1% [183]

TreeFam FOSL2 / JDP2 / FOS / 9.66 × 10−9 27.89 1.21 × 10−6 28.6% [42]‡ FOSL1 / FOSB / ATF3

Enriched terms for a variety of ontologies obtained using GREAT analysis (5+1 kb basal, up to 1 Mb extension) of proximal and distal binding events. The enriched terms highlight experimentally validated functions and cofactors of SRF that lend immediate insight into its biological roles as well as propose testable hypotheses of SRF functions that are, to our knowledge, novel (see Results). Shown are all binomial enriched terms at a false discovery rate of 0.05 with a genomic region fold enrichment of at least two that are also significant at a false discovery rate of 0.05 by the hypergeometric test, using the highest-scoring SRF peaks anywhere in the genome (QuEST score > 1; n = 556). ± The fraction of binding peaks contributing to the enrichment located > 10 kb from the transcription start site of the nearest gene. * Known interactions often also give rise to novel hypotheses; for example, SRF is known to co-regulate some genes with YY1, and GREAT identifies many additional genes potentially bound by both SRF and YY1. † Hypothesis: SRF acts with E2F4, GABP, EGR1, and a previously uncharacterized binding motif to co-regulate target genes (see Results for supporting evidence). ‡ SRF is known to regulate Fos and Fosb [42]; GREAT highlights three other members of the FOS family that may also be regulated by SRF. CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 24

As postulated above, using both genomic region-based and gene-based enrichment tests does highlight informative GO terms more effectively than either test alone (Figure 3.2c and Table 3.9). Moreover, when extension of regulatory domains is limited to 50 kb, one third of the supporting regions and associated genes are lost, and actin-related terms drop in rank (Table A.2). Coupling distal (up to 1 Mb) associations with the many additional ontologies available within GREAT provides a wealth of enrichments for specific known func- tions of SRF. An enrichment analysis of TreeFam gene families [195] shows that SRF binds in proximity to five out of six members of the FOS family. Two genes within the Fos family, c-Fos and Fosb, are previously known targets of SRF [42]. The Tran- scription Factor Targets ontology [142] has compiled data from ChIP experiments that link transcription factor regulators to downstream target genes. GREAT shows that many genes proximal to SRF binding events (in Jurkat cells) are also proximal to YY1 binding events (in HeLa cells), consistent with experiments showing that SRF acts in conjunction with YY1 to regulate Fos [168]. The top six hits in the Predicted Promoter Motifs ontology [218] are all variants of the SRF motif generated from different experiments and thus serve as strong positive controls of our method. Using the Pathway Commons ontology [41], GREAT predicts that SRF regulates components of the TRAIL signaling pathway and the class I PI3K signaling pathway. Previous experimental work has demonstrated that there is an association between SRF and TRAIL signaling [21] and that SRF is needed for PI3K-dependent cell pro- liferation [183]. In addition to rediscovering and expanding specific known functions of SRF, GREAT produces testable hypotheses even for this well-studied transcription fac- tor. The Transcription Factor Targets ontology indicates that SRF binds near genes regulated by E2F4 (in T98G, U2OS, and WI-38 cells; Table 3.8). SRF and E2F4 have not been shown to co-regulate target genes; however, both SRF and E2F4 are known to interact with SMAD3 ([134] and [45]) and they may thus be co-regulators of a common set of genes. The Predicted Promoter Motifs ontology reveals additional potential cofactors and co-regulators. It is particularly useful given that many more genes have characterized binding motifs than have genome-wide ChIP data available. CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 25

Table 3.9: SRF Gene Ontology term enrichments highlight the benefits of coupling bino- mial and hypergeometric tests. (a) Terms significant by both the binomial and hypergeometric tests highlight many genes involved in the process with many genomic regions implicating the genes as well. Skews between the fraction of genes annotated with the term and the fraction of the genome that maps to one or more genes annotated with the term are generally modest. (b) Terms significant by the hypergeometric test but not the binomial test arise either due to large differences between the fraction of genes annotated with the term and the fraction of the genome that maps to one or more genes annotated with the term or the association of a single genomic region to multiple genes annotated with the term. (c) Terms significant by the binomial test but not the hypergeometric test arise when many genomic regions cluster near one or few genes annotated with the term, and indicate gene-specific enrichments rather than broad term-based enrichment. a Terms significant by both binomial and hypergeometric tests (B ∩ H, listed in Table 1b) GO ID Description Genes Hit SRF Peaks Fraction of Genes Fraction of Genome GO:0015629 actin cytoskeleton 30 36 0.013185 0.021250 GO:0030863 cortical cytoskeleton 11 7 0.001859 0.003351 GO:0003779 actin binding 31 37 0.017483 0.032754 b Terms significant by the hypergeometric test but not the binomial test (H\B) GO ID Description Genes Hit SRF Peaks Fraction of Genes Fraction of Genome GO:0010604 positive regulation of 53 61 0.033862 0.073249 macromolecule metabolic process GO:0005634 nucleus 284 279 0.284951 0.419860 GO:0009893 positive regulation of 54 62 0.036011 0.077215 metabolic process GO:0005515 protein binding 397 351 0.423419 0.604715 GO:0019899 binding 33 37 0.018702 0.037255 c Terms significant by the binomial test but not the hypergeometric test (B\H) GO ID Description Genes Hit SRF Peaks Fraction of Genes Fraction of Genome GO:0032796 uropod organization 2 5 0.000116 0.000100 GO:0035267 NuA4 histone acetyltrans- 2 6 0.000348 0.000231 ferase complex GO:0043189 H4/H2A histone acetyl- 2 6 0.000407 0.000262 transferase complex GO:0043534 blood vessel endothelial 2 6 0.000290 0.000309 cell migration GO:0000212 meiotic spindle organiza- 1 4 0.000116 0.000092 tion CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 26

In this case, it shows enrichment for SRF binding near genes containing GABP motifs in their promoters. Notably, an independent experiment measuring GABP-bound re- gions of the genome in Jurkat cells has found that 29% of SRF peaks occur within 100 bp of a GABP peak, suggesting that SRF and GABP may indeed work together [232]. We were able to generate this same hypothesis using GREAT, without observing the GABP ChIP-seq data.

3.1.4 P300 binding in the developing mouse limbs

Second, we analyzed a recent ChIP-seq data set comprising 2,105 regions of the mouse genome bound by the general enhancer-associated protein p300 in embryonic limb tissue [235]. Of 25 such regions tested in transgenic mouse assays, 20 showed reproducible enhancer activity in the developing limbs [235]. Our analysis shows that GREAT identifies functions of enhancers active during embryonic development that gene-based tools do not detect. DAVID analysis of the genes with proximal p300 limb binding events produces only enrichments associated with transcription and involve- ment in organ morphogenesis, with the closest enrichments being the very broad terms ‘organ development’ and ‘anatomical structure morphogenesis’ (Table A.5). In con- trast, GREAT analysis of the 2,105 p300 limb peaks using the default settings (5+1 kb basal, up to 1 Mb extension) produces overwhelming support for their putative functional role in limb development (Table 3.10). GO enrichments highlight the regulation of transcription factors involved specif- ically in embryonic limb morphogenesis. The Mouse Phenotype ontology [24] points to the developing limbs and skull, hinting at the remarkable overlap of signaling pro- cesses involved in head and limb development [243]. The p300 limb peaks are enriched near genes in the TGF-β signaling pathway, which is known to be involved in limb development [36], and the InterPro ontology highlights genes in the Smad family containing the Dwarfin-type MAD homology-1 protein domain (Table 3.10) which is known to mediate and regulate TGF-β signaling [127]. Perhaps the strongest validation for the GREAT methodology comes from the MGI Expression: Detected ontology [32]. Notably, the enrichments highlighted most CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 27

Table 3.10: GREAT cis-regulatory element enrichments of p300 in limb using the default basal plus extension association rule with a basal regulatory region extending 5 kb upstream and 1 kb downstream of the transcription start site and a maximum extension of 1 Mb. CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 28

prominently by GREAT pinpoint the exact tissue and time point at which the ChIP- seq experiment was performed [235], providing unique large-scale evidence for the relevance of p300-bound regions to limb gene regulation. The top two ontology terms suggest limb-specific expression during Theiler stage 19 (TS19), which corresponds precisely with the embryonic day 11.5 time point at which the p300 limb peaks were assayed [235] (Table 3.10). The GREAT enrichments all draw heavily from the ap- propriate integration of distal binding events; over 75% of the binding peaks that contribute to every GREAT enriched term occur further than 10 kb from the TSS of the nearest gene (Figure 3.4, Table 3.10). Further emphasizing the importance of distal binding, GREAT run with proximal (2 kb) associations retrieves only weak enrichments for limb-associated genes and limb TS19, implicating 7-fold fewer genes and 16-fold fewer p300 limb peaks as being involved in TS19 limb expression than GREAT run with the default association rule (Table A.6). Moreover, GREAT run with proximal associations completely misses genes with crucial roles in limb devel- opment such as Gli3, Grem1, and Wnt7a [169]. We examined properties of the 2,105 p300 mouse embryonic limb peaks [235] in the context of three known limb-related terms and a negative control term (GO cortical cytoskeleton) using three different association rules. For each term, we examined the relevance of distal binding peaks by comparing the experimental results to the average values of 1,000 simulated data sets in which the 192 proximal ChIP-seq peaks within 2 kb of the nearest transcription start site were fixed and the 1,913 distal peaks were shuffled uniformly within the mouse genome, avoiding assembly gaps and proximal promoters (Figure 3.4). By design, simulation results for proximal, 2-kb GREAT are identical to the actual data and are thus omitted. Lengthening a 2-kb proximal promoter to a 50-kb extension, expected to increase genome coverage per term (pπ in Figure 3.1b) by 25-fold, causes an actual increase of 19- to 24-fold; in contrast, lengthening a 50-kb extension rule to a 1-Mb extension rule, expected to raise genome coverage 20-fold, leads to an actual increase of only 2.5- to 6-fold because regulatory domains are not extended through neighboring genes (Figure 3.4a). As regulatory domains increase in length from only the proximal 2 kb up to 50 kb and 1 Mb, the number of relevant genes with a p300 limb peak in their regulatory domain increases CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 29

(Figure 3.4b). Indeed, when GREAT’s regulatory domains are extended up to 50 kb, it correctly recovers limb terms, but still implicates only half the genes found with the default association rule and yields P values many orders of magnitude weaker (Figure 3.4 and Table A.7). The added genes selected only by distal associations are typically enriched for limb functionality compared to simulated data. Similarly, as regulatory domains increase in length, the number of p300 limb peaks associated with a relevant gene in excess of the number expected by chance increases for all limb-related terms (Figure 3.4c). The inclusion of distal peaks markedly increases the statistical significance of the correct terms alone (Figure 3.4d), providing strong evidence that many of these distal associations are biologically meaningful.

3.1.5 P300 binding in the developing mouse forebrain and midbrain

Finally, we analyzed two ChIP-seq data sets comprising regions bound by p300 in mouse embryonic forebrain and midbrain tissue [235]. Using the 2,453 forebrain peaks, DAVID correctly highlights forebrain development (0.004 < P < 0.05), but with terms implicating fewer than ten genes (Table A.10). GREAT run with proximal regulatory domains (2 kb) ranks forebrain development higher and is able to implicate additional genes and regions using its unique phenotype and expression ontologies (Table A.11). Using up to 50 kb extension adds additional related terms and raises the number of genes associated with each term (Table A.12). This trend continues when the extension is increased to up to 1 Mb, and only this inclusion of distal binding allows detection of significant associations (P = 0.001) with Wnt signaling genes that have known roles in forebrain development [253] (Ta- ble 3.11). The default GREAT settings highlight the regulation of transcription factors as well as mouse phenotypes in axonal tract formation that are affected by defects in early stages of neuronal differentiation (Table 3.11). In particular, GREAT offers enrichment for the basic helix-loop-helix family of transcription factors that are known to play a prominent role in cell fate at this stage of development [184]. Other CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 30

a b c d Ontology: Term Genome fraction Genes Regions (obs-exp) Statistical significance* p300 limb set Gene Ontology: 2 kb simulations Embryonic limb 50 kb morphogenesis 1 Mb 0 0.007 0.014 0 15 30 45 60 0 25 50 75 100 0 7.5 15 22.5 30

NS PANTHER: 2 kb TGF-β signaling 50 kb NS pathway 1 Mb 0 0.005 0.01 0 10 20 30 0 10 20 30 0 0.75 1.5 2.25 3

MGI Expression: 2 kb Theiler stage 19 50 kb limb expression 1 Mb Regulatory domain extent 0 0.025 0.05 0 30 60 90 120 0 50 100 150 200 0 15 30 45 60

NS Gene Ontology: 2 kb Cortical 50 kb NS cytoskeleton 1 Mb NS (negative control) 0 0.0015 0.003 0 2 4 6 8 -2 -1 0 1 2 3 0 0.25 0.5 0.75 1 Fraction of the genome Number of genes Number of genomic -log (FDR q) overlapped by the annotated with the regions in the regulatory 10 regulatory domain of term containing a domain of a gene a gene annotated genomic region in annotated with the term with the term their regulatory domain in excess of the number expected by chance

Figure 3.4: Distal binding events contribute substantially to accurate functional enrich- ments of p300 limb peaks. (a) Fraction of the genome annotated with an ontology term for three different association rules: only associating peaks within 2 kb of the transcription start site (labeled 2kb), 5+1 kb basal and up to 50 kb extension (labeled 50 kb), and 5+1 kb basal and up to 1 Mb extension (labeled 1 Mb). (b) The number of genes annotated with the ontology term selected by p300 limb peaks or by 1,000 simulated data sets (see text). (c) The excess number of genomic regions associated with the ontology term beyond that expected by chance given the fraction of the genome annotated with the term. (d) The statistical significance of enrichment for the term in the p300 limb peaks compared to simulated data. *Statistical significance is measured using the hyper- geometric test over genes for 2 kb to mimic current gene-based approaches, and using the binomial test over genomic regions for 50 kb and 1 Mb. Error bars indicate s.d.; NS, not significant at a threshold of 0.05 after false discovery rate multiple test correction; obs, observed; exp, expected. Note scale changes on x axes. CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 31

highly relevant findings of GREAT include the PANTHER Pathway enrichment for the Notch signaling pathway and multiple enrichments for the Wnt signaling pathway (Table 3.11. At this stage of forebrain development, the production of postmitotic neurons and proliferative progenitors is indeed tightly regulated by both Notch and Wnt signaling [206, 253]. When run on the 561 midbrain p300 peaks, DAVID does not yield significant results (P > 0.05; data not shown). Nearly all of the 561 midbrain peaks lie distal to genes: only 28 genes have a midbrain peak within their proximal promoter. Con- sequently, “gene-based GREAT” identifies only three total enriched terms involving seven total genes (Table A.15). In contrast, GREAT with up to 1 Mb extension highlights twelve brain-specific enriched terms (Table 3.12). Many GREAT enriched terms are shared between the forebrain and midbrain peaks, but GREAT correctly identifies midbrain-specific enrichments such as the GO term ‘compartment specification’. Compartment specification is of interest, as within this tissue at this developmental age, Fgf8 induces Wnt (also enriched within this set) to set up a gene network that establishes the boundary between the midbrain and hindbrain compartments [247]. GREAT with up to 50 kb extension is able to highlight many of the same terms, but loses roughly half the associated genes and regions and the Wnt enrichment (Table A.16). Thus, GREAT repeatedly highlights remarkably specific and relevant terms that gene-based enrichment analyses do not detect. In particular, our analysis shows that ignoring distal binding events often leads to missing target gene associations, weaker P values or even completely missed relevant enrichment terms.

3.1.6 Evaluation of GREAT usage patterns

The above results demonstrate the utility of GREAT at discerning functional enrich- ments of cis-regulatory regions. However, the goals of deploying a bioinformatics tool to the public are two-fold: not only should the tool address a novel question or improve analysis over existing tools, but also the tool should be intuitive and easy CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 32

Table 3.11: GREAT cis-regulatory element enrichments of p300 in forebrain using the default basal plus extension association rule with a basal regulatory region extending 5 kb upstream and 1 kb downstream of the transcription start site and a maximum extension of 1 Mb. CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 33

Table 3.12: GREAT cis-regulatory element enrichments of p300 in midbrain using the default basal plus extension association rule with a basal regulatory region extending 5 kb upstream and 1 kb downstream of the transcription start site and a maximum extension of 1 Mb. CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 34

enough to use to gain wide adoption within the community. These goals can be an- alyzed by examining the usage patterns of GREAT since its publication and website launch on May 2, 2010 (Figure 3.5). After an initial flurry of interest in GREAT immediately upon publication, during which time demo jobs contributed a substantial fraction of all jobs, the job submission rate leveled around 40 per day through the 2010 calendar year and into the beginning of 2011. In the past two months, submissions have grown dramatically, with peak days servicing over 200 jobs (Figure 3.5a). Human data sets are most frequently run, with mouse sets contributing a substantial but smaller fraction of all jobs run, and zebrafish contributing a consistent but small number of jobs since its introduction (Figure 3.5b). The usage growth can be separately quantified by the number of new GREAT users received each day (measured as the number of unique IP addresses running non-demo GREAT jobs). This user data corroborates the consistent number of total jobs submitted in 2010 and early 2011 followed by a recent increase in total jobs: the rate of new user adoption was relatively constant from June 2010 through February 2011 with an average of 3.77 new users each day, and has undergone a strong increase in the past two months (7.08 new users each day in March and April) (Figure 3.5c). This growth in usage is likely due in large part to recent demonstrations of GREAT’s utility in independent analyses of cis-regulatory elements from a wide range of cis-regulatory data of various types. GREAT’s ability to identify functions of sequence-specific regulators has now been repeatedly shown by groups other than ours. In particular, GREAT analyses of ChIP- seq data for the sequence-specific transcription factors FOXA2 and PPARγ reinforced their known role in regulating metabolic pathways in human and mouse [210]. Analy- sis of ChIP-seq data for Prdm14 predicted a role of the protein in chromatin modifica- tion and cell fate specification, and further experimental analysis confirmed Prdm14 as a regulator involved in mouse embryonic stem cell transcriptional network [147]. Analysis of predicted binding site motifs for a heterodimer of Emx2 and Pbx1, two proteins whose compound knockout produces pelvic and proximal hindlimb skeletal abnormalities, showed enrichment for hindlimb morphogenesis [37]. CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 35

a Graphical User Interface Programming Interface Demo 200

i tt ed 150 m

i tt ed 150 b m b

200 Graphical User Interface Programming Interface Demo 100 j ob s u f 100 j ob s u o f e r o i tt ed b e r m m b 150 b m

N u 50

N u 50 j ob s u f 100 o

e r 0 b 0 m May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr

N u May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr 50 2010 2011 2010 2011

b 2000 2000 Human Mouse Zebrafish 1800 Human Mouse Zebrafish 1800 May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr 1600 2010 2011 1602000 1400

i tt ed 1401800 Human Mouse Zebrafish m i tt ed 1200 b m 1201600 b 1000 i tt ed 1400

m 1000 j ob s u b f 800 j ob s u

o 1200 f 800 e r o

b 600 e r 1000 m b j ob s u 600 f m N u o 804000 N u e r 400 b

m 602000 200 N u 4000 0 200 May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May2010 Jun Jul Aug Sep Oct Nov Dec J2011an Feb Mar Apr 2010 2011 18000 1800 May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr c 1600 2010 2011 16018000 sses

r e 1400 sses 1600 d r e d a d 1200

a d 1400 sses

e I P r e u d

e I P 1000 i q

u 1200 a d n

i q u n f 800 o e I P u

1000 f u o e r

i q b n e r 60800 b u f nu m o

al nu m 400 e r

t 600 b o al t T o 200 T

nu m 400 al t

o 200 T May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr 0 2010 2011 May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr 2010 2011

Figure 3.5: GREAT usage patterns between launch on May 2, 2010 and April 30, 2011. (a) The total number of jobs run each day binned by type. (b) The total number of jobs run using the graphical user interface of GREAT or demos binned by species examined and month. On November 19, 2010, the human hg19 assembly was introduced. On January 27, 2011, the zebrafish Zv9 assembly was introduced. (c) The cumulative number of unique IP addresses that have run one or more non-demo GREAT jobs. CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 36

The utility of GREAT for analyzing compound epigenetic signatures was conclu- sively demonstrated in an analysis of regions bound by p300 and BRG1 and possessing the histone modification of monomethylation of histone H3 at lysine 4 (H3K4me1) in human embryonic stem cells. Of these regions, “Class I” regions defined as also containing acetylation of histone H3 at lysine 27 (H3K27ac) showed GREAT enrich- ments for processes in very early development, consistent with the demonstrated role of these regions as active enhancers in hESCs [189]. “Class II” regions, defined as containing trimethylation of histone H3 lysine 27 (H3K27me3) rather than H3K27ac, showed GREAT enrichments for later developmental stages. Strikingly, these Class II regions were experimentally demonstrated to be ‘poised enhancers’ active in later development, and GREAT analysis of a subset of Class II regions that change to Class I regions upon neuroectodermal differentiation showed enrichment for neural ectoderm and brain-expressed genes [189]. Finally, GREAT analysis of an evolutionary genomics set of conserved regions present in chimpanzee and other non-human mammals, but deleted uniquely in hu- mans, showed enrichment for steroid hormone receptor activity and genes expressed in the brain that suppress cell proliferation or migration [155]. Targeted further ex- periments of individual deletions contributing to those enriched functions identified enhancer elements whose human loss may have contributed to unique human traits (see Chapter 5 of this thesis).

3.2 Discussion

GREAT is a new-generation tool aimed at the interpretation of genome-wide cis- regulatory data sets. It explicitly models the vertebrate cis-regulatory landscape through the use of long-range regulatory domains and a genomic region-based en- richment test, allowing analyses that take into consideration the large number of binding events that occur far beyond proximal promoters. By accounting for the length of gene regulatory domains, GREAT is able to highlight biologically mean- ingful terms and their associated cis-regulatory regions and genes, in a manner that remains robust if there are some false associations between input regions and genes. CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 37

Moreover, these regulatory-domain definitions can naturally incorporate future results from three-dimensional conformation capture studies [72, 139, 202], radiation hybrid maps [175], and other emerging approaches for measuring the regulatory genome in action. By coupling this methodology with many ontologies that span a wealth of biological information types, GREAT produces specific, accurate enrichments that provide insight into the biological roles of cis-regulatory data sets of interest. We comprehensively tested GREAT on multiple ChIP-seq data sets and found that it is able to reproduce many known biological facts that existing methods do not detect, as well as suggest novel hypotheses for further experimental characterization. In particular, our analysis shows that ignoring distal binding events often leads to missing target gene associations, to obtaining weaker P values, or even to completely omitting relevant enrichment terms. Besides ChIP-seq data, GREAT can also be applied to the analysis of any data set thought to be enriched for localized cis- regulatory regions. This includes functional genomic data sets of open chromatin, localized epigenomic markers, and comparative genomics sets. GREAT may thus prove invaluable in elucidating the cis-regulatory functions encoded in genomes. GREAT is available online (http://great.stanford.edu); also provided is a means for direct submission from other applications such as genome portals and peak calling tools.

3.3 Methods

3.3.1 Gene set definition

Statistical enrichment of ontology terms is dependent upon the genome-wide gene set used in the analysis. GREAT version 1.2.6 supports testing of human (Homo sapiens NCBI Build 36.1, or UCSC hg18) and mouse (Mus musculus NCBI Build 37, or UCSC mm9). To limit the gene sets to only high-confidence genes and gene predictions, we use only the subset of the UCSC Known Genes [99] that are protein- coding, are on assembled chromosomes, and possess at least one meaningful Gene Ontology (GO) annotation [9]. GO is an ontological representation of information CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 38

related to the biological processes, cellular components, and molecular functions of genes. We rely on the idea that if a gene has been annotated for function it should be included in the gene set, and if no function has been ascribed to a gene its status may be unclear and thus it is best omitted from the gene set. In GREAT version 1.2.6, we use GO data downloaded on March 5, 2009 for human and March 23, 2009 for mouse, leading to gene sets of 17,217 and 17,506 genes for human and mouse, respectively. A single gene may have multiple splice variants. As annotations are generally given at the gene level, GREAT uses a single transcription start site (TSS) to specify the location of each gene. The TSS used is that of the ‘canonical isoform’ of the gene as defined by the UCSC Known Genes track [99].

3.3.2 Association rules from genomic regions to genes

For each gene, we define a ‘regulatory domain’ such that all non-coding sequences that lie within the regulatory domain are assumed to regulate that gene. GREAT currently supports three different parametrized association rules to define gene regu- latory domains (Figure 3.6). The default basal plus extension association rule assigns a ‘basal regulatory domain’ irrespective of the presence of neighboring genes that extends (using default parameters) 5 kb upstream and 1 kb downstream of the TSS (Figure 3.6a). Each gene’s regulatory domain is then extended up to the basal reg- ulatory domain of the nearest upstream and downstream genes, but no longer than 1 Mb in each direction. The choice of basal regulatory domain size and placement was motivated by the location of histone modifications and measures of chromatin accessibility near the TSS of genes [224], and the maximum extension distance is based upon work showing that long-range distal enhancers can regulate expression of target genes up to 1 Mb away [135, 150]. All three parameters (basal upstream, basal downstream, and maximum extension distance) can be set by the user. In theory a single base pair could be associated with all genes on the chromosome using this metric, since each gene is guaranteed a basal regulatory domain. In practice, most base pairs are associated with at most two genes. The two nearest genes association rule extends each gene’s regulatory domain CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 39

a

( ) ( ) ( )

b

( ) ( ) ( )

c

( )( )( )

Figure 3.6: Computationally-defined gene regulatory domains. The transcription start site (TSS) of each gene is shown as an arrow. The corresponding regulatory domain for each gene is shown in matching color as a bracketed line. The association rule and relevant parameters used in a run of GREAT can be altered via the web interface prior to execution. (a) The basal plus extension association rule assigns a basal regulatory domain to each gene regardless of genes nearby (thick line). The domain is then extended to the basal regulatory domain of the nearest upstream and downstream genes. (b) The two nearest genes association rule extends the regulatory domain to the TSS of the nearest upstream and downstream genes. (c) The single nearest gene association rule extends the regulatory domain to the midpoint between this gene’s TSS and the nearest gene’s TSS both upstream and downstream. All regulatory domain extension rules limit extension to a user-defined maximum distance for genes that have no other genes nearby. CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 40

from the TSS of the canonical isoform to the nearest upstream and downstream TSS (Figure 3.6b), up to 1 Mb in each direction. This association rule limits each base pair to be assigned to at most two genes. The single nearest gene association rule extends each gene’s regulatory domain from the TSS of the canonical isoform in each direction to the midpoint between the TSS and the nearest adjacent TSS (Figure 3.6c), up to 1 Mb in each direction. This association rule limits each base pair to be assigned to at most one gene. For well-studied genes with experimentally-detected distal regulatory elements (reviewed in [212]), we manually override the computationally-defined regulatory domains. GREAT version 1.2.6 uses experimentally-validated regulatory domains for SHH [135], genes in the β-globin locus [136], and LNP, EVX2, and HOXD10- 13 ([213]). Future releases of the tool will continue to refine regulatory domains as technological advances including three-dimensional conformation capture stud- ies [72, 139, 202] and radiation hybrid maps [175] further elucidate interactions be- tween regulatory DNA and its target genes.

3.3.3 Binomial test over genomic regions

To account for the length variability within gene regulatory domains, we implemented a binomial test over genomic regions that uses the fraction of the genome associated with each ontology term as the probability of selecting the term. The binomial test is executed separately for each ontology term π and is defined by three parameters:

1. n is the total number of genomic regions in the input set

2. pπ is the a priori probability of selecting a base pair annotated with π when selecting a single base pair uniformly from all non-assembly-gap base pairs in the genome

3. kπ is the number of genomic regions in the input set that cause annotation π to be selected CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 41

The test calculates the P value of the observed enrichment for term π as the prob- ability of selecting annotation π at least kπ times in n attempts using the formula below.

n n! X pi (1 − p )n−i (3.1) i π π i=kπ

The binomial test first maps each input genomic region to the left median base pair in its span, making it most appropriate for assessing enrichment of factors with narrow, precise peaks. The value of pπ is calculated for each ontology annotation π as the fraction of non-assembly-gap base pairs in the genome associated with annotation π. Each input genomic region can then be thought of as a ‘dart’ thrown at the genome, counting as a hit if the left median base pair is annotated with ontology term π. In this test, the length of each gene’s regulatory domain is explicitly accounted for in the calculation of pπ. This explicit use of regulatory domain size in the significance calculation provides a proper assessment of the enrichment for ontology terms by non- coding sequences. Importantly, since the binomial test incorporates the fraction of the genome assigned to each gene in the calculation of statistical significance, it is robust to variation in association rules and occasional misassignments of genomic regions to distal target genes. Ontology terms assigned to genes that have large regulatory domains are inherently downweighted so that each binding event associated with the term contributes less to the resulting enrichment than binding events associated with terms assigned to genes with small regulatory domains. However, enrichments under the binomial test may arise from clusters of non-coding regions all near one or a few genes with a particular ontology annotation, as well as from non-coding regions associating with many genes that possess a particular ontology annotation. The hypergeometric test over genes (described below) provides a measure of ‘term coverage’ that can be used to identify terms significant by the binomial test that have many annotated genes selected as well. CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 42

3.3.4 Hypergeometric test over genes

The hypergeometric test over genes identifies all genes whose regulatory domains possess one or more genomic regions from the input set and calculates enrichments over the genes with respect to the defined gene set using a hypergeometric distribution. More formally, the hypergeometric test is executed separately for each ontology term π, and is defined by four parameters:

1. N is the total number of genes in the genome

2. Kπ is the number of genes in the genome that possess ontology annotation π

3. n is the number of genes selected because one or more input genomic regions resides in their regulatory domains

4. kπ is the number of selected genes that possess ontology annotation π

The test calculates the P value of the observed enrichment for term π as the fraction of ways to choose n genes without replacement from the entire group of N genes such that at least kπ of the n possess ontology annotation π using the formula below.

   min(n,Kπ) Kπ N−Kπ X i n−i (3.2) N i=kπ n In particular, the hypergeometric test counts every gene only once even if it was picked by multiple genomic regions. Terms enriched by the hypergeometric test thus indicate a high ‘term coverage’ where a larger fraction of all genes annotated with the term are selected by the input genomic regions than expected by chance. An important aspect of the hypergeometric test is that the test is done entirely over genes, and the input genomic regions are only used to identify genes that are included in the list of n. Length variability within gene regulatory domains is not taken into account. Recent studies have shown that length variability in gene regulatory domains is present in common gene association definitions (e.g. single nearest gene or two nearest genes) due to the uneven distribution of genes within the genome, and CHAPTER 3. REGULATORY ELEMENT FUNCTIONAL ANALYSIS 43

leads to enrichment biases toward genes involved in development, morphogenesis, regulation, and signaling [220]. Though this bias should not be ignored (thus the binomial test described above), the hypergeometric test does provide utility due to its property of allowing a single gene only a fixed potential contribution to an ontology term enrichment.

3.3.5 Foreground/background hypergeometric test over ge- nomic regions

When a set of input genomic regions is selected from a superset of ‘background ge- nomic regions’ (e.g. the repetitive elements that have been exapted into functional roles selected from all repetitive elements in the genome [144]), one should consider whether the input genomic regions differ in functional composition from the entire set of background genomic regions as a whole. The foreground/background hyper- geometric test over genomic regions poses this statistical question by mapping all ontology annotations of each gene to all background genomic regions that lie within its regulatory domain, and then calculates enrichments over the input genomic regions with respect to the superset of background genomic regions using a hypergeometric distribution. Formally, the foreground/background hypergeometric test over genomic regions is executed separately for each ontology term π, and is defined by four pa- rameters:

1. N is the number of genomic regions in the background set

2. Kπ is the number of genomic regions in the background set that lie within the regulatory domain of some gene annotated with term π

3. n is the number of genomic regions in the foreground set

4. kπ is the number of genomic regions in the foreground set that lie within the regulatory domain of some gene annotated with term π

The test calculates the P value of the observed enrichment for term π using the hypergeometric equation shown above (Equation 3.2). Chapter 4

Dispensability of mammalian DNA

The abundance of CNEs within vertebrate genomes provokes the question of their importance as unique regulatory sequences. Experimental assays have demonstrated substantial redundancy in function [89] and overlapping expression domains of cis- regulatory elements [229, 15]. Furthermore, the variation in interspecies conserva- tion levels for demonstrated cis-regulatory sequences [237, 235] raises the question of whether the metric of sequence conservation is indicative of the functional importance of the sequence. This chapter addresses these questions by assaying functional impor- tance on an evolutionary timescale. By developing a computational methodology to detect genomic deletions on a genome-wide scale, we can both quantify the likelihood CNEs are lost within a species as well as identify individual CNE losses of potential functional interest.

In the lab, the cis-regulatory network seems to exhibit great functional redun- dancy. Many experiments testing enhancer activity of neighboring cis-regulatory elements show largely overlapping expression domains. Of recent interest, mice in which cis-regulatory ultraconserved elements were knocked out showed no obvious phenotype, further suggesting functional redundancy. Here, we present a global evo- lutionary analysis of mammalian conserved non-exonic elements (CNEs), and find strong evidence to the contrary. Given a set of CNEs conserved between several

44 CHAPTER 4. DISPENSABILITY OF MAMMALIAN DNA 45

mammals, we characterize functional dispensability as the propensity for the ances- tral element to be lost in mammalian species internal to the spanned species tree. We show that ultraconserved-like elements are over 300-fold less likely than neutral DNA to have been lost during rodent evolution. In fact, many thousands of non-coding loci under purifying selection display near uniform indispensability during mammalian evolution, largely irrespective of nucleotide conservation level. These findings suggest that many genomic non-coding elements possess functions that contribute noticeably to organism fitness in naturally evolving populations. Comparison to other mammals finds at least 5% of the human genome evolv- ing under purifying selection [241]. Two-thirds of this genomic mass does not code for protein [56]. Strong sequence conservation among mammals and conservation to other vertebrates have repeatedly proven good indicators of gene cis-regulatory func- tion [179]. Ultraconserved elements are some of the genomic regions most strongly conserved during mammalian evolution, having resisted all changes between human, mouse, and rat, along contiguous stretches 200 base pairs (bp) [16]. Point mutations in these regions are under extreme purifying selection in the human population [112]. And yet, transgenic mice homozygous for four independent deletions of cis-regulatory ultraconserved elements are viable and fertile in the lab, and display no obvious phe- notypic abnormalities [2]. There are two compatible hypotheses that may explain this lack of phenotype. One suggests that ultraconserved element deletions do not affect organism viabil- ity because of functional redundancy among the large number of mammalian cis- regulatory loci [237]. The other posits that ultraconserved element deletions most likely lead to a selective disadvantage that either does not manifest itself under lab conditions [11], or is too subtle to be observed in the lab but is large enough that evolutionary pressures eliminate the mutation from the population [82]. We set out to investigate the extent to which the loss of non-exonic sequence under purifying selec- tion is tolerated in natural populations on an evolutionary timescale. While elements with complete functional redundancy may be dispensable, the loss of elements that provide unique functions may be deleterious and will not fix in natural populations. CHAPTER 4. DISPENSABILITY OF MAMMALIAN DNA 46

By surveying genomic losses during the last 100 million years of eutherian (placen- tal mammal) evolution, we calculate evolutionary loss rate as a function of nucleotide conservation. We find that ultraconserved-like elements, along with many thousands of additional non-coding loci under purifying selection, are hundreds of fold less likely to be lost than neutrally evolving DNA. These results suggest that the vast majority of cis-regulatory sequences possess one or more unique functions whose loss is dele- terious. Furthermore, a very weak relationship between nucleotide conservation and evolutionary loss rate indicates that conservation %id (percentage of identical align- ment columns) is a poor measure of the relative importance of conserved non-exonic elements (CNEs).

4.1 Results

4.1.1 Ultraconserved-like element loss rates

To analyze the evolutionary pressures against CNE losses, we quantify the extent to which rodents have discarded sequences strongly conserved in multiple mammalian species. The computational pipeline to discover rodent-specific losses of mammalian conserved DNA is divided into three conceptual steps: discovery of strongly conserved mammalian sequence, identification of conserved sequence absent in rodents, and removal of assembly, alignment, and structural RNA migration artifacts (Figure 4.1). We use whole genome sequences of two rodents (mouse and rat), two primates (human and macaque), and a close carnivore outgroup species (dog). By examining sequences conserved between human, macaque, and dog, with no orthologous DNA in either mouse or rat, we infer by parsimony that the sequence losses were fixed in the rodent lineage prior to the mouse/rat split (Figure 4.2). Forty-six percent (1327 Mb) of the human genome aligns at 50%id or greater to macaque and dog. To avoid assembly and alignment artifacts, we examine only the 38.4% (1108 Mb) of the human genome alignable to macaque and dog that possesses unique orthologous sequence in both rodents. We rely on the UCSC chaining and netting algorithm to infer orthologous rodent sequences [115], and ignore regions CHAPTER 4. DISPENSABILITY OF MAMMALIAN DNA 47

Human Genomic Regions Conserved in Macaque, Dog Human  macaQue  Rat Mouse Dog  1,327 Mb horsE (1,268 non-exonic) Multiple Possible Unique Orthologous Unique Rodent Ortholog? Rodent Matches Region in both Rodents

1,108 Mb (1,059 non-exonic) Lost in Both Lost in Rodents? Sequence Present In Mouse and Rat Either/Both Rodents

343 Mb (341 non-exonic)

Real Rodent Rodent Assembly Deletion is Rodent Genome Deletion Quality? Assembly Artifact

265 Mb (263 non-exonic) Found in Matches Any Rodent GenBank No Rodent Sequenced DNA? Sequence

Discard

261 Mb (259 non-exonic)

Figure 4.1: The computational pipeline used to discover rodent-specific genomic DNA losses. Total human sequence remaining after each step is displayed below icon. Human-dog-horse and mouse-rat-dog computations were performed the same way. CHAPTER 4. DISPENSABILITY OF MAMMALIAN DNA 48

a b 23 0.1 subst/site 0.0188 20 My human human 0.0696 62

0.0298 macaque 23 macaque

0.0722 10 rat 23 rat 0.0111 0.2507 62

0.0855 mouse 23 0.1685 mouse dog 74

0.0330 0.0778 21 dog horse 74 horse

Figure 4.2: Phylogeny of species used to calculate pan-mammalian non-exonic loss rates. (a) Molecular evolution phylogenetic tree adapted from [149], showing rapid rodent evolution. (b) Evolutionary time phylogenetic tree adapted from [129], with updated rodent divergence given in [107], showing similar branch lengths for the primate and rodent pairs. where the chains and nets do not provide an unambiguous ortholog in both rodents. By excluding these regions from consideration we slightly bias against regions with paralogs. There are 218,824 distinct non-exonic sequences conserved at 90%id or more between human, macaque, and dog. Only 4646 of them (2.12%) resemble any other sequence in this set. We discard 7112 non-exonic sequences conserved at 90%id or more between human, macaque, and dog due to ambiguity. 273 (3.84%) of the 7112 possess a eutherian paralog, suggesting an ascertainment bias, which however affects only a small fraction of all non-exonic sequence we analyze. By sliding a 100 bp window over the alignment, we compute the overall level of primate-dog sequence conservation in the form of percentage of identical alignment columns (%id). The choice of 100 bp as the sliding window size provides enough resolution to identify distal functional elements and has proven effective in previous functional studies [217]. We then find all such windows that are completely lost in both rodents. By requiring complete window losses, we focus the study on losses of longer functional units, to directly compare to the ultraconserved element knockout experiments [2]. Smaller losses would eventually fall in the realm of individual binding site turnover, and potentially be subject to gap placement artifacts. In total, 261 Mb CHAPTER 4. DISPENSABILITY OF MAMMALIAN DNA 49

of pan-eutherian sequence have no orthologous rodent DNA. We divide this set into coding and non-exonic (neither coding nor untranslated region [UTR]) windows, and compute the fraction of eutherian conserved bases lost in rodents per %id conservation bin. The non-exonic alignment, capturing 95.1% of all alignable bases, includes virtu- ally all mammalian cis-regulatory elements characterized to date. We define ultraconserved- like sequences as contiguous regions of 100 bp or more perfectly conserved between the primates and dog. The non-exonic ultraconserved-like sequences encompass 1,691,090 bp of the human genome in regions up to 1016 bp in length. Of these regions, only 1447 bp are identified as lost in rodents, in stretches no longer than 184 bp (Fig- ure 4.3).

Human  macaQue  Rat  Mouse  Dog  horsE Maximum Non-Exonic Element Length Lost in Mouse & Rat 1,600 bp

1,400 bp

1,200 bp

1,000 bp

800 bp

600 bp

400 bp

200 bp Deletion Length in Rodents in Length Deletion

0 bp 100 95 90 85 80 75 70 65 60 55 50 HQD Conservation [%id]

Figure 4.3: Maximum length of primate-dog non-exonic DNA lost in rodents. CHAPTER 4. DISPENSABILITY OF MAMMALIAN DNA 50

Thus, the rodent loss rate of non-exonic ultraconserved-like elements is 1447/1,691,090 ≈ 0.086%. At 50%id–65%id, most aligning sequences are expected to be neutrally evolving [149]. Roughly a quarter of all neutrally evolving primate-dog non-exonic sequence (135,415,463 of 515,843,585 bp at 60%id) is lost in rodents (Figure 4.4a,e), making non-exonic ultraconserved-like elements over 300-foldHuman  more indispensable than macaQue  neutral sequence. Rat Mouse  Dog 

Human  Human  macaQue  macaQue Rat Rat  Mouse  Mouse  Dog  Dog  a b Human  Human-Macaque-Dog Non-Exonic Mouse-Rat-Dog Non-Exonic macaQue 600.0 Mb 600.0 Mb Rat  500.0 Mb 500.0 Mb Mouse  Dog  400.0 Mb 400.0 Mb

300.0 Mb 300.0 Mb

200.0 Mb 200.0 Mb

100.0 Mb 100.0 Mb

Total Bases Conserved Bases Total Total Bases Conserved Bases Total 0.0 Mb 0.0 Mb 100 95 90 85 80 75 70 65 60 55 50 100 95 90 85 80 75 70 65 60 55 50 HQD Conservation [%id] MRD Conservation [%id] c d Suggested Decomposition: Suggested Decomposition: Human-Macaque-Dog Non-Exonic Mouse-Rat-Dog Non-Exonic Neutral Evolution = Red Purifying Selection = Blue/Gray

All sequence = Royal

Total Bases Conserved Bases Total Total Bases Conserved Bases Total

100 75 50 100 75 50 HQD Conservation [%id] MRD Conservation [%id] e f Lost in Mouse & Rat Lost in Human & Macaque 30.0% 30.0%

25.0% 25.0% Over 300-fold difference 20.0% 20.0%

15.0% 15.0%

10.0% 10.0% Fraction Lost Fraction Fraction Lost Fraction 301 / 195,163 5.0% loci deleted 5.0%

0.0% 0.0% 100 95 90 85 80 75 70 65 60 55 50 100 95 90 85 80 75 70 65 60 55 50 HQD Conservation [%id] MRD Conservation [%id]

Figure 4.4: Indispensability of mammalian DNA under purifying selection. (a,b) Abun- dance of primate-dog, rodent-dog non-exonic DNA, respectively. (c,d) Suggested decomposition of above curves. Color scheme after [241]. (e) Fraction of primate-dog non-exonic DNA lost in rodents, where ultraconserved-like elements are over 300-fold more indispensable than neutral DNA. (f) Fraction of rodent-dog non-exonic DNA lost in primates. CHAPTER 4. DISPENSABILITY OF MAMMALIAN DNA 51

4.1.2 Dispensability of neutral DNA

While we expect the neutrally evolving sequence loss rate to be essentially uniform, we observe a rodent tendency to retain more such sequences the less conserved they are (50%id–65%id range in Figure 4.4e). We show that this phenomenon stems from an enrichment of less conserved sequences immediately adjacent to coding exons. We annotate each 100 bp window of the least conserved alignable sequence (50%id– 65%id) with its distance to the nearest exon, rounded up to the nearest kilobase. We then compute the fraction of all non-exonic sequences aligned at 50%id–65%id for each distance (truncated at 20 kb). We repeat the analysis for highly-conserved sequences (85%id–100%id). The results (Figure 4.5a) show that poorly-conserved sequences are over-represented near exons. Elements within 1 kb of the nearest exon are markedly less likely to be lost in rodents (Figure 4.5b), suggesting that exons do shield flanking DNA from many deletions that result in gene structure disruption. In line with theoretical predictions [180], larger deletions are biased away from gene structures, and consequently away from the least conserved alignable sequence. Loss rates for sequences of decreasing maximum size show a monotonic decrease in slope magnitude (Figure 4.5c) as the probability to disrupt gene structures is lessened. A restriction to losses of at most 1000 bp shows a nearly constant rodent-specific neutral sequence loss rate (Figure 4.5d).

4.1.3 Rodent dispensability under extended branch lengths

As we move from 100% identical primate-dog sequences into less conserved align- ments, we expect the mixing of two populations: DNA under purifying selection, and the slowly mutating tail of neutrally evolving sequence (Figure 4.4c). This mixing, first observed genome-wide in the seminal mouse genome paper (Fig. 28 in [241]), must at least partially account for the increase in rodent loss rate we observe the fur- ther we move from 100% primate-dog identity (Figure 4.4e). Interestingly, intronic and intergenic sequences face similar evolutionary pressures against rodent-specific loss (Figure 4.6). To separate the loss rate of DNA under purifying selection from slowly-mutating CHAPTER 4. DISPENSABILITY OF MAMMALIAN DNA 52

Human  macaQue  Rat  Mouse  Dog  horsE

a b 85 to 100% Conservation 50 to 65% Conservation Lost in Mouse & Rat 18.0% 30.0%

15.0% 2-10Kb Curves 25.0%

12.0% 20.0%

9.0% 15.0% Under 1Kb

6.0% 10.0% Fraction Lost Fraction

3.0% 5.0% Fraction at This Distance This at Fraction 0.0% 0.0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 100 95 90 85 80 75 70 65 60 55 50 Distance From Nearest Exon [Kb] HQD Conservation [%id] c d Lost in Mouse & Rat ≤ 1000bp Regions All Deletions ≤ 5000bp ≤ 2000bp ≤ 1000bp ≤ 750bp ≤ 500bp ≤ 250bp 30.0% 10.0% 9.0% 25.0% 8.0%

20.0% 7.0% 6.0% 15.0% 5.0% 4.0% 10.0%

Fraction Lost Fraction 3.0% Fraction Lost Lost Fraction 5.0% 2.0% 1.0% 0.0% 0.0% 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 100 95 90 85 80 75 70 65 60 55 50 HQD Conservation [%id] HQD Conservation [%id]

Figure 4.5: Relationship of neutrally-evolving mammalian DNA loss rates to mammalian genome structure. (a) Fraction of conserved non-exonic primate-dog elements as a function of distance from the nearest exon. (b) Rodent-specific loss rate of non-exonic elements as a function of distance from the nearest exon. Elements within 1 kb of the nearest exon are markedly less likely to be lost at all conservation levels. (c) Rodent-specific loss rate of low-conservation non-exonic regions as function of lost region size. (d) Rodent-specific loss rate of non-exonic regions at most 1 kb in size. CHAPTER 4. DISPENSABILITY OF MAMMALIAN DNA 53

Human  macaQue  Rat  Mouse  Dog  horsE

a b Human-Macaque-Dog Intergenic Human-Macaque-Dog Intronic 300.0 Mb 300.0 Mb

250.0 Mb 250.0 Mb

200.0 Mb 200.0 Mb

150.0 Mb 150.0 Mb

100.0 Mb 100.0 Mb

50.0 Mb 50.0 Mb

Total Bases Conserved Bases Total Total Bases Conserved Bases Total

0.0 Mb 0.0 Mb 100 95 90 85 80 75 70 65 60 55 50 100 95 90 85 80 75 70 65 60 55 50 HQD Conservation [%id] HQD Conservation [%id] c d Lost in Mouse & Rat Lost in Mouse & Rat

30.0% 30.0%

25.0% 25.0%

20.0% 20.0%

15.0% 15.0%

10.0% 10.0%

Fraction Lost Fraction Fraction Lost Fraction

5.0% 5.0%

0.0% 0.0% 100 95 90 85 80 75 70 65 60 55 50 100 95 90 85 80 75 70 65 60 55 50 HQD Conservation [%id] HQD Conservation [%id]

Figure 4.6: Abundance and loss rate of intergenic and intronic primate-dog sequences. (a) Abundance of primate-dog intergenic DNA. (b) Abundance of primate-dog intronic DNA. (c) Fraction of primate-dog intergenic DNA lost in rodents. (d) Fraction of primate-dog intronic DNA lost in rodents. CHAPTER 4. DISPENSABILITY OF MAMMALIAN DNA 54

neutrally evolving loci, we search for eutherian tree topologies that will increase the cumulative branch lengths between the three aligning species. The cumulative branch lengths of a human-dog-horse alignment are longer than the branch lengths of a human-macaque-dog alignment (Figure 4.2a), which suggests that neutrally-evolving sequence will align at a lower conservation %id in the human-dog-horse alignment. Forty-one percent (1182 Mb) of the human genome aligns at 50% or more to dog and horse; 35% (1009 Mb) has a unique homolog in both mouse and rat. Nested elements lost in rodents comprise 7.25% (209 Mb) of the human genome. The abundance of non-exonic sequence in the human-dog-horse alignment (Fig- ure 4.7a) shows a moderate shift in neutrally-evolving sequence toward lower conserva- tion %id values when compared to its human-macaque-dog counterpart (Figure 4.4a).

Human  macaQue Rat  Mouse  Dog  horsE 

a b Human-Dog-Horse Non-Exonic Lost in Mouse & Rat 600.0 Mb 30.0%

500.0 Mb 25.0%

400.0 Mb 20.0%

300.0 Mb 15.0%

200.0 Mb 10.0% Fraction Lost Fraction

100.0 Mb 5.0% Total Bases Conserved Bases Total 0.0 Mb 0.0% 100 95 90 85 80 75 70 65 60 55 50 100 95 90 85 80 75 70 65 60 55 50 HDE Conservation [%id] HDE Conservation [%id]

Figure 4.7: Abundance and loss rates of non-exonic sequences for the human-dog-horse alignment topology. (a) Abundance of human-dog-horse non-exonic DNA. (b) Fraction of human- dog-horse non-exonic DNA lost in rodents. y axis values at 80%id are shown as dashed-lines for visualization purposes.

The rodent-specific loss rates of highly-conserved sequence are lower in the human- dog-horse alignment (compare Figure 4.4e and Figure 4.7b), indicating that the ratio CHAPTER 4. DISPENSABILITY OF MAMMALIAN DNA 55

of functional sequence to highly-conserved neutral sequence is larger in the human- dog-horse alignment than in the human-macaque-dog alignment. However, the slowly evolving tail of neutral sequences may still contribute to correlate loss events with sequence conservation in Figure 4.7b.

4.1.4 Dispensability of DNA under purifying selection

To investigate whether a correlation exists between conservation %id and dispensabil- ity for non-coding sequence under purifying selection, we must ensure the neutral rate of evolution is sufficiently large to separate sequence under purifying selection from the slowly mutating tail of neutrally evolving sequence [241]. To increase the neutral rate sufficiently we focus on mouse-rat-dog conserved elements deleted in the primate lineage. While the divergence time between human and macaque is comparable to that of mouse and rat, the two lineages exhibit markedly different substitution rates (Figure 4.2). Rodents have evolved much faster than primates, both prior to and following their respective splits [81]. Consequently, only 22.3% of the mouse genome (573 Mb) is spanned by regions alignable to rat and dog, and 20.5% (527 Mb) has a unique homolog in both human and macaque. Only 0.598% (15.3 Mb) of 100 bp alignment windows is additionally lost in both primates. By nearly doubling the neutral branch lengths of the three genomes we align, we expect the dominant curve of neutrally evolving eutherian DNA to shift dramatically toward lower conservation values. Indeed, the peak of the non-exonic abundance curve shifts from 65%id for primate-dog to well below the 50%id marker of reliable alignability for rodent-dog (Figure 4.4a–d). Based on the hypothesis that most losses of functional elements will not fix, we expect the curve of sequences under purify- ing selection to roughly retain its overall volume under increased branch lengths. Furthermore, the mass of sequence should remain at similar conservation values, as most tolerated mutations are the relative few that confer little to no fitness change. To test these predictions we examine the preservation of coding exons, the largest well-annotated genomic functional set, between our three tree configurations. With increasing branch lengths, we observe both the predicted volume preservation and CHAPTER 4. DISPENSABILITY OF MAMMALIAN DNA 56

mild skew (Figure 4.8).

Human  Human  Human  macaQue  macaQue macaQue  Rat  Rat  Rat  Mouse  Mouse  Mouse  Dog  Dog  Dog  horsE horsE  horsE a b c Human-Macaque-Dog Coding Exons Human-Dog-Horse Coding Exons Mouse-Rat-Dog Coding Exons 20.0 Mb 20.0 Mb 20.0 Mb

15.0 Mb 15.0 Mb 15.0 Mb

10.0 Mb 10.0 Mb 10.0 Mb

5.0 Mb 5.0 Mb 5.0 Mb

Total Bases Conserved Bases Total

Total Bases Conserved Bases Total Total Bases Conserved Bases Total 0.0 Mb 0.0 Mb 0.0 Mb 100 95 90 85 80 75 70 65 60 55 50 100 95 90 85 80 75 70 65 60 55 50 100 95 90 85 80 75 70 65 60 55 50 HQD Conservation [%id] HDE Conservation [%id] MRD Conservation [%id]

Figure 4.8: Abundance of coding exon sequences for all three alignment topologies. (a) Abundance of primate-dog coding exon DNA. (b) Abundance of human-dog-horse coding exon DNA. (c) Abundance of rodent-dog coding exon DNA.

Thus, we conclude that most of the non-exonic 42.8 Mb at 80%id–100%id between rodents and dog in Figure 4.4b are under purifying selection. Of the 195,163 genomic loci in this set, only 301 (0.154%) are lost in the primate lineage (Figure 4.4f). By analyzing the relationship between conservation %id and loss rate, we see that the loss rate (non-exonic bp lost in primates/non-exonic bp conserved in rodent-dog) at 100%id is 0, at 80%id is 0.00122, and the least-squares best-fit line of the data has a slope magnitude of only 0.00006 (Figure 4.9a). The minuscule slope magnitude suggests pan-mammalian indispensability of nearly all non-exonic DNA under puri- fying selection, largely irrespective of substitution rate. Though we are restricted to sequence of 80%id or greater when examining the loss rate of non-exonic DNA under purifying selection to ensure the elements are functional, coding exon sequences con- served at lower %id can be analyzed. The primate loss rate of non-exonic sequence is also comparable to the primate loss rate of coding regions (0 at 100%id, 0.0000861 at 80%id, 0.00276 at 50%id), though the latter seems even more extreme (Figure 4.9b), probably, as above, because an exon loss will often lead to gene function loss. CHAPTER 4. DISPENSABILITY OF MAMMALIAN DNA 57

Human  macaQue  Rat  Mouse  Dog  horsE

a b Non-Exonic Loss Rate Coding Exon Loss Rate 0.5% 0.5%

0.4% 0.4%

0.3% 0.3%

0.2% 0.2%

Best-fit slope = -0.00006 43,379

Fraction Lost Fraction Fraction Lost Fraction 0.1% 0.1% 12,764 2,820 314 110 1,326 0 147 346 0.0% 0 0.0% 100 98 96 94 92 90 88 86 84 82 80 100 95 90 85 80 75 70 65 60 55 50 MRD Conservation [%id] MRD Conservation [%id]

Figure 4.9: Functional DNA loss rates. (a) Fraction of rodent-dog non-exonic DNA lost in primates. (b) Fraction of rodent-dog coding exon DNA lost in primates. Total base pairs lost at 100%id, 95%id, 90%id, 85%id, and 80%id are displayed above corresponding loss rate points. Note the scale change on the x axis.

4.1.5 Deep sequence conservation and dispensability

By searching each eutherian element in our abundance curves (Figure 4.4a,b) in opos- sum, platypus, chicken, frog, and teleost fish, we trace the most ancient species diver- gence within the Euteleostomi (bony vertebrate) clade at which sequence homology of the element is retained [160]. Interestingly, the rodent-dog conserved regions, which at 83%id–100%id are presumably all functional, are dominated by regions aligning to chicken (amniote node, Figure 4.10a,b). The primate-dog conserved re- gions are qualitatively similar for conservation thresholds of 90%id–100%id. In both cases, the less-conserved sequence is predominantly conserved only to dog, further supporting the notion that the bulk of moderately conserved sequences is neutrally evolving [241, 185]. By tracing the depth of sequence conservation in the loss curves a clear dispensabil- ity ranking emerges (Figure 4.10c,d). Namely, the more deeply in the vertebrate tree a non-exonic element possesses sequence conservation, the less likely it is to be lost in either primates or rodents, at all levels of high conservation. Thus, depth of sequence conservation provides a stronger measure of indispensability than conservation %id alone. This finding reinforces the utility in sequencing additional non-mammalian CHAPTER 4. DISPENSABILITY OF MAMMALIAN DNA 58

Human  macaQue  Rat Mouse  Dog 

Human  Human  macaQue  macaQue Rat Rat  Mouse  Mouse  Dog  Dog  a b Human Dog Opossum Platypus Chicken Frog Teleost Fish Dog Opossum Platypus Chicken Frog Teleost Fish macaQue 30.0 Mb 30.0 Mb Rat  25.0 Mb 25.0 Mb Mouse  Dog  20.0 Mb 20.0 Mb

15.0 Mb 15.0 Mb

10.0 Mb 10.0 Mb

5.0 Mb 5.0 Mb

Total Bases Conserved Bases Total Total Bases Conserved Bases Total 0.0 Mb 0.0 Mb 100 95 90 85 80 100 95 90 85 80 75 70 65 HQD Conservation [%id] MRD Conservation [%id] c d Human-Macaque-Dog Non-Exonic Mouse-Rat-Dog Non-Exonic Dog Opossum Platypus Chicken Frog Teleost Fish Dog Opossum Platypus Chicken Frog Teleost Fish 20.0% 0.4%

15.0% 0.3%

10.0% 0.2%

5.0% 0.1%

Fraction Lost Fraction Fraction Lost Fraction

0.0% 0.0% 100 95 90 85 80 100 95 90 85 80 HQD Conservation [%id] MRD Conservation [%id]

Figure 4.10: Deep sequence conservation and indispensability of mammalian DNA. (a) Abundance of primate-dog non-exonic DNA decomposed by conservation depth. Note x axis scale and truncated dog conservation depth (up to 66 Mb at 80%id). (b) Abundance of rodent-dog non-exonic DNA decomposed by conservation depth. Note x axis scale and truncated dog conser- vation depth (up to 61 Mb at 65%id). (c) Fraction of primate-dog non-exonic DNA lost in rodents decomposed by conservation depth. (d) Fraction of rodent-dog non-exonic DNA lost in primates decomposed by conservation depth. Note the scale change on the y axis. CHAPTER 4. DISPENSABILITY OF MAMMALIAN DNA 59

species [77]. Repeating the analysis after adding the lizard genome draft (anoCar1) at the amniote node yields qualitatively similar results (data not shown).

4.1.6 Robustness of loss event predictions

Mutational hotspots and positive selection can cause orthologous regions to be mis- labeled as lost sequence. It is currently hard to estimate the magnitude of this phe- nomenon in non-exonic sequences, as functional non-exonic elements are identified largely via comparative methods. Coding sequences, the largest well-annotated set of functional elements in the genome, are much better characterized. Consequently, we examine mislabeled coding exon losses to estimate the false positive rate. Rodent exons that preserve gene structure but are unalignable to their human ortholog cause mislabeled coding exon losses (e.g., Figure 4.11).

MDQADDVSCPLLNSCLSES SPVVLQCTHVTPQRDKSV VVCGSLFHTPKFVK GRQTPKHISESLGAEVDPDMSW +| ||+ | | |||||| ||+ |+|| ||+| | || ||||+||| + | |||| |||||| |||||||| VDPVVDVASPPLKSCLSES SPLTLRCTQAVLQREKPV VVSGSLFYTPKLKE G-QTPKPISESLGVEVDPDMSW

Figure 4.11: Mislabeled coding exon loss arising from accelerated evolution. Red arrow indicates a coding exon of BRCA2 not aligned at the nucleotide level between human and mouse, evident by lack of an overlapping net block [115] in both human and mouse. The corresponding amino acid alignment below clearly shows this exon to be orthologous between the two species.

For each element labeled as a coding exon loss we examine the orthologous ge- nomic region in both mouse and rat for evidence of an exon annotated by the UCSC Genome Browser knownGene table of the species. Only 4.8% (27/567) of rodent- specific coding exon losses possess an unalignable rodent exon that preserves gene CHAPTER 4. DISPENSABILITY OF MAMMALIAN DNA 60

structure. Interestingly, nearly half (12/25) of the genes harboring these exons have been associated with recent positive selection in human or rodents (Table 4.1).

Table 4.1: Alignment-based misannotation of orthologous human and mouse coding ex- ons.

Gene Name Hg18 Coordinates Positive Selection Evidence ACYP1 chr14 74,591,998 74,599,180 AHRR chr5 426,185 446,466 [102] BRCA2 chr13 31,797,502 31,798,307 [58] C13orf16 chr13 110,790,405 110,793,769 C19orf24 chr19 1,228,300 1,230,106 C21orf13 chr21 39,700,148 39,700,266 C6orf57 chr6 71,330,352 71,333,684 CASP1 chr11 104,408,414 104,409,933 [240] CCDC27 chr1 3,659,702 3,661,434 CCDC47 chr17 59,178,315 59,183,023 CCL25 chr19 8,027,414 8,033,639 Immune system response* DKFZp762E1312 chr2 234,409,533 234,411,309 DKFZp762E1312 chr2 234,419,906 234,422,794 FKSG24 chr19 18,167,042 18,168,552 HTN3 chr4 70,932,875 70,933,531 [196] ICOSLG chr21 44,474,747 44,479,569 [165] IL4 chr5 132,043,372 132,045,491 [231] IL4I1 chr19 55,084,643 55,084,800 [44] MKI67IP chr2 122,202,466 122,204,891 [30] MUC13 chr3 126,129,075 126,129,273 MUM1 chr19 1,311,947 1,315,329 Immune system response* PLB1 chr2 28,719,370 28,720,788 PTPN13 chr4 87,920,257 87,921,128 TCL1B chr14 95,221,984 95,223,032 TNFRSF14 chr1 2,476,243 2,480,078 Immune system response* TPRX1 chr19 52,994,285 53,003,834 [27]

*Immune system response genes are frequently under a rapid mutation rate [231].

4.2 Discussion

Recent lab experiments in mice [170, 2, 242, 59] and early theoretical predictions [254] suggest great redundancy and flux among regulatory genomic elements. We examined dispensability on an evolutionary timescale to assess the importance of regulatory elements in a manner much more sensitive to small fitness effects not measurable in the lab. Sequences perfectly conserved in multiple species show a strong selective pressure against being lost in rodents. We conclude that in all likelihood the four CHAPTER 4. DISPENSABILITY OF MAMMALIAN DNA 61

ultraconserved element knockouts [2] also suffer an evolutionarily noticeable fitness loss that either is too small to be detected (e.g., 1% change), or does not manifest itself under lab conditions (where air is filtered, predators are absent, and water, food, and mate are served). The regions immediately flanking exons are enriched for poorly-conserved se- quence. Though cis-regulatory modules have been shown to cluster preferentially near transcription start sites [25], it is possible that distal regulatory elements prefer- entially occur away from exons to provide modularity and flexibility for regulatory ele- ments and exons to evolve under their different evolutionary pressures. Alternatively, the presence of nearby strongly-conserved exons may encourage a disproportionate fraction of poorly-conserved sequence to align [181]. The failure to identify a measurable phenotype in the ultraconserved knockout mice [2] is unexpected, given that the deleted ultraconserved elements are verified enhancer elements near ancient genes (Dmrt1, Dmrt2, Dmrt3, Pax6, Arx, Sox3, all conserved to teleost fish) that produce marked phenotypes when inactivated or ex- pression levels are significantly altered [2]. However, these results do have an ex- perimental analog in coding sequences. Our limited ability to assay for phenotypic changes in the lab can allow many deleterious effects to go unreported. To assess the prevalence of published mouse gene knockout experiments in which no measurable phenotype was identified, we examined results published in the Mouse Genome Infor- matics database [79]. The database contains 4234 genes annotated with phenotypes resulting from knockout experiments. 510 genes (12.0%) give rise to no measurable phenotype in at least one knockout experiment. Of these, 284 genes (6.7%) have no measurable phenotype for any knockout experiments. The set of genes for which no measurable knockout phenotype was identified includes the ancient genes Gli1, Hoxa7, and Dach2, which are conserved to Tetraodon nigroviridis (bony vertebrates), Xenopus tropicalis (tetrapods), and Gallus gallus (amniotes), respectively. These numbers also constitute an (potentially large) underestimate, as many journal edi- tors hesitate to publish lack of phenotype results. Estimates of the fraction of genes in which knockout experiments fail to identify a measurable phenotype run as high CHAPTER 4. DISPENSABILITY OF MAMMALIAN DNA 62

as 20% of all genes [11]. Complete functional redundancy may be quite rare, how- ever, as a study in yeast recently showed a quantifiable mutant phenotype in some environment for nearly all genes that produce no phenotypic consequences in rich medium [96]. Similarly, multiple regulatory elements with an experimentally-verified overlap in functionality [89] or overlapping expression domains during a particular developmental stage [229, 15] suggest some functional redundancy exists among cis- regulatory elements, but does not imply that these elements are completely redun- dant. The demonstrated resistance of non-exonic sequences under purifying selection to deletion on an evolutionary timescale suggests in fact that complete functional redundancy of cis-regulatory elements is rare. In this context of partial redundancy, small fitness losses undetectable by conventional lab assays may be expected. An initial functional exploration of both more and less highly conserved cis- regulatory elements under purifying selection shows that many drive expression in similar precise patterns, irrespective of exact conservation level [237]. Our evolution- ary analysis corroborates the notion of importance of cis-regulatory elements under purifying selection, regardless of conservation %id. By looking beyond rodent losses of primate-dog sequence to three mammalian configurations of differing total branch lengths (Figure 4.7 and Figure 4.8), we observe a correlation shared between rodent and primate losses that strongly suggests the mixing of loci under purifying selec- tion with neutrally evolving ones is mostly responsible for higher loss rates as the conservation %id is lowered. This observation suggests that deletion of the majority of DNA elements under purifying selection leads to a fitness loss that is noticeable on an evolutionary timescale. If so, most such deletions, at a wide range of fitness effects, will be swept out of the population over time [82]. Indeed, our evolutionary analysis shows that when mammalian DNA under purifying selection is isolated using sufficient branch lengths, most loci in this set (of many thousands) are nearly equally indispensable over millions of years of evolution. In 1977, Allan Wilson and colleagues proposed that the rate of protein evolution depends on a combination of the probability that a substitution will be compatible with protein function and on the dispensability of the protein to the organism [245]. A rich body of protein literature in several organisms has since validated Wilson’s CHAPTER 4. DISPENSABILITY OF MAMMALIAN DNA 63

claim by showing that loss rate correlates with essentiality more than substitution rate does [173]. Our evolutionary analysis suggests a similar scenario in cis-regulatory elements under purifying selection. In further analogy to proteins, the dispensability of cis-regulatory elements under purifying selection increases minimally as conser- vation across species is lowered. Nearly all cis-regulatory elements under purifying selection that make any meaningful contribution toward organism fitness (large or small) are indispensable, regardless of exact inter-species conservation level. No pri- mate losses occur in non-exonic sequences conserved at 97%id or higher in rodents and dog, which precludes analysis of the exact relationship between conservation %id and dispensability of these sequences. However, the relationship observed in the other species configurations (Figure 4.7) indicate this is probably a consequence of the small loss rate of primates in general, rather than a unique property of perfectly conserved sequences. In this study, we rely on sequence similarity methods to identify functional ho- mologs across species. Sequences that possess redundant functionality but lack se- quence homology may not be quantified using these methods. In particular, the ability of vertebrate cis-regulatory sequences to shuffle individual binding sites [200] and our inability to align many orthologous non-coding RNAs [227] may be contributors to false positive losses of non-exonic sequences, and thus may partially account for dif- ferences in the loss rates of non-exonic and coding sequences. Since this phenomenon would increase our estimates of the loss rates of non-exonic elements, the small loss rates of non-exonic sequences we report may even be slightly conservative. By including sequence conservation depth information in the analysis of mam- malian cis-regulatory elements under purifying selection, we achieve additional pre- dictive power of the dispensability of an element. Regulatory elements that possess sequence homology in distant species are less dispensable than those that are only found in close species, at all levels of conservation %id. While low coverage genomes are useful for extending branch-lengths of individual elements, the forthcoming avail- ability of over a dozen high coverage mammalian genomes [90] will allow us to deploy the presence/absence measure genome-wide to more accurately highlight indispens- able genomic regions evolving at a range of rates. CHAPTER 4. DISPENSABILITY OF MAMMALIAN DNA 64

Our work serves to provide the evolutionary record of mammalian ultraconserved and highly conserved element indispensability. Additional work will be needed to better understand the extremes at which some genomic loci are conserved, as well as the manifested irreducibility of the non-coding portions of our genome.

4.3 Methods

4.3.1 Identifying rodent-specific losses of mammalian con- served DNA

The procedure we next describe is summarized in Figure 4.1. Five UCSC genome assemblies were used throughout this study: human (hg18), macaque (rheMac2), mouse (mm8), rat (rn4), and dog (canFam2). The phylogenetic relationships and evolutionary distances between the species are shown in Figure 4.2. To identify primate-dog conserved DNA, we created a multiple alignment of hu- man, macaque, and dog using MULTIZ [26] on syntenic human-macaque and human- dog pairwise alignments generated by running the BLASTZ [204] and UCSC chaining and netting tools [115]. Each 100 bp window within the multiple alignment was an- notated for conservation %id (the number of bases perfectly conserved between the three species), recorded in human coordinates. Overlapping windows of the same conservation %id were merged into single genomic intervals. Alignment reliability concerns limited the set to three-way alignments of 50%id or more. Roughly 46.0% (1327 Mb) of the human genome was categorized as alignable at this threshold. The pairwise alignment files from human to mouse and from human to rat iden- tified conserved elements potentially lost in the rodent lineage. To guarantee loss of orthologous rodent DNA, we required losses only from pairwise alignments that were one-to-one mappings, based on the UCSC chains and nets [115], leaving 1108 Mb of human, macaque, and dog conserved sequence as candidates for rodent loss. We identified rodent losses as the intersection of gaps from human to mouse and from human to rat within the conserved regions defined above. Only elements of 50 bp or larger (corresponding to indel mediated alignment windows of 100 bp at 50%id) were CHAPTER 4. DISPENSABILITY OF MAMMALIAN DNA 65

retained. Rodent sequence assembly gaps have similar signatures to true rodent-specific losses, as they appear in pairwise alignments to human as stretches of human sequence that have no rodent orthologs. We removed all rodent assembly gaps from further consideration. We ran BLAT [114] on all remaining candidate losses to discover and remove false positives arising from alignment artifacts and structural RNA migration since the rodent.primate split. To further ensure that the elements were truly missing from the rodent lineage, we ran the more sensitive BLASTN [177] of the human sequence within candidate rodent-specific losses against a database of rodents that included mouse NCBI Build 36.1, the Celera Genomics whole-genome mouse shot- gun assembly, the mouse MGSC v3 assembly, and the rat RGSC v3.4 assembly. We removed elements with significant similarity (E-score ≤ 10−10) found in any of the four databases scanned. The stringent BLAST significance threshold identifies a very limited amount of additional sequence (0.25%) beyond that discovered by the con- servative BLAT screening. Further relaxing the significance level used to exclude sequences runs the risk of preferentially excluding sequences that have diverged par- alogs elsewhere in the genome. The resultant elements were classified as lost in rodents, and comprised 9.04% (261 Mb) of the human genome. We identified paralogs in the following manner: for each non-exonic sequence conserved at 90%id or more between human, macaque, and dog, we look for a human- to-human self alignment of at least 50 bp to a different region of the human genome. If that similar sequence is also conserved at 90%id or more between human, macaque, and dog, we infer that the original sequence has a eutherian paralog. We annotated portions of both the primate-dog conserved DNA and the subset lost in rodents as non-exonic. Nonexonic elements at a particular conservation %id c were identified by taking the union of all 100 bp windows of conservation %id c that do not overlap any portion of the reference genome annotated as coding exons or UTRs. Other regions of primate-dog conservation and rodent loss were labeled as coding exon elements. Coding exon elements at a particular conservation %id c consist of the union of all 100 bp windows of conservation %id c that overlap at CHAPTER 4. DISPENSABILITY OF MAMMALIAN DNA 66

least 50 bp of a coding exon. All gene annotations used to identify both coding exon and non-exonic elements were based upon the UCSC Genome Browser knownGene table [99] of the human genome assembly. The loss rate at a particular conservation %id c was calculated as the total number of base pairs in rodent-specific loss elements of conservation %id c divided by the total number of base pairs in primate-dog conserved elements of conservation %id c. Nonexonic and coding exon loss rates were calculated using only elements in the appropriate classes in both the numerator and denominator of the equation. Repeating our entire analysis using both smaller and larger window sizes of 75 bp and 150 bp yields very similar results (data not shown). Losses within the rodent lineage of elements conserved between human, dog, and horse were processed in an identical manner to the primate-dog elements, using the UCSC horse (equCab1) genome assembly in place of macaque.

4.3.2 Identifying primate-specific losses of mammalian con- served DNA

Losses within the primate lineage of elements conserved between mouse, rat, and dog (Figure 4.2) were processed in a manner analogous to rodent-specific losses. Elements were identified in mouse coordinates. The BLASTN step was run against a database of primates that included the human NCBI Build 36.1, the Celera Genomics whole- genome human shotgun assembly, and the chimpanzee CSAC Build 2 v1.

4.3.3 Deep sequence conservation and dispensability

Pairwise syntenic genome alignments of human sequence conserved in macaque and dog with the following UCSC genomes were generated: opossum (monDom4), platy- pus (ornAna1), chicken (galGal3), frog (xenTro2), zebrafish (danRer4), tetraodon (tetNig1), fugu (fr2), and stickleback (gasAcu1). Each non-exonic 100 bp window within a primate-dog conserved region was tested for overlap of 50 bp or more with each pairwise alignment. Window conservation depth was assigned as the most distant species/clade for which the overlap criteria was CHAPTER 4. DISPENSABILITY OF MAMMALIAN DNA 67

fulfilled; the distance metric has teleost fish most distant (bony vertebrate ancestry), followed by frog (tetrapod), chicken (amniote), platypus (base of the mammalian lineage), opossum (therian. marsupial and placental.mammals), and dog (placental mammal specific). Windows satisfying the overlap criteria for one or more of the pairwise alignments to zebrafish, tetraodon, fugu, or stickleback were labeled with teleost fish conservation depth. Windows that did not satisfy the overlap criteria for any of the pairwise alignments were labeled with dog (eutherian) conservation depth. All 100 bp windows of the same conservation %id and conservation depth were merged into nonoverlapping elements, and the total abundance of sequence plotted (Figure 4.10a). The conservation depth of rodent-dog conserved regions (Figure 4.10b) was deter- mined using pairwise syntenic alignments from mouse to the same species mentioned above, using the same overlap criteria. Loss rates (Figure 4.10c,d) were calculated by dividing the number of bases lost of a particular conservation depth and conservation %id by the total number of bases of that conservation depth and conservation %id. Chapter 5

Human CNE losses and trait evolution

In Chapter 4 I showed that most sequences conserved in multiple species resist dele- tion in other extant species, arguing against complete functional redundancy of CNEs. Nonetheless, evidence of individual dramatically conserved CNEs lost uniquely in the rodent lineage were discovered. Here I extend the deletion discovery methodology to detect sequence losses unique to humans while remaining robust to assembly errors by incorporating data from individual sequence reads to support deletion calls. By integrating the GREAT cis-regulatory element analysis methodology I identify classes of genes for which CNEs have been preferentially deleted, providing candidates for experimental characterization.

Humans differ from other animals in many aspects of anatomy, physiology, and behavior; however, the genotypic basis of most human-specific traits remains un- known [233]. Recent whole-genome comparisons have made it possible to identify genes with elevated rates of amino acid change or divergent expression in humans, and non-coding sequences with accelerated base pair changes [35, 117, 182, 186]. Regulatory alterations may be particularly likely to produce phenotypic effects while preserving viability, and are known to underlie interesting evolutionary differences in other species [123, 39, 43]. Here we identify molecular events particularly likely

68 CHAPTER 5. HUMAN CNE LOSSES AND TRAIT EVOLUTION 69

to produce significant regulatory changes in humans [153]: complete deletion of se- quences otherwise highly conserved between chimpanzees and other mammals. We confirm 510 such deletions in humans, which fall almost exclusively in non-coding regions and are enriched near genes involved in steroid hormone signaling and neural function. One deletion removes a sensory vibrissae and penile spine enhancer from the human androgen receptor (AR) gene, a molecular change correlated with anatomical loss of androgen-dependent sensory vibrissae and penile spines in the human lin- eage [163, 95]. Another deletion removes a forebrain subventricular zone enhancer near the tumor suppressor gene growth arrest and DNA-damage inducible, gamma (GADD45G) [251, 252], a loss correlated with expansion of specific brain regions in humans. Deletions of tissue-specific enhancers may thus accompany both loss and gain traits in the human lineage, and provide specific examples of the kinds of regu- latory alterations [123, 39, 43] and inactivation events [171] long proposed to have an important role in human evolutionary divergence.

5.1 Results

5.1.1 Computational identification of conserved sequences deleted in humans

To discover human-specific deletions (hDELs) on a genome-wide scale (Figure 5.1a), we identified regions of the chimpanzee genome [222] with a clear macaque orthologue, but no closely related sequence in humans (Appendix Section B.1.1). This iden- tified 37,251 ancestral primate sequences lost in humans, spanning 34.0 megabases (Mb; 1.17%) of the chimpanzee genome. To find deletions most likely to produce functional consequences, we also identified highly conserved sequences spanning 70.0 Mb (2.41%) of the chimpanzee genome, producing a conservative estimate of pan- mammalian sequence under purifying selection in chimpanzee [222]. The intersection of these deletion and conservation surveys identified 583 regions with high sequence conservation that are surprisingly deleted in humans, which we term hCONDELs (Table B.2). CHAPTER 5. HUMAN CNE LOSSES AND TRAIT EVOLUTION 70

a Human deletions (hDELs) Chimp conserved sequence Chimp chr14 25,938,400 25,938,800 Chimp chr14 25,938,400 25,938,800 hDELs Conserved sequence Macaque alignment Conservation Human macaque mouse alignment dog 37,251 deletions opossum (34.0 Mb) 1,160,828 regions (70.0 Mb) Human deletions that contain chimp conserved sequence (hCONDELs) Chimp chr14 25,938,400 25,938,800 hDELs Conserved sequence hCONDELs Conservation human macaque mouse dog opossum Human alignment 583 deletions (4.0 Mb)

b1 22 2 21 3 20 4 19 5 18 6 17 7 16 8 15 9 14 10 13 11 12 X Y

c 140 Mean: 6,126 bp 120 Median: 2,804 bp 100

80

60

40

Number of hCONDELs 20

0 1 5 10 15 20 ≥25 hCONDEL size (kb)

Figure 5.1: Hundreds of sequences highly conserved between chimpanzee and other species are deleted in humans. (a) Computational approach used to discover human-specific deletions of functional DNA: identification of ancestral chimpanzee genomic sequences deleted in human; discovery of chimpanzee genomic sequences highly conserved in other species; and detec- tion of human-specific deletions that remove one or more chimpanzee conserved sequences. Total chimpanzee sequence identified in each step is displayed beneath each graphic. (b) Human genomic locations of the 583 hCONDELs. hCONDELs are displayed as blue ticks above the many locations where they are missing. (c) Size distribution of hCONDELs. CHAPTER 5. HUMAN CNE LOSSES AND TRAIT EVOLUTION 71

Of the 583 predicted hCONDELs, 510 (87.5%) were validated independently by single human sequence reads spanning both sides of the deletions (Appendix Sec- tion B.1.5). Of the sequence-validated hCONDELs, 85% seemed fixed in human trace archives. Experimental amplifications across human diversity panels confirmed human deletion of 39/39 randomly chosen hCONDELs, and showed fixation of 31/32 hCONDELs also fixed in the trace archives (Table B.3–B.5). Of the sequence- validated hCONDELs, 88% are missing from the draft Neanderthal genome [91] (Appendix Section B.1.7), in agreement with the length of the common lineage be- tween humans and Neanderthals following the divergence with chimpanzee. The 583 hCONDELs cover 3.96 Mb (0.14%) of the chimpanzee genome, remove an average of 95 base pairs (bp) of conserved sequence, and are found on every nuclear chromosome except Y, where macaque sequence is unavailable (Figure 5.1b). hCONDELs have a median size of 2,804 bp, and show a skew towards G/C-poor regions compared to chimpanzee genome-wide averages (Figure 5.1c and Figure B.2). We find no evidence for enrichment in areas with high recombination (P > 0.9) or pericentromeric and subtelomeric regions (P > 0.5) (Appendix Section B.1.8). Thus, loss of sequences is unlikely to be a secondary consequence of higher mutation rates in such regions, a possibility that has been debated for the accelerated sequence changes seen in other human non-coding elements [182, 186, 76]. Only 1 of 510 sequence-validated hCON- DELs removes a protein-coding region, corresponding to a previously known 92 bp deletion in CMAH [51]. All other sequence-validated hCONDELs map to non-protein coding regions of the genome (355 intergenic and 154 intronic). Similar highly con- served non-coding sequences frequently correspond to regulatory elements controlling expression of nearby genes [179]. Whereas hCONDELs cannot be tested directly for evidence of positive human selection, we can examine whether hCONDELs are pref- erentially located near genes with particular functions, using the Genomic Regions Enrichment of Annotations Tool (GREAT) [154]. We performed enrichment analysis using simulations to explicitly account for the overrepresentation of highly conserved elements near particular classes of genes (Appendix Section B.1.11). This analysis indicates that hCONDELs are significantly enriched near genes involved in steroid hormone receptor signaling and neural function (Table 5.1), and near genes encoding CHAPTER 5. HUMAN CNE LOSSES AND TRAIT EVOLUTION 72

fibronectin-type-III- or CD80-like immunoglobulin C2-set domains (Table B.10).

Table 5.1: Summary of annotation enrichments of hCONDELs.

Ontology: Term Expected Observed Fold Binomial hCONDELs hCONDELs enrich P-value [154] GO Molecular Function: Steroid hormone receptor activity 4.7 14 2.96 3.7 × 10−4 Gene: Neural genes 141.3 180 1.27 1.1 × 10−4 MGI Expression in Theiler Stage 21: Hindbrain 49.9 79 1.58 3.4 × 10−5 Cerebral cortex 42.1 68 1.62 7.0 × 10−5 Brain, ventricular layer 29.9 52 1.74 9.4 × 10−5 Midbrain 30.5 52 1.70 1.6 × 10−4 InterPro Protein Domains: Fibronectin, type III 16.7 34 2.03 1.0 × 10−4 CD80-like, immunoglobulin C2-set 2.1 8 3.84 1.4 × 10−3

Showing only non-redundant terms enriched after accounting for multiple testing and for conserved elements to be found near particular classes of genes (Appendix Section B.1.11).

To compare sequences lost in other lineages, we used a similar computational approach to identify conserved sequences lost specifically in chimpanzee or mouse (termed cCONDELs and mCONDELs, respectively). We identified 344 cCONDELs and 350 mCONDELs validated by sequence reads spanning predicted deletions (Ap- pendix Section B.1.5). cCONDELs and mCONDELs show enrichment for synapse and glutamate receptors (Table B.11) and metal ion binding (Table B.12), respec- tively, but not for the categories found near hCONDELs.

5.1.2 Experimental characterization of an AR enhancer deleted in humans

To explore further the possible functions of hCONDELs near steroid hormone signal- ing genes, we examined a 60.7 kilobase (kb) human deletion flanking the androgen receptor (AR) locus (Figure 5.2a) in detail. AR is required for response of tissues to circulating androgens, and thus represents a strong candidate locus for characteristic changes of secondary sexual traits known in the human lineage [68, 143]. Within this human-specific deletion lies a ≈ 5 kb region containing highly conserved non-coding CHAPTER 5. HUMAN CNE LOSSES AND TRAIT EVOLUTION 73

sequences. We cloned the corresponding 4,839 bp chimpanzee and 7,654 bp mouse re- gions and tested their capacity to drive expression of an hsp68 (also known as Hspa1b) basal promoter-lacZ reporter gene during normal mouse development. Chimpanzee and mouse constructs both drove consistent lacZ expression in the facial vibrissae and genital tubercle of five or more independent transgenic embryos (Figure 5.2b–g), and the mouse sequence also drove expression in hair follicles. lacZ expression was specifically located in the mesoderm surrounding vibrissae follicles, and in the superficial mesoderm within the presumptive glans of the devel- oping genital tubercle (Figure 5.2d,e), a site of known AR expression (Figure 5.2h). Four stable mouse lines all showed expression in the superficial tissue underlying epidermal spines of the penis (Figure 5.2i). Previous studies have shown that AR is expressed in mesenchyme surrounding developing epithelial structures [60] and is required for normal development of vibrissae and penile spines (see below). These results indicate that the human deletion removes a conserved enhancer sequence that directs expression in a spatially-restricted subset of the endogenous AR expression pattern. The chimpanzee sequence also drives significant reporter gene expression when tested in human foreskin fibroblasts, indicating that upstream pathways regu- lating the enhancer are still intact in humans (chimpanzee activity 2.2–5.9-fold greater than human deletion/basal control vector, P < 10−7). Interestingly, humans show obvious morphological differences at the anatomical locations controlled by the enhancer. Sensory vibrissae develop in many mammals including chimpanzees, macaques and mice [163]. In contrast, humans lack sensory vibrissae [163]. Vibrissae development is clearly androgen-responsive, as castration shortens vibrissae in mice, and excess testosterone increases vibrissal growth [104] (Figure 5.2j). Profound changes have also evolved in the genitalia of humans compared to other animals. Many mammals have keratinized epidermal spines overlying tactile receptors in the glans dermis [95, 68]. Penile spine growth is androgen-dependent, as primates lose spines upon castration, and treatment with exogenous testosterone restores spine formation [67] (Figure 5.2k). Mice with AR protein-coding mutations fail to form pe- nile spines [164], confirming an essential role for AR in penile spine development. Our CHAPTER 5. HUMAN CNE LOSSES AND TRAIT EVOLUTION 74

a Chimpanzee chrX 67,200,000 67,600,000 AR hCONDEL OPHN1

60.7 kb hCONDEL.569 Chimp construct Mouse construct Sequence conservation human macaque mouse dog opossum b c f g

d h

e i

j Mouse vibrissae k Galago spines l 35 (4th cycle) Androgen Penile Receptor Vibrissae spines Human 30 Chimp + + 25 Length (mm) Macaque + +

0 Mouse + +

Excess Castrated Castrated Castrated Normal male testosteroneNormal male + testosterone

Figure 5.2: Transgenic analysis of a chimpanzee and mouse AR enhancer region missing in humans. (a) Top panel: 1.1 Mb region of the chimpanzee X chromosome. The red bar shows the position of a 60.7-kb human deletion removing a well-conserved chimpanzee enhancer between the AR and OPHN1 genes. Bottom panel: multiple species comparison of the deleted region, show- ing sequences aligned between chimpanzee and other mammals. Blue and orange bars represent chimpanzee and mouse sequences tested for enhancer activity in transgenic mice. The chimpanzee sequence drives lacZ expression in (b), facial vibrissae (arrows), and (c), genital tubercle (dotted line) of E16.5 mouse embryos. Histological sections reveal strongest staining in the superficial mes- enchyme of (d), the prospective glans of the genital tubercle, and (e), the dermis surrounding the base of sensory vibrissae. The mouse enhancer also drives consistent expression in (f), facial vibris- sae, (g), genital tubercle, and hair follicles of E16.5 embryos. (h) Endogenous AR is expressed in the genital tubercle (dotted line) as demonstrated by in situ hybridization. (i) Histological section of a 60-day-old transgenic mouse penis showing postnatal lacZ expression in dermis of penile spines. Vibrissae and penile spines are androgen-dependent, as shown by (j), changes in vibrissae length in castrated and testosterone-treated mice and (k), loss and recovery of penile spines of a castrated and testosterone-treated primate (Galago crassicaudatus) (modified from [104] and [67], respectively). (l) Model depicting conserved tissue-specific enhancers (colored shapes) surrounding AR coding sequences (black bars) of different species. Loss of an ancestral vibrissae/penile spine enhancer in humans is correlated with the corresponding loss of vibrissae and penile spines in humans. CHAPTER 5. HUMAN CNE LOSSES AND TRAIT EVOLUTION 75

results show that humans have lost an ancestral penile spine enhancer from the AR locus. Humans also fail to form the penile spines commonly found in other animals, including chimpanzees, macaques, and mice [95, 68, 164] (Figure 5.2l). Simplified penile morphology tends to be associated with monogamous reproductive strategies in primates [68]. Ablation of spines decreases tactile sensitivity and increases the duration of intromission, indicating their loss in the human lineage may be associated with the longer duration of copulation in our species relative to chimpanzees [68]. This fits with an adaptive suite, including feminization of the male canine dentition, moderate-sized testes with low sperm motility, and concealed ovulation with perma- nently enlarged mammary glands [68], that suggests our ancestors evolved numer- ous morphological characteristics associated with pairbonding and increased paternal care [143].

5.1.3 Experimental characterization of a GADD45G enhancer deleted in humans

Sensory vibrissae and penile spines are examples of morphological structures lost in humans. However, deletion of enhancers could also be associated with tissue ex- pansion, by removing regulatory sequences from genes that control developmental patterns of cell proliferation, death or migration. One of the most dramatic tissue expansions in human evolution is expansion of the cerebral cortex [233]. Using a curated resource of gene expression domains during mouse development, we find that hCONDELs are significantly enriched near genes expressed during cortical neuroge- nesis (P < 7.0 × 10−5, Table 5.1). Within this set, hCONDELs are preferentially located near genes acting as suppressors of cell proliferation or migration (P = 0.003) (Appendix Section B.1.14). To test the function of one such hCONDEL, we analysed further a deletion re- moving a 3,181-bp region located next to the tumor suppressor gene growth arrest and DNA-damage-inducible, gamma (GADD45G) (Figure 5.3a). This hCONDEL re- moves a forebrain-specific p300 binding site [235], which predicts enhancer sequences with strong specificity. The chimpanzee version of this sequence, and a smaller 546 CHAPTER 5. HUMAN CNE LOSSES AND TRAIT EVOLUTION 76

bp mouse sequence overlapping the p300 enhancer-binding region, both drove lacZ expression in the developing ventral telencephalon and diencephalon in at least five independent embryonic day 14.5 (E14.5) transgenic embryos (Figure 5.3b–i), con- firming that the ancestral sequence corresponds to a conserved forebrain-specific en- hancer. The chimpanzee sequence also drives significant gene expression when tested in immortalized human fetal neural progenitor cells, suggesting the enhancer would also affect transcription if still present in humans (1.7–2-fold greater expression than human deletion/basal control vector, P < 10−7). Histological sections of transgenic embryos show that the chimpanzee and mouse sequences drive lacZ expression in the subventricular zone (SVZ) of the septum, the preoptic area, and in regions of the ventral thalamus and hypothalamus (Figure 5.3c– e, g–i). The lacZ expression matches a sub-domain of the expression pattern of GADD45G, which is normally expressed throughout the telencephalon SVZ and dien- cephalon [87]. The preoptic area generates inhibitory interneurons that migrate to the neocortex and ventral telencephalon [83]. The ventral thalamus domain corresponds to a region that generates inhibitory interneurons, which have notably increased in proportion in the human thalamus [239, 8]. Expression in brain subventricular zones is particularly interesting in light of previ- ous suggestions that increased proliferation of SVZ intermediate progenitors underlies the evolutionary expansion of the neocortex in primates [128]. GADD45G normally represses cell cycle and can activate apoptosis, and somatic loss of GADD45G expres- sion is clearly linked with excess tissue growth in human pituitary adenomas [251, 252]. Our results show that species-specific loss of an ancestral SVZ enhancer has clearly occurred in the human lineage, providing a plausible molecular basis for increasing production of particular neuronal cell types by regulatory changes in a tumor sup- pressor gene.

5.2 Discussion

We cannot exclude the possibility that loss of AR and GADD45G enhancers has occurred because of relaxed selection following other genetic changes that have led CHAPTER 5. HUMAN CNE LOSSES AND TRAIT EVOLUTION 77

a Chimpanzee chr9 88,800,000 89,200,000 89,600,000 GADD45g hCONDEL DIRAS2

3.2 kb hCONDEL.332 Chimp construct Mouse construct p300 binding

Sequence conservation human macaque mouse dog opossum b c f g

Th Th Th Th Se Se POA POA d h Se Se

VZ SVZ VZ SVZ

e POA i POA

Figure 5.3: Transgenic analysis of a chimpanzee and mouse forebrain enhancer missing from a tumor suppressor gene in humans. (a) Top panel: 1.3 Mb region of the chimpanzee chromosome 9. The red bar illustrates a 3,181 bp human-specific deletion removing a conserved chimpanzee enhancer located downstream of GADD45G. Bottom panel: multiple species comparison of the deleted region, showing sequences aligned between chimpanzee and other mammals. The green bar represents a mouse forebrain-specific p300 binding site [235], and the blue and orange bars represent chimpanzee and mouse sequences tested for enhancer activity in transgenic mice. The chimpanzee (b–e) and mouse (f–i) sequences both drive consistent lacZ expression in E14.5 mouse embryos in the ventral thalamus (c, g), the SVZ of the septum (d, h), and the preoptic area (e, i). Increased production of neuronal subtypes from these regions may contribute to thalamic and cortical expansion in humans [83, 239, 8, 128]. All sections are sagittal with anterior to right. POA, preoptic area; Se, septum; SVZ, subventricular zone; Th, thalamus; VZ, ventricular zone. CHAPTER 5. HUMAN CNE LOSSES AND TRAIT EVOLUTION 78

to anatomical differences in the human lineage. However, based on the previously established role of AR in vibrissae and penile spine development, and of GADD45G in negative regulation of tissue proliferation, we think it probable that deletions of tissue-specific enhancers in these genes have contributed to both loss and expansion of particular tissues during human evolution. The full set of hCONDELs may contain loci associated with other human-specific characteristics, a possibility that can now be tested by further functional studies of these conserved non-coding sequences that are surprisingly missing from the human genome.

5.3 Methods Summary

We examined 2,696 Mb (92.67%) of the chimpanzee genome aligning with single regions in humans, removing assembly and alignment artefacts, segmental duplica- tions, and other regions with complex genome histories (Appendix Section B.1.1). Chimpanzee sequences aligning to macaque but not human were identified as human- specific deletions. We identified 70.0 Mb (2.41%) of the chimpanzee genome as highly conserved by calculating the fraction of identical base pairs within multiple sequence alignments using a series of sliding window criteria (Table B.1). hCONDELs were val- idated by searching for individual human sequence reads in the NCBI Trace Archives spanning predicted deletions; and by testing 39 hCONDELs by experimental amplifi- cation from individuals of 23 human populations (Coriell Cell Repositories, Appendix Section B.1.6). Gene enrichments were analysed using GREAT [154]. For each on- tology we ran 1,000 simulations over 510 size-matched random deletions overlapping one or more chimpanzee conserved elements to derive a multiple test P value thresh- old for enrichment. Functional enhancer assays were carried out by injecting chimp andmouse expression constructs into FVB embryos (Xenogen Biosciences and Cya- gen Biosciences) and staining for lacZ expression activity at different developmental stages, or by transfecting chimp and mouse enhancer constructs into human fore- skin fibroblasts (System Biosciences) or ReNcell CX Human Neural Progenitor cells (Chemicon), and comparing expression to human-deletion/control constructs using Galacto-Light Plus (Applied Biosciences) (Appendix Section B.1.15–B.1.17). Chapter 6

Conclusion

In this dissertation I focused on the computational study and analysis of mammalian cis-regulatory elements. I began in Chapter 2 by introducing the notion of a reg- ulatory element and describing how to discover cis-regulatory elements using com- parative genomics. I then motivated their study as potential drivers of evolutionary change due to the demonstrated modularity of cis-regulatory elements as compared to the often pleiotropic genes they regulate. Because the grammar of cis-regulatory elements remains poorly understood, al- ternative mechanisms must be used to determine their functions. In Chapter 3, I described a method and tool named GREAT to analyze the functional enrichments of a set of cis-regulatory elements based on the functions of the genes putatively regulated by the cis-regulatory elements. Unlike previous methodologies, GREAT incorporates information from distal cis-regulatory elements while accounting for the uneven distribution of genes within the genome by using a genomic region-based test. I showed that many distal binding events are likely to be functionally relevant and demonstrated GREAT’s improvement over existing gene-based tools at discern- ing specific functions of both sequence-specific transcription factors and the general transcription-associated protein p300 from their ChIP-seq binding profiles. Chapter 4 introduced a method to identify CNE losses from whole genome se- quences. I quantified CNE dispensability by analyzing rodent- and primate-specific CNE losses. Low loss rates of CNEs suggest that despite their prevalence in the

79 CHAPTER 6. CONCLUSION 80

genome, most have a unique function whose loss produces a fitness cost that is de- tectable on an evolutionary timescale. This result, taken in conjunction with the result that ultraconserved knockout mice can be viable and fertile [2], highlights the breadth of the mutation spectrum and the gap between the selection coefficient nec- essary to ensure a deleterious mutation is swept out of a population by purifying selection and the much larger selection coefficient often required for laboratory assays to detect visible phenotype changes. Additionally, we found little correlation between nucleotide identity and loss rate for sequences under purifying selection: regions per- fectly conserved are only very slightly less likely to be retained than sequences strongly (but not perfectly) conserved. This result brings further attention to the breadth of the set of mutations whose selection coefficient upon loss is dramatic enough to be swept out of populations through purifying selection. Perfectly conserved regions may have more pleiotropic functions than strongly conserved regions, or even have a higher selection coefficient upon loss, but the vast majority of both sets of regions are under strong enough selection to be retained through evolution. In Chapter 5, I extend the work of Chapter 4 to robustly detect CNE losses unique to the human lineage. I found 583 human-specific deletions based on the reference human genome assembly. Of these, 510 could be validated using unassembled se- quence available in the NCBI human trace archives, and 434 deletions appeared fixed in all humans. The validated deletions lie preferentially near steroid hormone recep- tor genes and genes with neural function. By searching for genes whose expression alteration leads to phenotypes consistent with human evolution and narrowing down the set of fixed human deletions by conservation criteria or additional evidence of enhancer activity [235], we identified deletions particularly likely to confer human phenotypes. Experimental assays of a human-specific deletion near AR suggest that its loss may underlie the human loss of penile spines and is correlated with the hu- man loss of sensory vibrissae, and assays of a human-specific deletion near the tumor suppressor gene GADD45G are correlated with cortical expansion in humans. Rapid improvements in DNA sequencing technologies have enabled analyses of full mammalian genome sequences. These analyses have drawn attention to the wealth of CHAPTER 6. CONCLUSION 81

functional non-coding DNA in the human genome and the importance of gene regu- lation in the development of organismal complexity and its misregulation in causing disease. In this dissertation I introduced a statistical methodology and web-based tool which dramatically improves our ability to interpret the functional significance of cis-regulatory elements. Through the observation that sequences conserved during the evolution of multiple separate species are rarely lost in other lineages, I conclude that the vast majority of these sequences are involved in important roles for organ- ismal development. This observation suggests that loss of cis-regulatory elements could drive morphological evolution. We tested this hypothesis by examining unique human evolution for instances of cis-regulatory element loss and present two exam- ples of enhancers lost uniquely in humans whose loss correlates with human-specific traits. These results add to a growing body of evidence suggesting that sequence loss may be a major driver of evolutionary change [171, 88, 43, 155].

6.1 Future directions

Accurate analysis of the functions of cis-regulatory elements is becoming increasingly important as our ability to identify cis-regulatory elements increases. While we have demonstrated that GREAT improves our ability to interpret cis-regulatory elements over existing tools, the association rules used to link cis-regulatory elements to their regulated genes follow general assumptions about cis-regulation as we currently un- derstand it. Owing to the importance of proper region/gene associations to generate accurate enrichments, the incorporation of additional data to motivate particular regulatory domain assignments should improve the signal to noise ratio while still incorporating distal binding sites. While genome-wide maps of the three dimensional structure of the genome are not yet at a resolution usable by GREAT [139], the binding site locations of the insulator protein CTCF could be used to restrict regulatory domain extensions [121]. Analysis of synteny across diverged species could also differentiate shared regulatory elements from those acting on a single gene, an idea that has met with success when considering species that have undergone whole genome duplications [70]. CHAPTER 6. CONCLUSION 82

A complementary approach to alter GREAT results would be to soften regulatory domain assignment rules from a binary outcome, in which each cis-regulatory ele- ment is either associated with a gene or not, to a probabilistic measure of confidence in the association between cis-regulatory elements and genes. This could explicitly model increased confidence in binding sites occurring within the proximal promoter of a gene as compared to binding sites that are distal, and enrichments for individual ontology terms can still be computed efficiently when gene regulatory domains un- dergo pre-processing.

Large strides have been made to identify functional sequences within genomes without relying on interspecies nucleotide conservation. By integrating epigenetic information such as the presence of enhancer-associated histone modifications [94, 225] and open chromatin [199, 86] with the binding of enhancer-associated proteins [235] or enhancer transcription [122], other genomic losses with a high likelihood of producing functional consequences could be discovered. I focused on deletions as a genomic rearrangement of interest because of the a priori indication of functionality provided by nucleotide conservation among multiple species. By incorporating single-genome measures of enhancer activity, other types of rearrangements like insertions, inversions, translocations, and duplications could be identified as altering functional genomic regions. Finally, the population-scale sequencing project currently underway provides much data for discovery and validation of novel genomic rearrangements in humans [1]. Paired-end reads make the identification of deletions, insertions, and inversions tractable from short reads, and when integrated with enhancer identification and analysis meth- ods mentioned above, should provide additional insight into the human genome mod- ifications of greatest functional consequence. Appendix A

Supplementary material for Chapter 3

A.1 Ontologies supported

GREAT assimilates knowledge from 20 separate ontologies containing biological knowl- edge about gene functions, phenotype and disease associations, biological pathways, gene expression data, presence of regulatory motifs, and gene families (Table 3.1). Statistics for each ontology list the total number of terms in the ontology that are currently tested by GREAT, the number of genes annotated with one or more terms in the ontology, and the number of direct associations between ontology terms and genes (Tables 3.2 and 3.3). Some ontologies contain parent/child relationships be- tween terms expressed as a directed acyclic graph; general terms within these on- tologies inherit genes that are only labeled with more specific child terms as indirect associations. To increase statistical power (by reducing the multiple hypothesis cor- rection factor), GREAT does not test any general term whose associated gene list is identical to the associated gene list of a more specific child term. The following ontologies are currently used:

83 APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 84

A.1.1 Gene Ontology

The Gene Ontology (GO; http://www.geneontology.org/) provides a controlled vocabulary to describe attributes of gene products [9]. GO contains three separate ontologies that describe molecular functions, biological processes, and cellular com- ponents of proteins.

A.1.2 Mouse Phenotype

The Mouse Genome Informatics (MGI) resource contains data about mouse genotype– phenotype associations primarily obtained via literature curation [24, 32] (http: //www.informatics.jax.org/phenotypes.shtml). Phenotypic terms are canonical- ized and relationships between terms are enumerated in the Mammalian Phenotype Ontology [208].

A.1.3 MSigDB Ontologies

The Molecular Signatures Database (MSigDB; http://www.broad.mit.edu/gsea/ msigdb/) contains a collection of gene sets [218]. The following description of the various ontologies within MSigDB is taken from http://www.broad.mit.edu/gsea/ msigdb/collections.jsp.

• MSigDB Cancer Neighborhood Computational gene sets defined by mining large collections of cancer-oriented microarray data. Gene sets defined by expression neighborhoods centered on 380 cancer-associated genes [28]. This collection is identical to that previously reported in [218].

• MSigDB Cancer Modules Computational gene sets defined by mining large collections of cancer-oriented microarray data [205]. Briefly, the authors compiled gene sets (“modules”) from a variety of resources such as KEGG, GO, and others. By mining a large compendium of cancer-related microarray data, they identified 456 such modules as significantly changed in a variety of cancer conditions. APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 85

• MSigDB Pathway Gene sets from pathway databases. Usually, these gene sets are canonical rep- resentations of a biological process compiled by domain experts.

• MSigDB Perturbation Gene sets that represent gene expression signatures of genetic and chemical perturbations.

• MSigDB Predicted Promoter Motifs Sets of genes that share a transcription factor binding site defined in the TRANS- FAC (version 7.4, http://www.gene-regulation.com/) database. Each of these gene sets is annotated by a TRANSFAC record.

• MSigDB miRNA Motifs Sets of genes that share a 3’-UTR microRNA binding motif.

A.1.4 PANTHER Pathway

PANTHER Pathway (http://www.pantherdb.org/pathway/) contains information on biological pathways (primarily signaling pathways) [157]. PANTHER pathways are collections of biological molecules and the reactions in which they participate. Only well-documented reactions and relationships are listed.

A.1.5 Pathway Commons

Pathway Commons [41] contains a comprehensive collection of pathways from multi- ple sources listed at http://www.pathwaycommons.org/pc/. According to the web- site, “[p]athways include biochemical reactions, complex assembly, transport and catalysis events, and physical interactions involving proteins, DNA, RNA, small molecules and complexes.” APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 86

A.1.6 BioCyc Pathway

BioCyc (http://biocyc.org/) contains information linking genes to the metabolic pathways in which they participate [40].

A.1.7 MGI Expression: Detected & Not Detected

The Gene Expression Database (GXD, http://www.informatics.jax.org/expression. shtml), a part of the Mouse Genome Informatics database, contains expression data with a focus on gene expression during mouse development [209, 32]. The information is primarily obtained from the literature via manual curation. Each entry gives the expression in a specific anatomical structure during a specific developmental period or “Theiler stage” [226]. The anatomy for each developmental stage is represented by a directed acyclic graph that gives a hierarchy of anatomical terms and their re- lationships. The database contains information about which genes are expressed and which are not found to be expressed. We represent the MGI Gene Expression Database by several sub-ontologies, where each sub-ontology is specific to a developmental stage. We then combine all sub- ontologies into one ontology so that all developmental stages are tested at once. MGI Expression: Detected contains data about genes that are expressed and MGI Expression: Not Detected contains data about genes whose expression is measured but not experimentally detected. The human MGI Expression ontologies are derived by mapping expression infor- mation from all genes in the mouse ontologies to their human orthologs and assume large-scale conservation of expression patterns.

A.1.8 Transcription Factor Targets

The Transcription Factor Targets ontology contains transcription factor (TF) target sets for human and mouse collected from literature [142]. Most TF target genes were identified by ChIP-chip experiments (see http://acgt.cs.tau.ac.il/amadeus/suppl/ metazoan compendium.htm). APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 87

A.1.9 miRNA Targets

The miRNA Targets ontology contains miRNA target sets for human and mouse collected from literature [142]. miRNA target genes were identified as genes down- regulated after miRNA overexpression (see http://acgt.cs.tau.ac.il/amadeus/ suppl/metazoan compendium.htm).

A.1.10 InterPro

InterPro (http://www.ebi.ac.uk/interpro/) is a database of protein domains, families and functional sites [103]. InterPro annotations give information about the function, structure and evolution of the domains. InterPro combines data from sev- eral other databases (PROSITE, PRINTS, Pfam, ProDom, SMART, TIGRFAMs, PIRSF, SUPERFAMILY, PANTHER and Gene3D).

A.1.11 TreeFam

The Tree families database (TreeFam, http://www.treefam.org/) contains infor- mation about the evolutionary history (both orthologs and paralogs) of gene fam- ilies [195]. A gene family is defined as “a group of genes that evolved after the speciation of single-metazoan animals”. We format this data into an ontology by creating an ontology term for each gene family and then associating each gene within the species (human or mouse) with its gene family.

A.1.12 HGNC Gene Families

The HUGO Committee groups genes into gene families based on sequence similarity, data from the literature and other databases, and manual curation [31]. The groupings are listed at http://www.genenames.org/genefamily. html. APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 88

Website architecture

The GREAT website output is generated by way of server-side PHP code invoking a Python wrapper over a C program. The C code makes use of the UCSC Genome Browser [116] code libraries and performs the core calculations. Test results are for- matted for web display using Python and PHP code. GREAT uses the DataTable con- trol from the Yahoo! User Interface Library (http://developer.yahoo.com/yui/) and custom JavaScript code to allow many user operations without need for more server data. This allows rapid responses for all filtering requests once the initial re- sults page has loaded locally. Operations which generate new pages, such as getting details for a single ontology term or the generation of publication quality table display do involve return trips to the server, which are handled by PHP code.

Graphical User Interface

The graphical user interface (GUI) of GREAT version 1.2.6 has the following compo- nents:

• User Input The user input page (Supplementary Figure 3.3a) requires two inputs from the user: the organism genome assembly in which the analysis should be performed and the cis-regulatory regions to analyze in BED format [116]. Optionally, the user can upload a background set of genomic regions to test against rather than the whole genome. The user can also optionally alter the association rules between cis-regulatory regions and their putative target genes from the default basal plus extension rule via the Advanced options tab.

• Global Output & Controls Upon submission of a dataset, users are directed to the global output screen of GREAT (Supplementary Figure 3.3b). By default, only ontology terms sig- nificant by both the binomial and hypergeometric tests using the multiple hy- pothesis correction false discovery rate (FDR) [18] ≤ 0.05 whose binomial fold enrichment is at least 2.0 are displayed. The extensive global controls at the top APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 89

of the page allow users to change the ontologies shown, alter the data columns displayed for each result term, and change the multiple test correction type and threshold. The number of enriched terms shown for each ontology, the display of terms not significant by one or both of the enrichment tests, and the filtering of terms by their descriptions can all be changed via the global controls. Within each ontology table, terms can be sorted by any data column except FDR. The data for a single table can be downloaded either as HTML or as a tab-separated file. By clicking on a particular term, additional information is presented in an individual term page (described below).

• Individual Term Page Each ontology term tested for enrichment has an associated individual term page (Supplementary Figure 3.3c). The page lists all cis-regulatory regions that reside within the regulatory domain of any gene annotated with the term, all the genes annotated with the term that possess one or more input regions within its regulatory domain, and the definition of the ontology term from the source website inset into the page. By clicking on any cis-regulatory region listed, users can navigate to the UCSC Genome Browser with custom tracks of the entire input set and only the elements contributing to the specific term enrichment (Supplementary Figure 3.3d).

• Online Documentation User help documentation is available at http://great.stanford.edu/help and includes information regarding all aspects of GREAT.

• Demo Sets The SRF [232] and limb p300 ([235]) datasets presented in the main text, as well as several additional sets, are available as demonstration input sets for GREAT. These sets can be tested by navigating to the “Demo” link. APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 90

A.2 Additional ChIP-Seq analyses

To assess the ability of GREAT to improve upon existing gene-based analyses of ChIP-Seq datasets, we compared GREAT enrichments to gene-based tool enrich- ments for multiple datasets. Where they were not performed by the original authors, we performed gene-based enrichment analyses using the Database for Annotation, Vi- sualization, and Integrated Discovery (DAVID) [100]. The analysis of next-generation binding data is not clearly mappable to more advanced techniques like the Gene Set Enrichment Analysis (GSEA) [218] which rank genes by the intensity difference of the different probes/genes and are typically applied to compare gene expression profiles from two classes (e.g. people with a disease vs. healthy controls). For each ChIP-Seq dataset we analyzed using DAVID, we mapped the identified peaks that reside within 2 kb of the TSS of the nearest gene as identified by the UCSC Known Genes track [99] to the nearby gene and ran enrichments over the resulting gene list. The ten most enriched terms from each annotation cluster re- ported significant by DAVID at a threshold of 0.05 after an FDR multiple hypothesis correction are shown in each enrichment table. To run a “gene-based GREAT”, we used the basal plus extension association rule with basal upstream and downstream parameters both set to 2 kb and an extension of 0 bp and excluded the set of curated regulatory domains. The ten most enriched terms significant by the hypergeometric test at a threshold of 0.05 after an FDR multiple hypothesis correction are shown in each enrichment table. Most Gene Ontology enrichments are nearly identical be- tween DAVID and “gene-based GREAT”, though some slight discrepancies occur due to different versions of GO and gene sets used. Since GREAT performs a cis-regulatory element-based test, no mappings from ChIP-Seq peaks to genes are required for preprocessing. For each ChIP-Seq dataset we analyzed using GREAT, we ran the identified peaks through GREAT using its default settings (the basal plus extension association rule with basal domains extending 5 kb upstream and 1 kb downstream of the TSS and extension to the basal domains of the nearest genes within 1 Mb). The top ten terms significant at a threshold of 0.05 by the binomial test after an FDR multiple hypothesis correction that have a fold APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 91

enrichment of at least two and are also significant by the hypergeometric test are shown in each enrichment table.

A.2.1 P300 in mouse embryonic stem cells

A recent ChIP-Seq analysis of transcription factors involved in the maintenance of the self-renewal and pluripotency capabilities of embryonic stem cells (ESCs) assayed genome-wide binding of p300 within mouse embryonic stem cells (mESCs) [46]. The binding profile of p300 was noted to co-localize with the binding profiles of Nanog, Oct4, and Sox2 ([46]), which are known to be involved in stem cell maintenance [108]. When we ran DAVID on the associated gene set, no annotation terms were found to be statistically significant (Table A.19a). We then analyzed all 524 identified p300 ChIP-Seq peaks [46] with GREAT using the default settings (5+1 kb basal, up to 1 Mb extension). GO enrichments indicate that chromatin binding and transcriptional regulator proteins are targets of p300 in ESCs, with striking enrichments for genes involved in stem cell maintenance and stem cell differentiation (Table A.19b). The MGI Expression: Detected ontology shows enrichment for genes expressed during the very first stages of development [226], consistent with stem cell maintenance. The Predicted Promoter Motifs ontology shows enrichment for p300 binding near genes whose promoters contain binding sites for GTF3A and NHLH1. GTF3A, which helps to assemble active chromatin, is required for transcription of the 5S RNA genes that drive growth in early developing embryos [73]. While the significance of NHLH1 binding to stem cell maintenance is not known, NHLH1 is known to be expressed in the developing nervous system [13]. The “gene-based GREAT” results identify stem cell differentiation as an enriched term, but none of the other top enriched terms overlap the enrichments displayed by GREAT (Table A.20). Running GREAT with a more limited extension (5+1 kb basal, up to 50 kb extension) highlights more general terms as enriched and no longer emphasizes enrichment for genes expressed in early development (Table A.21). Distal binding appears to contribute greatly to the function of p300 in embryonic stem cells, as demonstrated by nearly 40% of all binding occurring outside of 50 kb from the APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 92

TSS of any gene (Figure 3.2a). Variation in distal association rules leads to generally similar enrichments as the default GREAT, emphasizing genes involved in stem cell maintenance and expressed in early development (Tables A.22, A.23).

A.2.2 Signal transducer and activator of transcription 3 (Stat3) in mouse embryonic stem cells

The binding events of Signal transducer and activator of transcription 3 (Stat3) were also assayed within mESCs using ChIP-Seq in the study mentioned above [46]. Stat3 is a transcriptional activator whose activity is sufficient to maintain an undifferen- tiated state of mouse ESCs [151], but whose constitutive expression has also been linked to various cancers [29]. Stat3 transduces signals from the IL-6 family of cy- tokine receptors [97]. One of these cytokines, Leukemia Inhibitory Factor (LIF), is a component of media used to culture ESCs in an undifferentiated state. Gene-based DAVID enrichments for the Stat3 dataset were calculated in the man- ner described above. The enriched terms from the gene-based analysis are very gen- eral, hinting mainly at roles for Stat3 in metabolic processes and regulation (Ta- ble A.24a). Table A.24b displays the cis-regulatory element-based enrichments produced by GREAT for the same set using the default settings (5+1 kb basal, up to 1 Mb exten- sion). In contrast to the generality of DAVID’s term enrichments, GREAT produces many highly specific and accurate enrichments and yields novel, testable hypotheses. GO Biological Process enrichments indicate Stat3 regulates genes involved in both stem cell maintenance and differentiation. The Mouse Phenotype ontology shows enrichment for genes whose alteration leads to embryonic lethality before somite for- mation and abnormal placental development (Table A.24b). Stat3 is indeed essential; Stat3 knockout mice are embryonic lethal at early stages of development [221]. The PANTHER Pathway ontology suggests that Stat3 modulates the Interferon-γ sig- naling pathway. Interestingly, though Stat3 mediation of the Interferon-γ pathway has yet to be shown, Interferon-γ has recently been shown to suppress Stat3 via de- phosphorylation [80]. The MSigDB Pathway ontology shows enrichment for genes APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 93

expressed in breast cancers, especially those involved in estrogen-receptor-dependent signal transduction. STAT3 expression is frequently detected in breast cancer tis- sues [98], though a clear link between Stat3 and estrogen receptor expression has yet to be shown. The MSigDB Perturbation ontology enrichment for genes upregulated after LIF treatment highlights the link between LIF and Stat3 (Table A.24b); LIF activates the JAK/STAT signaling pathway of which Stat3 is a part [125]. LIF is also an important factor in uterine blastocyst implantation, though its expression is required in the utero-placental unit rather than the blastocyst (from which ESCs are derived) [10]. Thus it is striking that GREAT highlights the Mouse Phenotype “abnor- mal trophoblast layer morphology” and the MGI Expression: Detected enrichments for trophectoderm and extraembryonic component in Theiler stage 5 (Table A.24b). The binding of Stat3 to these regions may reflect the totipotent state of ESCs or an overlap between genes expressed in trophectoderm and the blastocyst. The MSigDB Perturbation ontology also shows enrichment for genes downregulated by expression of constitutively active JUN N-terminal kinase (JNK). Indeed, JNK is known to be a negative regulator of Stat3 [140]. Furthermore, the strong enrichment for genes upregulated by insulin is explained by the ability of insulin to activate Stat3 [53]. Finally, there is enrichment for transcriptional modulators present during myeloid differentiation, and indeed, Stat3 is an essential component for myeloid differentia- tion [156]. Results from a “gene-based GREAT” analysis recapitulate DAVID’s generality in GO enrichments, with none of the top enrichments indicating the roles of Stat3 in stem cell maintenance and differentiation (Table A.25). The Mouse Phenotype ontology enrichments are also more general than in the standard GREAT analysis, with placenta morphology identified but the early embryonic lethality unidentified. Similarly to the p300 in embryonic stem cells example, the basal plus extension with a restricted enrichment domain highlights more general terms and appears more similar to the “gene-based GREAT” (Table A.26), while the two nearest genes and single nearest gene association rules lead to similar enrichments to the default GREAT (Tables A.27 and A.28, respectively). Overall, by coupling appropriate integration of distal binding events with data APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 94

from many ontologies spanning a wide variety of biological phenomena, GREAT high- lights many known functions of Stat3 in mouse ESCs that gene-based tools fail to feature prominently (Table A.24). Experimentally-validated links between Stat3 and its pathway involvements, its known cofactors, and its role in stem cell maintenance are all highlighted by various ontologies. In addition, novel hypotheses of Stat3 in- volvement in trophectoderm development and its additional cofactors can be studied by targeted future experimentation.

A.2.3 Neuron-restrictive silencer factor (NRSF) in human Jurkat cells

To assess the functional roles of Neuron-Restrictive Silencer Factor (NRSF, also known as RE1-Silencing Transcription Factor or REST), we used ChIP-Seq bind- ing data from human Jurkat cells [232]. NRSF is a transcription factor involved in silencing neuron-specific genes [50, 203, 47]. A gene-based analysis of the dataset identified enrichment for genes “mostly in- volved in neuronal function” [232]. The top ten results of the analysis are reproduced in Table A.29a. We ran GREAT on a dataset comprised of the most significant ChIP-Seq peaks of NRSF (QuEST score > 1; n = 1,712) using a whole genome background and de- fault settings. Enriched terms overwhelmingly implicate NRSF as binding near genes involved in ion channel activity, neurotransmitter transport, and synaptic transmis- sion (Table A.29b). Additionally, the GO Cellular Component, Mouse Phenotype, InterPro, and HGNC Gene Families ontologies indicate that NRSF binds near both calcium channel and potassium channel genes. NRSF has been shown to modulate aldosterone and cortisol production by regulating a calcium channel subunit [211]. NRSF has also been shown to regulate potassium channel expression, affecting the phenotype of human vascular smooth muscle cells [48]. The GREAT enrichments sug- gest that NRSF may play a role in regulating other calcium and potassium channel genes as well. APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 95

The enrichments are robust to all variations of association rule including a “gene- based GREAT” (Tables A.30–A.33). The ability of both the gene-based analyses and GREAT to identify the neuron-specific functions of NRSF binding data suggests that both proximal and distal binding events play a role in the transcriptional repression of neuron-specific genes [47]. Given the high information content of the 21 bp neuron- restrictive silencer element (NRSE) bound by NRSF [50, 203], binding of NRSF to the NRSE may be predominantly functional regardless of the location of the binding area relative to nearby genes (Figure 3.2a).

A.2.4 GA-Binding Protein (GABP) in human Jurkat cells

The binding profile of GA-Binding Protein (GABP) was assayed in human Jurkat cells via ChIP-Seq [232]. GABP is a ubiquitous transcription factor that controls tran- scriptional regulation of genes involved in many diverse functions including apoptosis, differentiation, cell cycle, and cellular energy metabolism [193]. A gene-based analysis of GABP-regulated genes showed enrichment for genes “in- volved in basic cellular processes, particularly those related to gene expression” [232]. The top ten results are reproduced in Table A.34a. Due to the large number (6,442) of GABP binding peaks present in the dataset, more than 3,000 genes possess a GABP binding peak even within their proximal promoter. As a result the DAVID website cannot even be used to analyze this set, as it can only analyze datasets of 3,000 or fewer genes. In contrast, GREAT can handle datasets of hundreds of thousands of genomic peaks, and any number of re- sulting gene picks. We ran GREAT on the most significant ChIP-Seq peaks of GABP (QuEST score > 1, n = 3,585) using a whole genome background and default settings (Table A.34b). Enrichments from the Pathway Commons ontology highlight the known functions of GABP as a transcriptional activator, as the strongest enrichment is for genes in- volved in transcription. Interestingly, there are also strong enrichments for genes involved in various aspects of mRNA processing, with the MSigDB Pathway ontology highlighting genes involved in mRNA splicing as its strongest enrichment. GABP APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 96

has been shown to regulate Hepatocyte Growth Factor-Regulated Tyrosine Kinase Substrate, a protein that mediates alternative mRNA splicing during liver regener- ation [75, 193]. The MSigDB Pathway and GO Molecular Function ontologies also show strong enrichments for ribosomal proteins. Indeed, GABP is known to regulate multiple ribosomal proteins [61, 85], though the extent of GABP binding suggests that many other ribosomal proteins may also be regulated by GABP. Furthermore, GABP regulates transcription of eIF6, an essential trans-acting factor in ribosome biogenesis [69]. The MSigDB Pathway enriched terms “oxidative phosphorylation” and “electron transport” also correspond to known functions of GABP; GABP regu- lates mtTFA, a mitochondrial transcription factor important in oxidative phosphory- lation [49]. Finally, the Transcription Factor Targets ontology indicates that GABP binding peaks occur near genes that are regulated by ETS1 and YY1, suggesting a possible cooperative role between GABP and these factors. GABP itself is part of the ETS family, and ChIP-Seq experiments examining the binding of both GABP and ETS1 show that the proteins do bind many similar promoters, though GABP is in general a more ubiquitous factor [55]. Interactions between GABP and YY1 have also been experimentally shown [193, 65]. As GABP binds predominantly near the promoter of its target genes (Figure 3.2a), the unique enrichments highlighted by GREAT ontologies are robust to both gene- based analysis (Table A.35) and for all tested variations of association rule (Ta- bles A.36–A.38). APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 97

A.3 Supplementary Figures

a b Region-based binomial Gene-based hypergeometric Region-based binomial Gene-based hypergeometric 300 120 ms ms 250 100 e d te r e d te r 200 80 c h c h i i r r e n e n 150 60 e e v v t i t i 100 40 po s i po s i 50 20 se se a l a l F F 0 0 100 500 1000 5000 10000 50000 100000 100 500 1000 5000 10000 50000 100000 Input set size Input set size c d Region-based binomial Gene-based hypergeometric Region-based binomial Gene-based hypergeometric 160 7

140 ms 6 rms e t 120

e d te r 5 hed c h

100 i r 4 nri c e n

e 80 e e

v 3 i v t 60 t i

pos i 2

40 po s i se

se 1

Fa l 20 a l F 0 0 100 500 1000 5000 10000 50000 100000 100 500 1000 5000 10000 50000 100000 Input set size Input set size

Figure A.1: Robustness of genomic region-based and gene-based enrichment tests. The gene-based hypergeometric test generates false positive enriched terms in many ontologies when not restricted to only proximal binding events for the (a) MGI Expression: Detected, (b) MGI Expression: Not Detected, (c) Mouse Phenotype, and (d) InterPro ontologies. Tests were performed as described in the caption for Figure 3.2b. Though the total number of average false positive terms varies across ontology (note scale changes on y-axis), the qualitative shape of the graphs is similar with the majority of false positive enriched terms occurring for input sets containing 1-50k elements. APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 98

Figure A.2: Binomial and hypergeometric P value differences for several different datasets. (a) p300 mouse embryonic limb data set [235]. (b) p300 mouse embryonic forebrain data set [235]. (c) p300 mouse embryonic midbrain data set [235]. (d) NRSF human Jurkat data set [232]. (e) p300 mouse embryonic stem cell data set [46]. (f) Stat3 mouse embryonic stem cell data set [46]. The labeling scheme is as described in the caption for Figure 3.2c. APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 99

A.4 Supplementary Tables

Table A.1: “Gene-based GREAT” enrichments of all genes that possess an SRF binding peak within 2 kb of its transcription start site. APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 100 APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 101

Table A.2: GREAT cis-regulatory element enrichments of SRF using the basal plus extension association rule with a basal regulatory region extending 5 kb upstream and 1 kb downstream of the transcription start site and a maximum extension of 50 kb, using the highest-scoring SRF peaks anywhere in the genome (QuEST score > 1; n = 556). APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 102

Table A.3: GREAT enrichments of SRF using the two nearest genes association rule with a maximum extension of 1 Mb. Shown are the top ten binomial enriched terms at a false discovery rate of 0.05 with a fold enrichment of at least two that are also significant by the hypergeometric test, using the highest-scoring SRF peaks anywhere in the genome (QuEST score > 1; n = 556). APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 103

Table A.4: GREAT enrichments of SRF using the single nearest gene association rule with a maximum extension of 1 Mb. Shown are the top ten binomial enriched terms at a false discovery rate of 0.05 with a fold enrichment of at least two that are also significant by the hypergeometric test, using the highest-scoring SRF peaks anywhere in the genome (QuEST score > 1; n = 556). APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 104

Table A.5: DAVID gene-based enrichments of genes with proximal p300 limb binding events. APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 105

Table A.6: “Gene-based GREAT” enrichments of all genes that possess a p300 limb binding peak within 2 kb of its transcription start site. APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 106

Table A.7: GREAT cis-regulatory element enrichments of p300 in limb using the basal plus extension association rule with a basal regulatory region extending 5 kb upstream and 1 kb downstream of the transcription start site and a maximum extension of 50 kb. APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 107

Table A.8: GREAT enrichments of all p300 limb peaks using the two nearest genes association rule with a maximum extension of 1 Mb. Shown are the top ten binomial enriched terms at a false discovery rate of 0.05 with a fold enrichment of at least two that are also significant by the hypergeometric test. APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 108

Table A.9: GREAT enrichments of all p300 limb peaks using the single nearest gene association rule with a maximum extension of 1 Mb. Shown are the top ten binomial enriched terms at a false discovery rate of 0.05 with a fold enrichment of at least two that are also significant by the hypergeometric test. APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 109

Table A.10: DAVID gene-based enrichments of genes with proximal p300 forebrain bind- ing events. APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 110

Table A.11: “Gene-based GREAT” enrichments of all genes that possess a p300 forebrain binding peak within 2 kb of its transcription start site. APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 111

Table A.12: GREAT cis-regulatory element enrichments of p300 in forebrain using the basal plus extension association rule with a basal regulatory region extending 5 kb upstream and 1 kb downstream of the transcription start site and a maximum extension of 50 kb. APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 112

Table A.13: GREAT enrichments of all p300 forebrain peaks using the two nearest genes association rule with a maximum extension of 1 Mb. Shown are the top ten binomial enriched terms at a false discovery rate of 0.05 with a fold enrichment of at least two that are also significant by the hypergeometric test. APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 113

Table A.14: GREAT enrichments of all p300 forebrain peaks using the single nearest gene association rule with a maximum extension of 1 Mb. Shown are the top ten binomial enriched terms at a false discovery rate of 0.05 with a fold enrichment of at least two that are also significant by the hypergeometric test. APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 114

Table A.15: “Gene-based GREAT” enrichments of all genes that possess a p300 midbrain binding peak within 2 kb of its transcription start site. APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 115

Table A.16: GREAT cis-regulatory element enrichments of p300 in midbrain using the basal plus extension association rule with a basal regulatory region extending 5 kb upstream and 1 kb downstream of the transcription start site and a maximum extension of 50 kb. APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 116

Table A.17: GREAT enrichments of all p300 midbrain peaks using the two nearest genes association rule with a maximum extension of 1 Mb. Shown are the top ten binomial enriched terms at a false discovery rate of 0.05 with a fold enrichment of at least two that are also significant by the hypergeometric test. APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 117

Table A.18: GREAT enrichments of all p300 midbrain peaks using the single nearest gene association rule with a maximum extension of 1 Mb. Shown are the top ten binomial enriched terms at a false discovery rate of 0.05 with a fold enrichment of at least two that are also significant by the hypergeometric test. APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 118

Table A.19: GREAT enrichments for regions bound by p300 in mESCs. (a) DAVID gene- based enrichments of genes with proximal p300 binding events. (b) GREAT cis-regulatory element enrichments for all regions bound by p300. a No terms were found significant after multiple hypothesis correction in DAVID’s gene-based test. b GREAT Enrichments of p300 Binding Peaks in mESCs APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 119

Table A.20: “Gene-based GREAT” enrichments of all genes that possess a p300 binding peak in mESCs within 2 kb of its transcription start site. Shown are the top ten hypergeometric enriched terms at a false discovery rate of 0.05. APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 120

Table A.21: GREAT enrichments of all p300 binding peaks in mESCs using the basal plus extension association rule with a basal regulatory region extending 5 kb upstream and 1 kb downstream of the transcription start site and a maximum extension of 50 kb. Shown are the top ten binomial enriched terms at a false discovery rate of 0.05 with a fold enrichment of at least two that are also significant by the hypergeometric test. APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 121

Table A.22: GREAT enrichments of all p300 binding peaks in mESCs using the two nearest genes association rule with a maximum extension of 1 Mb. Shown are the top ten binomial enriched terms at a false discovery rate of 0.05 with a fold enrichment of at least two that are also significant by the hypergeometric test. APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 122

Table A.23: GREAT enrichments of all p300 binding peaks in mESCs using the single nearest gene association rule with a maximum extension of 1 Mb. Shown are the top ten binomial enriched terms at a false discovery rate of 0.05 with a fold enrichment of at least two that are also significant by the hypergeometric test. APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 123

Table A.24: Enrichments for regions bound by Stat3 in mESCs. (a) DAVID gene-based enrichments of genes with proximal Stat3 binding events. (b) GREAT cis-regulatory element en- richments for all regions bound by Stat3.

a DAVID Gene-based Enrichments of Stat3 Binding Peaks in mESCs

Continued on Next Page. . . APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 124

Table A.24 – Continued b GREAT Enrichments of Stat3 Binding Peaks in mESCs APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 125

Table A.25: “Gene-based GREAT” enrichments of all genes that possess a Stat3 binding peak in mESCs within 2 kb of its transcription start site. Shown are the top ten hypergeometric enriched terms at a false discovery rate of 0.05. APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 126

Table A.26: GREAT enrichments of all Stat3 binding peaks in mESCs using the basal plus extension association rule with a basal regulatory region extending 5 kb upstream and 1 kb downstream of the transcription start site and a maximum extension of 50 kb. Shown are the top ten binomial enriched terms at a false discovery rate of 0.05 with a fold enrichment of at least two that are also significant by the hypergeometric test. APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 127

Table A.27: GREAT enrichments of all Stat3 binding peaks in mESCs using the two nearest genes association rule with a maximum extension of 1 Mb. Shown are the top ten binomial enriched terms at a false discovery rate of 0.05 with a fold enrichment of at least two that are also significant by the hypergeometric test. APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 128

Table A.28: GREAT enrichments of all Stat3 binding peaks in mESCs using the single nearest gene association rule with a maximum extension of 1 Mb. Shown are the top ten binomial enriched terms at a false discovery rate of 0.05 with a fold enrichment of at least two that are also significant by the hypergeometric test. APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 129

Table A.29: Enrichments for regions bound by NRSF in human Jurkat cells. (a) The top ten proximal binding gene-based enrichments (reproduced from [232]). (b) GREAT cis-regulatory element enrichments for all regions bound by NRSF.

a Gene-based GO Enrichments of NRSF Promoter Binding Peaks

Continued on Next Page. . . APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 130

Table A.29 – Continued b GREAT Enrichments of NRSF Binding Peaks in Human Jurkat Cells APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 131

Table A.30: “Gene-based GREAT” enrichments of all genes that possess an NRSF bind- ing peak within 2 kb of its transcription start site. Shown are the top ten hypergeo- metric enriched terms at a false discovery rate of 0.05. APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 132

Table A.31: GREAT enrichments of NRSF using the basal plus extension association rule with a maximum a basal regulatory region extending 5 kb upstream and 1 kb downstream of the transcription start site and extension of 50 kb. Shown are the top ten binomial enriched terms at a false discovery rate of 0.05 with a fold enrichment of at least two that are also significant by the hypergeometric test, using the highest-scoring NRSF peaks anywhere in the genome (QuEST score > 1; n = 1,712).

Continued on Next Page. . . APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 133

Table A.31 – Continued APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 134

Table A.32: GREAT enrichments of NRSF using the two nearest genes association rule with a maximum extension of 1 Mb. Shown are the top ten binomial enriched terms at a false discovery rate of 0.05 with a fold enrichment of at least two that are also significant by the hypergeometric test, using the highest-scoring NRSF peaks anywhere in the genome (QuEST score > 1; n = 1,712). APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 135

Table A.33: GREAT enrichments of NRSF using the single nearest gene association rule with a maximum extension of 1 Mb. Shown are the top ten binomial enriched terms at a false discovery rate of 0.05 with a fold enrichment of at least two that are also significant by the hypergeometric test, using the highest-scoring NRSF peaks anywhere in the genome (QuEST score > 1; n = 1,712). APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 136

Table A.34: Enrichments for regions bound by GABP in human Jurkat cells. (a) The top ten proximal binding gene-based enrichments (reproduced from [232]). (b) GREAT cis-regulatory element enrichments for all regions bound by GABP.

a Gene-based GO Enrichments of GABP Promoter Binding Peaks

Continued on Next Page. . . APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 137

Table A.34 – Continued b GREAT Enrichments of GABP Binding Peaks in Human Jurkat Cells APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 138

Table A.35: “Gene-based GREAT” enrichments of all genes that possess an GABP binding peak within 2 kb of its transcription start site. Shown are the top ten hyper- geometric enriched terms at a false discovery rate of 0.05. APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 139

Table A.36: GREAT enrichments of GABP using the basal plus extension association rule with a maximum a basal regulatory region extending 5 kb upstream and 1 kb downstream of the transcription start site and extension of 50 kb. Shown are the top ten binomial enriched terms at a false discovery rate of 0.05 with a fold enrichment of at least two that are also significant by the hypergeometric test, using the highest-scoring GABP peaks anywhere in the genome (QuEST score > 1; n = 3,585). APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 140

Table A.37: GREAT enrichments of GABP using the two nearest genes association rule with a maximum extension of 1 Mb. Shown are the top ten binomial enriched terms at a false discovery rate of 0.05 with a fold enrichment of at least two that are also significant by the hypergeometric test, using the highest-scoring GABP peaks anywhere in the genome (QuEST score > 1; n = 3,585). APPENDIX A. SUPPLEMENTARY MATERIAL FOR CHAPTER 3 141

Table A.38: GREAT enrichments of GABP using the single nearest gene association rule with a maximum extension of 1 Mb. Shown are the top ten binomial enriched terms at a false discovery rate of 0.05 with a fold enrichment of at least two that are also significant by the hypergeometric test, using the highest-scoring GABP peaks anywhere in the genome (QuEST score > 1; n = 3,585). Appendix B

Supplementary material for Chapter 5

B.1 Materials and methods

Six genome assemblies were used throughout this study: Human NCBI Build 36.1 (UCSC hg18), Chimpanzee Build 2 Version 1 (panTro2), Rhesus macaque v1.0 Mmul 051212 (rheMac2), Mouse Build 36 “essentially complete” (mm8), Rat Rnor3.4 (rn4), and Chicken v1.0 (galGal2). The phylogenetic relationships and estimates of evolutionary distances between the species are shown in Figure B.1.

B.1.1 Discovering Human-specific Deletions (hDELs)

Using the pairwise alignment from chimpanzee to macaque, we identified high quality (sequencing quality scores ≥ 40) regions of the chimpanzee genome that align to orthologous regions in macaque. These regions are inferred to have been present in the chimpanzee-human ancestor. Using the pairwise alignment from chimpanzee to human, we then searched the chimpanzee ancestral DNA for regions that have no orthologous human sequence. To guarantee loss of orthologous human DNA, we required losses only from pairwise alignments that are one-to-one mappings, based on the UCSC chains and nets [115]. We also removed all human sequence assembly

142 APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 143

gaps, which otherwise have a similar signature to true human-specific deletions. We estimated the human deletion size using the formula CS-CG-HS, where CS is the length of the chimpanzee sequence in the unaligned region, CG is the length of all chimpanzee alignment gaps in the unaligned region, and HS is the length of the human sequence in the unaligned region. To focus on significant loss events that potentially remove more than a single binding site, we limited our search to human losses at least 23 bp in length (corresponding to the 23 out of 25 bp threshold applied in the next step). The search obtained 37,251 human deletions (hDELs). This likely contains a subset of all human-specific losses with respect to chimpanzee, since we relied on presence in macaque to infer presence in the human-chimpanzee ancestor. Macaque split from the human-chimpanzee ancestor 15-20 million years before the latter two species split (Figure B.1).

B.1.2 Finding chimpanzee sequences highly conserved in non- human species

We generated two syntenic multiple alignments to identify conserved sequences: a three-way alignment of chimpanzee, macaque, and mouse, and a four-way alignment of chimpanzee, macaque, mouse, and chicken. We used the LASTZ [204] and UCSC chain and net [115] tools to create pairwise syntenic alignments from chimpanzee to all other species. We used MULTIZ [26] to combine all pairwise alignments into a multiple alignment. Conserved sequences were defined as ones that satisfied one or more of four sliding window criteria in Table B.1, and cover in total 70.0 Mb (2.41%) of the chimpanzee genome. The stringent conservation criteria bias our search away from exonic regions of the genome; of the 70 Mb of conserved sequence nearly 49 Mb (70%) is non-exonic while the remainder 21 Mb of conserved exonic sequence represents only 38% of all 56 Mb of chimpanzee exonic sequence. Previous work estimates the amount of human genomic sequence under functional constraint at 5% or more [241, 224], suggesting that the set of conserved elements we identify is a conservative estimate of all constrained elements in the chimpanzee APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 144

genome.

B.1.3 Identifying a set of Human Conserved Sequence Dele- tions (hCONDELs)

The 37,251 hDELs are expected to be a mix of neutrally evolving deletions and human losses of functional elements. To identify human-specific deletions most likely to contribute to unique human phenotypes, we focused on hDELs that encompass one or more conserved elements that have no human orthologs. We BLAT [114] searched each chimpanzee conserved region defined above with default parameters against the human genome to discover conserved elements entirely absent in humans. Conserved elements identified using the 10 and 25 bp sliding window criteria were padded with 50 bp of chimpanzee sequence on either side prior to the BLAT search, to ensure that seeds of the sequence could be obtained. Conserved elements not found anywhere within the reference human genome in which 90% or more of the base pairs possessed a chimpanzee sequencing quality score of 40 or greater and were manually verified to be syntenic within chimp, macaque, and mouse were merged into 1,265 non-overlapping chimpanzee conserved elements (55.5 kb) that contain no orthologous human sequence. The 1,265 conserved elements that contain no orthologous human sequence re- side within 583 distinct human-specific deletion events. These 583 deletions, which comprise 3.96 Mb (0.136%) of the chimpanzee genome, were labeled as human con- served sequence deletions (hCONDELs). We used sliding window criteria to identify conserved elements in order to penalize small insertions and deletions rather than treat them as missing data. However, we further analyzed the conserved sequences uniquely lost in humans using Genomic Evolutionary Rate Profiling (GERP) [57], which provides estimates of the number of rejected substitutions per site of conserved elements based on the neutral estimates of substitution rates. Of all 1,265 conserved elements identified via the sliding window metrics, 1,229 (97.2%) of them residing within 574 (98.5%) of all hCONDELs satisfy the default GERP criteria to identify conserved elements (the sum of rejected substitutions in each column ≥ (num columns APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 145

* neutral rate/10)1.15) [57]. The nine hCONDELs that did not satisfy this criterion all possess sequence that clearly aligns syntenically to elephant, implying ancestry at the base of the eutherian (placental) mammal clade or beyond and were thus also retained. hCONDELs are present on every autosome and the X chromosome (Figure 5.1b), and their absence on the Y chromosome is an artifact of the current incompleteness of the macaque genome assembly. The mean and median estimated hCONDEL sizes are 6,126 bp and 2,804 bp, respectively. The mean and median G/C content of the hCONDELs are 0.390 and 0.383, respectively, exhibiting more A/T richness than the chimpanzee genome as a whole (Figure B.2).

B.1.4 Data quality issues: false positive calls vs. rare dele- tion variants

Despite our careful processing, manual inspection of the set of 37,251 hDELs re- vealed many artifactual calls, mostly repetitive sequence mediated. The UCSC align- ment tools mask repetitive elements when seeding cross-species alignments, but allow alignments to extend through repetitive elements in the extension step. Repetitive elements that lack a flanking seeded alignment that can be bridged, as well as species- specific inversions of repetitive sequence, create an alignment gap that appears similar to a real human deletion though the sequence is still present in human (Figure B.3). Masked repeats were also found to confound the reported size of actual human dele- tions (Figure B.4). Another common deletion artifact not associated with repetitive DNA was triggered by chimpanzee-specific tandem duplication where the human and macaque pairwise alignments were found at times to each match a different chim- panzee copy (Figure B.5). Manual inspection verified that the set of 583 hCONDELs is devoid of these types of alignment errors because we BLAT searched the conserved element(s) in each hCONDEL against the unmasked human genome and required that no homologous sequences exist anywhere within the human genome (Section B.1.1). Conceptually, the larger set of hDELs should allow for non-orthologous matches elsewhere in the APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 146

genome. However, high repetitive DNA content and the lack of one or more clear genomic landmarks make hDELs more difficult to handle. The generation of a global set of human-specific deletions was thus left to future efforts, where it can be facili- tated by the coming availability of the Gorilla genome, and by computational whole genome reconstruction approaches [146]. We used the NCBI Trace Archives to validate most human deletions (Section B.1.5). The remainder likely contains both rare human polymorphisms and rare reference genome errors [106].

B.1.5 In silico deletion validation and polymorphism analy- sis

As of June 2008, the NCBI Trace Archives (http://www.ncbi.nlm.nih.gov/Traces/ home/) held 214,905,334 unique human traces. We used these traces to obtain evi- dence of the existence of both the human deleted and ancestral genotypes, indepen- dent of genome assembly. We confirmed with NCBI that ancestry information for the vast majority of these traces is not available, and thus treated the entire Trace Archive as a single resource. To sidestep the issue of correctly assembling human reads, we focused our effort on validating hCONDELs using single trace reads (Figure B.6). We searched for both derived and ancestral versions by running BLAT queries over all human traces against both 1,500 bp of human sequence centered on the lesion and a “chimpized” version of each element that consisted of the chimpanzee- specific sequence within the genomic lesion and flanked by the surrounding human sequence (Figure B.6). We collected all traces that produced significant alignments to any of those sequences and BLAT searched them again against both the entire human genome and a “chimpized” genome that consisted primarily of human sequence, but had each of the 37,251 hDELs replaced by its “chimpized” version. Traces that overlapped a portion of the human sequence that completely spans a hCONDEL and did not map anywhere else in the human genome or “chimpized” genome validated the human deletion genotype. APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 147

Current sequencing technologies cannot produce single reads long enough to val- idate hCONDELs with large sizes of unaligned human sequence remaining after the deletion using this method (Figure B.6 and Figure B.7a), though some of these hCON- DELs are likely real human deletions that will be validated as trace lengths increase and more human sequence becomes available. Nonetheless, 510 of the 583 hCONDELs (87.5%) were validated by this method, including all 39 PCR-validated hCONDELs (Figure B.7b). The largest human size of a computationally validated hCONDEL is 823 bp. Examination of only hCONDELs with ≤ 823 bp of human sequence in them yielded a 93.6% validation rate (510 of 545). Interestingly, we found that of these 545 hCONDELs, the relatively few we could not validate this way tended to be larger sized deletions of chimpanzee DNA (Figure B.7c). Though transposons provide a mechanism for sequence deletion [92], they may also have confounded accurate identification of hCONDELs, both as a source of poten- tial genome mis-assembly and because our validation method required single traces to map uniquely to the genome. We examined the RepeatMasker (http://www. repeatmasker.org) annotations of the chimpanzee sequence flanks of each hCON- DEL using the UCSC panTro2 RepeatMasker track, and labeled the 89 hCONDELs whose flanks were comprised of transposons within the same family as transposon- mediated. Only 67 (75.3%) of the transposon-mediated hCONDELs were computa- tionally validated. We assessed polymorphism of validated hCONDELs in a similar manner to iden- tify a conservative set of hCONDELs fixed within human populations. Validated hCONDELs with a human trace that mapped uniquely to the “chimpized” genome were identified as polymorphic (Figure B.6). We also searched for human traces that aligned well to the conserved elements within hCONDELs, not requiring the trace boundary to overlap the deletion edge, as evidence of polymorphism. Of the 510 validated hCONDELs, 76 (14.9%) are polymorphic based on Trace archive evidence, including seven of the eight polymorphisms we observed using PCR. APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 148

B.1.6 PCR deletion validation and polymorphism analysis

We randomly chose 39 hCONDELs for PCR deletion validation and polymorphism analysis from all 111 hCONDELs in which the chimpanzee size was under 1 kb (to facilitate PCR amplification). For each of the 39 hCONDELs randomly selected, we designed primers to amplify part or all of the deletion (Table B.3). We tested all 39 elements using DNA samples of single individuals from 23 human populations (Coriell Cell Repositories). For 27 hCONDELs, a single set of primers flanking the deletion was used to amplify specific human and ancestral PCR products. For three hCON- DELs, two sets of nested primers were used in successive rounds of PCR to amplify the human DNA panel. The primers were designed such that the derived amplicon was expected to be smaller than the ancestral amplicon. For nine hCONDELs, three primers (two primers flanking the deletion and one internal, ancestral-specific primer) were used in a single PCR reaction to amplify human and ancestral-specific ampli- cons. For the eight hCONDELs that appeared polymorphic in humans, we expanded the screen to include three individuals each from 32 different populations (for a to- tal of 96 individuals). In each case, we used 10 ng of human DNA (derived element; Clontech #636401), chimpanzee DNA (ancestral element; Coriell NS06006, NS03659) and a 1:1 mix of human:chimpanzee DNA as PCR controls (Table B.4). Annealing and extension cycling conditions were optimized for the specific primer sequences and the sizes of predicted PCR products.

B.1.7 Neandertal allele identification of hCONDELs

The recent sequencing of DNA pooled from three Neandertal individuals to roughly 1.3x genome coverage enables the separation of hCONDELs that were deleted specif- ically on the human lineage from those that are absent in multiple hominin lin- eages [91]. While the limited coverage depth makes inference of Neandertal deletion of hCONDELs difficult, given the existing data that shows most detectable gene flow goes from Neandertals to modern humans [91], the presence of Neandertal sequence within an hCONDEL suggests that the associated change may have occurred specif- ically along the modern human lineage. To identify hCONDELs with evidence of APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 149

the ancestral allele in Neandertal, we examined the Neandertal sequence contigs gen- erated from all Neandertal sequence data and mapped to the chimpanzee panTro2 genome (UCSC track ntSeqContigs) for contigs that overlap the conserved elements used to identify the human deletion as an hCONDEL. Using this metric, 62 (12%) of the 510 computationally validated hCONDELs appear ancestral within Neandertal.

B.1.8 Simulation significance testing

Methodology

To analyze the significance of overlap between hCONDELs and another genomic fea- ture (e.g. human recombination “hotspots”), we ran 1,000,000 simulations in which we placed each hCONDEL in a random position in the chimpanzee genome chosen from a uniform distribution that would identify the element as a hCONDEL–ensuring that the random placement both avoided non-bridged sequence assembly gaps and covered at least one chimpanzee conserved element as defined in Table B.1. Enrich- ment p-values are reported as the total fraction of simulations in which the simulated hCONDELs overlapped the other genomic feature at least as many times as actually observed. Depletion p-values are reported as the total fraction of simulations in which the simulated hCONDELs overlapped the other genomic feature at most as many times as actually observed. Genomic features that are defined in the human genome (e.g., regions of recent positive selection) were mapped to their orthologous chim- panzee regions using the UCSC liftOver tool (http://genome.ucsc.edu/cgi-bin/ hgLiftOver) before running the simulations. Most enrichments (recombination rate, subtelomeric/pericentromeric regions) are reported on the entire set of 583 hCONDELs, though enrichments for the 510 vali- dated hCONDELs were also not significant (recombination rate P > 0.7, subtelom- eric/pericentromeric region P > 0.3). The enrichment of regions of recent human positive selection is reported on the 510 validated hCONDELs, though the enrich- ment for the entire set of 583 hCONDELs was also not significant (P > 0.4). APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 150

Recombination rate

Genotyping of over 5,000 microsatellite markers with 1,257 meiotic events has created a high-resolution recombination map of the human genome [126]. We identified the 5% of the human genome with the highest measured sex-averaged recombination rate using the UCSC hg18 Recomb Rate track. As described in the track details, each base pair is annotated with a recombination rate calculated by assuming a linear genetic distance across the nearest genetic markers.

Subtelomeric and pericentromeric regions

We define telomere and centromere coordinates using the UCSC panTro2 Gap track elements with types “telomere” and “centromere”, respectively, and define telomere coordinates within chromosomes that lack telomeric gap annotations as the mini- mum and maximum base pairs within the chromosome. Lacking a precise definition of subtelomeric and pericentromeric regions, we assessed the enrichment of hCON- DELs within ten different definitions of the track: fixed-distance extensions beyond telomeres and centromeres of 0.5 Mb, 1 Mb, 2 Mb, 3 Mb, and 5 Mb, and chromosome- size-dependent extensions beyond telomeres and centromeres of 0.5%, 1%, 2%, 3%, and 5% of the chromosome size. None showed enrichment (simulation P > 0.5 in all cases).

Recent positive human selection

Several recent studies have identified human genes that show evidence for positive selection, either during the multi-million year split that separates humans from other apes, or during much more recent migrations of human populations to different en- vironments around the world [197]. Four separate selection tests applied to Central European, Yoruban, and Asian populations identified regions of recent positive selec- tion [238]. The strongest regions of selection for each population (249 windows of 100 kb in East Asians, 256 windows of 100 kb in Europeans, and 271 windows of 100 kb in Yorubans) can be merged into 557 distinct windows that span 75.5 Mb (2.64%) of the hg16 build of the human genome. We used the UCSC liftOver tool to convert the APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 151

hg18 coordinates of the hCONDELs to hg16 coordinates, and defined hCONDELs under recent positive human selection as those that overlapped at least one of the 557 distinct windows. While no statistically significant enrichment is observed for the set as a whole (simulation P > 0.4), 16 hCONDELs (2.7%) do overlap a re- gion of recent human positive selection and may represent individual cases of interest (Table B.6). Interestingly, five of these 16 hCONDELs (31%) are computationally identified as polymorphic in humans, nearing a statistically significant enrichment compared to the overall hCONDELs polymorphism estimate of 14.9% (hypergeomet- ric P = 0.057).

B.1.9 Exonic hCONDEL validation attempts

Only one of the 510 computationally validated hCONDELs overlaps any annotated protein-coding gene exons, corresponding to a previously known 92 base pair frameshift- ing exon deletion in the CMAH gene [51]. However, the complete set of 583 hCON- DELs contains an additional seven predicted deletions that remove protein-coding exons. We attempted to computationally validate the deletion genotype in the seven novel exonic hCONDELs using the method described above (Section B.1.5), but the results were uniformly inconclusive (Table B.7). We chose three exonic hCONDELs that predict severe gene disruption for further experimental analysis. We performed PCR analyses on these three hCONDELs to detect the presence of ancestral and derived alleles using a panel of 94–96 diverse humans (methods described in PCR deletion validation and polymorphism analysis). In all cases, we confirmed the presence of the ancestral allele in all individuals (Table B.8). None of the three hCONDELs showed evidence of the derived allele. As discussed above, these results may have arisen because the deletions in the reference human genome are rare poly- morphisms. However, it is also possible that mis-assembly of the human genome led to spurious identification of some of the 73 unvalidated hCONDELs. APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 152

B.1.10 Gene structure overlap

We used the NCBI reference sequences (RefSeq) of human and chimpanzee genes [188] downloaded from the UCSC Genome Browser on 15 March 2008 to classify hCON- DELs as exonic, intronic, or intergenic. Exonic hCONDELs are those in which the human deletion overlaps at least one base pair of coding exon or untranslated region (CDS or UTR) sequence. Intergenic hCONDELs are those that do not overlap the region between the transcription start site and transcription end site for any RefSeq gene. Intronic hCONDELs are all hCONDELs that are neither exonic nor intergenic. To supplement the RefSeq gene set, we sought to characterize hCONDELs “of possible exonic origins” as those whose conserved region(s) have any similarity to any curated amino acid sequence. To do so, we generated query sequences consisting of each conserved region within an hCONDEL. Elements under 200 bp in length were extended equally on each side to reach that size. Overlapping elements were then merged and restricted to lie within the boundaries of the hCONDEL if necessary. We used BLASTX [6] to search these elements against the Entrez database of all non-redundant protein sequences (‘nr’), screening for hits with an E − value < 10−4. Significant hits were manually examined for gene, mRNA, or EST evidence in both chimpanzee and mouse. Only 23 of the 583 hCONDELs (3.9%) appear “possibly ex- onic”, including all eight hCONDELs identified as exonic directly via RefSeq. Only 13 (56.5%) of the “possibly exonic” hCONDELs are computationally validated deletions (compared to 87.5% validation rate in the full set).

B.1.11 Functional annotation enrichments of hCONDELs

Genomic Regions Enrichment of Annotations Tool

We suspect most of the hCONDELs remove cis-regulatory sequences contributing to the regulation of nearby genes. Consequently, we can make use of different anno- tations of the genes (proteins) themselves to ask whether the set of hCONDELs is enriched for any specific type of function using the Genomic Regions Enrichment of Annotations Tool (GREAT) [154] (also see Chapter 3 of this dissertation). We ran APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 153

GREAT on the 510 computationally validated hCONDELs using its default associa- tion rule parameters. Conserved elements are known to cluster preferentially near genes involved in developmental processes [16, 246, 141]. Because one criterion used to identify hCON- DELs is that the deletion must remove one or more highly conserved elements, we wanted to ensure that enriched functions were calculated relative to this biased back- ground. To do so, we generated 1,000 random sets of 510 size-matched deletions placed such that they remove at least one of the 1,160,828 chimpanzee conserved el- ements (Table B.1) and ran the binomial test over each randomized set. From these runs we derive a multiple test p-value cutoff at which we expect a single term from the respective ontology to be enriched by chance alone. The estimated thresholds were P = 8.0 × 10−4 for GO Molecular Function and P = 2.1 × 10−3 for InterPro. We explored several additional tests of potential interest, including the comparison of hCONDELs to the incomplete set of hDELs (see above), and the comparison of the 1,265 chimpanzee conserved elements found in the hCONDELs to the general background of all chimpanzee conserved elements (Table B.1) which is confounded when multiple conserved elements are deleted by longer hCONDELs. Results of all enrichment tests over the entire set of 583 hCONDELs, or over the set of 434 hCONDELs computationally predicted as fixed within humans, were also examined and found to be qualitatively similar to those for the 510 validated hCONDELs, with significant enrichments for steroid hormone receptors and suppres- sor genes expressed in cortex (data not shown).

Ontologies tested

We investigated the functional enrichment of genes near validated hCONDELs for the Gene Ontology (GO) [9] and InterPro protein domain annotations [7]. To focus on enrichment of well-annotated categories, we restricted the data to only terms assigned to at least 25 or more genes genome-wide (1,815 and 365 terms for GO and InterPro, respectively). All results for GO and InterPro are given in Table B.9 and Table B.10, respectively. APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 154

Genes that show evidence of neural function are identified in a manner analo- gous to [185], by searching the Entrez Gene server (http://www.ncbi.nlm.nih.gov/ sites/entrez?db=Gene) for keywords “neuron*”, “neural*”, “neurite”, or “axon” within current human protein-coding genes. For the neural gene enrichment term we can calculate an empirical p-value of enrichment for hCONDELs to land in prox- imity to these genes by comparing the true enrichment to simulations as described above. The empirical enrichment for hCONDELs to land in proximity to neural genes as compared to simulations of size-matched deletions that remove one or more conserved elements is P = 0.003. The Gene Expression Database [209] holds curated information regarding the expression domains of genes during all stages of mouse development. The curated entries highlight Theiler stage 21 of embryonic mouse development (embryonic day 12.5-14) [226] as a stage in which particularly many genes (1,208) are known to be expressed in the brain (data not shown). We analyzed enrichment of validated hCONDELs for all 111 tissues curated in Theiler stage 21 annotated to at least 25 genes using the method described above to correct for multiple hypothesis testing as well as the fact that hCONDELs remove conserved elements that tend to cluster around particular gene classes. The estimated p-value cutoff at which we expect one term to be enriched by chance alone was P = 2.2 × 10−4. A study that compared human-chimpanzee divergence to human polymorphism within coding sequence identified 304 genes that have undergone rapid amino acid evolution indicative of positive human selection [35]. We analyzed enrichment of validated hCONDELs for these genes against a background gene set that included only genes examined in the study. Using GREAT [154], we see no overall enrich- ment of validated hCONDELs near protein-coding genes with elevated amino acid replacement in the human lineage (P = 0.71) though there are validated hCONDELs near the nine recently-evolved genes ACTL7B, ADCYAP1, CENPE, CSF2RB, ESR1, HABP2, PIP, PLOD2, and ZFP42. APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 155

B.1.12 Syntenic 14-way mammal multiple alignments

Browser screenshots for the experimentally tested AR (Figure 5.2a and Figure B.8) and GADD45G (Figure 5.3a and Figure B.9) hCONDELs were based off of chimpanzee- centric syntenic 14-way mammal multiple alignments. LASTZ [204] and the UCSC chain and net tools [115] were used to generate pairwise alignments between chim- panzee (panTro2) and 13 other mammals: human (hg19), orangutan (ponAbe2), rhesus macaque (rheMac2), marmoset (calJac1), mouse (mm9), rat (rn4), guinea pig (cavPor3), dog (canFam2), horse (equCab2), cow (bosTau4), elephant (loxAfr3), opossum (monDom5), and platypus (ornAna1). Chimpanzee-centric syntenic mul- tiple alignments were generated for the loci surrounding the AR and GADD45G hCONDELs using MULTIZ [26] and manually removing clearly paralogous sequences. PhastCons [207] was run using UCSC parameters calculated for the 44-way human- centric alignment.

B.1.13 cCONDELs and mCONDELs

Chimpanzee conserved sequence deletions (cCONDELs) were identified in the manner described above for hCONDELs using human-centric alignments. Conserved elements were identified via the sliding window criteria in Table B.1 after substituting human for chimpanzee, and span 72.3 Mb (2.51%) of the human genome. A total of 445 cCONDELs were identified, 344 (77.3%) of which could be computationally validated using chimpanzee traces within the NCBI Trace Archives and manual inspection. The size distribution of the 344 validated cCONDELs is similar to that of hCONDELs (Figure B.10) and they remove a total of 2.41 Mb of sequence. Mouse conserved sequence deletions (mCONDELs) were also identified in a man- ner similar to hCONDELs, using rat-centric alignments. Conserved elements were de- fined using the sliding window metrics in Table B.1 with the three species alignments using rat, human, and macaque and the four species alignment using rat, human, macaque, and chicken. The criteria identify 71.6 Mb (2.79%) of the rat genome as conserved. A total of 784 mCONDELs were identified, of which 350 (44.6%) were computationally validated via NCBI Trace Archive support and manual inspection. APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 156

The 350 validated mCONDELs remove a total of 2.30 Mb of sequence (Figure B.10). The reduction in validated CONDELs for both chimpanzee and mouse when com- pared to human may stem from various practical and biological considerations, in- cluding

1. The chimpanzee genome assembly is a draft assembly that contains a higher frequency of errors than the human assembly.

2. The NCBI Trace Archives has less than 25% as much chimpanzee sequence as it does human, reducing the opportunity for computational validation of chimpanzee deletions.

3. The divergence time of mouse and rat is significantly longer than that of human and chimpanzee (Figure B.1), leading to many mCONDELs with large apparent mouse sizes that confound validation via single unassembled traces (as discussed above).

Functional enrichment analyses for validated cCONDELs and mCONDELs were performed using GREAT as described above, with multiple hypothesis test correc- tion p-value cutoffs estimated by running 1,000 simulated data sets of size-matched sequences that must hit conserved elements. For mCONDELs, deletions were sim- ulated in rat genome coordinates to find overlap between simulated deletions and identified conserved elements, and the resulting deletions were converted to mouse (mm9) coordinates to use the GREAT ontologies for mm9 [154].

B.1.14 Cell proliferation or cell migration suppressor gene enrichments

Loss of enhancer sequences of genes that suppress cell proliferation or cell migration is a possible mechanism for expansion or increased growth of specific tissues. Given the enrichment we observed for hCONDELs to lie in proximity to genes expressed in the cerebral cortex (Table 5.1), we examined whether hCONDELs were also enriched for the subset of cortex-expressed genes that suppress cell proliferation or cell migration. APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 157

Manual curation of the Online Mendelian Inheritance in Man database [172] entries for all 553 genes known to be expressed in the cerebral cortex in Theiler stage 21 that have an associated OMIM entry identified a set of 49 genes that are both expressed in cortex and are known to suppress cell proliferation or cell migration. GREAT enrichment analysis indicates that hCONDELs are significantly enriched for proximity to 11 of these 49 genes (binomial P = 7.5 × 10−4, permutation-based empirical P = 0.003).

B.1.15 LacZ reporter assay

Transgenic mice were generated by pronuclear injections of FVB embryos (Xenogen Biosciences and Cyagen Biosciences). The chimpanzee AR enhancer was am- plified using primers 5’-GGATTGCGGCCGCTATCCAACAGGTATTTGAACCTAGGCA-3’ and 5’-GGATTGCGGCCGCTAGAGGATGCACATGTAAATTTGGAGG-3’ containing NotI re- striction sites and cloned into the NotI site of the Not5’hsp68lacZ minimal promoter expression vector [66], forming a new construct ptAR/hsp68lacZ. This construct was linearized with SalI prior to injection. The ho- mologous mouse enhancer was cloned in three parts using the follow- ing primer pairs: 5’-GGATTACTAGTTAACCTAGCAGACAGGCAGTTATTGGG-3’ and 5’-GGATTTCTAGATATTTGAGGCTTCTTAGCATGGTGATG-3’; 5’-GGATTACTAGTTACATTTCTTATCCTGGTGAACTGCCT-3’ and 5’-GGATTTCTAGATAGGTAGAAGAAAGAGCATGGGGAGAG-3’; and 5’-GGATTACTAGTTACCCTAGACTACCCTTAAATCCAAGA-3’ and 5’-GGATTTCTAGATATATAAACACACCATCTGGTGCTGAG-3’. In each case the first primer contained a SpeI site and the reverse primer contained an XbaI site, thus allowing each to be sequentially cloned into an XbaI site of a modified pBS(KS+) vector with the single addition of junction se- quence 5’-TTACCCTAGACTACCCTTAAATCCAAGATGGTGTGTTTATATATCTAG-3’. The final construct was linearized with NaeI and co-injected with Not5’hsp68lacZ. The hCONDEL containing the AR enhancer was confirmed in our total diversity panel (populations listed within Table B.4) via PCR using primers 5’-GGATTGCGGCCGCTAAAAGCCATAAGTGCCTAGGTGTAGG-3’; APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 158

5’-TATGCCACTTTCTTCATGTTGTGGG-3’; 5’-ATGCGGTTGGCAATAGCTAAACTAG-3’; and 5’-CCAGAATCTACGATGAACTCCAACA-3’. The chimpanzee GADD45G enhancer was amplified using primers 5’-AAGGATTGCGGCCGCTGTATGGATACAGGTACCTGCACATG-3’ and 5’-AAGGATTGCGGCCGCTTGAGGCTATTCCTGCTCACAGATGA-3’ containing NotI re- striction sites and cloned into the NotI site of Not5’hsp68lacZ [66], forming ptGADD45G/hsp68lacZ [66]. The mouse GADD45G enhancer was amplified using primers 5’-ATGTGGCCAAACAGGCCTATTGCTGAGCCATCTTGACAGTCTTAAT-3’ and 5’-ATGTGGCCTGTTTGGCCTATTCCCCCTGTGCTCCACGGAATC-3’ containing SfiI restriction sites and cloned as a 4x concatenate into an SfiI site added within the NotI site of Not5’hsp68lacZ [162]. Both constructs were linearized with SalI prior to injection. The hCONDEL containing the GADD45G enhancer was confirmed in our to- tal diversity panel (populations listed within Table B.4 via PCR using primers 5’-TTGCCTACACATCTGTATCACCTGG-3’; 5’-ATCATGTGCATGGACACATGTATGG-3’; and 5’-CAGCCCTTTCAATTGAGAGCACAGG-3’.

B.1.16 In situ hybridization

The AR riboprobe was generated by amplifying a 1,594bp fragment corre- sponding to the 5’ UTR and exon 1 from mouse genomic DNA using primers 5’-GAATTCGGTGGAAGCTACAGACAAG-3’ and 5’-CCATAAGGTCCGGAGTAGTTCTCCA-3’. The fragment was cloned into pCR4Blunt-TOPO vector (Invitrogen), and sense and anti-sense probes were produced by digesting with NotI and transcribing with Sp6 RNA polymerase or HindIII and T7 respectively. Whole-mount in situ hybridization was performed as described [244].

B.1.17 Cell culture assay

We performed cell culture assays to test the activity of the AR (ptAR/hsp68lacZ ) or GADD45G (ptGADD45G/hsp68lacZ ) enhancers in a human genetic background. For the AR enhancer, human foreskin fibroblasts (PC501A, System Biosciences) were maintained at 5% CO2 at 37C in high glucose DMEM (Gibco), 10% fetal bovine APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 159

serum, 2 mM L-Glutamine (Gibco), 1 mM MEM non-essential amino acids, and 0.5% Pen Strep (Gibco). Culture media was changed every two days and the cells were subcultured usually twice a week. Prior to transfection, cells were cultured for 24 hr in 24-well dishes at 8x104 cells/well. Cells were transiently transfected with 0.5 mg or 1 mg of ptAR/hsp68lacZ per well using the Superfect Transfection Reagent (Qiagen) following the manufacturers recommended protocols. Expression was com- pared to cells transfected with the molar equivalent of Not5’hsp68lacZ. Enhancer activity was measured with a GloMax-96 Microplate Luminometer (Promega) 42 to 48 hours after transfection using the Galacto-Light Plus System (Applied Biosystems) following manufacturer’s protocols. To examine the GADD45G enhancer activity, we used the ReNcell CX Human Neural Progenitor Cell Line (SCC007, Chemicon) de- rived from the cortical region of a gestational week 14 fetal brain. ReNcell CX cells were maintained at 5% CO2 at 37C in Neural Stem Cell (NSC) Maintenance Me- dia (SCM005, Chemicon) supplemented with 20 ng/mL bFGF and 20 ng/mL EGF (GF003, GF001, Chemicon) on tissue culture-ware previously coated with 20 mg/mL of laminin (L-2020, Sigma) diluted in DMEM/F12 (DF-042-B, Chemicon). Culture media was changed every two days and the cells were subcultured usually twice a week. Prior to transfection, cells were cultured for 16 hr in laminin-coated 96-well plates with NSC maintenance media supplemented with 6 ng/mL bFGF and 6 ng/mL EGF in 96-well dishes at 2.1x104 cells/well. Cells were transiently transfected with 1 mg of ptGADD45G/hsp68lacZ per well using the Superfect Transfection Reagent (Qiagen) following the manufacturer’s recommended protocol, and subsequently cul- turing in NSC maintenance media. Expression was compared to cells transfected with the molar equivalent of Not5’hsp68lacZ. Enhancer activity was measured with a GloMax-96 Microplate Luminometer (Promega) 56 hours after transfection using the Galacto-Light Plus System (Applied Biosystems) following manufacturer’s protocols. For each construct, we performed at least three independent transfection experiments containing 12 or 24 replicates each, and we compared the mean expression of the chim- panzee construct with the Not5’hsp68lacZ construct using a two-sample directional Student’s T-test. The p-values from independent experiments were combined using Fisher’s combined probability test. APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 160

B.2 Supplementary Figures

a 5 human 18 5 chimpanzee 50 My 68 23 macaque 219 41 mouse 50 41 rat 310 chicken

b 0.0038 human 0.1 subst/site 0.0150

0.0049 0.0696 chimpanzee 0.0298 macaque 0.4086 0.0855 mouse 0.2507 0.0722 0.4177 rat chicken

Figure B.1: Phylogeny of species used. (a) Evolutionary distances in units of millions of years (adapted from [230]). (b) Branch lengths in units of substitutions per synonymous site (adapted from [149]). APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 161

Human Conserved Sequence Deletions Entire chimpanzee genome 0.45 0.40 0.35 0.30 0.25 0.20

Frequency 0.15 0.10 0.05 0.00 0.05 0.45 0.55 0.6 0 0.65 0.70 0.75 0.80 0.85 0.1 5 0.20 0.30 0.40 0.50 1.00 0.10 0.25 0.35 0.90 0.95 Chimpanzee G/C content

Figure B.2: Chimpanzee G/C content of hCONDELs. G/C content was calculated over unmasked sequences and includes repetitive DNA. Entire chimpanzee genome G/C content was calculated in non-overlapping 1,000 bp windows of sequenced unmasked chimpanzee DNA. APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 162

a Scale 500 bases Chimp chr3: 45695500 45696000 45696500 Gap Locations Rhesus Alignment Net

Human Alignment Net

Repetitive Sequence Inverted in Human Non-deleted Sequence Repeating Elements by RepeatMasker SINE LINE LTR DNA Simple Low Complexity

b Scale 500 bases Human chr3: 44716000 44716500 44717000 Rhesus Alignment Net

Chimp Alignment Net

Repetitive Sequence Inverted in Human Non-deleted Sequence Repeating Elements by RepeatMasker SINE LINE LTR DNA Simple Low Complexity

Figure B.3: False detection of a human deletion event due to a repeat masking artifact. (a) Chimpanzee genome browser view of a 968 bp region that strongly aligns between chimpanzee and rhesus macaque, but does not align between chimpanzee and human, according to the human alignment net. (b) Human genome browser view of the orthologous region, which shows a 964 bp region that aligns between human and rhesus macaque but does not align between human and chimpanzee in the chimpanzee alignment net. Manual local alignment of the two regions aligned 964 bp of chimpanzee sequence to 960 bp on the reverse stranded human sequence, with 951 bp (99%) identically matching (red bar tracks added to both panels by the authors). APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 163

a Scale 10 kb Chimp chr3: 105455000 105460000 105465000 105470000 105475000 Gap Locations

Rhesus Alignment Net

Human Alignment Net

Repetitive Sequence Present in Human Non-deleted Sequence Repeating Elements by RepeatMasker SINE LINE LTR DNA Simple

b Scale 5 kb Human chr3: 102675000 102680000 102685000 Rhesus Alignment Net

Chimp Alignment Net

Repetitive Sequence Present in Human Non-deleted Sequence Repeating Elements by RepeatMasker SINE LINE

Figure B.4: Human-specific sequence loss of ≈ 11 kb mis-annotated as a 29 kb deletion. (a) Chimpanzee genome browser view of a 29 kb region that aligns to rhesus macaque but does not align to human. Red bar shows portion of sequence actually present in human. (b) Human genome browser view of the orthologous region, which shows 17 kb of sequence present in human that does not align to chimpanzee. Comparison of the RepeatMasker annotation of the orthologous sequences under the red bars shows the strong similarity. Manual local alignment of the initial 17,429 bp of chimpanzee sequence to the entire 17,762 bp in human produced an alignment of 17,438 bp in human to the chimpanzee sequence with 17,149 bp (97%) identically matching.

Scale 200 bases Chimp chr7: 125260800 125260900 125261000 125261100 125261200 Gap Locations

Rhesus Alignment Net

Human Alignment Net

Duplicated Chimpanzee Sequence Chimp Duplication

Figure B.5: Artifactual human deletion due to duplicated chimpanzee sequence. The criteria used to identify hDELs flagged this sequence as a human loss of 228 bp (corresponding to the gap shown in the human net). In actuality, there are two identical copies of a 129 bp region in chimpanzee (red bars) that appears only once in both rhesus macaque and human. APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 164

a Chimp DNA C c C

Human DNA H h H

b Deletion genotype validation Trace must align to all three sequence blocks

Human Trace aligns Human Trace H h H better than H c H

c Actual chimp DNA C c C (and proper genome assembly) Incorrect human genome assembly H h H

Actual human DNA H h c H

d Ancestral genotype inference Trace must align to human and chimpanzee sequence

Human Trace aligns Human Trace H c H better than H h H

or Human Trace H highly conserved c H

Figure B.6: Computational validation of deletion genotype and assessment of polymor- phism. (a) hCONDEL composition: flanking regions of human-chimpanzee sequence homology (aligned blue bars), chimpanzee sequence that is not alignable to human (red bar), and possibly some human sequence that is not alignable to chimpanzee (green bar). (b) The deletion genotype is validated when a single trace (brown bar) aligns to the predicted deletion genotype such that the entire human sequence that is not aligned to chimpanzee is spanned by the trace, and there are no strong alignments of the trace to anywhere else in the entire human genome or to a “chimpized” human genome in which the unaligned chimpanzee sequence replaces the unaligned human sequence in all 37,251 hDELs. (c) Traces are required to span the entire hCONDEL to preclude the possi- bility that the proper assembly of human contains the chimpanzee sequence (red bar) adjacent to the human-specific sequence (green bar). This deliberately conservative validation metric avoids any assembly-related artifacts and ensures that at least one human possesses the deletion. (d) Ancestral genotype existence in human is inferred in two ways: when a single trace aligns strongly to either edge of the “chimpized” version of the hCONDEL, spanning the border between the flanking human sequence and some unaligned chimpanzee sequence, and does not strongly align elsewhere within the human or “chimpized” genome (top), or when a human trace aligns strongly to one or more of the highly conserved elements used to identify the sequence as an hCONDEL (bottom). APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 165

70 100,000 ) Validated Unvalidated

a s b

n 60 o i ) l 10,000 l

50 p ( b (m i

40 e

s 1,000 z e s i c

30 a n t r

a 100

f 20 m o

u r

10 H e 10 b

m 0 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 0 u 4 9 9 4 9 4 9 4 9 4 9 4 9 4 9 4 9 4 9 4 9 4 9 4 9 4 0 - - 1 1 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 0 0 2 2 3 0 0 ------0 N 1 1 1 1 1 1 1 - -

5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 - - - - 0 0 ≥ 0 5 0 5 0 5 0 5 0 5 0 5 0 5 0 5 0 5 0 0 0 0 0 5 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 0 5 0 5 10 100 1,000 10,000 100,000 1,000,000 1 1 0 0 2 2 1 1 1 1 1 1 Chimpanzee size (bp) Human trace read length (bp)

c Fraction validated Fraction unvalidated 100% 3 22 10

) 80% ( %

s 60% E L

D 108 343 59

N 40% O C

h 20%

0% < 1,000 bp 1,000-10,000 bp > 10,000 bp Chimpanzee size (bp)

Figure B.7: Human conserved sequence deletion Trace Archive validation results. (a) Length distribution of human reads in the NCBI Trace Archives. (b) The 583 hCONDELs are displayed on a log-log plot according to their apparent sizes in the chimpanzee and human genome browsers (Figure B.6). Computational validation occurs when one or more individual human traces uniquely confirm the deletion genotype (Section B.1.5). There are 510 validated and 73 unvali- dated hCONDELs. No hCONDEL with human size > 823 bp was validated, likely due in part to the relatively few traces long enough to potentially validate those hCONDELs. (c) We observe a counterintuitive trend within the 545 hCONDELs of human size ≤ 823 bp where the fraction of unvalidated hCONDELs increases as the putative human deleted region grows bigger. The number of hCONDELs of each type is shown within the graph segments.

Figure B.8: Chimpanzee browser screenshot of the AR hCONDEL locus. Shown are the human deletion and all non-human species that possess alignable sequence orthologous to the chimpanzee. Conservation peaks are identified using the entire 14-way alignment (Section B.1.12). APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 166

Figure B.9: Chimpanzee browser screenshot of the GADD45G hCONDEL locus. Shown are the human deletion and all non-human species that possess alignable sequence orthologous to the chimpanzee. Conservation peaks are identified using the entire 14-way alignment (Section B.1.12). The rat genome has an assembly gap spanning the unaligned region that contains the highly con- served sequences.

chimpCONDELs mouseCONDELs 70

s 60

io n 50

dele t 40

o f 30

ber 20

Nu m 10

0 1 5 10 15 20 ≥25 Deletion size (kb)

Figure B.10: Size distributions of chimpanzee and mouse conserved sequence deletions. Shown are the total counts of computationally validated cCONDELs (n=344) and computationally validated mCONDELs (n=350). APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 167

B.3 Supplementary Tables

Table B.1: Sliding window criteria used to identify chimpanzee conserved elements.

Species aligned Window size Identical base pairs Chimpanzee genome coverage Chimpanzee, macaque, mouse 100 90 30.7 Mb (1.06%) Chimpanzee, macaque, mouse 50 45 45.8 Mb (1.57%) Chimpanzee, macaque, mouse 25 23 60.4 Mb (2.08%) Chimpanzee, macaque, mouse, chicken 10 9 14.7 Mb (0.51%) Union: 70.0 Mb (2.41%)

Table B.2: The 583 human conserved sequence deletions.

hCONDEL ID PanTro2 coordinates Hg18 coordinates

hCONDEL.1 chr1:3551306-3551400 chr1:3617079-3617079 hCONDEL.2 chr1:5962438-5965784 chr1:5915074-5915073 hCONDEL.3 chr1:17292135-17294810 chr1:17488364-17488363 hCONDEL.4 chr1:23458317-23466227 chr1:23476353-23476352 hCONDEL.5 chr1:30722441-30726705 chr1:30619579-30619578 hCONDEL.6 chr1:31895475-31903616 chr1:31744129-31744128 hCONDEL.7 chr1:33665412-33667133 chr1:33424994-33425005 hCONDEL.8 chr1:35001287-35001383 chr1:34738661-34738660 hCONDEL.9 chr1:41468612-41468838 chr1:41040497-41040496 hCONDEL.10 chr1:54573570-54575825 chr1:53879603-53879605 hCONDEL.11 chr1:58175745-58177766 chr1:57372792-57372791 hCONDEL.12 chr1:62481257-62483100 chr1:61631524-61631523 hCONDEL.13 chr1:66911847-66912341 chr1:65993327-65993326 hCONDEL.14 chr1:67000355-67001170 chr1:66080786-66080785 hCONDEL.15 chr1:73585239-73594098 chr1:72578084-72578083 hCONDEL.16 chr1:78187347-78188427 chr1:77122820-77122819 hCONDEL.17 chr1:83810584-83811324 chr1:82658330-82658329 hCONDEL.18 chr1:86559612-86561964 chr1:85360994-85361010 hCONDEL.19 chr1:89861836-89861929 chr1:88633307-88633306 hCONDEL.20 chr1:93331129-93331396 chr1:92099525-92099544 hCONDEL.21 chr1:98162298-98164592 chr1:96860009-96860008 hCONDEL.22 chr1:100240118-100244391 chr1:98912159-98912158 hCONDEL.23 chr1:132521559-132523059 chr1:151637537-151637536 hCONDEL.24 chr1:137169062-137171016 chr1:156160569-156160568 hCONDEL.25 chr1:143822931-143832863 chr1:162780232-162780245 hCONDEL.26 chr1:146537593-146544011 chr1:165431305-165431582 hCONDEL.27 chr1:154236083-154245026 chr1:172946670-172947635 hCONDEL.28 chr1:161410222-161411008 chr1:180011781-180011780 hCONDEL.29 chr1:167485783-167486268 chr1:185974133-185974295 hCONDEL.30 chr1:171062877-171067060 chr1:189515976-189515992 hCONDEL.31 chr1:175828792-175831596 chr1:194133708-194133708 hCONDEL.32 chr1:179568677-179574787 chr1:197842818-197842817 hCONDEL.33 chr1:183558736-183559777 chr1:201717532-201717531 hCONDEL.34 chr1:191352617-191353398 chr1:209151329-209151605 hCONDEL.35 chr1:197202575-197206931 chr1:214866810-214866809 hCONDEL.36 chr1:197529644-197530745 chr1:215188487-215188536 Continued on Next Page. . . APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 168

Table B.2 – Continued

hCONDEL ID PanTro2 coordinates Hg18 coordinates

hCONDEL.37 chr1:202801147-202804923 chr1:220377485-220379061 hCONDEL.38 chr1:226816542-226823289 chr1:243969233-243969232 hCONDEL.39 chr1:227475093-227476619 chr1:244613683-244613682 hCONDEL.40 chr2a:8316069-8316766 chr2:8200261-8200260 hCONDEL.41 chr2a:14557032-14557951 chr2:14332054-14332053 hCONDEL.42 chr2a:35564668-35565203 chr2:34861480-34861481 hCONDEL.43 chr2a:36215002-36255309 chr2:35533609-35533608 hCONDEL.44 chr2a:40561991-40562569 chr2:39753811-39753810 hCONDEL.45 chr2a:40760886-40792993 chr2:39946840-39946839 hCONDEL.46 chr2a:40925161-40927468 chr2:40076989-40076989 hCONDEL.47 chr2a:42539973-42540536 chr2:41617299-41617299 hCONDEL.48 chr2a:49202256-49206166 chr2:48126503-48126513 hCONDEL.49 chr2a:51356637-51357594 chr2:50227268-50227267 hCONDEL.50 chr2a:52475788-52479265 chr2:51327018-51327017 hCONDEL.51 chr2a:57253387-57255317 chr2:56022056-56022056 hCONDEL.52 chr2a:65654718-65656321 chr2:64319340-64319339 hCONDEL.53 chr2a:71484393-71485696 chr2:70046105-70046104 hCONDEL.54 chr2a:76753304-76755417 chr2:75217837-75217838 hCONDEL.55 chr2a:78346973-78352641 chr2:76777296-76777310 hCONDEL.56 chr2a:78608866-78609659 chr2:77031461-77031477 hCONDEL.57 chr2a:85192434-85197701 chr2:83396143-83396176 hCONDEL.58 chr2a:98910622-98923377 chr2:98000408-98001254 hCONDEL.59 chr2a:99250589-99250862 chr2:98322211-98322236 hCONDEL.60 chr2a:109506767-109556454 chr2:108245371-108246609 hCONDEL.61 chr2b:115403795-115409316 chr2:115351931-115351930 hCONDEL.62 chr2b:117543692-117547728 chr2:117424581-117424999 hCONDEL.63 chr2b:117919048-117927241 chr2:117793634-117793861 hCONDEL.64 chr2b:128750930-128751273 chr2:128366158-128366159 hCONDEL.65 chr2b:136575119-136577224 chr2:132955804-132955803 hCONDEL.66 chr2b:136822982-136823621 chr2:133193390-133193390 hCONDEL.67 chr2b:141538721-141539932 chr2:137807734-137807733 hCONDEL.68 chr2b:141731775-141733966 chr2:137992993-137992992 hCONDEL.69 chr2b:149631395-149631588 chr2:145760604-145760603 hCONDEL.70 chr2b:152551703-152555846 chr2:148635025-148635024 hCONDEL.71 chr2b:155116775-155117678 chr2:151155777-151155788 hCONDEL.72 chr2b:160340709-160347989 chr2:156327839-156328157 hCONDEL.73 chr2b:162540614-162545339 chr2:158481771-158482085 hCONDEL.74 chr2b:163179201-163187308 chr2:159107911-159107910 hCONDEL.75 chr2b:164800247-164801407 chr2:160676314-160676314 hCONDEL.76 chr2b:176764180-176765710 chr2:172398237-172398355 hCONDEL.77 chr2b:180250333-180253718 chr2:175845513-175848660 hCONDEL.78 chr2b:180842618-180845551 chr2:176433444-176433443 hCONDEL.79 chr2b:181014077-181016670 chr2:176590639-176590638 hCONDEL.80 chr2b:182065626-182065815 chr2:177620083-177620082 hCONDEL.81 chr2b:188994901-189000286 chr2:184409776-184409775 hCONDEL.82 chr2b:205603254-205606150 chr2:200756149-200756148 hCONDEL.83 chr2b:210162921-210165533 chr2:205192685-205192684 hCONDEL.84 chr2b:210892924-210896381 chr2:205918754-205918867 hCONDEL.85 chr2b:211409147-211423174 chr2:206415621-206417427 hCONDEL.86 chr2b:214620251-214625926 chr2:209524018-209524017 hCONDEL.87 chr2b:217747610-217751695 chr2:212576596-212576687 hCONDEL.88 chr2b:219664376-219671191 chr2:214447724-214447724 hCONDEL.89 chr2b:220404404-220406712 chr2:215160718-215160717 hCONDEL.90 chr2b:224800194-224800473 chr2:219447462-219447466 hCONDEL.91 chr2b:229013941-229016789 chr2:223585542-223585549 hCONDEL.92 chr2b:229425595-229431741 chr2:223980291-223980310 hCONDEL.93 chr2b:230113786-230114426 chr2:224672257-224672376 hCONDEL.94 chr2b:231499280-231501374 chr2:226016180-226016179 hCONDEL.95 chr2b:232520378-232523730 chr2:227016982-227016981 hCONDEL.96 chr2b:241400205-241400787 chr2:235736411-235736410 Continued on Next Page. . . APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 169

Table B.2 – Continued

hCONDEL ID PanTro2 coordinates Hg18 coordinates

hCONDEL.97 chr2b:241749280-241749818 chr2:236079583-236079582 hCONDEL.98 chr3:467592-483221 chr3:432358-432357 hCONDEL.99 chr3:1065800-1066932 chr3:1044531-1044601 hCONDEL.100 chr3:2533868-2535501 chr3:2480647-2480646 hCONDEL.101 chr3:3104583-3106814 chr3:3029559-3029691 hCONDEL.102 chr3:3380794-3384227 chr3:3298914-3298934 hCONDEL.103 chr3:4104218-4109947 chr3:4003044-4003078 hCONDEL.104 chr3:5737986-5749494 chr3:5611256-5611255 hCONDEL.105 chr3:9213107-9214692 chr3:8961584-8961591 hCONDEL.106 chr3:48493065-48495731 chr3:47447762-47447761 hCONDEL.107 chr3:50568972-50571827 chr3:49413917-49413917 hCONDEL.108 chr3:52646487-52650398 chr3:51444639-51444638 hCONDEL.109 chr3:56836865-56838699 chr3:55531563-55531562 hCONDEL.110 chr3:57556661-57561406 chr3:56241786-56241785 hCONDEL.111 chr3:58690466-58740312 chr3:57362067-57362201 hCONDEL.112 chr3:61846388-61846493 chr3:60395823-60395826 hCONDEL.113 chr3:65240779-65242858 chr3:63762507-63762506 hCONDEL.114 chr3:69398501-69408085 chr3:67772115-67772114 hCONDEL.115 chr3:70445956-70458640 chr3:68784820-68784819 hCONDEL.116 chr3:71241785-71244340 chr3:69556154-69556154 hCONDEL.117 chr3:76086056-76086803 chr3:74277375-74277374 hCONDEL.118 chr3:76637619-76642975 chr3:74819875-74819878 hCONDEL.119 chr3:84677420-84679671 chr3:82628597-82628596 hCONDEL.120 chr3:86972200-86978652 chr3:84883018-84883017 hCONDEL.121 chr3:87132987-87146444 chr3:85032453-85032894 hCONDEL.122 chr3:87807377-87808320 chr3:85680065-85680075 hCONDEL.123 chr3:101265170-101267048 chr3:98608725-98608724 hCONDEL.124 chr3:108531407-108535997 chr3:105697448-105697447 hCONDEL.125 chr3:110510751-110513582 chr3:107633875-107633874 hCONDEL.126 chr3:110823710-110862696 chr3:107937925-107937924 hCONDEL.127 chr3:123762492-123765096 chr3:120660337-120660336 hCONDEL.128 chr3:129477204-129479887 chr3:126291320-126291319 hCONDEL.129 chr3:133455575-133455774 chr3:130223687-130223686 hCONDEL.130 chr3:137310810-137311397 chr3:133956298-133956304 hCONDEL.131 chr3:139702849-139704297 chr3:136309671-136309677 hCONDEL.132 chr3:141913528-141913755 chr3:138485312-138485311 hCONDEL.133 chr3:142109053-142110860 chr3:138690515-138690514 hCONDEL.134 chr3:142246575-142254985 chr3:138829941-138829940 hCONDEL.135 chr3:142896615-142898387 chr3:139467964-139467963 hCONDEL.136 chr3:150027235-150037082 chr3:146472977-146473150 hCONDEL.137 chr3:153201810-153207703 chr3:149573141-149573143 hCONDEL.138 chr3:155671598-155683302 chr3:151978718-151978717 hCONDEL.139 chr3:159147527-159150712 chr3:155381099-155381098 hCONDEL.140 chr3:163941461-163953935 chr3:160103694-160103811 hCONDEL.141 chr3:164187190-164188123 chr3:160331202-160331201 hCONDEL.142 chr3:164603264-164604649 chr3:160740521-160740520 hCONDEL.143 chr3:180055943-180059097 chr3:175962932-175963407 hCONDEL.144 chr3:180697398-180704140 chr3:176583070-176583069 hCONDEL.145 chr3:186295304-186300907 chr3:182081265-182081287 hCONDEL.146 chr3:198081514-198083691 chr3:193632862-193632861 hCONDEL.147 chr4:9769989-9775447 chr4:9636965-9636964 hCONDEL.148 chr4:10823665-10823892 chr4:10647613-10647612 hCONDEL.149 chr4:14869183-14872631 chr4:14565824-14565994 hCONDEL.150 chr4:15066076-15069768 chr4:14749761-14749760 hCONDEL.151 chr4:15077645-15081870 chr4:14757125-14757124 hCONDEL.152 chr4:15424104-15430996 chr4:15097779-15099314 hCONDEL.153 chr4:21610445-21617307 chr4:21205502-21205528 hCONDEL.154 chr4:28924814-28930360 chr4:28372753-28372752 hCONDEL.155 chr4:34827489-34828430 chr4:34292429-34292428 hCONDEL.156 chr4:35094336-35104628 chr4:34572261-34572260 Continued on Next Page. . . APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 170

Table B.2 – Continued

hCONDEL ID PanTro2 coordinates Hg18 coordinates

hCONDEL.157 chr4:44017239-44030791 chr4:43299147-43299146 hCONDEL.158 chr4:44860758-44861792 chr4:44114390-44114390 hCONDEL.159 chr4:95670524-95673012 chr4:93789186-93789320 hCONDEL.160 chr4:98466713-98472712 chr4:96511181-96511305 hCONDEL.161 chr4:100719084-100719351 chr4:98689188-98689187 hCONDEL.162 chr4:103331199-103336759 chr4:101244829-101245096 hCONDEL.163 chr4:105291657-105296499 chr4:103130262-103130261 hCONDEL.164 chr4:106697890-106698135 chr4:104519048-104519047 hCONDEL.165 chr4:107071135-107077883 chr4:104890922-104890921 hCONDEL.166 chr4:108704430-108704756 chr4:106477581-106477583 hCONDEL.167 chr4:114856148-114863728 chr4:112535306-112538141 hCONDEL.168 chr4:117382924-117386138 chr4:114985823-114985822 hCONDEL.169 chr4:117891052-117891262 chr4:115490368-115490367 hCONDEL.170 chr4:117904648-117906793 chr4:115503177-115503204 hCONDEL.171 chr4:124064030-124064132 chr4:121835918-121835930 hCONDEL.172 chr4:136571399-136586625 chr4:134143965-134148699 hCONDEL.173 chr4:138327758-138332868 chr4:135851031-135851030 hCONDEL.174 chr4:138677607-138689097 chr4:136175466-136175951 hCONDEL.175 chr4:140178866-140180016 chr4:137638327-137638326 hCONDEL.176 chr4:150840814-150851452 chr4:147936807-147936806 hCONDEL.177 chr4:152116217-152117351 chr4:149179429-149179428 hCONDEL.178 chr4:154816352-154829609 chr4:151856990-151856989 hCONDEL.179 chr4:160193964-160206419 chr4:157097896-157097895 hCONDEL.180 chr4:163788299-163791465 chr4:160651961-160651960 hCONDEL.181 chr4:167688427-167688637 chr4:164527210-164527252 hCONDEL.182 chr4:167712140-167720350 chr4:164549916-164549915 hCONDEL.183 chr4:168055357-168058468 chr4:164883610-164883609 hCONDEL.184 chr4:172488936-172490768 chr4:169272122-169272121 hCONDEL.185 chr4:174702120-174713353 chr4:171431844-171431843 hCONDEL.186 chr4:177414727-177416299 chr4:174104558-174104559 hCONDEL.187 chr4:180884364-180887965 chr4:177566434-177566433 hCONDEL.188 chr4:181219489-181237602 chr4:177890452-177890454 hCONDEL.189 chr4:186003255-186004451 chr4:182526308-182526315 hCONDEL.190 chr4:191777246-191781255 chr4:188234002-188234007 hCONDEL.191 chr4:192740552-192741372 chr4:189175420-189175419 hCONDEL.192 chr4:193759085-193761726 chr4:190194493-190194492 hCONDEL.193 chr5:10137787-10141103 chr5:10023851-10023850 hCONDEL.194 chr5:11855430-11857055 chr5:11684539-11684538 hCONDEL.195 chr5:12903469-12927075 chr5:12738943-12738942 hCONDEL.196 chr5:102931197-102941434 chr5:101231692-101231691 hCONDEL.197 chr5:104831875-104841437 chr5:103114361-103114421 hCONDEL.198 chr5:109304444-109306171 chr5:107482705-107482704 hCONDEL.199 chr5:115719856-115725101 chr5:113788954-113788970 hCONDEL.200 chr5:119469075-119473205 chr5:117461053-117461052 hCONDEL.201 chr5:121293214-121294030 chr5:119230422-119230421 hCONDEL.202 chr5:146335965-146365745 chr5:143875902-143877712 hCONDEL.203 chr5:157873850-157881260 chr5:155232812-155232811 hCONDEL.204 chr5:157966768-157968071 chr5:155316605-155316605 hCONDEL.205 chr5:164443897-164443970 chr5:161680743-161680742 hCONDEL.206 chr5:165502165-165503559 chr5:162703995-162703994 hCONDEL.207 chr5:166268612-166269762 chr5:163454087-163454086 hCONDEL.208 chr5:166943355-166944289 chr5:164121528-164121527 hCONDEL.209 chr5:177501485-177504658 chr5:174508932-174508931 hCONDEL.210 chr6:11327808-11329755 chr6:11251289-11251288 hCONDEL.211 chr6:13343842-13355125 chr6:13236389-13244419 hCONDEL.212 chr6:19992812-19995299 chr6:19734166-19734190 hCONDEL.213 chr6:23485565-23486750 chr6:23165480-23165483 hCONDEL.214 chr6:24452850-24459874 chr6:24121280-24121281 hCONDEL.215 chr6:25602808-25603326 chr6:25234713-25234760 hCONDEL.216 chr6:52283200-52390725 chr6:51201582-51201581 Continued on Next Page. . . APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 171

Table B.2 – Continued

hCONDEL ID PanTro2 coordinates Hg18 coordinates

hCONDEL.217 chr6:55632364-55648366 chr6:54379189-54382159 hCONDEL.218 chr6:67890813-67900352 chr6:67952968-67952968 hCONDEL.219 chr6:69016965-69028417 chr6:69055680-69056700 hCONDEL.220 chr6:78783048-78789673 chr6:78583104-78583103 hCONDEL.221 chr6:79524503-79530860 chr6:79301002-79301105 hCONDEL.222 chr6:80378647-80406573 chr6:80141622-80141621 hCONDEL.223 chr6:80759240-80760957 chr6:80484412-80484411 hCONDEL.224 chr6:80952460-80955358 chr6:80667992-80667991 hCONDEL.225 chr6:81105900-81119811 chr6:80818446-80818445 hCONDEL.226 chr6:82867784-82868917 chr6:82531876-82531877 hCONDEL.227 chr6:83374601-83387143 chr6:83026650-83026649 hCONDEL.228 chr6:91136138-91139453 chr6:90638555-90638822 hCONDEL.229 chr6:95434419-95434508 chr6:94821434-94821433 hCONDEL.230 chr6:104421885-104443060 chr6:103307430-103312285 hCONDEL.231 chr6:108669679-108804720 chr6:107416447-107416446 hCONDEL.232 chr6:117116353-117125637 chr6:115598027-115598205 hCONDEL.233 chr6:119602099-119617831 chr6:118044036-118056838 hCONDEL.234 chr6:120842603-120907388 chr6:119258179-119258178 hCONDEL.235 chr6:126899677-126906767 chr6:125100924-125100924 hCONDEL.236 chr6:128003572-128006646 chr6:126187769-126187770 hCONDEL.237 chr6:130230529-130234672 chr6:128368544-128368543 hCONDEL.238 chr6:131479127-131482424 chr6:129598239-129598246 hCONDEL.239 chr6:133044973-133050498 chr6:131129142-131129141 hCONDEL.240 chr6:135169897-135173050 chr6:133245764-133245770 hCONDEL.241 chr6:136676763-136678830 chr6:134740206-134740500 hCONDEL.242 chr6:142486211-142489236 chr6:140434709-140434708 hCONDEL.243 chr6:143734797-143736777 chr6:141661682-141662118 hCONDEL.244 chr6:147827857-147842450 chr6:145643441-145643443 hCONDEL.245 chr6:148539705-148547624 chr6:146345009-146345008 hCONDEL.246 chr6:155311858-155315515 chr6:152887196-152887201 hCONDEL.247 chr6:159317538-159320977 chr6:156813407-156813407 hCONDEL.248 chr6:165095994-165098948 chr6:162443771-162444086 hCONDEL.249 chr6:165257118-165259260 chr6:162598314-162598316 hCONDEL.250 chr6:168910434-168911688 chr6:166154162-166154161 hCONDEL.251 chr7:3484561-3484626 chr7:3529126-3529125 hCONDEL.252 chr7:9766426-9775486 chr7:9768264-9771774 hCONDEL.253 chr7:11232399-11233385 chr7:11213999-11214001 hCONDEL.254 chr7:11508992-11512168 chr7:11480805-11480805 hCONDEL.255 chr7:12211392-12230299 chr7:12166258-12166438 hCONDEL.256 chr7:20329907-20346699 chr7:20079227-20079226 hCONDEL.257 chr7:22358054-22361045 chr7:22071579-22071578 hCONDEL.258 chr7:32327604-32332551 chr7:31978593-31978602 hCONDEL.259 chr7:33891069-33891183 chr7:33502032-33502031 hCONDEL.260 chr7:36037815-36039300 chr7:35619014-35619013 hCONDEL.261 chr7:38764822-38769830 chr7:38287545-38287789 hCONDEL.262 chr7:68095731-68097826 chr7:67482578-67482577 hCONDEL.263 chr7:68483889-68486482 chr7:67864435-67864434 hCONDEL.264 chr7:72048380-72051374 chr7:71324622-71324621 hCONDEL.265 chr7:81238720-81239465 chr7:81263556-81263555 hCONDEL.266 chr7:85558448-85566485 chr7:85475121-85475120 hCONDEL.267 chr7:88859929-88863130 chr7:88710171-88710170 hCONDEL.268 chr7:89125790-89130017 chr7:88973920-88973920 hCONDEL.269 chr7:97491115-97512618 chr7:97253053-97259162 hCONDEL.270 chr7:98511032-98542116 chr7:98223934-98225125 hCONDEL.271 chr7:102777474-102778116 chr7:102360244-102360243 hCONDEL.272 chr7:103150392-103164421 chr7:102716453-102716452 hCONDEL.273 chr7:109462582-109471396 chr7:108854977-108854977 hCONDEL.274 chr7:110344547-110360289 chr7:109713773-109713772 hCONDEL.275 chr7:114803347-114803617 chr7:114093192-114093191 hCONDEL.276 chr7:115028738-115030984 chr7:114310071-114310070 Continued on Next Page. . . APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 172

Table B.2 – Continued

hCONDEL ID PanTro2 coordinates Hg18 coordinates

hCONDEL.277 chr7:116832222-116832716 chr7:116057824-116057825 hCONDEL.278 chr7:116832749-116834257 chr7:116057858-116057858 hCONDEL.279 chr7:120018200-120019644 chr7:119170123-119170122 hCONDEL.280 chr7:120155392-120190711 chr7:119304247-119304246 hCONDEL.281 chr7:125418733-125430772 chr7:124463572-124463642 hCONDEL.282 chr7:126555993-126561168 chr7:125570522-125570561 hCONDEL.283 chr7:126745474-126748654 chr7:125762839-125762838 hCONDEL.284 chr7:127554759-127557808 chr7:126555376-126555392 hCONDEL.285 chr7:130944756-131021455 chr7:129939049-129939134 hCONDEL.286 chr7:133510819-133514693 chr7:132368298-132368297 hCONDEL.287 chr7:142904107-142907764 chr7:141586939-141586938 hCONDEL.288 chr7:143832880-143838483 chr7:142479701-142479700 hCONDEL.289 chr7:144257448-144274184 chr7:142888826-142888825 hCONDEL.290 chr7:146356121-146388742 chr7:145147742-145147741 hCONDEL.291 chr7:153806871-153809069 chr7:152544176-152544175 hCONDEL.292 chr7:154291333-154333701 chr7:153028149-153028148 hCONDEL.293 chr8:2106994-2130132 chr8:2093982-2094002 hCONDEL.294 chr8:5304963-5308744 chr8:5142280-5142288 hCONDEL.295 chr8:6121142-6127949 chr8:5929414-5929413 hCONDEL.296 chr8:9377630-9379086 chr8:13305392-13305392 hCONDEL.297 chr8:21755898-21758644 chr8:25222809-25222808 hCONDEL.298 chr8:34260628-34262853 chr8:37479255-37479254 hCONDEL.299 chr8:61073195-61077757 chr8:63937222-63937221 hCONDEL.300 chr8:62103730-62108405 chr8:64962662-64962661 hCONDEL.301 chr8:72117509-72121848 chr8:74749765-74750179 hCONDEL.302 chr8:74466645-74468939 chr8:77045608-77045607 hCONDEL.303 chr8:78534074-78536422 chr8:81028144-81028144 hCONDEL.304 chr8:82457848-82471003 chr8:84887182-84887181 hCONDEL.305 chr8:94538975-94539466 chr8:96773220-96773219 hCONDEL.306 chr8:94894271-94899828 chr8:97116325-97116324 hCONDEL.307 chr8:97782163-97787622 chr8:99925317-99925598 hCONDEL.308 chr8:106089991-106093736 chr8:108041429-108041428 hCONDEL.309 chr8:107608129-107610914 chr8:109527335-109527334 hCONDEL.310 chr8:112648430-112651650 chr8:114444449-114444515 hCONDEL.311 chr8:112888044-112925672 chr8:114689301-114689301 hCONDEL.312 chr8:115804813-115806004 chr8:117515131-117515139 hCONDEL.313 chr8:117316459-117319690 chr8:118990212-118991046 hCONDEL.314 chr8:128240439-128240979 chr8:129714879-129714878 hCONDEL.315 chr8:129544888-129547054 chr8:131015832-131016365 hCONDEL.316 chr8:131019650-131035238 chr8:132454636-132454635 hCONDEL.317 chr8:135689031-135690213 chr8:137030237-137030238 hCONDEL.318 chr9:4030578-4047739 chr9:3972218-3972217 hCONDEL.319 chr9:4201357-4201394 chr9:4122098-4122097 hCONDEL.320 chr9:11757526-11765325 chr9:11519898-11519899 hCONDEL.321 chr9:12298112-12390631 chr9:12042668-12047247 hCONDEL.322 chr9:13775346-13776758 chr9:13414322-13414321 hCONDEL.323 chr9:22110643-22114100 chr9:21586941-21587117 hCONDEL.324 chr9:22590491-22596978 chr9:22083513-22085985 hCONDEL.325 chr9:25531966-25536368 chr9:24970780-24970779 hCONDEL.326 chr9:29830466-29832998 chr9:29239921-29239920 hCONDEL.327 chr9:38111324-38112034 chr9:37329516-37329515 hCONDEL.328 chr9:38380442-38382167 chr9:37589168-37589167 hCONDEL.329 chr9:69420571-69421055 chr9:72458202-72458205 hCONDEL.330 chr9:72153530-72154927 chr9:75144970-75144973 hCONDEL.331 chr9:83606341-83606785 chr9:86412800-86412799 hCONDEL.332 chr9:88824101-88827281 chr9:91485277-91485276 hCONDEL.333 chr9:93875342-93881580 chr9:96504040-96504039 hCONDEL.334 chr9:98626109-98626854 chr9:101191270-101191269 hCONDEL.335 chr9:98858991-98860205 chr9:101426753-101426752 hCONDEL.336 chr9:100234594-100236379 chr9:102775608-102775628 Continued on Next Page. . . APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 173

Table B.2 – Continued

hCONDEL ID PanTro2 coordinates Hg18 coordinates

hCONDEL.337 chr9:100540184-100547889 chr9:103086705-103086707 hCONDEL.338 chr9:100623227-100632860 chr9:103161033-103161032 hCONDEL.339 chr9:100904395-100905302 chr9:103431028-103431027 hCONDEL.340 chr9:107416467-107419544 chr9:109827092-109827091 hCONDEL.341 chr9:111594705-111596721 chr9:113939310-113939309 hCONDEL.342 chr9:119941826-119947514 chr9:122147295-122147294 hCONDEL.343 chr9:123970231-123971147 chr9:126089969-126089968 hCONDEL.344 chr9:131942489-131942550 chr9:133774777-133774776 hCONDEL.345 chr9:134592504-134595107 chr9:136508439-136508438 hCONDEL.346 chr9:135306080-135309322 chr9:137195130-137195129 hCONDEL.347 chr10:668163-678694 chr10:621343-621346 hCONDEL.348 chr10:21023235-21036054 chr10:20856945-20856944 hCONDEL.349 chr10:23626368-23633424 chr10:23401511-23401510 hCONDEL.350 chr10:28204540-28216292 chr10:27902575-27902574 hCONDEL.351 chr10:43346179-43347613 chr10:42997336-42997335 hCONDEL.352 chr10:44217134-44225273 chr10:43804716-43805234 hCONDEL.353 chr10:53626650-53627938 chr10:56165071-56165793 hCONDEL.354 chr10:54979818-54988412 chr10:57552908-57552907 hCONDEL.355 chr10:64874056-64877684 chr10:67350420-67350434 hCONDEL.356 chr10:65071602-65074987 chr10:67542985-67542984 hCONDEL.357 chr10:71797857-71810039 chr10:74079253-74079268 hCONDEL.358 chr10:71814890-71828325 chr10:74084101-74084150 hCONDEL.359 chr10:80787504-80789241 chr10:82514312-82514353 hCONDEL.360 chr10:81836680-81845669 chr10:83565240-83565239 hCONDEL.361 chr10:82398777-82401672 chr10:84102980-84102979 hCONDEL.362 chr10:84023358-84029686 chr10:85683734-85683733 hCONDEL.363 chr10:85183186-85190250 chr10:86800491-86800496 hCONDEL.364 chr10:85841800-85844004 chr10:87436052-87436053 hCONDEL.365 chr10:99684117-99686451 chr10:101101085-101101122 hCONDEL.366 chr10:107322188-107325632 chr10:108554017-108554016 hCONDEL.367 chr10:109799544-109807658 chr10:110952685-110952684 hCONDEL.368 chr10:110224288-110225291 chr10:111370458-111370457 hCONDEL.369 chr10:112844981-112846557 chr10:113934260-113934259 hCONDEL.370 chr10:114126069-114128004 chr10:115201026-115201025 hCONDEL.371 chr10:126466123-126466174 chr10:127180185-127180199 hCONDEL.372 chr10:126930186-126945771 chr10:127641822-127641821 hCONDEL.373 chr10:128322779-128323342 chr10:128951621-128951620 hCONDEL.374 chr11:1230609-1232410 chr11:1170151-1170499 hCONDEL.375 chr11:6731984-6738331 chr11:6851250-6851249 hCONDEL.376 chr11:7691655-7702256 chr11:7822673-7822817 hCONDEL.377 chr11:10877660-10878745 chr11:10902587-10902589 hCONDEL.378 chr11:13452670-13454662 chr11:13437157-13437181 hCONDEL.379 chr11:15610822-15611335 chr11:15644828-15644827 hCONDEL.380 chr11:27780407-27782610 chr11:27642570-27642569 hCONDEL.381 chr11:27858095-27859195 chr11:27713751-27713750 hCONDEL.382 chr11:31238346-31238494 chr11:31015271-31015270 hCONDEL.383 chr11:38109097-38122769 chr11:37770511-37770510 hCONDEL.384 chr11:40381877-40382005 chr11:39984285-39984284 hCONDEL.385 chr11:42798464-42800232 chr11:42337388-42337542 hCONDEL.386 chr11:43165591-43166620 chr11:42699755-42699761 hCONDEL.387 chr11:58799608-58808183 chr11:60097277-60098978 hCONDEL.388 chr11:60121969-60187771 chr11:61349716-61349715 hCONDEL.389 chr11:67432686-67436343 chr11:68706982-68706981 hCONDEL.390 chr11:69276147-69315310 chr11:70479222-70479221 hCONDEL.391 chr11:73731328-73732434 chr11:74742349-74742348 hCONDEL.392 chr11:75922871-75928373 chr11:76899067-76899066 hCONDEL.393 chr11:78508284-78509494 chr11:79424019-79424018 hCONDEL.394 chr11:78608059-78611746 chr11:79519725-79519724 hCONDEL.395 chr11:78809266-78814015 chr11:79708671-79708679 hCONDEL.396 chr11:79986492-79991574 chr11:80853101-80853100 Continued on Next Page. . . APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 174

Table B.2 – Continued

hCONDEL ID PanTro2 coordinates Hg18 coordinates

hCONDEL.397 chr11:81977549-81977850 chr11:82777929-82777928 hCONDEL.398 chr11:86332507-86349283 chr11:87086606-87098567 hCONDEL.399 chr11:88725282-88756818 chr11:89632372-89632371 hCONDEL.400 chr11:88938053-88958698 chr11:89810287-89810286 hCONDEL.401 chr11:90847322-90851568 chr11:91702547-91702552 hCONDEL.402 chr11:95168631-95168857 chr11:95980009-95980008 hCONDEL.403 chr11:111951280-111951438 chr11:112528727-112528726 hCONDEL.404 chr11:113836831-113836992 chr11:114437960-114437959 hCONDEL.405 chr11:114203048-114203161 chr11:114796284-114796286 hCONDEL.406 chr11:126901270-126902865 chr11:127249432-127249457 hCONDEL.407 chr11:129395344-129399693 chr11:129694377-129694376 hCONDEL.408 chr11:133146926-133149458 chr11:133362858-133362857 hCONDEL.409 chr12:1099798-1106456 chr12:1047623-1047622 hCONDEL.410 chr12:1764442-1764872 chr12:1666663-1666662 hCONDEL.411 chr12:5653405-5653778 chr12:5447232-5447231 hCONDEL.412 chr12:6515417-6517178 chr12:6284913-6284912 hCONDEL.413 chr12:17851329-17854275 chr12:17422254-17422253 hCONDEL.414 chr12:74510540-74514792 chr12:72757260-72757259 hCONDEL.415 chr12:74879977-74884912 chr12:73106072-73106082 hCONDEL.416 chr12:85324562-85329320 chr12:83495250-83498031 hCONDEL.417 chr12:87054419-87061701 chr12:85177702-85177704 hCONDEL.418 chr12:87134009-87134337 chr12:85249811-85249810 hCONDEL.419 chr12:91144667-91145550 chr12:89194096-89194095 hCONDEL.420 chr12:94297978-94304634 chr12:92275811-92275810 hCONDEL.421 chr12:96588333-96591788 chr12:94504110-94504816 hCONDEL.422 chr12:103736990-103737215 chr12:101534626-101534625 hCONDEL.423 chr12:113225484-113252894 chr12:110821957-110821961 hCONDEL.424 chr12:128055265-128056318 chr12:125178968-125178967 hCONDEL.425 chr13:25419303-25424365 chr13:25284629-25285265 hCONDEL.426 chr13:32903715-32904885 chr13:32596804-32596806 hCONDEL.427 chr13:57449526-57451242 chr13:57020327-57020326 hCONDEL.428 chr13:63210544-63211658 chr13:62698484-62698483 hCONDEL.429 chr13:65056713-65065284 chr13:64521062-64521061 hCONDEL.430 chr13:68352571-68382684 chr13:67740553-67740553 hCONDEL.431 chr13:70356648-70358950 chr13:69681112-69681111 hCONDEL.432 chr13:72860682-72860894 chr13:72125393-72125392 hCONDEL.433 chr13:77533179-77533737 chr13:76710397-76710401 hCONDEL.434 chr13:79095172-79096212 chr13:78210612-78210624 hCONDEL.435 chr13:88586206-88595726 chr13:87506668-87506714 hCONDEL.436 chr13:90463428-90469088 chr13:89289191-89294595 hCONDEL.437 chr13:93530954-93532028 chr13:92295586-92295585 hCONDEL.438 chr13:95782842-95784267 chr13:94505542-94505541 hCONDEL.439 chr14:21345829-21347043 chr14:21941105-21941117 hCONDEL.440 chr14:25708290-25711534 chr14:26326075-26326074 hCONDEL.441 chr14:25938414-25938776 chr14:26544495-26544496 hCONDEL.442 chr14:27598268-27598343 chr14:28196148-28196147 hCONDEL.443 chr14:29772615-29775361 chr14:30332296-30333068 hCONDEL.444 chr14:39264855-39272471 chr14:39613888-39613890 hCONDEL.445 chr14:42233666-42238397 chr14:42703830-42703829 hCONDEL.446 chr14:43446977-43449480 chr14:43886073-43886072 hCONDEL.447 chr14:43983894-43987252 chr14:44414629-44414628 hCONDEL.448 chr14:45191695-45192232 chr14:45615930-45615929 hCONDEL.449 chr14:48159451-48162025 chr14:48559527-48559609 hCONDEL.450 chr14:48393218-48397286 chr14:48784628-48785040 hCONDEL.451 chr14:55100838-55101299 chr14:55351524-55351584 hCONDEL.452 chr14:56302087-56303226 chr14:56536089-56536352 hCONDEL.453 chr14:65712114-65713055 chr14:65750845-65750844 hCONDEL.454 chr14:66786497-66788022 chr14:66784953-66784952 hCONDEL.455 chr14:67700436-67704757 chr14:67673911-67674158 hCONDEL.456 chr14:81558172-81559092 chr14:81185720-81185720 Continued on Next Page. . . APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 175

Table B.2 – Continued

hCONDEL ID PanTro2 coordinates Hg18 coordinates

hCONDEL.457 chr14:83231290-83236040 chr14:82829246-82829245 hCONDEL.458 chr14:86314892-86337093 chr14:85841514-85841513 hCONDEL.459 chr14:89986572-89987907 chr14:89407665-89407664 hCONDEL.460 chr14:104695467-104703378 chr14:103732418-103733222 hCONDEL.461 chr15:23416893-23429976 chr15:23772538-23772537 hCONDEL.462 chr15:23600833-23618430 chr15:23941283-23941282 hCONDEL.463 chr15:30224283-30226967 chr15:31501135-31501156 hCONDEL.464 chr15:34724698-34727781 chr15:35897179-35897178 hCONDEL.465 chr15:37453802-37463087 chr15:38559094-38559100 hCONDEL.466 chr15:41069154-41071162 chr15:42024235-42024234 hCONDEL.467 chr15:42857316-42858146 chr15:43756360-43756359 hCONDEL.468 chr15:50622774-50623941 chr15:51355430-51355429 hCONDEL.469 chr15:53648667-53650096 chr15:54303523-54303527 hCONDEL.470 chr15:60085387-60086111 chr15:60613601-60613627 hCONDEL.471 chr15:62786413-62786909 chr15:63243008-63243013 hCONDEL.472 chr15:62809831-62812708 chr15:63265561-63265572 hCONDEL.473 chr15:63757651-63770540 chr15:64181652-64181651 hCONDEL.474 chr15:65950930-65964335 chr15:66314868-66315036 hCONDEL.475 chr15:67353695-67357749 chr15:67677286-67677312 hCONDEL.476 chr15:70614588-70614669 chr15:70900453-70900517 hCONDEL.477 chr15:73713023-73714653 chr15:73962124-73962156 hCONDEL.478 chr15:78205269-78206327 chr15:78328790-78328789 hCONDEL.479 chr15:80948396-80951316 chr15:81516708-81516707 hCONDEL.480 chr15:81780199-81783875 chr15:82333110-82333109 hCONDEL.481 chr15:90143878-90147148 chr15:90690751-90690750 hCONDEL.482 chr15:93329052-93330760 chr15:93790943-93790942 hCONDEL.483 chr15:94480518-94483423 chr15:94916460-94916506 hCONDEL.484 chr15:94679203-94681729 chr15:95107240-95107239 hCONDEL.485 chr15:96486245-96488610 chr15:96876816-96876926 hCONDEL.486 chr16:8768244-8769099 chr16:8505938-8505937 hCONDEL.487 chr16:17594975-17595201 chr16:17472280-17472279 hCONDEL.488 chr16:21420772-21547160 chr16:21247986-21261362 hCONDEL.489 chr16:26547513-26552105 chr16:26090643-26090937 hCONDEL.490 chr16:50422210-50422314 chr16:49708746-49708745 hCONDEL.491 chr16:52103707-52106692 chr16:51361596-51361595 hCONDEL.492 chr16:55595723-55598172 chr16:54773829-54773828 hCONDEL.493 chr16:62075345-62077887 chr16:61099456-61099457 hCONDEL.494 chr16:63242820-63245978 chr16:62220030-62220928 hCONDEL.495 chr16:64579259-64588298 chr16:63512825-63513346 hCONDEL.496 chr16:65360912-65362678 chr16:64264928-64264927 hCONDEL.497 chr16:69001220-69007014 chr16:67782875-67783277 hCONDEL.498 chr16:69007538-69007691 chr16:67783803-67783802 hCONDEL.499 chr16:71811444-71821376 chr16:70533877-70533944 hCONDEL.500 chr16:73963132-73965258 chr16:72661114-72661115 hCONDEL.501 chr16:75989577-75997649 chr16:74614346-74614345 hCONDEL.502 chr16:76824164-76826148 chr16:75422973-75422975 hCONDEL.503 chr16:77711871-77722446 chr16:76287138-76287158 hCONDEL.504 chr16:78228008-78228421 chr16:76784948-76784948 hCONDEL.505 chr16:80296948-80297568 chr16:78796661-78796660 hCONDEL.506 chr17:4972299-4977796 chr17:4727435-4727434 hCONDEL.507 chr17:5392107-5393037 chr17:5122279-5122278 hCONDEL.508 chr17:50515572-50515803 chr17:46943230-46943229 hCONDEL.509 chr17:50913308-50920652 chr17:47334015-47334014 hCONDEL.510 chr17:50962577-50965241 chr17:47375983-47375982 hCONDEL.511 chr17:51633729-51640033 chr17:48030957-48030956 hCONDEL.512 chr17:51903440-51907464 chr17:48288464-48288492 hCONDEL.513 chr17:54504302-54504388 chr17:50824949-50824948 hCONDEL.514 chr17:59792781-59801941 chr17:56000091-56000090 hCONDEL.515 chr17:61574813-61583401 chr17:57780622-57780621 hCONDEL.516 chr17:63658983-63663260 chr17:59798473-59798472 Continued on Next Page. . . APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 176

Table B.2 – Continued

hCONDEL ID PanTro2 coordinates Hg18 coordinates

hCONDEL.517 chr17:65490317-65493602 chr17:61717155-61717645 hCONDEL.518 chr17:71058333-71059532 chr17:67156829-67156852 hCONDEL.519 chr17:72296331-72303316 chr17:68367612-68367611 hCONDEL.520 chr17:72325470-72334062 chr17:68388546-68388585 hCONDEL.521 chr17:73655869-73659805 chr17:69643820-69644236 hCONDEL.522 chr18:12652295-12655214 chr18:4025227-4025226 hCONDEL.523 chr18:13610843-13615410 chr18:3101094-3101093 hCONDEL.524 chr18:14930132-14953871 chr18:1830280-1830281 hCONDEL.525 chr18:15927314-15928782 chr18:856046-856045 hCONDEL.526 chr18:24818891-24821019 chr18:24735433-24735441 hCONDEL.527 chr18:29928716-29928809 chr18:29825941-29825940 hCONDEL.528 chr18:33871053-33873806 chr18:33682700-33682708 hCONDEL.529 chr18:34759734-34759865 chr18:34546539-34546538 hCONDEL.530 chr18:38325389-38327922 chr18:38069077-38069076 hCONDEL.531 chr18:38473072-38487853 chr18:38211866-38211865 hCONDEL.532 chr18:39947178-39955538 chr18:39636125-39638787 hCONDEL.533 chr18:47677254-47678384 chr18:47203791-47203790 hCONDEL.534 chr18:48538439-48546888 chr18:48056604-48056603 hCONDEL.535 chr18:62105266-62109777 chr18:61351187-61352009 hCONDEL.536 chr18:67943218-67945668 chr18:66975698-66975697 hCONDEL.537 chr18:68374324-68375556 chr18:67398259-67398258 hCONDEL.538 chr18:69800556-69807334 chr18:68784719-68790771 hCONDEL.539 chr19:39715022-39721601 chr19:39428900-39428899 hCONDEL.540 chr19:43325023-43371742 chr19:43006563-43006562 hCONDEL.541 chr20:3908070-3911547 chr20:3930694-3930693 hCONDEL.542 chr20:4384102-4387123 chr20:4402148-4402147 hCONDEL.543 chr20:12369666-12369826 chr20:12307713-12307721 hCONDEL.544 chr20:17991946-17992204 chr20:17726285-17726284 hCONDEL.545 chr20:18336088-18338792 chr20:18064595-18064599 hCONDEL.546 chr20:20720712-20723572 chr20:20399094-20399414 hCONDEL.547 chr20:23595375-23596680 chr20:23226565-23226564 hCONDEL.548 chr20:32901282-32905087 chr20:33849138-33849137 hCONDEL.549 chr20:36926986-36928997 chr20:37768003-37768006 hCONDEL.550 chr20:36950881-36962805 chr20:37790055-37790054 hCONDEL.551 chr20:40018280-40020391 chr20:40782886-40782885 hCONDEL.552 chr20:40081988-40082859 chr20:40844711-40844710 hCONDEL.553 chr20:53108918-53154000 chr20:53556641-53556640 hCONDEL.554 chr20:55966569-55968194 chr20:56278139-56278138 hCONDEL.555 chr21:26023122-26025584 chr21:26493005-26493181 hCONDEL.556 chr21:29796661-29798819 chr21:30291720-30291719 hCONDEL.557 chr21:32574848-32576856 chr21:33127421-33127420 hCONDEL.558 chr22:35741610-35744726 chr22:35589603-35589602 hCONDEL.559 chr22:40328941-40333556 chr22:40026461-40026460 hCONDEL.560 chrX:9011278-9017884 chrX:9063791-9063790 hCONDEL.561 chrX:10507958-10508029 chrX:10520864-10520863 hCONDEL.562 chrX:16864977-16866897 chrX:16803487-16803682 hCONDEL.563 chrX:19605502-19610782 chrX:19494077-19494076 hCONDEL.564 chrX:25145156-25145236 chrX:24950657-24950656 hCONDEL.565 chrX:25952294-25958714 chrX:25749517-25752762 hCONDEL.566 chrX:42814504-42817900 chrX:42288020-42288447 hCONDEL.567 chrX:50860207-50860816 chrX:50561798-50561800 hCONDEL.568 chrX:65527853-65615914 chrX:65455439-65482386 hCONDEL.569 chrX:67124889-67185574 chrX:66990372-66990992 hCONDEL.570 chrX:68804172-68824709 chrX:68615749-68619493 hCONDEL.571 chrX:92972917-93015141 chrX:92703940-92705816 hCONDEL.572 chrX:95807147-95817247 chrX:95468657-95468822 hCONDEL.573 chrX:96783291-96784638 chrX:96400485-96400484 hCONDEL.574 chrX:103406574-103412580 chrX:102912414-102912422 hCONDEL.575 chrX:104444629-104466536 chrX:104046741-104050563 hCONDEL.576 chrX:112410767-112410982 chrX:111998822-111998821 Continued on Next Page. . . APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 177

Table B.2 – Continued

hCONDEL ID PanTro2 coordinates Hg18 coordinates

hCONDEL.577 chrX:114310141-114312312 chrX:113861179-113862591 hCONDEL.578 chrX:117180079-117182256 chrX:116755394-116755393 hCONDEL.579 chrX:117668052-117670746 chrX:117231839-117232975 hCONDEL.580 chrX:120431092-120442514 chrX:120018437-120027305 hCONDEL.581 chrX:128358743-128361999 chrX:127871330-127871329 hCONDEL.582 chrX:147258586-147260340 chrX:146681970-146682005 hCONDEL.583 chrX:152024872-152052312 chrX:151439689-151455378 APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 178

Table B.3: Validation of 39 randomly-selected human conserved sequence deletions by PCR in 23 human populations.

Chrom Start End ID Comp. PCR Chimp Human Num Primers (5’ -- 3’) Pred. Size Size Primers chr1 3551305 3551400 1 D D 674 580 2 GGATTGCGGCCGCTATGGCACCTCAGCTCTCTGGGGCCAG, GGATTGCGGCCGCTATGGGGTGGGAGCTACTAGAACCCTG chr1 83810583 83811324 17 D D 930 189 2 CAATCACAGATAATACTATGGGAGAC, ATTCCCTGGTGGCTATGAGAATAAG chr1 89861835 89861929 19 P P 453 359 2 GGATTTGGAACAGGTGTCAAGAACC, CAAGAAGAAGCATTTAGAATCCTGAC chr1 93331128 93331396 20 D D 1157 930 2 CAGGAAGCTCAGTGTATTTCTCTAC, GACTGGACGAGACGCACTGTTGGAG chr1 167485782 167486268 29 D D 809 229 2 AGTGCAAGGCCAAATCTCTTG, CTCTTTTCAGAGGCAGAGTAGA chr1 191352616 191353398 34 D D 1990 1210 2 ATGCATAGTAAACAGTCAGAGGTGG, ATGATGGAATGCCAACCGATAAGGG chr2a 35564667 35565203 42 D D 842 306 2 AGGGTTGGTATGGTAGGTCCAAGGA, GCAGAAGCCAGACAGGCTTTCATAA chr2a 40561990 40562569 44 D D 782 203 2 ATTGGCCATCTCTTCATGCCCTGAG, ACTTGCTAATGCATTCCCTGATGGG chr2a 42539972 42540536 47 D D 1677 1114 2 GGAAACAGTTGGAAATAGCCTCAGG, GCTGAAAGAGTGATGAAGCAGACAG chr2b 128750929 128751273 64 P P 1609 1951 2 ATCTACCAGGGTCCAGTCTACAAGG, TATTCCTGACATGTTCAGCCCATGG chr7 102777473 102778116 271 P P 2187 1544 2 GCTTATACTCACTGTGAATGTTGGG, AAGCCTGCAGTTCACCATCACATCG chr7 114803346 114803617 275 D D 1152 882 2 GGATTGCGGCCGCTAGTGGTAGAACTCAGAATTCAGAGGA, GGATTGCGGCCGCTACTGAAGTAGCAATTGCCAAACGA chr11 40381876 40382005 384 D D 562 433 2 CAGGACTGTTCAGCTGGATTTGAGG, GCAATTAGGTATTGTGGGCAGAGTC chr11 81977548 81977850 397 D D 452 150 2 GTGGAAGCACTTTGTAGAATGCAGC, TGTTGGCACAGACATTGGTCAAGA chr11 95168630 95168857 402 D D 472 245 2 TGATTCAAATGACCTTCGTGGGTGC, ACCAGGAGTGGGATCTACTTTCTGG chr11 114203047 114203161 405 D D 340 226 2 GGGACAAGTTATAGCAGAAAACTTGG, ATGGTGGCAGCTGAGGAATTAAATG chr12 1764441 1764872 410 D D 1323 873 2 TATGTCCATGTACATACCACCTAGG, AGTGAAATCCAGCCAGCTATTTAGG chr12 5653404 5653778 411 D D 1248 874 2 AAGGATTGCGGCCGCTCTGAATGCTAAAATCTTGCTTCTCTG, AAGGATTGCGGCCGCTATAGCCTCCATCAGAGCTTTCCCTG chr12 103736989 103737215 422 D D 504 278 2 TCACAAGAACTACAGTCCTCATCTC, GGTATGTATCTGAACCTGATTAGGC chr13 72860681 72860894 432 P P 1079 866 2 CACCCTGATATACATACACACCGTG, TGGTCACGGTCTTTCCCTAAGCGTA chr14 25938413 25938776 441 D D 1343 982 2 TTGAGTTATCTAGGAAGGGCAAGCG, ATACCATCACTGCCTAACTAGGATG chr14 65712113 65713055 453 D D 1349 407 2 CTGCTGGCTCAAACACAACACTTCA, GAACCTAACAGGGCAGCAGTGGTCA chr15 62786412 62786909 471 D D 1157 660 2 TAAATCAGAAATTGGGGCTGTCTGG, GGTGCTTTCAGTTGCTGTGTACAATC chr16 69007537 69007691 498 D D 1133 979 2 TGAGAGGTTGAAGAGATAGAAGAGG, AAAGCAAGAGGGTCCGAACACAGCA chr17 50515571 50515803 508 D P 1200 1154 2 TTCAGTCATGCTAATTGGAAGAGGG, GAAACTAGGTAGTATTGACAGAGTAG chr17 54504301 54504388 513 D D 777 690 2 AAAGACTGTCACCTCAGTCCTATAG, GAAATAGGCTTGCCTTTAGATCGGG chr20 40081987 40082859 552 D D 1328 456 2 GAAATACGGAATATGGTTAAGTGGC, CGATTAGTAGTCAAAGTCTAAGACATAG chr1 67000354 67001170 14 D D 1119, 420 303 3 GCAAATTGCCATGCTGTTTCTGAGG, CGCATGTATATGATGACTGTGAGCA, CTCAATGGTCTATTGCTACAGCAGA chr2a 51356636 51357594 49 D D 1467, 392 509 3 TTGCCACTACCCAAGGATGCTGTGA, GAGCATTTGTACTGAGTGTTACAGG, GGCATTTTTGGAAAGAGCCTCATGA chr2b 155116774 155117678 71 P P 1310, 314 417 3 AGGGAATACTCTGTTCACATCTGCA, CAAGTTCTTTACCATCTCAGCCTCA, CAGCAAGCTTCAGCCACTAAGCAGA chr3 133455574 133455774 129 D D 703, 405 503 3 TGGAATAGCACAGGGCCATTGTAGA, CTGGCTCCAGATCTTCATGTAACAG, GTGGCACTCAGATGAATGGCTAGGA chr14 55100837 55101299 451 D D 701, 360 239 3 GCTAGTTACTGCTCCTGGAAAGCCA, AATCAGTAAGCACTAACCAGGACTG, AATACCTCTCACTGGTGGGACTTGA chr14 81558171 81559092 456 P P 1355, 525 435 3 ATTGCTGGCTCCTGATCTTTGGAGA, GGAAGTACAGATTTTGGCGTCATCA, CTGGTCCCATTTCTCATTGCCTGGA chr16 50422209 50422314 490 D D 391, 199 286 3 CCATATTGTAGAGCTCAGAAATGCC, GACTTCATGGGTACACTGTAGATCG, TTGGTTCTATAAACCATGACCA chr16 80296947 80297568 505 D D 1074, 356 453 3 ATGTGAGCACCAAGGCACCAAGGA, AGACCATCCTGGCTAACATGGTGA, CCACGTAGTCTCTTACCTTGGCAA chr20 12369665 12369826 543 P P 635, 315 482 3 TTAAGCACTTCTGGCTCATTAGGCA, GCTTACAAACATGCAAAAGGAGCAG, CGACCGAACGGCATATTACCAGGTA chr2b 241400204 241400787 96 D D 882 293 4 TAACAAGTGTAGGGGAGGCTTCGAG, ATGAGGATTTGTGCAGATATGGTTG, TCCAGCGTGCCTCCTTAACACTCAC, CACAGGGTTGTCTCTCATGCAGTGC chr3 137310809 137311397 130 D D 1069 488 4 GACAGTGGTCACCTAGAGTGGCAAC, CCATACTACTTCAGTTCTAAAGGCAATG, CACCCAGTGTTCTGAGGATTATGTC, GCTTATATAGGCAAAACACATACACAC chr20 17991945 17992204 544 D D 1672 1415 4 CACCACATCAAACACTCTGAGGTTG, AGGTAACTCAGGTATGGTCAAGAGG, CGTAAATGAAGCTTGTACACCTGGG, GAATTTCTGGCAAGATTTTCCCTCC

ID column: Unique identifier for each hCONDEL from Table B.2. Comp. Pred. column: “D” indicates all NCBI Trace Archive reads support the deletion genotype, “P” indicates at least one Trace Archive read supports the ancestral genotype (Section B.1.5). PCR column: “D” indicates all 23 populations showed only the deletion genotype, “P” indicates that at least one population showed the ancestral genotype. Chimp Size column: Expected size of PCR product for ancestral genotype. Human Size column: Expected size of PCR product for deletion genotype. APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 179

Table B.4: Analysis of eight polymorphic human conserved sequence deletions by PCR in 96 diverse humans.

Catalog # Population † hCONDEL ID ‡ 19 64 71 271 432 456 508 543

NA10473 Pygmy* A D H A H D A D NA10492 Pygmy A H H A A A A H NA10493 Pygmy A A A A D A D D NA10539 Melanesian* A A A A A A D D NA10540 Melanesian A A A A A A D D NA10541 Melanesian A A A A A A D H NA10965 Amerindian, Karitiana, Brazil A A D A A H D A NA10967 Amerindian, Karitiana, Brazil A A H A A A H D NA10968 Amerindian, Karitiana, Brazil A H D A A A D D NA10975 Amerindian, Mayan* A D A A A A D D NA10976 Amerindian, Mayan A D D A A H D H NA10978 Amerindian, Mayan A D A A A A D D NA11198 Amerindian, Quechua, Peru* A A D A A A D D NA11199 Amerindian, Quechua, Peru A D H A A A H H NA11200 Amerindian, Quechua, Peru A D D A A H H D NA11373 Cambodian* A H A A A H A D NA11376 Cambodian A H H A A H H D NA11377 Cambodian A D H A A D D D NA11521 Druze* A H H A A H D D NA11522 Druze A D D A A D H D NA11523 Druze A D H A A H A H NA11587 Japanese A H H A A H D D NA11589 Japanese A A H A A H D D NA11590 Japanese A H H A A H D D NA13597 Atayal* A H H A A A D H NA13598 Atayal A H A A A D H D NA13599 Atayal A D A A A H H D NA13607 Ami A D H A A A D D NA13608 Ami A H D A A H H D NA13609 Ami A D A A A H D D NA13619 Adygei* A D D A A A H H NA13620 Adygei A H H A A A D D NA13623 Adygei A A A A H D A D NA13835 Russian A A A A A H D H NA13836 Russian A D H A A H D D NA13838 Russian A A H A A D D D NA15199 HVP-Hungarian* A D H A A D D D NA15202 HVP-Hungarian A H H A A D D H NA15207 HVP-Hungarian A H H A A H D D NA15755 Icelandic A D A A D A D D NA15758 Icelandic A D A A H H D H NA15765 Icelandic A D A A A D D H NA15886 Basque* A D D A A H A D NA16185 Basque A D A A A H D H NA16190 Basque A D H A A H D H NA16654 Chinese* A H H A A H H D NA16688 Chinese A H H A A D H D NA16689 Chinese A H A A A D D D NA17003 HVP-Northern European* A A H A A A D H NA17004 HVP-Northern European A D H A A A D D NA17006 HVP-Northern European A D H A A H H D NA17021 HVP-Indo Pakistani* A H A A A A D D NA17027 HVP-Indo Pakistani A D A A A A D D NA17029 HVP-Indo Pakistani A D H A A H D D NA17041 HVP-Middle Eastern (Version 1)* A D H A A H D H Continued on Next Page. . . APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 180

Table B.4 – Continued

Catalog # Population † hCONDEL ID ‡ 19 64 71 271 432 456 508 543

NA17042 HVP-Middle Eastern (Version 1) A D D A A A D D NA17050 HVP-Middle Eastern (Version 1) A D H A A D D D NA17081 HVP-Southeast Asians A D H A A A D H NA17083 HVP-Southeast Asians A H H A A H H D NA17085 HVP-Southeast Asians A D A A A H D D NA17094 Iberians* A D A A A A D H NA17099 Iberians A A A A A A D H NA17100 Iberians A D A A A D D D NA17109 HVP-African American Panel of 50* A A H A A A D H NA17111 HVP-African American Panel of 50 A H H A H A H H NA17112 HVP-African American Panel of 50 H A A A H H D D NA17342 HVP-African South of the Sahara* A A H H H H H H NA17344 HVP-African South of the Sahara A D D A A A A H NA17347 HVP-African South of the Sahara A H A A D A A H NA17361 HVP-Ashkenazi Jewish A D A A A A D D NA17367 HVP-Ashkenazi Jewish A D D A A D D D NA17369 HVP-Ashkenazi Jewish A D A A A D D D NA17371 HVP-Greek* A D A A A H D H NA17372 HVP-Greek A H A A A H D H NA17377 HVP-Greek A H H A D D D D NA17380 HVP-Africans North of the Sahara* A H A A A A D D NA17381 HVP-Africans North of the Sahara A H H A A D H D NA17384 HVP-Africans North of the Sahara A A A A A H D D NA17385 HVP-Pacific* A D H A A D D H NA17390 HVP-Pacific A A A A A D D D NA17391 HVP-Pacific A A H A A H D H NA17392 HVP-Mexican Indian A D A A A A D D NA17394 HVP-Mexican Indian A D H A A A D H NA17395 HVP-Mexican Indian* A D H A A A D D NA19027 IHP-Luhya from Webuye* A A H H A A D H NA19238 IHP-Yoruba in Ibadan, Nigeria* A A H A A H H D NA19239 IHP-Yoruba in Ibadan, Nigeria A A H H D H H D NA19240 IHP-Yoruba in Ibadan, Nigeria A A D A H H A D NA19473 IHP-Luhya from Webuye H H H A A H H D NA19474 IHP-Luhya from Webuye A H H A A H A D NA20502 IHP-Toscani in Italia (Tuscans in Italy) A D A A A A D D NA20504 IHP-Toscani in Italia (Tuscans in Italy) A D A A A D D D NA20509 IHP-Toscani in Italia (Tuscans in Italy) A A H A A A D D NA20845 Gujarati Indians in Houston A D D A A A H D NA20846 Gujarati Indians in Houston A A H A A A H D NA21144 Gujarati Indians in Houston A A H A A A H H

* Sample was tested in 23-population polymorphism analysis (Table B.3). † HVP, Human Variation Panel; IHP, International HapMap Project. ‡ A, ancestral; H, heterozygous; D, derived. APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 181

Table B.5: Derived allele frequency of hCONDELs in a human diversity panel, grouped by geographical region.

hCONDEL All Samples Africa Europe Central and Southern America Eastern and Southeastern Asia and Paci c Western and Southern Asia hCONDEL.543 0.83 0.81 0.80 0.79 0.90 0.88 hCONDEL.508 0.78 0.50 0.90 0.88 0.81 0.75 hCONDEL.64 0.59 0.31 0.72 0.71 0.52 0.75 hCONDEL.456 0.40 0.33 0.50 0.13 0.50 0.33 hCONDEL.71 0.39 0.42 0.30 0.58 0.31 0.54 hCONDEL.432 0.09 0.31 0.10 0.00 0.00 0.00 hCONDEL.271 0.02 0.08 0.00 0.00 0.00 0.00 hCONDEL.19 0.01 0.06 0.00 0.00 0.00 0.00

Number of chromosomes 192 36 60 24 48 24 in population

Population groupings are as follows, using population names from Table B.4. Africa: African American Panel of 50; Africans North of the Sahara; African South of the Sahara; Luhya from Webuye; Pygmy; Yoruba in Ibadan, Nigeria. Europe: Adygei; Basque; HVP-Ashkenazi Jewish; HVP-Greek; HVP-Hungarian; HVP-Northern European; Iberians; Icelandic; IHP-Toscani in Italia (Tuscans in Italy); Russian. Central and Southern America: Amerindian, Karitiana, Brazil; Amerindian, Mayan; Amerindian, Quechua, Peru; Mexican Indian. Eastern and Southeastern Asia and Pacific: Ami; Atayal; Cambodian; Chinese; HVP-Pacific; HVP-Southeast Asians; Japanese; Melanesian. Western and Southern Asia: Druze; Gujarati Indians in Houston; HVP-Indo Pakistani; HVP-Middle Eastern (Version 1).

Table B.6: hCONDELs in regions of the human genome under recent positive selection.

The nearest upstream and downstream genes within 1 Mb of the hCONDEL, as well as all genes to which the hCONDEL is intronic, are listed. All genes are identified as human or chimpanzee RefSeq genes with a status of Reviewed or Provisional. PanTro2 coordinates column: “*” indicates hCONDEL has been computationally identified as polymorphic in humans (Figure B.6). Comp. Valid. column: “Y” indicates the deletion genotype has been computationally validated. The only hCONDEL not validated has human size 1,239 bp and is thus biased against validation (Section B.1.5). APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 182

Table B.7: Summary of all exonic human conserved sequence deletions.

Comp. Valid. column: “Y” indicates the deletion genotype has been computationally validated (Section B.1.5). Exp. Valid. column: “Y” indicates the hCONDEL has been experimentally validated [51], “N” indicates that PCR attempts to detect the deletion genotype were unsuccessful (see Exonic hCONDEL validation attempts). “*”: hCONDEL.228, of size 3,316 bp in chimpanzee, corresponds to a 268 bp region in human that is flanked by two repeats rather than being entirely spanned by a single repeat in human.

Table B.8: Primers and results of PCR examination of unsupported exonic human con- served sequence deletions.

Primers used to verify ancestral sequence hCONDEL # Showing ancestral genotype Forward Reverse hCONDEL.231 CTCTTCCTGCCCTGCTATACA TGACTGGTCCAAAGTAAGTTC 94 of 94 hCONDEL.234 AGAGGGACAAGGTAGTCCTAAATGG TACAAATAGTCTGTCTTCAGTGTGG 96 of 96 hCONDEL.285 TGCTGTAGGGTCATCATCAGGCAAA TTTGACTGCACCAACACTCTCAATG 96 of 96

Table B.9: Gene Ontology Molecular Function annotation enrichments of computation- ally validated hCONDELs.

Term Exp Obs Fold Binomial Obs dels dels enrich P-value [154] genes Steroid hormone receptor activity 4.7 14 2.96 3.7 × 10−4 10 Ligand-dependent nuclear receptor activity 5.1 14 2.74 7.8 × 10−4 10

Showing uncorrected p-values of all terms annotated to at least 25 genes enriched when expecting a single term to be significant by chance after correcting for multiple hypothesis testing and clustering of conserved elements near particular gene classes. P-value cutoff for significance: 8.0 × 10−4. Exp, expected; Obs, observed; dels, hCONDELs. APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 183

Table B.10: InterPro annotation enrichments of computationally validated hCONDELs.

Term Exp Obs Fold Binomial Obs dels dels enrich P-value [154] genes Fibronectin, type III 16.7 34 2.03 1.0 × 10−4 27 Zinc finger, nuclear hormone receptor-type 4.4 14 3.19 1.8 × 10−4 10 Nuclear hormone receptor, ligand-binding 4.4 14 3.15 2.1 × 10−4 10 Nuclear hormone receptor, ligand-binding, core 4.4 14 3.15 2.1 × 10−4 10 Fibronectin, type III-like fold 17.6 34 1.94 2.4 × 10−4 28 Zinc finger, NHR/GATA-type 5.2 15 2.89 3.0 × 10−4 11 Steroid hormone receptor 3.8 12 3.15 5.8 × 10−4 8 CD80-like, immunoglobulin C2-set 2.1 8 3.84 1.4 × 10−3 5

Showing uncorrected p-values of all terms annotated to at least 25 genes enriched when expecting a single term to be significant by chance after correcting for multiple hypothesis testing and clustering of conserved elements near particular gene classes. P-value cutoff for significance: 2.1 × 10−3. Exp, expected; Obs, observed; dels, hCONDELs.

Table B.11: All annotation enrichments of computationally validated cCONDELs.

Ontology: Term Exp Obs Fold Binomial Obs dels dels enrich P-value [154] genes GO Cellular Component: Synapse 15.6 31 1.98 2.7 × 10−4 27 Postsynaptic membrane 7.6 19 2.51 2.8 × 10−4 19 PANTHER pathway: Ionotropic glutamate receptor pathway 2.6 10 3.80 3.9 × 10−4 8 Metabotropic glutamate receptor group III pathway 3.5 11 3.15 9.4 × 10−4 9 Muscarinic acetylcholine receptor 1 and 3 signaling pathway 2.4 8 3.33 3.2 × 10−3 5 Heterotrimeric G-protein signaling pathway-Gq alpha and Go alpha mediated pathway 4.8 11 2.29 1.0 × 10−2 11

Showing uncorrected p-values of all terms annotated to at least 25 genes enriched when expecting a single term to be significant by chance after correcting for multiple hypothesis testing and clustering of conserved elements near particular gene classes. P-value cutoffs for significance: GO Cellular Component: 1.9 × 10−3, PANTHER pathway: 1.4 × 10−2. Exp, expected; Obs, observed; dels, cCONDELs. APPENDIX B. SUPPLEMENTARY MATERIAL FOR CHAPTER 5 184

Table B.12: All annotation enrichments of computationally validated mCONDELs.

Ontology: Term Exp Obs Fold Binomial Obs dels dels enrich P-value [154] genes GO Molecular Function: Metal ion binding 116.8 149 1.28 2.1 × 10−4 162 Peptidyl-prolyl cis-trans isomerase activity 0.5 5 9.50 2.1 × 10−4 3 Cis-trans isomerase activity 0.6 5 8.65 3.3 × 10−4 3 Ion binding 119.5 150 1.25 4.4 × 10−4 165 Transition metal ion binding 75.5 101 1.34 8.1 × 10−4 107 InterPro: ATPase, P-type, K/Mg/Cd/Cu/Zn/Na/ 1.7 8 4.62 4.2 × 10−4 7 Ca/Na/H-transporter ATPase, AAA+ type, core 4.6 13 2.83 8.8 × 10−4 13

Showing uncorrected p-values of all terms annotated to at least 25 genes enriched when expecting a single term to be significant by chance after correcting for multiple hypothesis testing and clustering of conserved elements near particular gene classes. P-value cutoffs for significance: GO Molecular Function: 9.3 × 10−4, InterPro: 1.7 × 10−3. Exp, expected; Obs, observed; dels, mCONDELs. Bibliography

[1] 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature, 467:1061–1073, Oct 2010.

[2] N. Ahituv, Y. Zhu, A. Visel, A. Holt, V. Afzal, L. A. Pennacchio, and E. M. Rubin. Deletion of ultraconserved elements yields viable mice. PLoS Biol., 5:e234, Sep 2007.

[3] F. Al-Shahrour, J. Carbonell, P. Minguez, S. Goetz, A. Conesa, J. T´arraga, I. Medina, E. Alloza, D. Montaner, and J. Dopazo. Babelomics: advanced functional profiling of transcriptomics, proteomics and genomics experiments. Nucleic Acids Res., 36:W341–346, Jul 2008.

[4] D. B. Allison, X. Cui, G. P. Page, and M. Sabripour. Microarray data analysis: from disarray to consolidation and consensus. Nat. Rev. Genet., 7:55–65, Jan 2006.

[5] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. J. Mol. Biol., 215:403–410, Oct 1990.

[6] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25:3389–3402, Sep 1997.

[7] R. Apweiler, T. K. Attwood, A. Bairoch, A. Bateman, E. Birney, M. Biswas, P. Bucher, L. Cerutti, F. Corpet, M. D. Croning, R. Durbin, L. Falquet, W. Fleischmann, J. Gouzy, H. Hermjakob, N. Hulo, I. Jonassen, D. Kahn,

185 BIBLIOGRAPHY 186

A. Kanapin, Y. Karavidopoulou, R. Lopez, B. Marx, N. J. Mulder, T. M. Oinn, M. Pagni, F. Servant, C. J. Sigrist, and E. M. Zdobnov. The InterPro database, an integrated documentation resource for protein families, domains and func- tional sites. Nucleic Acids Res., 29:37–40, Jan 2001.

[8] P. Arcelli, C. Frassoni, M. C. Regondi, S. De Biasi, and R. Spreafico. GABAergic neurons in mammalian thalamus: a marker of thalamic complexity? Brain Res. Bull., 42:27–37, 1997.

[9] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ring- wald, G. M. Rubin, and G. Sherlock. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet., 25:25–29, May 2000.

[10] C. J. Auernhammer and S. Melmed. Leukemia-inhibitory factor-neuroimmune modulator of endocrine function. Endocr. Rev., 21:313–345, Jun 2000.

[11] I. Barbaric, G. Miller, and T. N. Dear. Appearances can be deceiving: phe- notypes of knockout mice. Brief Funct. Genomic Proteomic, 6:91–103, Jun 2007.

[12] S. Batzoglou, D. B. Jaffe, K. Stanley, J. Butler, S. Gnerre, E. Mauceli, B. Berger, J. P. Mesirov, and E. S. Lander. ARACHNE: a whole-genome shotgun assem- bler. Genome Res., 12:177–189, Jan 2002.

[13] C. G. Begley, S. Lipkowitz, V. G¨obel, K. A. Mahon, V. Bertness, A. R. Green, N. M. Gough, and I. R. Kirsch. Molecular characterization of NSCL, a gene encoding a helix-loop-helix protein expressed in the developing nervous system. Proc. Natl. Acad. Sci. U.S.A., 89:38–42, Jan 1992.

[14] T. Beissbarth and T. P. Speed. GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics, 20:1464–1465, Jun 2004. BIBLIOGRAPHY 187

[15] G. Bejerano, C. B. Lowe, N. Ahituv, B. King, A. Siepel, S. R. Salama, E. M. Rubin, W. J. Kent, and D. Haussler. A distal enhancer and an ultraconserved exon are derived from a novel retroposon. Nature, 441:87–90, May 2006.

[16] G. Bejerano, M. Pheasant, I. Makunin, S. Stephen, W. J. Kent, J. S. Mattick, and D. Haussler. Ultraconserved elements in the human genome. Science, 304:1321–1325, May 2004.

[17] H. G. Belting, C. S. Shashikant, and F. H. Ruddle. Modification of expression and cis-regulation of Hoxc8 in the evolution of diverged axial morphology. Proc. Natl. Acad. Sci. U.S.A., 95:2355–2360, Mar 1998.

[18] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate – a practical and powerful approach to multiple hypothesis testing. J. R. Stat. Soc. Ser. B, 57:289–300, 1995.

[19] S. Benko, J. A. Fantes, J. Amiel, D. J. Kleinjan, S. Thomas, J. Ramsay, N. Jamshidi, A. Essafi, S. Heaney, C. T. Gordon, D. McBride, C. Golzio, M. Fisher, P. Perry, V. Abadie, C. Ayuso, M. Holder-Espinasse, N. Kilpatrick, M. M. Lees, A. Picard, I. K. Temple, P. Thomas, M. P. Vazquez, M. Veke- mans, H. Roest Crollius, N. D. Hastie, A. Munnich, H. C. Etchevers, A. Pelet, P. G. Farlie, D. R. Fitzpatrick, and S. Lyonnet. Highly conserved non-coding elements on either side of SOX9 associated with Pierre Robin sequence. Nat. Genet., 41:359–364, Mar 2009.

[20] M. F. Berger, G. Badis, A. R. Gehrke, S. Talukder, A. A. Philippakis, L. Pena- Castillo, T. M. Alleyne, S. Mnaimneh, O. B. Botvinnik, E. T. Chan, F. Khalid, W. Zhang, D. Newburger, S. A. Jaeger, Q. D. Morris, M. L. Bulyk, and T. R. Hughes. Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences. Cell, 133:1266–1276, Jun 2008. BIBLIOGRAPHY 188

[21] C. Bertolotto, J. E. Ricci, F. Luciano, B. Mari, J. C. Chambard, and P. Auberger. Cleavage of the serum response factor during death receptor- induced apoptosis results in an inhibition of the c-FOS promoter transcriptional activity. J. Biol. Chem., 275:12941–12947, Apr 2000.

[22] D. Beysen, J. Raes, B. P. Leroy, A. Lucassen, J. R. Yates, J. Clayton-Smith, H. Ilyina, S. S. Brooks, S. Christin-Maitre, M. Fellous, J. P. Fryns, J. R. Kim, P. Lapunzina, E. Lemyre, F. Meire, L. M. Messiaen, C. Oley, M. Splitt, J. Thom- son, Y. Van de Peer, R. A. Veitia, A. De Paepe, and E. De Baere. Deletions involving long-range conserved nongenic sequences upstream and downstream of FOXL2 as a novel disease-causing mechanism in blepharophimosis syndrome. Am. J. Hum. Genet., 77:205–218, Aug 2005.

[23] D. L. Black. Mechanisms of alternative pre-messenger RNA splicing. Annu. Rev. Biochem., 72:291–336, 2003.

[24] J. A. Blake, C. J. Bult, J. T. Eppig, J. A. Kadin, J. E. Richardson, M. T. Airey, A. Anagnostopoulos, R. P. Babiuk, R. M. Baldarelli, M. J. Baya, J. S. Beal, S. M. Bello, D. W. Bradt, D. L. Burkart, N. E. Butler, J. W. Campbell, L. E. Corbani, S. L. Cousins, D. J. Dahmen, H. Dene, A. D. Diehl, K. L. Forthofer, K. S. Frazer, D. E. Geel, M. M. Hall, M. Knowlton, J. R. Lewis, I. Lu, L. J. Maltais, M. McAndrews-Hill, S. McClatchy, M. J. McCrossin, T. F. Meehan, D. B. Miers, L. A. Miller, L. Ni, H. Onda, J. E. Ormsby, D. J. Reed, B. Richards-Smith, D. R. Shaw, R. Sinclair, D. Sitnikov, C. L. Smith, P. Szauter, M. Tomczu, L. L. Washburn, I. T. Witham, and Y. Zhu. The Mouse Genome Database genotypes::phenotypes. Nucleic Acids Res., 37:D712–719, Jan 2009.

[25] M. Blanchette, A. R. Bataille, X. Chen, C. Poitras, J. Laganire, C. Lefbvre, G. Deblois, V. Gigure, V. Ferretti, D. Bergeron, B. Coulombe, and F. Robert. Genome-wide computational prediction of transcriptional regulatory modules reveals new insights into human gene expression. Genome Res., 16:656–668, May 2006. BIBLIOGRAPHY 189

[26] M. Blanchette, W. J. Kent, C. Riemer, L. Elnitski, A. F. Smit, K. M. Roskin, R. Baertsch, K. Rosenbloom, H. Clawson, E. D. Green, D. Haussler, and W. Miller. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res., 14:708–715, Apr 2004.

[27] H. A. Booth and P. W. Holland. Annotation, nomenclature and evolution of four novel homeobox genes expressed in the human germ line. Gene, 387:7–14, Jan 2007.

[28] H. Brentani, O. L. Caballero, A. A. Camargo, A. M. da Silva, W. A. da Silva, E. Dias Neto, M. Grivet, A. Gruber, P. E. Guimaraes, W. Hide, C. Iseli, C. V. Jongeneel, J. Kelso, M. A. Nagai, E. P. Ojopi, E. C. Osorio, E. M. Reis, G. J. Riggins, A. J. Simpson, S. de Souza, B. J. Stevenson, R. L. Strausberg, E. H. Ta- jara, S. Verjovski-Almeida, M. L. Acencio, M. H. Bengtson, F. Bettoni, W. F. Bodmer, M. R. Briones, L. P. Camargo, W. Cavenee, J. M. Cerutti, L. E. Coelho Andrade, P. C. Costa dos Santos, M. C. Ramos Costa, I. T. da Silva, M. R. Est´ecio,K. Sa Ferreira, F. B. Furnari, M. Faria, P. A. Galante, G. S. Guimaraes, A. J. Holanda, E. T. Kimura, M. R. Leerkes, X. Lu, R. M. Maciel, E. A. Martins, K. B. Massirer, A. S. Melo, C. A. Mestriner, E. C. Miracca, L. L. Miranda, F. G. Nobrega, P. S. Oliveira, A. C. Paquola, J. R. Pandolfi, M. I. Campos Pardini, F. Passetti, J. Quackenbush, B. Schnabel, M. C. Soga- yar, J. E. Souza, S. R. Valentini, A. C. Zaiats, E. J. Amaral, L. A. Arnaldi, A. G. de Ara´ujo,S. A. de Bessa, D. C. Bicknell, M. E. Ribeiro de Camaro, D. M. Carraro, H. Carrer, A. F. Carvalho, C. Colin, F. Costa, C. Curcio, I. D. Guerreiro da Silva, N. Pereira da Silva, M. Dellamano, H. El-Dorry, E. M. Espreafico, A. J. Scattone Ferreira, C. Ayres Ferreira, M. A. Fortes, A. H. Gama, D. Giannella-Neto, M. L. Giannella, R. R. Giorgi, G. H. Goldman, M. H. Goldman, C. Hackel, P. L. Ho, E. M. Kimura, L. P. Kowalski, J. E. Krieger, L. C. Leite, A. Lopes, A. M. Luna, A. Mackay, S. K. Mari, A. A. Marques, W. K. Martins, A. Montagnini, M. Mour˜ao Neto, A. L. Nascimento, A. M. Neville, M. P. Nobrega, M. J. O’Hare, A. Y. Otsuka, A. I. Ruas de Melo, M. L. Paco-Larson, G. Guimar˜aesPereira, N. Pereira da Silva, J. B. Pesquero, J. G. BIBLIOGRAPHY 190

Pessoa, P. Rahal, C. A. Rainho, V. Rodrigues, S. R. Rogatto, C. M. Romano, J. G. Romeiro, B. M. Rossi, M. Rusticci, R. Guerra de S´a, S. C. Sant’ Anna, M. L. Sarmazo, T. C. Silva, F. A. Soares, M. d. e. F. Sonati, J. de Freitas Sousa, D. Queiroz, V. Valente, A. L. Vettore, F. E. Villanova, M. A. Zago, and H. Zal- cberg. The generation and utilization of a cancer-oriented representation of the human transcriptome by using expressed sequence tags. Proc. Natl. Acad. Sci. U.S.A., 100:13418–13423, Nov 2003.

[29] J. F. Bromberg, M. H. Wrzeszczynska, G. Devgan, Y. Zhao, R. G. Pestell, C. Albanese, and J. E. Darnell. Stat3 as an oncogene. Cell, 98:295–303, Aug 1999.

[30] A. M. Bronikowski, J. S. Rhodes, T. Garland, T. A. Prolla, T. A. Awad, and S. C. Gammie. The evolution of gene expression in mouse hippocampus in response to selective breeding for increased locomotor activity. Evolution, 58:2079–2086, Sep 2004.

[31] E. A. Bruford, M. J. Lush, M. W. Wright, T. P. Sneddon, S. Povey, and E. Bir- ney. The HGNC Database in 2008: a resource for the human genome. Nucleic Acids Res., 36:D445–448, Jan 2008.

[32] C. J. Bult, J. T. Eppig, J. A. Kadin, J. E. Richardson, and J. A. Blake. The Mouse Genome Database (MGD): mouse biology and model systems. Nucleic Acids Res., 36:D724–728, Jan 2008.

[33] M. L. Bulyk. Computational prediction of transcription-factor binding site locations. Genome Biol., 5:201, 2003.

[34] C. Burge and S. Karlin. Prediction of complete gene structures in human ge- nomic DNA. J. Mol. Biol., 268:78–94, Apr 1997.

[35] C. D. Bustamante, A. Fledel-Alon, S. Williamson, R. Nielsen, M. T. Hubisz, S. Glanowski, D. M. Tanenbaum, T. J. White, J. J. Sninsky, R. D. Hernandez, D. Civello, M. D. Adams, M. Cargill, and A. G. Clark. Natural selection on protein-coding genes in the human genome. Nature, 437:1153–1157, Oct 2005. BIBLIOGRAPHY 191

[36] J. Capdevila and J. C. Izpis´uaBelmonte. Patterning mechanisms controlling vertebrate limb development. Annu. Rev. Cell Dev. Biol., 17:87–132, 2001.

[37] T. D. Capellini, K. Handschuh, L. Quintana, E. Ferretti, G. Di Giacomo, S. Fan- tini, G. Vaccari, S. L. Clarke, A. M. Wenger, G. Bejerano, J. Sharpe, V. Zap- pavigna, and L. Selleri. Control of pelvic girdle development by genes of the Pbx family and Emx2. Dev. Dyn., 240:1173–1189, May 2011.

[38] S. B. Carroll. Evolution at two levels: on genes and form. PLoS Biol., 3:e245, Jul 2005.

[39] S. B. Carroll. Evo-devo and an expanding evolutionary synthesis: a genetic theory of morphological evolution. Cell, 134:25–36, Jul 2008.

[40] R. Caspi, H. Foerster, C. A. Fulcher, P. Kaipa, M. Krummenacker, M. Laten- dresse, S. Paley, S. Y. Rhee, A. G. Shearer, C. Tissier, T. C. Walk, P. Zhang, and P. D. Karp. The MetaCyc Database of metabolic pathways and and the BioCyc collection of Pathway/Genome Databases. Nucleic Acids Res., 36:D623–631, Jan 2008.

[41] E. G. Cerami, G. D. Bader, B. E. Gross, and C. Sander. cPath: open source software for collecting, storing, and querying biological pathways. BMC Bioin- formatics, 7:497, 2006.

[42] J. Chai and A. S. Tarnawski. Serum response factor: discovery, biochemistry, biological roles and implications for tissue injury healing. J. Physiol. Pharma- col., 53:147–157, Jun 2002.

[43] Y. F. Chan, M. E. Marks, F. C. Jones, G. Villarreal, M. D. Shapiro, S. D. Brady, A. M. Southwick, D. M. Absher, J. Grimwood, J. Schmutz, R. M. Myers, D. Petrov, B. Jonsson, D. Schluter, M. A. Bell, and D. M. Kingsley. Adaptive evolution of pelvic reduction in sticklebacks by recurrent deletion of a Pitx1 enhancer. Science, 327:302–305, Jan 2010. BIBLIOGRAPHY 192

[44] S. S. Chavan, W. Tian, K. Hsueh, D. Jawaheer, P. K. Gregersen, and C. C. Chu. Characterization of the human homolog of the IL-4 induced gene-1 (Fig1). Biochim. Biophys. Acta, 1576:70–80, Jun 2002.

[45] C. R. Chen, Y. Kang, P. M. Siegel, and J. Massagu´e.E2F4/5 and p107 as Smad cofactors linking the TGFbeta receptor to c-myc repression. Cell, 110:19–32, Jul 2002.

[46] X. Chen, H. Xu, P. Yuan, F. Fang, M. Huss, V. B. Vega, E. Wong, Y. L. Orlov, W. Zhang, J. Jiang, Y. H. Loh, H. C. Yeo, Z. X. Yeo, V. Narang, K. R. Govindarajan, B. Leong, A. Shahab, Y. Ruan, G. Bourque, W. K. Sung, N. D. Clarke, C. L. Wei, and H. H. Ng. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell, 133:1106–1117, Jun 2008.

[47] Z. F. Chen, A. J. Paquette, and D. J. Anderson. NRSF/REST is required in vivo for repression of multiple neuronal target genes during embryogenesis. Nat. Genet., 20:136–142, Oct 1998.

[48] A. Cheong, A. J. Bingham, J. Li, B. Kumar, P. Sukumar, C. Munsch, N. J. Buckley, C. B. Neylon, K. E. Porter, D. J. Beech, and I. C. Wood. Downregu- lated REST transcription factor is a switch enabling critical potassium channel expression and cell proliferation. Mol. Cell, 20:45–52, Oct 2005.

[49] Y. Chinenov, C. Coombs, and M. E. Martin. Isolation of a bi-directional pro- moter directing expression of the mouse GABPalpha and ATP synthase cou- pling factor 6 genes. Gene, 261:311–320, Dec 2000.

[50] J. A. Chong, J. Tapia-Ramirez, S. Kim, J. J. Toledo-Aral, Y. Zheng, M. C. Boutros, Y. M. Altshuller, M. A. Frohman, S. D. Kraner, and G. Mandel. REST: a mammalian silencer protein that restricts sodium channel gene expression to neurons. Cell, 80:949–957, Mar 1995.

[51] H. H. Chou, H. Takematsu, S. Diaz, J. Iber, E. Nickerson, K. L. Wright, E. A. Muchmore, D. L. Nelson, S. T. Warren, and A. Varki. A mutation in human BIBLIOGRAPHY 193

CMP-sialic acid hydroxylase occurred after the Homo-Pan divergence. Proc. Natl. Acad. Sci. U.S.A., 95:11751–11756, Sep 1998.

[52] M. Clamp, B. Fry, M. Kamal, X. Xie, J. Cuff, M. F. Lin, M. Kellis, K. Lindblad- Toh, and E. S. Lander. Distinguishing protein-coding and noncoding genes in the human genome. Proc. Natl. Acad. Sci. U.S.A., 104:19428–19433, Dec 2007.

[53] P. J. Coffer, A. van Puijenbroek, B. M. Burgering, M. Klop-de Jonge, L. Koen- derman, J. L. Bos, and W. Kruijer. Insulin activates Stat3 independently of p21ras-ERK and PI-3K signal transduction. Oncogene, 15:2529–2539, Nov 1997.

[54] F. S. Collins and S. M. Weissman. The molecular genetics of human hemoglobin. Prog. Nucleic Acid Res. Mol. Biol., 31:315–462, 1984.

[55] P. J. Collins, Y. Kobayashi, L. Nguyen, N. D. Trinklein, and R. M. Myers. The ets-related transcription factor GABP directs bidirectional transcription. PLoS Genet., 3:e208, Nov 2007.

[56] G. M. Cooper, M. Brudno, E. A. Stone, I. Dubchak, S. Batzoglou, and A. Sidow. Characterization of evolutionary rates and constraints in three Mammalian genomes. Genome Res., 14:539–548, Apr 2004.

[57] G. M. Cooper, E. A. Stone, G. Asimenos, E. D. Green, S. Batzoglou, and A. Sidow. Distribution and intensity of constraint in mammalian genomic se- quence. Genome Res., 15:901–913, Jul 2005.

[58] B. J. Crespi and K. Summers. Positive selection in the evolution of cancer. Biol. Rev. Camb. Philos. Soc., 81:407–424, Aug 2006.

[59] C. J. Cretekos, Y. Wang, E. D. Green, J. F. Martin, J. J. Rasweiler, and R. R. Behringer. Regulatory divergence modifies limb length between mammals. Genes Dev., 22:141–151, Jan 2008.

[60] A. Crocoll, C. C. Zhu, A. C. Cato, and M. Blum. Expression of androgen receptor mRNA during mouse embryogenesis. Mech. Dev., 72:175–178, Mar 1998. BIBLIOGRAPHY 194

[61] D. Curci´c, M. Glibeti´c,D. E. Larson, and B. H. Sells. GA-binding protein is involved in altered expression of ribosomal protein L32 gene. J. Cell. Biochem., 65:287–307, Jun 1997.

[62] E. H. Davidson. Genomic Regulatory Systems: Development and Evolution. Academic Press, 2001.

[63] E. H. Davidson. The Regulatory Genome: Gene Regulatory Networks in Devel- opment and Evolution. Academic Press, 2006.

[64] Y. J. de Kok, E. R. Vossenaar, C. W. Cremers, N. Dahl, J. Laporte, L. J. Hu, D. Lacombe, N. Fischel-Ghodsian, R. A. Friedman, L. S. Parnes, P. Thorpe, M. Bitner-Glindzicz, H. J. Pander, H. Heilbronner, J. Graveline, J. T. den Dunnen, H. G. Brunner, H. H. Ropers, and F. P. Cremers. Identification of a hot spot for microdeletions in patients with X-linked deafness type 3 (DFN3) 900 kb proximal to the DFN3 gene POU3F4. Hum. Mol. Genet., 5:1229–1235, Sep 1996.

[65] S. Del´ehouz´ee,T. Yoshikawa, C. Sawa, J. Sawada, T. Ito, M. Omori, T. Wada, Y. Yamaguchi, Y. Kabe, and H. Handa. GABP, HCF-1 and YY1 are involved in Rb gene expression during myogenesis. Genes Cells, 10:717–731, Jul 2005.

[66] R. J. DiLeone, L. B. Russell, and D. M. Kingsley. An extensive 3’ regulatory re- gion controls expression of Bmp5 in specific anatomical structures of the mouse embryo. Genetics, 148:401–408, Jan 1998.

[67] A. F. Dixson. Effects of testosterone on the sternal cutaneous glands and gen- italia of the male greater galago (Galago crassicaudatus crassicaudatus). Folia Primatol. (Basel), 26:207–213, 1976.

[68] A. F. Dixson. Primate Sexuality. Oxford University Press, 1998.

[69] A. Donadini, F. Giacopelli, R. Ravazzolo, V. Gandin, P. C. Marchisio, and S. Biffo. GABP complex regulates transcription of eIF6 (p27BBP), an essential BIBLIOGRAPHY 195

trans-acting factor in ribosome biogenesis. FEBS Lett., 580:1983–1987, Apr 2006.

[70] X. Dong, D. Fredman, and B. Lenhard. Synorth: exploring the evolution of synteny and long-range regulatory interactions in vertebrate genomes. Genome Biol., 10:R86, 2009.

[71] J. Dopazo. Functional interpretation of microarray experiments. OMICS, 10:398–410, 2006.

[72] J. Dostie, T. A. Richmond, R. A. Arnaout, R. R. Selzer, W. L. Lee, T. A. Honan, E. D. Rubio, A. Krumm, J. Lamb, C. Nusbaum, R. D. Green, and J. Dekker. Chromosome Conformation Capture Carbon Copy (5C): a massively parallel solution for mapping interactions between genomic elements. Genome Res., 16:1299–1309, Oct 2006.

[73] P. D. Drew, J. W. Nagle, R. D. Canning, K. Ozato, W. E. Biddison, and K. G. Becker. Cloning and expression analysis of a human cDNA homologous to Xenopus TFIIIA. Gene, 159:215–218, Jul 1995.

[74] M. C. Driscoll, C. S. Dobkin, and B. P. Alter. Gamma delta beta-thalassemia due to a de novo mutation deleting the 5’ beta-globin gene activation-region hypersensitive sites. Proc. Natl. Acad. Sci. U.S.A., 86:7470–7474, Oct 1989.

[75] K. Du, J. I. Leu, Y. Peng, and R. Taub. Transcriptional up-regulation of the delayed early gene HRS/SRp40 during liver regeneration. Interactions among YY1, GA-binding proteins, and mitogenic signals. J. Biol. Chem., 273:35208– 35215, Dec 1998.

[76] L. Duret and N. Galtier. Biased gene conversion and the evolution of mam- malian genomic landscapes. Annu Rev Genomics Hum Genet, 10:285–311, 2009.

[77] Editorial. Life as we know it. Nature, 449(7158):1, 2007. BIBLIOGRAPHY 196

[78] E. S. Emison, A. S. McCallion, C. S. Kashuk, R. T. Bush, E. Grice, S. Lin, M. E. Portnoy, D. J. Cutler, E. D. Green, and A. Chakravarti. A common sex- dependent mutation in a RET enhancer underlies Hirschsprung disease risk. Nature, 434:857–863, Apr 2005.

[79] J. T. Eppig, J. A. Blake, C. J. Bult, J. A. Kadin, and J. E. Richardson. The mouse genome database (MGD): new features facilitating a model system. Nu- cleic Acids Res., 35:D630–637, Jan 2007.

[80] P. Fang, V. Hwa, and R. G. Rosenfeld. Interferon-gamma-induced dephospho- rylation of STAT3 and apoptosis are dependent on the mTOR pathway. Exp. Cell Res., 312:1229–1239, May 2006.

[81] D. J. Gaffney and P. D. Keightley. Genomic selective constraints in murid noncoding DNA. PLoS Genet., 2:e204, Nov 2006.

[82] A. Garcia-Dorado, A. Caballero, and J. F. Crow. On the persistence and per- vasiveness of a new mutation. Evolution, 57:2644–2646, Nov 2003.

[83] D. M. Gelman, F. J. Martini, S. Nobrega-Pereira, A. Pierani, N. Kessaris, and O. Marin. The embryonic preoptic area is a novel source of cortical GABAergic interneurons. J. Neurosci., 29:9380–9389, Jul 2009.

[84] Genome 10K Community of Scientists. Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. J. Hered., 100:659–674, 2009.

[85] R. R. Genuario, D. E. Kelley, and R. P. Perry. Comparative utilization of transcription factor GABP by the promoters of ribosomal protein genes rpL30 and rpL32. Gene Expr., 3:279–288, 1993.

[86] P. G. Giresi, J. Kim, R. M. McDaniell, V. R. Iyer, and J. D. Lieb. FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolates active reg- ulatory elements from human chromatin. Genome Res., 17:877–885, Jun 2007. BIBLIOGRAPHY 197

[87] J. M. Gohlke, O. Armant, F. M. Parham, M. V. Smith, C. Zimmer, D. S. Castro, L. Nguyen, J. S. Parker, G. Gradwohl, C. J. Portier, and F. Guille- mot. Characterization of the proneural gene regulatory network during mouse telencephalon development. BMC Biol., 6:15, 2008.

[88] N. Gompel, B. Prud’homme, P. J. Wittkopp, V. A. Kassner, and S. B. Carroll. Chance caught on the wing: cis-regulatory evolution and the origin of pigment patterns in Drosophila. Nature, 433:481–487, Feb 2005.

[89] B. G¨ottgens, C. Broccardo, M. J. Sanchez, S. Deveaux, G. Murphy, J. R. G¨othert, E. Kotsopoulou, S. Kinston, L. Delaney, S. Piltz, L. M. Barton, K. Knezevic, W. N. Erber, C. G. Begley, J. Frampton, and A. R. Green. The scl +18/19 stem cell enhancer is not required for hematopoiesis: identification of a 5’ bifunctional hematopoietic-endothelial enhancer bound by Fli-1 and Elf-1. Mol. Cell. Biol., 24:1870–1883, Mar 2004.

[90] P. Green. 2x genomes–does depth matter? Genome Res., 17:1547–1549, Nov 2007.

[91] R. E. Green, J. Krause, A. W. Briggs, T. Maricic, U. Stenzel, M. Kircher, N. Patterson, H. Li, W. Zhai, M. H. Fritz, N. F. Hansen, E. Y. Durand, A. S. Malaspinas, J. D. Jensen, T. Marques-Bonet, C. Alkan, K. Prufer, M. Meyer, H. A. Burbano, J. M. Good, R. Schultz, A. Aximu-Petri, A. Butthof, B. Hober, B. Hoffner, M. Siegemund, A. Weihmann, C. Nusbaum, E. S. Lander, C. Russ, N. Novod, J. Affourtit, M. Egholm, C. Verna, P. Rudan, D. Brajkovic, Z. Kucan, I. Gusic, V. B. Doronichev, L. V. Golovanova, C. Lalueza-Fox, M. de la Rasilla, J. Fortea, A. Rosas, R. W. Schmitz, P. L. Johnson, E. E. Eichler, D. Falush, E. Birney, J. C. Mullikin, M. Slatkin, R. Nielsen, J. Kelso, M. Lachmann, D. Reich, and S. P¨a¨abo. A draft sequence of the Neandertal genome. Science, 328:710–722, May 2010.

[92] K. Han, J. Lee, T. J. Meyer, P. Remedios, L. Goodwin, and M. A. Batzer. L1 recombination-associated deletions generate human genomic variation. Proc. Natl. Acad. Sci. U.S.A., 105:19366–19371, Dec 2008. BIBLIOGRAPHY 198

[93] R. Hardison, J. L. Slightom, D. L. Gumucio, M. Goodman, N. Stojanovic, and W. Miller. Locus control regions of mammalian beta-globin gene clusters: combining phylogenetic analyses and experimental results to gain functional insights. Gene, 205:73–94, Dec 1997.

[94] N. D. Heintzman, R. K. Stuart, G. Hon, Y. Fu, C. W. Ching, R. D. Hawkins, L. O. Barrera, S. Van Calcar, C. Qu, K. A. Ching, W. Wang, Z. Weng, R. D. Green, G. E. Crawford, and B. Ren. Distinct and predictive chromatin signa- tures of transcriptional promoters and enhancers in the human genome. Nat. Genet., 39:311–318, Mar 2007.

[95] W. C. O. Hill. Note on the male external genitalia of the chimpanzee. Proc. Zool. Soc. Lond., 116:129–132, 1946.

[96] M. E. Hillenmeyer, E. Fung, J. Wildenhain, S. E. Pierce, S. Hoon, W. Lee, M. Proctor, R. P. St Onge, M. Tyers, D. Koller, R. B. Altman, R. W. Davis, C. Nislow, and G. Giaever. The chemical genomic portrait of yeast: uncovering a phenotype for all genes. Science, 320:362–365, Apr 2008.

[97] T. Hirano, K. Ishihara, and M. Hibi. Roles of STAT3 in mediating the cell growth, differentiation and survival signals relayed through the IL-6 family of cytokine receptors. Oncogene, 19:2548–2556, May 2000.

[98] F. C. Hsieh, G. Cheng, and J. Lin. Evaluation of potential Stat3-regulated genes in human breast cancer. Biochem. Biophys. Res. Commun., 335:292–299, Sep 2005.

[99] F. Hsu, W. J. Kent, H. Clawson, R. M. Kuhn, M. Diekhans, and D. Haussler. The UCSC Known Genes. Bioinformatics, 22:1036–1046, May 2006.

[100] d. a. W. Huang, B. T. Sherman, Q. Tan, J. Kir, D. Liu, D. Bryant, Y. Guo, R. Stephens, M. W. Baseler, H. C. Lane, and R. A. Lempicki. DAVID Bioin- formatics Resources: expanded annotation database and novel algorithms to better extract biology from large gene lists. Nucleic Acids Res., 35:W169–175, Jul 2007. BIBLIOGRAPHY 199

[101] T. Hubbard, D. Barker, E. Birney, G. Cameron, Y. Chen, L. Clark, T. Cox, J. Cuff, V. Curwen, T. Down, R. Durbin, E. Eyras, J. Gilbert, M. Hammond, L. Huminiecki, A. Kasprzyk, H. Lehvaslaiho, P. Lijnzaad, C. Melsopp, E. Mon- gin, R. Pettett, M. Pocock, S. Potter, A. Rust, E. Schmidt, S. Searle, G. Slater, J. Smith, W. Spooner, A. Stabenau, J. Stalker, E. Stupka, A. Ureta-Vidal, I. Vastrik, and M. Clamp. The Ensembl genome database project. Nucleic Acids Res., 30:38–41, Jan 2002.

[102] A. L. Hughes, B. Packer, R. Welch, A. W. Bergen, S. J. Chanock, and M. Yeager. Effects of natural selection on interpopulation divergence at polymorphic sites in human protein-coding Loci. Genetics, 170:1181–1187, Jul 2005.

[103] S. Hunter, R. Apweiler, T. K. Attwood, A. Bairoch, A. Bateman, D. Binns, P. Bork, U. Das, L. Daugherty, L. Duquenne, R. D. Finn, J. Gough, D. Haft, N. Hulo, D. Kahn, E. Kelly, A. Laugraud, I. Letunic, D. Lonsdale, R. Lopez, M. Madera, J. Maslen, C. McAnulla, J. McDowall, J. Mistry, A. Mitchell, N. Mulder, D. Natale, C. Orengo, A. F. Quinn, J. D. Selengut, C. J. Sigrist, M. Thimma, P. D. Thomas, F. Valentin, D. Wilson, C. H. Wu, and C. Yeats. InterPro: the integrative protein signature database. Nucleic Acids Res., 37:D211–215, Jan 2009.

[104] L. Ibrahim and E. A. Wright. Effect of castration and testosterone propionate on mouse vibrissae. Br. J. Dermatol., 108:321–326, Mar 1983.

[105] International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature, 409:860–921, Feb 2001.

[106] International Human Genome Sequencing Consortium. Finishing the euchro- matic sequence of the human genome. Nature, 431:931–945, Oct 2004.

[107] International Rat Genome Sequencing Consortium. Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature, 428(6982):493–521, Apr 2004. BIBLIOGRAPHY 200

[108] N. Ivanova, R. Dobrin, R. Lu, I. Kotenko, J. Levorse, C. DeCoste, X. Schafer, Y. Lun, and I. R. Lemischka. Dissecting self-renewal in stem cells with RNA interference. Nature, 442:533–538, Aug 2006.

[109] Y. Jeong, F. C. Leskow, K. El-Jaick, E. Roessler, M. Muenke, A. Yocum, C. Dubourg, X. Li, X. Geng, G. Oliver, and D. J. Epstein. Regulation of a remote Shh forebrain enhancer by the Six3 homeoprotein. Nat. Genet., 40:1348– 1353, Nov 2008.

[110] H. Ji, H. Jiang, W. Ma, D. S. Johnson, R. M. Myers, and W. H. Wong. An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat. Biotechnol., 26:1293–1300, Nov 2008.

[111] D. S. Johnson, A. Mortazavi, R. M. Myers, and B. Wold. Genome-wide mapping of in vivo protein-DNA interactions. Science, 316:1497–1502, Jun 2007.

[112] S. Katzman, A. D. Kern, G. Bejerano, G. Fewell, L. Fulton, R. K. Wilson, S. R. Salama, and D. Haussler. Human genome ultraconserved elements are ultraselected. Science, 317:915, Aug 2007.

[113] S. A. Kauffman. Metabolic stability and epigenesis in randomly constructed genetic nets. J. Theor. Biol., 22:437–467, Mar 1969.

[114] W. J. Kent. BLAT–the BLAST-like alignment tool. Genome Res., 12(4):656– 664, Apr 2002.

[115] W. J. Kent, R. Baertsch, A. Hinrichs, W. Miller, and D. Haussler. Evolution’s cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc. Natl. Acad. Sci. U.S.A., 100(20):11484–11489, Sep 2003.

[116] W. J. Kent, C. W. Sugnet, T. S. Furey, K. M. Roskin, T. H. Pringle, A. M. Zahler, and D. Haussler. The human genome browser at UCSC. Genome Res., 12:996–1006, Jun 2002.

[117] P. Khaitovich, I. Hellmann, W. Enard, K. Nowick, M. Leinweber, H. Franz, G. Weiss, M. Lachmann, and S. P¨a¨abo. Parallel patterns of evolution in the BIBLIOGRAPHY 201

genomes and transcriptomes of humans and chimpanzees. Science, 309:1850– 1854, Sep 2005.

[118] P. V. Kharchenko, M. Y. Tolstorukov, and P. J. Park. Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat. Biotechnol., 26:1351– 1359, Dec 2008.

[119] P. Khatri and S. Draghici. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics, 21:3587–3595, Sep 2005.

[120] P. Khatri, C. Voichita, K. Kattan, N. Ansari, A. Khatri, C. Georgescu, A. L. Tarca, and S. Draghici. Onto-Tools: new additions and improvements in 2006. Nucleic Acids Res., 35:W206–211, Jul 2007.

[121] T. H. Kim, Z. K. Abdullaev, A. D. Smith, K. A. Ching, D. I. Loukinov, R. D. Green, M. Q. Zhang, V. V. Lobanenkov, and B. Ren. Analysis of the vertebrate insulator protein CTCF-binding sites in the human genome. Cell, 128:1231– 1245, Mar 2007.

[122] T. K. Kim, M. Hemberg, J. M. Gray, A. M. Costa, D. M. Bear, J. Wu, D. A. Harmin, M. Laptewicz, K. Barbara-Haley, S. Kuersten, E. Markenscoff- Papadimitriou, D. Kuhl, H. Bito, P. F. Worley, G. Kreiman, and M. E. Green- berg. Widespread transcription at neuronal activity-regulated enhancers. Na- ture, 465:182–187, May 2010.

[123] M. C. King and A. C. Wilson. Evolution at two levels in humans and chim- panzees. Science, 188:107–116, Apr 1975.

[124] D. Kioussis, E. Vanin, T. deLange, R. A. Flavell, and F. G. Grosveld. Beta- globin gene inactivation by DNA translocation in gamma beta-thalassaemia. Nature, 306:662–666, 1983.

[125] T. Kisseleva, S. Bhattacharya, J. Braunstein, and C. W. Schindler. Signaling through the JAK/STAT pathway, recent advances and future challenges. Gene, 285:1–24, Feb 2002. BIBLIOGRAPHY 202

[126] A. Kong, D. F. Gudbjartsson, J. Sainz, G. M. Jonsdottir, S. A. Gudjons- son, B. Richardsson, S. Sigurdardottir, J. Barnard, B. Hallbeck, G. Masson, A. Shlien, S. T. Palsson, M. L. Frigge, T. E. Thorgeirsson, J. R. Gulcher, and K. Stefansson. A high-resolution recombination map of the human genome. Nat. Genet., 31:241–247, Jul 2002.

[127] M. Kretzschmar and J. Massagu´e.SMADs: mediators and regulators of TGF- beta signaling. Curr. Opin. Genet. Dev., 8:103–111, Feb 1998.

[128] A. Kriegstein, S. Noctor, and V. Martinez-Cerde˜no. Patterns of neural stem and progenitor cell division may underlie evolutionary cortical expansion. Nat. Rev. Neurosci., 7:883–890, Nov 2006.

[129] S. Kumar and S. B. Hedges. A molecular timescale for vertebrate evolution. Nature, 392(6679):917–920, Apr 1998.

[130] I. Kurth, E. Klopocki, S. Stricker, J. van Oosterwijk, S. Vanek, J. Altmann, H. G. Santos, J. J. van Harssel, T. de Ravel, A. O. Wilkie, A. Gal, and S. Mundlos. Duplications of noncoding elements 5’ of SOX9 are associated with brachydactyly-anonychia. Nat. Genet., 41:862–863, Aug 2009.

[131] C. Lanct¨ot, A. Moreau, M. Chamberland, M. L. Tremblay, and J. Drouin. Hindlimb patterning and mandible development require the Ptx1 gene. Devel- opment, 126:1805–1810, May 1999.

[132] E. S. Lander. Initial impact of the sequencing of the human genome. Nature, 470:187–197, Feb 2011.

[133] D. S. Latchman. Transcription factors: an overview. Int. J. Biochem. Cell Biol., 29:1305–1312, Dec 1997.

[134] H. J. Lee, C. H. Yun, S. H. Lim, B. C. Kim, K. G. Baik, J. M. Kim, W. H. Kim, and S. J. Kim. SRF is a nuclear repressor of Smad3-mediated TGF-beta signaling. Oncogene, 26:173–185, Jan 2007. BIBLIOGRAPHY 203

[135] L. A. Lettice, S. J. Heaney, L. A. Purdie, L. Li, P. de Beer, B. A. Oostra, D. Goode, G. Elgar, R. E. Hill, and E. de Graaff. A long-range Shh enhancer regulates expression in the developing limb and fin and is associated with preax- ial polydactyly. Hum. Mol. Genet., 12:1725–1735, Jul 2003.

[136] P. P. Levings and J. Bungert. The human beta-globin locus control region. Eur. J. Biochem., 269:1589–1599, Mar 2002.

[137] H. Li and R. Durbin. Fast and accurate short read alignment with Burrows- Wheeler transform. Bioinformatics, 25:1754–1760, Jul 2009.

[138] R. Li, H. Zhu, J. Ruan, W. Qian, X. Fang, Z. Shi, Y. Li, S. Li, G. Shan, K. Kristiansen, S. Li, H. Yang, J. Wang, and J. Wang. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res., 20:265–272, Feb 2010.

[139] E. Lieberman-Aiden, N. L. van Berkum, L. Williams, M. Imakaev, T. Ragoczy, A. Telling, I. Amit, B. R. Lajoie, P. J. Sabo, M. O. Dorschner, R. Sandstrom, B. Bernstein, M. A. Bender, M. Groudine, A. Gnirke, J. Stamatoyannopoulos, L. A. Mirny, E. S. Lander, and J. Dekker. Comprehensive mapping of long- range interactions reveals folding principles of the human genome. Science, 326:289–293, Oct 2009.

[140] C. P. Lim and X. Cao. Serine phosphorylation and negative regulation of Stat3 by JNK. J. Biol. Chem., 274:31055–31061, Oct 1999.

[141] K. Lindblad-Toh, C. M. Wade, T. S. Mikkelsen, E. K. Karlsson, D. B. Jaffe, M. Kamal, M. Clamp, J. L. Chang, E. J. Kulbokas, M. C. Zody, E. Mauceli, X. Xie, M. Breen, R. K. Wayne, E. A. Ostrander, C. P. Ponting, F. Galibert, D. R. Smith, P. J. DeJong, E. Kirkness, P. Alvarez, T. Biagi, W. Brockman, J. Butler, C. W. Chin, A. Cook, J. Cuff, M. J. Daly, D. DeCaprio, S. Gnerre, M. Grabherr, M. Kellis, M. Kleber, C. Bardeleben, L. Goodstadt, A. Heger, C. Hitte, L. Kim, K. P. Koepfli, H. G. Parker, J. P. Pollinger, S. M. Searle, N. B. Sutter, R. Thomas, C. Webber, J. Baldwin, A. Abebe, A. Abouelleil, BIBLIOGRAPHY 204

L. Aftuck, M. Ait-Zahra, T. Aldredge, N. Allen, P. An, S. Anderson, C. An- toine, H. Arachchi, A. Aslam, L. Ayotte, P. Bachantsang, A. Barry, T. Bayul, M. Benamara, A. Berlin, D. Bessette, B. Blitshteyn, T. Bloom, J. Blye, L. Bo- guslavskiy, C. Bonnet, B. Boukhgalter, A. Brown, P. Cahill, N. Calixte, J. Ca- marata, Y. Cheshatsang, J. Chu, M. Citroen, A. Collymore, P. Cooke, T. Da- woe, R. Daza, K. Decktor, S. DeGray, N. Dhargay, K. Dooley, K. Dooley, P. Dorje, K. Dorjee, L. Dorris, N. Duffey, A. Dupes, O. Egbiremolen, R. Elong, J. Falk, A. Farina, S. Faro, D. Ferguson, P. Ferreira, S. Fisher, M. FitzGerald, K. Foley, C. Foley, A. Franke, D. Friedrich, D. Gage, M. Garber, G. Gearin, G. Giannoukos, T. Goode, A. Goyette, J. Graham, E. Grandbois, K. Gyaltsen, N. Hafez, D. Hagopian, B. Hagos, J. Hall, C. Healy, R. Hegarty, T. Honan, A. Horn, N. Houde, L. Hughes, L. Hunnicutt, M. Husby, B. Jester, C. Jones, A. Kamat, B. Kanga, C. Kells, D. Khazanovich, A. C. Kieu, P. Kisner, M. Ku- mar, K. Lance, T. Landers, M. Lara, W. Lee, J. P. Leger, N. Lennon, L. Leuper, S. LeVine, J. Liu, X. Liu, Y. Lokyitsang, T. Lokyitsang, A. Lui, J. Macdonald, J. Major, R. Marabella, K. Maru, C. Matthews, S. McDonough, T. Mehta, J. Meldrim, A. Melnikov, L. Meneus, A. Mihalev, T. Mihova, K. Miller, R. Mit- telman, V. Mlenga, L. Mulrain, G. Munson, A. Navidi, J. Naylor, T. Nguyen, N. Nguyen, C. Nguyen, T. Nguyen, R. Nicol, N. Norbu, C. Norbu, N. Novod, T. Nyima, P. Olandt, B. O’Neill, K. O’Neill, S. Osman, L. Oyono, C. Patti, D. Perrin, P. Phunkhang, F. Pierre, M. Priest, A. Rachupka, S. Raghuraman, R. Rameau, V. Ray, C. Raymond, F. Rege, C. Rise, J. Rogers, P. Rogov, J. Sahalie, S. Settipalli, T. Sharpe, T. Shea, M. Sheehan, N. Sherpa, J. Shi, D. Shih, J. Sloan, C. Smith, T. Sparrow, J. Stalker, N. Stange-Thomann, S. Stavropoulos, C. Stone, S. Stone, S. Sykes, P. Tchuinga, P. Tenzing, S. Tes- faye, D. Thoulutsang, Y. Thoulutsang, K. Topham, I. Topping, T. Tsamla, H. Vassiliev, V. Venkataraman, A. Vo, T. Wangchuk, T. Wangdi, M. Weiand, J. Wilkinson, A. Wilson, S. Yadav, S. Yang, X. Yang, G. Young, Q. Yu, J. Zain- oun, L. Zembek, A. Zimmer, and E. S. Lander. Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature, 438:803–819, Dec 2005. BIBLIOGRAPHY 205

[142] C. Linhart, Y. Halperin, and R. Shamir. Transcription factor and microRNA motif discovery: the Amadeus platform and a compendium of metazoan target sets. Genome Res., 18:1180–1189, Jul 2008.

[143] C. O. Lovejoy. Reexamining human origins in light of Ardipithecus ramidus. Science, 326:1–8, Oct 2009.

[144] C. B. Lowe, G. Bejerano, and D. Haussler. Thousands of human mobile element fragments undergo strong purifying selection near developmental genes. Proc. Natl. Acad. Sci. U.S.A., 104:8005–8010, May 2007.

[145] G. Lunter. Probabilistic whole-genome alignments reveal high indel rates in the human and mouse genomes. Bioinformatics, 23(13):289–296, Jul 2007.

[146] J. Ma, A. Ratan, B. J. Raney, B. B. Suh, L. Zhang, W. Miller, and D. Haussler. DUPCAR: reconstructing contiguous ancestral regions with duplications. J. Comput. Biol., 15:1007–1027, Oct 2008.

[147] Z. Ma, T. Swigut, A. Valouev, A. Rada-Iglesias, and J. Wysocka. Sequence- specific regulator Prdm14 safeguards mouse ESCs from entering extraembryonic endoderm fates. Nat. Struct. Mol. Biol., 18:120–127, Feb 2011.

[148] E. R. Mardis. ChIP-seq: welcome to the new frontier. Nat. Methods, 4:613–614, Aug 2007.

[149] E. H. Margulies, V. V. Maduro, P. J. Thomas, J. P. Tomkins, C. T. Amemiya, M. Luo, and E. D. Green. Comparative sequencing provides insights about the structure and conservation of marsupial and monotreme genomes. Proc. Natl. Acad. Sci. U.S.A., 102:3354–3359, Mar 2005.

[150] G. A. Maston, S. K. Evans, and M. R. Green. Transcriptional regulatory ele- ments in the human genome. Annu Rev Genomics Hum Genet, 7:29–59, 2006.

[151] T. Matsuda, T. Nakamura, K. Nakao, T. Arai, M. Katsuki, T. Heike, and T. Yokota. STAT3 activation is sufficient to maintain an undifferentiated state of mouse embryonic stem cells. EMBO J., 18:4261–4269, Aug 1999. BIBLIOGRAPHY 206

[152] A. P. McGregor, V. Orgogozo, I. Delon, J. Zanet, D. G. Srinivasan, F. Payre, and D. L. Stern. Morphological evolution through multiple cis-regulatory mu- tations at a single gene. Nature, 448:587–590, Aug 2007.

[153] C. McLean and G. Bejerano. Dispensability of mammalian DNA. Genome Res., 18:1743–1751, Nov 2008.

[154] C. Y. McLean, D. Bristor, M. Hiller, S. L. Clarke, B. T. Schaar, C. B. Lowe, A. M. Wenger, and G. Bejerano. GREAT improves functional interpretation of cis-regulatory regions. Nat. Biotechnol., 28:495–501, May 2010.

[155] C. Y. McLean, P. L. Reno, A. A. Pollen, A. I. Bassan, T. D. Capellini, C. Guen- ther, V. B. Indjeian, X. Lim, D. B. Menke, B. T. Schaar, A. M. Wenger, G. Be- jerano, and D. M. Kingsley. Human-specific loss of regulatory DNA and the evolution of human-specific traits. Nature, 472:216–219, Mar 2011.

[156] M. L. McLemore, S. Grewal, F. Liu, A. Archambault, J. Poursine-Laurent, J. Haug, and D. C. Link. STAT-3 activation is required for normal G-CSF- dependent proliferation and granulocytic differentiation. Immunity, 14:193–204, Feb 2001.

[157] H. Mi, N. Guo, A. Kejariwal, and P. D. Thomas. PANTHER version 6: protein sequence and function evolution data with expanded representation of biological pathways. Nucleic Acids Res., 35:D247–252, Jan 2007.

[158] J. M. Miano, X. Long, and K. Fujiwara. Serum response factor: master regu- lator of the actin cytoskeleton and contractile apparatus. Am. J. Physiol., Cell Physiol., 292:70–81, Jan 2007.

[159] C. T. Miller, S. Beleza, A. A. Pollen, D. Schluter, R. A. Kittles, M. D. Shriver, and D. M. Kingsley. cis-Regulatory changes in Kit ligand expression and parallel evolution of pigmentation in sticklebacks and humans. Cell, 131:1179–1189, Dec 2007. BIBLIOGRAPHY 207

[160] W. Miller, K. Rosenbloom, R. C. Hardison, M. Hou, J. Taylor, B. Raney, R. Burhans, D. C. King, R. Baertsch, D. Blankenberg, S. L. Kosakovsky Pond, A. Nekrutenko, B. Giardine, R. S. Harris, S. Tyekucheva, M. Diekhans, T. H. Pringle, W. J. Murphy, A. Lesk, G. M. Weinstock, K. Lindblad-Toh, R. A. Gibbs, E. S. Lander, A. Siepel, D. Haussler, and W. J. Kent. 28-way vertebrate alignment and conservation track in the UCSC Genome Browser. Genome Res., 17:1797–1808, Dec 2007.

[161] G. E. Moore. Cramming more components onto integrated circuits. Electronics, 38:114–117, Apr 1965.

[162] D. P. Mortlock, C. Guenther, and D. M. Kingsley. A general approach for identifying distant regulatory elements applied to the Gdf6 gene. Genome Res., 13:2069–2081, Sep 2003.

[163] M. N. Muchlinski. A comparative analysis of vibrissa count and infraorbital foramen area in primates and other mammals. J. Hum. Evol., 58:447–473, Jun 2010.

[164] R. Murakami. A histological study of the development of the penis of wild-type and androgen-insensitive mice. J. Anat., 153:223–231, Aug 1987.

[165] K. M. Murphy, C. A. Nelson, and J. R. Sedy. Balancing co-stimulation and inhibition with BTLA and HVEM. Nat. Rev. Immunol., 6:671–681, Sep 2006.

[166] W. J. Murphy, T. H. Pringle, T. A. Crider, M. S. Springer, and W. Miller. Using genomic data to unravel the root of the placental mammal phylogeny. Genome Res., 17:413–421, Apr 2007.

[167] E. W. Myers, G. G. Sutton, A. L. Delcher, I. M. Dew, D. P. Fasulo, M. J. Flanigan, S. A. Kravitz, C. M. Mobarry, K. H. Reinert, K. A. Remington, E. L. Anson, R. A. Bolanos, H. H. Chou, C. M. Jordan, A. L. Halpern, S. Lonardi, E. M. Beasley, R. C. Brandon, L. Chen, P. J. Dunn, Z. Lai, Y. Liang, D. R. Nusskern, M. Zhan, Q. Zhang, X. Zheng, G. M. Rubin, M. D. Adams, and J. C. BIBLIOGRAPHY 208

Venter. A whole-genome assembly of Drosophila. Science, 287:2196–2204, Mar 2000.

[168] S. Natesan and M. Gilman. YY1 facilitates the association of serum response factor with the c-fos serum response element. Mol. Cell. Biol., 15:5975–5982, Nov 1995.

[169] L. Niswander. Pattern formation: old models out on a limb. Nat. Rev. Genet., 4:133–143, Feb 2003.

[170] M. A. N´obrega, Y. Zhu, I. Plajzer-Frick, V. Afzal, and E. M. Rubin. Megabase deletions of gene deserts result in viable mice. Nature, 431:988–993, Oct 2004.

[171] M. V. Olson. When less is more: gene loss as an engine of evolutionary change. Am. J. Hum. Genet., 64:18–23, Jan 1999.

[172] Online Mendelian Inheritance in Man (OMIM) (TM). McKusick- Nathans Institute of Genetic Medicine, Johns Hopkins University (Balti- more, MD) and National Center for Biotechnology Information, National Library of Medicine (Bethesda, MD), June 2009. World Wid Web URL: http://www.ncbi.nlm.nih.gov/omim/, 2009.

[173] C. P´al,B. Papp, and M. J. Lercher. An integrated view of protein evolution. Nat. Rev. Genet., 7:337–348, May 2006.

[174] V. S. Pande, I. Baker, J. Chapman, S. P. Elmer, S. Khaliq, S. M. Larson, Y. M. Rhee, M. R. Shirts, C. D. Snow, E. J. Sorin, and B. Zagrovic. Atomistic protein folding simulations on the submillisecond time scale using worldwide distributed computing. Biopolymers, 68:91–109, Jan 2003.

[175] C. C. Park, S. Ahn, J. S. Bloom, A. Lin, R. T. Wang, T. Wu, A. Sekar, A. H. Khan, C. J. Farr, A. J. Lusis, R. M. Leahy, K. Lange, and D. J. Smith. Fine mapping of regulatory loci for mammalian gene expression using radiation hy- brids. Nat. Genet., 40:421–429, Apr 2008. BIBLIOGRAPHY 209

[176] P. J. Park. ChIP-seq: advantages and challenges of a maturing technology. Nat. Rev. Genet., 10:669–680, Oct 2009.

[177] W. R. Pearson, T. Wood, Z. Zhang, and W. Miller. Comparison of DNA sequences with protein sequences. Genomics, 46(1):24–36, Nov 1997.

[178] J. S. Pedersen, G. Bejerano, A. Siepel, K. Rosenbloom, K. Lindblad-Toh, E. S. Lander, J. Kent, W. Miller, and D. Haussler. Identification and classification of conserved RNA secondary structures in the human genome. PLoS Comput. Biol., 2:e33, Apr 2006.

[179] L. A. Pennacchio, N. Ahituv, A. M. Moses, S. Prabhakar, M. A. Nobrega, M. Shoukry, S. Minovitsky, I. Dubchak, A. Holt, K. D. Lewis, I. Plajzer-Frick, J. Akiyama, S. De Val, V. Afzal, B. L. Black, O. Couronne, M. B. Eisen, A. Visel, and E. M. Rubin. In vivo enhancer analysis of human conserved non-coding sequences. Nature, 444:499–502, Nov 2006.

[180] D. A. Petrov. Mutational equilibrium model of genome size evolution. Theor. Popul. Biol., 61:531–544, Jun 2002.

[181] D. A. Pollard, C. M. Bergman, J. Stoye, S. E. Celniker, and M. B. Eisen. Benchmarking tools for the alignment of functional noncoding DNA. BMC Bioinformatics, 5:6, Jan 2004.

[182] K. S. Pollard, S. R. Salama, N. Lambert, M. A. Lambot, S. Coppens, J. S. Pedersen, S. Katzman, B. King, C. Onodera, A. Siepel, A. D. Kern, C. Dehay, H. Igel, M. Ares, P. Vanderhaeghen, and D. Haussler. An RNA gene expressed during cortical development evolved rapidly in humans. Nature, 443:167–172, Sep 2006.

[183] S. Poser, S. Impey, K. Trinh, Z. Xia, and D. R. Storm. SRF-dependent gene expression is required for PI3-kinase-regulated cell proliferation. EMBO J., 19:4955–4966, Sep 2000. BIBLIOGRAPHY 210

[184] L. M. Powell and A. P. Jarman. Context dependence of proneural bHLH pro- teins. Curr. Opin. Genet. Dev., 18:411–417, Oct 2008.

[185] S. Prabhakar, F. Poulin, M. Shoukry, V. Afzal, E. M. Rubin, O. Couronne, and L. A. Pennacchio. Close sequence comparisons are sufficient to identify human cis-regulatory elements. Genome Res., 16:855–863, Jul 2006.

[186] S. Prabhakar, A. Visel, J. A. Akiyama, M. Shoukry, K. D. Lewis, A. Holt, I. Plajzer-Frick, H. Morrison, D. R. Fitzpatrick, V. Afzal, L. A. Pennacchio, E. M. Rubin, and J. P. Noonan. Human-specific gain of function in a develop- mental enhancer. Science, 321:1346–1350, Sep 2008.

[187] B. Prud’homme, N. Gompel, A. Rokas, V. A. Kassner, T. M. Williams, S. D. Yeh, J. R. True, and S. B. Carroll. Repeated morphological evolution through cis-regulatory changes in a pleiotropic gene. Nature, 440:1050–1053, Apr 2006.

[188] K. D. Pruitt, T. Tatusova, and D. R. Maglott. NCBI reference sequences (Ref- Seq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res., 35:D61–65, Jan 2007.

[189] A. Rada-Iglesias, R. Bajpai, T. Swigut, S. A. Brugmann, R. A. Flynn, and J. Wysocka. A unique chromatin signature uncovers early developmental en- hancers in humans. Nature, 470:279–283, Feb 2011.

[190] F. Rahimov, M. L. Marazita, A. Visel, M. E. Cooper, M. J. Hitchler, M. Rubini, F. E. Domann, M. Govil, K. Christensen, C. Bille, M. Melbye, A. Jugessur, R. T. Lie, A. J. Wilcox, D. R. Fitzpatrick, E. D. Green, P. A. Mossey, J. Little, R. P. Steegers-Theunissen, L. A. Pennacchio, B. C. Schutte, and J. C. Murray. Disruption of an AP-2alpha binding site in an IRF6 enhancer is associated with cleft lip. Nat. Genet., 40:1341–1347, Nov 2008.

[191] A. R´emenyi, H. R. Scholer, and M. Wilmanns. Combinatorial control of gene expression. Nat. Struct. Mol. Biol., 11:812–815, Sep 2004. BIBLIOGRAPHY 211

[192] J. C. Roach, G. Glusman, A. F. Smit, C. D. Huff, R. Hubley, P. T. Shannon, L. Rowen, K. P. Pant, N. Goodman, M. Bamshad, J. Shendure, R. Drmanac, L. B. Jorde, L. Hood, and D. J. Galas. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science, 328:636–639, Apr 2010.

[193] A. G. Rosmarin, K. K. Resendes, Z. Yang, J. N. McMillan, and S. L. Fleming. GA-binding protein transcription factor: a review of GABP as an integrator of intracellular signaling and protein-protein interactions. Blood Cells Mol. Dis., 32:143–154, 2004.

[194] J. Rozowsky, G. Euskirchen, R. K. Auerbach, Z. D. Zhang, T. Gibson, R. Bjorn- son, N. Carriero, M. Snyder, and M. B. Gerstein. PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat. Biotechnol., 27:66–75, Jan 2009.

[195] J. Ruan, H. Li, Z. Chen, A. Coghlan, L. J. Coin, Y. Guo, J. K. H´erich´e,Y. Hu, K. Kristiansen, R. Li, T. Liu, A. Moses, J. Qin, S. Vang, A. J. Vilella, A. Ureta- Vidal, L. Bolund, J. Wang, and R. Durbin. TreeFam: 2008 Update. Nucleic Acids Res., 36:D735–740, Jan 2008.

[196] L. M. Sabatini and E. A. Azen. Two coding change mutations in the HIS2(2) al- lele characterize the salivary histatin 3-2 protein variant. Hum. Mutat., 4(1):12– 19, 1994.

[197] P. C. Sabeti, S. F. Schaffner, B. Fry, J. Lohmueller, P. Varilly, O. Shamovsky, A. Palma, T. S. Mikkelsen, D. Altshuler, and E. S. Lander. Positive natural selection in the human lineage. Science, 312:1614–1620, Jun 2006.

[198] N. Sabherwal, F. Bangs, R. R¨oth, B. Weiss, K. Jantz, E. Tiecke, G. K. Hinkel, C. Spaich, B. P. Hauffa, H. van der Kamp, J. Kapeller, C. Tickle, and G. Rap- pold. Long-range conserved non-coding SHOX sequences regulate expression in developing chicken limb and are associated with short stature phenotypes in human patients. Hum. Mol. Genet., 16:210–222, Jan 2007. BIBLIOGRAPHY 212

[199] P. J. Sabo, M. S. Kuehn, R. Thurman, B. E. Johnson, E. M. Johnson, H. Cao, M. Yu, E. Rosenzweig, J. Goldy, A. Haydock, M. Weaver, A. Shafer, K. Lee, F. Neri, R. Humbert, M. A. Singer, T. A. Richmond, M. O. Dorschner, M. McArthur, M. Hawrylycz, R. D. Green, P. A. Navas, W. S. Noble, and J. A. Stamatoyannopoulos. Genome-scale mapping of DNase I sensitivity in vivo using tiling DNA microarrays. Nat. Methods, 3:511–518, Jul 2006.

[200] R. Sanges, E. Kalmar, P. Claudiani, M. D’Amato, F. Muller, and E. Stupka. Shuffling of cis-regulatory elements is a pervasive feature of the vertebrate lin- eage. Genome Biol., 7:R56, 2006.

[201] D. Schmidt, M. D. Wilson, B. Ballester, P. C. Schwalie, G. D. Brown, A. Mar- shall, C. Kutter, S. Watt, C. P. Martinez-Jimenez, S. Mackay, I. Talianidis, P. Flicek, and D. T. Odom. Five-vertebrate ChIP-seq reveals the evolutionary dynamics of transcription factor binding. Science, 328:1036–1040, May 2010.

[202] S. Schoenfelder, T. Sexton, L. Chakalova, N. F. Cope, A. Horton, S. Andrews, S. Kurukuti, J. A. Mitchell, D. Umlauf, D. S. Dimitrova, C. H. Eskiw, Y. Luo, C. L. Wei, Y. Ruan, J. J. Bieker, and P. Fraser. Preferential associations between co-regulated genes reveal a transcriptional interactome in erythroid cells. Nat. Genet., 42:53–61, Jan 2010.

[203] C. J. Schoenherr and D. J. Anderson. The neuron-restrictive silencer factor (NRSF): a coordinate repressor of multiple neuron-specific genes. Science, 267:1360–1363, Mar 1995.

[204] S. Schwartz, W. J. Kent, A. Smit, Z. Zhang, R. Baertsch, R. C. Hardison, D. Haussler, and W. Miller. Human-mouse alignments with BLASTZ. Genome Res., 13:103–107, Jan 2003.

[205] E. Segal, N. Friedman, D. Koller, and A. Regev. A module map showing conditional activity of expression modules in cancer. Nat. Genet., 36:1090– 1098, Oct 2004. BIBLIOGRAPHY 213

[206] Q. Shen, W. Zhong, Y. N. Jan, and S. Temple. Asymmetric Numb distribution is critical for asymmetric cell division of mouse cerebral cortical stem cells and neuroblasts. Development, 129:4843–4853, Oct 2002.

[207] A. Siepel, G. Bejerano, J. S. Pedersen, A. S. Hinrichs, M. Hou, K. Rosenbloom, H. Clawson, J. Spieth, L. W. Hillier, S. Richards, G. M. Weinstock, R. K. Wilson, R. A. Gibbs, W. J. Kent, W. Miller, and D. Haussler. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res., 15:1034–1050, Aug 2005.

[208] C. L. Smith, C. A. Goldsmith, and J. T. Eppig. The Mammalian Phenotype Ontology as a tool for annotating, analyzing and comparing phenotypic infor- mation. Genome Biol., 6:R7, 2005.

[209] C. M. Smith, J. H. Finger, T. F. Hayamizu, I. J. McCright, J. T. Eppig, J. A. Kadin, J. E. Richardson, and M. Ringwald. The mouse Gene Expres- sion Database (GXD): 2007 update. Nucleic Acids Res., 35:D618–623, Jan 2007.

[210] R. E. Soccio, G. Tuteja, L. J. Everett, Z. Li, M. A. Lazar, and K. H. Kaest- ner. Species-specific strategies underlying conserved functions of metabolic tran- scription factors. Mol. Endocrinol., 25:694–706, Apr 2011.

[211] S. Somekawa, K. Imagawa, N. Naya, Y. Takemoto, K. Onoue, S. Okayama, Y. Takeda, H. Kawata, M. Horii, T. Nakajima, S. Uemura, N. Mochizuki, and Y. Saito. Regulation of aldosterone and cortisol production by the transcrip- tional repressor neuron restrictive silencer factor. Endocrinology, 150:3110– 3117, Jul 2009.

[212] F. Spitz and D. Duboule. Global control regions and regulatory landscapes in vertebrate development and evolution. Adv. Genet., 61:175–205, 2008.

[213] F. Spitz, F. Gonzalez, and D. Duboule. A global control region defines a chro- mosomal regulatory landscape containing the HoxD cluster. Cell, 113:405–417, May 2003. BIBLIOGRAPHY 214

[214] M. S. Springer, W. J. Murphy, E. Eizirik, and S. J. O’Brien. Placental mammal diversification and the Cretaceous-Tertiary boundary. Proc. Natl. Acad. Sci. U.S.A., 100:1056–1061, Feb 2003.

[215] D. L. Stern. A role of Ultrabithorax in morphological differences between Drosophila species. Nature, 396:463–466, Dec 1998.

[216] D. L. Stern. Evolutionary developmental biology and the problem of variation. Evolution, 54:1079–1091, Aug 2000.

[217] E. A. Stone, G. M. Cooper, and A. Sidow. Trade-offs in detecting evolutionarily constrained sequence by comparative genomics. Annu. Rev. Genomics Hum. Genet., 6:143–164, 2005.

[218] A. Subramanian, P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gillette, A. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lander, and J. P. Mesirov. Gene set enrichment analysis: a knowledge-based approach for in- terpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U.S.A., 102:15545–15550, Oct 2005.

[219] G. G. Sutton, O. White, M. D. Adams, and A. R. Kerlavage. TIGR Assembler: A new tool for assembling large shotgun sequencing projects. Genome Sci. and Technology, 1:9–19, Jan 1995.

[220] L. Taher and I. Ovcharenko. Variable locus length in the human genome leads to ascertainment bias in functional inference for non-coding elements. Bioin- formatics, 25:578–584, Mar 2009.

[221] K. Takeda, K. Noguchi, W. Shi, T. Tanaka, M. Matsumoto, N. Yoshida, T. Kishimoto, and S. Akira. Targeted disruption of the mouse Stat3 gene leads to early embryonic lethality. Proc. Natl. Acad. Sci. U.S.A., 94:3801–3804, Apr 1997.

[222] The Chimpanzee Sequencing and Analysis Consortium. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature, 437:69– 87, Sep 2005. BIBLIOGRAPHY 215

[223] The C. elegans Sequencing Consortium. Genome Sequence of the Nematode C. elegans: A Platform for Investigating Biology. Science, 282:2012–2018, 1998.

[224] The ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature, 447:799–816, Jun 2007.

[225] The modENCODE Consortium. Identification of functional elements and regu- latory circuits by Drosophila modENCODE. Science, 330:1787–1797, Dec 2010.

[226] K. Theiler. The House Mouse: Development and Normal Stages from Fertiliza- tion to 4 Weeks of Age. Springer-Verlag, New York, 1989.

[227] E. Torarinsson, M. Sawera, J. H. Havgaard, M. Fredholm, and J. Gorodkin. Thousands of corresponding human and mouse genomic regions unalignable in primary sequence contain common RNA structure. Genome Res., 16:885–889, Jul 2006.

[228] G. Tuteja, P. White, J. Schug, and K. H. Kaestner. Extracting transcription factor targets from ChIP-Seq data. Nucleic Acids Res., 37:e113, Sep 2009.

[229] O. Uemura, Y. Okada, H. Ando, M. Guedj, S. Higashijima, T. Shimazaki, N. Chino, H. Okano, and H. Okamoto. Comparative functional genomics re- vealed conservation and diversification of three enhancers of the isl1 gene for motor and sensory neuron-specific expression. Dev. Biol., 278:587–606, Feb 2005.

[230] A. Ureta-Vidal, L. Ettwiller, and E. Birney. Comparative genomics: genome- wide analysis in metazoan eukaryotes. Nat. Rev. Genet., 4:251–262, Apr 2003.

[231] E. J. Vallender and B. T. Lahn. Positive selection on the human genome. Hum. Mol. Genet., 13 Spec No 2:R245–254, Oct 2004.

[232] A. Valouev, D. S. Johnson, A. Sundquist, C. Medina, E. Anton, S. Batzoglou, R. M. Myers, and A. Sidow. Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nat. Methods, 5:829–834, Sep 2008. BIBLIOGRAPHY 216

[233] A. Varki and T. K. Altheide. Comparing the human and chimpanzee genomes: searching for needles in a haystack. Genome Res., 15:1746–1758, Dec 2005.

[234] A. Visel, J. A. Akiyama, M. Shoukry, V. Afzal, E. M. Rubin, and L. A. Pen- nacchio. Functional autonomy of distant-acting human enhancers. Genomics, 93:509–513, Jun 2009.

[235] A. Visel, M. J. Blow, Z. Li, T. Zhang, J. A. Akiyama, A. Holt, I. Plajzer-Frick, M. Shoukry, C. Wright, F. Chen, V. Afzal, B. Ren, E. M. Rubin, and L. A. Pennacchio. ChIP-seq accurately predicts tissue-specific activity of enhancers. Nature, 457:854–858, Feb 2009.

[236] A. Visel, S. Minovitsky, I. Dubchak, and L. A. Pennacchio. VISTA Enhancer Browser–a database of tissue-specific human enhancers. Nucleic Acids Res., 35:88–92, Jan 2007.

[237] A. Visel, S. Prabhakar, J. A. Akiyama, M. Shoukry, K. D. Lewis, A. Holt, I. Plajzer-Frick, V. Afzal, E. M. Rubin, and L. A. Pennacchio. Ultraconservation identifies a small subset of extremely constrained developmental enhancers. Nat. Genet., 40:158–160, Feb 2008.

[238] B. F. Voight, S. Kudaravalli, X. Wen, and J. K. Pritchard. A map of recent positive selection in the human genome. PLoS Biol., 4:e72, Mar 2006.

[239] T. Y. Vue, J. Aaker, A. Taniguchi, C. Kazemzadeh, J. M. Skidmore, D. M. Martin, J. F. Martin, M. Treier, and Y. Nakagawa. Characterization of progen- itor domains in the developing mouse thalamus. J. Comp. Neurol., 505:73–91, Nov 2007.

[240] Y. Wang and X. Gu. Functional divergence in the caspase gene family and altered functional constraints: statistical analysis and prediction. Genetics, 158(3):1311–1320, Jul 2001.

[241] R. H. Waterston, K. Lindblad-Toh, E. Birney, J. Rogers, J. F. Abril, P. Agar- wal, R. Agarwala, R. Ainscough, M. Alexandersson, P. An, S. E. Antonarakis, BIBLIOGRAPHY 217

J. Attwood, R. Baertsch, J. Bailey, K. Barlow, S. Beck, E. Berry, B. Birren, T. Bloom, P. Bork, M. Botcherby, N. Bray, M. R. Brent, D. G. Brown, S. D. Brown, C. Bult, J. Burton, J. Butler, R. D. Campbell, P. Carninci, S. Caw- ley, F. Chiaromonte, A. T. Chinwalla, D. M. Church, M. Clamp, C. Clee, F. S. Collins, L. L. Cook, R. R. Copley, A. Coulson, O. Couronne, J. Cuff, V. Curwen, T. Cutts, M. Daly, R. David, J. Davies, K. D. Delehaunty, J. Deri, E. T. Der- mitzakis, C. Dewey, N. J. Dickens, M. Diekhans, S. Dodge, I. Dubchak, D. M. Dunn, S. R. Eddy, L. Elnitski, R. D. Emes, P. Eswara, E. Eyras, A. Felsenfeld, G. A. Fewell, P. Flicek, K. Foley, W. N. Frankel, L. A. Fulton, R. S. Ful- ton, T. S. Furey, D. Gage, R. A. Gibbs, G. Glusman, S. Gnerre, N. Goldman, L. Goodstadt, D. Grafham, T. A. Graves, E. D. Green, S. Gregory, R. Guig´o, M. Guyer, R. C. Hardison, D. Haussler, Y. Hayashizaki, L. W. Hillier, A. Hin- richs, W. Hlavina, T. Holzer, F. Hsu, A. Hua, T. Hubbard, A. Hunt, I. Jackson, D. B. Jaffe, L. S. Johnson, M. Jones, T. A. Jones, A. Joy, M. Kamal, E. K. Karlsson, D. Karolchik, A. Kasprzyk, J. Kawai, E. Keibler, C. Kells, W. J. Kent, A. Kirby, D. L. Kolbe, I. Korf, R. S. Kucherlapati, E. J. Kulbokas, D. Kulp, T. Landers, J. P. Leger, S. Leonard, I. Letunic, R. Levine, J. Li, M. Li, C. Lloyd, S. Lucas, B. Ma, D. R. Maglott, E. R. Mardis, L. Matthews, E. Mauceli, J. H. Mayer, M. McCarthy, W. R. McCombie, S. McLaren, K. McLay, J. D. McPher- son, J. Meldrim, B. Meredith, J. P. Mesirov, W. Miller, T. L. Miner, E. Mon- gin, K. T. Montgomery, M. Morgan, R. Mott, J. C. Mullikin, D. M. Muzny, W. E. Nash, J. O. Nelson, M. N. Nhan, R. Nicol, Z. Ning, C. Nusbaum, M. J. O’Connor, Y. Okazaki, K. Oliver, E. Overton-Larty, L. Pachter, G. Parra, K. H. Pepin, J. Peterson, P. Pevzner, R. Plumb, C. S. Pohl, A. Poliakov, T. C. Ponce, C. P. Ponting, S. Potter, M. Quail, A. Reymond, B. A. Roe, K. M. Roskin, E. M. Rubin, A. G. Rust, R. Santos, V. Sapojnikov, B. Schultz, J. Schultz, M. S. Schwartz, S. Schwartz, C. Scott, S. Seaman, S. Searle, T. Sharpe, A. Sheridan, R. Shownkeen, S. Sims, J. B. Singer, G. Slater, A. Smit, D. R. Smith, B. Spencer, A. Stabenau, N. Stange-Thomann, C. Sugnet, M. Suyama, G. Tesler, J. Thompson, D. Torrents, E. Trevaskis, J. Tromp, C. Ucla, A. Ureta- Vidal, J. P. Vinson, A. C. Von Niederhausern, C. M. Wade, M. Wall, R. J. BIBLIOGRAPHY 218

Weber, R. B. Weiss, M. C. Wendl, A. P. West, K. Wetterstrand, R. Wheeler, S. Whelan, J. Wierzbowski, D. Willey, S. Williams, R. K. Wilson, E. Winter, K. C. Worley, D. Wyman, S. Yang, S. P. Yang, E. M. Zdobnov, M. C. Zody, and E. S. Lander. Initial sequencing and comparative analysis of the mouse genome. Nature, 420:520–562, Dec 2002.

[242] T. Werner, A. Hammer, M. Wahlbuhl, M. R. Bosl, and M. Wegner. Multi- ple conserved regulatory elements with overlapping functions determine Sox10 expression in mouse embryogenesis. Nucleic Acids Res., 35:6526–6538, 2007.

[243] A. O. Wilkie and G. M. Morriss-Kay. Genetics of craniofacial development and malformation. Nat. Rev. Genet., 2:458–468, Jun 2001.

[244] D. G. Wilkinson. Whole mount in situ hybridization of vertebrate embryos, pages 75–83. IRL Press, 1992.

[245] A. C. Wilson, S. S. Carlson, and T. J. White. Biochemical evolution. Annu. Rev. Biochem., 46:573–639, 1977.

[246] A. Woolfe, M. Goodson, D. K. Goode, P. Snell, G. K. McEwen, T. Vavouri, S. F. Smith, P. North, H. Callaway, K. Kelly, K. Walter, I. Abnizova, W. Gilks, Y. J. Edwards, J. E. Cooke, and G. Elgar. Highly conserved non-coding sequences are associated with vertebrate development. PLoS Biol., 3:e7, Jan 2005.

[247] W. Wurst and L. Bally-Cuif. Neural plate patterning: upstream and down- stream of the isthmic organizer. Nat. Rev. Neurosci., 2:99–108, Feb 2001.

[248] Y. Xue, Q. Wang, Q. Long, B. L. Ng, H. Swerdlow, J. Burton, C. Skuce, R. Taylor, Z. Abdellah, Y. Zhao, D. G. MacArthur, M. A. Quail, N. P. Carter, H. Yang, and C. Tyler-Smith. Human Y chromosome base-substitution muta- tion rate measured by direct sequencing in a deep-rooting pedigree. Curr. Biol., 19:1453–1457, Sep 2009.

[249] J. Yu, S. Hu, J. Wang, G. K. Wong, S. Li, B. Liu, Y. Deng, L. Dai, Y. Zhou, X. Zhang, M. Cao, J. Liu, J. Sun, J. Tang, Y. Chen, X. Huang, W. Lin, C. Ye, BIBLIOGRAPHY 219

W. Tong, L. Cong, J. Geng, Y. Han, L. Li, W. Li, G. Hu, X. Huang, W. Li, J. Li, Z. Liu, L. Li, J. Liu, Q. Qi, J. Liu, L. Li, T. Li, X. Wang, H. Lu, T. Wu, M. Zhu, P. Ni, H. Han, W. Dong, X. Ren, X. Feng, P. Cui, X. Li, H. Wang, X. Xu, W. Zhai, Z. Xu, J. Zhang, S. He, J. Zhang, J. Xu, K. Zhang, X. Zheng, J. Dong, W. Zeng, L. Tao, J. Ye, J. Tan, X. Ren, X. Chen, J. He, D. Liu, W. Tian, C. Tian, H. Xia, Q. Bao, G. Li, H. Gao, T. Cao, J. Wang, W. Zhao, P. Li, W. Chen, X. Wang, Y. Zhang, J. Hu, J. Wang, S. Liu, J. Yang, G. Zhang, Y. Xiong, Z. Li, L. Mao, C. Zhou, Z. Zhu, R. Chen, B. Hao, W. Zheng, S. Chen, W. Guo, G. Li, S. Liu, M. Tao, J. Wang, L. Zhu, L. Yuan, and H. Yang. A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science, 296:79–92, Apr 2002.

[250] B. R. Zeeberg, W. Feng, G. Wang, M. D. Wang, A. T. Fojo, M. Sunshine, S. Narasimhan, D. W. Kane, W. C. Reinhold, S. Lababidi, K. J. Bussey, J. Riss, J. C. Barrett, and J. N. Weinstein. GoMiner: a resource for biological interpre- tation of genomic and proteomic data. Genome Biol., 4:R28, 2003.

[251] L. F. Zerbini, Y. Wang, A. Czibere, R. G. Correa, J. Y. Cho, K. Ijiri, W. Wei, M. Joseph, X. Gu, F. Grall, M. B. Goldring, J. R. Zhou, T. A. Libermann, and J. R. Zhou. NF-kappa B-mediated repression of growth arrest- and DNA- damage-inducible proteins 45alpha and gamma is essential for cancer cell sur- vival. Proc. Natl. Acad. Sci. U.S.A., 101:13618–13623, Sep 2004.

[252] X. Zhang, H. Sun, D. C. Danila, S. R. Johnson, Y. Zhou, B. Swearingen, and A. Klibanski. Loss of expression of GADD45 gamma, a growth inhibitory gene, in human pituitary adenomas: implications for tumorigenesis. J. Clin. En- docrinol. Metab., 87:1262–1267, Mar 2002.

[253] C. J. Zhou, U. Borello, J. L. Rubenstein, and S. J. Pleasure. Neuronal produc- tion and precursor proliferation defects in the neocortex of mice with loss of function in the canonical Wnt signaling pathway. Neuroscience, 142:1119–1131, Nov 2006.

[254] E. Zuckerkandl. Revisiting junk DNA. J. Mol. Evol., 34:259–271, Mar 1992.