UCSF UC San Francisco Electronic Theses and Dissertations

Title Prediction and validation of regulatory elements active in human development

Permalink https://escholarship.org/uc/item/72t7d0p9

Author Erwin Haliburton, Genevieve Dorothy

Publication Date 2014

Peer reviewed|Thesis/dissertation

eScholarship.org Powered by the California Digital Library University of California ii

Copyright 2014

by

Genevieve D. E. Haliburton

iii

Dedication and acknowledgments

The work presented in this dissertation is a result of the hard work of many people. I’d first like to thank Dr. Katherine Pollard, my thesis advisor, for her guidance and support over the years. Katie is a thoughtful and dedicated mentor, and she has worked hard on behalf of my research, career, and personal development as a scientist. I immensely appreciate all that Katie has done for me.

Several former postdocs in the Pollard Lab contributed significantly to the computational work to develop and implement EnhancerFinder (as published in Erwin, Oksenberg et al.

2014), namely Tony Capra, Dennis Kostka, and Rebecca Truty. Zebrafish experiments presented in that paper and this dissertation were performed with the help of Nir

Oksenberg and Karl Murphy in Nadav Ahituv’s lab. I’d like to thank the entire Ahituv lab, but Nir in particular, for teaching me the molecular biology and animal techniques used in these experiments, and for being such a joy to work with. I’d also like to thank Caroline

Miller from the Gladstone Histology Core for help with mouse brain sectioning, and Kate

Lovero for helpful discussions about neurodevelopment.

My thesis committee, chaired by Nadav Ahituv and including Patsy Babbitt, Deepak

Srivastava, and Katie Pollard has been a wonderful source of scientific guidance and career development advice. I’d also like to thank my “bonus mentor” Bruce Conklin for his advice and encouragement. I have been very lucky to have many inspirational mentors during my time at Gladstone and UCSF. iv

For their scientific support and camaraderie I’d like to thank the Pollard lab, especially fellow graduate student Aram Avila-Herrera and former postdoc Mariel Finucane, along with

UCSF classmates Joanna Lipinski-Kruszka, Dan Lu, and Henry Lin.

And lastly, I’d like to thank my family and friends for being loving, supportive, and wonderful for so many years, especially my parents Tom and Carolyn Erwin and my husband John Haliburton.

v

Abstract

Embryonic development relies on well-tuned expression of thousands of across developing tissues and organs. Gene regulatory regions such as enhancers control these gene expression patterns, and properly functioning regulatory regions are vital for healthy development. To better characterize genes and regulatory regions that are important in embryonic development, we developed a new approach to identify developmental enhancers and a database of genes with known developmental functions, validated our predictions in transgenic mouse and zebrafish enhancer assays, and applied these tools to several interesting questions in embryonic development.

While many large-scale studies have investigated the location and function of enhancers in specific developmental tissues and timepoints, a general predictor of developmental enhancers was lacking. To leverage the massive amount of available data and fill this void, we developed EnhancerFinder, a computational tool that integrates thousands of genetic and epigenetic datasets to predict tissue-specific developmental enhancers. With this tool we predicted over 80,000 developmental enhancers, plus tissue specificity for thousands of these predicted enhancers.

We surveyed the enhancer landscape across the whole genome and found that predicted enhancers tend to cluster around developmental genes and that genes near tissue- specific enhancers are expressed in relevant tissues. We tested 12 developmental enhancers in transgenic mouse and zebrafish enhancer assays and found that 10 vi candidate enhancers validated with consistent expression patterns in at least one of the animal models. One cluster of these validated enhancers near developmental FOXC1 pointed us towards a specific developmental brain structure known to be involved in cerebral malformations. We further investigated these candidate enhancers and developed a model for a possible non-coding genetic cause of the brain development disorder Dandy-Walker malformations.

We developed an improved framework for predicting enhancers, cataloged thousands of genes with developmental functions, predicted tens of thousands of novel developmental enhancers in the that validate well in animal models, and showed applications of these tools in several interesting questions in developmental biology. We hope other researchers will be able to use these tools to further their own investigations in gene regulation during embryonic development.

vii

Table of Contents

Chapter 1. Introduction ...... 1

Chapter 2. EnhancerFinder: A new method to predict developmental enhancers ...... 5 Motivation...... 5 Developing EnhancerFinder ...... 9 How EnhancerFinder works ...... 13 Evaluating the performance of EnhancerFinder ...... 14 EnhancerFinder predicts over 80,000 developmental enhancers across the human genome ...... 17

Chapter 3. Cataloging important developmental regions of the genome ...... 22 Motivation ...... 22 Database components ...... 23 Application of the database in developmental research ...... 24

Chapter 4. Developmental enhancers are clustered near important developmental transcription factors and exhibit relevant tissue-specific characteristics ...... 26 EnhancerFinder predicts a much stricter set of developmental enhancers than prominent existing methods ...... 26 Predicted enhancers are associated with relevant genomic regions ...... 27 Heart enhancers are easier to predict but a biological explanation remains elusive . . . 34 Developmental enhancers cluster near important developmental genes ...... 35 Methods ...... 37

Chapter 5. In vivo validation of predicted enhancers confirms enhancer activity near transcription factors FOXC1 and FOXC2 ...... 39 FOXC1 and FOXC2 are important developmental genes ...... 39 Candidate enhancers validate well for enhancer activity in mouse and fish models . . . .40 Mouse and results mostly disagree on tissue specificity ...... 46 How “super” are the predicted enhancers? ...... 47 Methods ...... 49 Ethics ...... 51

Chapter 6. EnhancerFinder’s novel characterization of the developmental enhancer landscape can help address many questions in developmental biology ...... 52 Many human accelerated regions function as enhancers ...... 52 Disruptions in enhancers near FOXC1 may influence development brain disorder Dandy- Walker ...... 53

Chapter 7. Conclusion ...... 67

References ...... 69

Library release form ...... 73 viii

List of Tables

Table 1. Genes near predicted enhancers exhibit relevant tissue-specific expression. A) Genes near predicted brain enhancers. B) Genes near predicted heart enhancers . . 29

Table 2. Top 25 predicted transcription factor binding site motifs in predicted limb, heart, and brain enhancers ...... 32

Table 3. Candidate enhancer regions tested in transgenic animal assays ...... 43

Table 4. Developing brain-expressed genes that have many predicted transcription factor binding sites in candidate Dandy-Walker associated enhancers ...... 60 ix

List of Figures

Figure 1. Overview of EnhancerFinder enhancer prediction pipeline ...... 8

Figure 2. Performance of EnhancerFinder. A) Step 1 performance. How well can EnhancerFinder distinguish enhancers from genomic background? B) Step 2 performance. How well can EnhancerFinder discern tissue specificity of a predicted enhancer region? ...... 16

Figure 3. Tissue-specific enhancers are located near relevant genes ...... 20

Figure 4. Enhancer landscapes near two developmental transcription factors, including images of candidate enhancers tested in transgenic mouse and zebrafish enhancer assay. A) Enhancer landscape near FOXC1. B) Enhancer landscape near FOXC2 . . . . . 41

Figure 5. Dandy-Walker related expression patterns of enhancers near FOXC1 enhancers. A) Whole-embryo images from transgenic mouse assay. B) Transverse cross-sections of brains from transgenic mouse assays ...... 57

Figure 6. Suggested interactions between ZIC1 and FOXC1 enhancers. A) A model of possible interactions between ZIC1 and FOXC1 enhancers B) An image from the UCSC Genome Browser showing the enhancer landscape near FOXC1 ...... 59 1

Chapter 1: Introduction

Every cell in the human body contains the genetic information to make the genes a person will need over the course of an entire lifetime. A highly regulated series of events determines which genes are actually expressed, since each cell will only ever use a subset of the full assortment of human genes. Some genes are expressed for just a short time in a person’s life, such as during embryonic development or response to environmental events. Other genes are expressed for an entire lifetime in a variety of cells and tissues (Alberts 2008).

Transcriptional regulation is one stage of gene regulation that occurs just prior to and during transcription. Transcription requires sufficient binding between transcriptional machinery and the DNA that is to be transcribed. Transcription factors are small proteins that bind to specific DNA motifs and influence the binding of the transcriptional machinery. Enhancers are groups of transcription factor binding sites (TFBSs) that help recruit and stabilize transcriptional machinery to increase expression of a gene in specific tissues, or interfere with transcriptional machinery to decrease expression. It is through this mechanism that enhancers can control the timing and amplitude of a gene’s expression, leading to the complex and diverse range of gene expression we see.

Enhancers are very important functional regions of the genome, but they can be difficult to find. Although there are several genetic and epigenetic clues that help identify enhancers, there is no clear, universal signature. Since the human genome contains an 2 estimated 3.2 billion base pairs (bp), and enhancers can be very small, identifying these functional regions presents an interesting research question (Smith, Riesenfeld et al.

2013).

Large groups of researchers have worked together in projects like ENCODE, the

Epigenetic Roadmap, and FANTOM to annotate enhancers (Bernstein, Birney et al. 2012;

Andersson, Gebhard et al. 2014), generating thousands of datasets through experimental methods such as ChIP-seq (Robertson, Hirst et al. 2007), which identifies where proteins of interest bind to DNA and where specific chemical modifications are made to the histone proteins that help package DNA, and DNaseI hypersensitivity assays (Akopov,

Chernov et al. 2007), which highlight regions of open chromatin. These studies probed hundreds of cell lines and tissues and located epigenetic markers of functional regions of the genome such as enhancers, promoters, and repressive regions, resulting in massive amounts of functional data. Each of these genomic features informs gene regulation in manner that is tissue- and timepoint-specific, but we hoped the data could be integrated and leveraged to make a new set of genome-wide developmental enhancer predictions we could then apply to study regions of the genome involved in general embryonic development. Designing this comprehensive predictor presented a significant computational and statistical challenge, details of which can be found in Chapter 2. We ultimately developed a computational tool, called EnhancerFinder, which incorporates thousands of genetic and epigenetic clues in a machine learning framework to learn what suite of these clues underlie developmental enhancers (Erwin, Oksenberg et al. 2014).

3

We applied EnhancerFinder’s predictive abilities to the whole genome and identified over

80,000 putative developmental enhancers. We tested a dozen of these predicted enhancers in transgenic mouse and zebrafish enhancer assays, and ten of the candidates showed confirmed or suggestive enhancer activity in at least one of the assays. Armed with confidence in our predictions and our newly characterized enhancer landscape of the human genome, we applied EnhancerFinder’s predictions to several interesting questions in developmental biology.

Small sequence changes in enhancers have been shown to have big impact on development, disease, and drug response (Visel, Rubin et al. 2009; Ahituv 2012; Sakabe,

Savic et al. 2012). We built a database of genomic regions and sequence variants involved in a wide range of developmental processes and disorders, with specific emphasis on regions important to heart development, which we discuss in Chapter 3.

This database enabled us to highlight dozens of novel enhancers for our own analyses of developmental genetics which are detailed in Chapters 4, 5, and 6, and we found that

EnhancerFinder’s predicted developmental enhancers contain significantly more disease- associated genetic changes than expected by chance. Other labs in the Gladstone

Institutes were able to use this database for their own research projects as well, to highlight important genes for genome editing and other experiments.

Our work also confirms that enhancers cluster near genes that are required for development and other important, highly regulated genes such as those that encode transcription factors. The ten enhancers we validated in transgenic zebrafish and mouse enhancer assays fell in two such clusters near the genes FOXC1 and FOXC2. We further 4 investigated four enhancers near FOXC1 that may be involved in a brain morphology disorder called Dandy-Walker malformations. We also found that many human accelerated regions (HARs), which are regions of the genome that likely contain human- specific sequence changes, are predicted to be developmental enhancers (Capra, Erwin et al. 2013).

EnhancerFinder is not an exhaustive predictor of developmental enhancers, but it gives us a good place to start our investigation of a wide range of developmental questions. We were able to characterize many functional regulatory regions of the human genome, and the predicted set of candidate developmental enhancers is freely available. We hope that future researchers will use these predictions to further investigate gene regulation in developmental biology. With minor changes, the EnhancerFinder framework can be applied to identify many other types of regulatory elements in different cell types and tissues outside of development as well. In the few years since we developed

EnhancerFinder, it has become much easier and cheaper for researchers to collect the functional genomics data that underlie EnhancerFinder predictions. Since we found that more features leads to better performance, we hope researchers will be able to utilize the

EnhancerFinder framework to integrate this new information in a broad range of genomic applications.

5

Chapter 2: EnhancerFinder: A new approach to predict developmental enhancers

Motivation

Enhancers are essentially the switches that turn genes on and off, or finely tune levels up and down. They are DNA sequences that bind transcriptional machinery proteins to regulate the timing and amplitude of gene expression. Beyond that, though, they are highly variable in characteristics. They are usually found in noncoding DNA but not always (Noonan and McCallion 2010; Birnbaum, Clowney et al. 2012). They are often found near the gene (or genes) they regulate but are sometimes very far away (Koch,

Andrews et al. 2007; Heintzman, Hon et al. 2009). Some enhancers control the expression of one gene while some control many genes, and some genes are controlled by one enhancer while other genes are controlled by many enhancers (Visel, Akiyama et al. 2009; Visel, Taher et al. 2013). Many enhancers are active only at certain times and in certain tissues (Visel, Blow et al. 2009). All of this variability makes it difficult to pinpoint these functional regulatory elements in the genome.

Though researchers have yet to identify a single beacon highlighting the genomic location of all enhancers, several experimentally derived genomic signatures are known to mark large sets of enhancers. Many enhancers are conserved over evolutionary time since functional regions often undergo negative selection to maintain their function, and one of the first ways researchers identified enhancers was through in vivo assays of conserved noncoding regions (Nobrega, Ovcharenko et al. 2003; Pennacchio, Ahituv et al. 2006).

Active enhancers are often found near transcriptional co-factor proteins p300 and CREB 6 binding (CBP) (Visel, Blow et al. 2009), as well as specific chemical modifications in histones, the proteins that help package DNA (Creyghton, Cheng et al. 2010). These histone modifications can change over time and vary between tissues, allowing researchers to identify tissue- and timepoint-specific enhancers. Additionally, the DNA sequence itself can help identify enhancers, since certain DNA motifs are common in some types of enhancers (Narlikar, Sakabe et al. 2010).

Previous computational work has used each of these genomic signatures, independently or in combination, to identify enhancers. Machine learning approaches such as support vector machines (SVMs) or random forests have been used to predict DNA motifs that underlie experimentally determined binding sites of enhancer-associated proteins p300 and CBP and histone modifications (Arvey, Agius et al. 2012; Rajagopal, Xie et al. 2013).

Other researchers have used hidden Markov models (Ernst and Kellis 2012) and

Bayesian approaches (Hoffman, Buske et al. 2012) to combine many epigenetic datasets to segment the genome into functional classes including enhancers. Each of these methods is able to identify enhancers with some success, but they are all designed to work in a single system, on a single cell type.

Identifying developmental enhancers is a bigger challenge, due largely to the heterogeneity of tissues and organs as compared to cell lines. Also, we were not interested in proxies for enhancers based on a single genomic clue such as p300 binding. We wanted to investigate enhancers that are active during embryonic development, which we defined by activity in an in vivo assay (Kothary, Clapoff et al.

1989), because we expected these regions to be most useful in our subsequent 7 applications to questions in developmental biology. The VISTA Enhancer Browser provides the largest collection of developmental enhancers experimentally tested in a transient transgenic mouse enhancer assay (Visel, Minovitsky et al. 2007). Even when localized to a single embryonic tissue, these genomic regions are typically active in several different cell types that make up that tissue. We chose this more complex biologically active definition of “developmental enhancer” and created EnhancerFinder, a novel machine learning-based enhancer prediction pipeline that integrates genetic and epigenetic features (referred to as feature data) from a variety of experimental techniques and biological contexts that have previously been used individually to predict enhancers (Figure 1). We then used EnhancerFinder to identify over 80,000 candidate developmental enhancers in the human genome. 8

Figure 1. Overview of the EnhancerFinder enhancer prediction pipeline. In our two-step approach, we characterized genomic regions based on many features, such as their evolutionary conservation, regulatory protein binding, chromatin modifications, and DNA sequence patterns. We train a multiple kernel learning (MKL) algorithm with positive training examples, shown in green, and negative training examples, shown in purple, to generate a trained classifier. We used 10-fold cross validation to evaluate the performance of all classifiers. In Step 1, we trained a classifier to distinguish between known developmental enhancers from VISTA and the genomic background, i.e. enhancer from non-enhancer. In Step 2, we trained several classifiers to distinguish enhancers active in tissues of interest from those without activity in the tissue. We applied the trained enhancer classifier from Step 1 to the entire human genome to produce more than 80,000 developmental enhancer predictions. We then applied the tissue-specific enhancer classifiers from Step 2 to further refine our predictions. Figure: Erwin, Oksenberg et al. 2014. 9

Developing EnhancerFinder

There were three main problems to solve before establishing a pipeline for

EnhancerFinder: defining a set of non-enhancers to serve as the negative training data, determining which algorithm can best separate enhancers from genomic background, and selecting which genetic and epigenetic features best characterize the profiles of enhancers and non-enhancers.

One of the main advantages of having access to hundreds of validated enhancers is this gave us a good set of positive training data to use in a supervised learning method. The flip side, though, is that we needed to define a good set of negative training data as well.

The algorithm can then learn the profile of thousands of feature data sets that underlie the positives and negatives, and generate a predictor based on features that separate the positives from the negatives. With developmental enhancers, it is hard to say definitively that a region is not an enhancer if it hasn’t been exhaustively tested at every developmental timepoint with all appropriate assays. Because this exhaustive catalog of non-enhancers does not exist, we tried several different definitions of “non-enhancer” in a simple SVM framework with a small number of feature datasets to see which best captured the most biologically meaningful “non-enhancer-ness,” that is, most representative of genomic background. Negative training data that actually contains many positives would make classification very hard and lead to an incorrect signature of positives versus negatives.

10

The first potential set of genomic background non-enhancers came directly from the

VISTA Enhancer Browser. Hundreds of genomic regions tested by VISTA in their transgenic mouse assay showed no enhancer activity at embryonic day 11.5 (E11.5)

(VISTA negatives). However, these were nearly impossible to distinguish from positive enhancers in our test framework. VISTA negatives are not a true sample of genomic background because it is likely that they are active developmental enhancers at a different timepoint that was not assayed. They were selected by VISTA researchers for the expensive and time-consuming enhancer assay because these regions had promising signatures suggesting they would be enhancers, namely high levels of evolutionary conservation. While these regions were not appropriate “non-enhancers” we were able to use the VISTA negatives in a later stage of EnhancerFinder design, as described later in this chapter.

We then tried a computationally generated list of genomic background regions that were matched to positive enhancers on several criteria: length, , distance to nearest gene, GC content, and conservation. These too were extremely hard to distinguish from positive enhancers, again because they likely contained too many real enhancers. Attributes of these regions that we were fixing such as GC content, conservation, and distance to the nearest gene are actually informative features of enhancers. We then relaxed the criteria of our computationally generated list of negatives, matching our positives only on length. While it is likely that this set still contained a small number of enhancers, this set of negative training data best represented true genomic background.

11

Once we settled on a set of negative training data that best biologically represented genomic background, we evaluated several algorithms on their ability to separate positives (enhancers) and negatives (genomic background). When we first started designing EnhancerFinder, we tried several machine learning approaches like SVMs and random forests, clustering methods such as K-nearest neighbors, and statistical models like linear regression. Initial tests showed SVMs and random forests consistently outperformed K-nearest neighbors. We were drawn to random forests because the results can be interpreted more intuitively than other approaches, but soon saw that SVM was more easily tunable to perform better.

Each of these methods required us to boil down complex biological data into simple mathematical representations. In the first few passes using SVM, we tried a number of different kernel functions and distance metrics to best encode diverse biological data in the most appropriate mathematical function. However, this approach was limited because it forced very different types of biological data to be represented by a single mathematical function. We were attempting to characterize the enhancer and non- enhancer regions based on three very different attributes: 1) What is the genetic sequence of this region? 2) Is this region evolutionary conserved? and 3) Which known functional genomic marks does this region contain? Each of these features can provide a lot of information about whether a region is an enhancer or not, but only if the data is properly encoded. The solution to this problem was the multiple kernel learner (MKL), an extension of SVM that allowed us to encode different data types with a different kernel functions (Sonnenburg, Zien et al. 2006; Kloft, Brefeld et al. 2011). We were then able to use the most appropriate mathematical representation for each biological class of data. 12

Our next main challenge was to identify which set of genetic and epigenetic features best characterized the positive and negative training examples. We started with five feature classes: epigenetic marks like histone modifications and binding of p300 in a small number of relevant cell lines and tissues, DNA sequences via counts of known DNA motifs involved in promoters and transcription factor binding sites, evolutionary conservation, distance to nearest gene, and GC content. Early tests we performed with a simple (single kernel) SVM showed that the more features we included, the better we were able to separate the positives from negatives, a result that was particularly striking when we combined DNA sequence data with epigenetic data. We tried many ways of encoding epigenetic data, since ChIP-seq studies provide not just genomic location of the epigenetic mark but a proxy for binding intensity using peak height information.

However, despite several attempts to leverage peak height or read count information to quantify intensity of epigenetic marks, we found that simply noting any overlap between a test region and epigenetic peak yielded comparable performance. We also tried different methods of encoding DNA sequence information. While early iterations of

EnhancerFinder considered just known DNA motifs, we found that simply summarizing the occurrence of all 4 bp sequences in the genome performed just as well if not better than using known motifs, and was not limited by biases in known motifs.

With the EnhancerFinder framework in place, we scaled up the feature data analysis to include not just dozens but thousands of epigenetic datasets, and evaluated many combinations of feature data. Results from this larger scale study confirm that integrating diverse feature data consistently improved performance. A particularly 13 interesting finding showed that adding large numbers of feature sets from seemingly irrelevant biological contexts also improved performance. We think this is due to a better characterization of negative regions.

Our original goal was to predict not just the location of the developmental enhancers but also in which tissue the enhancer was active. In early iterations of EnhancerFinder, we saw that we could easily distinguish enhancers of one tissue from enhancers of another

(e.g. heart enhancers from brain enhancers), but we had a hard time directly classifying tissue-specific enhancers when comparing them to genomic background (e.g. heart enhancers versus genomic background, brain enhancers versus genomic background).

This approach was overwhelmed by the general enhancer features of these genomic regions, and failed to identify strong signatures of tissue specificity. We then decided to use a two-step approach, first determining whether a region is an enhancer or not, and then ascertaining tissue specificity.

How EnhancerFinder works

EnhancerFinder identified tissue-specific developmental enhancers using two rounds of

MKL. In the first round, the MKL learned the profile of thousands of genomic feature data sets to determine what profile best separates over 700 validated VISTA developmental enhancers from an equal number of background genomic regions. In the second round, we built a separate classifier for each tissue and the MKL learned the feature profile that best separates enhancers of one tissue from VISTA-tested regions not expressed in that tissue (e.g. heart enhancers versus VISTA enhancers expressed in brain, limb, eye, etc., plus VISTA-tested regions that did not show expression – even though these are 14 considered negative regions, there is good evidence that these are developmental enhancers at timepoints not assayed in VISTA).

This two-step training provided us with trained classifiers we could apply to the whole genome to identify novel tissue-specific developmental enhancers, based on the feature profiles that best separate the positives from negatives. Even though our study was focused on developmental enhancers, the basic framework of EnhancerFinder can be used to identify enhancers or other genomic elements in many different contexts, as long as there is a sufficient number of training examples.

Evaluating the performance of EnhancerFinder

We evaluated the performance of each classifier using cross validation. In cross validation, we randomly divided the training examples into two sets, the training data

(which is visible to the classifier) and the test data (which remains hidden). We trained the classifier on the first set, and then used the trained classifier to predict which regions were positive and which were negative in the second set. We repeated this ten times for each classifier.

We then quantified the performance using receiver operator characteristic (ROC) curves.

The area under the curve (AUC) of an ROC curve is a standard measure of algorithm performance. ROC AUC values range from 0 to 1, where 0.5 means the classifier performs as well as random chance, and 1 means the classifier can perfectly separate positives and negatives. At each value, the ROC curve illustrates the true positive rate

(TPR) and false positive rate (FPR). We used the FPR to establish thresholds for high 15 confidence enhancer predictions, described below. FPR is a measure of real non- enhancers that are incorrectly predicted to be enhancers (false positives) out of the total number of real non-enhancer regions – that is, how many real non-enhancers are misclassified. This is in contrast to false discovery rate (FDR), another statistic we examined to ensure the quality of our predictions. FDR is a measure of how many non- enhancers are incorrectly predicted to be enhancers (false positives), out of the total number of predicted enhancers, i.e. how many of our predicted enhancers are not really enhancers. With both FPR and FDR, lower values indicate higher quality predictions.

Performance for both steps of EnhancerFinder is shown in Figure 2. In Step 1, we determined whether a region is an enhancer or not. The full EnhancerFinder classifier here was built with 2,469 functional genomics feature sets, DNA sequence, and conservation. This performs very well, with an AUC of 0.96. Classifiers based on each of those components are also shown. In Step 2, we predict a tissue of activity for each positive enhancer from Step 1. Note that with an AUC of 0.85, the heart classifier performs better than any other classifier. We discuss possible reasons for this performance in detail in Chapter 4. 16

Figure 2. Performance of EnhancerFinder. A) Step 1 performance. How well can EnhancerFinder distinguish enhancers from genomic background? These ROC curves show the performance of four trained classifiers, with the AUC listed after the classifier name. DNA Motifs, All Functional Genomics, and Evolutionary Conservation each use the single class of specified data to distinguish enhancers from genomic background using an SVM framework. Each of these classifiers performs well. EnhancerFinder combines all three classes of data in the MKL framework and outperforms the other three classifiers. B) Step 2 performance. How well can EnhancerFinder discern tissue specificity of a predicted enhancer region? These ROC curves show the performance of trained classifiers for six different developmental tissues. In each case, the classifier tried to distinguish enhancers active in the specified tissue from enhancers not active in that tissue. Figure adapted from Erwin, Oksenberg et al. 2014. 17

EnhancerFinder predicts over 80,000 developmental enhancers across the human genome

To generate our genome-wide predictions, we trained EnhancerFinder with a feature set of 509 relevant functional genomics datasets plus DNA sequence k-mers and applied the trained classifier to sliding windows of 1500 bp, moving along the human genome in 500 bp steps. (Although we showed that adding more features improves performance, it also increases the time it takes to compute the classifiers and predictions. This smaller set of features still performs very well and is much faster to run.) The feature profile for each window was computed as described above. To focus on high-confidence predictions, we filtered the enhancer scores for the windows at a 5% FPR, estimated from cross- validation using the genomic background, and combined the remaining overlapping windows to produce 84,301 high-confidence predicted developmental enhancers. These predictions can be viewed in the UCSC Genome Browser Public Track Data Hubs, or can be downloaded from http://lighthouse.ucsf.edu/public_files_no_password/genevieve.

We did not include evolutionary conservation data because the positives in our training data are almost universally conserved (706 of the 711 VISTA enhancers overlap a conserved element). While most enhancers likely exhibit some evolutionary conservation, this extremely high fraction is more likely due to the fact that conservation was the main selection criterion used by VISTA researchers. We didn’t want conservation to overwhelm our classifier and reduce our ability to detect less highly conserved novel enhancers genome-wide. The conservation-free classifier described in the paragraph above still performed extremely well in cross validation (AUC=0.92), and with this classifier as the basis of EnhancerFinder we were able to identify over 16,000 non-conserved enhancers 18 in our genome-wide enhancer predictions (~20% of our predictions). We did not observe any other dramatic biases in the feature data associated with VISTA enhancers.

The 5% FPR threshold we used corresponds to a 65% TPR. To then calculate the FDR, we had to estimate the unknown fraction of 1.5 kilobase (kb) blocks of the human genome that harbor developmental enhancers. If this fraction was 50% (which would far exceed any standard biological estimate) a 5% FPR would correspond to an FDR of 9%.

A more realistic estimate, where 10% of 1.5 kb windows contain a developmental enhancer, suggests an FDR estimate of 47%. This struck us as a pretty high FDR (and remember, high FDR means many of our predicted enhancers are not really enhancers), but in practice we saw a much smaller number of false positives. In a separate application of EnhancerFinder discussed in more detail in Chapter 6, we tested HARs that are predicted to be enhancers, using the transgenic mouse enhancer assay described in Chapter 5 methods, and saw far fewer false discoveries (5/29, or 17%, at

E11.5) (Capra, Erwin et al. 2013). Also, just two of twelve tested predictions did not validate with confirmed or suggestive activity in at least one of our animal model assays of enhancer activity (Chapter 5). This suggests that the FDR may be lower in experimental applications, especially when predicted enhancer regions are analyzed in the context of other relevant data. A truly accurate measure the actual FDR would require experimental testing of a very large, random set of EnhancerFinder predictions, which is beyond the scope of my thesis.

Next, we trained heart, brain, and limb classifiers as Step 2 of EnhancerFinder and applied the trained classifiers to all 299,039 windows with positive enhancer scores in 19

Step 1. We then applied a 5% FPR cutoff for each tissue and concatenated the remaining overlapping windows into merged enhancer regions. We focused on brain, limb, and heart, because these tissues are highly represented in VISTA and have been extensively studied in previous analyses of developmental enhancers. We predicted 7,400 limb enhancers, 19,051 heart enhancers, and 11,693 brain enhancers (Figure 3) at a 5% FPR threshold tuned separately for each tissue. Computational evidence we will discuss in

Chapter 4 shows biological support for these tissue-specific predictions, e.g. genes near these tissue-specific enhancers are expressed in relevant tissues. However, experimental evidence described in Chapter 5 suggests that, in practice, our published tissue-specific predictions have much higher FDRs than our general enhancer prediction. 20

Figure 3. Tissue-specific enhancers are located near relevant genes. EnhancerFinder identifies thousands of novel high-confidence (FPR<0.05) heart, brain, and limb enhancers. These enhancers are enriched for tissue-specific (GO) Biological Processes. The five most enriched GO Biological Processes among genes near each enhancer set (as calculated using gene annotation tool GREAT) are listed in the colored boxes. The larger number of high-confidence heart enhancers relative to brain and limb enhancers is likely the result of the superior performance of the heart classifier, and does not imply that there are more heart enhancers than other tissues’ enhancers in the human genome. Figure: Erwin, Oksenberg et al. 2014.

21

In later chapters we’ll discuss several cases where we applied these enhancer predictions to study gene regulation and possible medical implications in developmental genes including the autism-associated gene AUTS2 and the transcription factor genes FOXC1 and FOXC2. All genome-wide enhancer predictions were made available as tracks for import into the UCSC Genome Browser, which we hope other researchers will use for their own studies in developmental gene regulation. 22

Chapter 3: Cataloging important developmental regions of the genome

Motivation

The original goal of this thesis project was to identify regulatory regions that played an active role in cardiovascular development. EnhancerFinder takes care of the first part, identifying candidate regulatory regions. To highlight where in the genome we can find active regions in cardiovascular development, we created a database of important cardio- related regions of the genome based on many publicly available resources.

In the first pass at this database, before it became apparent that hacking skills are way more interesting in the movies than in real life, we set up a pipeline to scrape all of

PubMed based on a large number of keywords manually identified from the MeSH

(Medical Subject Headings) controlled vocabulary hierarchy. The goal was to capture nearly all genes mentioned in the context of heart development or specific diseases in all the major publications. When the keywords were seen in a paper, all genes names in the paper were extracted and entered into the database along with the keyword. However, since we included no natural language processing or way to parse paper context, many irrelevant genes were included. And due to differences in the way genes are listed in different publications, important genes were missed. We scrapped these results and decided to refocus our efforts building off of existing databases instead.

23

Database components

Online Mendelian Inheritance in Man (OMIM) is a large database of human genes and genetic phenotypes as they relate to human health, curated at the McKusick-Nathans

Institute of Genetic Medicine at The Johns Hopkins University School of Medicine

(www.omim.org). We applied the scraping and extracting method from above to this much smaller, more controlled environment, to identify genes discussed in written text entries related to diseases of interest. We also extracted all gene names and related phenotypes from the OMIM Gene Map and Morbid Map. The result was a list of over

15,000 genes and their associated phenotypes.

We combined this list with two other resources. The first was CardioGO, a gene ontology list of over 4,000 genes associated with a cardiovascular process, curated by researchers at the European Institute and University College London

(http://www.geneontology.org/GO.cardio.shtml, but CardioGO is no longer maintained or updated online). The second resource was the National Human Genome Research

Institute’s genome wide association study (GWAS) catalog, which is a large collection of

GWAS results including phenotypes, single nucleotide polymorphisms (SNPs), and nearby genes (Hindorff, Sethupathy et al. 2009). The GWAS catalog is continuously updated and has significantly grown since we initially started this thesis project, and we have periodically downloaded the catalog to update our database.

Since GWAS identify lead SNPs rather than causal SNPs, especially in the less-refined results from the early years of these studies, we initially tried to include population genetics information such as linkage disequilibrium so we could expand the GWAS 24 catalog and investigate any possible causal SNP. However, at the time of analysis (2009-

2010) population variant information was lacking, so we were unable to gain any functional insight with this approach.

Based on literature searches and discussions with cardiovascular clinicians within the

Gladstone Institutes, we identified relevant cardiovascular phenotypes to include in our developmental cardio database. We were grateful for the input from clinicians, which allowed us to identify regions we might have ignored because of non-intuitive yet highly cardio-related phenotypes such as cleft palate. We now have a list of thousands of genes, annotated with their function, and prioritized by several lines of evidence supporting cardiovascular function. We later added gene expression data from work performed by

Jeff Alexander in Benoit Bruneau’s lab on mouse cardiac progenitor cells and cardiomyocytes (Wamstad, Alexander et al. 2012) (results lifted over to the human genome based on homology) as yet another line of evidence in cardiac development.

Applications of the database in developmental research

We used this database, in conjunction with our EnhancerFinder predictions, to prioritize regions for further experimental investigation. Results are detailed in the following chapters. Several labs at the Gladstone Institutes have also been able to use this database in their own heart development research projects. In one application, researchers from Bruce Conklin’s lab have used this database to help identify top priority genes associated with dilated cardiomyopathy for investigation using the transcription activator-like effector nuclease (TALEN) genome editing approach. In another application, researchers from Deepak Srivastava’s lab have used this database to see where the 25 results for their ChIP-seq studies fall in cardio-related genomic regions. Gene annotation files from this database can be viewed and downloaded from http://lighthouse.ucsf.edu/public_files_no_password/genevieve/, and we hope more researchers can use this to help inform their research in cardiovascular development and disease.

26

Chapter 4: Developmental enhancers are clustered near important developmental transcription factors and exhibit relevant tissue- specific characteristics

Our main motivation to develop EnhancerFinder was to generate a set of developmental enhancers that could enable further study of gene regulation during development. We characterized the 80,000+ predicted enhancers in several ways: 1) How do these predictions compare to other existing computational enhancer predictions? 2) Do our tissue-specific predictions exhibit relevant biological signatures? 3) Can we find a biological explanation for the differences we see in the performance of tissue-specific classifiers? and 4) How are these enhancers distributed across the genome?

EnhancerFinder predicts a much stricter set of developmental enhancers than prominent existing methods

Our enhancers significantly overlap predictions of enhancer states from ChromHMM and

Segway. ChromHMM and Segway are algorithms that segment the human genome into sets of regions with similar patterns in functional genomics data. They were both applied to ENCODE data, and the resulting “states” were labeled with potential functions (Ernst and Kellis 2012; Hoffman, Buske et al. 2012). We merged all genomic regions assigned to an enhancer related state by ChromHMM and calculated the proportion of

EnhancerFinder predictions that lie within this enhancer segment of the genome. To assess statistical significance, we compared the observed proportion to the proportion 27 from random sets of regions match to the length and chromosome distribution of

EnhancerFinder predictions. A similar analysis was conducted for Segway enhancer states. For both algorithms, we found that a significantly higher proportion of

EnhancerFinder predictions lie in enhancer states than expected (p<0.001, permutation test). And in both cases, EnhancerFinder predicts a smaller set of enhancers based on a more diverse collection of genomic features, to provide a more refined set of regions for the experimental analysis that goes into further studies in developmental gene regulation.

Predicted enhancers are associated with relevant genomic regions

To characterize and further validate our genome-wide enhancer predictions, we examined their genomic distribution with respect to several independent indicators of function, including the expression patterns and Gene Ontology (GO) annotations of nearby genes, hits to TFBS models, and lead SNPs from GWAS.

Genes near tissue-specific enhancers are enriched for expression in the relevant tissue.

We investigated expression patterns of the potential gene targets for our predicted tissue-specific enhancers. Using gene expression data from 79 human tissues from the

GNF Gene Expression Atlas 2 (Su, Wiltshire et al. 2004), we compared mean expression levels of genes associated with brain versus heart enhancers by associating each predicted enhancer with the nearest transcription start site. The GNF Atlas does not include enough relevant tissues to consider limb enhancers in this analysis. Genes near predicted brain enhancers had significantly higher expression levels in the 22 brain and neural tissues than genes associated with predicted heart enhancers (Table 1). Superior 28 cervical ganglion, subthalamic nucleus, and pons show the most significantly elevated expression (all t-test p<0.001). Conversely, genes associated with heart enhancers showed elevated expression, compared to those associated with brain enhancers, in three of the four cardiovascular-related tissues: cardiac myocytes, heart, and whole blood

(all t-test p<0.001). Interestingly, expression in the atrioventricular node, which is part of the heart’s electrical conduction system, is higher for genes associated with predicted brain enhancers compared to predicted heart enhancers.

29

A.

Tissue p-value Superior cervical ganglion 8.85E-101 Subthalamic nuclei 1.28E-82 Pons 6.21E-77 Medulla oblongata 2.97E-65 Globus pallidus 1.16E-64 Fetal brain 4.72E-63 Parietal lobe 4.13E-62 Trigeminal ganglion 3.07E-60 Dorsal root ganglion 3.40E-57 Atrioventricular node 2.80E-56 Occipital lobe 4.24E-52 Ciliary ganglion 2.12E-45 Temporal lobe 2.08E-44 Cingulate cortex 3.47E-38 Amygdala 5.23E-29 Whole brain 1.69E-17 Cerebellum 3.70E-16 Caudate nuclei 1.27E-14 Prefrontal cortex 8.70E-14 Hypothalamus 4.04E-12 Cerebellar peduncles 5.63E-08 Pituitary gland 4.51E-05 Thalamus 0.039211

B.

Tissue p-value Cardiac myocytes 2.56E-89 Heart 1.06E-55 Whole blood 8.94E-36

Table 1. Genes near predicted enhancers exhibit relevant tissue-specific expression. A) Genes near brain enhancers have significantly higher gene expression in brain and neural tissues than genes near heart enhancers. This table lists tissues with significantly higher mean expression in genes associated with predicted brain enhancers compared to predicted heart enhancers. B) Genes near heart enhancers have significantly higher gene expression in cardiac-related tissues than genes near brain enhancers. This table lists tissues with significantly higher mean expression in genes associated with predicted heart enhancers compared to predicted brain enhancers.

30

Genes near tissue-specific enhancers are enriched for relevant functional annotations. To explore functional annotations associated with genomic regions near predicted enhancers, we conducted GREAT analyses, using the online tool that provides annotations of gene ontology, function, and expression (McLean, Bristor et al. 2010) using the “basal plus extension” method to map annotations onto our predicted brain, heart, and limb enhancers (see Methods, below). GO Biological Process enrichment results suggest that our predicted developmental enhancers target genes that function in each of their relevant cell types and tissues (Figure 3). For example, TGF beta– signaling, artery development, and artery morphogenesis are enriched among our heart enhancer predictions, while predicted brain enhancers are enriched for midbrain development and neuron differentiation. Predicted limb enhancers are enriched for kinase regulation and phosphorylation, which likely reflect the known role of kinases and phosphatases in limb patterning and development (Saxton, Ciruna et al. 2000; Dudley and Tabin 2003). Interestingly, some highly enriched GO Biological Process terms for predicted limb enhancers were also enriched for predicted heart enhancers (e.g., artery development and artery morphogenesis). This overlap appears to be due in part to the relatively large percentage of limb enhancer predictions that overlap heart enhancer predictions (e.g., the 1,960 overlapping enhancers comprise 25.6% of all limb enhancers and just 10.2% of all heart enhancers), and in part to the development of arteries and blood vessels in the limbs.

Tissue-specific enhancers contain many relevant TFBS motifs. We scanned the DNA sequences of our genome-wide tissue-specific enhancer predictions using TFBS motif models (see Methods, below). The most prevalent motifs (Table 2) differed between 31 tissues and between enhancers predicted to be active only in a single tissue compared to all predicted enhancers of that tissue (e.g., enhancers unique to heart vs. all heart enhancers), suggesting potentially different functions and regulatory mechanisms for these classes of enhancers. For example, brain enhancers contained many binding sites for NRSE (neural restrictive silencer element), a transcription factor thought to act as both a suppressor and activator in neural cells (Schoenherr and Anderson 1995), whereas heart and limb did not have many NRSE binding sites. Binding sites for

NFkappaB, which regulates transcription in heart morphogenesis (Hernandez-Gutierrez,

Garcia-Pelaez et al. 2006), were abundant in heart enhancers but not brain or limb enhancers. Limb enhancers contained many binding sites for LHX3, a LIM gene that regulates limb growth and three-dimensional patterning (Tzchori, Day et al.

2009), while brain and heart enhancers did not. 32

Brain Heart Limb enhancers enhancers enhancers HIC1 PITX2 AHRHIF AP2 BACH1 E2F1DP1RB LRF BACH2 E2F4DP1 E2F1 E2F1DP1 AP2GAMMA E2F1DP2 AP2ALPHA NKX3A E2F4DP2 AHRHIF AP1 SP1 SREBP2 HMGIY KROX HIF1 TEF WT1 LFA1 OCT1 HIC1 EGR POU3F2 EGR EGR3 CDC5 MAZR KROX FREAC7 CKROX SREBP NKX62 EGR1 NRF1 TBP NRF1 SREBP1 RSRFC4 AP2 NFKAPPAB50 HNF1 CDPCR1 WT1 FOXO3A MZF1 HES1 PIT1 MAZ AHRARNT FOXJ2 WHN CREBP1 FOXO3 NRSE SP1 FOX AHRARNT MAZR FOXD3 GATA2 EGR1 NKX61 OCT CKROX LHX3

Table 2. Top 25 predicted transcription factor binding site motifs in predicted brain, heart, and limb enhancers. For each tissue, we computed the top ranking motifs present in each group of enhancers (brain, heart, or limb), after subtracting out background occurrence. 33

GWAS SNPs are enriched in predicted enhancers. We intersected the 9,687 SNPs in

NHGRI’s GWAS catalog which were available at the time of our analysis (downloaded

October 2012) with our predicted enhancers and found that our enhancers contain 676 lead GWAS SNPs, significantly more than expected at random (p<0.001, permutation test). Looking at the tissue-specific predictions, we found 209 lead GWAS SNPs in the predicted heart enhancers (p<0.001, permutation test), 68 in predicted brain enhancers

(p=0.330, permutation test), and 47 in the limb enhancers (p=0.265, permutation test).

(A complete list of lead GWAS SNPs found in predicted enhancers can be found in the supplementary material of Erwin, Oksenberg et al. 2014, or can be downloaded from http://lighthouse.ucsf.edu/public_files_no_password/genevieve.)

Taken together, these analyses suggest that EnhancerFinder identifies many active regulatory regions that contain functionally relevant variation. Our tissue-specific enhancer predictions give valuable annotations to previously uncharacterized non-coding regions of the human genome. For example, as illustrated above, hundreds of

EnhancerFinder predictions that are associated with disease by GWAS are in non-coding regions with limited functional annotations. Our genome-wide enhancer predictions provide a simple resource for exploring the mechanisms and functional effects of uncharacterized GWAS hits.

Delving deeper into heart enhancers using the developmental cardio database we developed (described in Chapter 3), we found further support for the relevant biological functions. Starting with our 19,051 predicted heart enhancers, we found that over

13,600 enhancers are near genes expressed in early cardiac cells, 103 enhancers are 34 near genes known to be important in heart development (per manually curated list), 33 enhancers contain a cardiac-specific GWAS SNP (p=0.053, permutation test), 2,950 enhancers are near genes with a developmental function, and over 3,000 enhancers are near genes with an annotated potential cardiac function in OMIM. These results not only support the biological relevance of our heart enhancers, but also helped inform which regions to prioritize for further study, as detailed in Chapters 5 and 6.

Heart enhancers are easier to predict but a biological explanation remains elusive

As noted in Chapter 2, we saw that the tissue-specific classifier for heart enhancers performed better than the other tissues’ classifiers. Initially we thought that our results were biased by our feature data, which contained several epigenetic data sets in fetal heart and embryonic stem cells differentiated into cardiomyocytes. The other tissues lacked such specific and relevant feature sets. However, heart classifiers built without these features still outperformed the other tissues’ classifiers. Other results suggested that this could be due two interesting aspects of heart enhancers: 1) while highly evolutionarily conserved compared to genomic background, they are much less conserved than other enhancers, and 2) they are on average located closer to the transcription start site of the nearest gene. Further investigation, though, showed that the increased performance could be attributed almost entirely to the uniquely high GC content seen in heart enhancers.

We investigated the source of this high GC content to determine if the high GC could be due to a biologically functional reason, or if it was just a result of high GC repeats (or just high G or high C repeats). We looked at the prevalence of three possible sources of high 35

GC in VISTA heart enhancers, brain enhancers, and limb enhancers: CpG islands, known motifs of GC-rich transcription factor binding sites, and simple repeats. CpG islands are clusters of CG nucleotides that recruit DNA methylation, an epigenetic chemical modification correlated with repressed gene expression. However, we did not see enrichment for functional high-GC regions in the heart enhancers. The only noteworthy result indicated that heart enhancers are significantly lacking an AT-rich simple repeat found in other tissues’ enhancers. This AT-rich repeat did not correspond with a known transcription factor binding site motif.

Developmental enhancers cluster near important developmental genes

We examined the genome-wide distribution of EnhancerFinder’s predicted enhancers and found that enhancers tend to cluster near important developmental genes. Since developmentally active genes typically rely on tight regulation to exhibit the timepoint- and tissue-specific expression patterns that drive development, it makes sense that many developmental enhancers are located nearby. For every gene in the human genome

(based on an October 2012 download of the UCSC Genome Browser RefGene track), we counted the number of predicted enhancers within 500 kb upstream or downstream of the transcription start site. While it is hard to know which gene an enhancer is actually regulating, 1 megabase (Mb) is a wide enough distance that most of the functional gene- enhancer pairs are likely captured. More broadly, this approach allowed us to identify regions of the genome that contain particularly dense clusters of enhancers and which genes are nearby.

36

We focused on the genes with the top 5% highest number of nearby enhancers, and found these genes were enriched for several biological functions related to development including epithelial cell development, artery development, and dorsal/ventral neural tube patterning (using GREAT enrichment tool, described in methods below). This top 5% list included many important genes from our cardiac development database described in

Chapter 3, including five of the genes we had determined “highest priority” based on known cardiac development function: FOXC1, FOXC2, IRX3, JARID2, and NOTCH1. Each of these five genes had over 60 predicted enhancers clustered nearby. In the next chapter, we’ll discuss FOXC1 and FOXC2 further and present experimental validation of enhancers near these genes.

Another gene with the top 5% highest number of nearby enhancers is AUTS2, the autism susceptibility candidate 2 gene. Previous research has shown that AUTS2 has an important neurodevelopmental function and is a suspected master regulator of genes implicated in autism spectrum disorder-related pathways. In a study performed outside the scope of this thesis, led by Nir Oksenberg (Oksenberg, Haliburton et al. In Review), we investigated the regulatory role and targets of mouse homolog Auts2 using ChIP-seq and RNA-seq on mouse E16.5 forebrains.

We found 776 genes whose promoters are bound by Auts2, the majority of which are highly expressed in the developing forebrain. Of the 1,146 Auts2-bound genomic regions we identified that did not encompass a promoter region, 26% overlap known activator marks (H3K27ac in E14.5 mouse whole brain), which is more than expected by chance

(p<0.001, permutation test), while only 2% overlap known repressive marks (H3K27me3 37 in E14.5 mouse whole brain), which was not statistically significant (p=0.081, permutation test). This suggests Auts2 functions as an activator. Interestingly, we found that the Auts2 gene itself contained five Auts2 peaks, suggesting a possible auto regulatory role for Auts2. Taken together with the very high number of enhancers predicted near AUTS2, we find a very compelling case that Auts2 is an active regulator of important neurodevelopmental genes and pathways.

Methods

We examined genomic regions near predicted developmental enhancers for enrichment of Gene Ontology functional annotations, known phenotypes, and pathways using GREAT

(McLean, Bristor et al. 2010). Results were computed using the hypergeometric test for genome-wide significance, with the default settings and the “basal plus extension” association rule (proximal 5 kb upstream, 1 kb downstream, plus distal up to 100 kb).

We identified the sequence motifs present in each set of enhancers using the FIMO tool

(Find Individual Motif Occurrences) from the MEME Suite of sequence motif analysis tools (Grant, Bailey et al. 2011). We considered known transcription factor binding motifs from the April 2011 release of the TRANSFAC database (Wingender, Chen et al.

2000) with a FIMO score threshold of 10e-5. We identified those occurrences that fell in predicted enhancers, and summarized motifs to identify the most prevalent transcription factors in each tissue-specific set of enhancers.

We analyzed the overlap of predicted enhancers with GWAS SNPs, based on the NHGRI catalog of 9,687 GWAS SNPs downloaded from the UCSC Genome Browser in October 38

2012. Unadjusted permutation p-values were calculated by randomizing genomic locations of predicted enhancers (matching for length and chromosome, and avoiding assembly gaps) and overlapping these randomized regions with GWAS SNPs to assess significance of overlapping regions.

39

Chapter 5: In vivo validation of predicted enhancers confirms enhancer activity near transcription factors FOXC1 and FOXC2

FOXC1 and FOXC2 are important developmental genes

Developmental enhancer predictions can be experimentally tested for enhancer activity in vivo using a transgenic enhancer assay in mouse and zebrafish embryos (Kothary, Clapoff et al. 1989; Kawakami 2005). With this assay, which is detailed below in this chapter’s

Methods section, we tested enhancer activity in twelve candidate human enhancers near the genes FOXC1 and FOXC2, two forkhead box transcription factors. These two genes encode transcription factors that are known to be required for proper embryonic development, and are expressed in the developing heart, bone, and eye, among other tissues. The mouse homologs Foxc1 and Foxc2 have been studied extensively and are required for proper embryonic development, with Foxc1 null and Foxc2 null mutants being pre- or perinatal lethal (Kume, Deng et al. 1998). In humans, complete lack of

FOXC1 is also typically pre- or perinatal lethal, and deletions near and point mutations in

FOXC1 contribute to eye and brain development disorders which we’ll discuss in more detail in the next chapter.

These genes are not only biologically interesting, but we also had several lines of evidence suggesting enhancers for these genes would work well with both the mouse and fish assay. These genes are highly conserved, and previous in situ hybridization work by other researchers has shown FOXC1 and FOXC2 are expressed in tissues of interest in 40 mouse and zebrafish embryos (the EMAP eMouse Atlas Project at http://www.emouseatlas.org and ZFIN, the Zebrafish Model Organism Database at www.zfin.org) (Bradford, Conlin et al. 2011; Richardson, Venkataraman et al. 2014). We predicted which transcription factors may bind to these enhancers based on known transcription factor binding sites (see Chapter 3), and found that many enhancers near

FOXC1 and FOXC2 contain binding sites for the transcription factor proteins FOXC1. This finding initially suggested that these genes might be autoregulated. The results described below were unable to confirm or reject this idea.

Candidate enhancers validate well for enhancer activity in mouse and fish models

Figure 4 shows the enhancer landscape near FOXC1 and FOXC2, along with the candidate enhancers that we tested. One representative mouse and/or fish is shown for each validated enhancer. Ten of the twelve candidate human enhancers we tested showed consistent or suggestive enhancer activity in the zebrafish and/or mouse enhancer assay (Table 3). 41

42

Figure 4. Enhancer landscapes near two developmental transcription factors, including images of candidate enhancers tested in transgenic mouse and zebrafish enhancer assays. Images from the UCSC Genome Browser display the genomic locations of the 13 candidate enhancers (12 candidate enhancers plus one candidate super-enhancer) we tested in mouse and zebrafish. One representative animal from each positive assay is shown. For a detailed list of expression patterns in the positive assays, see Table 3. A) Enhancer landscape near FOXC1. We tested six candidate enhancers in zebrafish and seven in mouse. A zoomed in view shows details of the positive zebrafish in the second panel of A. B) Enhancer landscape near FOXC2. We tested five candidate enhancers in zebrafish and two in mouse. The candidate super-enhancer we tested encompasses candidate enhancers 10, 11 and 12. 43

A.

Candidate enhancer Zebrafish enhancer assay Mouse enhancer assay N at Positive Positive Positive Genomic N at 24 PCR+ N Name 48 expression expression expression location (hg19) hpf at E11.5 hpf at 24hpf at 48hpf at E11.5 Eye (31%) Forebrain (39%) Hindbrain (16%) Forebrain Midbrain (12%)* (45%) chr6:1,573,927- Midbrain Spinal cord 1 58 51 (not tested in mouse) 1,577,560 (10%)* (59%) Epidermis Pharyngeal (12%)* arches (16%) Somitic muscles (10%)* Epidermis (29%) chr6:1523980- 2 (not tested in zebrafish) 9 (negative) 1524967 Eye (30%) 3 chr6:1,614,903- 63 68 (negative) (negative) 10 Neural tube (50%) (CDE1) 1,616,336 Hindbrain (30%) Somitic muscles 4 chr6:1,616,321- Neural tube (75%) 148 142 (10%)* (negative) 4 (CDE2) 1,617,337 Hindbrain (50%)* Epidermis (14%)* chr6:1,617,946- 5 80 79 (negative) (negative) 5 Forebrain (60%) 1,619,425 Eye (39%) 6 chr6:1619847- Motor neuron (39%) (not tested in zebrafish) 18 (CDE3) 1620844 Neural tube (44%) Hindbrain (33%) chr6:1,701,475- 7 142 137 (negative) (negative) 7 (negative) 1,702,475 Spinal cord Spinal cord Motor neuron (42%) (11%)* (17%) Cranial nerve (33%) Yolk (30%) Yolk (11%)* Neural tube (50%) 8 chr6:1,702,410- Epidermis 100 99 Epidermis 12 Hindbrain (42%) (CDE4) 1,703,764 (37%) (22%) Pharyngeal arches Pericardium Pericardium (42%) (23%) (21%) Face (33%) Heart (12%)*

44

B.

Candidate enhancer Zebrafish enhancer assay Mouse enhancer assay N at Positive Positive Positive Genomic N at 24 PCR+ N Name 48 expression expression expression location (hg19) hpf at E11.5 hpf at 24hpf at 48hpf at E11.5 Eye (28%) Spinal cord (17%) Somitic muscles chr16:86,597,061- Pericardium (18%) 9 66 58 (not tested in mouse) 86,598,447 (20%) Caudal fin (14%)* Epidermis (48%) Pericardium (12%)* Eye (10%)* Forebrain Forebrain (46%) (59%) Hindbrain Hindbrain (14%)* (56%) Midbrain Midbrain (73%) (61%) Nerves Nerves (22%) (62%) Notochord Spinal cord (14%)* chr16:86,602,685- (27%) Trigeminal (36%) 10 59 59 Spinal cord 11 86,605,098 Somitic Face (27%) (66%) muscles Somatic (28%) muscles (41%) Epidermis Epidermis (31%) (44%) Pericardium Pericardium (27%) (37%) Heart (51%) Pre-organ Organs - region (24%) undetermined (19%) Eye (16%) Eye (17%) Forebrain Forebrain (25%) (16%) Midbrain Spinal cord (17%) (18%) Spinal cord Somitic chr16:86,605,112- (13%)* muscles 11 100 100 8 (negative) 86,606,111 Somitic (27%) muscles Yolk (16%) (14%)* Epidermis Epidermis (40%) (24%) Pericardium Pericardium (17%) (16%) Heart (23%) Eye (13%)* Midbrain Notochord (13%)* (34%) Notochord Spinal cord (36%) (30%) Spinal cord Somitic (14%)* chr16:86,606,310- muscles 12 70 70 Somitic (not tested in mouse) 86,607,263 (14%)* muscles (19%) Yolk (26%) Yolk (24%) Epidermis Epidermis (63%) (60%) Pericardium Pericardium (10%)* (16%) Heart (21%)

45

C.

Candidate enhancer Zebrafish enhancer assay Mouse enhancer assay N at Positive Positive Positive Genomic N at 24 PCR+ N Name 48 expression expression expression location (hg19) hpf at E11.5 hpf at 24hpf at 48hpf at E11.5 Eye (13%)* Midbrain (30%) Notochord Notochord (41%) (32%) Somitic Somitic muscles chr16:86,602,685- muscles 13 (super) 56 56 (12%)* (not tested in mouse) 86,607,262 (13%)* Epidermis Yolk (18%) (14%)* Epidermis Pericardium (21%) (32%) Pericardium (52%)

Table 3. Candidate enhancer regions tested in transgenic animal assays. We tested 13 candidate enhancer regions in transgenic mouse and/or zebrafish assays. Representative positive fish are shown in Figure 4. For zebrafish, N is the number of zebrafish alive at the specified timepoint in hours post fertilization (hpf). For mouse, PCR+ N is the number of embryos that tested PCR-positive for at least one integration of the enhancer construct. For both zebrafish and mouse, * indicates that expression falls below the required threshold of “positive enhancer,” and that expression in these tissues is suggestive but not confirmed. In zebrafish, this threshold is 15% of fish alive at that timepoint. For mouse, the threshold is 3 embryos. A) Candidate enhancers tested near FOXC1. CDE refers to candidate Dandy-Walker enhancers and are further discussed for their relevance to brain malformations in Chapter 6. B) Candidate enhancers tested near FOXC2. C) Candidate super-enhancer. 46

However, we saw consistent or suggestive enhancer activity in the predicted tissue for only two of the ten tested regions where we had predicted a tissue of activity (candidate enhancers 8 and 11, both with predicted heart activity), and both of these results were in only the fish. No tissue-specific predictions were validated in mouse. These results indicate that while we are able to correctly predict developmental enhancers, our tissue- specific predictions may be either incorrect or for activity at a different timepoint.

(EnhancerFinder does not have a universally poor track record with regard to tissue specificity, though. In a different application of EnhancerFinder described in Chapter 6, where we used the transgenic mouse enhancer assay to test HARs that were predicted to be enhancers, we saw better recapitulation of tissue-specific predictions. Twenty-four

HAR enhancers we tested had predictions of tissue specificity, and 16 of those displayed enhancer activity in at least one of the predicted tissues.)

Mouse and fish results mostly disagree on tissue specificity

In the seven candidate FOXC1 and FOXC2 regions tested with both assays, the zebrafish and mouse assay did not agree on tissue expression patterns. Moreover, the two animal models didn’t always agree as to whether or not a region showed any enhancer activity.

Two regions that were positive in mouse were negative in fish (candidate enhancers 3 and 5), and one that was positive in fish was negative in mouse (candidate enhancer 11).

It is unclear whether this is due to biological differences between zebrafish and mouse development, or due to slight differences in the assay. There were many experimental differences that may have contributed to these results. Different minimal promoters and reporter genes were used in the mouse and zebrafish assays. The mouse regions were synthesized to match the human reference genome, whereas fish regions were generated 47 using PCR of human genomic DNA, introducing several known SNPs. These assays are also known to have a random and highly variable number of construct integrations in the animals, introducing more variation still.

In addition to contributions from technical differences, it is possible that biological differences contribute to these expression differences. Our training data comes from mouse E11.5 (Theiler Stage 19), which is midway through the organogenesis stage of development. This corresponds loosely with zebrafish 24 hours post fertilization (hpf), which is also midway through organogenesis, but the specific timing of organ development is different. For example, at this timepoint the zebrafish heart is a single tube (Kimmel, Ballard et al. 1995); similar mouse heart morphology is seen a full day earlier, at E10.5 (Theiler Stage 17) (EMAP 2014). Zebrafish 48hpf is closer to the hatching or juvenile stage of development, corresponding to a much later stage of mouse development approaching postnatal day 0 (Haudry, Berube et al. 2008). This suggests that, in terms of heart development, the standard zebrafish annotation times of 24 and

48hpf may fall before and after mouse E11.5.

How “super” are the predicted enhancers?

In Chapter 2 I described that EnhancerFinder’s predictions are based on scores given to

1.5 kb windows and cut at a threshold determined by a balance between true and false positives. We do not expect that all, or even most, of the bases in the high-scoring windows actually function as enhancers, only that the enhancer is present somewhere in the predicted window. Previous research by others has shown that some enhancer regions, dubbed “super-enhancers,” can be longer than 10 kb and contain many 48 functional domains (Hnisz, Abraham et al. 2013; Whyte, Orlando et al. 2013). In addition to the dense clusters of enhancers, these regions are defined by the presence of a transcriptional co-activator called Mediator, the so-called master coordinator of development (Yin and Wang 2014), along with master transcription factors Oct4, Nanog, and .

Many EnhancerFinder predictions are regions of approximately the length of super- enhancers, and 3,839 predicted enhancers overlap a super-enhancer in human embryonic stem cells or any of the five fetal tissues and cell lines tested by the super- enhancer researchers (thymus, intestine, large intestine, muscle, and CD34+ cells). Four of the six enhancers tested near FOXC1 and three of the four enhancers tested near

FOXC2 are subunits of a larger, perhaps “super”, enhancer predicted by EnhancerFinder

(Figure 4). Though none of these regions overlaps a known super-enhancer in the tissues and cell lines listed above, since enhancers (and super-enhancers) are timepoint- and tissue-specific we could not rule out that these regions could be super-enhancers in their active context. We wanted to investigate whether these larger regions could perhaps drive a different expression pattern than the subunits, or whether the combined region simply drove an additive expression pattern of the subunits.

We tested a 4,577bp region that encompassed three of our validated enhancers near

FOXC2, and attempted to test a similar region spanning approximately 6.5 kb across four of our validated enhancers near FOXC1 but were thwarted by molecular biology, unable to PCR the extremely high GC-content region. The potential super-enhancer near FOXC2 did not show any unique expression patterns not seen in the subunits, and did not 49 display an additive expression pattern of the three tested subunits. This is more likely due to variability introduced by the assay itself, rather than an actual biological result.

Previous studies with this enhancer assay and a similar assay show that expression patterns can be influenced by distance from candidate enhancer to minimal promoter, as well as the order of candidate enhancers (Smith, Riesenfeld et al. 2013). So while super- enhancers remain an interesting idea to us, it is not a big surprise that we were unable to shed light on possible “super” expression patterns in this system.

Methods

We tested candidate enhancer regions that ranged in length from 987 bp to 3,633 bp

(see Table 3 for hg19 genomic coordinates), which we manually demarcated from within larger predicted enhancer regions based on signatures of likely enhancer function

(including DNaseI hypersensitivity sites, transcription factor binding sites, histone modifications, and conservation).

Mouse enhancer assays were carried out in transient transgenic mouse embryos generated by pronuclear injections of enhancer assay constructs into FVB embryos

(Cyagen Biosciences). Human DNA sequences were inserted upstream of a minimal promoter Hsp68 and a LacZ reporter gene. The embryos were collected and stained for

LacZ expression at E11.5. In this assay, the minimal promoter should only activate the

LacZ reporter gene when the candidate enhancer region effectively recruits and binds transcriptional machinery – that is, when the candidate enhancer displays enhancer function. Following the annotation policies of the VISTA Enhancer Browser, we required 50 that consistent spatial expression patterns be present in three or more embryos with staining in order for the region to be considered an enhancer.

Zebrafish enhancer assays were performed transiently. We performed PCR to obtain the candidate enhancer sequence using human genomic DNA (Roche). These were cloned into the E1b-GFP-Tol2 enhancer assay vector containing an E1b minimal promoter followed by green fluorescent protein (GFP) (Li, Ritter et al. 2010), and the construct was verified by sequencing. Each construct was injected with Tol2 mRNA into at least 100 single-cell fertilized zebrafish embryos. The minimal promoter Tol2 was chosen for the zebrafish assay based on previous work by other researchers showing this promoter has low background expression in the tissues we are most interested in (namely, heart and brain). As with the mouse assay, the minimal promoter should only activate the reporter gene for GFP expression when the candidate enhancer displays enhancer function. Unlike the mouse assay, we did not need to sacrifice the fish to annotate enhancer expression, so we could assay the fish at multiple timepoints. We annotated GFP expression at approximately 24 and 48 hpf, and considered an enhancer to be positive if we observed consistent expression in at least 15% of all fish alive at either 24 or 48 hpf (Oksenberg,

Stevison et al. 2013), and suggestive of enhancer activity if we observed consistent expression in at least 10% of all fish alive at 24 or 48 hpf, after subtracting out percentages of tissue expression in fish injected with the empty enhancer vector. For each construct, at least 50 fish were analyzed for GFP expression at 48 hpf.

51

Ethics

Transgenic mice were generated by Cyagen Biosciences (http://www.cyagen.com/).

Their facility meets and often exceeds animal health and welfare guidelines. Animals were euthanized using techniques recommended by the American Veterinary Medical

Association. All procedures were carried out in line with Gladstone Institutes and

University of California guidelines. All zebrafish work was approved by the UCSF

Institutional Animal Care and Use Committee (protocol number AN100466). 52

Chapter 6: EnhancerFinder’s novel characterization of the developmental enhancer landscape can help address many questions in developmental biology

Our original motivation to create EnhancerFinder was to identify a large, robust set of biologically active developmental enhancers. We wanted this set to be a good representation of many types of developmental enhancers, not just regions with a single histone modification or DNA motif, since previous research indicated that there was no single, universal marker of enhancer activity. EnhancerFinder was able to characterize the signature of thousands of different datasets that underlie in vivo validated developmental enhancers, and we applied this knowledge to the whole human genome to predict over 80,000 developmental enhancers. We were then able to use this novel characterization of the developmental enhancer landscape to address several interesting questions in developmental biology.

Many human accelerated regions function as enhancers

Like many other members of our species, we have long been curious about the fundamental question “what makes us human?” Thesis advisor Katie Pollard and other researchers developed statistical models to shed light on this question by highlighting regions of the human genome that are undergoing uniquely human evolution, which are the HARs we discussed in earlier chapters. These genomic regions are highly conserved in non-human species, indicating some kind of functional constraint on evolution, yet 53 they display many DNA substitutions in the human lineage. HARs largely fall in non- coding regions of the genome, which suggested they might have a regulatory function, perhaps as enhancers. In a study outside the scope of this thesis, led by Tony Capra,

EnhancerFinder was applied to a collection of 2,649 non-coding HARs from several studies (Pollard, Salama et al. 2006; Lindblad-Toh, Garber et al. 2011) and predicted

773 of them to be developmental enhancers in humans (Capra, Erwin et al. 2013). We tested human and chimpanzee sequences of 29 predicted enhancer regions using the transgenic mouse assay described in Chapter 5, and found that 24 of the regions showed consistent (17 regions) or suggestive (7 regions) enhancer activity at E11.5. Five of the

17 regions with consistent enhancer activity showed expression differences between human and chimp. As noted in Chapter 5, these results were much better at capturing tissue-specific predictions than the analysis near FOXC1 and FOXC2. Out of the 24 regions with a prediction of tissue specificity, 16 showed enhancer activity in at least one predicted tissue (as compared to 2 out of 12 tested enhancers near FOXC1and FOXC2).

Further research on HAR enhancer regions continues in the Pollard lab and with collaborators, which will continue to help answer long standing questions about what makes us human.

Disruptions in enhancers near FOXC1 may influence brain developmental disorder

Shifting to a more clinical application of EnhancerFinder predictions, we studied enhancers that may be involved in Dandy-Walker malformations (DWM). DWMs are congenital brain malformations of the cerebellum and fourth ventricle that occur in approximately 1 in 25,000 live births. Patients typically exhibit hypoplasia and upward rotation of cerebellar vermis (the median region of cerebellum that connects the two 54 cerebral hemispheres) and cystic dilation of fourth ventricle (the bottom central fluid- filled cavity in the brain that extends from cerebral aqueduct to obex and develops from the central canal of the neural tube). In mild cases, DWM patients may have normal intellectual and cognitive development, but more often patients show moderate to severe intellectual disabilities. Patients also exhibit hydrocephaly, convulsions, and delayed motor skills (Parisi and Dobyns 2003).

DWM is a phenotypically heterogeneous disorder that has several suspected genetic causes (Lim, Park et al. 2011; Liao, Fu et al. 2012; Darbro, Mahajan et al. 2013).

However, these genetic loci each contribute to just a small number of known DWM cases, and many cases lack a genetic explanation with the known deletions and mutations

(Blank, Grinberg et al. 2011; Garel, Fallet-Bianco et al. 2011). As discussed in Chapter 5,

FOXC1 is a transcription factor that is important in many aspects of embryonic development, and patient studies have shown that deletions or duplications in the genomic region surrounding FOXC1 are present in DWM patients (Aldinger, Lehmann et al. 2009). Homozygous deletion of FOXC1 is typically pre- or perinatal lethal in humans and null mouse Foxc1 -/- mutants (Kume, Deng et al. 1998). These homozygous null embryos have severe brain malformations, among other significant problems. Patients with heterozygous deletions in the FOXC1 region that lack other associated mutations still show classic presentations of the DWM phenotype, indicating the FOXC1 mutations may be sufficient for DWM. Other genes such as ZIC1 and ZIC4, two of the

Cerebellum (ZIC) proteins expressed in the developing cerebellum, have also been shown to be associated with DWM (Blank, Grinberg et al. 2011).

55

Patients with the largest genomic deletions in the FOXC1 locus, approximately 1.8 Mb in length, have the most severe DWM phenotype. Smaller deletions have a less severe presentation of DWM, and point mutations in coding regions of FOXC1 have the least severe phenotype but still exhibit classic DWM features (Aldinger, Lehmann et al. 2009).

Patient point mutations that cause amino acid substitutions S82T and S131L have been shown in vitro to greatly reduce binding of FOXC1 to the known FOXC1 binding site.

Binding affinity is reduced 3- to 20-fold, depending on which mutation is present

(Saleem, Banerjee-Basu et al. 2001). Therefore, there is effectively just one copy (the non-deleted or non-mutated copy) of FOXC1 binding properly in the DWM patients, as compared to two copies in normal individuals. While gene expression does not have a strictly linear relationship to number of working copies of the gene, in the absence of compensatory up-regulation DWM patients could have fewer FOXC1 proteins binding to

DNA. Since FOXC1 is a transcription factor protein that regulates many developmental genes, the loss of sufficient FOXC1 could presumably have many developmental effects in addition to DWM.

The decreased binding of FOXC1 to DNA achieved via point mutations in one copy of

FOXC1 can lead to DWM, so we hypothesize that decreased binding of FOXC1 to DNA achieved via enhancer disruption could also lead to DWM. In both cases the net effect is there are fewer FOXC1 proteins present that are able to bind to DNA.

In the transgenic mouse enhancer assay results presented in Chapter 5, we identified a dense cluster of four novel enhancers near FOXC1 that are expressed in the brain structures relevant to DWM, which we refer to as candidate Dandy-Walker related 56 enhancers (CDEs). These enhancers contain computationally predicted binding sites for many genes expressed in fetal brain tissue, including known DWM-related gene ZIC1.

Given that DWM is a highly heterogeneous disorder with multiple known genetic causes, we predict that sequencing the non-coding regions near FOXC1, particularly in patients who exhibit milder forms of the malformations, will reveal enhancer-mediated forms of the disease. By understanding the phenotypic impact of these gene regulatory regions, we can better understand the full range of genetic causes of DWM.

Results

As we described in Chapter 4, EnhancerFinder’s predicted enhancers cluster near important developmental genes. The known DWM-related genes followed this trend of many enhancers clustered nearby. FOXC1 is found in one of the densest enhancer clusters in the genome, with 72 predicted nearby enhancers. The 1Mb region surrounding the ZIC1/ZIC4 locus contains 30 predicted enhancers.

We validated four novel developmental enhancers near FOXC1 using a transgenic mouse enhancer assay, and saw that these enhancers are active at E11.5 in the rhombic lip region of the developing brain, an embryonic structure that is a precursor to the cerebellum. FOXC1 is expressed in the same tissue in human at both 8 post-conception weeks (pcw) (reads per kilobase per million (RPKM)=2.26) and 9pcw (RPKM=2.59), according to RNA-seq data from the BrainSpan Atlas of the Developing Human Brain.

Mouse E11.5 is an appropriate time to study the developing cerebellum because it is an active time of brain patterning and development and is one day prior to when differentiated cerebellar structures are apparent (Rossant and Tam 2002). Figure 5 57 shows images of these four enhancers from the transgenic mouse assay, both whole embryos and transverse cross-sections of the brain highlighting enhancer activity in neuronal tissues that contribute cells to the cerebellum and nearby mesenchyme. Table

3 details the complete expression patterns of these enhancers in the transgenic enhancer assay.

Figure 5. Dandy-Walker related expression patterns of enhancers near FOXC1. Four novel enhancers near FOXC1 show expression patterns consistent with Dandy-Walker related brain structures. A) Whole-embryo images from transgenic mouse assay. One representative image from each candidate Dandy-Walker related enhancer (CDE) is shown, with the red arrow highlighting the developing hindbrain tissue. Counts of the number of PCR-positive embryos showing a similar pattern are shown below each whole embryo. B) Transverse cross-sections of brains from transgenic mouse assays. These were each sectioned at a slightly different angle, allowing different views of brain structures. CDE1 shows expression in both the neuronal tissue (upper arrow, similar to ZIC1 expression seen in published in situ sections) and nearby mesenchyme (lower arrow, similar to FOXC1 expression seen in published in situ sections). CDE3 shows expression in the rhombic lip. CDE4 shows similar expression as CDE1, with staining in neuronal tissue. Neuronal tissues highlighted here all contribute cells to the cerebellum.

58

We hypothesized that disruptions in these enhancers can lead to DWM, similar to disruptions in the FOXC1 protein. Patients showing less severe DWM that lack deletions, duplications, or point mutations in coding regions of FOXC1 or other DWM-associated genes may carry small sequence changes in these enhancers. We did not have access to patient genotype data in these enhancer regions, so we could not directly test this hypothesis. Instead, we characterized the transcription factor binding sites present in these four enhancers. In Figure 6A, we present a model for possible interactions between

ZIC1 and enhancers of FOXC1. Figure 6B shows the enhancer landscape near FOXC1 and the location of predicted binding sites for ZIC1. 59

Figure 6. Suggested interactions between ZIC1 and FOXC1 enhancers. A) A model of possible interactions between ZIC1 and FOXC1 enhancers. The enhancer region, shown in green, contains a binding site for ZIC1, shown in black. In the first panel, ZIC1 is able to bind to the enhancer and facilitate transcription of FOXC1, leading to a normal level of FOXC1 expression. In the second panel, a sequence mutation in the ZIC1 binding site, shown in red, interferes with ZIC1’s binding to the enhancer, leading to a decreased expression of FOXC1. B) An image from the UCSC Genome Browser showing the enhancer landscape near FOXC1. The top panel shows all four CDEs, along with FOXC1 and neighboring gene GMDS. The lower panel is zooms in to the region immediately surrounding FOXC1 to highlight the ZIC1 binding sites present in these enhancers.

60

The four CDEs contain binding motifs for hundreds of genes. We filtered this list to include just genes expressed in the developing brain that have at least 12 predicted binding motifs in the novel enhancers (Table 4). Given that FOXC1 itself is not expressed in fully formed cerebellum but is expressed in the precursor rhombic lip and adjacent mesenchyme that surrounds the developing cerebellum, we considered transcription factors with expression in any embryonic brain tissue. These genes that are potentially regulating FOXC1 provide a wider view of the many players in this regulatory pathway.

Transcription Total in CDE1 CDE2 CDE3 CDE4 factor name all CDEs ESR1 130 58 12 53 7 ELK1 75 29 11 17 18 ZIC3 71 37 10 11 13 PAX5 66 29 2 23 12 DEAF1 62 31 5 19 7 SP1 56 9 6 38 3 PAX3 50 26 3 12 9 MYF6 48 16 7 18 7 ZIC1 48 26 6 12 4 SMAD4 41 15 5 14 7 FOXO1 40 15 15 - 10 SOX13 39 13 6 15 5 EGR1 38 10 5 19 4 MYB 38 8 11 6 13 GATA1 37 3 3 2 29 E2F1 33 14 1 17 1 ATF6 32 11 7 6 8 TBP 31 6 7 6 12 ISL2 30 6 8 2 14 YY1 30 1 18 3 8 PBX1 27 2 15 4 6 CAD 24 7 12 3 2 CDX1 23 6 14 2 1 ELF1 23 6 5 6 6 ELF5 23 9 6 5 3 IK 23 9 3 8 3 IRF3 23 7 9 4 3 SP2 23 6 2 14 1 61

PAX4 22 12 4 3 3 HNF4A 21 6 8 1 6 IRF4 21 12 - 4 5 MAFB 21 6 7 2 6 SOX9 21 4 11 2 4 ARID3A 20 1 8 1 10 CTCF 20 15 - 3 2 PAX6 20 5 4 6 5 SMAD3 20 4 5 5 6 CEBPA 19 6 7 1 5 FOXO3 19 4 9 - 6 GATA2 19 2 1 1 15 IRF6 19 11 1 4 3 FOXJ1 18 6 5 2 5 PAX8 18 4 5 2 7 TTF1 18 7 5 4 2 USF2 18 8 2 4 4 VDR 18 6 3 3 6 AR 17 8 3 4 2 CEBPB 17 2 7 1 7 SPIB 17 6 - 4 7 16 6 - 10 - POU5F1 16 2 5 - 9 SP3 16 7 4 5 - ARID5A 15 2 6 1 6 ESRRA 15 5 4 - 6 RXRA 15 6 5 - 4 SP100 15 1 3 8 3 SRY 15 1 12 - 2 TBX5 15 6 3 2 4 GATA3 14 4 1 - 9 LEF1 14 5 5 4 - NF1 14 2 6 4 2 ZBTB4 14 5 - 2 7 EGR2 13 4 3 6 - PPARA 13 2 5 3 3 GLIS2 12 2 6 - 4 MAFA 12 2 3 1 6 MAX 12 6 3 3 - SF1 12 6 - 3 3 SOX2 12 3 4 - 5

Table 4. Transcription factors with more than 12 predicted binding motifs found in CDEs. For each transcription factor, the total number of predicted binding sites is shown along with the number of predicted binding sites in each CDE. These genes are all expressed in the fetal brain (GNF Atlas2 pooled sample of fetal whole brain).

62

We further investigated the expression patterns of several of these transcription factor genes that may be binding to the CDEs. We looked at publicly available in situ hybridization images of E11.5 mouse embryos from the Allen Brain Map, and saw that

Zic1 is expressed in the rhombic lip, consistent with the expression we saw in the four

CDEs. According to RNA-seq data from the BrainSpan Atlas of the Developing Human

Brain, ZIC1 is also expressed in the developing human brain at several stages, including the upper (rostral) rhombic lip at 8pcw (RPKM=5.53) and 9pcw (RPKM=6.06). After the cerebellar structures are established, ZIC1 is expressed in the 12pcw cerebellum

(RPKM=6.27) and cerebellar cortex (RPKM=6.51) (though these are effectively the same tissue, they are given separate listings in BrainSpan at this developmental timepoint, so both values are listed here). This further supports the idea that these transcription factors may bind to enhancers near FOXC1 and may therefore regulate FOXC1 expression. Disruption of these binding sites could lead to a reduced level of FOXC1, which is known to correlate with DWM.

Although the CDEs are located very close to FOXC1, they may regulate other nearby genes instead of or in addition to FOXC1. FOXF2 and FOXQ1 are located 215 kb and 300 kb upstream, respectively, of FOXC1 and the CDE clusters. These transcription factors are expressed in the same mouse neuronal tissues as the CDEs (as seen in in situ images of E11.5 mouse embryos from the Allen Brain Map), all of which contribute cells to the cerebellum. They are also expressed in the developing human brain, per the BrainSpan

Atlas of the Developing Human Brain, in upper rhombic lip at 8pcw and 9pcw (with

RPKM values ranging from 1.82 to 2.76), and the cerebellum and cerebellar cortex at

12pcw (with RPKM values ranging from 1.64 to 2.00). The genomic distance between the 63

CDEs and these other potential target genes is well within the range of enhancer function, so these genes provide additional loci for potential enhancer-driven changes in gene expression that could impact the developing cerebellum and DWM.

Materials and methods

Enhancer identification

Enhancers tested in this study were computationally predicted using EnhancerFinder

(Erwin, Oksenberg et al. 2014). The novel enhancers are located in the chromosome band 6p25.3 at the genomic coordinates listed below (hg19). Three of the enhancers

(CDE1-3) are located less than 7 kb from FOXC1. CDE4 is approximately 90 kb from

FOXC1. In this study, DNA for these regions was synthesized rather than amplified via

PCR, but the primers shown below have been used in the work shown in Chapter 5 to amplify the same regions. (CDE1, located at chr6:1614904-1616335, forward primer

AGACCCCTGTTAGTTTCGCT and reverse primer ATTAGCTGATTCCCCGCCAT. CDE2, located at chr6:1616341-1617336, forward primer AAATAGCCTCTGTAAAAAGCTTTAGG and reverse primer GACTGACACAGTCTCTTGGTCCT. CDE3, located at chr6:1619847-

1620844, forward primer GAGTCGAGTCCTCGGAGC and reverse primer

TATGACTACGACGGCAGAGG. CDE4, located at chr6:1702408-1703763, forward primer

CAGTAGCTGGACTCCGACTC and reverse primer ACTTCCACCCAGCACAGAAA.)

Computational analysis of enhancer clusters

For every gene in the human genome (based on an October 2012 download of the UCSC

Genome Browser RefGene track), we counted the number of predicted enhancers within

500 kb upstream or downstream of the transcription start site. While it is hard to know 64 which gene an enhancer is actually regulating, 1 Mb is a wide enough distance that most of the functional gene-enhancer pairs are likely captured.

The GREAT tool (McLean, Bristor et al. 2010) was used to annotate the biological functions enriched in genes near predicted enhancers. Results were computed using the hypergeometric test for genome-wide significance, with the default settings and the

“basal plus extension” association rule (proximal 5 kb upstream, 1 kb downstream, plus distal up to 100 kb).

Transgenic mouse enhancer assay

Methods for mouse enhancer assays are described in Chapter 5. CDE1, CDE3, and CDE4 all met the annotation criterion of 3 embryos showing consistent expression patterns.

CDE2 had suggestive expression, with 2 embryos out of 4 PCR-positive embryos of that construct showing consistent hindbrain expression.

Identification of genes potentially regulating FOXC1

Genome-wide locations of predicted transcription factor binding sites were generated using the FIMO tool (Find Individual Motif Occurrences) from the MEME suite of bioinformatics tools (Grant, Bailey et al. 2011), based on the April 2011 release

TRANSFAC database of experimentally-derived transcription factor binding motifs

(Wingender, Chen et al. 2000), with a FIMO score threshold of 10e-5. We intersected these predicted binding sites with the genomic coordinates of the four enhancers using the IntersectBed tool from the BedTools suite of bioinformatics tools (Quinlan and Hall

2010). 65

To determine which genes were expressed in the developing brain, we used the fetal brain data from the GNFAtlas2 database (Su, Wiltshire et al. 2004). The fetal brain data included in GNFAtlas2 is based on a Clontech pooled sample of normal whole brains from 59 spontaneously aborted male and female Caucasian fetuses, ages 20-33 weeks.

All array data was mapped to RefSeq gene names (RefGene track downloaded from the

UCSC Genome Browser January 2014). Genes with a GNFAltas2 expression score greater than or equal to 26 were considered expressed. To assess known gene expression patterns in the developing mouse we viewed images of in situ hybridization studies performed in E11.5 mouse embryos in the Allen Brain Map

(http://developingmouse.brain-map.org/). To assess known gene expression in the developing human brain, we viewed RNA-seq data from the BrainSpan Atlas of the

Developing Human Brain (http://www.brainspan.org/). Genes with RPKM values greater than 1 were considered to be expressed (BrainSpan).

Discussion

DWM is a heterogeneous disorder with many possible genetic sources. While it is unknown how much of the disease burden is explained by known mutations and deletions, it is highly likely that more genetic loci will be implicated in this disorder as more patients have their whole genomes sequenced. We see several lines of highly suggestive evidence that non-coding mutations in enhancers near FOXC1 may contribute to mild cases of DWM. Four novel enhancers near FOXC1 are expressed in DWM-related brain structures at the relevant developmental timepoint. These enhancers contain many predicted binding sites for ZIC1 and dozens of other transcription factors expressed in 66 the developing brain, linking these enhancers to a larger network of neurodevelopment genes. Variants present in these enhancers could reduce the binding affinity of essential transcription factors to their binding sites within these enhancers, reducing the expression of FOXC1 and potentially leading to DWM.

We suggest that clinical sequencing of these novel enhancers in the FOXC1 locus in DWM patients will reveal new genetic causes for this disease, as may other predicted enhancer regions near known DWM-associated genes. Enhancer mutations are particularly likely to cause mild forms of DWM, similar to those seen in patients with coding mutations of

FOXC1 that reduce its transcription factor activity. We hope that further study of these developmental enhancers will enable a better understanding of the genetic cause of DWM and other brain malformations. This knowledge may provide helpful information for genetic counseling, as well as causal mechanisms for patients and their families who otherwise lack genetic explanations for DWM.

67

Chapter 7: Conclusion

We developed EnhancerFinder because we wanted to study gene regulation in many developmental tissues, without being limited to in vitro proxies or constrained by expensive and animal-intensive in vivo experiments. EnhancerFinder integrates many diverse clues about a wide range of genomic regions – enhancers, promoters, repressors, genes, regions with no known function – to predict enhancers active during human embryonic development. We were initially focused on regions of the genome active during cardiovascular development, but our results eventually led us to many other developmental applications as well.

EnhancerFinder predicted over 80,000 developmental enhancers, and we showed that these enhancers tend to cluster around genes known to be active during development.

These enhancers validate well in transgenic animal enhancer assays. We were able to springboard from validation to a possible clinical application when several of our enhancers exhibited expression patterns that are highly relevant to a known brain developmental disorder.

Many of our initial hypotheses and methods did not pan out. But these attempts became the building blocks of several useful tools, namely the EnhancerFinder framework of integrating diverse data to predict regulatory regions, the EnhancerFinder predicted enhancers, and the database of important cardiovascular developmental regions. We have shown several applications of these tools in the study of developmental biology, and 68 we highly encourage any researcher interested in developmental biology to use these tools and continue building the knowledge about fundamental aspects of life.

69

References

Ahituv, N. (2012). Gene regulatory sequences and human disease. New York, Springer. Akopov, S. B., I. P. Chernov, et al. (2007). "Methods for identification of epigenetic elements in mammalian long multigenic genome sequences." Biochemistry. Biokhimiia 72(6): 589- 594. Alberts, B. (2008). Molecular biology of the cell. New York, Garland Science. Aldinger, K. A., O. J. Lehmann, et al. (2009). "FOXC1 is required for normal cerebellar development and is a major contributor to chromosome 6p25.3 Dandy-Walker malformation." Nature genetics 41(9): 1037-1042. Andersson, R., C. Gebhard, et al. (2014). "An atlas of active enhancers across human cell types and tissues." Nature 507(7493): 455-461. Arvey, A., P. Agius, et al. (2012). "Sequence and chromatin determinants of cell-type-specific transcription factor binding." Genome research 22(9): 1723-1734. Bernstein, B. E., E. Birney, et al. (2012). "An integrated encyclopedia of DNA elements in the human genome." Nature 489(7414): 57-74. Birnbaum, R. Y., E. J. Clowney, et al. (2012). "Coding exons function as tissue-specific enhancers of nearby genes." Genome research 22(6): 1059-1068. Blank, M. C., I. Grinberg, et al. (2011). "Multiple developmental programs are altered by loss of Zic1 and Zic4 to cause Dandy-Walker malformation cerebellar pathogenesis." Development 138(6): 1207-1216. Bradford, Y., T. Conlin, et al. (2011). "ZFIN: enhancements and updates to the Zebrafish Model Organism Database." Nucleic acids research 39(Database issue): D822-829. BrainSpan (2013). "Transcriptome profling by RNA sequencing and exon microarray." Technical White Paper. Capra, J. A., G. D. Erwin, et al. (2013). "Many human accelerated regions are developmental enhancers." Philosophical transactions of the Royal Society of London. Series B, Biological sciences 368(1632): 20130025. Creyghton, M. P., A. W. Cheng, et al. (2010). "Histone H3K27ac separates active from poised enhancers and predicts developmental state." Proceedings of the National Academy of Sciences of the United States of America 107(50): 21931-21936. Darbro, B. W., V. B. Mahajan, et al. (2013). "Mutations in extracellular matrix genes NID1 and LAMC1 cause autosomal dominant Dandy-Walker malformation and occipital cephaloceles." Human mutation 34(8): 1075-1079. Dudley, A. T. and C. J. Tabin (2003). "Deconstructing phosphatases in limb development." Nature cell biology 5(6): 499-501. EMAP (2014). "http://www.emouseatlas.org/emap/ema/theiler_stages/StageDefinition/Theiler/ts17%20 %20from%20Theiler.pdf." Ernst, J. and M. Kellis (2012). "ChromHMM: automating chromatin-state discovery and characterization." Nature methods 9(3): 215-216. Erwin, G. D., N. Oksenberg, et al. (2014). "Integrating diverse datasets improves developmental enhancer prediction." PLoS . Garel, C., C. Fallet-Bianco, et al. (2011). "The fetal cerebellum: development and common malformations." Journal of child neurology 26(12): 1483-1492. Grant, C. E., T. L. Bailey, et al. (2011). "FIMO: scanning for occurrences of a given motif." Bioinformatics 27(7): 1017-1018. Haudry, Y., H. Berube, et al. (2008). "4DXpress: a database for cross-species expression pattern comparisons." Nucleic acids research 36(Database issue): D847-853. 70

Heintzman, N. D., G. C. Hon, et al. (2009). "Histone modifications at human enhancers reflect global cell-type-specific gene expression." Nature 459(7243): 108-112. Hernandez-Gutierrez, S., I. Garcia-Pelaez, et al. (2006). "NF-kappaB signaling blockade by Bay 11-7085 during early cardiac morphogenesis induces alterations of the outflow tract in chicken heart." Apoptosis : an international journal on programmed cell death 11(7): 1101-1109. Hindorff, L. A., P. Sethupathy, et al. (2009). "Potential etiologic and functional implications of genome-wide association loci for human diseases and traits." Proceedings of the National Academy of Sciences of the United States of America 106(23): 9362-9367. Hnisz, D., B. J. Abraham, et al. (2013). "Super-enhancers in the control of cell identity and disease." Cell 155(4): 934-947. Hoffman, M. M., O. J. Buske, et al. (2012). "Unsupervised pattern discovery in human chromatin structure through genomic segmentation." Nature methods 9(5): 473-476. Kawakami, K. (2005). "Transposon tools and methods in zebrafish." Developmental dynamics : an official publication of the American Association of Anatomists 234(2): 244-254. Kimmel, C. B., W. W. Ballard, et al. (1995). "Stages of embryonic development of the zebrafish." Developmental dynamics : an official publication of the American Association of Anatomists 203(3): 253-310. Kloft, M., U. Brefeld, et al. (2011). "Lp-Norm Multiple Kernel Learning." Journal of machine learning research : JMLR 12: 953-997. Koch, C. M., R. M. Andrews, et al. (2007). "The landscape of histone modifications across 1% of the human genome in five human cell lines." Genome research 17(6): 691-707. Kothary, R., S. Clapoff, et al. (1989). "Inducible expression of an hsp68-lacZ hybrid gene in transgenic mice." Development 105(4): 707-714. Kume, T., K. Y. Deng, et al. (1998). "The forkhead/winged helix gene Mf1 is disrupted in the pleiotropic mouse mutation congenital hydrocephalus." Cell 93(6): 985-996. Li, Q., D. Ritter, et al. (2010). "A systematic approach to identify functional motifs within vertebrate developmental enhancers." Developmental biology 337(2): 484-495. Liao, C., F. Fu, et al. (2012). "Prenatal diagnosis and molecular characterization of a novel locus for Dandy-Walker malformation on chromosome 7p21.3." European journal of medical genetics 55(8-9): 472-475. Lim, B. C., W. Y. Park, et al. (2011). "De novo interstitial deletion of 3q22.3-q25.2 encompassing FOXL2, ATR, ZIC1, and ZIC4 in a patient with blepharophimosis/ptosis/epicanthus inversus syndrome, Dandy-Walker malformation, and global developmental delay." Journal of child neurology 26(5): 615-618. Lindblad-Toh, K., M. Garber, et al. (2011). "A high-resolution map of human evolutionary constraint using 29 mammals." Nature 478(7370): 476-482. McLean, C. Y., D. Bristor, et al. (2010). "GREAT improves functional interpretation of cis- regulatory regions." Nature 28(5): 495-501. Narlikar, L., N. J. Sakabe, et al. (2010). "Genome-wide discovery of human heart enhancers." Genome research 20(3): 381-392. Nobrega, M. A., I. Ovcharenko, et al. (2003). "Scanning human gene deserts for long-range enhancers." Science 302(5644): 413. Noonan, J. P. and A. S. McCallion (2010). "Genomics of long-range regulatory elements." Annual review of genomics and human genetics 11: 1-23. Oksenberg, N., G. D. E. Haliburton, et al. (In Review). "Genome-wide distribution of Auts2 localizes with active neurodevelopmental genes." Translational Psychiatry. Oksenberg, N., L. Stevison, et al. (2013). "Function and regulation of AUTS2, a gene implicated in autism and human evolution." PLoS genetics 9(1): e1003221. 71

Parisi, M. A. and W. B. Dobyns (2003). "Human malformations of the midbrain and hindbrain: review and proposed classification scheme." Molecular genetics and metabolism 80(1- 2): 36-53. Pennacchio, L. A., N. Ahituv, et al. (2006). "In vivo enhancer analysis of human conserved non- coding sequences." Nature 444(7118): 499-502. Pollard, K. S., S. R. Salama, et al. (2006). "Forces shaping the fastest evolving regions in the human genome." PLoS genetics 2(10): e168. Quinlan, A. R. and I. M. Hall (2010). "BEDTools: a flexible suite of utilities for comparing genomic features." Bioinformatics 26(6): 841-842. Rajagopal, N., W. Xie, et al. (2013). "RFECS: a random-forest based algorithm for enhancer identification from chromatin state." PLoS computational biology 9(3): e1002968. Richardson, L., S. Venkataraman, et al. (2014). "EMAGE mouse embryo spatial gene expression database: 2014 update." Nucleic acids research 42(Database issue): D835- 844. Robertson, G., M. Hirst, et al. (2007). "Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing." Nature methods 4(8): 651-657. Rossant, J. and P. P. L. Tam (2002). Mouse development : patterning, morphogenesis, and organogenesis. San Diego, Academic Press. Sakabe, N. J., D. Savic, et al. (2012). "Transcriptional enhancers in development and disease." Genome biology 13(1): 238. Saleem, R. A., S. Banerjee-Basu, et al. (2001). "Analyses of the effects that disease-causing missense mutations have on the structure and function of the winged-helix protein FOXC1." American journal of human genetics 68(3): 627-641. Saxton, T. M., B. G. Ciruna, et al. (2000). "The SH2 tyrosine phosphatase shp2 is required for mammalian limb development." Nature genetics 24(4): 420-423. Schoenherr, C. J. and D. J. Anderson (1995). "The neuron-restrictive silencer factor (NRSF): a coordinate repressor of multiple neuron-specific genes." Science 267(5202): 1360-1363. Smith, R. P., S. J. Riesenfeld, et al. (2013). "A compact, in vivo screen of all 6-mers reveals drivers of tissue-specific expression and guides synthetic regulatory element design." Genome biology 14(7): R72. Sonnenburg, S., A. Zien, et al. (2006). "ARTS: accurate recognition of transcription starts in human." Bioinformatics 22(14): e472-480. Su, A. I., T. Wiltshire, et al. (2004). "A gene atlas of the mouse and human protein-encoding transcriptomes." Proceedings of the National Academy of Sciences of the United States of America 101(16): 6062-6067. Tzchori, I., T. F. Day, et al. (2009). "LIM homeobox transcription factors integrate signaling events that control three-dimensional limb patterning and growth." Development 136(8): 1375-1385. Visel, A., J. A. Akiyama, et al. (2009). "Functional autonomy of distant-acting human enhancers." Genomics 93(6): 509-513. Visel, A., M. J. Blow, et al. (2009). "ChIP-seq accurately predicts tissue-specific activity of enhancers." Nature 457(7231): 854-858. Visel, A., S. Minovitsky, et al. (2007). "VISTA Enhancer Browser--a database of tissue-specific human enhancers." Nucleic acids research 35(Database issue): D88-92. Visel, A., E. M. Rubin, et al. (2009). "Genomic views of distant-acting enhancers." Nature 461(7261): 199-205. Visel, A., L. Taher, et al. (2013). "A high-resolution enhancer atlas of the developing telencephalon." Cell 152(4): 895-908. Wamstad, J. A., J. M. Alexander, et al. (2012). "Dynamic and coordinated epigenetic regulation of developmental transitions in the cardiac lineage." Cell 151(1): 206-220. 72

Whyte, W. A., D. A. Orlando, et al. (2013). "Master transcription factors and mediator establish super-enhancers at key cell identity genes." Cell 153(2): 307-319. Wingender, E., X. Chen, et al. (2000). "TRANSFAC: an integrated system for gene expression regulation." Nucleic acids research 28(1): 316-319. Yin, J. W. and G. Wang (2014). "The Mediator complex: a master coordinator of transcription and cell lineage development." Development 141(5): 977-987.