LEVERAGING LARGE-SCALE DATASETS TO

UNDERSTAND THE INTERACTION BETWEEN THE

GENOME AND THE EPIGENOME

by

Leandros Boukas

A dissertation submitted to Johns Hopkins University in conformity with the

requirements for the degree of Doctor of Philosophy.

Baltimore, Maryland

June, 2020

c 2020 Leandros Boukas

All rights reserved Abstract

Epigenetics is typically described as a layer of molecular information above and beyond the DNA sequence. While this conceptualization is certainly ac- curate to some extent, there is also a tight connection between the genome and the epigenome, as the basic components of the epigenetic machinery (EM) are DNA-encoded. This thesis focuses on four such genetic components: encoding for the of the histone machinery, genes encoding for the pro- teins of the DNA methylation machinery, genes encoding for chromatin remod- elers, and CpG dinucleotides. We first perform a systematic analysis of all human EM genes, and characterize them with respect to their tolerance to variation, both at the whole- level, and the local, domain level.

We then discover a systems-level property (co-expression), that is specifically exhibited by a large subset of variation-intolerant EM genes, and may be par- ticularly relevant to their involvement in neurodevelopment. Finally, we shift our focus on the CpG dinucleotides. We show that a high promoter CpG density is not merely a generic feature of human promoters, but is preferentially en-

ii ABSTRACT countered at the promoters of the most loss-of-function intolerant genes. This coupling calls into question the prevailing view that CpG islands are not sub- ject to selection. It also has practical utility, as it allows us to train a simple and easily interpretable predictive model of loss-of-function intolerance that outperforms existing predictors and classifies 1,760 genes - which are currently unascertained - as highly loss-of-function-intolerant or not. Together, the re- sults presented in this thesis provide new insights into the interaction between the genome and the epigenome.

Advisors and Readers:

Kasper Daniel Hansen, PhD, and Hans Tomas Bjornsson, MD, PhD

iii Acknowledgments

I am indebted to many individuals for their contribution to my work during the past 5 years. First and foremost, I am deeply grateful to my advisors, Kasper and Hans. While having two advisors is not always guaranteed to succeed, in my case it was a true blessing and privilege.

In particular, I would have never gotten accepted into the Human Genetics

PhD program at Johns Hopkins, if Hans hadn’t generously offered me the op- portunity to work in his lab as a visiting student in the summer of 2014. Since then, I have watched from the front seat how he asks bold and important sci- entific questions, how he attacks them from many possible angles, and how he has developed a groundbreaking research program aimed at finding cures for his patients, while always keeping an eye on the basic science. For a budding physician-scientist like me, Hans is an ideal role model.

Aside from allowing me to get into this PhD program, Hans did another cru- cial thing for my training: he suggested that I work with Kasper. Following this suggestion turned out to be one of the best decisions I have ever made. Data

iv ACKNOWLEDGMENTS analysis is both a science and an art, that I believe is impossible to truly learn unless one works closely, and for an extended period of time, with a true expert.

I have seen how Kasper uses his solid foundation in theoretical statistics, and an understanding of biology that, over the years, has become as deep as that of biologists, to approach meaningful problems in a manner that is both rigorous and intuitive. Whenever I walked into his office confused about some analysis,

I always, without exception, walked out with things being much clearer in my head. I have also learned quite a lot from our many discussions over lunch, which I have very much enjoyed.

Finally, I greatly appreciate the total freedom Kasper and Hans have given me to explore my own ideas, even when they are not directly related to their other research projects. If I manage to be half as good of an advisor as they have been to me, my future students will be extremely lucky.

I also have to give special thanks to Dr. Valle and Sandy. I am very proud to call myself a product of the Human Genetics program, and it’s clear that their deep commitment is one of the driving forces behind it. They have created an environment where it is possible for us students to live and breathe genetics.

The many departmental activities and courses have certainly helped me ac- quire breadth that I would not have acquired had I only been focused on my own research.

Thanks to my thesis committee members, Dr. Valle, Dani Fallin, and Alexis

v ACKNOWLEDGMENTS

Battle. I especially want to thank Alexis who provided us with constructive feedback for the co-expression analysis.

I also thank Jill Fahrner, with whom we have joint lab meetings, and who has often provided me with very useful feedback. Thanks also to Loyal Goff and Dimitrios Avramopoulos, with whom I spent two very pleasant rotations in my first year. In addition, I have had some very stimulating interactions and discussions with Kirby Smith, Barbara Migeon, Haig Kazazian, and Stephanie

Hicks.

Sincere thanks to Priya Duggal and Jennifer Deal, who accepted me into the MD-GEM program. The courses I took as part of it definitely helped me understand how to think about population-scale genetics.

Of course, I want to also thank my classmates. It has been great to have their friendship and support, and to be able to share this journey with them.

I have also been very fortunate to have friends outside Hopkins, with whom we made some great memories. Thanks to Thanos, Kleio, Chris, Maria, Greta, and Antonis. Very special thanks have to go to Ilias and Mike (who is my best friend from med school and now also my roommate) - I hope the three of us will continue to share this journey through residency training.

I am grateful to the Jenkins family - Stella, Larry, Christian, and Daniel.

Ever since I came to Baltimore they adopted me as their third child (which, of course, they didn’t have to do), and it has been great to know that I have a

vi ACKNOWLEDGMENTS family here, even though my real parents are far away.

I don’t need to say much about my best friends from Greece - Thomas, Ja- nis, Kostas, Dimitris, and Iordanis. Ever since we graduated from college we disseminated around the globe, but our friendship (brotherhood, indeed) has no borders.

The final year of my PhD was by far the happiest for me. There is only one reason for this, which has nothing to do with the science. It is that I have been able to share my life with Giota, and I look forward to many more years with her.

Last, but not least, I wouldn’t be standing here without the endless and unconditional love and support from my parents, Effie and Andreas, who is also responsible for inspiring my love for science. I wholeheartedly dedicate this thesis to them.

vii To my mom and dad

viii Contents

Abstract ii

Acknowledgments iv

List of Tables xv

List of Figures xvi

1 Introduction 1

2 Co-expression patterns define epigenetic regulators associated

with neurological dysfunction 5

2.1 Preface ...... 5

2.2 Introduction ...... 5

2.3 Results ...... 7

2.3.1 The modular composition of the epigenetic machinery . . . 7

2.3.2 The human epigenetic machinery is highly intolerant to

variation and contains many additional disease candidates 10

ix CONTENTS

2.3.3 Dual function epigenetic regulators and remodelers are

the most variation-intolerant categories ...... 14

2.3.4 The intolerance to variation is primarily driven by the do-

mains mediating the epigenetic function ...... 16

2.3.5 A large subset of the epigenetic machinery is co-expressed 19

2.3.6 Dual function epigenetic regulators are enriched in the

highly co-expressed group and are co-expressed with mul-

tiple other categories ...... 23

2.3.7 The highly co-expressed epigenetic regulators are extremely

intolerant to variation and enriched for genes causing neu-

rological dysfunction ...... 25

2.3.8 Brain-specific regulatory elements of highly co-expressed

epigenetic regulators are enriched for SNPs that explain

the heritability of common neurological traits...... 28

2.3.9 The promoters of highly co-expressed genes of the epige-

netic machinery are bound by common trans-acting factors 30

2.4 Discussion ...... 31

2.5 Methods ...... 35

2.5.1 The creation of an epigenetic regulator list ...... 35

2.5.2 Epigenetic regulators with disease associations ...... 37

2.5.3 Variation tolerance analysis ...... 40

x CONTENTS

2.5.4 CCR local constraint score ...... 40

2.5.5 GTEx data ...... 41

2.5.6 Tissue specificity and expression level analysis ...... 42

2.5.7 Co-expression analysis ...... 43

2.5.8 Trans-acting factor binding at EM gene promoters . . . . . 47

2.5.9 Enrichment of disease genes in the highly co-expressed

group ...... 49

2.5.10 Stratified LD score regression ...... 50

2.5.11 Genome assembly version ...... 51

2.5.12 Code availability ...... 52

2.5.13 Acknowledgments ...... 52

2.6 Supplemental Materials ...... 53

2.6.1 Supplemental Results ...... 53

2.6.1.1 Variation intolerance of EM genes encoded on the

sex ...... 53

2.6.1.2 Tissue specificity and expression levels of EM genes 53

2.6.1.3 The highly co-expressed genes are not enriched

for protein-protein interactions ...... 58

2.6.1.4 Co-expressed EM genes are not spatially clustered 58

2.6.2 Supplemental Tables ...... 59

2.6.3 Supplemental Figures ...... 63

xi CONTENTS

3 Promoter CpG density predicts genic loss-of-function intoler-

ance 82

3.1 Preface ...... 82

3.2 Introduction ...... 82

3.3 Results ...... 86

3.3.1 Promoter CpG density is strongly and quantitatively as-

sociated with downstream gene LoF-intolerance ...... 86

3.3.2 The association between CpG density and LoF-intolerance

is not mediated through expression level or tissue specificity 89

3.3.3 Regulatory factor binding at promoters provides informa-

tion about LoF-intolerance which adds to CpG density . . 89

3.3.4 Promoter CpG density with promoter and exonic across-

species conservation can collectively predict LoF-intolerance

with high accuracy ...... 92

3.3.5 32.5% of currently unascertained genes in gnomAD re-

ceive high-confidence predictions by predLoF-CpG . . . . . 95

3.3.6 predLoF-CpG reclassifies 101 genes with expected LoF vari-

ants between 10 and 20 as highly LoF-intolerant ...... 97

3.4 Discussion ...... 98

3.5 Methods ...... 100

xii CONTENTS

3.5.1 Selecting transcripts with high-confidence loss-of-function

intolerance estimates ...... 100

3.5.2 Selecting transcripts with high-confidence annotations in

GENCODE v19 ...... 102

3.5.3 Calculating the CpG density of a promoter ...... 107

3.5.4 The impact of promoter definition ...... 108

3.5.5 Overlapping promoters ...... 108

3.5.6 Promoters in subtelomeric regions ...... 109

3.5.7 ENCODE ChIP-seq data ...... 109

3.5.8 GTEx expression data ...... 110

3.5.9 TSS coordinates of mouse orthologs ...... 111

3.5.10 Across-species conservation quantification ...... 112

3.5.11 Previously published LoF-intolerance predictions . . . . . 112

3.5.12 Structural variation data ...... 112

3.5.13 Gene catalogs ...... 113

3.5.14 Enrichment quantification ...... 113

3.5.15 Code ...... 114

3.5.16 Acknowledgements ...... 114

3.6 Supplementary Materials ...... 115

3.6.1 Supplemental Tables ...... 115

3.6.2 Supplemental Figures ...... 117

xiii CONTENTS

4 Future directions 127

Bibliography 130

Curriculum Vitae 161

xiv List of Tables

2.1 The protein domains used to define the epigenetic machinery. . . 59 2.2 The components of the epigenetic machinery ...... 60 2.3 Local CCR constraints ...... 60 2.4 EM genes with DNA binding domains ...... 60 2.5 Subunits of EM complexes ...... 60 2.6 Novel EM disease candidate genes ...... 60 2.7 Novel disease candidate genes encoding for accessory subunits of EM complexes ...... 60 2.8 Histone modifiers for which the amino-acid substrate specificity is known ...... 61 2.9 GTEx tissues used in the tissue specificity and co-expression anal- yses ...... 62 2.10 The components of the TCA cycle ...... 62 2.11 Common traits/diseases whose heritability enrichment in EM reg- ulatory regions was examined ...... 62 2.12 The protein components of the ribosome ...... 62

3.1 predLoF-CpG predictions for genes unascertained in gnomAD . . 115 3.2 predLoF-CpG reclassifications for genes with expected LoF vari- ants between 10 and 20 ...... 115 3.3 Non-canonical promoter coordinates ...... 115 3.4 All promoter coordinates ...... 116

xv List of Figures

2.1 The modular composition of the epigenetic machinery ...... 8 2.2 A large subset of epigenetic regulators are very intolerant to vari- ation...... 12 2.3 The protein domains known to mediate epigenetic functions drive the observed constraint of EM genes...... 17 2.4 A large subset of the components of the epigenetic machinery exhibit unusually high levels of co-expression...... 22 2.5 Dual function EM genes are enriched within the highly co-expressed group...... 24 2.6 EM genes linked to disorders with neurological dysfunction demon- strate significant enrichment within the highly co-expressed cat- egory...... 26 2.7 The pLI scores of members of EM protein complexes...... 63 2.8 pLI for EM genes with same substrate specificity...... 64 2.9 The C2H2 zinc fingers are the main drivers of the mutational constraint of the PRDM family...... 65 2.10 The protein domains known to mediate epigenetic functions drive the observed constraint of EM genes...... 66 2.11 Identical copies of protein domains show within-gene variability in constraint...... 67 2.12 The components of the epigenetic machinery are expressed in a highly non tissue-specific manner and at high levels across tissues. 68 2.13 The pLI distributions of genes with low tissue specificity...... 69 2.14 EM genes show inter-individual variability in their expression levels...... 70 2.15 Expression levels of EM genes...... 71 2.16 EM genes are highly co-expressed irrespective of arbitrary choices. 72 2.17 Size and average degree of the maximally connected component in different tissues...... 73

xvi LIST OF FIGURES

2.18 Dual function EM writers partner with multiple other EM cate- gories...... 74 2.19 Transcription Factors (TFs) and Protein Kinases/Phosphatases are not significantly co-expressed...... 75 2.20 Co-expressed EM genes are not spatially clustered...... 76 2.21 The promoters of the highly co-expressed EM genes are bound by common regulatory factors...... 77 2.22 The lack of tissue specificity of EM genes is not driven by un- wanted variation...... 78 2.23 Removing noise in co-expression analysis by removing principal components...... 79 2.24 Highly constrained EM genes are also more highly expressed. . . 80 2.25 Regulatory regions of EM genes are enriched for explained vari- ation for some neurological traits: significance...... 81

3.1 The relationship between promoter CpG density and downstream gene loss-of-function intolerance...... 84 3.2 The relationship between promoter CpG density and loss-of-function intolerance conditional on downstream level and tissue specificity (τ)...... 87 3.3 The loss-of-function intolerance of tissue-specific genes conditional on high promoter CpG-density and promoter EZH2 binding. . . . 90 3.4 Training and assessing predLoF-CpG: a predictor of loss-of-function intolerance based on CpG density...... 91 3.5 Using predLoF-CpG to classify currently unascertained genes as highly loss-of-function intolerant or not...... 92 3.6 Assessing the reliability of LOEUF estimates...... 117 3.7 Assessing the relationship between tissue specificity of gene ex- pression and POLR2A binding at the canonical promoter...... 118 3.8 Partitioning genes according to the reliability of their LOEUF estimates and promoter annotation...... 119 3.9 Scatterplot of promoter CpG density against downstream gene LOEUF...... 119 3.10 The effect of filtering for high-confidence promoter annotations on the relationship between CpG density and LOEUF...... 120 3.11 The impact of the size of promoter definition on the relationship between CpG density and LOEUF...... 121 3.12 Distributions of downstream gene expression level and tissue specificity across promoter CpG density deciles...... 122 3.13 The proportion of promoters with EZH2 peaks in 1-14 ENCODE experiments, stratified based on their CpG density and down- stream gene tissue specificity...... 123

xvii LIST OF FIGURES

3.14 The relationship between promoter CpG density and loss-of-function intolerance conditional on promoter and exonic across-species con- servation...... 124 3.15 The percentage of out-of-sample LOEUF variance explained by the different predictors of LoF-intolerance...... 124 3.16 The relationship between promoter deletions seen in healthy in- dividuals and downstream gene loss-of-function intolerance. . . . 125 3.17 UCSC genome browser screenshot of a 10kb region containing the transcriptional start sites of the canonical and one alterna- tive KMT2D trasctipts...... 126

xviii Chapter 1

Introduction

The epigenome is often thought of, and depicted, as a layer of molecular in- formation above and beyond the DNA sequence. Such a conceptualization is certainly accurate to some extent. It explains, for example, how different cell types in multicellular organisms can have vastly different morphologies and functions, despite containing the same (or almost the same) genome. However, it is important to recognize that there is also a very intimate relationship be- tween the genome and the epigenome, because the basic components of the epigenetic machinery (EM) are DNA-encoded. The focus of this thesis will be on four such EM components [1–3]:

1. The protein members of the histone machinery (writers, erasers, readers),

which act on the histone tails of the nucleosomes, and thereby generate

(writers, erasers) and interpret (readers) the genome-wide histone modi-

1 CHAPTER 1. INTRODUCTION

fication landscape.

2. The protein members of the DNA methylation machinery (writers, erasers,

readers), which act on CpG dinucleotides and thereby generate (writers,

erasers) and interpret (readers) the genome-wide DNA methylation land-

scape.

3. The chromatin remodeler enzymes, which modulate the position and com-

position of nucleosomes genome-wide.

4. The CpG dinucleotides themselves. Aside from their obvious role as sub-

strates for the DNA methyltransferases and DNA demethylases, CpGs

have recently been shown to interact with the histone machinery as well.

Specifically, clusters of unmethylated CpGs, which often occur at promot-

ers, are recognized by CxxC-domain containing proteins, which are sub-

units of histone-modifying complexes. As a result, unmethylated CpG

clusters can influence the histone modification state of the nearby nucle-

osomes.

The extent to which epigenetics are causally relevant to normal human physiology is often debated, and it is difficult to definitively resolve this debate purely with biochemical approaches. However, the description of the above genetic EM components makes it possible to approach this question in an un- biased and quantitative fashion. Specifically, we can ask: are these EM com-

2 CHAPTER 1. INTRODUCTION ponents under selection? And if so, how strong is this selective pressure, and why?

The first unequivocal demonstration of such selection came when Huda

Zoghbi and colleagues described the genetic defect responsible for Rett syn- drome [4], one of the most common causes of intellectual disability in females, that is lethal in males. The causative gene turned out to encode for a DNA methylation reader discovered by Adrian Bird’s group [5], who rather clumsily

(as he later admitted) named it MECP2 in 1992. Within the last decade, the ad- vent and widespread clinical use of exome sequencing led to the realization that the disruption of several genes encoding for EM proteins can cause Mendelian disorders. These disorders share common phenotypic features (most notably, intellectual disability and growth defects), and are now collectively termed the

”Mendelian Disorders of the Epigenetic Machinery” [2, 6,7].

In the first part of this thesis (Chapter 2), we focus on the genes encoding for the protein EM components. We leverage newly available data on naturally occurring genetic variation in the human population to systematically charac- terize all human EM genes with respect to their tolerance to loss-of-function

(LoF) variation (which serves as a proxy for the strength of negative selection on LoF variants in them). We study this variation tolerance not only at the gene level, but also at the local, protein domain level, motivated by studies reporting that the domains mediating the epigenetic function of EM proteins

3 CHAPTER 1. INTRODUCTION can be dispensable in some model systems. Finally, we show that there is a systems-level property (co-expression) that is exclusively exhibited by a subset of EM genes under very strong selection.

In the second part of the thesis (Chapter 3), we shift our focus to the other genetic EM component, the CpG dinucleotides. Understanding whether CpGs are subject to selection is a less tractable question, because CpGs are non- coding genetic elements. While we do not directly tackle this problem here, we take a first step towards that direction, by showing that a high promoter

CpG density is preferentially encountered at the most selectively constrained genes in the . We explore some potential biological reasons for this coupling, and subsequently, as a practical application, we capitalize on it to train a simple and easily interpretable predictive model that allows us to classify 1,760 human genes - which currently lack reliable LoF-intolerance estimates - as highly LoF-intolerant or not.

4 Chapter 2

Co-expression patterns define epigenetic regulators associated with neurological dysfunction

2.1 Preface

This chapter has been published as:

Boukas et al., Genome Res. 2019. 29: 532-542, doi: 10.1101/gr.239442.118.

2.2 Introduction

The chromatin landscape of any cell is shaped and maintained by the epi- genetic machinery (EM), hereafter defined as the group of proteins that can catalyze the addition or removal of epigenetic marks (writer or erasers, respec-

5 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION tively), bind to preexisting marks (readers), or use the energy of ATP hydrolysis to alter the local chromatin environment via mechanisms such as nucleosome sliding (remodelers) [2,3]. Recently, some EM genes have been associated with human diseases, with the most prevalent disease phenotypes falling broadly under the categories of neurological dysfunction [6, 8–11], and cancer [12–14]; those associations have indicated that the vast majority of known disease caus- ing EM genes are haploinsufficient [6, 13].

This paper addresses three main questions. First, how many additional dis- ease candidate EM genes are there? Existing estimates [15, 16] suggest that

EM genes with ascribed roles in disease only form a minority of the whole group. Thus, the number of additional disease candidates that a comprehen- sive EM gene list will harbor is unclear. It is also unknown whether disease genes tend to be evenly distributed among classes (e.g. erasers vs remodelers) and subclasses (e.g. histone methyltransferases vs histone acetyltransferases) of the machinery; such patterns could reflect the relative contribution of those categories to normal cellular function. Second, is the lost epigenetic function of these genes the most likely cause of disease? Studies in model systems have indicated that the domains mediating the epigenetic function can be dispens- able [17, 18]. This raises the possibility that, even among known EM disease genes, the phenotype might have some alternative mechanistic basis. Third, are there expression signatures characteristic of disease candidates? In other

6 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION words, are the expression patterns of EM genes that are intolerant to variation different from those of variation-tolerant EM genes? Related to this question, it would be of particular interest if there also exist expression signatures that distinguish between EM genes associated with neurological dysfunction versus those associated with cancer. Such signatures could not only prioritize candi- date genes for specific phenotypes, but also provide insights into novel disease mechanisms. To answer these questions, we perform a systematic investiga- tion of the human epigenetic machinery with respect to its composition, its tolerance to variation, and its expression in a diverse set of tissues.

2.3 Results

2.3.1 The modular composition of the epigenetic machinery

We defined EM genes as genes whose protein products contain domains clas- sifying them as chromatin remodelers, or as writers/erasers/readers of DNA or histone methylation, or histone acetylation. Then, we utilized the UniProt database [19], combined with InterPro domain annotations [20], to systemati- cally compile a list of all such human genes (Methods; a full list of the domains used for classification is provided in Supplemental Table 2.1). This stringent, domain-based definition minimizes the risk of false positives. We found a total of 295 EM genes (Figure 2.1A,B and Supplemental Table 2.2), the vast majority

7 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION of which belong to the histone machinery, and only a small fraction are remod- elers or components of the DNA methylation machinery (Figure 2.1A). The two latter categories overlap the histone machinery; most remodelers are also read- ers of either histone methylation or acetylation, whereas the overlap between the DNA methylation and histone components is multifaceted (Methods).

A Remodelers B Writers Erasers 5 13 38 41 Histone 24 14 Machinery Readers 252 160 14 13 11 DNAm 5 Machinery Remodelers Figure 2.1: The modular composition of the epigenetic machinery. (A) Venn diagram illustrating the 3 broad categories of the epigenetic machinery (histone machinery, DNA methylation machinery, and remodelers), their - tive sizes, and their mutual relationships. (B) Venn diagram illustrating the 4 broad ”action” categories of the machinery (writers, erasers, remodelers, and readers), their relative sizes, and their mutual relationships. The modularity of this organization is evident, with some reader components exhibiting en- zymatic functions and/or more than one reading functions. In contrast, the individual enzymatic component types are pairwise mutually exclusive.

Considering the categorization of EM genes into readers, writers, erasers and remodelers, we found that the readers comprise the biggest group (n =

211), and the remodelers the smallest (n = 18) (Figure 2.1B). The writer and eraser groups are comparable in size (n = 62, n = 55, respectively) (Figure 2.1B).

We observed that the three enzymatic categories (writers, erasers, and remod- elers) are pairwise mutually exclusive (Figure 2.1B). In contrast, we saw a subgroup of 51 genes encoding proteins which harbor both an enzymatic and

8 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION a reader domain (Figure 2.1B), suggesting that these factors have dual epi- genetic function; we will refer to these genes as dual function EM genes. In general, dual function writers tend to catalyze the addition of the same mark they read; this is also true for dual function histone demethylases, the only eraser category with some members that have reading activity. However, there are exceptions: histone methylation readers that enzymatically only function as remodelers (n = 11; 9 of these are members of the CHD family), or DNA methyltransferases (n = 2), and 1 dual function histone methyltrasferase which only reads DNA methylation. Furthermore, within the reader category, there are 32 genes capable of recognizing more than one type of mark, indicating their participation in crosstalk between different parts of the machinery; we termed those dual readers, and found that some of them (n = 7) also have en- zymatic activity. Moreover, we observed that the same reading function can be mediated by different domains within a single protein; among the 178 readers of histone methylation we found 23 proteins which contain 2 distinct reading domains. We also observe that 9 of the domains defining EM genes can be present in multiple copies within the same gene, with the exact multiplicity ranging from 2 to 8 (calculated for the EM genes in Supplemental Table 2.3).

Moreover, using a previously generated, high confidence list of 1,254 human transcription factors [21,22], we found that 20 EM genes (12 of which are mem- bers of the PRDM family of histone methyltransferases), have a DNA binding

9 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION domain found in transcription factors, suggesting their involvement in more than one aspect of transcriptional regulation (Supplemental Table 2.4).

2.3.2 The human epigenetic machinery is highly intolerant to varia-

tion and contains many additional disease candidates

To identify novel EM disease candidates, we systematically investigated the tolerance of the entire EM group to loss-of-function variation. To achieve this, we used the ExAC database coupled with the pLI score [23], a metric which ranges between 0 and 1 and measures the extent to which a given gene toler- ates heterozygous loss-of-function variants. pLIs have been derived from the whole exome sequences of 60,706 humans, after comparing the number of ob- served loss-of-function variants within a given gene to the expected such num- ber (see Supplemental Material of 23 for details). In particular, genes with a pLI of more than 0.9 have been described as highly dosage sensitive [23], with virtually all known haploinsufficient human genes belonging to this cat- egory [23]. A similar approach, focused only on the histone methylation ma- chinery, was very recently used to derive candidate genes for developmental disorders [24]. In total, ExAC provides a pLI for 18,225 human genes, of which

281 are EM genes. First, we observed that EM genes have significantly higher pLI scores compared to all other genes (Wilcoxon rank sum test, p < 2.2 · 10−16,

Figure 2.2A) and show substantial enrichment in the highly intolerant cate-

10 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION gory (Fisher’s test, p < 2.2 · 10−16, odds ratio = 7.7). Note that there are many

EM genes with a pLI score between 0.7 and 0.9, a range which is almost ab- sent for other genes. Given that pLI is a measure of haploinsufficiency, genes encoded on the X and Y chromosomes were not considered in this comparison

(see Supplemental Results for details on EM genes encoded on the sex chromo- somes).

We next compared EM to TF genes; this is a natural comparison, since

EM proteins are usually recruited to target sites by TFs [25], and TF genes have been shown by previous analyses to be mostly haploinsufficient [26, 27].

Using the 1,155 TF genes in ExAC, we first showed that they have significantly higher pLI compared to other genes (Wilcoxon rank sum test, p < 2.2 · 10−16,

Figure 2.2A), although they are less dosage sensitive than previously suggested

[26], illustrating the value of our comprehensive approach in yielding accurate estimates. Comparing TF to EM genes however, we observed that EM genes have higher pLIs (Wilcoxon rank sum test, p < 2.2 · 10−16, Figure 2.2A) and are more strongly enriched in the highly dosage sensitive category (Fisher’s test, p < 2.2 · 10−16, odds ratio = 4.4).

Since it is known that many EM gene products function as parts of multi- subunit complexes [28–31], we reasoned that genes encoding for accessory sub- units of these complexes (which are not categorized as EM genes by our defini- tion), would also show a similar intolerance to variation. We thus assembled a

11 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

All other genes EM genes A B C not disease ass. TF genes TF genes disease ass. EM genes EM accessory subunits Density 2 4 Density 1.5 3 Density 0.5 1.5

0.1 0.5 0.9 0.1 0.5 0.9 0.1 0.5 0.9 pLI pLI pLI Highly constrained D pLI genes 0.0 0.5 1.0 Percentage Writers F pLI G with pLI > 0.9 0.1 0.9 Erasers 10 90

Remodelers Hist meth writers Hist meth writers Readers Hist ac writers Hist ac writers DNAm writers Dual Function DNAm writers Hist meth erasers Hist meth erasers EM and TF Hist ac erasers Hist ac erasers Percentage DNAm erasers DNAm erasers E with pLI > 0.9 Hist meth readers 10 90 Hist meth readers Hist ac readers Hist ac readers Writers DNAm readers DNAm readers Erasers Unmeth CpG readers Unmeth CpG readers Remodelers Dual Readers Dual Readers Readers All other genes Dual Function EM and TF All other genes Figure 2.2: A large subset of epigenetic regulators are very intolerant to variation. (A) The pLI distributions of EM genes (red curve), TF genes (green curve), and all other genes (blue curve). (B) The pLI distributions of EM genes (red curve), genes encoding for accessory subunits of EM protein complexes (black curve curve), and TF genes (green curve). (C) The pLI dis- tribution of disease associated EM genes vs. non-disease associated EM genes. (D, E, F, G) The pLI distributions (D, F) and percentage of genes with pLI > 0.9 (E, G) of individual classes of EM genes. The shaded grey area (A,B,C,D,F) in- dicates highly constrained genes (> 0.9). The vertical dashed grey line (E, G) corresponds to the percentage of all other genes with pLI > 0.9.

list of 95 non-EM accessory subunit genes and 46 EM subunit genes, spanning

a total of 19 complexes with chromatin modifying activities (Methods; a full list

of the complexes and their subunits is given in Supplemental Table 2.5). As ex-

pected, the 46 EM subunit genes are very dosage sensitive; 80% of these genes

12 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

have a pLI> 0.9. Considering the accessory subunits, their pLI distribution

confirmed that they are more constrained than all other genes, as well as TF

genes; however, we they are slightly less constrained than all EM genes (Fig-

ure 2.2B). A more detailed investigation revealed that in general, each complex

contains multiple accessory as well as EM subunits that are highly constrained

(Supplemental Fig. 2.7). Specifically, across all 19 complexes, the median per-

centage of accessory and EM subunits with a pLI > 0.9 was 64% and 100%,

respectively.

After splitting EM genes into those with existing disease associations and

those with no reported link to disease (the latter constituting approximately

70%; Methods), we discovered that in both the disease and the non-disease

associated groups there exist many EM genes with elevated pLI scores, (Fig-

ure 2.2C), although the disease associated ones exhibit higher skewing. It is

notable that EM genes which are only associated with cancer have high pLIs

(median pLI = 0.98, percentage with pLI> 0.9 = 65%). There is no a priori reason to expect this for somatic cancer driver genes, as pLI scores were de- rived after only excluding individuals with severe pediatric disease [23]. Over- all, this result strongly suggests the existence of additional EM disease genes.

Among 162 EM genes in ExAC with no reported link to disease, 78 have a pLI greater than 0.9 (percentage with pLI> 0.9 = 78/162 = 48%). Additionally,

there are 24 EM genes that have only been associated with cancer but whose

13 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

pLI is greater than 0.9, suggesting that they also cause some other disease

phenotype. In total, this leads to 102 novel EM disease candidates (Supple-

mental Table 2.6). Finally, the same approach for the EM accessory subunits

(Methods) highlighted 39 new disease candidate genes whose phenotypic con-

sequences are likely to arise through similar mechanisms as in the case of EM

genes (Supplemental Table 2.7).

2.3.3 Dual function epigenetic regulators and remodelers are the most

variation-intolerant categories

We next explored the loss-of-function variation tolerance of the different types

of machinery components. Chromatin remodelers are an extremely intolerant

group, whereas both the writers and the erasers are equally distributed among

the high and low pLI groups (Figure 2.2D,E). Collectively, the three enzymatic

EM classes show high mutational constraint (Fisher’s test, p = 9.82 · 10−9, odds

ratio = 4.1 for enrichment of EM genes with enzymatic but not reading func-

tion in the pLI > 0.9 category). Readers are more skewed towards the high pLI category than writers and erasers (Figure 2.2D,E). Despite the differences between these single-function classes, dual function EM genes are extremely constrained, regardless of the specific enzymatic function; this underscores the importance of this unique category (Figure 2.2D,E). The 20 EM genes which also contain TF DNA-binding domains show a pLI distribution which mir-

14 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION rors that of the whole EM group (Figure 2.2D,E). Although histone methyl- transferases (HMTs) appear less constrained than histone acetyltransferases

(HATs) (Figure 2.2F,G), all but one of the HMTs also have a reader domain, and there is no difference in variation tolerance between dual-function HMTs and dual-function HATs (Wilcoxon rank sum test, p = 0.61). Two out of the three DNA methyltransferases (DNMT1 and DNMT3B) are constrained (Fig- ure 2.2F,G), as is the case for DNA demethylases (with TET2 being the only tolerant member, Figure 2.2F,G), whereas histone demethylases and deacety- lases are approximately evenly divided in the high and low pLI categories (Fig- ure 2.2F,G). This analysis also highlighted dual readers as a very intolerant group (Figure 2.2F,G). Additionally, we found that all genes encoding for CxxC domain proteins, which recognize unmethylated CpG dinucleotides [32], show very high dosage sensitivity (median pLI = 0.97), with three of them being dual readers.

Finally, we investigated the impact of the exact amino-acid substrate speci-

ficities of EM genes on their variation tolerance (Methods, Supplemental Ta- ble 2.8). Specifically, we examined writers/erasers of H3K4, H3K27, H3K36, and H3K9 methylation, and H3K27 acetylation writes and H3K9 acetylation erasers. More than 50% of genes in each category had a pLI> 0.9 (Supplemen- tal Fig. 2.8); this highlights that dosage sensitivity does not depend on the ex- act nucleosomal target position. It also reveals that multiple histone modifiers

15 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION with seemingly redundant biochemical activities can be highly constrained, as also observed for DNA methylation writers/erasers (Figure 2.2F).

2.3.4 The intolerance to variation is primarily driven by the domains

mediating the epigenetic function

A recent study showed that Drosophila embryos with a catalytically inactive version of trr (a homolog of the mammalian histone methyltransferases KMT2C and KMT2D) develop normally, despite altered histone methylation patterns

[18]. This example shows that in some cases the inactivation of an epigenetic domain (in this case, the SET domain) might not have severe, easily detectable consequences. Our previous analysis is unable to determine if the observed variation intolerance is driven by the presence of epigenetic, or other non- epigenetic domains. We therefore asked if the EM-specific domain(s) in an

EM gene had a different local mutational constraint than other domains in the same gene. To answer this, we used the constrained coding region (CCR) model [33] to examine the mutational constraint of EM genes at the domain level. Specifically, we classified a given domain as constrained or not; this classification reflects how devoid a domain is of missense or loss of function mutations in the gnomAD database [23], compared to other similar regions

(Methods). We were able to study 237 out of 295 EM genes.

Under the hypothesis that EM-specific domains are not contributing to the

16 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

A B C 100 at least 1 domain 300 constrained more EM domains constrained unconstrained constrained no domains constrained 50 150 pLI genes Number of Percentage of high Difference in number more non-EM EM-specific domains of constrained domains

-5domains 0 5 constrained 10 0 High pLI Low pLI 0 1 70 140 EM-specific Other genes genes High pLI EM genes domains domains Figure 2.3: The protein domains known to mediate epigenetic func- tions drive the observed constraint of EM genes. (A) The number of con- strained and not constrained EM-specific protein domains of high pLI (> 0.9) EM genes, versus low pLI (< 0.1) EM genes. (B) The within-gene differences in the total number of EM-specific constrained domains versus other constrained domains. Each dot corresponds to a gene. Red dots indicate genes with more EM-specific constrained domains; blue dots indicate genes with more other constrained domains; black dots indicate genes with an equal number of con- strained EM-specific and other domains. (C) The percentage of high pLI EM genes with at least one constrained EM-specific domain, versus the correspond- ing percentage with at least one constrained other domain. observed variation intolerance of EM genes, there should be no difference in the constraint of EM-specific domains found in high pLI versus low pLI EM genes.

In contrast to this, we found that, collectively, the EM-specific domains of high pLI genes (greater than 0.9 pLI) are much more likely to be constrained than those of low pLI genes (less than 0.1 pLI) (82% versus 19%; Figure 2.3A). To explore this further, we restricted our analysis to high pLI genes, and compared the contribution of EM-specific domains to that of other domains. First, at the individual gene level, we found that for most EM genes (65%), the number of EM-specific constrained domains exceeds that of other constrained domains

(Figure 2.3B). Furthermore, almost all high pLI EM genes (92%) have at least

17 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION one constrained EM-specific domain, whereas approximately half (47 %) have no other constrained domains (Figure 2.3C). In fact, there are 54 high pLI EM genes which do not contain other domains. Notable exceptions in this analysis are 4 high pLI members of the PRDM family, where the C2H2-like Zinc Fingers are the main drivers of variation-intolerance (Supplemental Fig. 2.9).

We note that our approach is overall conservative, as there are domains that do not have catalytic or reading activity (and are thus labeled as non EM- specific), but are nevertheless found only in EM genes (see Methods). Return- ing to our initial example, in KMT2D we see that the catalytic SET domain is constrained, as is its associated post-SET domain, and 5 out of the 7 PHD-

fingers.

Finally, we repeated this analysis using a quantitative version of domain- specific constraint, where each domain is assigned a score from 0 to 100 (with greater values indicating more constrained domains; Methods). The results we obtained regarding the relative constraint of EM-specific versus other domains recapitulated these described above (Supplemental Fig. 2.10A,B). Additionally, this revealed that multiple identical copies of a domain within a single gene can differ with respect to their constraint (Supplemental Figs. 2.10C, 2.11). This could reflect different contributions of these identical copies to gene function, although it is possible that this variability in domain constraint is a conse- quence of inadequate sampling of variation (since it has been estimated that

18 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION even with 500,000 individuals, less than 10% of protein-coding variation will be captured [33, 34]).

2.3.5 A large subset of the epigenetic machinery is co-expressed

To identify functional properties specific to variation-intolerant EM genes, we systematically explored the expression patterns of the whole group across a spectrum of adult tissues, using publicly available RNA-seq data [35]. We selected 28 tissues on the basis of sample size and diversity in physiological function. First, we discovered that virtually all EM genes are expressed in a non-tissue specific manner, similarly to what is observed for known housekeep- ing genes (Methods, Supplemental Fig. 2.12A,B), with the exception of a small number that showed testis-specific expression (Supplemental Fig. 2.12C). Hence, tissue specificity cannot account for the differences in variation tolerance within the EM group; we also found that it cannot explain the high mutational con- straint of EM genes vs TF genes and other genes, after restricting the pLI comparison to very broadly expressed genes from both groups (Methods, Sup- plemental Fig. 2.13). Additional details on the tissue specificity of EM genes can be found in Supplemental Results.

However, even though EM genes show ubiquitous expression, within any given tissue there is inter-individual variability in their expression levels (Sup- plemental Fig. 2.14). We noticed that in several cases, EM genes show coordi-

19 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION nated fluctuations in their expression levels across individuals (Supplemental

Fig. 2.14A). We reasoned that this might contribute to the precise epigenetic regulation of the transcriptional programs operating within each cell. There- fore, we hypothesized that EM genes whose expression patterns display this co- ordinated behavior (co-expression) would differ in their mutational constraint from those who do not. To test this, we constructed tissue-specific co-expression networks and determined modules of co-expressed genes using WGCNA [36]

(Methods).

We noticed that for all tissues, EM genes were grouped in a few large mod- ules (median 2 modules across tissues, range 0-4), with a substantial number of genes not belonging to any module (singletons; median 106 singletons across tissues, range 9-270). We asked if the division of EM genes into genes belonging to large modules and genes being singletons was stable across tissues. Because these modules were estimated separately for each tissue, it is not obvious how to compare them across tissues, and modules are affected by noise resulting from differences in sample size, and other sources. To perform the compari- son, we defined two genes to be module partners if they belonged to the same module in at least 10 tissues, stable module partners if they belonged to the same module in at least 14 tissues, and not module partners if they belonged to the same module in less than 10 tissues (light blue, orange, and dark gray squares respectively in cartoon Figure 2.4A). For each gene we computed the

20 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION number of module partners, and then ordered EM genes according to this score

(Figure 2.4B). We next collectively visualized the pairwise partnership statuses among EM genes in a symmetric matrix, keeping this ordering for both rows and columns (Figure 2.4C, blue). We observed a distinct clustering, with a set of genes which are predominately stable module partners with each other (Fig- ure 2.4C, orange), a large set of genes with no module partners (Figure 2.4C, dark gray), and a transition between these two groups (Figure 2.4C). We noted that the transition occurs as the number of module partners increases, mean- ing that EM genes not only have more partners, but they are also stable part- ners with the majority of them.

We then divided EM genes into 3 groups: (1) a group of 74 genes with at least 75 module partners; we call this group of EM genes “highly co-expressed”,

(2) a group of 83 genes with between 15 and 74 module partners; we call this group “co-expressed” and (3) a group of 113 genes with fewer than 15 module partners; we call this group “not co-expressed”. To assess the statistical signif- icance of the size of these groups, we compared our results to those obtained with randomly chosen genes, and found that the groups of highly co-expressed as well as co-expressed EM genes are much larger than expected by chance

(Figure 2.4E, Supplemental Fig. 2.16A,B). We also established that our results are robust to the choice of cutoffs, the presence of sample outliers, and the exact network reconstruction method employed (Methods, Supplemental Fig. 2.16C,

21 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

A Tissue 1 B Module 1 EM Genes C

B Module partners across tissues

A 20 70 120 not partners D

partners Number of partners Module 2 Tissue 2 stable partners C

C Module 2 D B C A B D A EM Genes Module 1 AB C D Genes

D random genes EM genes Highly Co-expressed Not Tissue 28 0.5 co-expressed co-expressed

C Observed E

Module 1 Density

A D pLI 0 B

0 35 70 0.1 0.5 0.9 number of genes with >= 75 module partners EM Genes Module 2 Figure 2.4: A large subset of the components of the epigenetic ma- chinery exhibit unusually high levels of co-expression. (A) Schematic illustrating our definition and identification of module partners. WGCNA was used to construct tissue-specific co-expression networks and modules for 28 tis- sues profiled in GTEx. We determined if two EM genes were module partners (part of the same module in 10 − 14 tissues) or stable module partners (part of the same module in > 14 tissues). (B, C) The number of module partners for each EM gene and the module partner matrix, where rows and columns are ordered as in (B). We define 3 groups of EM genes, “highly co-expressed”, “co- expressed” and “not co-expressed” based on their number of module partners. (D) The pLI for each EM gene, ordered by the its number of module partners as in (B). (E) The size of the (highly) co-expressed group of EM genes compared to 300 draws of 270 random genes, where the random genes are selected to have a similar expression level across tissues compared to EM genes (Supplemental Fig. 2.15).

22 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

Supplemental Fig. 2.17). Finally, we note that our across-tissue co-expression analysis provides confidence that our findings are not driven by the cell-type heterogeneity present in these tissue samples.

2.3.6 Dual function epigenetic regulators are enriched in the highly

co-expressed group and are co-expressed with multiple other

categories

To better understand the co-expression phenomenon, we examined whether some EM categories are overrepresented within the highly co-expressed group.

We first observed an enrichment for dual function EM genes (Figure 2.5A, B).

This was driven by the enrichment of dual function writers, as well as dual function erasers (Figure 2.5A, B). While we found 6 highly co-expressed dual function remodelers, there was no statistically significant overrepresentation compared to the co-expressed and not co-expressed groups (Figure 2.5A, B). We then performed a breakdown of the partners of highly co-expressed dual func- tion histone methyltransferases and acetyltransferases. We observe that both of these EM groups partner with their corresponding readers and erasers, as well as with remodelers (Supplemental Fig. 2.18). In addition, the two groups partner with each other, and with the DNA methylation machinery (Supple- mental Fig. 2.18). This is partly expected given the large number of partners of highly co-expressed genes (more than 75 per definition), compared to the size

23 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

A Dual-function B log2(OR) All Writers 0 1 2 3

20% 21% All Dual function

57% 62% Writers-Readers 23% 17% Remodelers-Readers Erasers-Readers Erasers Remodelers

21% 15% C 46% EM 57% 22% 39% TF Kinase/phosphatases random Highly co-expressed Co-expressed Non co-expressed number of partners 0 50 100 150

0 135 270 genes Figure 2.5: Dual function EM genes are enriched within the highly co- expressed group. (A) The distribution of dual function EM genes (collectively and separately for each enzymatic group) within the three co-expression cate- gories. (B) Log odds ratios and 95% confidence intervals for enrichment of dual function EM genes (collectively and separately for each enzymatic group) in the highly co-expressed category. The dashed vertical line at 0 corresponds to sta- tistical significance. (C) Blue dots correspond to randomly chosen genes, sam- pled in sets of 270 genes from genes with a median expression (log(RPKM + 1))) greater than 0.5 in at least half the tissues, to match the expression of EM genes (as in Figure 2.4D). Orange, green and pink dots correspond to EM genes, TF genes, and protein kinases/phosphatases, respectively. Each dot cor- responds to a single gene, and its position along the y axis corresponds to the number of other genes that it partners with. The genes are ordered on the x axis according to the number of their partners. This figure also serves as a sensitivity analysis with respect to the number of partners for this particular tissue cutoff. of the individual EM categories.

We next wanted to investigate if the observed co-expression is related to the involvement of EM genes in transcriptional regulation, their organization into writers/erasers/readers, or both. To test the first possibility, we used TF genes

24 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION as a reference group, whereas to test the second possibility we used genes en- coding for protein phosphorylation writers/erasers (i.e. kinases/phosphatases)

[37,38]. We discovered that neither of these two classes of genes are co-expressed

(Figure 2.5C, Supplemental Fig. 2.19), suggesting that the co-expression is a unique property of EM genes likely reflecting both their role in transcription and their modular composition.

2.3.7 The highly co-expressed epigenetic regulators are extremely in-

tolerant to variation and enriched for genes causing neurologi-

cal dysfunction

If this co-expression of EM genes is functionally important, we would anticipate a relationship with their mutational constraint. Indeed, examination of the pLI scores of the three co-expression groups revealed a very clear association (Fig- ure 2.4E), with almost all highly co-expressed genes being extremely intolerant to variation (percentage of genes with pLI > 0.9 is > 90%), co-expressed genes exhibiting intermediate intolerance, and the not co-expressed group being the least intolerant (Figure 2.6A).

We next asked whether, in addition to being very constrained, co-expressed

EM genes are also preferentially associated with specific disease phenotypes.

To perform this analysis, we first used our full list of EM genes to obtain a comprehensive picture of those links to disease. We examined associations

25 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

A B C Percentage (all EM genes)` log2(OR) 0 25 50 75 0 1 2 3 Any disease MDEM w/o neuro Neuro Neuro pLI > 0.9 Cancer Neuro and cancer 30 60 90 Percentage with Only cancer Only neuro No Disease Only cancer Neuro and cancer D E 20% High pLI neuro vs other high pLI 62% 18% F h2 Enrichment p-value 0 5 10 15 0.5 0.05 0.005 (OR) Neuro (pLI>0.9) 2 Schizophrenia log Gen. epilepsy Highly co-expressed IQ

Co-expressed 0 1 3 5 Openness Non co-expressed BMI 20 140 Depressive sympt. Size of highly co-expressed set Bipolar Neuroticism Highly co-expressed 0.05 0.01 All EM genes Figure 2.6: EM genes linked to disorders with neurological dys- function demonstrate significant enrichment within the highly co- expressed category. (A) The percentage of EM genes with pLI > 0.9 in each of the co-expression categories. (B) The percentage of EM genes that are associ- ated with different types of disease; individual disease categories are mutually exclusive. MDEM: Mendelian disorders of the epigenetic machinery, Neuro: includes autism, schizophrenia, developmental disorders, and MDEM whose phenotype includes dysfunction of the central nervous system (Methods). (C) Log odds ratios and 95% confidence intervals for enrichment of different sub- sets of EM genes in the highly co-expressed category. The dashed vertical line at 0 corresponds to statistical significance. (D The percentage of EM genes that are associated with neurological dysfunction and have pLI > 0.9 in each of the co-expression categories. (E) Odds ratio (black line) and 95% confidence interval (shaded area) for enrichment of EM genes associated with neurolog- ical dysfunction in the highly co-expressed group, as a function of the size of the highly co-expressed group. For all sizes, the comparison was performed against the not co-expressed group. (F) Estimates for enrichment of explained heritability, and adjusted p-values, for 8 traits and 2 sets of regulatory fea- tures: regions marked by H3K27ac in brain within 1 Mb of the transcription start site of all-EM (red dots) or highly co-expressed (orange dots) EM genes.

26 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

with Mendelian disorders, cancer, and complex disorders (Methods). Consis-

tent with previous observations [6], we found that neurological dysfunction is

a very prevalent phenotype within those diseases. Specifically, a total of 50 out

of the 101 disease-associated EM genes genes have been previously describe

to lead to neurological dysfunction (Figure 2.6B). Our analysis also yielded

64 EM genes associated with cancer (Figure 2.6B). We highlight a substan-

tial overlap between those two groups: 24 of cancer associated EM genes are

also associated with neurological dysfunction (Figure 2.6A), with dual function

EM genes showing extremely high enrichment in this category (Fisher’s test,

p = 1.84 · 10−8, odds ratio = 13.3). We did not find any EM gene both associ- ated with cancer and also causing a Mendelian disease, without neurological dysfunction being part of the disease phenotype.

EM genes associated with any one of those disease phenotypes were en- riched within the highly co-expressed group (Figure 2.6C). We thus sought to examine whether this enrichment was driven by associations with partic- ular disease categories. We found a marked enrichment for genes causing neurological dysfunction (Figure 2.6C); genes implicated in cancer were also enriched, although less (Figure 2.6C). We next tried to disentangle the contri- butions stemming from the associations with cancer versus neurological dys- function. To accomplish this, we partitioned the EM genes into genes only associated with neurological dysfunction, genes only associated with cancer,

27 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION and genes associated with both. Genes only associated with neurological dys- function were still enriched, while we did not observe significant enrichment for genes only associated with cancer (Figure 2.6C). Subsequently, we asked if this result was a consequence of the association between co-expression and pLI, and restricted the analysis to EM genes with high pLI (> 0.9). We found that these associated with neurological dysfunction were still significantly en- riched in the highly co-expressed category (Figure 2.6C,D). As expected given the overrepresentation of dual function genes (which are themselves enriched in the highly co-expressed subset) in the set of genes associated with both neu- rological dysfunction and cancer, the latter were particularly enriched in the highly co-expressed group (Figure 2.6C). We then examined the impact of the definition of the highly co-expressed group on the strength of the enrichment by varying the co-expression cutoff, while using the constant set of non-co- expressed EM genes as a control. As expected, this showed that the enrichment increased as the stringency of our cutoff increased (Figure 2.6E).

2.3.8 Brain-specific regulatory elements of highly co-expressed epi-

genetic regulators are enriched for SNPs that explain the heri-

tability of common neurological traits.

The above results establish that rare, coding variants in highly co-expressed

EM genes preferentially cause neurological dysfunction. We next asked whether

28 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION common variation in regulatory regions surrounding these genes, as well as all

EM genes collectively, contributes to neurological disease risk. To achieve this, we used a set of brain enhancers (defined by the presence of H3K27 in one or more of 87 distinct brain regions (39; see also Methods), and labeled every such enhancer within 1Mb of the TSS of EM genes as an EM regulatory re- gion. We then performed stratified LD score regression [40] to assess whether these EM-regulatory regions show enrichment for explained heritability in 24 neurological diseases/traits (Methods, Supplemental Fig. 2.25; Supplemental

Table 2.11), compared to what is expected given the size of the regions and their overlap with regions of known genetic importance.

For 7 out of the 24 neurological traits, either the highly co-expressed, or the all-EM regulatory regions showed significantly enriched heritability at p = 0.05

(corrected for multiple testing; Figure 2.6F). Two of the 7 traits (neuroticism and bipolar disorder) were only significant for the set of highly co-expressed regulatory regions, and two other traits (openness and depressive symptoms) were only significant for the all-EM regulatory regions. The remaining 3 traits

(schizophrenia, general epilepsy, IQ) are significant for both sets of regulatory regions. However, the enrichment of heritability for the highly co-expressed regulatory regions was either exceeding or on par with the enrichment for the all-EM regulatory regions, despite the fact that the former are considerably smaller ( 15 Mb vs 46 Mb). As a negative control, we also examined 5 non-

29 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION neurological traits (Supplemental Fig. 2.25; Supplemental Table 2.11). For 4 of them, we observed that neither set of regions showed heritability enrichment, as expected. The only exception was BMI, in concordance with recent results implicating brain regulatory elements in BMI heritability [40].

2.3.9 The promoters of highly co-expressed genes of the epigenetic

machinery are bound by common trans-acting factors

To gain insights into the mechanistic basis of the observed co-expression, we investigated: (1) whether these genes are co-localized in the genome, and (2) whether there is evidence that they are regulated by common trans-acting fac- tors. It has been observed that highly expressed genes tend to reside in chro- mosomal clusters in the human genome [41], and clustered genes are often co- expressed [42]. However, we did not observe any evidence of spatial clustering of EM genes (Supplemental Results, Supplemental Fig. 2.20).

To test whether shared regulation potentially contributes to co-expression, we asked if the promoters of highly co-expressed EM genes are bound by com- mon trans-acting factors. To answer this, we used ENCODE ChIP-seq data from K562 cells [43]. We choose this cell line because it contains – by far – the most extensive collection of ChIP-seq data on such factors, and because our co-expression analysis suggests that the co-expression is tissue-independent.

We tested each of the 330 factors available from ENCODE for enriched binding

30 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION at the promoters of the highly co-expressed EM genes relative to those of the non co-expressed EM genes. Even though these factors are a relatively small subset of the 1254 TFs encoded in the human genome [21, 22], we found that

53 factors exhibit at least 2-fold enrichment, in contrast to what is observed for randomly chosen genes, or after permuting the labels of EM genes (Supple- mental Fig. 2.21, p = 0.02 and p = 5 · 10−4 respectively; Methods). We note that the direction of effect is consistent with our hypothesis: there is only one factor enriched in the non co-expressed group compared to the highly co-expressed group.

2.4 Discussion

We have performed a systematic investigation of all human genes encoding for epigenetic writers/erasers/remodelers/readers (EM genes). This enables us to make three basic contributions. First, we identify 102 novel disease candidates within this class of genes. Second, we provide strong evidence that genetic dis- ruption of the epigenetic domains of these genes is the most likely cause of the disease phenotypes. This strongly suggests that these phenotypes arise through an impaired epigenomic state. Third, we discover that co-expression distinguishes a large subset of EM genes that are both extremely variation- intolerant and, independently, enriched for genes causing neurological dysfunc- tion.

31 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

We note that our pLI-based approach for disease gene identification, while unbiased, cannot discriminate between genes that cause severe pediatric dis- ease versus genes that lead to lethality at the embryonic stage. Addition- ally, while our local mutational constraint analysis argues that the enzymatic or reading functions are the primary drivers of this intolerance to variation, some EM proteins may participate in biochemical events that affect other, non- chromatin related cellular functions [44, 45]. The importance of non-histone protein methylation and acetylation for signal transduction pathways and other molecular activities is not well understood, although there are examples of es- tablished functional relevance. These include cases of Cornelia de Lange syn- drome caused by defective deacetylation of SMC3, a subunit of the cohesin complex [46], as well as the regulation of the tumor-suppressor functions of through acetylation mediated by CBP/p300 [47,48]. Further elucidation of such mechanisms will undoubtedly yield more insights into this issue.

Our most unexpected finding is that, among these 295 EM genes, we de- tected a subset of 74 that are highly co-expressed within tissues, as well as 83 others with an intermediate level of co-expression. The sizes of the two groups and the exact cutoff separating them might be refined with future interroga- tion of more tissues/cell types, and increases in sample size, but we anticipate the rank ordering of EM genes with respect to their partners to remain accu- rate. This co-expression appears to unite three seemingly independent prop-

32 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION erties of the machinery: variation-intolerance, association with neurological dysfunction (even after conditioning on haploinsufficiency), and dual function

(enzymatic activity combined with reading function). From a functional stand- point, the clear relationship between co-expression and mutational constraint indicates that the former potentially plays a role in homeostasis and disease predilection. It also suggests a basis for the observed dosage sensitivity, a coun- terintuitive result given that many EM genes are enzymes, and enzymes are usually haplosufficient [26]. For co-expressed enzymes however, a reduction of the normal amount of protein product present would not be tolerated, since it would compromise the coordinated expression of the module. Given the strong signal for enrichment of genes causing neurological dysfunction, it becomes tempting to speculate that the co-expression might be especially relevant to brain development and function; future examination of EM gene expression during fetal and early childhood development will likely yield profound insights into this. It will also be important to develop methods for experimental pertur- bations of this co-expression, to help define the precise cellular consequences of its disruption. Finally, prioritization of highly co-expressed EM genes might not only aid in the discovery of new pathogenic variants disrupting the epige- netic machinery, but also provide a starting point for the interpretation of the functional consequences of those variants, particularly in the context of neuro- logical dysfunction.

33 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

Most of our work has focused on the disease causing potential of rare coding variation in EM genes, but we have also established that brain-specific regu- latory regions surrounding them show enriched heritability signal for multiple common neurological traits. It is noteworthy that these traits include a mea- sure of intellect (IQ) and a seizure phenotype (generalized epilepsy), given that intellectual disability and seizures are among the most common neurological manifestations of the Mendelian disorders of the epigenetic machinery [6]. For another epileptic phenotype, focal epilepsy, we did not find heritability enrich- ment for either set of EM genes. However, this GWAS included mostly adult individuals [49], and thus probably contains a different genetic signal. We also hypothesize that the other non-significant traits are those without a neurode- velopmental origin.

With respect to the underlying mechanistic basis of the co-expression, one way that such co-regulation could be achieved is with shared upstream regu- lators. Our data on trans-acting factor binding at the promoters of EM genes support this possibility. However, a definitive answer to this will only be pro- vided after further delineation of human regulatory circuits, with mapping of enhancer-promoter interactions in different cell-types. Currently available data also argue against the formation of multi-subunit complexes between the co-expressed EM gene products (Supplemental Results). Hence, it is possible that the need for co-expression arises not to regulate protein-protein interac-

34 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION tions, but because imbalance of the epigenetic system could over time lead to major changes in open versus closed chromatin [2].

In summary, our data provide the first evidence of widespread co-expression of epigenetic regulators, and link this novel phenomenon to both variation in- tolerance and neurological dysfunction, thus opening an avenue to better un- derstand the role of the human epigenetic machinery in health and disease.

2.5 Methods

2.5.1 The creation of an epigenetic regulator list

We used InterPro domain annotations as provided by the UniProt database

[19, 20], accessed in June 2016, to generate a list of proteins with at least one domain that classifies them as writers or erasers of histone lysine methyla- tion [50, 51], writers or erasers of histone lysine acetylation [52, 53], readers of the two aforementioned histone modifications [54], and readers of methylated and unmethylated CpG dinucleotides [32,55]. A full list of all the domains used along with the corresponding InterPro IDs is provided in Supplemental Ta- ble 2.1. Additionally, we included the known human DNA methyltransferases and demethylases [56], as well as the catalytic subunits of the known human chromatin remodeling complexes [29]. We also categorized all proteins belong- ing to the CHD family as chromatin remodelers [57], and we included the two

35 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION histone lysine demethylases that do not harbor the JmjC domain, KDM1A and

KDM1B [51]. After manual curation, we excluded the proteins COIL, MSL3P1,

ASH2L, PHF24, and VPRBP. We did not include the atypical histone lysine methyltransferase DOT1L, and we did not classify transcription factors whose recognition motifs include methylated CpG dinucleotides as DNA methylation readers. Proteins containing Ankyrin repeats were classified as histone methy- lation readers, provided they were first included as members of the epigenetic machinery based on the domains in Supplemental Table 2.1. We also note the case of the PHD finger domain: it was generally classified as a histone methylation reader domain, with the exception of 5 proteins (DPF1,2, and 3, and KAT6A,B) that have a double PHD finger which, based on experimental evidence [58, 59] acts as a histone acetylation recognition mode. We only in- cluded UniProt entries that have been manually annotated and reviewed by the database curators. The full list of all EM genes we used for our analyses along with several of their features, is given in Supplemental Table 2.2.

We note that while our categorization of EM-specific domains only includes domains which have some catalytic or reading function, there are domains which did not label as EM-specific, but are exclusively or almost exclusively found in EM genes. Two such examples are the pre-SET domain (present in

7 proteins, all of which are HMTs) and the post-SET domain (present in 16 proteins, 15 of those are HMTs and the remaining protein has 8 domains, all

36 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION of which are post-SET domains).

With respect to the investigation of EM protein complexes, we performed a manual literature curation, and subsequently assembled a catalog of the EM and accessory subunits of 19 complexes with chromatin modifying activities as described in 28–31 (Supplemental Table 2.5).

Finally, regarding the grouping of EM histone modifiers according to their amino acid substrate specificities, we performed a manual curation of the lit- erature to identify cases where these specificities have been well defined. We subsequently classified EM genes as H3K4, H3K27, H3K36, H3K9 methylation writers/erasers [60,61], H3K27 acetylation writers [62], and H3K9 acetylation erasers [53] (Supplemental Table 2.8).

2.5.2 Epigenetic regulators with disease associations

With respect to Mendelian disease associations, we included disorders with a phenotype mapping key equal to 3 (indicating sufficient evidence to ascribe causality for a particular gene) in OMIM (https://omim.org/), as accessed in

June 2016. We determined which of those syndromes involved dysfunction of the central nervous system based on the corresponding clinical synopses in

OMIM. We also labeled the following genes as associated with neurological dys- function: 1) genes that have been associated with Autism at a false discovery rate of 0.1 [8] (those included the three genes later firmly associated with de-

37 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

velopmental disorders in 24), 2) the top 15 % genes implicated in Schizophrenia

(as ranked by their residual variation intolerance score in 9), 3) SETD1A [10],

4) KMT2B [63], 5) genes that lacked previous associations with developmen- tal disorders but achieved genome-wide significance in the DDD study [11].

We note that all of the above represent associations between protein-coding variants in EM genes and the corresponding phenotypes, with strong evidence for pathogenicity. This yielded a total of 50 EM genes associated with disor- ders exhibiting symptoms of abnormal brain function: 8 are linked to Autism,

Schizophrenia, and Developmental Disorders, and 42 are known to cause a monogenic disorder in OMIM. As can be seen under the ”Clinical Synopsis” in

OMIM, in each of these disorders the affected children can have a variety of manifestations under the ”Neurologic” category. These include intellectual dis- ability of variable severity, seizures, speech delay, apraxia, balance/gait abnor- malities, memory defects, and others. Additionally, patients with Autism and

Schizophrenia also exhibit several different symptoms attributable to central nervous system dysfunction (such as seizures and memory deficits). Hence, we concluded that the most clinically meaningful classification of EM genes linked to such diseases is as ”associated with neurological dysfunction”. Within our

102 novel disease candidates, we did not include candidate genes provided in

24.

To assess the potential contribution of EM genes to neurodegenerative dis-

38 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION orders, we performed a literature curation to search for EM genes associated with either Parkinson’s or Alzheimer’s disease. We found no such reports

[64–68], consistent with the existing view that EM genes are mainly involved in neurodevelopment [69]. The sole exception to this is DNMT1, which is in- cluded in our list of EM genes with Mendelian disease associations and causes two disorders with neurodegenerative features: autosomal dominant cerebel- lar ataxia, deafness, and narcolepsy (MIM: 604121), and hereditary sensory neuropathy type 1E (MIM: 614116).

Regarding associations with cancer, we first identified EM genes potentially functioning as cancer drivers using: 1) a list of 260 significantly mutated can- cer genes, derived from data spanning 21 tumor types [70], and 2) genes that were predicted to be drivers by at least one of the top 3 performing methods in

71. Both of the above studies evaluated genes based on point mutations and small insertions/deletions. We then also included other EM genes that have been reported to be involved in cancer, harboring either point mutations/small indels, or structural rearrangements [14, 72]. Taken together, the above stud- ies show that EM genes are associated with a wide variety of tumor types, both solid and hematological (e.g. renal cell carcinoma, colorectal cancer, lung can- cer, melanoma, pancreatic neuroendocrine tumors, T/B cell lymphoma, acute lymphoblastic leukemia, and others). This indicates that they broadly promote tumorigenesis when mutated in somatic cells, in a tissue-independent manner.

39 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

Therefore, we collectively refer to all these EM genes as “cancer associated”.

All of the above disease associations are provided in Supplemental Table 2.2.

2.5.3 Variation tolerance analysis pLI scores for heterozygous loss of function constraint were downloaded from the ExAC database [23]. When comparing the pLI distributions of different classes of genes, we excluded genes encoded on the X and Y chromosomes.

2.5.4 CCR local constraint score

The CCR model [33] identifies regions of the genome without any missense or loss of function mutations in gnomAD [23]. Each region devoid of muta- tions is assigned a CCR percentile score; the greater the difference between the observed and expected coverage-weighted length for regions with similar

CpG density, the higher the constraint. As a result, the CCR model extends single gene-wide estimates of constraint to identify sub-regions within genes that exhibit ”local constraint”. We mapped each protein domain to the genome, using the Pbase package. For each EM gene, we restricted our analysis to the single specific isoform for which the ExAC investigators provide a pLI score.

We then classified a domain as constrained if at least 10% of bases in that do- main resided in a genomic region with a CCR percentile score above 90. Our rationale for choosing this cutoff was that: 1) it had to be a sizable percentage,

40 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION and 2) in high pLI genes which only contain a single domain (such as TET3), that domain had to be constrained. However, to examine whether our results are dependent on the choice of cutoff, we also performed our analysis with a quantitative version of this domain-specific constraint. Specifically, we defined the ”CCR local constraint” of a domain to be the percentage of bases in the do- main residing in a genomic region with a CCR percentile score above 90. Our analysis yielded results which mirrored these obtained with the binary version of the CCR constraint (Figure 2.10). The CCR local constraint score (quantita- tive version) for the protein domains in EM genes is included in Supplemental

Table 2.3.

2.5.5 GTEx data

RNA-seq data from 28 tissues (Supplemental Table 2.9), from 449 individuals were downloaded from the GTEx portal, release V6p. Those 28 tissues were selected based on differences in physiological function. Our goal was to obtain as representative a picture of human physiology as possible, but, since we ul- timately performed across-tissue analyses, we sought to avoid the inclusion of tissues whose presence could introduce similarities between genes that would confound our tissue specificity and co-expression analyses (see sections below).

As an example, we only included samples from subcutaneous, and not from visceral adipose tissue.

41 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

For expression pattern analysis, we downloaded the raw RPKM data as provided in the GTEx portal. For co-expression analysis, we downloaded the gene-level count table and transformed to the log2(RPM + 1) scale (scaled to

107 counts per sample instead of 106). In this dataset, 5 EM genes were not available, leaving us with 290 for analysis.

2.5.6 Tissue specificity and expression level analysis

Using the GTEx data described above, we calculated tissue specificity scores for

Supplemental Fig. 2.12 as previously described [73]. Computations were done using the functions makeprobs() and JSdistFromP() from the cummeR- bund package [74]. Supplementart Figure 2.12B depicts the tissue specificity scores of EM genes vs those of 30 genes encoding for TCA cycle related pro- teins (Supplemental Table 2.10). To confirm that our findings were not driven by unwanted variation, we repeated our analysis after correcting for RIN as well as surrogate variables (SVs) [75, 76]. In particular, using (log2(RPKM +

1)) values, we estimated the SVs using the function sva(), from the SVA

R package [77], while protecting for the tissue effect and including RIN as a known confounder. This resulted in 182 significant SVs, which we then used along with RIN to obtain the corrected expression values using the function removeBatchEffect() in the limma R package [78]. Subsequently, negative values in the expression matrix were replaced by zeros, and genes with uni-

42 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION formly low values (< 0.01 in all samples) were removed. As depicted in Supple- mental Fig. 2.22), the results were essentially the same as those obtained with our original analysis. It should be mentioned however, that in situations where there is severe confounding of the factor of interest with some batch effect (for example, if samples from a particular tissue were all processed differently than others), correcting for surrogate variables cannot disentangle desired from un- desired variation.

2.5.7 Co-expression analysis

Using the GTEx data described above, we estimated tissue-specific networks and modules using the following approach. First, for each tissue, we only in- cluded genes where the corresponding tissue-specific median expression (me- dian (log2(RPKM + 1))) was greater than zero. Then, prior to network con- struction, we preprocessed the expression data to remove unwanted varia- tion, since it is known that it can confound the estimation of pairwise cor- relation coefficients between genes [79, 80]. To achieve this, for each tissue, we standardized the expression matrix (containing (log2(RPM + 1)) values) to have mean 0 and variance 1 across every gene, and removed the 4 leading principal components from this matrix by regressing on the PCs and then re- constructing a new matrix with the regression residuals, using the function removePrincipalComponents() in the WGCNA package [36, 81]. This has

43 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION been shown to remove unwanted variation for co-expression analysis [80]. In

Supplemental Fig. 2.23 we depict the impact of doing this on the distribution of pairwise correlations across (1) 2000 randomly selected genes and (2) 80 genes encoding for the protein component of the ribosome (Supplemental Table 2.12), following ideas from 79. We expect random genes to be uncorrelated (negative controls) whereas we expect genes encoding for ribosomal proteins to be highly co-expressed (positive controls). To avoid over-fitting, we removed the same number of principal components from all tissues.

We then proceeded to perform tissue-specific network construction. For each tissue, we estimated the soft thresholding power using the entire ex- pression matrix; we chose the first value for which the network was charac- terized by an approximately scale free topology, following standard WGCNA guidelines. To ensure that we can ultimately make comparisons across the

28 tissues, we next selected genes that have some minimal expression in all of them. Specifically, we required that the tissue-specific median expression

(median(log2(RPKM + 1))) was greater than zero in all of the 28 tissues. This gave us a set of 14872 genes. With respect to EM genes, this requirement se- lects 270 EM genes out of 295; in addition to the 5 of the 295 EM genes are not present in GTEx (see above), it excludes the 11 EM genes which are testis- specific (testis-specificity score > 0.5), 7 genes that are either not expressed or expressed at very low levels in those tissues, and 2 genes (DPF1 and PHF21B)

44 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION that are expressed at a considerable level in more than 1 tissue but are very lowly expressed in some other tissues (DPF1 was especially expressed in cere- bral cortex and cerebellum, but this was not as pronounced after correcting for

RIN and surrogate variables). Subsequently, using only those 270 EM genes, we built unsigned tissue-specific networks and identified modules, by perform- ing hierarchical clustering using the function cutTreeDynamic(), with the dissimilarity measure based on the topological overlap matrix. We set the pa- rameters minClusterSize and deepSplit equal to 15 and 2, respectively.

Modules were merged when the correlation between the corresponding module eigengenes was 0.8 or greater. Any parameters that are not mentioned were left at their default values.

To derive the reference distribution of the number of highly co-expressed and co-expressed genes (Figure 2.4D, Supplemental Fig. 2.16), we performed all of the steps described above, but, instead of EM genes, with 270 randomly se- lected genes; this was repeated 300 times. Because we observed that EM genes which belong to either the highly co-expressed or co-expressed group have higher expression across tissues than EM genes which are not co-expressed

(Supplemental Fig. 2.15A), each time the random genes were sampled from the population of genes whose median expression level (median(log2(RPKM + 1))) was: (1) at least 0.5 in more than half of the tissues (11963 genes total), or (2) at least 3 in more than half of the tissues (5095 genes total). These two popu-

45 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION lations of genes have a similar expression level to the EM genes (group 1) or a considerably higher expression level compared to the EM genes (group 2) (see

Supplemental Fig. 2.15B,C).

We examined the robustness of our results to the choice of arbitrary cutoffs, and determined that our choice of cutoffs do not impact the statistical signifi- cance of our finding (Supplemental Fig. 2.16). To ensure that our findings are not driven by the presence of outlier samples, we compared our result to those obtained when we randomly excluded samples from the network construction.

In particular: 1) when there were more than 100 samples in a tissue, we ran- domly dropped half of them, 2) when there were between 50 and 100 samples in a tissue we randomly dropped 20 of them, and when there were less than

50 samples we randomly dropped 5 of them. After this subsampling had taken place, the subsequent steps were performed as described above. This procedure was repeated 300 times. Across the random subsets, we found a median of 64.5 out of the 74 originally identified highly co-expressed genes being classified as such again.

For the analysis where we construct tissue-specific networks by threshold- ing the correlation matrix, we selected genes and removed principal compo- nents as described above. We then estimated a tissue-specific threshold as the 99.8% percentile of the correlation matrix of 2,000 randomly chosen genes.

The topology of the resulting network is sensitive in both directions to the ex-

46 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION act cutoff. We then thresholded the correlation matrix of the 270 EM genes and computed the maximally connected component of the resulting network.

We then identified EM genes that shared membership in this component with more than 75 other EM genes in more than 10 tissues. We see 71 such genes, 60 of which are part of the originally detected highly co-expressed group, and 11 which are part of the co-expressed group, thus largely recapitulating the result obtained with WGCNA. We then compared the size and average node degree of the maximally connected component to those of the maximally connected com- ponents of networks constructed from groups of 270 randomly chosen genes; each of the reference distributions was derived by sampling 300 times from the population of genes with a similar expression level to EM genes (i.e. me- dian expression greater than 0.5 in more than half the tissues, Supplemental

Fig. 2.15). We found that EM gene networks had a significantly larger - imally connected component, whose average node degree was generally also larger (Supplemental Fig. 2.17).

2.5.8 Trans-acting factor binding at EM gene promoters

We defined promoters as 10 kb sequences centered around the transcriptional start site. We used the ENCODE portal (http://encodeproject.org), to down- load TF ChIP-Seq data for the K562 cell line (we note here that those data provide information for transcription factors, as well as other regulators not

47 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION strictly belonging to the TF group; for simplicity, in this section we will refer to all those factors as TFs). To take antibody quality into account, we followed the ENCODE guidelines and selected experiments that showed reproducibility across replicates and had narrow peak calls (that is, whose output type was labeled as ”optimal IDR thresholded peaks”). Then, we only kept experiments performed in the absence of any treatment on the cells. We also randomly dis- carded any duplicate experiments (that is, experiments performed on the same

TF target, regardless of the exact antibody used). Subsequently, we used the recount R package [82, 83] to select genes expressed in K562 cells (84; study identifier in the sequence read archive: SRP010061); as in our co-expression analysis, we required that a gene had median RP KM > 0 across the 8 total samples. This yielded a total of 242 EM genes (72 highly co-expressed and 94 non co-expressed), 14355 other genes (excluding ribosomal protein genes), and

330 regulatory factors.

To test for enrichment of TF binding in the highly co-expressed versus the non co-expressed EM gene group, we first discarded any TFs that were binding at only 10 promoters or less, as those were unlikely to be driving the observed co-expression. We then formed a 2x2 table for each of the remaining 295 TFs, and performed Fisher’s exact test. To derive a null distribution we used the fol- lowing two approaches. First, we initially split the set of 14355 other genes to a set of 9495 genes with a median RP KM > 1.2 and a set of 10821 genes with a

48 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

median RP KM > 0.4, to match the expression levels of the highly co-expressed and non co-expressed EM genes respectively (Supplemental Fig. 2.21A). Then, we randomly sampled from these two sets to create two groups, consisting of 72 and 94 members respectively. We discarded any TFs binding at 50 promoters or less and then, as before, we tested each of the remaining 320 TFs for en- richment. Supplemental Fig. 2.21B depicts the null distribution of the number of TFs with an Odds Ratio > 2 (indicating at least a 2-fold enrichment in the

72-member group) and a p-value < 0.05, versus the observed number of TFs showing this enrichment in the highly co-expressed EM group. For the second approach, we randomly sampled EM genes and after sampling we arbitrarily created two groups, one with 72 members and another with 94. We then re- peated the same procedure as before, and tested each TF for enrichment in the promoters of one group versus the other. Supplemental Fig. 2.21C shows the resulting null distribution, as well as the actual observed value.

2.5.9 Enrichment of disease genes in the highly co-expressed group

For Figure 2.6B we formed 2x2 tables of EM genes used in the co-expression analysis. For the categories “any disease”, “neuro” and “ca”, all 270 such genes were included. For the categories “neuro (no ca)”, and “ca (no neuro)” all EM genes associated with both neurological dysfunction and cancer were excluded.

In all cases compared EM genes in the highly co-expressed group to EM genes

49 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION in either the co-expressed or the not co-expressed group (combined). For the

“high pLI neuro vs. other high pLI” category we only included genes with a pLI greater than 0.9 (without excluding those on the sex chromosomes). For

Figure 2.6C we compared EM genes in the highly co-expressed group (defined using different cutoffs) to EM genes in the not co-expressed group, keeping the latter reference group constant in all comparisons.

2.5.10 Stratified LD score regression

We first used the enhancer regions, defined by 39 as genomic locations distinct from genic promoters, which are marked by H3K27ac in one or more of 136 brain samples from 87 anatomically distinct brain regions. We labeled each brain enhancer region within 1 Mb of the transcription start site (TSS) of an

EM gene as an EM regulatory region. This yielded 46 Mb of EM-regulatory regions and 15 Mb of highly co-expressed regulatory regions (with the former including the latter as a subset).

We next used stratified LD score regression (SLDSR) [40,85] with the LDSC software (https://github.com/bulik/ldsc) to estimate coefficient z-scores and enrichment statistics for these two sets of regions, across 29 traits (infor- mation regarding the corresponding GWAS studies is provided in 2.11). For a given trait, SLDSR estimates the proportion of heritability that is explained by SNPs that reside within a given set of genomic locations. It employs a

50 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION linear model that incorporates GWAS summary statistics as well as linkage disequilibrium values (as derived from a reference panel matched for ances- try). Each of our 2 sets of features (the highly co-expressed and the all-EM regulatory regions) was separately examined for heritability enrichment, after adjusting for the full baseline set of features described in 85. This includes standard features such as coding regions and conserved regions. The result- ing p-values associated with the z-scores were corrected for multiple testing using the Benjamini-Hochberg prodedure [86], across all 29 traits and the two features we considered.

The interpretation of the LDSC analysis is that a feature (a set of regions) is significantly enriched for explained heritability if the feature adds (signfi- cantly) to the explained heritability on top of the baseline model. This is a strong statement of enrichment for GWAS signal because it includes multiple regions of known genetic importance.

2.5.11 Genome assembly version

All our analyses were performed using the GRCh37 (hg19) genome assembly version, primarily because the publicly available datasets we relied on (ExAC, as well as the GWAS data used in LDSC) utilized this genome version. Given that we focus on well-annotated regions of the genome, our results would not be significantly impacted by use of the newer GRCh38.

51 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

2.5.12 Code availability

Analysis code for this work is available online at

https://github.com/hansenlab/em_paper.

2.5.13 Acknowledgments

Research reported in this publication was supported by the National Insti- tute of General Medical Sciences of the National Institutes of Health under award numbers R01GM121459 and DP5OD017877. LB was supported by the

Maryland Genetics, Epidemiology and Medicine (MD-GEM) training program, funded by the Burroughs-Wellcome Fund. HTB received support from the

Louma G. Foundation. LB, HTB and KDH received support from a Discov- ery Award from Johns Hopkins University. JMH and ARQ were supported by

National Institutes of Health awards from the National Human Genome Re- search Institute (R01HG006693 and R01HG009141) and the National Institute of General Medical Sciences (R01GM124355). The content is solely the respon- sibility of the authors and does not necessarily represent the official views of the National Institutes of Health. HTB is a paid consultant for Millennium

Pharmaceuticals, Inc.

52 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

2.6 Supplemental Materials

2.6.1 Supplemental Results

2.6.1.1 Variation intolerance of EM genes encoded on the sex chromo-

somes

In our main analysis of loss-of-function variation intolerance, we focused on genes encoded on the autosomes. When we exclusively considered the X chro- mosome, we observed a similar picture; 16 out of the 18 X-linked EM genes have a pLI greater than 0.9. Using data from a recent study on X inactiva- tion [87], we found that all 3 EM genes that consistently escape X inactivation in different tissues have a pLI of 1. In contrast, only 31% of other X-linked genes have a pLI greater than 0.9 (median pLI = 0.65, median pLI for other genes that escape X inactivation = 0.41). With respect to the 2 out of 4 EM genes on the Y that are included in ExAC, UTY has an interme- diate pLI of 0.63, while KDM5D is haplosufficient (pLI = 0.02).

2.6.1.2 Tissue specificity and expression levels of EM genes

The intolerance to variation suggests that for most of the EM genes, the loss of even a single copy is incompatible with a healthy organismal state. An impor- tant question then, is the identification of the tissues and cell types through which this detrimental effect is mediated. There are two primary reasons why

53 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION one might speculate that some EM genes have tissue-specific expression. First, the composition of the machinery suggests the existence of functionally redun- dant components (for instance, there are 116 histone methylation readers with no enzymatic or other reading activity). This redundancy could be explained if different components with the same role were specific to different tissues.

Second, there is some evidence suggesting that TF binding to their target sites requires, at least in some cases, an already permissive chromatin state [88].

This could imply that there are certain EM components expressed in a cell-type specific fashion, that thereby help generate the epigenomic landscapes that fa- cilitate TF binding [88, 89]. Given that specific genomic locations need to be marked, it has been postulated that this is achieved by EM genes with DNA- binding domains that differ from those encountered in classical TFs, yet confer some degree of sequence specificity [88, 89]. Four domains described as puta- tive candidates are the ARID DNA-binding domain, the AT-hook DNA-binding motif, the CxxC domain, and the C2H2-like Zinc finger, which we found to be present in 21 EM genes.

To gain insight into the question of tissue specificity, we examined the ex- pression patterns of EM genes across a spectrum of adult tissues, using RNA seq data generated by the GTEx consortium [35]. We selected 28 tissues on the basis of sample size and differences in physiological function, as this enabled us to obtain a comprehensive picture under diverse cellular conditions, and al-

54 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION lowed us to avoid spurious specificity estimates arising from the high similarity between some tissues (e.g. subcutaneous and visceral adipose tissue). For each

EM gene, we calculated an entropy based tissue specificity score, as previously described [73]. This score reflects the degree to which a gene is highly specific for some tissue (score close to 1) or is expressed broadly across tissues (score close to 0). We discovered that the vast majority of EM genes are character- ized by very low specificity, after comparing their scores to those of TF genes as well as other genes (Supplemental Fig. 2.12A). In fact, when we compared the specificity of EM genes to that of genes encoding for proteins involved in the tricarboxylic acid (TCA) cycle, a well known category of housekeeping genes, we found a very similar distribution (Supplemental Fig. 2.12B). The lack of tissue specificity characterizes EM genes with a DNA-binding domain that rec- ognizes short motifs as well (median specificity score = 0.1), although we note that there also exist other genes that harbor those domains but do not fulfill the criteria for inclusion in our list of EM genes. This result however, raises the possibility that the increased dosage sensitivity of EM genes is due to their non-specific expression pattern. However, even after considering only highly non-specific genes (score less than 0.1), the enrichment of the 160 EM genes satisfying this criterion in the highly constrained category remains extremely pronounced (Supplemental Fig. 2.13). This is true both when comparing to all of the 5249 other non-specific genes (Fisher’s test, p < 2.2 · 10−16, odds ra-

55 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION tio = 5.6), as well as to the 232 non-specific TFs (Fisher’s test, p = 6.73 · 10−9, odds ratio = 3.5), showing that this constraint is not merely a consequence of the presence of EM genes in a greater number of cell types, but rather reflects their function.

We subsequently reasoned that our analysis might be masking the pres- ence of genes specific for only a small subset of tissues, and we performed a detailed analysis of the specificity of EM genes separately for each tissue (Sup- plemental Fig. 2.12C). We observed that testis stands out as the only tissue for which a small number of EM genes show specific expression (Supplemental

Fig. 2.12C), indicating its dependence on not only the general machinery that operates in all other tissues, but also on a distinct subset of components. This is in agreement with the existing view that testis is an outlier tissue with respect to its transcriptomic state [35]. The testis-specific EM genes include PRDM9, in accordance with its reported role in meiotic recombination [90], as well as 10 other genes (Supplemental Fig. 2.12C), some of which (TDRD1, RNF17, BRDT,

PRDM14, MORC1) possibly play roles in male germ cell differentiation and the repression of transposable elements in the germline [91–95], while the role of the others (CDY2A, HDGFL1, PRDM13, PRDM7, TDRD15) remains mostly unspecified. Three of those genes are also members of the PRDM family of his- tone methyltransferases, while the rest are all readers of histone methylation, with the exception of BRDT, a histone acetylation reader.

56 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

Finally, in all of the tissues analyzed, we observe that EM genes are highly expressed compared to TF genes, and other genes (Supplemental Fig. 2.12D).

We also confirm that TF genes are lowly expressed (Supplemental Fig. 2.12D), as was previously observed using microarray data [21]. Furthermore, we find an association between pLI and expression level (Supplemental Fig. 2.24), which could be attributed to the fact that a reduction of this high expression into half the normal amount is not tolerated. As expected, there are exceptions, namely

TF genes or other genes that are expressed at equally high or higher levels than EM genes, as well as a median of 44.5 EM genes across tissues expressed at low levels, with a median RPKM always less than 1. Within the latter, 9 components show consistently low expression across all tissues, with 6 being testis-specific. Collectively, the above results indicate that the majority of hu- man EM genes are active across a heterogeneous set of adult tissues. This suggests that other factors primarily maintain cell identity in those tissues, and it can help explain the observation that in most cases of Mendelian disor- ders of the epigenetic machinery, more organ systems are affected compared to other genetic disorders [6].

57 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

2.6.1.3 The highly co-expressed genes are not enriched for protein-

protein interactions

We tested whether the co-expression is associated with protein-protein inter-

actions (PPIs) between EM gene products, using recent data on such interac-

tions [96]. As our definition of EM genes only includes these with catalytic or

reading activity, and not genes encoding for accessory subunits of chromatin

modifying complexes, the extent to which such interactions will occur is not a priori known. We did not observe increased frequency of interactions between the highly co-expressed versus the non co-expressed group, with both groups having very few PPIs (probability of a pair interacting is 0.0007 and 0.003 in the two groups respectively). This may suggest incomplete data on protein-protein interactions.

2.6.1.4 Co-expressed EM genes are not spatially clustered

Given that highly expressed genes in the human genome tend to reside in chro- mosomal clusters [41], and taking into account that clustered genes are often co-expressed [42], we investigated the relative chromosomal positions of co- expressed EM genes. We did not observe any notable clustering, with our 74 highly co-expressed, and 82 co-expressed EM genes being approximately uni- formly distributed across chromosomes (chi-squared goodness of fit test, p = 1 for both groups), and only 9 and 6 pairs, respectively, having a within-pair chro-

58 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION mosomal distance less than 1 megabase (Supplemental Fig. 2.20). The single exception to this is SETD1A and FBXL19, which are separated by only 8 kb. To see if the co-expression of those two genes is driven by a bidirectional promoter we looked at whether they are encoded on opposite strands, and found this not to be the case.

2.6.2 Supplemental Tables

Supplemental Table 2.1: The protein domains used to define the epige- netic machinery.

Domain name Interpro ID SET domain IPR000182 GNAT domain IPR000313 Histone acetyltransferase domain, MYST-type IPR001025 Histone acetyltransferase Rtt109/CBP IPR001214 JmjC domain IPR001487 Histone deacetylase domain IPR001680 Sirtuin family & catalytic core domain IPR001739 Chromo domain IPR001965 Zinc finger & PHD-type IPR002110 Tudor domain IPR002717 PWWP domain IPR002857 Protein ASX-like & PHD domain IPR002999 Bromo adjacent homology (BAH) domain IPR003347 ADD domain IPR004092 Mbt repeat IPR011124 Zinc finger, CW-type IPR013178 Bromodomain IPR025766 Zinc finger, CXXC-type IPR026590 Methyl-CpG DNA binding IPR026905

59 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

Supplemental Table 2.2: The components of the epigenetic machinery. An online version is available at http://www.epigeneticmachinery.org. Also available at https://genome.cshlp.org/content/29/4/532/suppl/DC1

Supplemental Table 2.3: Local CCR constraints. Local CCR constraint for all domains (rows), for all genes included in the local constraint analysis. Available at https://genome.cshlp.org/content/29/4/532/suppl/DC1

Supplemental Table 2.4: EM genes with DNA binding domains. These 38 genes include: 1) the 20 genes which are also classified as TFs under the ”TF activity” column in Supplemental Table 2.2, 2) EM genes containing a CxxC-type Zinc finger, or a High-mobility group box domain, or an AT-hook DNA-binding motif. Available at https://genome.cshlp.org/content/29/4/532/suppl/DC1

Supplemental Table 2.5: Subunits of EM complexes. A list of the acces- sory and EM subunits of the 19 complexes involving EM genes. Available at https://genome.cshlp.org/content/29/4/532/suppl/DC1

Supplemental Table 2.6: Novel EM disease candidate genes. Novel EM disease candidate genes, along with their co-expression status. We also provide the phenotype MIM number and the association with neurological dysfunction (if present) for 8 of these that had been associated with Mendelian phenotypes in OMIM at the time of publication. Available at https://genome.cshlp.org/content/29/4/532/suppl/DC1

Supplemental Table 2.7: Novel disease candidate genes encoding for accessory subunits of EM complexes. Available at https://genome.cshlp.org/content/29/4/532/suppl/DC1

60 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

Supplemental Table 2.8: Histone modifiers for which the amino-acid substrate specificity is known.

Specificity Gene name H3K4 methylation writer KMT2A, KMT2B, KMT2C, KMT2D, SETD1A, SETD1B, SETD7, SMYD1, SMYD2, ASH1L, PRDM9 H3K27 methylation writer EZH1, EZH2 H3K36 methylation writer NSD1, WHSC1, WHSC1L1, SETD2, SMYD2, ASH1L, SETD3, SETMAR H3K9 methylation writer PRDM2, EHMT1, EHMT2, SETDB1, SUV39H1 H3K4 methylation eraser KDM1A, KDM1B, KDM5A, KDM5B, KDM5C, KDM5D, NO66 H3K27 methylation eraser KDM6A, UTY, KDM6B, KDM7A, PHF8 H3K36 methylation eraser KDM2A, KDM2B, KDM4A, KDM4B, KDM4C, KDM4D H3K9 methylation eraser KDM3A, KDM3B, JMJD1C, KDM4A, KDM4B, KDM4C, KDM4D, PHF8, PHF2 H3K27 acetylation writer EP300, CREBBP H3K9 acetylation eraser SIRT1, SIRT2

61 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

Supplemental Table 2.9: GTEx tissues used in the tissue specificity and co-expression analyses.

Tissue name Adipose – Subcutaneous, Adrenal Gland, Artery – Tibial, Brain – Cerebellum, Brain – Cortex, Breast – Mammary Tissue, Colon – Transverse, Esophagus – Mucosa, Heart – Left Ventricle, Kidney - Cortex, Liver, Lung, Minor Salivary Gland, Muscle – Skeletal, Nerve – Tibial, Ovary, Pancreas, Pituitary, Prostate, Skin – Not Sun Exposed (Suprapubic), Small Intestine – Terminal Ileum, Spleen, Stomach, Testis, Thyroid, Uterus, Vagina, Whole Blood

Supplemental Table 2.10: The components of the TCA cycle. The 30 genes encoding for TCA cycle related proteins. Available at https://genome.cshlp.org/content/29/4/532/suppl/DC1

Supplemental Table 2.11: Common traits/diseases whose heritability enrichment in EM regulatory regions was examined. The 29 common traits for which stratified LD-score regression was performed. The table in- cludes the sample size for each GWAS, as well as links to the summary statis- tics. Available at https://genome.cshlp.org/content/29/4/532/suppl/DC1

Supplemental Table 2.12: The protein components of the ribosome. 80 genes encoding for protein components of the ribosome. Available at https://genome.cshlp.org/content/29/4/532/suppl/DC1

62 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

2.6.3 Supplemental Figures

pLI 0.1 0.9

BAF PBAF CHRACH NURF NURD INO80 SRCAP TRRAP STAGA PCAF TFTC KMT2A/B KMT2C/D KMT2F/G PRC2 cPRC1 ncPRC1 COREST SWI IND 3 Supplemental Figure 2.7: The pLI scores of members of EM protein complexes. The pLI scores of EM (red points) and accessory (black points) subunits, depicted separately within each of 19 EM protein complexes (Meth- ods). The grey area indicates genes with a pLI > 0.9.

63 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

pLI 0.1 0.9

H3K4 meth writers

H3K4 meth erasers

H3K27 meth writers

H3K27 meth erasers

H3K36 meth writers

H3K36 meth erasers

H3K9 meth writers

H3K9 meth erasers

H3K27 ac writers

H3K9 ac erasers

Supplemental Figure 2.8: pLI for EM genes with same substrate speci- ficity. The pLI scores of EM genes grouped according to amino-acid substrate specificity, for the EM genes where this specificity is well defined. Only genes on autosomes are included.

64 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION 6 SET domain C2H2 3 domains number of constrained 0 PRDM2 PRDM1 PRDM4 PRDM16 Supplemental Figure 2.9: The C2H2 zinc fingers are the main drivers of the mutational constraint of the PRDM family. Depicted are the four members of the PRDM family with high pLI, that contain both a SET domain as well as C2H2 zinc fingers. (In total, there are 15 PRDM members that are EM genes. 5 have a high pLI, and for one of them (PRDM10), the SET domain is not annotated by the Pbase package (see also Methods))

65 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

Range of (A) (B) (C) CCR local constraint 10 50 90 Chromodomain Zinc finger, PHD type 30 Tudor domain 0 PWWP domain -30 Mbt repeat WD40 repeat Mean difference in CCR local constraint CCR local constraint 10 50 90 Ankyrin repeat High pLI Low pLI genes genes Bromodomain High pLI Low pLI genes genes Zinc finger, CXXC type Supplemental Figure 2.10: The protein domains known to mediate epi- genetic functions drive the observed constraint of EM genes. (A) The CCR local constraint of all EM-specific protein domains for high pLI (> 0.9) EM genes (grey points) and low pLI (< 0.1) EM genes (black points). The two groups are significantly different (p < 2.2 × 10−16, one-sided Wilcoxon rank-sum test). (B) The distribution of within-gene differences in the average CCR lo- cal constraint of EM-specific domains minus the average CCR local constraint of non EM-specific domains for high pLI EM genes (grey box; paired t-test, p = 0.02) and low pLI EM genes (black box; paired t-test, p = 0.25). (C) EM reader domains that appear in more than 1 copy within the same gene show within-gene variability in CCR local constraint (each point corresponds to the range of CCR local constraint scores for the different copies of a domain within the same gene; only data for high pLI EM genes are shown).

66 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

(A) Ankyrin Bromodomain Chromo domain Mbt repeat PWWP domain Tudor domain repeat 100 75 50 25 0 PHIP NSD1 BRD4 CHD1 CHD2 CHD3 CHD4 CHD5 CHD6 CHD7 CHD8 CHD9 PHF20 RNF17 TDRD1 MBTD1 EHMT1 KDM4A KDM4B PBRM1 WHSC1 BRWD1 SETDB1 SFMBT1 SFMBT2 PHF20L1 L3MBTL3 WHSC1L1 MPHOSPH8

Zinc finger WD40 repeat Zinc finger PHD type

CCR local constraint CXXC type 100 75 50 25 0 EED PHIP DPF1 DPF2 BPTF MTF2 BRD1 NSD1 CHD3 CHD4 CHD5 MBD1 JADE1 JADE2 PHF12 PHF19 KAT6A KAT6B BRPF1 BRPF3 KMT2A KMT2B KMT2D KDM4A KDM4B KDM5A TRIM33 WHSC1 BRWD1 MLLT10 WHSC1L1

(B) Leucine rich AT hook DNA B box type BRK Homeodomain repeat cysteine binding motif zinc finger domain like containing subtype 100 75 50 25 0 CHD7 CHD8 CHD9 BAZ2A ASH1L KDM2A KDM2B SRCAP TRIM24 TRIM33 FBXL19 SMARCA5

Lysine specific Protein of unknown SANT/Myb Zinc finger demethylase

CCR local constraint function DUF3776 domain C2H2 like like domain 100 75 50 25 0 EZH2 KDM5A PRDM1 PRDM2 PRDM10 PRDM14 PRDM16 PHF20L1 SMARCA5 Supplemental Figure 2.11: Identical copies of protein domains show within-gene variability in constraint. Each plot corresponds to a domain, and the points therein are the CCR local constraint scores for the different copies of the same domain within each gene. Only genes with more than one copy of the particular domain are shown. (A) EM-specific domains. (B) non EM-specific domains in EM genes.

67 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

Testis Gene name specificity score (A) (C) PRDM9 1 PRDM13 1 All other genes PRDM14 1 TF genes EM genes CDY2A 1 0.8 5 9

Density BRDT 0.87 RNF17 0.86

0.1 0.5 0.9 HDGFL1 0.81 Tissue specificity score 0.2 PRDM7 0.77 Tissue specificity score MORC1 0.75 Other Tissues TDRD15 0.67 Ovary

Testes TDRD1 0.52 (B) (D) TCA cycle genes TF genes All other genes EM genes EM genes 5 9 Density 2 4 Median RPKM 0.1 0.5 0.9 Tissue specificity score Tissues 1-28 Tissues 1-28 Tissues 1-28 Supplemental Figure 2.12: The components of the epigenetic machin- ery are expressed in a highly non tissue-specific manner and at high levels across tissues. (A) The distribution of the tissue specificity score of EM genes (red curve) reveals their lack of tissue-specific expression, compared to TF genes (green curve), and all other genes (blue curve). (B) Comparison of the tissue specificity of EM genes (red curve) with that of genes encoding for tricarboxylic acid (TCA) cycle related proteins (black curve) shows that EM genes exhibit comparable tissue specificity to this class of well known house- keeping genes. (C) Testis is the sole tissue for which some EM genes have high specificity. (D) A comparison of expression levels of EM genes (red boxes) to those of TF genes (green boxes) and all other genes (blue boxes) shows their high relative expression. Each box shows the inter-quartile range of expression values, and tissues are ordered according to median expression for EM genes.

68 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

All other genes TF genes EM genes Density 1 2.5

0.1 0.5 0.9 pLI Supplemental Figure 2.13: The pLI distributions of genes with low tis- sue specificity. Density plots of pLI scores for genes with low tissue speci- ficity score (< 0.1) highlight that the enrichment of EM genes (red curve) in the highly intolerant category (pLI > 0.9, gray shaded area) compared to TF genes (green) and other genes (blue) remains very pronounced even after considering only broadly expressed genes.

69 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

(a) (b) EZH2 SRCAP -2 0 2 -4 -1 2

-6 -2 2 -6 -2 2 KMT2D KMT2D Supplemental Figure 2.14: EM genes show inter-individual variabil- ity in their expression levels. Data from subcutaneous adipose tissue from GTEx on 348 individuals. (A) Scatterplot of the expression levels of two EM genes (KMT2D and SRCAP) whose expression across individuals is highly cor- related. (B) Scatterplot of the expression levels of two EM genes (KMT2D and EZH2) whose expression across individuals is uncorrelated.

70 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

(A) highly co-expressed co-expressed not co-expressed (RPKM) 2 2 4 Median log Tissues

(B) EM genes Random genes (expression level ~ EM genes) (RPKM) 2 2 4 Median log Tissues

(C) EM genes Random genes (expression level > EM genes) ( RPKM) 2 2 4 Median log Tissues Supplemental Figure 2.15: Expression levels of EM genes. Expression (log(RPKM + 1))) for various groups of genes. (A) We categorize EM genes into 3 groups based on co-expressed module patterns across tissues (Figure 2.4). EM genes which are highly co-expressed or co-expressed have a higher ex- pression level than EM genes which are not co-expressed. The former two categories show similar expression levels. (B) The expression level of EM genes compared to 11,963 genes where the median expression in each tissue is greater than 0.5 in more than half the tissues. The two groups of genes have similar expression level. We say these genes are similarly expressed to the EM genes. (C) The expression level of EM genes compared to 5,095 genes where the median expression in each tissue is greater than 3 in more than half the tissues. The latter group of genes are expressed at higher levels than the EM genes.

71 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

random genes (expression level ~ EM genes) random genes (A) (expression level > EM genes) (B) Observed Observed 0.50 Density Density 0.002 0.012 0 0 35 70 0 80 170 number of genes with number of genes with >= 75 module partners >= 15 module partners

(C) Tissue cutoff = 7; Gene cutoff = 50 Tissue cutoff = 7; Gene cutoff = 75 Observed Observed 0.008 0.008 Density Density 0 0

0 91 182 0.0 83.5 167.0 number of genes with number of genes with >= 50 module partners >= 75 module partners Tissue cutoff = 14; Gene cutoff = 50 Tissue cutoff = 14; Gene cutoff = 30 Observed Observed random genes (same 15

0.6 expression level as EM genes) EM genes Density Density 0 0

0 13 26 0 28 56 number of genes with number of genes with >= 50 module partners >= 30 module partners Supplemental Figure 2.16: EM genes are highly co-expressed irrespec- tive of arbitrary choices. We examine the sensitivity wrt. various cutoffs of the result that EM genes are highly co-expressed. (A) Like Figure 2.4D, but with an additional reference distribution where random genes are selected to have higher expression level than EM genes (Supplemental Fig. 2.15). (B) Like Figure 2.4D but where we consider our observation that we have 157 EM genes which are either highly co-expressed or co-expressed. (C) Like Figure 2.4D, but for various choices of arbitrary cutoffs. Specifically we vary (1) in how many tissues two genes need to belong to the same module, to be considered partners (“tissue cutoff”, in the main text we use 10) and (2) how many module partners a gene needs to have, to be considered part of the highly co-expressed group (“gene cutoff”, in the main text we use 75).

72 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

EM genes Random genes (expression level ~ EM genes) (A) (B) Adipose Adrenal Gland Artery - Tibial Adipose Adrenal Gland Artery - Tibial Subcutaneous Subcutaneous 0.05 1 Density Density 0 0.00 Brain Brain Breast Brain Brain Breast

0.05 Cerebellum Cortex Mammary Tissue Cerebellum Cortex Mammary Tissue 1 Density Density 0 0.00 Colon Esophagus Heart Colon Esophagus Heart

0.05 Transverse Mucosa Left Ventricle Transverse Mucosa Left Ventricle 1 Density Density 0 0.00 0 70 140 0 70 140 0 70 140 0 4 8 0 4 8 0 4 8 size of maximally Average node degree connected component Supplemental Figure 2.17: Size and average degree of the maximally connected component in different tissues. We estimated tissue-specific networks by thresholding the correlation matrix. For each tissue we computed the maximally connected component of the EM genes. As comparison we did the same for 300 random samples of 270 genes with a similar expression level to the EM genes. Depicted are results from 9/28 tissues (same tissues as in Supplemental Fig. 2.23). (A) The size of the maximally connected component of the EM genes compared to the random samples (B) The average node degree within the maximally connected components

73 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

(A) (B) 70 25 number of partners number of partners 1 1 DNAm DNAm DNAm DNAm DNAm DNAm Hist ac Hist ac Hist ac Writers Writers Writers Writers Erasers Erasers Erasers Erasers Readers Readers Readers Readers Hist meth Hist meth Hist meth Remodelers Remodelers Supplemental Figure 2.18: Dual function EM writers partner with multiple other EM categories. (A) Each point corresponds to a dual function histone methyltransferase, and the y axis depicts the number of its partners belonging to a given EM category (different positions on the x axis correspond to different categories). (B) Same as (A), but for dual function histone acetyl- transferases.

74 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

(A) (B) TF Kinase/phosphatases random random number of partners number of partners 0 50 100 150 0 50 100 150

0 400 800 0 190 280 genes genes Supplemental Figure 2.19: Transcription Factors (TFs) and Protein Kinases/Phosphatases are not significantly co-expressed. (A) Each green dot corresponds to a TF, and its position along the y axis corresponds to the number of other TFs that it partners with. The TFs are ordered on the x axis according to the number of their partners. Blue dots correspond to randomly chosen genes, sampled from genes with a median expression (log(RPKM + 1))) greater than 0.1 and less than 2.8 in at least half the tis- sues, to match the expression of TFs. Each random set contained 915 genes. (B) As (A), but for protein kinases/phosphatases (yellow dots). Each random set of genes contained 395 genes, sampled from genes with median expression (log(RPKM + 1))) greater than 0.1 in at least half of the tissues.

75 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

(A) (B) 40 80 1 1 80 165 Coexpressed EM distances (Mb) Coexpressed EM distances (Mb) 0.1 0.5 0.9 0.1 0.5 0.9

1 2 3 4 5 6 7 8 9 10111213141516171819202122 X Y 1 2 3 4 5 6 7 8 9 10111213141516171819202122 X Y Chromosome Chromosome Supplemental Figure 2.20: Co-expressed EM genes are not spatially clustered. The pairwise chromosomal distances (number of bp separating the transcription end site of a gene with the transcription start site of the most proximal downstream gene) between EM genes. Top panel are distances greater than 1 Mb and bottom panel less than 1 Mb. (A) Highly co-expressed EM genes (n = 74). (B) Co-expressed EM genes (n = 83). Both groups of genes are approximately uniformly distributed across chromosomes (chi- squared goodness of fit test, p = 1 based on simulation for both groups).

76 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

(a) (b) random (2000 samples)

0.07 EM (observed) RPKM Density 2 7 12

k562 Samples 0.00 other (median RPKM > 0.7) 10 50 90 highly coexpressed Number of enriched TFs other (median RPKM > 0.3) non coexpressed (c) (d) EM shuffled (2000 times) EM shuffled (2000 times) EM (observed) EM (observed) 0.1 Density Density 0.05 0.20 0.0

10 50 10 25 40 Number of enriched TFs Number of enriched TFs Supplemental Figure 2.21: The promoters of the highly co-expressed EM genes are bound by common regulatory factors. (A) The distribu- tions of expression levels in K562 cells for genes that were used to derive the reference distribution in (B). (B) The distribution of the number of regula- tory factors that show enriched binding (log2 OR > 1 and p < 0.05) at the promoters of the first group created after randomly sampling genes in K562 cells (black curve), versus the observed number of regulatory factors showing enriched binding at the promoters of highly co-expressed EM genes (orange vertical line). (C) The distribution of the number of regulatory factors that show enriched binding (log2 OR > 1 and p < 0.05) at the promoters of the first group created after shuffling the labels of EM genes in k562 cells (black curve), versus the observed number of regulatory factors showing enriched binding at the promoters of highly co-expressed EM genes (orange vertical line).

77 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

(A) (B)

All other genes All other genes TF genes TF genes EM genes EM genes 1 2 Density Density 2 8 14

0.1 0.5 0.9 0.1 0.5 0.9 pLI Tissue Specificity Score

(C) (D)

TCA cycle genes EM genes 0.8 Density 2 8 14

0.1 0.5 0.9

Tissue Specificity Score 0.2 Tissue Specificity Score

Other Tissues Testis Supplemental Figure 2.22: The lack of tissue specificity of EM genes is not driven by unwanted variation. Results for the tissue specificity analy- ses after correcting for RIN and surrogate variables (Methods). (A) Like Sup- plemental Fig. 2.13. (B) Like Supplemental Fig. 2.12A. (c) Like Figure 2.12B. (D) Like Supplemental Fig. 2.12C.

78 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

(A) Adipose - Subcutaneous Adrenal Gland Artery - Tibial

Raw data 4 PCs removed Density 1 3

Brain - Cerebellum Brain - Cortex Breast - Mammary Tissue Density 1 3

Colon - Transverse Esophagus - Mucosa Heart - Left Ventricle Density 1 3

-0.5 0.0 0.5 -0.5 0.0 0.5 -0.5 0.0 0.5 Pairwise correlation Pairwise correlation Pairwise correlation

(B) Adipose - Subcutaneous Adrenal Gland Artery - Tibial

Raw data 4 PCs removed Density 1 3

Brain - Cerebellum Brain - Cortex Breast - Mammary Tissue Density 1 3

Colon - Transverse Esophagus - Mucosa Heart - Left Ventricle Density 1 3

0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 Pairwise correlation Pairwise correlation Pairwise correlation Supplemental Figure 2.23: Removing noise in co-expression analysis by removing principal components. We remove unwanted variation in our co-expression analysis by removing 4 principal components from the expression matrix in all tissues. (A) The distribution of pairwise correlations between randomly sampled genes, serving as a negative control, for 9 out of the 28 tissues. (B) The distribution of pairwise correlations between 80 genes coding for ribosomal proteins, which serve as a positive control, for the same tissues as in (A). 79 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

pLI > 0.9 pLI < 0.9 Density 0.1 0.3

1 4

log2(RPKM) Supplemental Figure 2.24: Highly constrained EM genes are also more highly expressed. Density plots of log 2(RPKM) values for EM genes with pLI > 0.9 vs. those of EM genes with pLI < 0.9 show that the former exhibit higher expression levels.

80 CHAPTER 2. CO-EXPRESSION PATTERNS DEFINE EPIGENETIC REGULATORS ASSOCIATED WITH NEUROLOGICAL DYSFUNCTION

ADHD Alzheimers disease Anorexia nervosa Anxiety disorder Autism spectrum 3 disorder 2 1 0 −1 −2 Childhood cognitive Bipolar disorder BMI performance Cigarettes per day College attainment 3 2 1 0 −1 −2 Coronary artery Depressive Conscientiousness disease Crohns disease symptoms Epilepsy 3 2 1 0 −1 −2 Generalized Ever smoked Extraversion Focal epilepsy epilepsy Height 3 2 1 ʻCoefficient z−score 0 −1 −2

IQ Ischemic stroke Major depressive Neuroticism Openness 3 disorder 2 1 0 −1 −2 Subjective PTSD Schizophrenia well−being Years of education 3 2 1 0 −1 −2 Highly co-expressed EM genes Not significant at adjusted p=0.05 All EM genes Significant at adjusted p=0.05 Supplemental Figure 2.25: Regulatory regions of EM genes are en- riched for explained variation for some neurological traits: signifi- cance. We performed an LDSC analysis for each of the traits listed in the figure. For each trait we included two different sets of features: regulatory regions for all EM genes (orange) and regulatory regions only for highly co- expressed EM genes (green). As baseline features we included the standard set of LDSC features including conserved regions and coding regions (Meth- ods). For each feature and each trait we computed a coefficient z-score which is a test statistic for whether the feature is significantly enriched for heritability associated with the trait. Filled circles are trait-feature combinations which are significant at 5% after correcting for multiple testing across all feature- trait combinations.

81 Chapter 3

Promoter CpG density predicts genic loss- of-function intolerance

3.1 Preface

This chapter is available on bioRxiv as:

Boukas et al., bioRxiv 2020. doi: https://doi.org/10.1101/2020.02.15.936351.

3.2 Introduction

A powerful way of gaining insight into a gene’s contribution to organismal homeostasis is by studying the fitness effect exerted by loss-of-function (LoF) variants in that gene. Fully characterizing this effect is challenging, as it re- quires estimation of both the selection coefficient for individuals with biallelic

LoF variants, as well as the dominance coefficient [97, 98]. However, recent

82 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE studies based on the joint processing and analysis of large numbers of ex- ome sequences have developed metrics which serve as approximations to genic

LoF-intolerance in humans [23, 99, 100]. These metrics correlate with several properties indicative of LoF-intolerance (such as enrichment for known hap- loinsufficient genes; 23, 100), and can substantially help in the assignment of pathogenicity to novel variants encountered in patients as recommended by the American College of Medical Genetics and Genomics [101].

At the core of all these metrics is a comparison of the observed to the ex- pected number of LoF variants. Hence, genes where the latter is small (e.g. due to small coding sequence length or low mutation rate) will not be amenable to this approach until the sample sizes become much larger than they presently are. Currently in gnomAD, the largest such effort with publicly available con- straint data based on 125,748 exomes, approximately 28% of genes lack reliable

LoF-intolerance estimates [100]. It has been estimated that even with 500,000 indviduals, the discovery of LoF variants will remain far from saturation, with potentially a sizeable fraction of genes still difficult to ascertain [34].

The cardinal feature of highly LoF-intolerant genes, i.e. genes depleted of even monoallelic LoF variants in healthy individuals, is dosage sensitivity; a gene copy containing one or more LoF variants produces mRNAs that are typi- cally degraded via nonsense-mediated decay [102,103]. Therefore, the deleteri- ous effects of LoF variants in these genes are often mediated through a reduc-

83 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE

(a) High density (b) 10

9 40

8

7

6

5 20

4

CpG density decile 3 enrichment in highly 2 LoF-intolerant genes (OR) 1 1 Low density 0.0 0.5 1.0 1.5 2.0 10 9 8 7 6 5 4 3 2 1 LOEUF High CpG density decile Low Figure 3.1: The relationship between promoter CpG density and down- stream gene loss-of-function intolerance. (a) The distribution of genic LOEUF (as provided by gnomAD) in each decile of promoter CpG density. The vertical line corresponds to the cutoff for highly LoF-intolerant genes (LOEUF < 0.35). (b) Odds ratios and the corresponding 95% confidence intervals, quan- tifying the enrichment for highly LoF-intolerant genes (LOEUF < 0.35) that is exhibited by the set of genes in each decile of promoter CpG density. For each of the other deciles, the enrichment is computed against the 10th decile. The horizontal line corresponds to zero enrichment. In both (a) and (b), CpG den- sity deciles are labeled from 1-10 with 1 being the most CpG-poor and 10 the most CpG-rich decile.

84 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE tion of the normal amount of mRNA used for protein production. This in turn, implies that studying the characteristics of regulatory elements controlling the expression of highly LoF-intolerant genes has the potential to yield two impor- tant benefits [104, 105]. First, it can highlight the features of the most func- tionally important regulatory elements in the human genome. Second, such features can then provide the basis for predictive models of LoF-intolerance, which can be applied to unascertained genes.

In promoters, one sequence feature that has been extensively studied is

CpG density. A large number of mammalian promoters harbor CpG islands

[1, 106], which typically remain constitutively unmethylated in all cell types

[107,108]. Recently, it has been shown that clusters of unmethylated CpG dinu- cleotides are recognized by CxxC-domain containing proteins [32,109], thereby facilitating the deposition of transcription-associated marks such as H3K4me3

[110–112]. Additionally, there is now evidence that unmethylated CpGs sur- rounding (TF) motifs may contribute to promoter activity by also increasing the probability that the cognate TFs will bind [113,114].

85 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE

3.3 Results

3.3.1 Promoter CpG density is strongly and quantitatively associated

with downstream gene LoF-intolerance

We discovered a strong relationship between the observed-to-expected CpG ra- tio (hereafter referred to as CpG density) of a promoter, and LoF-intolerance of the downstream gene (Figure 3.1a, b); high CpG density is associated with high LoF-intolerance. To establish this, we used the LOEUF metric provided by gnomAD, an updated and more accurate measure of genic LOF-intolerance compared to pLI [100]. LOEUF places human genes on a 0-to-2 continuous scale, with lower values indicating higher LoF-intolerance. Following previous work [115], we classified genes with LOEUF < 0.35 as highly LoF-intolerant.

In 100, genes with ≤ 10 expected LoF variants were found to have unreli- able LOEUF estimates. Based on additional assessment (Supplemental Fig- ure 3.6; Methods), we here adopted a more stringent threshold and consid- ered 8,506 genes with ≥ 20 expected LoF variants. We further filtered this set down to 4,743 genes for which we could reliably determine the canonical pro- moter (Supplemental Figure 3.7; Methods; Supplemental Figure 3.8 contains a schematic of our approach to partitioning genes according to the reliability of their LOEUF estimate and promoter annotation).

When ranked according to the CpG density of their promoter, genes in the

86 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE

(a) (b) (c) High expr High spec 4 4

3 3 Percent LOEUF variance explained 0 10 20 2 2 τ expr level quartile tissue spec ( τ ) quartile

1 1 Low expr Low spec expr level

0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 τ x expr level LOEUF LOEUF

top 25% CpG density prom CpG density in between bottom 25% CpG density Figure 3.2: The relationship between promoter CpG density and loss- of-function intolerance conditional on downstream gene expression level and tissue specificity (τ). (a) The distribution of LOEUF, stratified by promoter CpG density, in each quartile of downstream gene expression level, computed using the GTEx dataset (Methods). (b) The distribution of LOEUF, stratified by promoter CpG density, in each quartile of downstream tissue specificity. For each gene, tissue specificity is quantified by τ, and is com- puted using the GTEX dataset (Methods). For both (a) and (b) quartiles are labeled from 1-4, with 1 being the quartile with the lowest and 4 the quartile with the highest expression/tissue specificity, respectively. (c) The percentage of LOEUF variance (adjusted r2) that is explained by downstream gene expres- sion level, τ, the interaction between the two, and promoter CpG density.

87 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE

top 10% have a 67.2% probability of being highly LoF-intolerant. This in con-

trast to 7.4% for genes in the bottom 10%, yielding a 25.6-fold enrichment (p <

2.2 · 10−16; Figure 3.1b). We note that there is a continuous gradient of enrich- ment across CpG density deciles (Figure 3.1b). When splitting genes into just two groups, consisting of those with CpG island-overlapping promoters, and those without, we found that the enrichment for highly LoF-intolerant genes in the CpG-island-overlapping group is markedly weaker (odds ratio = 3.71, p < 2.2 · 10−16), showing that this dichotomy masks the more continuous nature of CpG density. Finally, regression modeling revealed that CpG density alone can explain 19.3% of the variation in LOEUF (p < 2.2 · 10−16, β = −1.02; (Sup- plemental Figure 3.9; Methods), and that its effect on LOEUF is unchanged when accounting for coding sequence length (p < 2.2 · 10−16, β = −1.00).

We emphasize that our result remains pronounced even when we omit the

filtering for high-confidence promoters, and merely consider all canonical pro- moters with ≥ 20 expected LoF variants (p < 2.2 · 10−16; Supplemental Fig-

ure 3.10). However, the association becomes weaker (14.6-fold enrichment of

highly LoF-intolerant genes in the top CpG density decile), underscoring the

importance of accurate promoter annotation. We also verified that the exact

definition of the promoter (in terms of the size of the interval around the TSS)

has only a small impact on strength of the relationship between CpG density

and LOEUF (Supplemental Figure 3.11).

88 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE

3.3.2 The association between CpG density and LoF-intolerance is

not mediated through expression level or tissue specificity

The more LoF-intolerant a gene is, the more broadly it tends to be expressed

across tissues, and at higher levels [23, 100]. Even though it is well estab-

lished that promoter CpG density is associated with these two properties as

well [114, 116, 117], we found that neither variable explains our result (Fig-

ure 3.2, Supplemental Figure 3.12). First, after stratifying genes according to

either expression level or tissue specificity (using RNA-seq data from the GTEx

consortium; Methods), we saw a clear relationship between promoter CpG den-

sity and LOEUF within each stratum (Figure 3.2a, b). Second, the effect of

CpG density on LOEUF is almost equally strong when adjusting for either ex-

pression level or tissue specificity (regression β = −1.00 and −0.85, respectively, p < 2.2 · 10−16 for both regression models). Third, even the combination of the two expression properties explains less LOEUF variance than CpG density by itself (Figure 3.2c).

3.3.3 Regulatory factor binding at promoters provides information

about LoF-intolerance which adds to CpG density

We next turned our attention to the fraction of LOEUF variation (80.7%) that remains unexplained by CpG density. We hypothesized that part of it might be explained by preferential binding of specific regulatory factors at LoF-intolerant

89 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE

(a) (b) 1.5 Density with EZH2 peak 0 2 4 6 0.0

4 3 2 1 0.00 0.75 1.50 High density Low density Median # ENCODE experiments LOEUF CpG density quartile with EZH2-bound promoter tissue-specific genes without EZH2-bound promoter broadly expressed genes Figure 3.3: The loss-of-function intolerance of tissue-specific genes conditional on high promoter CpG-density and promoter EZH2 bind- ing. (a) The median number of ENCODE ChIP-seq experiments (out of 14 total) where an EZH2 peak is detected, shown separately for tissue-specific (τ > 0.6) and broadly expressed (τ < 0.6) genes, within each quartile of pro- moter CpG density. The quartiles are labeled from 1-4, with 1 being the most CpG-poor and 4 the most CpG-rich. (b) The LOEUF distributions of tissue- specific genes with high-CpG-density (top 25%) promoters, stratified according to whether their promoters show EZH2 peaks in at least 2 ENCODE experi- ments, or in less than 2 experiments.

gene promoters. Since a comprehensive assessment of this is currently out of

reach (due to the lack of extensive genome-wide binding data for most reg-

ulatory factors), we focused on EZH2 as a proof-of-principle. EZH2 is a rel-

atively well-characterized histone methylstransferase that specifically local-

izes to CpG islands of non-transcribed genes (Figure 3.3a, Supplemental Fig-

ure 3.13; 118, 119).

We discovered that tissue-specific genes with CpG-dense and EZH2-bound

promoters (EZH2 binding in ≥ 2 ENCODE experiments) have lower LOEUF

compared to their EZH2-unbound counterparts (Figure 3.3b; regression β =

−5.66, p = 5.21 · 10−8, for the interaction between CpG density and EZH2 bind-

90 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE

(a) (c) (d) exon PhastCons 1.0 1.0 prom PhastCons prom CpG density 0.5 0.5 prom+exon PhastCons all three combined predLoF-CpG predLoF-CpG 0.0 Huang et al. 0.0 Huang et al. 0 15 30 1.0 1.0 Percent LOEUF variance explained

(b) Test set: Genes with known LOEUF 0.5 0.5

predLoF-CpG predLoF-CpG 0.0 Steinberg et al. 0.0 Steinberg et al. negative predictive value 1.0 1.0 Density precision (positive predictive value)

0.5 0.5 0.0 1.5 3.0

0.00 0.75 1.50 LOEUF predLoF-CpG predLoF-CpG Han et al. Han et al. Predictions 0.0 0.0 highly LoF-intolerant 0 600 1200 0 300 600 non-highly LoF-intolerant # genes correctly classified # genes correctly classified as non-highly LoF-intolerant as highly LoF-intolerant Figure 3.4: Training and assessing predLoF-CpG: a predictor of loss-of- function intolerance based on CpG density. (a) The percentage of LOEUF variance (adjusted r2) that is explained by CpG density, exonic or promoter conservation, and their combinations. (b) The out-of-sample performance of predLoF-CpG. Shown are the LOEUF distributions of 1,743 genes belonging to the holdout test set (which consists of genes with reliable LOEUF estimates), stratified according to their classification as highly LoF-intolerant or not. The dashed vertical line corresponds to the cutoff for highly LoF-intolerant genes (LOEUF < 0.35). (c) The precision (y axis, left column) and negative predic- tive value (y axis, right column) plotted against the number of correctly classi- fied genes (x axis), for different predictors of loss-of-function intolerance. Each point corresponds to a threshold. The thresholds span the [0,1] interval, with a step size of 0.05. We note that because we are using two classification thresh- olds, a ROC curve would not be an appropriate evaluation metric here. ing, conditional on tissue specificity τ > 0.6). In this subset of promoters, the interaction of EZH2 binding with CpG density explains an additional 27.1% of

LOEUF variance on top of what CpG density explains (2.1%).

91 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE

Unlabelled data: Unascertained protein-coding genes (a) (b) (c) (d) Predictions highly LoF-intolerant 0.002

non-highly LoF-intolerant 70 1.5 0 Density

0.002 1 15 30 Density

0 0 20 40 0.0 in healthy individuals lethal

0 2000 4000 enrichment in predicted factors

0.00 0.75 1.50 highly LoF-intolerant genes Transcr. % of promoters with deletions

deletion size Olfactory Obs/Exp LoF point estimate receptors Mouse het. Figure 3.5: Using predLoF-CpG to classify currently unascertained genes as highly loss-of-function intolerant or not. (a) The distribution of point estimates of the observed/expected proportions of LoF variants. Genes are stratified according to their classification as highly LoF-intolerant or not. (b) The proportion of promoters which harbor deletions in a sample of 14,891 healthy individuals. Promoters are stratified according to downstream gene classification as highly LoF-intolerant or not. (c) The distribution of the size of deletions harbored by promoters in a sample of 14,891 healthy individuals. Promoters are stratified according to downstream gene classification as highly LoF-intolerant or not. (d) Odds ratios and the corresponding 95% confidence intervals quantifying the enrichment for genes in each of the x-axis groups that is exhibited by genes predicted as highly LoF-intolerant by predLoF-CpG. The enrichment is computed against genes predicted as non-highly LoF-intolerant. The horizontal line at 1 corresponds to zero enrichment.

3.3.4 Promoter CpG density with promoter and exonic across-species

conservation can collectively predict LoF-intolerance with high

accuracy

We then sought to develop a predictive model for LoF-intolerance, with the goal of providing high-confidence predictions for the 2,430 genes with currently unreliable LOEUF scores and reliable promoter annotation. Specifically, we aimed to classify genes as highly LoF-intolerant (LOEUF< 0.35) or not.

To build our model, we first separately computed the promoter and exonic

92 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE across-species conservation for each gene (using the PhastCons score; Meth- ods), and asked if they provide information about LOEUF complementary to

CpG density. We found this to be true (Figure 3.14c and Supplementary Fig- ure 3.14a,b); notably, CpG density explains at least as much LOEUF variance as exonic or promoter conservation (Figure 3.4a). When all three metrics are combined, 33.4% of the total LOEUF variation is explained (Figure 3.4a). We note that while EZH2 binding explains a substantial amount of LOEUF vari- ance when considering tissue-specific genes with high-CpG-density promoters, these are a small subset. Hence, inclusion of this feature only minimally in- creases the overall explained variance (0.4% increase). We therefore settled on training a logistic regression model with CpG density, and promoter/exonic con- servation as three linear predictors. As our training set we used 3,000 genes, randomly selected from the 4,743 with high-confidence LOEUF estimates.

Our predictor, which we called predLoF-CpG (predictor of LoF-intolerance based on CpG density) showed strong out-of-sample performance on the test set of the remaining 1,743 genes. The precision (positive predictive value) was 82.6% at the 0.75 prediction probability cutoff, and the negative predic- tive value was 88.4% at the 0.25 cutoff (Figure 3.4b); 144 genes were predicted to be highly LoF-intolerant, 753 were predicted as non-highly LoF-intolerant, and 806 (47.3%) were left unclassified. We chose to use two thresholds in- stead of one, at the expense of leaving a fraction of genes unclassified, since

93 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE this endows our predictor with precision and negative predictive value high enough to be useful in the clinical setting. We note that our predictive ac- curacy is comparable to that of widely adopted tools for predicting damaging missense variants [120, 121]. Further examining our out-of-sample classifica- tions, we found that a) the genes falsely predicted as highly LoF-intolerant had a median LOEUF of 0.49, indicating that at least half of them are very

LoF-intolerant despite not exceeding the 0.35 cutoff, and b) the genes correctly predicted as non-highly LoF intolerant had a median LOEUF of 0.86, suggest- ing that at least half of them are likely to tolerate biallelic inactivation as well

(Figure 3.4b).

Regardless of the choices for the two classification thresholds, predLoF-CpG outperforms all of the previously published predictors of LoF-intolerance (Fig- ure 3.4c). Specifically, all models have comparable and high negative predic- tive value, with ours being slightly superior (Figure 3.4c). However, within a range of thresholds that yield high precision, as would be required for use in clinical decision making, predLoF-CpG provides clear gain versus the rest

(Figure 3.4d, upper left area of left column plots). As an additional evalua- tion, we found that predLoF-CpG is capable of explaining a greater proportion of out-of-sample LOEUF variance compared to the other three (Supplemental

Figure 3.15).

Finally, we mention GeVIR, a recently developed metric (primarily for in-

94 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE tolerance to missense, but also useful for LoF variation; 122) which identifies regions depleted of protein-altering variation, and weights these regions by conservation within each gene. As expected given its dependency on observed variation, GeVIR exhibits substantial correlation with the expected number of

LoF variants (Spearman correlation = 0.42 vs 0.26 for predLoF-CpG). This lim- its its applicability to genes with unreliable LOEUF, even though the weighting step slightly alleviates this issue compared to LOEUF (Spearman correlation

= 0.49).

3.3.5 32.5% of currently unascertained genes in gnomAD receive high-

confidence predictions by predLoF-CpG

We applied predLoF-CpG to genes with unreliable LOEUF estimates in gno- mAD. After filtering for these with high-confidence promoter annotation, we retained 2,430 (out of 5,413). Of these, 104 were classified as highly LoF in- tolerant, 1,656 as non-highly LoF intolerant and 670 were left unclassified

(Supplemental Table 3.1). We first examined the ratio of observed-to-expected

LoF variants in these genes. Even though these point estimates are uncer- tain, there is a clear difference in the distribution of the point estimates be- tween genes we classify as highly LOF intolerant (median = 0.14) and those as not (median = 0.70), with the difference being in the expected direction (Fig- ure 3.5a; Wilcoxon test, p < 2.2 · 10−16).

95 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE

Next, to provide orthogonal support for our predictions, we leveraged a set of

175,716 deletions detected in 14,891 healthy individuals using whole-genome

sequencing (Methods) [123]. We reasoned that LoF-intolerant gene promot-

ers should be depleted of such deletions; when they do harbor deletions, these

should be small. By only considering promoters, we ensured that our assess-

ment is not dependent on gene length, which confounds LOEUF estimation.

Using the 4,743 genes with high-confidence LOEUF (from the training and test

sets), we first observed that low LOEUF is indeed associated with the presence

of both fewer (p = 2.39 · 10−15) and smaller (p < 2.2 · 10−16) promoter deletions

(Supplemental Figure 3.16a, b), showing that this is a legitimate assessment

strategy. Turning to our predictions, we found the same: genes predicted to

be highly LoF-intolerant are less likely to contain deletions in their promoters

compared to genes classified as non-highly LoF-intolerant (Figure 3.5b; proba-

bility of overlapping at least one deletion = 0.18 vs 0.33, permutation one-sided

p = 4 · 10−4 after 10,000 permutations); when such deletions are observed, they

tend to be much smaller (Figure 3.4c; median size = 129 vs 1092; Wilcoxon test,

p = 4.49 · 10−5).

Finally, we found that our predictions are in strong agreement with what would be expected based on known mouse phenotypes, and membership in specific gene classes (Figure 3.5d). First, the predicted highly LoF-intolerant genes show a 27.6-fold enrichment for genes heterozygous lethal in mouse

96 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE

(p = 1.03 · 10−12), when compared against those predicted as non-highly LoF- intolerant. Second, they exhibit a 12.7-fold enrichment for transcription fac- tors (p < 2.2 · 10−16), consistent with the known dosage sensitivity of these genes [26, 27, 124]. Third, they show a total depletion (odds ratio = 0) of olfac- tory genes (p = 2.5 · 10−5).

3.3.6 predLoF-CpG reclassifies 101 genes with expected LoF variants

between 10 and 20 as highly LoF-intolerant

In our analyses so far, we have ignored 3,440 genes with expected LoF variants

between 10 and 20. Even though in 100 these were treated as having reli-

able LOEUF estimates, our assessment suggests that lack of power can affect

whether they are categorized as highly LoF intolerant or not (Supplementary

Figure 3.6, Methods). After filtering for reliable promoter annotation, we ap-

plied predLoF-CpG to 2,772 genes, and obtained high-confidence classifications

for 1,675. For the great majority (93.9%), we agree with the classification ob-

tained by purely considering whether their LOUEF is < 0.35. However, we

observed 101 genes that were classified as highly LoF-intolerant by predLoF-

CpG but had LOEUF ≥ 0.35, a number not explained by the false positive

rate of our predictor (Supplemental Table 3.2). 75% of these genes have an

observed/expected LoF point estimate of 0.31, suggesting that they are indeed

highly LoF-intolerant, but do not exceed the required LOEUF threshold be-

97 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE cause of inadequate power. Therefore, when interpreting LoF variants in these genes, we suggest that both LOEUF as well as predLoF-CpG are taken into account.

3.4 Discussion

Our study reveals that: a) there exists a strong, widespread coupling between promoter CpG density and downstream gene LoF-intolerance in the human genome, and b) this coupling can be exploited to predict LoF-intolerance for almost 2000 genes that are otherwise largely intractable with current sample sizes. Our predictions for these genes (which we make available in Supple- mental Table 3.1) can inform research into novel disease candidates and now become incorporated in the clinical genetics laboratory setting. Similarly to existing tools for missense variants [120, 121], they can provide corroborating evidence during the evaluation of the pathogenicity of LoF variants harbored by patients, as recommended by the American College of Medical Genetics and

Genomics [101].

In terms of understanding the regulatory architecture of the genome, our

findings extend decades of work [1, 106] to show that high CpG density is not just a prevalent feature of many promoters, but is preferentially marking the promoters of the most selectively constrained genes. We believe this casts doubt on the prevailing view that CpG islands are not under selection [125],

98 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE although we note that our current results are correlative in nature.

If promoter CpG density is indeed under selection, its presence at LoF- intolerant gene promoters has to be advantageous, which raises the question of the underlying biological mechanism. Our findings suggest that this mecha- nism is not related to the high and constitutive expression that LoF-intolerant genes typically exhibit. An intriguing possibility has been recently raised by single-cell expression measurements showing that promoter CpG islands are associated with reduced expression variability [126]. We hypothesize that this decreased variability is beneficial for many processes where LoF-intolerant genes are known to play central roles, such as neurodevelopment [7].

Our work represents an attempt at deciphering the link between regulatory element characteristics, and the LoF-intolerance of the genes they control. The fact that taking promoter EZH2 binding into account improves our ability to recognize LoF-intolerant genes on top of CpG density, implies that this map- ping can be learned with even greater accuracy by incorporating information about other regulatory factors as well. However, a current barrier to achieving this is the relative paucity of genome-wide binding data across the full reper- toire of transcription factors: the human genome encodes approximately 1500 transcription factors [21,127,128] and at least 300 epigenetic regulators [124].

In contrast to these numbers, currently ENCODE has profiled only 330 regu- latory factors in K562 cells, the most extensively characterized cell line.

99 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE

It is also natural to consider moving beyond promoters to other regulatory

elements. An initial step in this direction has recently been taken in 105, mo-

tivated by work in Drosophila showing that developmentally important genes

can have multiple redundant enhancers [129, 130]. While this ”enhancer do-

main score” was not designed to capture LoF-intolerance and has poor associa-

tion with LOEUF (adjusted r2 = 0.03), it has been shown to have some predic- tive capacity for human disease genes, especially those with a developmental basis.

In summary, our study shows the existence of a strong and widespread as- sociation between promoter CpG density and genic LoF-intolerance, and lever- ages this relationship to predict LoF-intolerance for unascertained genes.

3.5 Methods

3.5.1 Selecting transcripts with high-confidence loss-of-function in-

tolerance estimates

In total, gnomAD [100] provides LoF-intolerance estimates for 79,141 human protein-coding transcripts (hereafter referred to as trancripts) labeled with

ENSEMBL identifiers, of which 19,172 are annotated as canonical. For each transcript, these LoF-intolerance estimates consist of the point estimate of the observed/expected number of LoF variants, as well as a 90% confidence inter-

100 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE val around it. The upper bound of this confidence interval (LOEUF) is the suggested metric of LoF-intolerance [100]. For any given transcript, the ability to reliably estimate LOEUF is directly related to the expected number of LoF variants; when that expected number is small, there is uncertainty around the point estimate (and thus a large LOEUF value), because it is not possible to determine whether an observed depletion of LoF variants is due to nega- tive selection against these variants in the population, or due to inadequate sample size. Therefore, for transcripts with high-confidence LOEUF values, there should be a strong positive correlation between the point estimate and

LOEUF; in constrast, low-confidence LOEUF transcripts will have small point estimates coupled with large LOUEF values.

Based on this assessment, and consistent with 100, we considered tran- scripts with ≤ 10 expected LoF variants to have unreliable LOEUF (34,232 out of 79,141 total transcripts; 5,413 out of 19,172 canonical transcripts; Supple- mental Figure 3.6). Throughout the text, we refer to the genes encoding for these transcripts as ”unascertained”.

Even though in 100 most of the analyses were performed using transcripts with > 10 expected LoF variants, we saw that, with increasing expected num- ber of LoF variants, there was a non-negligible increase in the probability (con- ditional on a given point estimate) of a transcript belonging in the highly LoF- intolerant category (LOEUF < 0.35). We thus adopted a more stringent crite-

101 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE rion, and considered transcripts with ≥ 20 expected LoF variants (25,474 out of

79,141 total transcripts; 8,506 out of 19,172 canonical transcripts; Supplemen- tal Figure 3.6) to have high-confidence LOEUF. After further filtering based on promoter annotation (see the section “Annotating canonical promoters in the human genome”), these are the transcripts we used to establish the associa- tion between promoter CpG density and LOEUF, and to train predLoF-CpG.

3.5.2 Selecting transcripts with high-confidence annotations in GEN-

CODE v19 gnomAD supplies LOEUF estimates for 79,141 transcripts in GENCODE v19.

However, we conducted our analyses at the gene level, based on the follow- ing reasoning: typically, transcripts from the same gene have overlap in their coding sequence, which makes it hard to disentangle their LOEUF estimates.

For example, a transcript whose loss does not have severe phenotypic conse- quences, and therefore its promoter does not contain informative features, may still have low LOEUF merely because it overlaps with a different transcript of the same gene.

For each gene, GENCODE labels a single transcript as canonical, and recog- nizes the difficulty of accurately annotating transcriptional start sites (TSS’s)

[131]. We manually inspected GENCODE’s choices of canonical transcripts, and found some problematic cases. An illustrative example is KMT2D (Supple-

102 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE mental Figure 3.17). First, even though this gene is broadly expressed across tissues in GTEx, its canonical promoter shows POLR2A (the major subunit of

RNA PolII complex) ChIP-seq peaks in only 4 ENCODE experiments (out of

74 total). Even though there does exist a non-canonical transcript whose pro- moter has POLR2A signal in 59 experiments, as would expected for a broadly expressed gene, that non-canonical transcript has an unusually short coding sequence, which does not even encode for the catalytic SET domain. In this particular case, we reasoned that the 5’ UTR of the canonical transcript needs to be extended up until the TSS of the non-canonical transcript. Such an anno- tation would also be consistent with the annotation of the mouse ortholog. Im- portantly, if this annotation error is ignored, it is impossible to select a KMT2D transcript with accurate estimates of both LOEUF and promoter CpG density.

With this example in mind, we developed an empirical approach to only re- tain transcripts with high-confidence GENCODE annotations in our analysis.

First, we defined promoters as 4kb elements centered around the transcrip- tional start site (TSS). We then leveraged the main hallmark of transcriptional initiation at protein-coding gene promoters: binding of the RNA PolII complex, the major subunit of which is POLR2A [132, 133]. We used data from EN-

CODE [134] on the genome-wide binding locations of POLR2A from 74 ChIP- seq experiments on several cell lines, originating from diverse human tissues

(see “POLR2A ENCODE ChIP-seq data” section below).

103 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE

As expected, we observed that genes that are broadly expressed across the

53 different tissues in GTEx (τ < 0.6; see “GTEx expression data” section be- low) tend to have promoters with POLR2A ChIP-seq peaks in multiple exper- iments, while the opposite is true for genes expressed in a restricted number of tissues (τ > 0.6, Supplemental Figure 3.7c). However, as in the KMT2D example above, we also observed genes with broad expression and very low binding of POLR2A at their canonical promoter (Supplemental Figure 3.7c), and a few genes with restricted expression but POLR2A peaks at their canon- ical promoter in multiple experiments (Supplemental Figure 3.7c), raising our suspicion that these reflect inaccurate annotation of the canonical TSS.

Therefore, we required that the canonical promoter of a broadly expressed gene exhibits POLR2A peaks in multiple ENCODE experiments, and the canon- ical promoter of a gene with restricted expression exhibits POLR2A peaks only in a small number of ENCODE experiments. As additional layers of evidence for canonical promoters, we used the presence of CpG islands, which are known markers of promoters in mammalian genomes [1, 106], as well as the concor- dance between the human TSS coordinate and the TSS coordinate of a mouse ortholog transcript (when the latter is mapped onto the human genome).

Specifically, we first excluded genes on the sex chromosomes, since, due to X- inactivation in females and hemizygosity in males, LoF-intolerance estimates have different interpretion in these cases. This gave us 17,657 genes with at

104 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE least one canonical transcript, of which 17,359 had expression measurements in GTEx. We then applied the following criteria (when none of the criteria were satisfied, we entirely discarded the gene):

Criterion 1: The gene is broadly expressed (τ < 0.6) and the canonical promoter has a POLR2A peak in more than 35 ENCODE experiments.

We found 7,250 cases satisfying this criterion, and kept the canonical pro- moter annotation.

Criterion 2: The gene is broadly expressed (τ < 0.6), the canonical pro- moter has a POLR2A peak in less than 10 ENCODE experiments, and there is an alternative promoter with POLR2A peaks in more than 35 experiments.

We found 218 cases satisfying this criterion (Supplemental Figure 3.7d), and classified the alternative promoter as the canonical (all such cases are pro- vided in Supplemental Table 3.3). When there were more than one alternative promoters satisfying our requirement, we distinguished the following subcases:

(a) If none of these alternative promoters overlapped a CpG island, we classi-

fied the promoter corresponding to the transcript with the greater number

of expected LoF variants as the canonical.

(b) If exactly one of these alternative promoters overlapped a CpG island, we

classified that promoter as the canonical.

(c) If more than one of these alternative promoters overlapped a CpG island,

105 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE

we classified the promoter that, among the CpG-island-overlapping pro-

moters, had the greatest number of expected LoF variants as the canoni-

cal.

For our subsequent analyses, we used the LOEUF value of the newly anno- tated canonical promoter.

Criterion 3: The gene is not broadly expressed (τ > 0.6), the canonical pro- moter has a POLR2A peak in less than 10 ENCODE experiments, and overlaps a CpG island.

We found 1,862 cases satisfying this criterion, and kept the canonical pro- moter annotation.

Criterion 4: The gene is not broadly expressed (τ > 0.6), the canonical promoter has a POLR2A peak in less than 10 ENCODE experiments, none of the promoters corresponding to the gene overlap a CpG island, and there is a mouse ortholog TSS in RefSeq no more than 500bp away from the canonical human TSS.

We found 3,049 cases satisfying this criterion, and kept the canonical pro- moter annotation.

Criterion 5: The gene is not broadly expressed (τ > 0.6), the canonical promoter has a POLR2A peak in less than 10 ENCODE experiments, none of the promoters corresponding to the gene overlap a CpG island, there is no mouse ortholog TSS in RefSeq, and there are no alternative transcripts with

106 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE different TSS coordinates.

We found 1,411 cases satisfying this criterion, and kept the canonical pro- moter annotation.

The promoters selected from the above 5 criteria along with their coordi- nates are provided in Supplemental Table 3.4.

Finally, regarding coding sequence annotations, errors such as the one in

KMT2D described at the beginning of the section are difficult to systematically detect and correct, and our manual inspection suggested that they are also less frequent. We chose to entirely discard cases where:

(a) the trascript we had selected after promoter filtering had ≤ 10 expected

LoF variants (placing the gene into the ”unascertained” category), and

(b) there was an alternative transcript that had longer coding sequence and

≥ 20 more expected LoF variants compared to the one our procedure se-

lected.

This approach removes KMT2D and 14 more potentially problematic cases such as ZNF609.

3.5.3 Calculating the CpG density of a promoter

Using the BSgenome.Hsapiens.UCSC.hg19 R package, we obtained the sequence of each promoter with the getSeq function. We then calculated its CpG density

107 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE using the definition of the observed-to-expected CpG ratio in 135, applied to the entire 4 kb sequence (that is, without using sliding windows).

3.5.4 The impact of promoter definition

There is currently no single accepted definition of a promoter in terms of the size of the interval around the TSS. We therefore examined how this parameter affects the relationship between CpG density and LOEUF, and found its impact to be small for 5 sensible choices (Supplemental Figure 3.11a,b).

3.5.5 Overlapping promoters

When defining the set of genes with high-confidence LOEUF estimates, we excluded genes whose promoters overlapped promoters of genes with less than

20 expected LoF variants, but whose observed/expected LoF point estimate was suggestive of LoF-intolerance (< 0.5). In cases of overlapping promoters with both genes having ≥ 20 expected LoF variants, we kept the promoter corre- sponding to the gene with the lowest LOEUF. In cases of overlapping promoters with both genes having at ≤ 10 expected LoF variants, we kept the promoter with the highest CpG density. Finally, when defining the set of unascertained genes, we excluded genes whose promoters overlapped promoters of genes with greater than 10 expected LoF variants, unless there was strong evidence that these were LoF-tolerant (observed/expected LoF point estimate > 0.8 and at

108 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE least 20 expected LoF variants.)

We recognize however, that in cases where promoters overlap, the predic- tions are potentially informative not only for the gene whose promoter was ul- timately used, but also for the genes with overlapping promoters. In addition, in cases of genes predicted as highly LoF-intolerant, these predictions might also have been influenced by the overlapping promoter (there are only 3 po- tential such cases). With that in mind, in Supplemental Table 3.1, we provide such information under the column ”other genes with overlapping promoter”.

3.5.6 Promoters in subtelomeric regions

It is known that subtelomeric regions are rich in CpG islands, which are how- ever different than those in the rest of the genome, in that they appear in clusters, and their CpG-richness is driven mainly by GC-biased gene conver- sion [125]. We thus excluded promoters residing in subtelomeric regions (de-

fined as 2 Mb on each of the two chromosomal ends of each chromosome) from our analyses.

3.5.7 ENCODE ChIP-seq data

We used the rtracklayer R package to download the ”wgEncodeRegTfbsClus- teredV3” table from the ”Txn Factor ChIP” track, as provided by the UCSC

Table Browser for the hg19 human assembly. We then restricted to peak clus-

109 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE ters corresponding to POLR2A. This gave us a set of genomic intervals, each of which has been derived from uniform processing of 74 POLR2A ChIP experi- ments on 32 distinct cell lines (some cell lines were represented by more than one experiments). Each genomic interval was associated with a single num- ber, which ranged from 0 to 74 and indicated the number of ChIP experiments where a peak was detected at that interval. The EZH2 data were downloaded in an identical manner.

3.5.8 GTEx expression data

We used the GTEx portal to download a matrix with the gene-level TPM ex- pression values from the v7 release, derived from RNA-seq expression mea- surements from 714 individuals, spanning 53 tissues. [136].

As the metric of tissue specificity for a given gene, we used τ, which has been shown to be the most robust such measure when benchmarked against alter- natives [137]. To calculate τ, we first computed the gene’s median expression across individuals, within each tissue. Since it has been shown that the tran- scriptomic profiles of the different brain regions are very similar, with the ex- ception of the two cerebellar tissues [35], which are similar to one-another, we aggregated the median expression of each gene in the different brain regions into two “meta-values”. One meta-value corresponded to the median of its me- dian expression in the two cerebellar tissues, and the other to the median of its

110 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE

median expression in the other brain regions. We then formed a matrix where

rows corresponded to genes, and columns to tissues, with one column for the

across-brain-regions meta-value and another for the across-cerebellar-tissues

meta-value; the entries in the matrix were log2(TPM + 1) median expression

values. Finally, for each gene, τ was calculated as described in [137].

For our analyses of the association between promoter CpG density and ex-

pression level, we used the median (across individuals) expression (log2(TPM +

1)), computed for the tissue where the gene had the maximum median expres-

sion.

3.5.9 TSS coordinates of mouse orthologs

We used the biomaRt R package to obtain a list of mouse-human homolog

pairs, using the human Ensembl gene IDs as the input. For this query, we

set the ‘mmusculus homolog orthology confidence‘ parameter equal to 1 (indi- cating high-confidence homolog pairs). Then, for each of the mouse homolog

Ensembl IDs, we retrieved the RefSeq mRNA IDs, again with biomaRt. We discarded cases where the same RefSeq mRNA ID was associated with more than one Ensembl gene IDs. We then used the rtracklayer R package to down- load the ”xenoRefGene” UCSC table, from the ”Other RefSeq” track, containing the TSS coordinates for each of the mouse RefSeq trancripts.

111 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE

3.5.10 Across-species conservation quantification

Using the the phastCons100way.UCSC.hg19 R package, we quantified conser- vation for each nucleotide across 100 vertebrate species using the PhastCons score [138]. The PhastCons score ranges from 0 to 1 and represents the prob- ability that a given nucleotide is conserved. As the promoter PhastCons score for a given gene, we computed the average PhastCons of all nucleotides in the

4kb region centered around the TSS. As the exonic PhastCons for a given gene, we pooled all nucleotides belonging to the coding sequence of the gene (that is, excluding the 5’ and 3’ UTRs), and computed their average PhastCons.

3.5.11 Previously published LoF-intolerance predictions

The updated version of the score of 139 was downloaded from the DECIPHER database (accessed November 2019). The scores of 140 and 104 were down- loaded from the supplemental materials of the respective publications. In our comparison we did not include HIPred [141], since it only provides binary hap- loinsufficiency predictions for a small number of genes.

3.5.12 Structural variation data

We used the gnomAD browser to download a bed file containing the coordi- nates and characteristics of structural variants in gnomAD v2. We then re- stricted to deletions that passed quality control (“FILTER” column value equal

112 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE

to ”PASS”). Subsequently, we excluded deletions that overlapped more than

one of our high-confidence promoters, in order to avoid ambiguous links be-

tween deletions and genes.

3.5.13 Gene catalogs

The following gene catalogs were used for Figure 3.5d:

(a) 404 heterozygous lethal genes in mouse from https://github.com/

macarthur-lab/gnomad_lof/blob/master/R/ko_gene_lists. Us-

ing the biomaRt R package, we mapped these genes to their human ho-

molog ensembl ids with the biomaRt R package using the ”mgi symbol”

filter, keeping only high-confidence homologs. This yielded a total of 390

human homologs.

(b) 1,254 high-confidence transcription factor genes from 127

(c) 371 olfactory receptor genes from

https://github.com/macarthur-lab/gene_lists.

3.5.14 Enrichment quantification

All enrichment point estimates in the text correspond to odds ratios, and the associated p-values were calculated using Fisher’s exact test (two-sided) with the ”fisher.test” function in R.

113 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE

3.5.15 Code

Code used in this manuscript is available at

https://github.com/hansenlab/lof_prediction_paper_repro.

3.5.16 Acknowledgements

Funding: Research reported in this publication was supported by the National

Institute of General Medical Sciences of the National Institutes of Health un-

der award number R01GM121459. LB was supported by the Maryland Genet-

ics, Epidemiology and Medicine (MD-GEM) training program, funded by the

Burroughs-Wellcome Fund. HTB received support from the Louma G. Founda-

tion.

Conflict of Interest: HTB is a paid consultant for Millennium Pharmaceuticals,

Inc.

114 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE

3.6 Supplementary Materials

3.6.1 Supplemental Tables

Supplemental Table 3.1: predLoF-CpG predictions for genes unascer- tained in gnomAD. Prediction probabilities are provided in the ”predic- tion probability of high LoF intolerance by predLoF-CpG” column. Probabil- ities > 0.75 correspond to genes predicted as highly LoF-intolerant, and prob- abilities < 0.25 to genes predicted as non-highly LoF-intolerant. ENSEMBL gene/transcript ids and coordinates of the promoters used for prediction are also provided; all coordinates refer to hg19. Available at https://www.biorxiv.org/content/10.1101/2020.02. 15.936351v3.supplementary-material

Supplemental Table 3.2: predLoF-CpG reclassifications for genes with expected LoF variants between 10 and 20. Similar to Supplemental Ta- ble 3.1, but for 101 genes with expected LoF variants between 10 and 20 that were classified as highly LoF-intolerant by predLoF-CpG but had LOEUF ≥ 0.35. Available at https://www.biorxiv.org/content/10.1101/2020.02. 15.936351v3.supplementary-material

Supplemental Table 3.3: Non-canonical promoter coordinates.Promoter coordinates for cases where our promoter filtering procedure selected a non- canonical promoter. The table contains the promoter coordinates and tran- script ENSEMBL ids of both the canonical, as well as the alternative transcript that was selected. All coordinates refer to hg19. Available at https://www.biorxiv.org/content/10.1101/2020.02. 15.936351v3.supplementary-material

115 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE

Supplemental Table 3.4: All promoter coordinates. Promoter coordinates for 11,059 transcripts where our filtering procedure selected a reliable pro- moter. Available at https://www.biorxiv.org/content/10.1101/2020.02. 15.936351v3.supplementary-material

116 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE

3.6.2 Supplemental Figures

(a) (b)

transcripts with >= 20 expected LoF variants transcripts in between LOEUF LOEUF transcripts with <= 10 expected LoF variants 0 1 2 0 1 2

0 1 2 0 1 2 Obs/Exp LoF point estimate Obs/Exp LoF point estimate Supplemental Figure 3.6: Assessing the reliability of LOEUF esti- mates. Scatterplots of the point estimates of the observed/expected propor- tion of loss-of-function variants (x axis), against LOEUF (y axis; defined as the upper bound of the 90% confidence interval around the point estimate). Each point corresponds to a transcript. The horizontal line corresponds to the 0.35 cutoff for highly LoF-intolerant genes. Shown for: (a) all transcripts, and (b) canonical transcripts only (based on GENCODE annotation).

117 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE

(a) (b) 0.15 Density Density 0.0 1.5 3.0

0.00 0.0 0.5 1.0 0 30 60 tissue specificity (τ) # of ENCODE experiments with POLR2A peak (c) (d) 30 60 0 0 30 60 with POLR2A peak with POLR2A peak (max) # of ENCODE experiments

# of ENCODE experiments 0.0 0.5 1.0 0 9 tissue specificity (τ) # of ENCODE experiments with POLR2A peak (canonical) Supplemental Figure 3.7: Assessing the relationship between tissue specificity of gene expression and POLR2A binding at the canonical promoter. (a) The distribution of the number of ENCODE ChIP-seq experi- ments showing POLR2A peaks, for all canonical promoters (4 kb regions cen- tered around the TSS) in Ensembl (hg19 assembly). (b) The distribution of τ computed using gene-level expression quantifications from GTEx. (c) Scat- terplot of τ against the number of ENCODE ChIP-seq experiments showing POLR2A peaks at the canonical promoter. Each point corresponds to a gene- promoter pair. (d) Scatterplot of the number of ENCODE ChIP-seq experi- ments showing POLR2A peaks at the canonical (x axis) promoter versus the corresponding number at the promoter with the greatest number of detected peaks (out of all the alternative promoters of a gene; y axis). Each point cor- responds to a promoter pair for a single gene; shown are only genes that are broadly expressed (τ < 0.6) but whose canonical promoter shows POLR2A bind- ing in less than 10 ENCODE experiments.

118 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE

17,359 autosomal genes in gnomAD+GTEx

Expected number of LoF variants >=20 10-20 <=10

8,506 3,440 5,413 genes genes genes

Filtering, based on - can the genic promoter be identified reliably - excluding sub-telomeric regions - handling bi-directional promoters

Train/Test set 4,743 2,772 2,430 Predictions with high-confidence genes genes genes for genes with LOEUF unascertained LOEUF

Supplemental Figure 3.8: Partitioning genes according to the relia- bility of their LOEUF estimates and promoter annotation. Schematic illustrating our approach (see Methods for details). We start with 17,359 genes that: a) are present in both GTEx and gnomAD, b) reside in autosomes, and c) their promoters are not subtelomeric. We then filter these according to whether they have reliable promoter annotations, and in cases of pairs of genes with overlapping promoters we only keep one pair. This gives us the set of high- confidence genes that we use to establish the relationship between CpG den- sity and LOEUF and to train predLoF-CpG, and the set of unascertained genes to which we apply predLoF-CpG. LOEUF 0 1 2

0.2 0.6 1.0 CpG density Supplemental Figure 3.9: Scatterplot of promoter CpG density against downstream gene LOEUF. Each point corresponds to a promoter-gene pair.

119 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE 40 promoter filtering no promoter filtering 20 enrichment in highly LoF-intolerant genes (OR) 1

1 2 3 4 5 6 7 8 9 10 CpG density decile Supplemental Figure 3.10: The effect of filtering for high-confidence promoter annotations on the relationship between CpG density and LOEUF. Like Figure 1b, but shown both for the 4,859 genes with high- confidence promoter annotations (red), and for 6,656 genes with canonical (based on GENCODE) promoter annotations and at least 20 expected LoF vari- ants, without further promoter filtering (blue).

120 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE

(a) +/- 4kb +/- 2kb +/- 1kb High density High density High density 10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

CpG density decile 3 CpG density decile 3 CpG density decile 3

2 2 2

1 1 1 Low density Low density Low density 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 LOEUF LOEUF LOEUF

-1.5kb/+500bp -2kb only (b)

High density High density 30 10 10

9 9

8 8

7 7 15

6 6 Percent LOEUF

5 5 variance explained

4 4 0

CpG density decile 3 CpG density decile 3

2 2

1 1 +/-4kb +/-2kb +/-1kb Low density Low density -2kb only 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0

LOEUF LOEUF +500bp/-1.5kb Supplemental Figure 3.11: The impact of the size of promoter defi- nition on the relationship between CpG density and LOEUF. (a) Like Figure 3.1a, but with different choices of the interval around the transcription start site that is defined as the promoter. (b) The percentage of LOEUF vari- ance (adjusted r2) that is explained by promoter CpG density, for each of the promoter definitions in (a).

121 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE

(a) (b) (c) High density High density 10 10 High density 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 CpG density decile CpG density decile CpG density decile 2 2 2 1 1 1 Low density Low density Low density 0 5 10 15 0 0.25 0.50 0.75 1.00 0 20 40 expr level (log (TPM+1)) tissue specificity (τ) # tissues with 2 detectable expression Supplemental Figure 3.12: Distributions of downstream gene expres- sion level and tissue specificity across promoter CpG density deciles. Both expression level and τ were computed from the GTEx dataset (see Meth- ods). In all three figures, CpG density deciles are labeled 1-10, with 1 the most CpG-poor decile and 10 the most CpG-rich. In (c), detectable expression in a given tissue is defined as median TPM > 0.3.

122 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE

0.75

low tissue specificity - all other quartiles 0.50 low tissue specificity - top CpG density quartile

high tissue specificity - all other quartiles

high tissue specificity - top CpG density quartile

0.25 proportion of promoters

0.00

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # ENCODE experiments with EZH2 peak Supplemental Figure 3.13: The proportion of promoters with EZH2 peaks in 1-14 ENCODE experiments, stratified based on their CpG density and downstream gene tissue specificity. Tissue speciicity was quantified from the GTEx dataset using τ (Methods). Low tissue specificity corresponds to τ < 0.6 and high tissue specificity corresponds to τ > 0.6.

123 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE

(a) (b)

High High 4 4

3 3

2 2 Exon PhastCons quartile 1 1 Promoter PhastCons quartile Low Low 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 LOEUF LOEUF top 25% CpG density in between bottom 25% CpG density Supplemental Figure 3.14: The relationship between promoter CpG density and loss-of-function intolerance conditional on promoter and exonic across-species conservation. (a) The distribution of LOEUF, strat- ified by promoter CpG density, in each quartile of promoter PhastCons score (Methods). (b) The distribution of LOEUF, stratified by promoter CpG density, in each quartile of exonic PhastCons (Methods). For both (a) and (b) quartiles are labeled from 1-4, with 1 being the least and 4 the most conserved, respec- tively.

Huang et al.

Steinberg et al.

Han et al.

predLoF-CpG 0.15 0.35

Percent LOEUF variance explained out-of-sample Supplemental Figure 3.15: The percentage of out-of-sample LOEUF variance explained by the different predictors of LoF-intolerance. Each boxplots corresponds to a LoF-intolerance predictor as shown on the x- axis, and shows the sampling distribution of the adjusted r2 after regressing the LOEUF of genes in the test set on the corresponding predictor. We per- formed 1,000 random train/test splits. For predLoF-CpG, the regression was performed on the prediction probably of high LoF-intolerance.

124 CHAPTER 3. PROMOTER CPG DENSITY PREDICTS GENIC LOSS-OF-FUNCTION INTOLERANCE

(a) (b) 0.002 4th LOEUF quartile 40

0 0.002 rd 20 3 LOEUF quartile in healthy individuals 0 0 % of promoters with deletions 1 2 3 4 0.002 nd

Density 2 LOEUF quartile LOEUF quartile

Increasing LoF-intolerance

0 0.002 1st LOEUF quartile

0 0 2000 4000 deletion size Supplemental Figure 3.16: The relationship between promoter dele- tions seen in healthy individuals and downstream gene loss-of- function intolerance. (a) The proportion of promoters harboring deletions across different strata of downstream gene loss-of-function intolerance. For each stratum, the distribution is obtained via the bootstrap. (b) The distri- bution of the size of deletions harbored by promoters across different strata of downstream gene loss-of-function intolerance.

125 Scale 5 kb hg19 chr12: 49,447,000 49,448,000 49,449,000 49,450,000 49,451,000 49,452,000 49,453,000 49,454,000 49,455,000 49,456,000 1 _ Average mCG (small smooth) in BA9 (NeuN+) samples mCG small (BA9+) 0 _ 1 _ Average mCG (small smooth) in NAcc (NeuN+) samples mCG small (NAcc+) 0 _ CG-DMRs between brain regions in NeuN+ samples CHAPTER0.25 _ 3. PROMOTER CPG DENSITYAverage mCA in PREDICTSBA9 (NeuN+) samples GENIC mCA (BA9+) -0.25 _ LOSS-OF-FUNCTION0.25 _ INTOLERANCEAverage mCA in NAcc (NeuN+) samples mCA (NAcc+) -0.25 _ CA-DMRs between brain regions in NeuN+ samples 0.1 _ Average ATAC-seq CPM in BA9 (NeuN+) samples ATAC CPM (BA9+) 0 _ 0.1 _ Average ATAC-seq CPM in NAcc (NeuN+) samples ATAC CPM (NAcc+) 0 _ DARs between NAcc (NeuN+) and BA9 (NeuN+) samples 1 _ Average RNA-seq CPM in BA9 (NeuN+) samples RNA CPM (BA9+) 0 _ 1 _ Average RNA-seq CPM in NAcc (NeuN+) samples RNA CPM (NAcc+) 0 _ DEGs between NAcc (NeuN+) and BA9 (NeuN+) Haplotypes to GRCh37 Reference Sequence Patches to GRCh37 Reference Sequence Assembly from Fragments AC011603.37 100 _ GC Percent in 5-Base Windows GC Percent 0 _ Contigs Dropped or Changed from GRCh37(hg19) to GRCh38(hg38) UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics) KMT2D NCBI RefSeq genes, curated subset (NM_*, NR_*, NP_* or YP_*) - Annotation Release GCF_000001405.25_GRCh37.p13 (2017-04-19) KMT2D UCSC annotations of RefSeq RNAs (NM_* and NR_*) KMT2D Non-Human RefSeq Genes Mus Kmt2d Basic Gene Annotation Set from GENCODE Version 31lift37 (Ensembl 97) KMT2D Comprehensive Gene Annotation Set from GENCODE Version 31lift37 (Ensembl 97) KMT2D KMT2D Pseudogene Annotation Set from GENCODE Version 31lift37 (Ensembl 97) Basic Gene Annotation Set from GENCODE Version 28lift37 (Ensembl 92) KMT2D Comprehensive Gene Annotation Set from GENCODE Version 28lift37 (Ensembl 92) KMT2D Canonical Trascript Alternative Trascript KMT2D Basic Gene Annotation Set from GENCODE Version 19 KMT2D Ensembl Gene Predictions - archive 75 - feb2014 ENST00000301067 ENST00000547610 Gene Expression in 53 tissues from GTEx RNA-seq of 8555 samples (570 donors)

KMT2D

CpG Islands (Islands < 300 Bases are Light Green) CpG: 103 100 _ H3K27Ac Mark (Often Found Near Active Regulatory Elements) on 7 cell lines from ENCODE Layered H3K27Ac 0 _ DNaseI Hypersensitivity Clusters in 125 cell types from ENCODE (V3) DNase Clusters Transcription Factor ChIP-seq Clusters (161 factors) from ENCODE with Factorbook Motifs CTCF 3/99 dgg KAP1 1/3 h ZNF263 1/2 h TBP 1/5 H CTCF 52/99 dGUKnnnofMAaa ETS1 2/3 AK ELF1 4 AGLK YY1 1/13 A TBP 5 G1HLK MXI1 2/5 GL YY1 1/13 G SIN3A 1/2 G SRF 2/4 G1 ELF1 4 AGLK POLR2A 59/74 UKAAeGG ATF2 1/2 G SMARCB1 1/2 H WRNIP1 1 G POLR2A 4/74 UKKn TCF7L2 6/7 hhHHMp SPI1 1/3 G CHD1 1/4 K GABPA 3/6 AGK ZBTB7A 1/2 K MAZ 4 GHLK 4/21 KKnM UBTF 2 KK MAX 4/9 KGKn 1/3 H NRF1 4/5 G1HK CHD2 1/5 H RUNX3 1 G PHF8 1 K TBP 1/5 K RBBP5 2 1K UBTF 1/2 K KDM5B 1 K RELA 1/10 g POLR2A signal TAF1 10 AGgg1HLKpS YY1 2/13 GK MYBL2 1 L JUND 1/8 K YY1 12/13 AGgg1hLKKSKn CBX3 1 K GABPA 1/6 L TRIM28 1 K UBTF 1/2 K Supplemental Figure 3.17: UCSCZNF143 4 G1 genomeHK browser screenshotSIN3AK20 1/7 K of a SIX5 4 AG1K REST 7/12 ALKpSSu FOXP2 1/2 s MTA3 1 G 10kb region containing the transcriptionalBCLAF1 1/2 K start sites ofKAP1 the canoni-1/3 h CEBPB 2/10 KK NFIC 2 GL CHD2 1/5 G E2F1 1/3 M cal and one alternative KMT2DEP300trasctipts.1/12 K The precise coordinatesELK4 1/2 H are SIN3AK20 1/7 K ZNF263 1/2 h BCL3 3 AGK chr12:49,446,107-49,456,107. The sequence of the canonical transcriptPML extends2 GK GABPA 1/6 G CHD1 1/4 1 beyond the 10kb region shown. SMARCB1 1/2 H ATF2 2 G1 FOXM1 1 G SP4 1 1 TCF7L2 3/7 hHp RELA 7/10 Ggggggg STAT1 2/6 GH GATA1 1/3 p MYC 20/21 1HKKKKKKmm ATF3 1/6 A ELF1 4 AGLK JUND 5/8 1L1KK GTF2F1 3 1HK CBX3 1 K ZBTB33 2/5 Ah IRF1 2/4 KK ARID3A 2 LK JUNB 1 K 1/4 K SP1 4 G1LK POU2F2 2 Gg STAT5A 2 GK TBL1XR1 1/3 K TEAD4 2/3 1K TRIM28 1 K FOSL2 2 AL RCOR1 2/5 HK GABPA 3/6 A1L BACH1 2 1K NR2F2 1 K CCNT2 1 K SIN3AK20 7 A1LKppS HDAC1 1 K MXI1 5 G1HLK EP300 7/12 AG1S tLK GTF2B 1 K RFX5 1/5 H MAFK 5/6 1HL IK 126 MAX 9 KAG1HLUKn HDAC2 2/4 LK ZKSCAN1 1 H TCF12 2/4 A1 HMGN3 1 K TCF3 1 G NFYA 1/3 H FOXP2 2 ps SIX5 3/4 AGK MAFF 1/2 K CHD2 5 G1HLK ATF1 1 K ELK1 1/3 H EGR1 3 G1K BCL11A 2 G1 BHLHE40 4/5 AGLK USF1 8 AAAG1LKS ZNF143 4 G1HK CREB1 1 A RUNX3 1 G MEF2A 2 GK RXRA 2/3 GL BRCA1 1/4 H PAX5 3/4 GGg ZEB1 1 G HNF4G 1 L ETS1 3 AGK NANOG 1 1 USF2 2/5 G1 NR3C1 1/5 A TAF7 1/2 1 SIN3A 1/2 1 CEBPD 1 L IRF4 1 G E2F6 2/3 KK PBX3 1 G ZBTB7A 1/2 K FOXA1 1/5 t SMARCC1 1 H Transcription Factor ChIP-seq Peaks (338 factors in 130 cell types) from ENCODE 3 GM12878 BATF GM12878 ELF1 GM12878 POLR2A GM12878 TAF1 H1-hESC CTCF 1 H1-hESC HDAC2 1 H1-hESC YY1 HepG2 MAFK 1 HepG2 MAX HepG2 SP1 1 IMR-90 MAFK K562 CTCF 1 K562 GATA2 K562 IKZF1 1 K562 MYC 1 MCF-7 CTCF 1 liver JUND 1 neuralCell SMC3 ovary CTCF 1 GM12878 TFBS Uniform Peaks of CTCF from ENCODE/UT-A/Analysis GM12878 TFBS Uniform Peaks of Pol2-4H8 from ENCODE/HudsonAlpha/Analysis GM12878 TFBS Uniform Peaks of SP1 from ENCODE/HudsonAlpha/Analysis H1-hESC TFBS Uniform Peaks of CTCF from ENCODE/UT-A/Analysis H1-hESC TFBS Uniform Peaks of NANOG_(SC-33759) from ENCODE/HudsonAlpha/Analysis H1-hESC TFBS Uniform Peaks of Pol2-4H8 from ENCODE/HudsonAlpha/Analysis K562 TFBS Uniform Peaks of CTCF from ENCODE/UT-A/Analysis K562 TFBS Uniform Peaks of NF-YA from ENCODE/Stanford/Analysis K562 TFBS Uniform Peaks of Pol2-4H8 from ENCODE/HudsonAlpha/Analysis HeLa-S3 TFBS Uniform Peaks of CTCF from ENCODE/UT-A/Analysis HeLa-S3 TFBS Uniform Peaks of HA-E2F1 from ENCODE/USC/Analysis HeLa-S3 TFBS Uniform Peaks of Pol2 from ENCODE/HudsonAlpha/Analysis

HepG2 TFBS Uniform Peaks of CTCF from ENCODE/UT-A/Analysis HepG2 TFBS Uniform Peaks of p300_(SC-584) from ENCODE/Stanford/Analysis HepG2 TFBS Uniform Peaks of Pol2-4H8 from ENCODE/HudsonAlpha/Analysis HUVEC TFBS Uniform Peaks of CTCF from ENCODE/UT-A/Analysis HUVEC TFBS Uniform Peaks of c-Fos from ENCODE/USC/Analysis HUVEC TFBS Uniform Peaks of Pol2-4H8 from ENCODE/HudsonAlpha/Analysis A549 (DEX_100nM) TFBS Uniform Peaks of CTCF_(SC-5916) ENCODE/HudsonAlph/Analysis A549 (DEX_100nM) TFBS Uniform Peaks of GR from ENCODE/HudsonAlpha/Analysis IMR90 TFBS Uniform Peaks of CEBPB from ENCODE/Stanford/Analysis IMR90 TFBS Uniform Peaks of CTCF_(SC-15914) from ENCODE/Stanford/Analysis IMR90 TFBS Uniform Peaks of Pol2 from ENCODE/Stanford/Analysis MCF-7 (serum_stimulated) TFBS Uniform Peaks of CTCF from ENCODE/UT-A/Analysis MCF-7 (serum_stimulated) TFBS Uniform Peaks of c-Myc from ENCODE/UT-A/Analysis MCF-7 (serum_stimulated) TFBS Uniform Peaks of Pol2 from ENCODE/UT-A/Analysis

GM12878 Chromatin State Segmentation by HMM from ENCODE/Broad 10_Txn_Elongation 2_Weak_Promoter 1_Active_Promoter 2_Weak_Promoter 9_Txn_Transition 2_Weak_Promoter 6_Weak_Enhancer 4_Strong_Enhancer 11_Weak_Txn 1_Active_Promoter H1-hESC Chromatin State Segmentation by HMM from ENCODE/Broad 10_Txn_Elongation 6_Weak_Enhancer 1_Active_Promoter 9_Txn_Transition 2_Weak_Promoter 2_Weak_Promoter 2_Weak_Promoter 2_Weak_Promoter 13_Heterochrom/lo 6_Weak_Enhancer K562 Chromatin State Segmentation by HMM from ENCODE/Broad 10_Txn_Elongation 4_Strong_Enhancer 2_Weak_Promoter 9_Txn_Transition 1_Active_Promoter HepG2 Chromatin State Segmentation by HMM from ENCODE/Broad 10_Txn_Elongation 2_Weak_Promoter 2_Weak_Promoter 9_Txn_Transition 1_Active_Promoter 6_Weak_Enhancer 11_Weak_Txn HUVEC Chromatin State Segmentation by HMM from ENCODE/Broad 10_Txn_Elongation 1_Active_Promoter 9_Txn_Transition 2_Weak_Promoter 2_Weak_Promoter 6_Weak_Enhancer 4_Strong_Enhancer 11_Weak_Txn HMEC Chromatin State Segmentation by HMM from ENCODE/Broad 10_Txn_Elongation 6_Weak_Enhancer 1_Active_Promoter 9_Txn_Transition 2_Weak_Promoter 2_Weak_Promoter 6_Weak_Enhancer 13_Heterochrom/lo HSMM Chromatin State Segmentation by HMM from ENCODE/Broad 10_Txn_Elongation 2_Weak_Promoter 6_Weak_Enhancer 9_Txn_Transition 1_Active_Promoter 13_Heterochrom/lo NHEK Chromatin State Segmentation by HMM from ENCODE/Broad 10_Txn_Elongation 2_Weak_Promoter 2_Weak_Promoter 9_Txn_Transition 1_Active_Promoter 6_Weak_Enhancer 13_Heterochrom/lo NHLF Chromatin State Segmentation by HMM from ENCODE/Broad 10_Txn_Elongation 1_Active_Promoter 9_Txn_Transition 2_Weak_Promoter 2_Weak_Promoter 6_Weak_Enhancer 11_Weak_Txn Histone Modifications by ChIP-seq from ENCODE/Broad Institute GM12878 H3K4m1 GM12878 H3K4m1 GM12878 H3K4m3 GM12878 H3K4m3 GM12878 H3K9m3 GM12878 H3K9m3 H1-hESC CTCF H1-hESC CTCF H1-hESC H3K4m1 H1-hESC H3K4m1 H1-hESC H3K4m3 H1-hESC H3K4m3 H1-hESC H3K27ac H1-hESC H3K27ac H1-hESC H3K27m3 H1-hESC H3K27m3 H1-hESC H3K36m3 H1-hESC H3K36m3 K562 CTCF K562 CTCF K562 H3K4m1 K562 H3K4m1 K562 H3K4m3 K562 H3K4m3 K562 H3K27ac K562 H3K27ac K562 H3K27m3 K562 H3K27m3 K562 H3K36m3 K562 H3K36m3 Chromatin Interaction Analysis Paired-End Tags (ChIA-PET) from ENCODE/GIS-Ruan K562 CTCF Int 1 K562 CTCF Sig 1 K562 Pol2 Int 1 K562 Pol2 Sig 1 HeLaS3 Pol2 Int 1 HeLaS3 Pol2 Sig 1 Regulatory elements from ORegAnno ORegAnno K562 H3K4me1 Histone Modifications by ChIP-Seq Peaks from ENCODE/SYDH

20 _ K562 H3K4me1 Histone Modifications by ChIP-Seq Signal from ENCODE/SYDH K562 H3K4me1 3 _ Histone Modifications by ChIP-seq from ENCODE/Stanford/Yale/USC/Harvard K562 H3K4me3 K562 H3K4me3 K562 H3K9ac K562 H3K9ac K562 H3K27me3 K562 H3K27me3 NT2-D1 H3K4me1 NT2-D1 H3K4me1 NT2-D1 H3K4me3 NT2-D1 H3K4me3 NT2-D1 H3K9ac NT2-D1 H3K9ac NT2-D1 H3K27me3 NT2-D1 H3K27me3 NT2-D1 H3K36me3 NT2-D1 H3K36me3 U2OS H3K9me3 U2OS H3K9me3 U2OS H3K36me3 U2OS H3K36me3 Chromatin Interactions by 5C from ENCODE/Dekker Univ. Mass. GM12878 Pk H1-hESC Pk K562 Pk HeLa-S3 Pk Histone Modifications by ChIP-seq from ENCODE/University of Washington GM78 H3K4M3 Ht 1 GM78 H3K4M3 Pk 1 GM78 H3K4M3 Sg 1 GM78 H3K4M3 Ht 2 GM78 H3K4M3 Pk 2 GM78 H3K4M3 Sg 2 GM78 H3K27M3 Ht 1 GM78 H3K27M3 Pk 1 GM78 H3K27M3 Sg 1 GM78 H3K27M3 Ht 2 GM78 H3K27M3 Pk 2 GM78 H3K27M3 Sg 2 GM78 H3K36M3 Ht 1 GM78 H3K36M3 Pk 1 GM78 H3K36M3 Sg 1 GM78 H3K36M3 Ht 2 GM78 H3K36M3 Pk 2 GM78 H3K36M3 Sg 2 GM78 In Sg 1 K562 H3K4M3 Ht 1 K562 H3K4M3 Pk 1 K562 H3K4M3 Sg 1 K562 H3K4M3 Ht 2 K562 H3K4M3 Pk 2 K562 H3K4M3 Sg 2 K562 H3K27M3 Ht 1 K562 H3K27M3 Pk 1 K562 H3K27M3 Sg 1 K562 H3K27M3 Ht 2 K562 H3K27M3 Pk 2 K562 H3K27M3 Sg 2 K562 H3K36M3 Ht 1 K562 H3K36M3 Pk 1 K562 H3K36M3 Sg 1 K562 H3K36M3 Ht 2 K562 H3K36M3 Pk 2 K562 H3K36M3 Sg 2 K562 In Sg 1 4.88 _ 100 vertebrates Basewise Conservation by PhyloP

100 Vert. Cons 0 - -4.5 _ Multiz Alignments of 100 Vertebrates Rhesus Mouse Dog Elephant Chicken X_tropicalis Zebrafish Lamprey 4 _ Placental Mammal Basewise Conservation by PhyloP

Mammal Cons 0 - -4 _ Multiz Alignments of 46 Vertebrates Rhesus Mouse Dog Elephant Opossum Chicken X_tropicalis Zebrafish Short Genetic Variants from dbSNP release 153 Common dbSNP(153) Simple Nucleotide Polymorphisms (dbSNP 151) Found in >= 1% of Samples rs111271133 rs55865069 rs61287555 rs75262283 rs113504432 rs111884706 rs146407750 rs116050103 rs11168838 rs7313188 rs833836 rs117478394 Simple Nucleotide Polymorphisms (dbSNP 142) Found in >= 1% of Samples rs111271133 rs55865069 rs61287555 rs75262283 rs113504432 rs111884706 rs146407750 rs116050103 rs11168838 rs7313188 rs833836 rs117478394 Simple Nucleotide Polymorphisms (dbSNP 151) rs563212339 rs996860183 rs768557363 rs777308502 rs771426341 rs926420342 rs897596223 rs965940606 rs961694577 rs577508592 rs867736286 rs748781430 rs182926808 rs372735664 rs746552370 rs777238764 rs936815127 rs952059684 rs761994423 rs555042375 rs192119507 rs955092320 rs1332192055 rs772685598 rs761901471 rs373390613 rs576547231 rs911777383 rs559495647 rs931454176 rs916098029 rs375955436 rs531869149 rs963785393 rs398123747 rs776217418 rs529035330 rs181007270 rs144485809 rs565005367 rs969346850 rs772262309 rs548803119 rs773913309 rs767818678 rs769525380 rs956773993 rs940655997 rs950120435 rs772348581 rs978969470 rs889935656 rs914611484 rs374309012 rs773542626 rs376406778 rs185665739 rs1172235493 rs960584050 rs907420243 rs957690018 rs568125238 rs372370792 rs770411607 rs761087107 rs762844218 rs995948270 rs1036572787 rs981901627 rs937778789 rs922645110 rs939666778 rs568664910 rs776020689 rs377008550 rs763984819 rs117832453 rs150304974 rs992429969 rs556490137 rs911134117 rs755465211 rs527825717 rs368840291 rs754277787 rs750501302 rs890536873 rs930693465 rs777447096 rs145042402 rs932755579 rs994130554 rs967826612 rs764835174 rs868032480 rs760731820 rs955195275 rs757973496 rs913969840 rs894482974 rs550322407 rs898380226 rs587783689 rs778228140 rs764606893 rs766485501 rs986619548 rs945639348 rs559245942 rs35906138 rs888560276 rs960793831 rs1249028857 rs556161350 rs752287450 rs963249278 rs1319311533 rs137932817 rs776985874 rs967467161 rs941340817 rs907041519 rs150863243 rs929859464 rs902134701 rs761231743 rs911134162 rs141918169 rs528214384 rs371999467 rs960033753 rs779335800 rs753988116 rs961954950 rs757959082 rs992185027 rs963725467 rs890475853 rs757008580 rs999279537 rs894685773 rs970281726 rs757542695 rs1253762327 rs886041408 rs916665282 rs973649612 rs943428179 rs545072218 rs1897642 rs573016539 rs747473966 rs765230432 rs1241609517 rs577775922 rs767059644 rs534339693 rs939315304 rs781685274 rs896142757 rs545195129 rs548765817 rs1368271845 rs1461362456 rs777532785 rs948153515 rs964340771 rs777417580 rs746299461 rs965909730 rs751487847 rs957286129 rs770880004 rs1355131917 rs745388553 rs761086670 rs973731226 rs899272861 rs528363352 rs953142671 rs867590162 rs964967242 rs1474324864 rs565366509 rs772096372 rs754410277 rs760143851 rs995746657 rs925292706 rs759317923 rs957820805 rs369246760 rs1173103392 rs775240170 rs775591881 rs935895501 rs137907153 rs61942220 rs1403155520 rs984512266 rs750045168 rs912910837 rs570831539 rs941179500 rs751393107 rs759623829 rs774848010 rs970366453 rs932864742 rs908954827 rs989096636 rs923577239 rs1414201568 rs1353159777 rs886852432 rs964733660 rs969651339 rs980484267 rs564756054 rs940430472 rs978938380 rs933468739 rs771787499 rs187220111 rs756964187 rs573194410 rs919588439 rs771121097 rs530443886 rs972188321 rs924858289 rs1268965302 rs776768333 rs764010033 rs370093030 rs923251075 rs114632104 rs184929877 rs912835755 rs928808354 rs932700507 rs1391942010 rs1296595611 rs1325359284 rs781143750 rs570651008 rs938407913 rs957844291 rs372621265 rs1897643 rs543890388 rs939762159 rs398123715 rs1224679829 rs797045664 rs866793815 rs142922628 rs75262283 rs1350407055 rs543805896 rs893520625 rs562170821 rs886041407 rs1250122297 rs587783714 rs983887855 rs988987698 rs914076994 rs571055915 rs938920137 rs560581060 rs933696842 rs1160502598 rs1350729975 rs1166407793 rs889497609 rs537745553 rs999873035 rs149307852 rs965774433 rs191695378 rs986944369 rs1404228890 rs1335231453 rs1016972609 rs140009821 rs557172928 rs945609948 rs530281519 rs560539386 rs539813641 rs911009560 rs759548512 rs1212015994 rs1170211759 rs942314834 rs573879658 rs567704750 rs879546728 rs894764044 rs778961339 rs769165824 rs1363265353 rs797045670 rs962424431 rs370148765 rs917704262 rs138923387 rs993663764 rs947365411 rs998288271 rs1167132811 rs547058856 rs751430894 rs1311872208 rs765390371 rs912869595 rs116050103 rs952429479 rs952978454 rs950064364 rs919601395 rs375363526 rs757154422 rs1399659855 rs1037866053 rs117448109 rs573306456 rs546833155 rs535470004 rs563061782 rs942844176 rs767667917 rs1179683526 rs1383404031 rs898676754 rs577567291 rs833836 rs567041664 rs959071440 rs985906466 rs898715438 rs1220026073 rs200767468 rs1176809903 rs1394090948 rs553070026 rs920669601 rs757384129 rs990994626 rs909962585 rs748658982 rs1278420714 rs753893657 rs745748263 rs1324101478 rs151180317 rs930753621 rs538788520 rs752630952 rs941426146 rs1458586836 rs949079169 rs894562087 rs267603492 rs1467345002 rs758354654 rs558718956 rs971447202 rs998913630 rs140029545 rs1322370350 rs750774484 rs755198691 rs1327465795 rs1373743931 rs931933213 rs563837492 rs960929119 rs901659809 rs111884706 rs1247315850 rs1257297267 rs779328648 rs968801897 rs1169758946 rs1291920666 rs868581182 rs981538157 rs949055509 rs999920532 rs796401909 rs1278926525 rs1420374914 rs769800698 rs1221097416 rs1051512155 rs550113544 rs927398310 rs552455168 rs997858926 rs948634514 rs1484496560 rs959680791 rs778827062 rs1417293557 rs775682595 rs974035783 rs937843003 rs997383248 rs954479200 rs182304831 rs944542233 rs371517352 rs1191574371 rs1195333175 rs1053766831 rs886952827 rs991101020 rs547253734 rs760829890 rs553655850 rs1284715745 rs748510679 rs1458907604 rs1468724508 rs892473676 rs575575256 rs915157020 rs749985544 rs985446436 rs547815384 rs756490738 rs201231484 rs748306957 rs948410084 rs1188986683 rs1176767790 rs946645168 rs763664956 rs962766061 rs896459745 rs1320202062 rs766543419 rs375589514 rs1448392088 rs1023677795 rs1391789222 rs902208300 rs567028839 rs918704867 rs891845978 rs1057520573 rs1373394206 rs1460238091 rs994772522 rs1009552200 rs112240315 rs1359975130 rs889636472 rs929148231 rs777763131 rs1237256636 rs1437634497 rs1235528467 rs1481377196 rs969595848 rs190124007 rs1295349679 rs984830670 rs971920995 rs570745690 rs369169749 rs747495428 rs1022875075 rs1190640030 rs1340574593 rs1033358156 rs944402412 rs751178914 rs981430728 rs905313885 rs750578681 rs771365929 rs1255623449 rs569772182 rs1366095422 rs775415533 rs1040079719 rs961768863 rs757465911 rs996215823 rs754280002 rs368315432 rs971327555 rs1044185046 rs1030355148 rs763027359 rs1352437012 rs901279708 rs925638768 rs1299585024 rs373466438 rs1441499895 rs772169265 rs61287555 rs1235465081 rs1021052009 rs546539468 rs971821752 rs924614269 rs538954527 rs199555464 rs1253638323 rs1233258179 rs1306236464 rs1311881791 rs1043996145 rs985450397 rs928399808 rs866466283 rs866212352 rs747689383 rs1043194235 rs1262520981 rs960988043 rs779945110 rs868009414 rs996349403 rs1204016242 rs934887883 rs552078112 rs1379472549 rs562385594 rs569737412 rs1370765292 rs545080955 rs1228095923 rs1028245559 rs938832594 rs762941493 rs988673189 rs771725967 rs1391527793 rs1301863331 rs191624773 rs1003802362 rs904142072 rs1267075544 rs991703922 rs538552421 rs146407750 rs1320575784 rs56155409 rs1186616881 rs1393374528 rs1374181223 rs996949701 rs1464773098 rs916112903 rs558366905 rs908042838 rs1171936789 rs1379993777 rs1396861260 rs1397399795 rs763085421 rs888856267 rs1416840900 rs12582095 rs891579432 rs537662400 rs1053886738 rs1302832696 rs976882824 rs1297748763 rs1409141780 rs1434302140 rs887995212 rs1290238319 rs998359684 rs973831299 rs1210618978 rs1373765624 rs368471915 rs1164593438 rs1368615288 rs1372318376 rs559031638 rs1391350100 rs899547390 rs554296710 rs889708647 rs1192638214 rs797045661 rs1352727735 rs1175333836 rs1308464747 rs560475784 rs958873357 rs764182695 rs1380047870 rs1373617821 rs745416480 rs771228283 rs1175472741 rs764438730 rs1008652403 rs1204705299 rs1470618625 rs933738615 rs554730046 rs777342213 rs769419008 rs371058953 rs1002528790 rs1214757639 rs966837497 rs113504432 rs1230113845 rs537813063 rs771122775 rs1298620131 rs775185224 rs759923704 rs373610943 rs1173546380 rs1019005093 rs749794933 rs1479778955 rs761337672 rs942435370 rs1339419409 rs398123759 rs765815816 rs1191437479 rs1008006800 rs1443516254 rs1219865257 rs1245086444 rs575075451 rs974210169 rs1387082122 rs1261529096 rs886049487 rs1023649988 rs1420444426 rs961752210 rs764895005 rs993036430 rs559154055 rs117478394 rs746808668 rs1485689297 rs535454400 rs969421434 rs988357010 rs987623395 rs918755157 rs1262465425 rs898455789 rs7313188 rs770564304 rs768381823 rs375614893 rs146866485 rs1297238340 rs1227849587 rs929017751 rs1230890242 rs994205423 rs576741098 rs1472368017 rs754009851 rs587783710 rs925750378 rs751797277 rs972171376 rs1157292801 rs868299902 rs750048191 rs369785765 rs1372206844 rs774153097 rs763699658 rs1281797703 rs1482394906 rs551922857 rs971375042 rs1222916325 rs554261950 rs768572170 rs530760656 rs1454605298 rs1416559861 rs1212257626 rs1260204804 rs774520590 rs569219256 rs1032212100 rs961078862 rs563551924 rs777680159 rs1192942787 rs766805522 rs935837211 rs565280886 rs1325845803 rs959077399 rs917657204 rs1012747270 rs761869141 rs1164204410 rs1338989574 rs756988459 rs1347726093 rs1212096957 rs1201552257 rs768736608 rs1452774873 rs1403052255 rs757494376 rs1261940155 rs1280319408 rs781092896 rs1278639828 rs968537880 rs1230280220 rs887553423 rs867277171 rs1023227080 rs542366490 rs1213914980 rs572738570 rs750283044 rs1241620668 rs963956651 rs964967770 rs1211037449 rs970537855 rs188930231 rs905250448 rs746838323 rs761768160 rs1328544199 rs1383979174 rs1225683775 rs1277592939 rs1338393292 rs978127863 rs1477704729 rs900625937 rs139300616 rs192397485 rs755981076 rs752814635 rs1317676039 rs1352304882 rs990637171 rs923614429 rs375660829 rs996574280 rs776367433 rs1359944042 rs587783708 rs558189975 rs1251308693 rs762047763 rs1261272746 rs1179639732 rs11168838 rs1030341959 rs1293208490 rs1159611979 rs1254805661 rs933915492 rs1287885680 rs1409088876 rs538252821 rs1246188560 rs969976712 rs1032414790 rs776580342 rs1064793649 rs780096090 rs1313497493 rs974415603 rs1206585468 rs1352247064 rs947664471 rs1246249282 rs1455959860 rs1312267772 rs1362838443 rs748173147 rs763871479 rs1348597235 rs1304360226 rs1400007065 rs911168537 rs981949651 rs1010060382 rs200840481 rs753933861 rs55865069 rs1435266645 rs919574592 rs766142466 rs915064188 rs532755486 rs1459114354 rs574277537 rs376909620 rs1420455563 rs1374763957 rs951108585 rs776133418 rs967917431 rs1409306161 rs867549091 rs761844418 rs1394584489 rs1159354145 rs1455129536 rs879590522 rs1466281252 rs977993682 rs942508298 rs925900601 rs758832409 rs886049486 rs941339684 rs1171584263 rs982470453 rs926452137 rs181099069 rs903204154 rs935659196 rs1025696587 rs1442209147 rs146053307 rs578172034 rs1314680166 rs961034384 rs944371639 rs185396124 rs559890940 rs1264412417 rs765552604 rs747186162 rs1050677657 rs755836466 rs920610541 rs1466183932 rs934689802 rs575054594 rs767763351 rs752905665 rs771177274 rs201886153 rs917671687 rs1473304263 rs1404799139 rs1293662690 rs912938867 rs1185034017 rs1307480043 rs776953582 rs372708094 rs751397812 rs930692438 rs937639990 rs1040920757 rs990132381 rs530028496 rs758710579 rs759997856 rs780821410 rs370464553 rs1047792683 rs921666221 rs1041090165 rs1343417497 rs1296575764 rs778237222 rs1307936415 rs942423749 rs1455082040 rs992273628 rs574030561 rs901296288 rs1199976130 rs1433828070 rs201994402 rs1350170452 rs539033714 rs949150546 rs573942806 rs1236282807 rs1378596199 rs1415351892 rs760957959 rs757757539 rs770192344 rs1245094036 rs577903096 rs1258782691 rs931726400 rs1229007222 rs914823045 rs1393154670 rs1275062415 rs372109249 rs898261535 rs1471307918 rs1201376084 rs1417737289 rs997751715 rs1441687808 rs368687683 rs371522093 rs1335354078 rs994345724 rs34678268 rs563940324 rs1158953920 rs901244062 rs866296774 rs754226909 rs976475854 rs1056944455 rs184613763 rs1419338804 rs1430974351 rs144539465 rs994268822 rs1238862705 rs755442424 rs1251026434 rs1220774201 rs1202850962 rs1189300045 rs945383268 rs887956270 rs888467319 rs1179223945 rs765773848 rs1481603340 rs1263665782 rs1260453710 rs766189801 rs1329960899 rs79148493 rs1192936117 rs1275495432 rs1411463611 rs781605013 rs1354976545 rs1237169824 rs1442293523 rs1287746066 rs1210632670 rs534037394 rs1051251929 rs752181763 rs375837753 rs763380529 rs1260677463 rs1254519926 rs1227777146 rs1323713887 rs569377239 rs923452517 rs1478025825 rs1416871936 rs1177637771 rs1294049939 rs936907290 rs939830909 rs774465629 rs888993144 rs1312492001 rs1367939301 rs1234651504 rs1250596620 rs896706046 rs1187367097 rs374223057 rs1442504302 rs894767234 rs1309313045 rs1385730689 rs1418200308 rs763644658 rs1354792928 rs34778330 rs542847703 rs538485897 rs756953632 rs1375355237 rs757905884 rs1460464454 rs375169999 rs138120465 rs1054432712 rs925533884 rs1265329874 rs538340557 rs1305067053 rs777479385 rs769222874 rs587778462 rs1274952378 rs111514395 rs1174770621 rs1238165101 rs1480710654 rs1007089375 rs200102669 rs564523745 rs761525910 rs753417510 rs945376057 rs1396360149 rs1002867670 rs1178362164 rs117915017 rs779856832 rs951136426 rs886049488 rs1157726105 rs1051747729 rs548401786 rs1314928284 rs970847064 rs1358933487 rs1316703353 rs533509060 rs767327365 rs1455325179 rs899242011 rs1424960429 rs1020745920 rs1035428278 rs1174305924 rs374523376 rs1340062904 rs367774720 rs1419566476 rs1245434556 rs1378120327 rs902361688 rs977888970 rs1421722967 rs1267448004 rs1276397303 rs371342351 rs1165089471 rs1339798376 rs906412119 rs1034773449 rs959822843 rs1158148864 rs377452989 rs1344372105 rs794727497 rs1023619175 rs890508186 rs1458434750 rs1200439013 rs1249255284 rs1437865221 rs958975938 rs1438619865 rs753753212 rs1442581449 rs754990197 rs1388736567 rs958780682 rs1438038105 rs1248338544 rs769828573 rs549991458 rs754931522 rs1416392931 rs1007602994 rs1370683214 rs1436673134 rs992026258 rs1183471184 rs1251241574 rs1324103654 rs188961745 rs779904198 rs1388975800 rs765277786 rs1012096536 rs762420238 rs1200664403 rs775622569 rs1404594016 rs747135721 rs1184600388 rs905215092 rs1331532088 rs377256278 rs1223372791 rs1236671675 rs748251797 rs1392063921 rs781453591 rs749224276 rs1003771190 rs1353267597 rs967834072 rs1343386098 rs1286226223 rs370879376 rs1172489002 rs746079736 rs1245314632 rs1035668271 rs1409292431 rs977931367 rs1281634884 rs1465446742 rs1417070634 rs1465827839 rs200968437 rs969393329 rs1018083360 rs893487370 rs923851145 rs1240084691 rs1238831819 rs773358205 rs1259663078 rs192501095 rs1470035352 rs1396053731 rs891314371 rs1165922349 rs1329506323 rs1314543539 rs760762714 rs1315667203 rs775736179 rs11168835 rs1163269499 rs1011018411 rs985421072 rs1406831603 rs1038078839 rs545791832 rs1202880089 rs933609667 rs1274014834 rs1472798426 rs1190773600 rs921611281 rs924889380 rs1013250104 rs759944251 rs1251044339 rs886612677 rs979871751 rs1367225789 rs1465464186 rs973415784 rs1413824750 rs1371918638 rs201709328 rs1478719848 rs769766747 rs1032410572 rs1426423195 rs752589843 rs931721250 rs1416494641 rs1025659380 rs753213085 rs1489111743 rs1000269537 rs1232791211 rs563503690 rs1009036257 rs1049272868 rs1174642532 rs1427240349 rs1064796125 rs1212105014 rs749760606 rs1332803879 rs1429511539 rs966474486 rs544845151 rs934659266 rs1014105004 rs758955636 rs1198742581 rs769326226 rs1271956729 rs1247739457 rs1348500363 rs908985934 rs1470336187 rs1417958151 rs1399241108 rs1450353720 rs773974527 rs768112173 rs368101837 rs1347645017 rs940837033 rs1445943383 rs1367786650 rs1168216865 rs1049758269 rs1412240820 rs573842815 rs1364457719 rs1009014580 rs1230074974 rs1210116925 rs1166907010 rs376607419 rs1240441282 rs1259870960 rs974065783 rs1273962462 rs1339667719 rs1326520856 rs888969639 rs1275962773 rs751110341 rs184033085 rs898267841 rs1317460152 rs995382691 rs757885877 rs984310761 rs1206117658 rs1031697997 rs756827567 rs888523090 rs1430816978 rs988625575 rs1430795279 rs1312561732 rs1381085198 rs536577507 rs1246233712 rs781037238 rs1317526246 rs761467130 rs1396289237 rs1263578059 rs878884196 rs1393472527 rs1037606779 rs531962922 rs202013880 rs1256394083 rs955948792 rs913068260 rs1022496153 rs975034268 rs749162740 rs897787112 rs1203033268 rs756036854 rs867507009 rs1172263154 rs747731247 rs1365689641 rs920872754 rs567943406 rs1284139509 rs1220857894 rs779996718 rs1016060439 rs767274089 rs1460547508 rs1290326872 rs993175721 rs1036568294 rs922463781 rs1341460276 rs373906505 rs1345769333 rs1187636945 rs1387963984 rs1432458637 rs1027358112 rs1424442835 rs993126939 rs1297886558 rs367758673 rs1194475410 rs997893922 rs933884395 rs968404048 rs768913184 rs1170733718 rs1217262140 rs1380736486 rs1457680256 rs1426124128 rs372067643 rs986693223 rs1372726388 rs1450373092 rs1178364230 rs553253630 rs1488903301 rs780566885 rs1056027174 rs760258777 rs1476065532 rs1323716069 rs1247605382 rs1372646092 rs1024872229 rs1021583961 rs1256584389 rs1333121380 rs727503990 rs1372990068 rs374531713 rs1186864806 rs1426713157 rs1284303896 rs1432360069 rs931862756 rs915756701 rs1229092998 rs1191544700 rs778254420 rs1033488526 rs1465339744 rs1225082341 rs1189201510 rs1479640635 rs1167858307 rs766151808 rs542702767 rs951371525 rs1268074048 rs1011608322 rs1012679627 rs1424066503 rs371240686 rs1169517938 rs753608064 rs1455830309 rs529123519 rs1210363089 rs1022123614 rs1226703465 rs1209947299 rs1183357730 rs962198578 rs754881554 rs1253974662 rs1170167575 rs1289398998 rs1490126779 rs1364119784 rs1487836961 rs374697444 rs950074369 rs765211131 rs771795535 rs140182639 rs1230311981 rs1054763966 rs1271860363 rs1271845609 rs1474655476 rs1171435886 rs1409291268 rs1467267678 rs970440214 rs907950670 rs1298734366 rs1189785101 rs983659973 rs1162359346 rs779553820 rs751629550 rs1275217065 rs980527654 rs1342486257 rs1398540041 rs1425114852 rs908285679 rs773130102 rs1295342817 rs954226380 rs756875303 rs1297249392 rs1341285716 rs1269118336 rs1023164875 rs1459161040 rs1465683604 rs1045869354 rs757397817 rs1196456077 rs1187060326 rs1296459580 rs1224581354 rs1406443984 rs1302193674 rs1217605337 rs748889976 rs985615705 rs1156828560 rs1360819035 rs1224732589 rs1030793577 rs1171002821 rs1386287733 rs1375600274 rs768327021 rs910120446 rs189858041 rs953062708 rs1222598431 rs1404801955 rs1202008717 rs1368802776 rs1281257063 rs1457588604 rs781142081 rs948213764 rs1253840997 rs1323573363 rs1364164658 rs1246377270 rs1291409778 rs111271133 rs1271370667 rs746049766 rs773443633 rs1248752518 rs1430833079 rs1407459161 rs1017945647 rs1323874006 rs1025920621 rs993372731 rs756318924 rs528451028 rs987123682 rs1479090855 rs976417508 rs778638499 rs1389100753 rs375609745 rs879104572 rs779539520 rs1439710430 rs559193812 rs1378347324 rs1306806732 rs922399647 rs1400828028 rs1174897617 rs529271936 rs963025174 rs61942219 rs1252701069 rs1197790060 rs1345773349 rs558418271 rs1379012841 rs1006081170 rs774162644 rs1435968507 rs11168836 rs1197500428 rs1451005530 rs1028939022 rs568907386 rs1299004072 rs1034317988 rs761564278 rs749662138 rs928368532 rs1342707159 rs1266656307 rs1396018533 rs922647318 rs1367905974 rs1158670838 rs1181711192 rs769271092 rs545356883 rs1273732192 rs1207149755 rs1166321209 rs1049825338 rs1038551979 rs1439637065 rs368851125 rs1326406753 rs564236913 rs1275101352 rs1483277002 rs1406349351 rs1179158825 rs1199064378 rs1163339216 rs773091405 rs375176639 rs1330370779 rs1216569993 rs1242327648 rs1467675952 rs749783274 rs1432155539 rs1236817105 rs760670429 rs748668278 rs1410017716 rs1328859296 rs1193727979 rs1429998062 rs894495668 rs1294897345 rs1185884643 rs1474705941 rs760746504 rs1396461822 rs1330231955 rs1346664963 rs1280934927 rs1340913203 rs1347498712 rs1483777516 rs999247283 rs1343807630 rs533209821 rs1244118049 rs1231867202 rs1232983840 rs950322473 rs1044690262 rs1381486899 rs1217477780 rs774435686 rs1167297267 rs1370468818 rs1169944091 rs1184042763 rs909982654 rs1263886316 rs1036462046 rs765353857 rs369474918 rs898237398 rs1484985388 rs1055037786 rs1271875333 rs769194000 rs1368922889 rs570377490 rs1421070239 rs578094383 rs181777489 rs1185742655 rs1164329910 rs1252414399 rs1037545079 rs1002658436 rs534623353 rs1345719942 rs200432853 rs929724473 rs1426264228 rs1042073532 rs1335505547 rs903458518 rs1311527261 rs958715267 rs1311346809 rs1290149065 rs761178696 rs1486540221 rs1416446572 rs1397948709 rs897703689 rs1249833792 rs1318919916 rs1412073375 rs1207630894 rs896674845 rs1278995418 rs1323208176 rs889836797 rs929224116 rs1377592548 rs1286851045 rs752848850 rs765989486 rs1234807011 rs1051677907 rs1184262345 rs1006933702 rs560849255 rs1328019074 rs747156898 rs763229270 rs1463945370 rs1309974842 rs1039516806 rs1472842949 rs1201925254 rs904315343 rs1389471912 rs775033668 rs764456734 rs955916306 rs1437285785 rs1181732527 rs1232143439 rs1243264659 rs982579951 rs1172865742 rs938447534 rs373972024 rs918597814 rs1270925073 rs1306754004 rs1049981590 rs1265518580 rs909719929 rs1463450154 rs1282203245 rs1296202170 rs183739889 rs1013819337 rs1417110711 rs1278926995 rs1432450447 rs532649664 rs1011635797 rs771120138 rs757625928 rs113801344 rs1202950691 rs1026854793 rs1432641646 rs1019590357 rs956267945 rs1021552688 rs1440671433 rs1414427607 rs974710034 rs1045330126 rs1004149073 rs1199490487 rs1056440016 rs991603107 rs1232817117 rs747321938 rs1282433002 rs916694519 rs569860315 rs1413051835 rs1268133766 rs1374733989 rs552785723 rs1013616427 rs1352901458 rs1364190219 rs1347107415 rs905468820 rs1024616634 rs1341391739 rs1336218243 rs1341599386 rs1180214118 rs1306983574 rs781620078 rs948518078 rs532510717 rs1410426364 rs1229611268 rs1167449296 rs1297934542 rs1257295500 rs746758232 rs375344369 rs933535297 rs891460821 rs1343086984 rs1356722471 rs1401156735 rs934920539 rs1200142163 rs368788444 rs756636221 rs1044292155 rs1236272052 rs1472290055 rs1333790282 rs1407068445 rs747941272 rs1288376519 rs398123703 rs369626386 rs563458995 rs763559341 rs1409784818 rs1037648441 rs1156564710 rs919103655 rs1353002991 rs759891330 rs748834953 rs1311113651 rs1280895726 rs1179938496 rs1027901750 rs1446138375 rs929151803 rs1278943936 rs765520581 rs1280814073 rs761555649 rs1000871885 rs1473697645 rs1159980221 rs1182556245 rs539542443 rs1001281667 rs547979245 rs754603677 rs908008248 rs1417167678 rs1320341277 rs1284527984 rs1243287457 rs143539083 rs1425030547 rs1305646072 rs765493981 rs1326362566 rs1042967304 rs1384041793 rs1457764372 rs1217204866 rs1031773146 rs1334543299 rs764701309 rs778489743 rs1415672606 rs1173469249 rs1454893587 rs1407139545 rs1360553999 rs547431576 rs1373274610 rs751011919 rs1243541676 rs768084231 rs1413103041 rs1313623396 rs1473675606 rs1243962862 rs998787551 rs1162341993 rs372119165 rs548930191 rs117513962 rs1469776351 rs1208921969 rs1182613665 rs1270828892 rs1197237218 rs1032812869 rs1418558554 rs771698859 rs891480310 rs1260826572 rs1327891156 rs1014935130 rs1030762608 rs951239185 rs1406592985 rs376475904 rs773077415 rs997864285 rs1249716267 rs1284038658 rs1282795923 rs1240887617 rs972796009 rs1157406221 rs369427465 rs746898212 rs117688382 rs1486643067 rs1484028033 rs1230485903 rs1225317011 rs769609261 rs1367462499 rs763806964 rs771025627 rs1318857571 rs1258105701 rs1347221556 rs1283909720 rs1226896295 rs993109479 rs1474047033 rs374509573 rs201796530 rs776468404 rs1183366637 rs1208392582 rs1450797615 rs1291419309 rs917133663 rs1369501895 rs779668384 rs763180920 rs759386500 rs1010082275 rs1018645226 rs1367610527 rs1274525065 rs183654376 rs1030999783 rs372085797 rs764371025 rs1303764958 rs1030160423 rs1039076249 rs1021240982 rs1273866615 rs775187972 rs1446561981 rs548225150 rs533789715 rs768543392 rs1286024480 rs1249612447 rs1456512022 rs1224807252 rs1399718406 rs1282201625 rs1178885414 rs1217292529 rs772273525 rs1224004701 rs1265770933 rs1206444914 rs1488245844 rs1157182975 rs1284671559 rs754843359 rs774365576 rs773673110 rs1005385650 rs1464111989 rs1414335720 rs1186909149 rs1441107087 rs1232464376 rs1414548482 rs767724777 rs1231381730 rs1416078400 rs1322906469 rs1406962932 rs1263530475 rs1265792467 rs1310850564 rs587778487 rs750829880 rs752570040 rs1356714593 rs1242461182 rs1385956281 rs1422403901 rs1054648080 rs1315638413 rs564617479 rs1431535562 rs762896047 rs1312777002 rs1470438189 rs1249479125 rs1385898117 rs1447178426 rs1044106975 rs747046920 rs1269728206 rs1158592681 rs1421342018 rs1298704270 rs1188746807 rs1170647904 rs1299774842 rs1228799949 rs376751325 rs1362887681 rs1386287108 rs1446414592 rs1048678241 rs1485276403 rs1006948669 rs1156387570 rs1474463185 rs781256045 rs770105193 rs767591619 rs1179779960 rs1434726160 rs1032276744 rs1416681660 rs1051701453 rs930151761 rs746039927 rs1303794400 rs1323918085 rs1363262100 rs1004075455 rs1317160029 rs1202239248 rs1174315574 rs896767007 rs1020065475 rs1346858813 rs1330280764 rs1470327713 rs1349332573 rs1244194196 rs1015941246 rs1050759601 rs1237982122 rs770032317 rs766919843 rs750506197 rs1178914155 rs1448562943 rs1438611335 rs1041512289 rs1399896319 rs1483949579 rs975469976 rs1283225248 rs756336640 rs1410010992 rs1056835523 rs1350366649 rs1187901510 rs1472080165 rs1332498404 rs775685334 rs1412438708 rs1002765819 rs1167245047 rs1002269619 rs1260537532 rs1426011345 rs1426299246 rs1259718255 rs1208432882 rs753334940 rs780197677 rs1014886921 rs1028485386 rs1478664326 rs1433329564 rs1261306042 rs1218963393 rs1287920288 rs376086299 rs1227185621 rs1181231386 rs1367013846 rs1484246000 rs1297596494 rs1223974865 rs1472693187 rs1322211105 rs778490171 rs754250785 rs1464205569 rs1291793093 rs1005156359 rs1310598655 rs1227710022 rs1460307041 rs1311760425 rs747698407 rs779434145 rs1234642295 rs1428836493 rs1282594914 rs1335338294 rs1171921090 rs1000915682 rs1488388667 rs758096542 rs748617194 rs1298466163 rs1177825912 rs1364411697 rs1404638692 rs1293139649 rs1192030302 rs777393777 rs772489056 rs1389552112 rs1423933555 rs1034437384 rs1270222487 rs1407651626 rs1252447243 rs368882441 rs1396271427 rs1451360303 rs1305708215 rs1253124517 rs1388179624 rs1364804869 rs1421916846 rs200245957 rs922856978 rs1197278365 rs1271194773 rs1002128318 rs1158500975 rs1337322418 rs1412858677 rs767076035 rs766046826 rs1057520167 rs1345636233 rs1402613473 rs1025682411 rs1194882442 rs1267974792 rs375948181 rs1290911238 rs1224128434 rs1430429671 rs1453086127 rs1265067304 rs1452310814 rs1160889076 rs1434696264 rs932938631 rs1372234933 rs1393848578 rs1291587739 rs1162028554 rs1289233305 rs1325901160 rs1398905636 rs752970063 rs1449639406 rs1203451605 rs1344703040 rs1005808806 rs1017223131 rs1402964156 rs760364650 rs1455219794 rs1355578039 rs1252203817 rs1041999443 rs1324147996 rs1354141359 rs1171862667 rs373271756 rs949877494 rs1433419251 rs1194629689 rs1227866707 rs1348965010 rs1279961022 rs1043517039 rs753553995 rs1413400165 rs1163743753 rs949515549 rs1422537817 rs1384883577 rs1016845899 rs754864555 rs888451411 rs1356792068 rs1250408125 rs1391743800 rs1453961702 rs1308060630 rs1398974964 rs941431713 rs1400465912 rs1215205015 rs1471189771 rs1158798261 rs1052756687 rs778714676 rs981300838 rs1406112365 rs1427466711 rs1308393300 rs1463432677 rs1433685371 rs550197466 rs924670274 rs1183848160 rs1296877329 rs1473852140 rs1460931522 rs1258290301 rs758348372 rs879089748 rs1475628134 rs1480198619 rs1048853917 rs1198714937 rs1332692911 rs1370093761 rs934801255 rs973027386 rs1306008312 rs1479809176 rs1446320868 rs1465556660 rs78617409 rs749454912 rs776043326 rs1170905599 rs1258007928 rs1486802586 rs1398608037 rs781206901 rs897261357 rs918553850 rs1374540502 rs1221670186 rs1309951164 rs1178660489 rs745984786 rs1223570033 rs559318235 rs1001525482 rs1040476179 rs1301568577 rs1303797265 rs1351148592 rs993733984 rs960683259 rs1171100876 rs1055235574 rs1358247553 rs1329681867 rs1204392134 rs370735843 rs1484687336 rs1428977892 rs1036597889 rs1326232687 rs1281690168 rs770976326 rs1262412226 rs1036865794 rs1020042891 rs1359026762 rs569941378 rs1410415839 rs1316091331 rs1406254642 rs1484852065 rs1041059279 rs780376463 rs1278281176 rs1229700197 rs1267040870 rs1383690786 rs749611061 rs1401527918 rs1044589263 rs1438095240 rs1456471183 rs768998104 rs1317683279 rs1292140897 rs1443415627 rs1449470490 rs1192317373 rs1386555270 rs1394721826 rs1251123556 rs1339282655 rs774591142 rs776774168 rs1342987068 rs1175791516 rs1470580780 rs772738926 rs1326488356 rs1430592054 rs1469264460 rs1057170570 rs201166677 rs539551937 rs187202953 rs1420475330 rs1238898404 rs760311292 rs1230759771 rs1462163349 rs1179236756 rs1374090759 rs199891692 rs1281090809 rs1367999281 rs1462376746 rs1000286737 rs1407313666 rs1350888818 rs1417407301 rs1320208284 rs1246107726 rs377354964 rs1222852725 rs1403246748 rs1026297009 rs1487163144 rs781575851 rs745801358 rs1302566440 rs1344514421 rs1463998257 rs567986751 rs1254142487 rs1050750112 rs1386580812 rs1330702443 rs1442109235 rs1057519422 rs1230964354 rs1426354905 rs1016962393 rs762022117 rs1337815856 rs1215089304 rs1365893303 rs1237017872 rs1319713268 rs1165807500 rs1006594914 rs1296678178 rs1197374517 rs1278494018 rs755489693 rs1228665775 rs1042095538 rs1279840982 rs573735886 rs1385037757 rs1308781688 rs1442381995 rs1463022285 rs1055565891 rs1441285119 rs1295613803 rs1421108054 rs1259824899 rs776415531 rs1286065271 rs1186561431 rs1444541632 rs1014247624 rs1336548329 rs1034493008 rs1351979153 rs1336345180 rs1467396118 rs1298398357 rs759117840 rs1233341881 rs1384094219 rs1357969045 rs1272016298 rs1174857448 rs748886643 rs1356217775 rs1421259699 rs770673634 rs1208587474 rs1321377954 rs765026517 rs1275488821 rs1261132423 rs752517205 rs1438744754 rs1214778118 rs370649060 rs1181954085 rs1412497685 rs1170361306 rs1170149588 rs1384462258 rs1464283097 rs1417253763 rs1386287402 rs1253285245 rs1397208631 rs1288745859 rs915692803 rs1409124137 rs1419132894 rs1438942959 rs1306010839 rs1053645480 rs201801082 rs1026481963 rs1383537110 rs1043654062 rs1022611871 rs1455977102 rs1432657783 rs1356923003 rs1320240010 rs750484458 rs1415054479 rs1430285951 rs1473858119 rs1338209584 rs1043013496 rs1165937379 rs1185809684 rs1262383763 rs553644335 rs1340739952 rs1185795140 rs1459936058 rs1319640242 rs1263223271 rs1258496375 rs1258286627 rs1326461019 rs1033142865 rs1226116563 rs1295503898 rs1397101572 rs1305221551 rs1398630977 rs1390772947 rs1204820599 rs1304121749 rs1326561189 rs1272970442 rs1047056938 rs867673406 rs1486736986 rs1462569176 rs372915490 rs1211881648 rs1264862790 rs794727860 rs1255808508 rs1006162222 rs749467510 rs1324311027 rs1197112227 rs1233409629 rs1373559377 rs1339495244 rs375915416 rs1475050662 rs1263488091 rs901202001 rs1375921887 rs1207998557 rs779383653 rs1432323879 rs1015910321 rs1449690449 rs1301033353 rs1236820260 rs1251402162 rs1261333670 rs1329649307 rs1486850045 rs1410821210 rs1168589331 rs1408651389 rs1455183787 rs1014830114 rs952824763 rs1189364625 rs1030597393 rs1173356358 rs1164661199 rs1264337812 rs1441486318 rs1421460851 rs1035363170 rs1238638728 rs1385502411 rs1047190414 rs1047443135 rs1295155264 rs1277832920 rs1016546469 rs1451346072 rs1307944720 rs1343798855 rs1227855868 rs1426161308 rs1404343548 rs1254805495 rs1206709412 rs1312097686 rs1273657235 rs1360394951 rs1166664002 rs1203805422 rs1431375455 rs1348627173 rs1281843442 rs1479245333 rs1445023949 rs1371792291 rs1365837408 rs1260278267 rs1358869467 rs1163588531 rs1299065755 rs1266358348 rs1030692193 rs1373067984 rs1047852374 rs1273977917 rs1270832205 rs1004347707 rs1055764954 rs1467779774 rs1415205254 rs1424826755 rs1370362540 rs1222576435 rs1314970592 rs1388854098 rs1360034258 rs1013020862 rs1447372379 rs1022710534 rs1453779890 rs1207781409 rs1339776566 rs1029561821 rs1213390684 rs1301564645 rs1441543137 rs1393174212 rs1243381790 rs1423264450 rs1260224688 rs1486732954 rs1448043458 rs1030357807 rs1347059574 rs1244923156 rs1442809880 rs1320522106 rs1031880491 rs1285412878 rs1479294319 rs1208925241 rs1396209546 rs1162663698 rs1430232730 rs1246739822 rs1270602477 rs1057029325 rs1230314197 rs1039442551 rs1188369777 rs1217214254 rs1347697921 rs1324469774 rs1370312282 rs1388749383 rs1162944460 rs1471836357 rs1466164123 rs1044462807 rs1318243341 rs1453341366 rs1336070859 rs1457785775 rs1029905210 rs1257978482 rs1232859852 rs1329587431 rs1447619768 rs1334688821 rs1258851002 rs1186842232 rs1044038286 rs1252825924 rs1473674406 rs1238561255 rs1187388259 rs1217676945 Repeating Elements by RepeatMasker RepeatMasker Chapter 4

Future directions

The results presented in this thesis motivate several follow-up investigations.

First, regarding the co-expression phenomenon described in Chapter 2, the strong association with loss-of-function intolerance indicates it must be func- tionally important, and may provide insight into why these EM genes are un- der such strong selection. It is, however, not easy to biologically interpret this co-expression as inferred from bulk expression data. Studies using single-cell

RNA-seq have the potential to yield a clearer picture, provided that the net- work reconstruction methods are appropriately modified to deal with technical challenges inherent in such data, such as sparsity. Moreover, with single-cell

RNA-seq data, it is, at least in theory, possible to estimate one EM network per biological sample, and then perform a ”differential co-expression analysis” across different conditions. Such differential analyses between healthy and

127 CHAPTER 4. FUTURE DIRECTIONS disease states, or across developmental time points, should offer valuable clues into the biological basis and role of EM co-expression. Experimental perturba- tions that disrupt this co-expression will of course also be very informative, but at present it is not clear how they can be achieved.

Another logical next step is to characterize the epigenomic defect that arises downstream of the primary genetic defect in the Mendelian Disorders of the

Epigenetic Machinery. The assays to achieve this (e.g. ATAC-seq or ChIP-seq) are well-developed. The challenge is to choose the appropriate disease-relevant cell types and developmental time points to interrogate. The reward, however, will ultimately be the discovery of epigenetic variation that has adverse fitness consequences. Such variation so far remains essentially uncharacterized, but these Mendelian disorders suggest a systematic way of pursuing its discovery and cataloging.

Finally, the results in chapter 3 strongly suggest that the question of se- lection on promoter CpG density has to be directly tackled again, despite the prevailing view that such selection is not present. Given the mutagenic pres- sure exerted on CpGs by DNA methylation, it is clear that the potential action of selection on DNA methylation itself must be taken into account. It is, how- ever, uncertain how this can be approached, given that DNA methylation is an epigenetic feature, and existing methods for selection inference only apply to the DNA sequence. Yet, if this question were to be successfully addressed,

128 CHAPTER 4. FUTURE DIRECTIONS it could provide a fascinating link between the almost 80-year old concept of selection on modifiers of the mutation rate, and the nascent field of population epigenetics.

129 Bibliography

[1] A. M. Deaton and A. Bird, “CpG islands and the regulation of transcrip-

tion,” Genes & development, vol. 25, no. 10, pp. 1010–1022, 2011.

[2] J. A. Fahrner and H. T. Bjornsson, “Mendelian disorders of the epigenetic

machinery: tipping the balance of chromatin states,” Annu Rev Genomics

Hum Genet, vol. 15, pp. 269–293, 2014.

[3] C. D. Allis and T. Jenuwein, “The molecular hallmarks of epigenetic con-

trol,” Nat Rev Genet, vol. 17, pp. 487–500, 2016.

[4] R. E. Amir, I. B. Van den Veyver, M. Wan, C. Q. Tran, U. Francke, and

H. Y. Zoghbi, “Rett syndrome is caused by mutations in x-linked MECP2,

encoding methyl-CpG-binding protein 2,” pp. 185–188.

[5] J. D. Lewis, R. R. Meehan, W. J. Henzel, I. Maurer-Fogy, P. Jeppesen,

F. Klein, and A. Bird, “Purification, sequence, and cellular localization

of a novel chromosomal protein that binds to methylated DNA,” Cell,

vol. 69, no. 6, pp. 905–914, Jun. 1992.

130 BIBLIOGRAPHY

[6] H. T. Bjornsson, “The mendelian disorders of the epigenetic machinery,”

Genome Res, vol. 25, pp. 1473–1481, 2015.

[7] J. A. Fahrner and H. T. Bjornsson, “Mendelian disorders of the epigenetic

machinery: postnatal malleability and therapeutic prospects,” Human

Molecular Genetics, vol. 28, no. R2, pp. R254–R264, 2019.

[8] S. De Rubeis, X. He, A. P. Goldberg, C. S. Poultney, K. Samocha, A. E.

Cicek, Y. Kou, L. Liu, M. Fromer, S. Walker, T. Singh, L. Klei, J. Kos-

micki, F. Shih-Chen, B. Aleksic, M. Biscaldi, P. F. Bolton, J. M. Brown-

feld, J. Cai, N. G. Campbell, A. Carracedo, M. H. Chahrour, A. G. Chioc-

chetti, H. Coon, E. L. Crawford, S. R. Curran, G. Dawson, E. Duketis,

B. A. Fernandez, L. Gallagher, E. Geller, S. J. Guter, R. S. Hill, J. Ionita-

Laza, P. Jimenz Gonzalez, H. Kilpinen, S. M. Klauck, A. Kolevzon,

I. Lee, I. Lei, J. Lei, T. Lehtimaki,¨ C.-F. Lin, A. Ma’ayan, C. R. Mar-

shall, A. L. McInnes, B. Neale, M. J. Owen, N. Ozaki, M. Parellada, J. R.

Parr, S. Purcell, K. Puura, D. Rajagopalan, K. Rehnstrom,¨ A. Reichen-

berg, A. Sabo, M. Sachse, S. J. Sanders, C. Schafer, M. Schulte-Ruther,¨

D. Skuse, C. Stevens, P. Szatmari, K. Tammimies, O. Valladares, A. Vo-

ran, W. Li-San, L. A. Weiss, A. J. Willsey, T. W. Yu, R. K. C. Yuen, DDD

Study, Homozygosity Mapping Collaborative for Autism, UK10K Con-

sortium, E. H. Cook, C. M. Freitag, M. Gill, C. M. Hultman, T. Lehner,

131 BIBLIOGRAPHY

A. Palotie, G. D. Schellenberg, P. Sklar, M. W. State, J. S. Sutcliffe, C. A.

Walsh, S. W. Scherer, M. E. Zwick, J. C. Barett, D. J. Cutler, K. Roeder,

B. Devlin, M. J. Daly, and J. D. Buxbaum, “Synaptic, transcriptional and

chromatin genes disrupted in autism,” Nature, vol. 515, pp. 209–215,

2014.

[9] S. E. McCarthy, J. Gillis, M. Kramer, J. Lihm, S. Yoon, Y. Berstein,

M. Mistry, P. Pavlidis, R. Solomon, E. Ghiban, E. Antoniou, E. Kelle-

her, C. O’Brien, G. Donohoe, M. Gill, D. W. Morris, W. R. McCombie,

and A. Corvin, “De novo mutations in schizophrenia implicate chromatin

remodeling and support a genetic overlap with autism and intellectual

disability,” Mol Psychiatry, vol. 19, pp. 652–658, 2014.

[10] T. Singh, M. I. Kurki, D. Curtis, S. M. Purcell, L. Crooks, J. McRae, J. Su-

visaari, H. Chheda, D. Blackwood, G. Breen, O. Pietilainen,¨ S. S. Gerety,

M. Ayub, M. Blyth, T. Cole, D. Collier, E. L. Coomber, N. Craddock, M. J.

Daly, J. Danesh, M. DiForti, A. Foster, N. B. Freimer, D. Geschwind,

M. Johnstone, S. Joss, G. Kirov, J. Korkk¨ o,¨ O. Kuismin, P. Holmans,

C. M. Hultman, C. Iyegbe, J. Lonnqvist,¨ M. Mannikk¨ o,¨ S. A. McCarroll,

P. McGuffin, A. M. McIntosh, A. McQuillin, J. S. Moilanen, C. Moore,

R. M. Murray, R. Newbury-Ecob, W. Ouwehand, T. Paunio, E. Prigmore,

E. Rees, D. Roberts, J. Sambrook, P. Sklar, D. St Clair, J. Veijola, J. T. R.

132 BIBLIOGRAPHY

Walters, H. Williams, Swedish Schizophrenia Study, INTERVAL Study,

DDD Study, UK10 K Consortium, P. F. Sullivan, M. E. Hurles, M. C.

O’Donovan, A. Palotie, M. J. Owen, and J. C. Barrett, “Rare loss-of-

function variants in SETD1A are associated with schizophrenia and de-

velopmental disorders,” Nat Neurosci, vol. 19, pp. 571–577, 2016.

[11] Deciphering Developmental Disorders Study, “Prevalence and architec-

ture of de novo mutations in developmental disorders,” Nature, vol. 542,

pp. 433–438, 2017.

[12] B. Vogelstein, N. Papadopoulos, V. E. Velculescu, S. Zhou, L. A. Diaz, Jr,

and K. W. Kinzler, “Cancer genome landscapes,” Science, vol. 339, pp.

1546–1558, 2013.

[13] L. A. Garraway and E. S. Lander, “Lessons from the cancer genome,”

Cell, vol. 153, pp. 17–37, 2013.

[14] A. P. Feinberg, M. A. Koldobskiy, and A. Gond¨ or,¨ “Epigenetic modulators,

modifiers and mediators in cancer aetiology and progression,” Nat Rev

Genet, vol. 17, pp. 284–299, 2016.

[15] S. P. Khare, F. Habib, R. Sharma, N. Gadewal, S. Gupta, and S. Galande,

“HIstome–a relational knowledgebase of human histone proteins and hi-

stone modifying enzymes,” Nucleic Acids Res, vol. 40, pp. D337–42, 2012.

133 BIBLIOGRAPHY

[16] Y. A. Medvedeva, A. Lennartsson, R. Ehsani, I. V. Kulakovskiy, I. E.

Vorontsov, P. Panahandeh, G. Khimulya, T. Kasukawa, FANTOM Con-

sortium, and F. Drabløs, “EpiFactors: a comprehensive database of hu-

man epigenetic factors and complexes,” Database, vol. 2015, p. bav067,

2015.

[17] K. M. Dorighi, T. Swigut, T. Henriques, N. V. Bhanu, B. S. Scruggs,

N. Nady, C. D. Still, 2nd, B. A. Garcia, K. Adelman, and J. Wysocka, “Mll3

and facilitate enhancer RNA synthesis and transcription from pro-

moters independently of H3K4 monomethylation,” Mol Cell, vol. 66, pp.

568–576.e4, 2017.

[18] R. Rickels, H.-M. Herz, C. C. Sze, K. Cao, M. A. Morgan, C. K. Collings,

M. Gause, Y.-H. Takahashi, L. Wang, E. J. Rendleman, S. A. Marshall,

A. Krueger, E. T. Bartom, A. Piunti, E. R. Smith, N. A. Abshiru, N. L.

Kelleher, D. Dorsett, and A. Shilatifard, “Histone H3K4 monomethy-

lation catalyzed by trr and mammalian COMPASS-like proteins at en-

hancers is dispensable for development and viability,” Nat Genet, vol. 49,

pp. 1647–1653, 2017.

[19] The UniProt Consortium, “UniProt: a hub for protein information,” Nu-

cleic Acids Res, vol. 43, pp. D204–D212, 2015.

[20] S. Hunter, R. Apweiler, T. K. Attwood, A. Bairoch, A. Bateman, D. Binns,

134 BIBLIOGRAPHY

P. Bork, U. Das, L. Daugherty, L. Duquenne, R. D. Finn, J. Gough,

D. Haft, N. Hulo, D. Kahn, E. Kelly, A. Laugraud, I. Letunic, D. Lonsdale,

R. Lopez, M. Madera, J. Maslen, C. McAnulla, J. McDowall, J. Mistry,

A. Mitchell, N. Mulder, D. Natale, C. Orengo, A. F. Quinn, J. D. Selengut,

C. J. A. Sigrist, M. Thimma, P. D. Thomas, F. Valentin, D. Wilson, C. H.

Wu, and C. Yeats, “InterPro: the integrative protein signature database,”

Nucleic Acids Res, vol. 37, pp. D211–D215, 2009.

[21] J. M. Vaquerizas, S. K. Kummerfeld, S. A. Teichmann, and N. M. Lus-

combe, “A census of human transcription factors: function, expression

and evolution,” Nature Reviews Genetics, vol. 10, no. 4, pp. 252–263,

2009.

[22] L. A. Barrera, A. Vedenko, J. V. Kurland, J. M. Rogers, S. S. Gisselbrecht,

E. J. Rossin, J. Woodard, L. Mariani, K. H. Kock, S. Inukai, T. Siggers,

L. Shokri, R. Gordan,ˆ N. Sahni, C. Cotsapas, T. Hao, S. Yi, M. Kellis,

M. J. Daly, M. Vidal, D. E. Hill, and M. L. Bulyk, “Survey of variation

in human transcription factors reveals prevalent DNA binding changes,”

Science, vol. 351, pp. 1450–1454, 2016.

[23] M. Lek, K. J. Karczewski, E. V. Minikel, K. E. Samocha, E. Banks, T. Fen-

nell, A. H. O’Donnell-Luria, J. S. Ware, A. J. Hill, B. B. Cummings,

T. Tukiainen, D. P. Birnbaum, J. A. Kosmicki, L. E. Duncan, K. Estrada,

135 BIBLIOGRAPHY

F. Zhao, J. Zou, E. Pierce-Hoffman, J. Berghout, D. N. Cooper, N. De-

flaux, M. DePristo, R. Do, J. Flannick, M. Fromer, L. Gauthier, J. Gold-

stein, N. Gupta, D. Howrigan, A. Kiezun, M. I. Kurki, A. L. Moonshine,

P. Natarajan, L. Orozco, G. M. Peloso, R. Poplin, M. A. Rivas, V. Ruano-

Rubio, S. A. Rose, D. M. Ruderfer, K. Shakir, P. D. Stenson, C. Stevens,

B. P. Thomas, G. Tiao, M. T. Tusie-Luna, B. Weisburd, H.-H. Won, D. Yu,

D. M. Altshuler, D. Ardissino, M. Boehnke, J. Danesh, S. Donnelly, R. Elo-

sua, J. C. Florez, S. B. Gabriel, G. Getz, S. J. Glatt, C. M. Hultman,

S. Kathiresan, M. Laakso, S. McCarroll, M. I. McCarthy, D. McGov-

ern, R. McPherson, B. M. Neale, A. Palotie, S. M. Purcell, D. Saleheen,

J. M. Scharf, P. Sklar, P. F. Sullivan, J. Tuomilehto, M. T. Tsuang, H. C.

Watkins, J. G. Wilson, M. J. Daly, D. G. MacArthur, and Exome Aggrega-

tion Consortium, “Analysis of protein-coding genetic variation in 60,706

humans,” Nature, vol. 536, no. 7616, pp. 285–291, 2016.

[24] V. Faundes, W. G. Newman, L. Bernardini, N. Canham, J. Clayton-Smith,

B. Dallapiccola, S. J. Davies, M. K. Demos, A. Goldman, H. Gill, R. Hor-

ton, B. Kerr, D. Kumar, A. Lehman, S. McKee, J. Morton, M. J. Parker,

J. Rankin, L. Robertson, I. K. Temple, Clinical Assessment of the Util-

ity of Sequencing and Evaluation as a Service (CAUSES) Study, Deci-

phering Developmental Disorders (DDD) Study, and S. Banka, “Histone

lysine methylases and demethylases in the landscape of human develop-

136 BIBLIOGRAPHY

mental disorders,” Am J Hum Genet, vol. 102, pp. 175–187, Jan. 2018.

[25] T. Lappalainen and J. M. Greally, “Associating cellular epigenetic models

with human phenotypes,” Nat Rev Genet, vol. 18, pp. 441–451, 2017.

[26] G. Jimenez-Sanchez, B. Childs, and D. Valle, “Human disease genes,”

Nature, vol. 409, no. 6822, pp. 853–855, 2001.

[27] J. G. Seidman and C. Seidman, “Transcription factor haploinsufficiency:

when half a loaf is not enough,” The Journal of Clinical Investigation,

vol. 109, no. 4, pp. 451–455, 2002.

[28] M. J. Carrozza, R. T. Utley, J. L. Workman, and J. Cotˆ e,´ “The diverse

functions of histone acetyltransferase complexes,” Trends Genet, vol. 19,

pp. 321–329, 2003.

[29] C. R. Clapier and B. R. Cairns, “The biology of chromatin remodeling

complexes,” Annu Rev Biochem, vol. 78, pp. 273–304, 2009.

[30] A. Laugesen and K. Helin, “Chromatin repressive complexes in stem

cells, development, and cancer,” Cell Stem Cell, vol. 14, pp. 735–751,

2014.

[31] R. C. Rao and Y. Dou, “Hijacked in cancer: the KMT2 (MLL) family of

methyltransferases,” Nat Rev Cancer, vol. 15, pp. 334–346, 2015.

137 BIBLIOGRAPHY

[32] J. H. Lee, K. S. Voo, and D. G. Skalnik, “Identification and characteriza-

tion of the DNA binding domain of CpG-binding protein,” The Journal of

Biological Chemistry, vol. 276, no. 48, pp. 44 669–44 676, 2001.

[33] J. M. Havrilla, B. S. Pedersen, R. M. Layer, and A. R. Quinlan, “A map of

constrained coding regions in the human genome,” Nat Genet, 2018.

[34] J. Zou, G. Valiant, P. Valiant, K. Karczewski, S. O. Chan, K. Samocha,

M. Lek, S. Sunyaev, M. Daly, and D. G. MacArthur, “Quantifying

unobserved protein-coding variants in human populations provides a

roadmap for large-scale sequencing projects,” Nature Communications,

vol. 7, p. 13293, 2016.

[35] GTEx Consortium, “Human genomics. the Genotype-Tissue expression

(GTEx) pilot analysis: multitissue gene regulation in humans,” Science,

vol. 348, no. 6235, pp. 648–660, 2015.

[36] B. Zhang and S. Horvath, “A general framework for weighted gene co-

expression network analysis,” Stat Appl Genet Mol Biol, vol. 4, p. Arti-

cle17, 2005.

[37] G. Manning, D. B. Whyte, R. Martinez, T. Hunter, and S. Sudarsanam,

“The protein kinase complement of the human genome,” Science, vol. 298,

pp. 1912–1934, 2002.

138 BIBLIOGRAPHY

[38] M. J. Chen, J. E. Dixon, and G. Manning, “Genomics and evolution of

protein phosphatases,” Sci Signal, vol. 10, p. eaag1796, 2017.

[39] M. W. Vermunt, P. Reinink, J. Korving, E. de Bruijn, P. M. Creyghton,

O. Basak, G. Geeven, P. W. Toonen, N. Lansu, C. Meunier, S. van Heesch,

Netherlands Brain Bank, H. Clevers, W. de Laat, E. Cuppen, and M. P.

Creyghton, “Large-scale identification of coregulated enhancer networks

in the adult human brain,” Cell Rep, vol. 9, pp. 767–779, 2014.

[40] H. K. Finucane, B. Bulik-Sullivan, A. Gusev, G. Trynka, Y. Reshef, P.-

R. Loh, V. Anttila, H. Xu, C. Zang, K. Farh, S. Ripke, F. R. Day, Re-

proGen Consortium, Schizophrenia Working Group of the Psychiatric

Genomics Consortium, RACI Consortium, S. Purcell, E. Stahl, S. Lind-

strom, J. R. B. Perry, Y. Okada, S. Raychaudhuri, M. J. Daly, N. Patterson,

B. M. Neale, and A. L. Price, “Partitioning heritability by functional an-

notation using genome-wide association summary statistics,” Nat Genet,

vol. 47, pp. 1228–1235, 2015.

[41] H. Caron, B. van Schaik, M. van der Mee, F. Baas, G. Riggins, P. van

Sluis, M. C. Hermus, R. van Asperen, K. Boon, P. A. Voute,ˆ S. Heis-

terkamp, A. van Kampen, and R. Versteeg, “The human transcriptome

map: clustering of highly expressed genes in chromosomal domains,” Sci-

ence, vol. 291, pp. 1289–1292, 2001.

139 BIBLIOGRAPHY

[42] B. A. Cohen, R. D. Mitra, J. D. Hughes, and G. M. Church, “A compu-

tational analysis of whole-genome expression data reveals chromosomal

domains of gene expression,” Nat Genet, vol. 26, pp. 183–186, 2000.

[43] The ENCODE Project Consortium, “An integrated encyclopedia of DNA

elements in the human genome,” Nature, vol. 489, pp. 57–74, 2012.

[44] S. Spange, T. Wagner, T. Heinzel, and O. H. Kramer,¨ “Acetylation of non-

histone proteins modulates cellular signalling at multiple levels,” Int J

Biochem Cell Biol, vol. 41, pp. 185–198, 2009.

[45] K. K. Biggar and S. S.-C. Li, “Non-histone protein methylation as a reg-

ulator of cellular signalling and function,” Nat Rev Mol Cell Biol, vol. 16,

pp. 5–17, 2015.

[46] M. A. Deardorff, M. Bando, R. Nakato, E. Watrin, T. Itoh, M. Minamino,

K. Saitoh, M. Komata, Y. Katou, D. Clark, K. E. Cole, E. De Baere,

C. Decroos, N. Di Donato, S. Ernst, L. J. Francey, Y. Gyftodimou, K. Hi-

rashima, M. Hullings, Y. Ishikawa, C. Jaulin, M. Kaur, T. Kiyono, P. M.

Lombardi, L. Magnaghi-Jaulin, G. R. Mortier, N. Nozaki, M. B. Petersen,

H. Seimiya, V. M. Siu, Y. Suzuki, K. Takagaki, J. J. Wilde, P. J. Willems,

C. Prigent, G. Gillessen-Kaesbach, D. W. Christianson, F. J. Kaiser, L. G.

Jackson, T. Hirota, I. D. Krantz, and K. Shirahige, “HDAC8 mutations in

140 BIBLIOGRAPHY

cornelia de lange syndrome affect the cohesin acetylation cycle,” Nature,

vol. 489, pp. 313–317, 2012.

[47] L. Liu, D. M. Scolnick, R. C. Trievel, H. B. Zhang, R. Marmorstein, T. D.

Halazonetis, and S. L. Berger, “p53 sites acetylated in vitro by PCAF and

p300 are acetylated in vivo in response to DNA damage,” Mol Cell Biol,

vol. 19, pp. 1202–1209, 1999.

[48] N. G. Iyer, S.-F. Chin, H. Ozdag, Y. Daigo, D.-E. Hu, M. Cariati,

K. Brindle, S. Aparicio, and C. Caldas, “p300 regulates p53-dependent

apoptosis after DNA damage in colorectal cancer cells by modulation of

PUMA/p21 levels,” Proc Natl Acad Sci, vol. 101, pp. 7386–7391, 2004.

[49] International League Against Epilepsy Consortium on Complex Epilep-

sies, “Genetic determinants of common epilepsies: a meta-analysis of

genome-wide association studies,” Lancet Neurol, vol. 13, pp. 893–903,

2014.

[50] S. C. Dillon, X. Zhang, R. C. Trievel, and X. Cheng, “The SET-domain

protein superfamily: protein lysine methyltransferases,” Genome Biol,

vol. 6, p. 227, 2005.

[51] Y. Shi, “Histone lysine demethylases: emerging roles in development,

physiology and disease,” Nat Rev Genet, vol. 8, pp. 829–833, 2007.

141 BIBLIOGRAPHY

[52] R. Marmorstein and M.-M. Zhou, “Writers and readers of histone acety-

lation: structure, mechanism, and inhibition,” Cold Spring Harb Perspect

Biol, vol. 6, p. a018762, 2014.

[53] E. Seto and M. Yoshida, “Erasers of histone acetylation: the histone

deacetylase enzymes,” Cold Spring Harb Perspect Biol, vol. 6, p. a018713,

2014.

[54] C. A. Musselman, M.-E. Lalonde, J. Cotˆ e,´ and T. G. Kutateladze, “Per-

ceiving the epigenetic landscape through histone readers,” Nat Struct

Mol Biol, vol. 19, pp. 1218–1227, 2012.

[55] H. F. Jørgensen and A. Bird, “MeCP2 and other methyl-CpG binding

proteins,” Ment Retard Dev Disabil Res Rev, vol. 8, pp. 87–93, 2002.

[56] J. Weissman, S. Naidu, and H. T. Bjornsson, “Abnormalities of the DNA

methylation mark and its machinery: an emerging cause of neurologic

dysfunction,” Semin Neurol, vol. 34, pp. 249–257, 2014.

[57] C. G. A. Marfella and A. N. Imbalzano, “The chd family of chromatin

remodelers,” Mutat Res, vol. 618, pp. 30–40, 2007.

[58] L. Zeng, Q. Zhang, S. Li, A. N. Plotnikov, M. J. Walsh, and M.-M. Zhou,

“Mechanism and regulation of acetylated histone binding by the tandem

PHD finger of DPF3b,” Nature, vol. 466, pp. 258–262, 2010.

142 BIBLIOGRAPHY

[59] F. M. Huber, S. M. Greenblatt, A. M. Davenport, C. Martinez, Y. Xu, L. P.

Vu, S. D. Nimer, and A. Hoelz, “Histone-binding of DPF2 mediates its

repressive role in myeloid differentiation,” Proc Natl Acad Sci, vol. 114,

pp. 6016–6021, 2017.

[60] K. Hyun, J. Jeon, K. Park, and J. Kim, “Writing, erasing and reading

histone lysine methylations,” Exp Mol Med, vol. 49, p. e324, 2017.

[61] Y. Zhao and B. A. Garcia, “Comprehensive catalog of currently docu-

mented histone modifications,” Cold Spring Harb Perspect Biol, vol. 7,

p. a025064, 2015.

[62] R. Raisner, S. Kharbanda, L. Jin, E. Jeng, E. Chan, M. Merchant,

P. M. Haverty, R. Bainer, T. Cheung, D. Arnott, E. M. Flynn, F. A.

Romero, S. Magnuson, and K. E. Gascoigne, “Enhancer activity requires

CBP/P300 Bromodomain-Dependent histone H3K27 acetylation,” Cell

Reports, vol. 24, pp. 1722–1729, 2018.

[63] M. Zech, S. Boesch, E. M. Maier, I. Borggraefe, K. Vill, F. Laccone,

V. Pilshofer, A. Ceballos-Baumann, B. Alhaddad, R. Berutti, W. Poewe,

T. B. Haack, B. Haslinger, T. M. Strom, and J. Winkelmann, “Hap-

loinsufficiency of KMT2B, encoding the Lysine-Specific histone methyl-

transferase 2b, results in Early-Onset generalized dystonia,” Am J Hum

Genet, vol. 99, pp. 1377–1387, Dec. 2016.

143 BIBLIOGRAPHY

[64] E. C. Schulte, A. Fukumori, B. Mollenhauer, H. Hor, T. Arzberger,

R. Perneczky, A. Kurz, J. Diehl-Schmid, M. Hull,¨ P. Lichtner, G. Eck-

stein, A. Zimprich, D. Haubenberger, W. Pirker, T. Brucke,¨ B. Bereznai,

M. J. Molnar, O. Lorenzo-Betancor, P. Pastor, A. Peters, C. Gieger, X. Es-

tivill, T. Meitinger, H. A. Kretzschmar, C. Trenkwalder, C. Haass, and

J. Winkelmann, “Rare variants in β-Amyloid precursor protein (APP)

and parkinson’s disease,” European Journal of Human Genetics, vol. 23,

pp. 1328–1333, 2015.

[65] G. Nicolas, D. Wallon, C. Charbonnier, O. Quenez, S. Rousseau, A.-C.

Richard, A. Rovelet-Lecrux, S. Coutant, K. Le Guennec, D. Bacq, J.-G.

Garnier, R. Olaso, A. Boland, V. Meyer, J.-F. Deleuze, H. M. Munter,

G. Bourque, D. Auld, A. Montpetit, M. Lathrop, L. Guyant-Marechal,´

O. Martinaud, J. Pariente, A. Rollin-Sillaire, F. Pasquier, I. Le Ber,

M. Sarazin, B. Croisile, C. Boutoleau-Bretonniere,` C. Thomas-Anterion,´

C. Paquet, M. Sauvee,´ O. Moreaud, A. Gabelle, F. Sellal, M. Ceccaldi,

L. Chamard, F. Blanc, T. Frebourg, D. Campion, and D. Hannequin,

“Screening of dementia genes by whole-exome sequencing in early-onset

alzheimer disease: input and lessons,” European Journal of Human Ge-

netics, vol. 24, pp. 710–716, 2016.

[66] J. L. Farlow, L. A. Robak, K. Hetrick, K. Bowling, E. Boerwinkle, Z. H.

144 BIBLIOGRAPHY

Coban-Akdemir, T. Gambin, R. A. Gibbs, S. Gu, P. Jain, J. Jankovic,

S. Jhangiani, K. Kaw, D. Lai, H. Lin, H. Ling, Y. Liu, J. R. Lupski,

D. Muzny, P. Porter, E. Pugh, J. White, K. Doheny, R. M. Myers, J. M.

Shulman, and T. Foroud, “Whole-Exome sequencing in familial parkin-

son disease,” JAMA Neurology, vol. 73, pp. 68–75, 2016.

[67] M. V. Shulskaya, A. K. Alieva, I. N. Vlasov, V. V. Zyrin, E. Y. Fedotova,

N. Y. Abramycheva, T. S. Usenko, A. F. Yakimovsky, A. K. Emelyanov,

S. N. Pchelina, S. N. Illarioshkin, P. A. Slominsky, and M. I. Shad-

rina, “Whole-Exome sequencing in searching for new variants associated

with the development of parkinson’s disease,” Frontiers in Aging Neuro-

science, vol. 10, p. 136, 2018.

[68] C. Sandor, F. Honti, W. Haerty, K. Szewczyk-Krolikowski, P. Tomlinson,

S. Evetts, S. Millin, T. Keane, S. A. McCarthy, R. Durbin, K. Talbot,

M. Hu, C. Webber, C. P. Ponting, and R. Wade-Martins, “Whole-exome

sequencing of 228 patients with sporadic parkinson’s disease,” Scientific

Reports, vol. 7, p. 41188, 2017.

[69] J. Cholewa-Waclaw, A. Bird, M. von Schimmelmann, A. Schaefer, H. Yu,

H. Song, R. Madabhushi, and L.-H. Tsai, “The role of epigenetic mecha-

nisms in the regulation of gene expression in the nervous system,” The

Journal of Neuroscience, vol. 36, pp. 11 427–11 434, 2016.

145 BIBLIOGRAPHY

[70] M. S. Lawrence, P. Stojanov, C. H. Mermel, J. T. Robinson, L. A. Gar-

raway, T. R. Golub, M. Meyerson, S. B. Gabriel, E. S. Lander, and G. Getz,

“Discovery and saturation analysis of cancer genes across 21 tumour

types,” Nature, vol. 505, pp. 495–501, 2014.

[71] C. J. Tokheim, N. Papadopoulos, K. W. Kinzler, B. Vogelstein, and

R. Karchin, “Evaluating the evaluation of cancer driver genes,” Proc Natl

Acad Sci, vol. 113, pp. 14 330–14 335, 2016.

[72] H. Shen and P. W. Laird, “Interplay between the cancer genome and

epigenome,” Cell, vol. 153, pp. 38–55, 2013.

[73] M. N. Cabili, C. Trapnell, L. Goff, M. Koziol, B. Tazon-Vega, A. Regev,

and J. L. Rinn, “Integrative annotation of human large intergenic non-

coding RNAs reveals global properties and specific subclasses,” Genes

Dev, vol. 25, pp. 1915–1927, 2011.

[74] C. Trapnell, A. Roberts, L. Goff, G. Pertea, D. Kim, D. R. Kelley, H. Pi-

mentel, S. L. Salzberg, J. L. Rinn, and L. Pachter, “Differential gene and

transcript expression analysis of RNA-seq experiments with TopHat and

cufflinks,” Nat Protoc, vol. 7, pp. 562–578, 2012.

[75] J. T. Leek and J. D. Storey, “Capturing heterogeneity in gene expression

studies by surrogate variable analysis,” PLOS Genet, vol. 3, pp. 1724–

1735, 2007.

146 BIBLIOGRAPHY

[76] ——, “A general framework for multiple testing dependence,” Proc Natl

Acad Sci, vol. 105, pp. 18 718–18 723, 2008.

[77] J. T. Leek, W. E. Johnson, H. S. Parker, A. E. Jaffe, and J. D. Storey, “The

sva package for removing batch effects and other unwanted variation

in high-throughput experiments,” Bioinformatics, vol. 28, pp. 882–883,

2012.

[78] M. E. Ritchie, B. Phipson, D. Wu, Y. Hu, C. W. Law, W. Shi, and

G. K. Smyth, “limma powers differential expression analyses for RNA-

sequencing and microarray studies,” Nucleic Acids Res, vol. 43, p. e47,

2015.

[79] S. Freytag, J. Gagnon-Bartsch, T. P. Speed, and M. Bahlo, “Systematic

noise degrades gene co-expression signals but can be corrected,” BMC

Bioinformatics, vol. 16, p. 309, 2015.

[80] P. Parsana, C. Ruberman, A. E. Jaffe, M. C. Schatz, A. Battle, and J. T.

Leek, “Addressing confounding artifacts in reconstruction of gene co-

expression networks,” bioRxiv, p. 202903, 2017.

[81] P. Langfelder and S. Horvath, “WGCNA: an R package for weighted cor-

relation network analysis,” BMC Bioinformatics, vol. 9, p. 559, 2008.

[82] L. Collado-Torres, A. Nellore, K. Kammers, S. E. Ellis, M. A. Taub, K. D.

147 BIBLIOGRAPHY

Hansen, A. E. Jaffe, B. Langmead, and J. T. Leek, “Reproducible RNA-

seq analysis using recount2,” Nat Biotechnol, vol. 35, pp. 319–321, 2017.

[83] L. Collado-Torres, A. Nellore, and A. E. Jaffe, “recount workflow: Access-

ing over 70,000 human RNA-seq samples with bioconductor,” F1000Res,

vol. 6, p. 1558, 2017.

[84] S. A. Slavoff, A. J. Mitchell, A. G. Schwaid, M. N. Cabili, J. Ma, J. Z. Levin,

A. D. Karger, B. A. Budnik, J. L. Rinn, and A. Saghatelian, “Peptidomic

discovery of short open reading frame-encoded peptides in human cells,”

Nat Chem Biol, vol. 9, pp. 59–64, Jan. 2013.

[85] B. K. Bulik-Sullivan, P.-R. Loh, H. K. Finucane, S. Ripke, J. Yang,

Schizophrenia Working Group of the Psychiatric Genomics Consortium,

N. Patterson, M. J. Daly, A. L. Price, and B. M. Neale, “LD score regres-

sion distinguishes confounding from polygenicity in genome-wide associ-

ation studies,” Nat Genet, vol. 47, pp. 291–295, 2015.

[86] Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate: A

practical and powerful approach to multiple testing,” J Royal Stat Soc

Series B Stat Methodol, vol. 57, pp. 289–300, 1995.

[87] T. Tukiainen, A.-C. Villani, A. Yen, M. A. Rivas, J. L. Marshall,

R. Satija, M. Aguirre, L. Gauthier, M. Fleharty, A. Kirby, B. B. Cum-

mings, S. E. Castel, K. J. Karczewski, F. Aguet, A. Byrnes, GTExx Con-

148 BIBLIOGRAPHY

sortium, Laboratory, Data Analysis &Coordinating Center (LDACC)—

Analysis Working Group, Statistical Methods groups—Analysis Work-

ing Group, Enhancing GTEx (eGTEx) groups, NIH Common Fund,

NIH/NCI, NIH/NHGRI, NIH/NIMH, NIH/NIDA, Biospecimen Collection

Source Site—NDRI, Biospecimen Collection Source Site—RPCI, Biospec-

imen Core Resource—VARI, Brain Bank Repository—University of Mi-

ami Brain Endowment Bank, Leidos Biomedical—Project Management,

ELSI Study, Genome Browser Data Integration &Visualization—EBI,

Genome Browser Data Integration &Visualization—UCSC Genomics In-

stitute, University of California Santa Cruz, T. Lappalainen, A. Regev,

K. G. Ardlie, N. Hacohen, and D. G. MacArthur, “Landscape of X chromo-

some inactivation across human tissues,” Nature, vol. 550, pp. 244–248,

2017.

[88] M. J. Guertin and J. T. Lis, “Mechanisms by which transcription factors

gain access to target sequence elements in chromatin,” Curr Opin Genet

Dev, vol. 23, pp. 116–123, 2013.

[89] T. Quante and A. Bird, “Do short, frequent DNA sequence motifs mould

the epigenome?” Nat Rev Mol Cell Biol, vol. 17, pp. 257–262, 2016.

[90] A. Hochwagen and G. A. B. Marais, “Meiosis: a PRDM9 guide to the

hotspots of recombination,” Curr Biol, vol. 20, pp. R271–R274, 2010.

149 BIBLIOGRAPHY

[91] S. Chuma, M. Hosokawa, K. Kitamura, S. Kasai, M. Fujioka, M. Hiyoshi,

K. Takamune, T. Noce, and N. Nakatsuji, “Tdrd1/Mtr-1, a tudor-related

gene, is essential for male germ-cell differentiation and nuage/germinal

granule formation in mice,” Proc Natl Acad Sci, vol. 103, pp. 15 894–

15 899, 2006.

[92] J. Pan, M. Goodheart, S. Chuma, N. Nakatsuji, D. C. Page, and P. J.

Wang, “RNF17, a component of the mammalian germ cell nuage, is es-

sential for spermiogenesis,” Development, vol. 132, pp. 4029–4039, 2005.

[93] E. Shang, H. D. Nickerson, D. Wen, X. Wang, and D. J. Wolgemuth,

“The first bromodomain of brdt, a testis-specific member of the BET sub-

family of double-bromodomain-containing proteins, is essential for male

germ cell differentiation,” Development, vol. 134, pp. 3507–3515, 2007.

[94] M. Yamaji, Y. Seki, K. Kurimoto, Y. Yabuta, M. Yuasa, M. Shigeta, K. Ya-

manaka, Y. Ohinata, and M. Saitou, “Critical function of prdm14 for the

establishment of the germ cell lineage in mice,” Nat Genet, vol. 40, pp.

1016–1022, 2008.

[95] W. A. Pastor, H. Stroud, K. Nee, W. Liu, D. Pezic, S. Manakov, S. A. Lee,

G. Moissiard, N. Zamudio, D. Bourc’his, A. A. Aravin, A. T. Clark, and

S. E. Jacobsen, “MORC1 represses transposable elements in the mouse

male germline,” Nat Commun, vol. 5, p. 5795, 2014.

150 BIBLIOGRAPHY

[96] E. L. Huttlin, R. J. Bruckner, J. A. Paulo, J. R. Cannon, L. Ting,

K. Baltier, G. Colby, F. Gebreab, M. P. Gygi, H. Parzen, J. Szpyt, S. Tam,

G. Zarraga, L. Pontano-Vaites, S. Swarup, A. E. White, D. K. Schweppe,

R. Rad, B. K. Erickson, R. A. Obar, K. G. Guruharsha, K. Li, S. Artavanis-

Tsakonas, S. P. Gygi, and J. W. Harper, “Architecture of the human inter-

actome defines protein communities and disease networks,” Nature, vol.

545, pp. 505–509, 2017.

[97] D. S. Falconer and T. F. C. Mackay, Introduction to Quantitative Genetics.

Pearson, 1996.

[98] Z. L. Fuller, J. J. Berg, H. Mostafavi, G. Sella, and M. Przeworski, “Mea-

suring intolerance to mutation in human genetics,” Nature Genetics,

vol. 51, no. 5, pp. 772–776, 2019.

[99] S. Petrovski, Q. Wang, E. L. Heinzen, A. S. Allen, and D. B. Goldstein,

“Genic intolerance to functional variation and the interpretation of per-

sonal genomes,” PLOS Genetics, vol. 9, no. 8, p. e1003709, 2013.

[100] K. J. Karczewski, L. C. Francioli, G. Tiao, B. B. Cummings, J. Alfoldi,¨

Q. Wang, R. L. Collins, K. M. Laricchia, A. Ganna, D. P. Birnbaum, L. D.

Gauthier, H. Brand, M. Solomonson, N. A. Watts, D. Rhodes, M. Singer-

Berk, E. M. England, E. G. Seaby, J. A. Kosmicki, R. K. Walters, K. Tash-

man, Y. Farjoun, E. Banks, T. Poterba, A. Wang, C. Seed, N. Whiffin, J. X.

151 BIBLIOGRAPHY

Chong, K. E. Samocha, E. Pierce-Hoffman, Z. Zappala, A. H. O’Donnell-

Luria, E. V. Minikel, B. Weisburd, M. Lek, J. S. Ware, C. Vittal, I. M.

Armean, L. Bergelson, K. Cibulskis, K. M. Connolly, M. Covarrubias,

S. Donnelly, S. Ferriera, S. Gabriel, J. Gentry, N. Gupta, T. Jeandet,

D. Kaplan, C. Llanwarne, R. Munshi, S. Novod, N. Petrillo, D. Roazen,

V. Ruano-Rubio, A. Saltzman, M. Schleicher, J. Soto, K. Tibbetts, C. Tolo-

nen, G. Wade, M. E. Talkowski, The Genome Aggregation Database

Consortium, B. M. Neale, M. J. Daly, and D. G. MacArthur, “Variation

across 141,456 human exomes and genomes reveals the spectrum of loss-

of-function intolerance across human protein-coding genes,” bioRxiv, p.

531210, 2019.

[101] A. N. Abou Tayoun, T. Pesaran, M. T. DiStefano, A. Oza, H. L. Rehm, L. G.

Biesecker, S. M. Harrison, and ClinGen Sequence Variant Interpretation

Working Group (ClinGen SVI), “Recommendations for interpreting the

loss of function PVS1 ACMG/AMP variant criterion,” Human Mutation,

vol. 39, no. 11, pp. 1517–1524, 2018.

[102] S. Lykke-Andersen and T. H. Jensen, “Nonsense-mediated mRNA de-

cay: an intricate machinery that shapes transcriptomes,” Nature Reviews

Molecular Cell Biology, vol. 16, no. 11, pp. 665–677, 2015.

[103] R. G. H. Lindeboom, M. Vermeulen, B. Lehner, and F. Supek, “The impact

152 BIBLIOGRAPHY

of nonsense-mediated mRNA decay on genetic disease, gene editing and

cancer immunotherapy,” Nature Genetics, vol. 51, no. 11, pp. 1645–1651,

2019.

[104] X. Han, S. Chen, E. Flynn, S. Wu, D. Wintner, and Y. Shen, “Distinct

epigenomic patterns are associated with haploinsufficiency and predict

risk genes of developmental disorders,” Nature Communications, vol. 9,

no. 1, p. 2138, 2018.

[105] X. Wang and D. B. Goldstein, “Enhancer domains predict gene

pathogenicity and inform gene discovery in complex disease,” Am. J.

Hum. Genet., vol. 106, no. 2, pp. 215–233, 2020.

[106] A. P. Bird, “CpG islands as gene markers in the vertebrate nucleus,”

Trends in Genetics, vol. 3, pp. 342–347, 1987.

[107] A. Meissner, T. S. Mikkelsen, H. Gu, M. Wernig, J. Hanna, A. Sivachenko,

X. Zhang, B. E. Bernstein, C. Nusbaum, D. B. Jaffe, A. Gnirke,

R. Jaenisch, and E. S. Lander, “Genome-scale DNA methylation maps

of pluripotent and differentiated cells,” Nature, vol. 454, no. 7205, pp.

766–770, 2008.

[108] R. Straussman, D. Nejman, D. Roberts, I. Steinfeld, B. Blum, N. Ben-

venisty, I. Simon, Z. Yakhini, and H. Cedar, “Developmental program-

153 BIBLIOGRAPHY

ming of CpG island methylation profiles in the human genome,” Nature

Structural & Molecular Biology, vol. 16, no. 5, pp. 564–571, 2009.

[109] H. K. Long, N. P. Blackledge, and R. J. Klose, “ZF-CxxC domain-

containing proteins, CpG islands and the chromatin connection,” Bio-

chemical Society Transactions, vol. 41, no. 3, pp. 727–740, 2013.

[110] J. P. Thomson, P. J. Skene, J. Selfridge, T. Clouaire, J. Guy, S. Webb,

A. R. W. Kerr, A. Deaton, R. Andrews, K. D. James, D. J. Turner, R. Illing-

worth, and A. Bird, “CpG islands influence chromatin structure via the

CpG-binding protein cfp1,” Nature, vol. 464, no. 7291, pp. 1082–1086,

2010.

[111] T. Clouaire, S. Webb, P. Skene, R. Illingworth, A. Kerr, R. Andrews, J.-

H. Lee, D. Skalnik, and A. Bird, “Cfp1 integrates both CpG content and

gene activity for accurate H3K4me3 deposition in embryonic stem cells,”

Genes & Development, vol. 26, no. 15, pp. 1714–1728, 2012.

[112] E. Wachter, T. Quante, C. Merusi, A. Arczewska, F. Stewart, S. Webb,

and A. Bird, “Synthetic CpG islands reveal DNA sequence determinants

of chromatin structure,” eLife, vol. 3, p. e03397, 2014.

[113] M. A. White, C. A. Myers, J. C. Corbo, and B. A. Cohen, “Massively par-

allel in vivo enhancer assay reveals that highly local features determine

154 BIBLIOGRAPHY

the cis-regulatory function of ChIP-seq peaks,” Proceedings of the Na-

tional Academy of Sciences of the United States of America, vol. 110,

no. 29, pp. 11 952–11 957, 2013.

[114] D. Hartl, A. R. Krebs, R. S. Grand, T. Baubec, L. Isbel, C. Wirbelauer,

L. Burger, and D. Schubeler,¨ “CG dinucleotides enhance promoter activ-

ity independent of DNA methylation,” Genome Research, vol. 29, no. 4,

pp. 554–563, 2019.

[115] B. B. Cummings, K. J. Karczewski, J. A. Kosmicki, E. G. Seaby, N. A.

Watts, M. Singer-Berk, J. M. Mudge, J. Karjalainen, F. Kyle Satterstrom,

A. O’Donnell-Luria, T. Poterba, C. Seed, M. Solomonson, J. Alfoldi,¨ The

Genome Aggregation Database Production Team, The Genome Aggrega-

tion Database Consortium, M. J. Daly, and D. G. MacArthur, “Transcript

expression-aware annotation improves rare variant discovery and inter-

pretation,” bioRxiv, p. 554444, 2019.

[116] S. Saxonov, P. Berg, and D. L. Brutlag, “A genome-wide analysis of CpG

dinucleotides in the human genome distinguishes two distinct classes

of promoters,” Proceedings of the National Academy of Sciences of the

United States of America, vol. 103, no. 5, pp. 1412–1417, 2006.

[117] V. Agarwal and J. Shendure, “Predicting mRNA abundance directly from

155 BIBLIOGRAPHY

genomic sequence using deep convolutional neural networks,” bioRxiv, p.

416685, 2018.

[118] E. M. Riising, I. Comet, B. Leblanc, X. Wu, J. V. Johansen, and K. Helin,

“Gene silencing triggers polycomb repressive complex 2 recruitment to

CpG islands genome wide,” Molecular Cell, vol. 55, no. 3, pp. 347–360,

2014.

[119] G. Berrozpe, G. O. Bryant, K. Warpinski, D. Spagna, S. Narayan, S. Shah,

and M. Ptashne, “Polycomb responds to low levels of transcription,” Cell

Reports, vol. 20, no. 4, pp. 785–793, 2017.

[120] N.-L. Sim, P. Kumar, J. Hu, S. Henikoff, G. Schneider, and P. C. Ng, “SIFT

web server: predicting effects of amino acid substitutions on proteins,”

Nucleic Acids Research, vol. 40, no. Web Server issue, pp. W452–7, 2012.

[121] I. Adzhubei, D. M. Jordan, and S. R. Sunyaev, Current Protocols in Hu-

man Genetics, 2013, vol. Chapter 7.

[122] N. Abramovs, A. Brass, and M. Tassabehji, “GeVIR is a continuous gene-

level metric that uses variant distribution patterns to prioritize disease

candidate genes,” Nat. Genet., vol. 52, no. 1, pp. 35–39, Jan. 2020.

[123] R. L. Collins, H. Brand, K. J. Karczewski, X. Zhao, J. Alfoldi,¨ L. C. Fran-

cioli, A. V. Khera, C. Lowther, L. D. Gauthier, H. Wang, N. A. Watts,

156 BIBLIOGRAPHY

M. Solomonson, A. O’Donnell-Luria, A. Baumann, R. Munshi, M. Walker,

C. Whelan, Y. Huang, T. Brookings, T. Sharpe, M. R. Stone, E. Valkanas,

J. Fu, G. Tiao, K. M. Laricchia, V. Ruano-Rubio, C. Stevens, N. Gupta,

L. Margolin, Genome Aggregation Database Production Team, Genome

Aggregation Database Consortium, K. D. Taylor, H. J. Lin, S. S. Rich,

W. Post, Y.-D. I. Chen, J. I. Rotter, C. Nusbaum, A. Philippakis, E. Lan-

der, S. Gabriel, B. M. Neale, S. Kathiresan, M. J. Daly, E. Banks, D. G.

MacArthur, and M. E. Talkowski, “An open resource of structural varia-

tion for medical and population genetics,” bioRxiv, p. 578674, Oct. 2019.

[124] L. Boukas, J. M. Havrilla, P. F. Hickey, A. R. Quinlan, H. T. Bjorns-

son, and K. D. Hansen, “Coexpression patterns define epigenetic regula-

tors associated with neurological dysfunction,” Genome Research, vol. 29,

no. 4, pp. 532–542, 2019.

[125] N. M. Cohen, E. Kenigsberg, and A. Tanay, “Primate CpG islands are

maintained by heterogeneous evolutionary regimes involving minimal

selection,” Cell, vol. 145, no. 5, pp. 773–786, 2011.

[126] M. D. Morgan and J. C. Marioni, “CpG island composition differences are

a source of gene expression noise indicative of promoter responsiveness,”

Genome Biology, vol. 19, no. 1, p. 81, 2018.

[127] L. A. Barrera, A. Vedenko, J. V. Kurland, J. M. Rogers, S. S. Gisselbrecht,

157 BIBLIOGRAPHY

E. J. Rossin, J. Woodard, L. Mariani, K. H. Kock, S. Inukai, T. Siggers,

L. Shokri, R. Gordan,ˆ N. Sahni, C. Cotsapas, T. Hao, S. Yi, M. Kellis,

M. J. Daly, M. Vidal, D. E. Hill, and M. L. Bulyk, “Survey of variation

in human transcription factors reveals prevalent DNA binding changes,”

Science, vol. 351, no. 6280, pp. 1450–1454, 2016.

[128] S. A. Lambert, A. Jolma, L. F. Campitelli, P. K. Das, Y. Yin, M. Albu,

X. Chen, J. Taipale, T. R. Hughes, and M. T. Weirauch, “The human tran-

scription factors,” Cell, vol. 175, no. 2, pp. 598–599, Oct. 2018.

[129] M. W. Perry, A. N. Boettiger, J. P. Bothma, and M. Levine, “Shadow en-

hancers foster robustness of drosophila gastrulation,” Curr. Biol., vol. 20,

no. 17, pp. 1562–1567, Sep. 2010.

[130] N. Frankel, G. K. Davis, D. Vargas, S. Wang, F. Payre, and D. L. Stern,

“Phenotypic robustness conferred by apparently redundant transcrip-

tional enhancers,” Nature, vol. 466, no. 7305, pp. 490–493, Jul. 2010.

[131] J. Harrow, A. Frankish, J. M. Gonzalez, E. Tapanari, M. Diekhans,

F. Kokocinski, B. L. Aken, D. Barrell, A. Zadissa, S. Searle, I. Barnes,

A. Bignell, V. Boychenko, T. Hunt, M. Kay, G. Mukherjee, J. Ra-

jan, G. Despacio-Reyes, G. Saunders, C. Steward, R. Harte, M. Lin,

C. Howald, A. Tanzer, T. Derrien, J. Chrast, N. Walters, S. Balasubra-

manian, B. Pei, M. Tress, J. M. Rodriguez, I. Ezkurdia, J. van Baren,

158 BIBLIOGRAPHY

M. Brent, D. Haussler, M. Kellis, A. Valencia, A. Reymond, M. Gerstein,

R. Guigo,´ and T. J. Hubbard, “GENCODE: the reference human genome

annotation for the ENCODE project,” Genome Research, vol. 22, no. 9,

pp. 1760–1774, 2012.

[132] M. Wintzerith, J. Acker, S. Vicaire, M. Vigneron, and C. Kedinger, “Com-

plete sequence of the human RNA polymerase II largest subunit,” Nu-

cleic Acids Research, vol. 20, no. 4, p. 910, 1992.

[133] K. Mita, H. Tsuji, M. Morimyo, E. Takahashi, M. Nenoi, S. Ichimura,

M. Yamauchi, E. Hongo, and A. Hayashi, “The human gene encoding the

largest subunit of RNA polymerase II,” Gene, vol. 159, no. 2, pp. 285–286,

1995.

[134] ENCODE Project Consortium, “An integrated encyclopedia of DNA el-

ements in the human genome,” Nature, vol. 489, no. 7414, pp. 57–74,

2012.

[135] M. Gardiner-Garden and M. Frommer, “CpG islands in vertebrate

genomes,” J. Mol. Biol., vol. 196, no. 2, pp. 261–282, Jul. 1987.

[136] GTEx Consortium, “Genetic effects on gene expression across human tis-

sues,” Nature, vol. 550, no. 7675, pp. 204–213, 2017.

[137] N. Kryuchkova-Mostacci and M. Robinson-Rechavi, “A benchmark of

159 BIBLIOGRAPHY

gene expression tissue-specificity metrics,” Briefings in Bioinformatics,

vol. 18, no. 2, pp. 205–214, 2017.

[138] A. Siepel, G. Bejerano, J. S. Pedersen, A. S. Hinrichs, M. Hou, K. Rosen-

bloom, H. Clawson, J. Spieth, L. W. Hillier, S. Richards, G. M. Weinstock,

R. K. Wilson, R. A. Gibbs, W. J. Kent, W. Miller, and D. Haussler, “Evo-

lutionarily conserved elements in vertebrate, insect, worm, and yeast

genomes,” Genome Research, vol. 15, no. 8, pp. 1034–1050, 2005.

[139] N. Huang, I. Lee, E. M. Marcotte, and M. E. Hurles, “Characterising

and predicting haploinsufficiency in the human genome,” PLOS Genetics,

vol. 6, no. 10, p. e1001154, 2010.

[140] J. Steinberg, F. Honti, S. Meader, and C. Webber, “Haploinsufficiency pre-

dictions without study bias,” Nucleic Acids Research, vol. 43, no. 15, p.

e101, Sep. 2015.

[141] H. A. Shihab, M. F. Rogers, C. Campbell, and T. R. Gaunt, “HIPred: an

integrative approach to predicting haploinsufficient genes,” Bioinformat-

ics, vol. 33, no. 12, pp. 1751–1757, Jun. 2017.

160 Curriculum Vitae

Leandros Boukas, M.D.

Personal:

Born: August 29, 1991 in Athens, Greece

Citizenship: Greek

Residence: USA

E-mail: [email protected], [email protected]

Education:

Aug 2015 - June 2020: PhD in Human Genetics

McKusick – Nathans Department of Genetic Medicine, Johns Hopkins

University School of Medicine, Baltimore, MD, USA

Advisors: Kasper Daniel Hansen, PhD & Hans Tomas Bjornsson, MD,

PhD

161 CURRICULUM VITAE

Sep 2009 - July 2015: Doctor of Medicine (M.D.)

University of Patras, Patras, Greece

Publications and Preprints:

1. L. Boukas, H.T. Bjornsson & K.D. Hansen. Promoter CpG density pre-

dicts downstream gene loss-of- function intolerance (2020). bioRxiv.

2. L. Boukas, J.M. Havrilla, P.F. Hickey, A.R. Quinlan, H.T. Bjornsson &

K.D. Hansen. Co-expression patterns define epigenetic regulators associ-

ated with neurological dysfunction (2019). Genome Research.

3. J.A. Fahrner, W.Y. Lin, R.C. Riddle, L. Boukas, et al. Precocious chon-

drocyte differentiation disrupts skeletal growth in Kabuki syndrome mice

(2019). JCI Insight.

4. L. Myint, R. Wang, L. Boukas, K.D. Hansen, et al. A screen of 1,049

schizophrenia and 30 Alzheimer’s-associated variants for regulatory po-

tential (2019). Am J Med Genet B Neuropsychiatr Genet.

5. G.A. Carosso, L. Boukas, J.J Augustin, H.N. Nguyen, et al. Precocious

neuronal differentiation and disrupted oxygen responses in Kabuki syn-

drome (2019). JCI Insight.

6. G.O. Pilarowski, H.J. Vernon, C.D. Applegate, L. Boukas, et al. Missense

162 CURRICULUM VITAE

variants in the chromatin remodeler CHD1 are associated with neurode-

velopmental disability (2017). Journal of Medical Genetics.

Awards/Honors:

2019: Fellowship from the Gerondelis Foundation

2018: 3rd place award, 12th Annual Symposium and Poster Session in Ge-

nomics and Bioinformatics (held at Johns Hopkins University)

2017: Maryland Genetics, Epidemiology and Medicine (MD-GEM) Fellowship

(supported by the Burroughs – Wellcome fund)

2015: Travel Fellowship from the European Society of Human Genetics to

attend the 28th European School in Medical Genetics (held in Bertinoro,

Italy)

2008: 3rd Place Award (with Rafail Kotronias and Phillip Siaplaouras) in the

Attica Competition in Experimental Biology, Experimental Chemistry,

and Experimental Physics

2007: 3rd Place Award in the National Mathematics Competition “Euclid”

(organized by the Greek Mathematical Society)

163 CURRICULUM VITAE

Presentations/Posters:

2019: Signatures of selection on promoter CpG density in the human genome.

Poster, Human Genetics at NYC Day, New York, NY

2019: Co-expression patterns define epigenetic regulators associated with

neurological dysfunction. Poster, The Biology of Genomes annual meet-

ing, Cold Spring Harbor Laboratory, NY

2018: Co-expression patterns define epigenetic regulators associated with

neurological dysfunction. Poster, 12th Annual Symposium and Poster

Session in Genomics and Bioinformatics, Johns Hopkins University (3rd

prize award at the poster session)

2018: Co-expressed epigenetic regulators are severely intolerant to variation

and associated with neurological dysfunction. Poster, Genetics Research

Day, Johns Hopkins University

2017: A genetic analysis of the human epigenetic machinery. Oral presenta-

tion, Chromatin Workshop, Johns Hopkins University

2017: The human epigenetic machinery is intolerant to variation and highly

co-expressed. Poster, American Society of Human Genetics annual meet-

ing, Orlando, FL

164