Computational Methods to Advance From Genetic Association to Biological Insight

The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters

Citation Fine, Rebecca S. 2020. Computational Methods to Advance From Genetic Association to Biological Insight. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.

Citable link https://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37365548

Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA

Computational methods to advance from genetic association to biological insight

A dissertation presented

by

Rebecca S. Fine

to

The Division of Medical Sciences

in partial fulfillment of the requirements

for the degree of

Doctor of Philosophy

in the subject of

Genetics and Genomics

Harvard University

Cambridge, Massachusetts

February 2020

© 2020 Rebecca S. Fine All rights reserved.

Dissertation Advisor: Joel N. Hirschhorn Rebecca S. Fine

Computational methods to advance from genetic association to biological insight

Abstract

The increasing availability of large-scale genetic data has enabled massive efforts to understand the genetic basis of traits and diseases. However, roughly 90% of genome- wide association study (GWAS) signals can be attributed to noncoding variation. Such variants are challenging to interpret and connect to biological insight because (1) there are often multiple possible causal at each locus and (2) linking noncoding variants to the genes they regulate is difficult, which obscures the relevant biological processes. One useful approach in addressing this problem is the application of algorithms that search for commonalities across loci, which can implicate biological processes through set enrichment analysis and prioritize likely causal genes. To facilitate progress from genetic association to biological hypotheses, I have built upon this strategy to develop two different methods.

First, I developed a gene set enrichment analysis method for the ExomeChip, a genotyping array which focuses on rare and low-frequency coding variation. To do this, I adapted DEPICT, a gene set enrichment analysis method developed by our lab for GWAS data.

DEPICT is particularly powerful as a tool for biological interpretation due to its use of gene sets that have been extended via coexpression data to make predictions about the function of uncharacterized genes. I applied this method to many different types of traits, including anthropometric (e.g. height), hormonal (e.g. adiponectin), and glycemic (e.g. fasting glucose).

Second, many different gene prioritization algorithms for GWAS have been developed using a commonality-based strategy, but it is difficult to determine which are the most accurate.

iii

Most new methods benchmark using “gold standards” (genes already known to play a role in the trait of interest). However, such gold standards are biased toward well-studied genes and pathways. Therefore, I developed Benchmarker, a leave-one--out method for benchmarking prioritization methods that relies only on the original GWAS data and evaluates method performance using stratified LD score regression.

The methods described in this dissertation contribute to an important and growing body of work on more effective and rigorous analysis of GWAS data to obtain biological insight.

iv

Table of Contents

Abstract ...... iii

Table of Contents ...... v

Acknowledgements ...... vii

Dedication ...... x

Attributions ...... xi Chapter 2 ...... xi Chapter 3 ...... xii Chapter 4 ...... xii

Chapter 1: Introduction ...... 1 Rare and low-frequency coding variation ...... 3 Gene set enrichment analysis ...... 5 Partitioning heritability ...... 9 Gene prioritization ...... 11 Benchmarking strategies ...... 15 Tissue and cell type enrichment ...... 17 Summary ...... 21

Chapter 2: EC-DEPICT: a gene set enrichment analysis method for ExomeChip data ...... 22 Abstract ...... 24 Introduction ...... 25 Methods ...... 29 Results ...... 36 Discussion ...... 65

Chapter 3: Benchmarker: an unbiased, association-data-driven strategy to evaluate gene prioritization algorithms ...... 69 Abstract ...... 70 Introduction ...... 71 Materials and Methods ...... 74

v

Results ...... 85 Discussion ...... 100

Chapter 4: Application of single-cell RNA sequencing data to interpret genome-wide association studies of body-mass index ...... 106 Abstract ...... 107 Introduction ...... 108 Methods ...... 113 Results ...... 120 Discussion ...... 134

Chapter 5: Discussion ...... 139 Major findings and implications ...... 140 Future directions ...... 146 Conclusions ...... 154

Appendix A: Supplementary materials for Chapter 2 ...... 155 Supplementary figures for Chapter 2 ...... 156 Supplementary tables for Chapter 2 ...... 170 Supplementary text for Chapter 2 ...... 176

Appendix B: Supplementary materials for Chapter 3 ...... 182 Supplementary figures for Chapter 3 ...... 183 Supplementary tables for Chapter 3 ...... 203

References ...... 204

vi

Acknowledgements

Graduate school is a team sport, and there are many people without whom this dissertation would not have been possible. First, I want to thank my mentor, Joel Hirschhorn.

Joel, you have made me a better scientist, a better writer, and a better communicator. I started graduate school with only a vague concept of the definition of bioinformatics and more or less zero programming skills, and somehow I’m coming out the other side with a computational

Ph.D. in genetics. I am so grateful that you took a chance on me five years ago, believed in me, and invested in me. Thank you for all that you’ve taught me, for your patience, and for your kindness.

I’d also like to thank my dissertation advisory committee members, Soumya

Raychaudhuri, Richa Saxena, and Chris Newton-Cheh, for providing ideas and encouragement throughout my Ph.D. In addition, I thank the members of my defense committee, Richa Saxena,

Ron Do, Marcy MacDonald, and Elise Robinson, for taking the time to read this dissertation and conduct my defense.

Next, I would like to acknowledge my grad school friends, in particular Kayla Davis,

Kate Lachance, Sam Schilit, Lisa Witten, Katie Wu, and Adam Riesselman. Kayla and Kate, you are my personal superheroes, and I feel extremely lucky to have spent the last four years working on the Oklahoma Science Project with both of you. Sam, the emotional support (and pictures) you’ve given me over the years defies description. Lisa, I’m so glad a terrible class brought us together and cultivated a friendship that’s become incredibly important to me. Katie, I don’t know how you manage to do all of the amazing things you do, but your science communication skills (not to mention all of your awesome science) have always inspired me to push myself harder. Adam, my partner in sass and in politics, you’ve somehow always been able to both

vii

make me laugh and also teach me bioinformatics skills way above my pay grade. To all of you: the lunches we’ve had in NRB, the coffee breaks we’ve taken, and the long conversations about science and emotions and the emotions of science are probably the only reason I survived graduate school, and I am so grateful.

Next, my wonderful labmates, who are some of the best people I know. Yu-Han Hsu, my coworker-turned-ASHG-roommate-turned-actual-roommate: you made ASHG two hundred percent better for me every year, and I’ve learned so much about being a better scientist from you. Eric, your constant generosity in spending time with me to talk through problems even when you were stressed about your own work has never ceased to amaze me and make me want to be the best labmate I can be, and I look forward to our therapy lunch every month. Sailaja

Vedantam, you are the best lab mom I could have asked for and I truly wish I could live off of your cooking. Joanne Cole, you have taught me so much, and I’m sorry for constantly distracting you with sparkling conversation (actually, I’m only a little bit sorry). Priya Nakka, thank you for obsessing over The Good Place with me. Kristin Tsuo, it’s been a joy to help mentor you, and I can’t believe we finally roped you into graduate school. Kathy Ushakov, you’re a wonderful labmate, and getting to chat with you always improves my day. Christina Astley, I am forever in awe of your statistical and epidemiological chops. Tune Pers, thank you for entrusting me with

DEPICT. Hannah Whitley, your administrative skills and patience have honestly improved my graduate school experience to a degree I would not have thought possible. I’d also like to acknowledge all other labmates, past and present, all of whom have taught me and made me better every single day, especially Mike Guo, Jess Kremen, Jia Zhu, Nora Renthal, Rany Salem,

Tonu Esko, Yisong Huang, Jenkai Miao, Greg Gorski, Chris Addis, Ben Weaver, Vidhu Thaker, and Jon Swartz.

viii

I’d also like to thank three more of the most important people in my life. To Shir Livne and Emma Oppenheim, two of my best friends in the world: I’ve known you both for more than

15 years, and every time I see either of you it feels like no time has passed at all. Thank you both for lunches at Coach, for trading high school gossip, and for tolerating my incessant whining for more than five years. To Brian Soulliard: thank you for supporting me during this process and for always believing I could do it. Your faith has meant the world to me, and I love you. Good luck getting me out of the Bahamas.

Finally, I could not have done this without my family. To my grandparents, aunt, uncles, and cousins: thank you for the Thanksgivings, Seders, and summers in Maine. Anna Cline, you are the yin to my yang and my second sister. You pushed me when I needed to be pushed, and you have made me a much better (not to mention happier) person. Hannah, your values and your sacrifices to make the world a better place inspire me every day. If everyone in the world had five percent of your drive and integrity, we’d all be better off. Mom and Dad, I can honestly say I could not have gotten through this without you. I don’t know what I did to deserve parents like you, but your consistent support, love, and encouragement has meant everything to me. I love you all more than I can express.

ix

Dedication

To Mom, Dad, Hannah, and Anna.

x

Attributions

Chapter 2

Rebecca S. Fine: Conceived, designed, and implemented EC-DEPICT. Performed the analyses and interpreted results. Wrote relevant sections of the manuscript with feedback from all authors.

Tune H. Pers: Provided input regarding the design of EC-DEPICT. Generated permutations from

Swedish data and the first release of UK Biobank.

Joel N. Hirschhorn: Conceived and designed EC-DEPICT and interpreted results. Substantially edited the manuscript. Provided overall supervision.

David Lamparter and Zoltán Kutalik: Performed PASCAL analyses.

The GIANT Height ExomeChip Working Group (chairs: Guillaume Lettre, Panos Deloukas, Joel

N. Hirschhorn, Timothy M. Frayling, Ruth J.F. Loos, Fernando Rivadeneira, Zoltán Kutalik,

Claus Oxvig): Generated the summary statistics that were used as input and provided supervision. Contributed to relevant sections of the manuscript.

The GIANT Body Mass Index ExomeChip Working Group (chairs: Ruth J.F. Loos, Joel N.

Hirschhorn, Cecilia M. Lindgren, Kari E. North, Guillaume Lettre): Generated the summary statistics that were used as input and provided supervision. Contributed to relevant sections of the manuscript.

The GIANT Waist-Hip Ratio ExomeChip Working Group (chairs: Cecilia M. Lindgren, Kari E.

North, Ruth J.F. Loos, L. Adrienne Cupples): Generated the summary statistics that were used as input and provided supervision. Contributed to relevant sections of the manuscript.

The MAGIC ExomeChip Working Group (chairs: Inês Barroso, Anna Gloyn): Generated the summary statistics that were used as input, performed gold/silver/bronze analysis and variant lookups, and provided supervision. Contributed to relevant sections of the manuscript.

xi

The GIANT Adiponectin ExomeChip Working Group (chairs: Karen Mohlke, Ruth J.F. Loos,

Cecilia M. Lindgren): Generated the summary statistics that were used as input. Contributed to relevant sections of the manuscript.

The GIANT Leptin ExomeChip Working Group (chairs: Tuomas Kilpeläinen, Ruth J.F. Loos):

Generated the summary statistics that were used as input. Contributed to relevant sections of the manuscript.

The GIANT Body Fat Percentage Working Group (chair: Ruth J.F. Loos): Generated the summary statistics that were used as input. Contributed to relevant sections of the manuscript.

Chapter 3

Rebecca S. Fine: Conceived, designed, and implemented Benchmarker. Performed the analyses and interpreted results. Wrote the manuscript with feedback from all authors.

Tune H. Pers: Provided input regarding design of Benchmarker. Implemented leave-one- chromosome-out modifications to GWAS-DEPICT.

Tiffany Amariuta: Provided input regarding analysis of data.

Soumya Raychaudhuri: Provided input regarding analysis of data.

Joel N. Hirschhorn: Conceived and designed Benchmarker and interpreted results. Substantially edited the manuscript. Provided overall supervision.

Chapter 4

Rebecca S. Fine: Conceived, designed, and implemented analyses.

Joel N. Hirschhorn: Conceived and designed analyses. Provided overall supervision.

xii Chapter 1:

Introduction

The consists of roughly three billion bases. Though the vast majority of these bases are shared among all , there are millions of locations that can differ from person to person; these are called single nucleotide variants, or SNVs. Thus, one of the overarching motivating questions for the field of human genetics is: how do these genetic differences relate to variation in human phenotypes?

A major tool for addressing this question is the genome-wide association study

(GWAS)1,2. GWAS provides a systematic, hypothesis-free approach in which participants are genotyped and then each variant is separately tested (generally in a linear or logistic regression framework) for association with the phenotype of interest. The most basic output of a GWAS is thus a list of summary statistics for each assayed variant, consisting of its estimated effect size and p-value for the tested trait. When GWAS was first introduced over ten years ago, one of the ultimate goals was to find most of the genetic variants that contribute to trait heritability, defined as the proportion of phenotypic variance that is attributable to genotypic variance3. Even more ambitiously, there was hope that these studies would efficiently reveal the genetic basis of many human diseases, leading to a list of the most critical potential therapeutic targets and therefore a golden age of drug development. However, several challenges with respect to these goals quickly became apparent4.

One such challenge is that the genetic architecture of most traits is much more complex than had been anticipated. A critical and striking discovery from GWAS has been the extreme polygenicity of most phenotypes, with hundreds or even thousands of loci contributing to a given trait2. In addition, significantly associated SNVs tend to have individually fairly small effects on a given phenotype. As a result, in general, significant variants (and even all tested variants genome-wide) discovered from GWAS often explain only a fraction of the predicted total

2

heritability for many traits (where heritability estimates are based on twin and other family-based studies).

Another challenge is the unexpectedly difficult task of linking significant trait-associated variants with function. This is complicated by two main problems. Firstly, nearby variants in the genome tend to be correlated with each other, a phenomenon referred to as linkage disequilibrium (LD). As a result, the most significant variant in a region (the “index variant”) cannot be assumed to be causal – other nearby correlated variants may actually be responsible for the observed signal. The other main problem is that, even if we were certain we had the correct causal variant, we often would not know which gene that variant was exerting its effect through5. This is because roughly 90% of GWAS signals fall in the noncoding region of the genome6. Unlike coding regions (referred to as the exome, which represents roughly 1% of the genome), the noncoding regions often have no readily available interpretation. Nonsynonymous coding variants alter codon sequences and therefore directly change the amino acid sequence of the . In contrast, though trait-associated noncoding variants are presumed to act via regulating a relevant gene (e.g. by altering expression levels or gene splicing), there is no simple way to accurately link the variant with the correct gene.

This dissertation is largely concerned with strategies and tools for overcoming these challenges in GWAS interpretation. In the following sections, I will review several relevant such strategies.

Rare and low-frequency coding variation

One approach for improving interpretability of GWAS results is to focus on the coding regions of the genome, where nonsynonymous changes can easily be deciphered. One method to do this is exome sequencing7, which captures the entire coding region of the genome, including

3

rare and low-frequency (RLF) variation which often cannot be accurately imputed in GWAS data. (For the purposes of this dissertation, I classify variants with a minor allele frequency

[MAF] of > 5% as common, variants with MAF between 1% and 5% as low-frequency, and variants with a MAF < 1% as rare.) Another advantage of focusing on RLF variation is that, for traits under stabilizing or negative selection, the largest effect sizes are usually found among variants in this space8–10. RLF coding variants found within known GWAS loci can also provide helpful evidence in indicating the correct gene by either explaining the original signal or providing an additional independent signal in the locus11. For example, a low-frequency coding variant association with body mass index in HIP1R explained a previously observed GWAS signal in an intron of the nearby CLIP1 gene, suggesting that HIP1R rather than CLIP1 is more likely to be responsible for the association12. In another example, the discovery of a coding type

2 diabetes association in RREB1 independent of a previously observed GWAS signal at the same locus provided evidence that RREB1 is likely the causal effector gene13.

However, exome sequencing on very large numbers of people remains cost-prohibitive, so a cheaper alternative was proposed: the ExomeChip. The ExomeChip directly genotypes a large number of RLF coding variants, capturing the majority (>70%) of coding variants with a

MAF of > 0.1%14. The ExomeChip thus allows the interrogation of variants below the MAF threshold that GWAS can typically reach (though of course as reference panels have become larger and more ancestry-specific, the lower bound for accurate MAF imputation in GWAS has dropped). Using this approach, rare and low-frequency variants have been identified for many different phenotypes, including anthropometric traits12,15,16, glycemic traits17–19, lipid levels14,20, diabetes status21, and many others. These studies have made many interesting observations; for example, the GIANT Consortium identified 83 rare and low-frequency variants associated with

4

heights, including variants with effect sizes up to ten times larger than the those of the average common variant (parts of this work are described in Chapter 2 of this dissertation)15.

Gene set enrichment analysis

One important strategy for extracting biological insight from GWAS is gene set enrichment analysis, in which a list of trait-associated loci is analyzed for overrepresentation of genes in a known gene set, such as a biological pathway22–26. Gene set enrichment analysis can improve statistical power by combining multiple weak signals into a stronger one (i.e. if five genes in a gene set are weakly associated, collectively they may confer significance to the gene set) and reducing the multiple-testing burden23. (Gene set enrichment analysis is also known as pathway analysis. The term “pathway” usually refers to gene products which functionally interact to achieve a specific endpoint. “Gene set” is a more general term that includes pathways, members of the same cellular compartment, genes associated by coexpression, etc.23 For our purposes, we also consider networks to be a form of gene set, though they are not strictly equivalent because networks also contain topological information.) Gene set enrichment analysis has enabled important and novel inferences from GWAS data. For example, when applied to a large GWAS of body mass index (BMI), gene set enrichment analysis found a substantial and unexpected enrichment of brain-relevant pathways, such as neurotransmission and synaptic plasticity12,27. As another example, application of gene set enrichment analysis method to GWAS data from schizophrenia has shown enrichment of genes in ion channel pathways, synaptic plasticity, regulation of neuronal differentiation, long-term potentiation, and dopaminergic and cholinergic synapses28,29.

Gene set enrichment analysis methods generally fall into one of two categories: two-step or one-step23–26. Two-step methods calculate a gene-level p-value first, then use those p-values as

5

input for an enrichment test statistic; in contrast, one-step methods calculate an enrichment test statistic directly from the variants without aggregating by gene first23–26. For both approaches, it is important to consider how variants are mapped to genes or gene sets, respectively. An important question for both one- and two-step methods is whether to extend gene/gene set boundaries to include flanking noncoding variants, on the assumption that such variants are likely to regulate the genes they are closest to (for example by adding a 20- or 100-kb window around each gene)22–24,26. This can be a tricky balance for improving power, as for some variants this assumption will hold true and for others it will not (i.e. some variants near the gene of interest do not regulate it). Indeed, improving methods of mapping regulatory variants to genes they regulate is a major area of research interest.

In addition, for two-step methods, gene-level p-values must be computed. Some methods use the average or highest !2 statistic of all SNVs within the gene (though the latter is often strongly biased by gene length); another approach is to use Fisher’s method to combine the variant p-values. Gene-level p-values can also be calculated with methods specifically designed to combine variant-level summary statistics while accounting for LD patterns, such as

MAGMA30 and VEGAS31.

Many different kinds of test statistics for enrichment have been proposed, which generally fall into three classes: mean-based, count-based, and rank-based25. Mean-based tests, such as Fisher’s method32,33, evaluate the average level of phenotype association within genes in the gene set. Count-based methods, such as the hypergeometric test34,35, use the proportion of trait-associated genes within the gene set as a test statistic (e.g. comparing the proportion of trait- associated genes in the gene set against the proportion of trait-associated genes outside the gene set). Finally, rank-based methods, such as the Kolmogorov-Smirnov algorithm36, rank all genes

6

by the strength of their phenotype association and then test whether genes in the gene set are overrepresented at the top of that list. This test statistic is also often weighted by the mean level of phenotype association for each gene; this approach is borrowed conceptually from GSEA, a gene set enrichment analysis algorithm designed for expression data37. There are also several ways of obtaining a p-value from these test statistics. Many methods calculate these via comparing the test statistic to an empirical distribution derived from either phenotype permutation (shuffling the phenotypes with real genotype data) or gene permutation (shuffling the gene labels from the observed gene-trait association data)25.

There are several major sources of potential confounding for gene set enrichment analysis. One such confounder is that SNV density and/or level of LD between SNVs within genes can vary considerably22,24,25. If a gene has many SNVs, it is more likely to contain variants with lower p-values just by chance; as mentioned above, methods which assign p-values to genes based on their most significant SNV are especially susceptible to this type of confounding24,25.

Another is the fact that nearby genes are more likely to be in the same pathway38,39, which can cause artifacts due to LD. An extreme example of this is the human major histocompatibility complex (MHC), which has unusually extensive LD over a long span of ~10 Mb. Because of this, a single variant in the MHC region might implicate several functionally related genes and therefore inappropriately suggest enrichment of immune-related pathways22. It is also necessary to correct for differently sized pathways, as small pathways can show significance based on one or two strong SNV associations, while especially large pathways may be prone to significance by chance (especially if they contain many long genes)22–26. These many potential pitfalls make permutation-based p-value calculation attractive, as (if correctly implemented) in theory it creates an appropriate empirical null that accounts for many of these factors.

7

One popular two-step gene set enrichment analysis method, MAGMA, works as follows30. First, gene-level p-values are calculated; if this is done from summary statistics (i.e. without available individual-level genotype data), this is generally the mean c2 statistic for all

SNVs in a gene while accounting for correlation among SNVs. Then, the gene set enrichment analysis is performed as a regression of the gene-level z-scores on a binary indicator variable representing, for each gene, whether it is a member of the tested gene set.

Another two-step gene set enrichment analysis method, DEPICT (developed by our lab), has a key advantage over previous algorithms40,41. All previous algorithms use gene sets with binary membership: a given gene is either present or absent in a gene set. DEPICT, in contrast, used data from 77,840 expression microarrays to “reconstitute” 14,462 existing gene sets

(derived from various sources, such as KEGG42 and Reactome43). The reconstitution was based on the idea that genes with similar expression patterns are more likely to reside in the same gene sets. After reconstitution, every gene in the genome has a probability of being in every gene set; that is, each gene set consists of every gene and its associated probability (in the form of a z- score)40,41. (Due to the use of probabilistically-defined gene sets, DEPICT does not fit neatly into many of the general gene set enrichment paradigms described above.)

These reconstituted gene sets provide enormous value. First, their probabilistic nature helps capture signal that would be lost by a hard-and-fast cutoff. Second, they were reconstituted in a hypothesis-free manner (using the microarray coexpression data), which lessens some of the bias inherent in manually curated gene sets. For example, more well-studied genes are generally overrepresented in established gene sets, and in any given gene set database, there are many genes that do not appear at all24. By including information about uncharacterized genes, the reconstituted gene sets can thus leverage more information and improve power.

8

Most existing gene set enrichment analysis methods for genetic data focus on GWAS, with only a handful of studies considering rare variants (generally in the context of next- generation sequencing data)26,44–47. There was therefore a need to create such a method specifically for ExomeChip data. Generally, methods that are designed for GWAS cannot be directly applied to the ExomeChip because the approaches differ from each other in ways that need to be accounted for in calculating appropriate test statistics and p-values. Most obviously, the ExomeChip focuses largely on RLF coding variants, while the majority of variants from

GWAS are common and noncoding. In addition, a method for the ExomeChip should leverage its main advantage over GWAS, which is the ability to use rare coding variants to more specifically pinpoint causal genes.

Partitioning heritability

A complementary approach for extracting biological insight from GWAS data is partitioning heritability among genomic annotations of interest48–52. Most published GWAS include an estimate of SNP-heritability for the trait of interest, which is the total percentage of additive phenotypic variation explained by a given set of observed SNPs. Partitioning heritability takes this a step further, investigating the relative contribution of different genomic categories to overall heritability, where “categories” simply refers to SNP-level annotations such as being located in a DNase hypersensitivity site, being in an evolutionarily conserved region, or being near a gene with a particular pattern of tissue expression. For example, we might expect that coding SNPs and SNPs in enhancer regions explain more heritability on average than SNPs in repressed regions of the genome.

One popular method for partitioning heritability is stratified LD score regression (S-

LDSC)53. S-LDSC is an extension of LD score regression (LDSC)54, a method for calculating

9

total SNP-heritability. The key insight behind LDSC is that we expect SNPs with high LD to many other SNPs to be more likely to tag truly causal SNPs. We can quantify the amount of LD a SNP has to other SNPs with an “LD score,” which is simply the sum of the squared Pearson correlation coefficient (r2) SNP j has with all nearby SNPs. The intuition of LDSC is then that strength of the relationship between LD scores and GWAS c2 association statistics provides an estimate of overall heritability for a given trait.

S-LDSC extends this framework by considering only LD to SNPs in the genomic category of interest rather than LD to all SNPs. In this way, category-specific LD scores are calculated and can then be used to measure the heritability explained by that category. When applying S-LDSC to a new annotation of interest, we always include a group of 53 annotations of known genomic importance (e.g. coding, enhancers, etc.), referred to as the “baseline model.”

This is done to avoid model misspecification and ensure that heritability is assigned to the correct category. For example, if the annotation of interest correlates strongly with evolutionary conservation and evolutionary conservation is not included in the model, the annotation may appear to be important when it is merely piggybacking off the known enrichment of evolutionarily conserved regions to heritability.

An example of an interesting insight that has emerged from S-LDSC involves the two subtypes of inflammatory bowel disease, Crohn’s disease and ulcerative colitis. S-LDSC analysis of GWAS summary statistics from these two diseases identified different patterns of enrichment with respect to cell-type-specific histone marks53. As expected, both showed strong enrichment for immune cell types. However, of the 40 cell types significantly associated with Crohn’s disease, 39 were immune and one was gastrointestinal; in contrast, ulcerative colitis was significant for 30 immune cell types and nine gastrointestinal, suggesting a different pattern of

10

tissue susceptibility between the two subtypes. This illustrates the utility of approaches like partitioning heritability in interpretation of GWAS results.

Gene prioritization

Gene set enrichment analysis and partitioning heritability are ways of interpreting GWAS by assessing the contribution of biologically meaningful units of analysis (e.g. all coding SNPs, all genes that are members of a given pathway) to a given set of results. A complementary approach is the use of computational tools to directly prioritize genes for follow-up55–59.

Prioritization tools can help shorten long lists of potentially associated genes from GWAS; this has become particularly critical as sample sizes for GWAS have increased and lists of significant loci have correspondingly grown much longer.

There are many types of approaches that might be considered “gene prioritization.” For example, the methods VEGAS31 and MAGMA30 (mentioned above) can calculate a gene-based p-value from summary statistics and a reference LD panel. However, the class of prioritization methods I will focus on in this dissertation are based on the principle of “guilt by association”: genes that are similar in some way (e.g. correlated expression patterns or shared protein interaction partners) are more likely to be functionally related55,60,61. Thus, if genes A, B, and C are highly correlated in some capacity and genes A and B are known to cause disease X, gene C may be relevant for disease X as well.

Many of the earliest prioritization methods were not designed specifically for GWAS data, though some can be used for this purpose62–66. These methods usually rely on the designation of some genes (called “seed genes”) or keywords known to be involved with the trait of interest. Then, in accordance with the guilt-by-association principle, genes of interest are compared to the seed genes or keywords such that the most similar genes are highly prioritized;

11

this can be done based on a prespecified list of candidate genes (e.g. all genes within a particular locus) or genome-wide. Many different types of data can be used to assess similarity. Some popular choices include protein-protein interaction (PPI) networks62,64,66, biological pathways/ annotations62,65, and published literature62 (which can be text-mined to assess, for example, how frequently two gene names are mentioned together). Many methods also aggregate multiple sources of data together to either (1) prioritize based on an overall similarity metric spanning all data sources63–65 or (2) produce a single composite data source

(e.g. a network incorporating information from multiple PPI networks) upon which to base prioritization66.

For example, one popular method, Endeavour, works as follows62,63. Endeavour provides a collection of 75 data sources, spanning many different data types (e.g. PPIs, expression profiles, sequence-based features, gene ontology, and chemical information). The user defines a set of seed genes known to be associated with the trait of interest, such as genes associated with the trait of interest in the OMIM database. Then, the user selects which of the 75 data sources they would like to use for prioritization; the authors recommend thinking carefully about this choice in order to avoid, for instance, using multiple data sources that contain redundant information. Finally, the user selects candidate genes to choose from (e.g. only genes falling in

GWAS loci); alternatively, they can choose to prioritize across the whole genome. Endeavour uses the user-provided seed genes to train a sub-model from each selected data source (e.g. calculating the average expression profile of the seed genes across a collection of tissues and comparing that to the expression profile of each test gene). Then, each gene receives a ranking for each sub-model. Finally, the rankings from each sub-model are integrated to generate a single overall ranking, along with a p-value.

12

The use of user-defined seed genes and keywords is a critical feature in this type of method (and such methods continue to be developed67,68). However, this type of approach can be problematic, as it relies inherently on existing trait knowledge which can be biased (and at worst, potentially inaccurate). In addition, there are many traits for which we have very little prior knowledge; such traits would be difficult to analyze with this kind of strategy. This is especially disappointing because such traits also can hold the most promise for novel discovery55.

GWAS, however, presents a particularly interesting case in this context because it enables an alternate possibility for prioritization that bypasses the use of predefined seed genes.

Specifically, with GWAS it is possible to define “seeds” directly from the summary statistics as, for example, all genes in trait-associated loci (or, alternatively, a cutoff can be avoided by instead defining a weight for each gene based on the GWAS p-values of nearby variants). These seeds will of course include genes that are not truly associated, but they are relatively unbiased and highly enriched for true associations. Then, the prioritization method can compare each gene in significant loci (or every gene in the genome) against this data-driven list of seed genes. The difference between this type of approach and those described above can be thought of as

“supervised” versus “unsupervised” in the sense that the user defines what is important in the former case, while in the latter the algorithm learns what is important directly from the data.

Both types of method do of course rely on external sources of information (i.e. whatever is used as the basis for guilt-by-association, such as the PPI network), but the difference is the source of seed genes: in supervised prioritization, the user provides these, while in unsupervised prioritization these are based directly on the observed p-values from the GWAS itself.

Many existing methods designed specifically for GWAS fall into this “unsupervised” category. One example of such a strategy is used in DEPICT, which, in addition to the gene set

13

enrichment analysis function described earlier, also has a gene prioritization component40. First,

DEPICT defines all genes in trait-associated loci with a user-specified p-value cutoff (usually 5 ×

10-8 or 1 × 10-5); I refer to this set of genes as S. Then, each gene is compared to every other gene in S across the reconstituted gene sets, looking for similar patterns of pathway membership.

Genes with high similarity to many other genes in S will be strongly prioritized. As an example, when DEPICT was applied to a large GWAS of height, it prioritized many genes without published connections to growth but that were highly predicted to be relevant to bone/cartilage development pathways based on the reconstituted gene sets, such as FAM101A and CRISPLD169.

There are many other types of these methods. For example, GRAIL assesses gene relatedness based on text from PubMed abstracts; genes highly connected with trait-associated genes are highly scored70. Another group of approaches is based on networks, which can contain information not only from PPIs but many other evidence sources (e.g. coexpression, co- occurrence in PubMed abstracts)71,72. The general idea of such methods is to take as input

GWAS summary statistics and overlay them with network information. The network can then be used to reprioritize the genes such that genes with high connectivity to genes in trait-associated loci are given extra weight73–77. Another type of method evaluates combinations of prioritized genes within significant loci and tries to prioritize the set of genes with the highest connectivity between them78,79.

As an example of how methods like this can prioritize genes, Fernandez-Tajes et al. identified a set of high-confidence seed genes from GWAS data of type 2 diabetes and used them to build general and tissue-specific PPI networks80. They used a pancreatic-islet-specific network to prioritize both among the original seed genes with high functional connectivity and among

“linking” genes (non-seed genes closely connected to the seed genes). Validating their approach,

14

experimental evidence in the literature has shown that perturbation of many of the genes prioritized by this strategy leads to changes in islet development, beta-cell function and/or insulin signaling, such as the linking genes CDK2 and SMAD481,82. In addition, at some GWAS loci, genes other than the highest-confidence gene were prioritized based on high network connectivity (e.g. TBCA at the ZBED3 locus), illustrating how PPI information can be used to boost signal from less obvious candidates.

Benchmarking strategies

A major challenge for the field of gene prioritization is benchmarking: how do we know which approaches perform the best? This is a critical question, as the development of more these methods is not useful without a measure of performance accuracy – different methods often prioritize fairly different groups of genes.

When new methods are published, the most commonly used benchmarking approach involves “gold standard” genes. These are similar in spirit to the seed genes described in the previous section: they are generally groups of genes that have well-characterized and well- validated links to the trait, based on expert curation and/or extensive literature review40,70,77. For example, some studies use Mendelian disease genes relevant to the tested trait (e.g. for height, genes known to harbor variants that cause disorders of skeletal growth)40,73,76,83. Gold standard genes can be used to evaluate the method by either (1) assessing, for a given set of prioritization results, how strongly the gold standard genes were prioritized or (2) a cross-validation framework where the gold standard genes are divided into “training” and “testing” sets and the performance on testing sets is evaluated with a metric such as the area under a receiver operating characteristic curve55 (AUROC). A variation on this theme is “gold standard” tissues or gene

15

sets, where prioritized genes are evaluated based on enrichment of expression within disease- relevant tissues or membership in disease-relevant pathways40,55,73,76.

Several large-scale cross-validation-style benchmarking studies have been done comparing different types of gene prioritization algorithms, generally focusing on network-based approaches. For example, Guala et al. tested several network-based methods using the FunCoup

PPI network84,85. The authors used genes annotated to specific GO terms as training and testing data in three-fold cross-validation. Picart-Amada et al. compared performance of 12 network propagation approaches using drug targets from the Open Targets database86 as gold standard genes87. Finally, Shim et al. compared direct neighborhood and neighborhood diffusion network methods using two-fold cross-validation with human and worm networks and phenotypes88. In this analysis, the authors observed that neighborhood diffusion methods tended to perform better over the entire AUROC, but when limiting to only the top 200 prioritized candidates, the direct neighborhood methods were often more successful. This observation underscores an important point in benchmarking prioritization algorithms: the practical considerations for functional follow-up studies mean that emphasis should often be on early retrieval of true positives.

Despite the ubiquity of gold-standard-based benchmarks, they are highly problematic for some of the same reasons cited above for seed genes. They rely on trait knowledge which does not always exist, and preexisting knowledge tends to be highly biased toward well-studied genes

(and, indeed, can sometimes turn out to be inaccurate). In addition, such an approach will penalize a method that discovers truly novel biology, as the benchmark is built on the assumption that all newly discovered genes should be as similar as possible to the set of known (biased) ones. Furthermore, many benchmarking approaches can suffer from a phenomenon called knowledge contamination. This term refers to the idea that as soon as a genetic association study

16

is published, new disease-gene association information will work its way into the types of data sets often used within prioritization tools, such as KEGG, GO, and Medline abstracts. This can facilitate a method’s ability to predict such gene-trait associations in a benchmarking setting and therefore lead to overly optimistic estimates of method performance55,84,89. Indeed, several studies have observed that performance estimates based on cross-validation were higher than those based on a prospective benchmark, suggesting the influence of knowledge contamination63,89.

A less commonly used approach for benchmarking, designed to avoid the problem of knowledge contamination, is prospective validation. In this strategy, prioritization is performed on an older data set, and then results from an independent (and generally newer) set serves as the benchmark. This situation more accurately mimics the setting of novel discovery. Börnigen et al. conducted a large-scale comparison of several methods using this strategy, specifically performing the prioritization as soon as a novel gene-disease association was reported so that knowledge of the new association did not yet have time to be integrated into databases89. Some researchers have also used a prospective-style benchmark when describing their prioritization tools. For example, the authors of the GRAIL tool applied it to a GWAS of Crohn’s disease and evaluated method performance by comparing against results from an independent GWAS replication data set70 (i.e. to confirm that GRAIL tended to prioritize genes that actually replicated in the second data set).

Tissue and cell type enrichment

One of the critical questions GWAS seeks to answer is which tissues are most relevant to the etiology of particular phenotypes. Thus, another useful way of interpreting GWAS data is to ask which tissues are likely to be affected by the phenotype-associated variants. Many methods

17

have been developed to address this question90–96. For example, S-LDSC (described above) has been applied to many different phenotypes to quantify the heritability explained by genes specifically expressed in different tissues90. Many studies have also developed frameworks for integrating GWAS data with expression quantitative trait loci (eQTLs) for different tissues to determine enriched tissues and/or make predictions about likely effector transcripts and tissues on a locus-specific basis91–94. An especially useful resource for these types of methods has been the Genotype-Tissue Expression project (GTEx)97,98, which is collecting genetic and expression information across 54 types of tissues across almost 1000 individuals.

However, tissues generally consist of mixtures of many different cell types. Therefore, bulk tissue-level analysis will overlook many potentially important relationships, especially for cell types which make up only a small fraction of the tissue. To address this problem, in recent years there has been a rapid increase in the development and application of single-cell RNA sequencing (scRNA-seq) technologies, which measure expression profiles within individual cells and therefore provide much higher resolution data99–101. scRNA-seq has been used to classify and characterize known and novel cell types in various physiological systems and tissues, such as the lung102, liver103, and intestinal epithelium104. There has been particular interest in characterizing the brain and nervous system, which are comprised of extremely diverse cell populations; scRNA sequencing studies have been performed focusing on particular regions of the brain (e.g. the cerebral cortex, the hypothalamus)105–109 and on larger-scale systems (e.g. whole brain, whole nervous system)110,111.

The importance of having single-cell data resolution has been illustrated by the fact that for many psychiatric diseases, one of the most strongly associated brain regions is often the cerebellum90. Given what is already known about the etiology of these diseases, this association

18

is likely driven not by a truly critical role of the cerebellum but rather by the fact that the cerebellum has an unusually high density of neurons; therefore, if neurons are substantially more important than non-neurons in a given disease, the cerebellum may artifactually appear highly significant90,112.

There are many challenges inherent in the analysis of scRNA-seq data. Such data generally contains a large fraction (sometimes >50%) of zero values, referred to as technical dropouts; these represent genes with no reads for a particular cell113,114. Some of these genes are truly not expressed in the given cell, but many may simply be lowly expressed and therefore are often randomly not captured in the library preparation. Zero-inflation leads to substantial non- normality of data and therefore can complicate statistical analysis and interpretation. Another challenge in scRNA-seq is that there is no consensus on the best way to identify different cell populations within the data. Generally, this is done with a combination of a dimensionality reduction approach (e.g. principal components analysis, t-distributed stochastic neighbor embedding [t-SNE], or diffusion maps) and an unsupervised clustering approach (e.g. k-means, hierarchical clustering, or graph-based community detection methods)100,101,113. Each method has its own particular assumptions, advantages, and drawbacks, complicating the decision of how to process any particular data set. Adding further complexity, the definition of a “cell type” is not clear-cut, as the line between a cell type and a cell state can be blurry. For example, the same type of cell in two different states (e.g. activated versus quiescent) may cluster separately101.

A handful of studies have attempted to integrate GWAS results with single-cell RNA sequencing data115–119. For example, genes in schizophrenia-associated loci were found to be specifically expressed in hippocampal and neocortical somatosensory pyramidal cells, medium spiny neurons (inhibitory neurons located within the striatum), and certain interneurons118. A

19

GWAS of insomnia also identified medium spiny neurons, as well as neuroblasts, claustrum pyramidal neurons, and certain hypothalamic cell types. In contrast, Parkinson’s disease implicated not only dopaminergic and serotonergic neurons but also oligodendrocytes, indicating an important role for non-neuronal cell types in neurological disorders117.

The studies described above begin by calculating “specificity” scores for each gene in each cell type from the scRNA-seq data, which theoretically reflect the degree to which a gene’s expression in a particular cell type is unique to that cell type. The inherent assumption in this approach is that there exist certain cell types which are most relevant to a phenotype of interest, and therefore we expect that genes that most clearly define those cell types should be strongly associated with the phenotype by GWAS.

Based on these specificity scores, the studies above use two main methodologies to assess cell type enrichment. One method is based on MAGMA, a gene set enrichment analysis method described above. In addition to gene set enrichment, MAGMA can be used to analyze GWAS with a continuous covariate of interest, asking the question: does the strength of the GWAS association signal correlate with significantly increasing (or decreasing) values of the covariate?

Transformed specificity scores were used as the continuous covariate such that the method tested whether increasing specificity to the tested cell type was associated with more significant GWAS association. The other main method used by these studies is based on S-LDSC, also described above. In this case, the top 10% of specifically expressed genes for each cell type were treated as a genomic annotation, following conventions laid out in a large-scale S-LDSC analysis of tissue expression based on the GTEx data90. Then, S-LDSC was used to quantify the heritability explained by the specifically expressed genes for each cell type.

20

These initial approaches to use scRNA-seq data to facilitate GWAS interpretation are exciting, but there remain many questions about which methodologies are most appropriate. For example, the studies above calculate a “specificity” metric, but there are many ways to calculate such a metric. In particular, which cell types should a cell type of interest be compared against when calculating specificity? This is especially complicated because of the high degree of correlation between related cell types. In addition, another recently developed method proposes an alternative approach for applying MAGMA to evaluate cell type enrichment: instead of using specificity scores as the covariate of interest, the regression model uses average gene expression per gene per cell type conditioned on that gene’s average global expression119. More work needs to be done exploring these different methods and critically evaluating the assumptions made by each approach.

Summary

In this introduction, I have outlined several major strategies for interpretation of GWAS results: gene set enrichment, interrogation of rare/low-frequency coding variants, partitioning heritability, gene prioritization and benchmarking, and integration with single-cell RNA sequencing data. In the dissertation that follows, I will discuss the two major relevant new approaches I have developed that leverage and extend these existing strategies. Chapter 2 describes EC-DEPICT, a gene set enrichment analysis method designed specifically for

ExomeChip data, and its application to several different traits. Chapter 3 describes Benchmarker, a gold-standard-free and unbiased method for benchmarking gene and variant prioritization strategies. I also include an additional Chapter 4, in which I discuss a preliminary effort to identify relevant nervous system cell types for BMI using scRNA-seq data.

21 Chapter 2:

EC-DEPICT: a gene set enrichment analysis method for ExomeChip data

Rebecca S. Fine,1-4 Tune H. Pers,5,6 the GIANT Consortium, the MAGIC Consortium &

Joel N. Hirschhorn1-3,7

1 Department of Genetics, Harvard Medical School, Boston, MA 02115, USA. 2 Division of Endocrinology and Center for Basic and Translational Obesity Research, Boston Children's Hospital, Boston, MA 02115, USA. 3 Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA. 4 Ph.D. Program in Biological and Biomedical Sciences, Graduate School of Arts and Sciences, Harvard University, Cambridge, MA 02138, USA. 5 The Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, University of Copenhagen, 2200 Copenhagen, Denmark. 6 Department of Epidemiology Research, Statens Serum Institut, 2300 Copenhagen, Denmark. 7 Department of Pediatrics, Harvard Medical School, Boston, MA 02115, USA.

Some of this text in this chapter has been modified and adapted from the following publications:

Marouli E.*, Graff M.*, Medina-Gomez C.*, Lo K.S.*, Wood A.R.*, Kjaer T.*, Fine R.S.*, Lu Y.* et al. (2017). Rare and low- frequency coding variants alter human adult height. Nature 542, 186-190.

Turcot, V.*, Lu, Y.*, Highland, H.M.*, Schurmann, C.*, Justice, A.E*., Fine, R.S.*, et al. (2018). Protein-altering variants associated with body mass index implicate pathways that control energy intake and expenditure underpinning obesity. Nature Genetics 50, 26–41.

Justice, A.E.*, Karaderi, T.*, Highland H.M.*, Young K.L.*, Graff M.*, Lu, Y., Turcot, V., Auer, P.L., Fine R.S. et al. (2019). Protein-coding variants implicate novel genes related to lipid homeostasis contributing to body fat distribution. Nature Genetics 51, 452-469.

Spracklen, C.S.*, Karaderi, T.*, Yaghootkar, H., Schurmann, C., Fine R.S. et al. (2019). Novel adiponectin-associated variants implicate new obesity and lipid biology. American Journal of Human Genetics 105(1), 15-28.

Ng, N.H.J.*, Willems, S.*, Fernandez, J., Fine, R.S., Wheeler, E., Wessel, J. et al. (2019). Tissue-specific alteration of metabolic pathways influences glycemic regulation. BioRxiv doi: https://doi.org/10.1101/790618.

Yaghootkar, H.*, Zhang, Y.*, Spracklen, C.N., Karaderi, T., Huang, O.L., Bradfield, J., Schurmann, C., Fine, R.S. et al. (in review). Coding variant in LEP associated with lower leptin concentrations implicates leptin in the regulation of early adiposity.

*Authors contributed equally to this work.

23 Abstract

Gene set enrichment analysis has become an important method for analyzing and interpreting results from genome-wide association studies (GWAS). In recent years, a complementary approach to GWAS has gained popularity: the ExomeChip, an array which focuses specifically on rare and low-frequency coding variation. Such studies would benefit greatly from a gene set enrichment analysis tool designed specifically for ExomeChip data. To address this need, we took an existing gene set enrichment analysis method for GWAS data,

DEPICT, and developed a version of it for the ExomeChip (“EC-DEPICT”). DEPICT has extended existing gene sets using large-scale coexpression data to make predictions about gene function, making it an especially useful approach. In collaboration with the GIANT and MAGIC

Consortia, we applied EC-DEPICT to many different traits, including height, body mass index, waist-hip ratio, several glycemic measures, body fat percentage, fat-free mass index, adiponectin levels, and leptin levels. For most traits, EC-DEPICT identified many relevant biological mechanisms, replicating previously identified biology as well as discovering novel associations.

We also developed a visualization-based strategy for qualitative prioritization of genes that are the strongest drivers of the gene set enrichment and successfully used it to identify promising candidates for follow-up. EC-DEPICT provides a powerful new approach for downstream analysis of ExomeChip data.

24

Introduction

The genome-wide association study (GWAS) has become an increasingly popular technique for investigating the genetic basis of human phenotypes, but interpretation of this data remains challenging. Roughly 90% of GWAS signals fall in noncoding regions of the genome, and linking noncoding variants to the genes they regulate remains extremely difficult5,6. In addition, linkage disequilibrium (LD) between nearby variants makes it challenging to determine which variants are responsible for an observed association signal.

One approach that can facilitate GWAS interpretation is gene set enrichment analysis, a technique in which groups of genes with a known biological relationship (e.g. interaction within the same biochemical pathway, expression within the same cellular compartment) are analyzed for overrepresentation among trait-associated genes22–25. Such overrepresentation of a particular group of genes suggests possible involvement of the relevant biological process in the trait of interest. Gene set enrichment analysis can also improve statistical power because (1) the number of gene sets tested is generally much smaller than the number of variants tested, reducing the multiple-testing burden, and (2) it can combine several weak signals into a stronger one by aggregating evidence (e.g. several suggestively significant genes in the same pathway may collectively reach significance when tested as a group).

A sometimes-underappreciated aspect of gene set enrichment analysis is that, in addition to providing hypotheses about the biological processes that underlie a particular trait, it can also be used to prioritize trait-associated genes for follow-up. Specifically, when performing gene set enrichment analysis, it is possible to identify the trait-associated genes that are the strongest drivers of the enrichment; we consider these genes somewhat more likely to be causal. That is, genes that participate in significantly enriched biological processes have stronger evidence for

25

prioritization than those that do not, since we assume that truly trait-associated genes should share biological features with one another. This can be especially useful if suggestively significant genes are included in the analysis, since such genes can also be prioritized using this approach; this can provide an evidentiary boost for genes that might otherwise be overlooked due to a subthreshold p-value.

One popular gene set enrichment analysis method for GWAS data, developed by our lab, is Data-driven Expression Prioritized Integration for Complex Traits (DEPICT)40,41. DEPICT’s primary advantage over other such methods is the use of what we refer to as “reconstituted” gene sets. These consist of gene sets downloaded from five different sources (Kyoto Encyclopedia of

Genes and Genomes [KEGG]42, Reactome43, Gene Ontology [GO]120, mouse phenotypes [MP] from the Mouse Genetics Initiative121, and protein-protein interaction [PPI] networks from

InWeb122) which have been extended based on the use of roughly 80,000 publicly available expression microarrays. Specifically, genes with similar patterns of expression were predicted to be members of similar gene sets based on the logic of “guilt by association” (genes that are similar in some way – in this case, genes with correlated expression patterns – are more likely to be functionally related)60,61. The reconstituted gene sets then consist of z-scores which indicate how strongly each gene is predicted to be a member of each gene set. These predictions are valuable because a large fraction of genes in the genome are not annotated as members of any known biological pathways; such understudied genes normally cannot contribute to a gene set enrichment analysis.

Another approach that can facilitate GWAS interpretation is the use of a complementary study design focusing on rare and low-frequency (RLF) coding variation. Coding variation is more readily interpretable than noncoding variation, as it directly alters codons and thus (if

26

nonsynonymous) the amino acid sequence of a protein (though care must be taken to ensure that an observed coding signal is not attributable to nearby noncoding variation21). In addition, RLF variants can often have larger effect sizes than common variants due to reduced selection pressure8–10. When discovered, coding genetic associations can provide additional insight into

GWAS signals by explaining known GWAS signals and/or by implicating additional signals in known loci11. For the purposes of this work, I classify variants with a minor allele frequency

[MAF] of > 5% as common, variants with MAF between 1% and 5% as low-frequency, and variants with a MAF < 1% as rare.

The most comprehensive approach to investigate coding RLF variation is exome sequencing, which covers the entire coding region of the genome7. However, exome sequencing remains very expensive, so a cheaper alternative gained popularity: the ExomeChip (EC). The

ExomeChip is an genotyping array designed to cover many rare and low-frequency coding variants; it includes the majority of variants with a minor allele frequency (MAF) of greater than

0.1% in populations of European and African ancestry14.

Since the EC study design was a somewhat recent development, little work had been done to develop gene set enrichment algorithms that could handle these data. Such a method could have particular utility for EC data because for many complex traits, the number of variants achieving array-wide EC significance is small (presumably in large part due to the difficulty of achieving sufficient power to detect RLF associations). Therefore, the ability of gene set enrichment to enhance weak signal by aggregating across gene sets might prove invaluable to finding trait-associated genes that do not pass the significance threshold.

Existing methods designed for GWAS data generally cannot be applied directly to

ExomeChip results because different assumptions must be made for each data type: most GWAS

27

variants are common and noncoding, while most ExomeChip variants are rare/low-frequency and coding. We therefore developed a gene set enrichment analysis method designed specifically for

ExomeChip data. We modeled this method after DEPICT so as to (1) leverage its advantage of using the reconstituted gene sets and (2) have the ability to directly compare ExomeChip enrichment results to corresponding GWAS enrichment results for a given trait.

28 Methods

As the EC-DEPICT method has been applied to many different traits, the method has slightly evolved and improved. Therefore, in this methods section, I describe the general overview of how the method works and the various data sources used, noting how approaches differed between traits where relevant. Where additional trait-specific implementational details are necessary, they are included in Text S2.1.

Method overview

The method for gene set enrichment is implemented as follows. First, a significance threshold is chosen; common choices are p < 1 × 10-5 or p < 5 × 10-4, which will include suggestively significant variants along with array-wide significant ones. Then, the input variants from the ExomeChip meeting the chosen significance threshold are filtered to include only nonsynonymous and splice-site variants (see “Variant annotations” for details). These variants are clumped by distance to obtain index SNPs (in these analyses, the distance was defined as either +/- 500 kb or +/- 1 Mb flanking each index SNP). Next, we map the variants to genes (see

“Variant annotations” for details).

Then, test statistics and p-values are calculated in an approach designed to be as similar as possible to the original DEPICT formulation. For each gene set, the z-scores for pathway membership of each significant gene from the ExomeChip are retrieved from the reconstituted gene sets and summed. The summed z-scores are normalized relative to 2,000 null ExomeChip permutations (see “Permutations” for details) by subtracting off the mean null z-score for a given gene set and dividing by the null standard deviation. The resulting adjusted z-score is then converted to a p-value based on the normal distribution.

29

False discovery rates (FDRs) are calculated using 50 null permutations separate from the

2,000 used for p-value calculation. For each of the 50 permutations, an EC-DEPICT p-value is calculated for each reconstituted gene set to generate a null distribution of p-values. The FDR is calculated as the average number of null p-values less than a given threshold divided by the number of observed p-values less than that threshold.

The Python programming language was used to implement EC-DEPICT, and the code can be downloaded from our Github repository (see URLs).

GWAS-DEPICT versus EC-DEPICT

The approach used in EC-DEPICT was modeled as closely as possible after that used in

DEPICT for GWAS (hereafter referred to as GWAS-DEPICT). The most substantial implementational differences with respect to GWAS-DEPICT are (1) instead of including all genes within a specified amount of linkage disequilibrium to each index variant, we include only the gene containing the index variant, (2) we include only nonsynonymous and splicing (coding) variants (discarding noncoding associations), and (3) we use permuted ExomeChip data for p- value calculation rather than permuted GWAS data.

Variant annotations

Two sets of annotations were used as the basis for prediction of variant consequences and the mapping of variants to genes. For height, body mass index, waist-hip ratio, and the glycemic traits, we used ANNOVAR annotations from the CHARGE Consortium (version 6, download date 2/10/16, see URLs), consisting of 247,083 variants in 27,196 genes. We retained only nonsynonymous and splice-site variants (for variants with different functional consequences that mapped to more than one gene, the most severe one was retained) and removed genes absent from the reconstituted gene sets, leaving 204,396 variants in 15,690 genes. Nonsynonymous and

30

splice-site variants were defined as variants in any of the following categories: “stopgain,”

“stoploss,” “splicing,” “frameshift,” and “nonsynonymous.” (Note that for height, which was the first EC-DEPICT analysis, the filtering steps were performed slightly differently. See Text S2.1 for specific implementational details.)

For leptin, adiponectin, body fat percentage, and fat-free mass index, we used 246,328 annotated SNPs in 19,924 genes from the Variant Effect Predictor (VEP) tool (version 82)123.

We again retained only nonsynonymous/splice-site variants and removed genes absent from the reconstituted gene sets, leaving 203,578 variants in 15,751 genes. Nonsynonymous and splice- site variants were defined as variants in any of the following categories: “missense variant,”

“stop gained,” “splice donor variant,” “splice acceptor variant,” “start lost,” and “stop lost.”

HUGO Committee (HGNC) gene identifiers from the CHARGE and

VEP annotation files were converted to Ensembl identifiers using Ensembl Biomart gene homology mapping (version 84, GRCh37, download date 4/18/2016, see URLs). Additionally, for all analyses except height, we removed all variants from the human major histocompatibility complex (chromosome 6: 25,000,000 – 35,000,000) due to its unusually complex LD structure.

Permutation generation

We use permuted ExomeChip data to calculate p-values. This strategy is designed to mitigate potential confounding effects from inherent properties of variants, genes, and gene sets.

For example, a gene set which contains many long genes will be more likely to contain variants with significant p-values by chance, which can cause artificial inflation of gene set significance.

However, permutations enable us to correct for this phenomenon because the gene length bias will be reflected in the null data.

31

We generated three sets of null permutation data used within our analysis. The first set of permutations was used for ExomeChip studies of height, body-mass index, and waist-hip ratio.

The first and second sets were used for ExomeChip studies of glycemic traits. The third set was used for ExomeChip studies of leptin levels, adiponectin levels, body fat percentage, and fat-free mass index. All permutations were conducted using a linear model association in Plink based on

2,200 simulated normally-distributed phenotypes.

The first set of null ExomeChip data was from the Malmö Diet and Cancer (MDC), All

New Diabetics in Scania (ANDIS), and Scania Diabetes Registry (SDR) cohorts (a total of

11,899 samples with Swedish ancestry). We refer to these as the “Swedish nulls.” We analyzed

60,038 variants with a minor allele count (MAC) ³ 10. After filtering to nonsynonymous/splice- site variants based on the CHARGE annotations and removing genes absent from the reconstituted gene sets (the “nonsynonymous/CHARGE/DEPICT filter”), we were left with

41,538 variants.

The second set and third sets of null permutation data were based on UK Biobank data after extracting nonsynonymous/splice-site variants that are present on the ExomeChip. The second set was based on the first release of UK Biobank and thus contained 152,249 subjects.

Variants present in the CHARGE file and with a MAC ³ 10 were included, leading to a final count of 136,384 variants. After the nonsynonymous/CHARGE/DEPICT filter, we were left with

106,257 variants. The third set was based on the full release of UK Biobank and so contained

378,141 unrelated European subjects. Variants present in the CHARGE file and with a MAC ³ 5, info score > 0.3, and missingness £ 0.10 were retained, leaving 155,266 variants. After the nonsynonymous/VEP/DEPICT filter, we were left with 124,692 variants.

The variants in each null set of permutations were sorted by ascending p-value and

32

clumped (+/- 500 kb for glycemic traits, body fat percentage and fat-free mass index; +/- 1 Mb for body-mass index, waist-hip ratio, adiponectin and leptin). Annotations from the CHARGE or

VEP data (see “Annotations,” above) were used to assign variants to genes and retain only nonsynonymous/splice-site variants. Additionally, the number of top genes we took from each null as “input genes” was matched to the observed number of significant input genes (so for example, if there were 119 significant input variants, the top-scoring 119 variants from each permutation were used in the p-value calculation). For analyses of RLF variants, a separate set of backgrounds was created that retained only loci where the index variant had a MAF of < 5%. To match the permutations as accurately as possible to the observed data, we also removed all variants absent from the specific ExomeChip summary statistics to be analyzed.

As noted above, before performing an EC-DEPICT analysis, we also remove all variants absent from the set of null permutations being used. This is the main filtering bottleneck, as it often removes a substantial fraction of variants (for example, when using the Swedish nulls, this step removes about 30% of the variants associated with body mass index). Most of the excluded variants tend to be in the very rarest allele frequency bins, especially when using the Swedish nulls; this is expected due to the much smaller sample size of that cohort relative to most of our analyzed ones. For further discussion of this issue, see Text S2.1.

Type I error

To assess type I error, we calculated all reconstituted gene set enrichment p-values for 50 null ExomeChip studies and observed that their distribution was close to uniform. However, in more detailed analyses, we also observed that extreme non-normality of gene set membership z- scores could, in certain situations, cause minor inflation of Type I error. To address this issue, we repeated the original EC-DEPICT analysis with an inverse-normal-transformed version of the

33

reconstituted gene sets, in which every gene set is forced to have a normal z-score distribution for pathway membership. We then compared the rank of each significant gene set in the original results with the rank in the inverse normal transform and flagged “outliers” with respect to the change in rank (> 1.5 × the interquartile range). We then excluded these outliers from further analysis, removing them from tables and heatmaps (except for height, where this issue only minimally impacted results; see Text S2.1). This strategy allowed us to strike a balance: we continued to use the full reconstituted gene sets in the main analysis (thus avoiding the power loss inherent in the inverse normal transform), but, by including the additional check, we leaned on the side of conservativism with respect to type I error.

Correlation and clustering of reconstituted gene sets

Many of the reconstituted gene sets are correlated with each other; it is therefore logical to group together substantially similar gene sets to facilitate interpretation and analysis of the data. We thus generated a list of “meta-gene sets” in which each meta-gene set contained a group of highly related gene sets. To do this, we calculated the Pearson correlation matrix for all pairs of the 14,462 gene sets (based on gene set membership) and subsequently clustered the reconstituted gene sets based on their similarity (i.e. Pearson correlation coefficients). A Python implementation of the Affinity Propagation algorithm124 was used for clustering (SciKit-

Learn.clustering.AffinityPropagation version 0.17125). Due to the relatively large number of reconstituted gene sets, we used a maximum iteration of 10,000 and a convergence iteration of

1,000. In tables and figures, we used the “representative” gene set (as determined by the affinity propagation algorithm) as the gene set label for each meta-gene set. For analysis purposes, the p- value for each “meta-gene set” was assigned to be the best p-value of any member gene set.

34

Visualizations

Heatmaps were generated using the ComplexHeatmap package in R126. Meta-gene sets were used as labels for the heatmaps; each z-score for gene set membership in the heatmaps represents the z-score for the gene set with the best association p-value for that trait.

Although some variants were excluded from the EC-DEPICT analysis based on filtering steps described above (mainly absence from the permutations used), we have included them in the heatmap figures. This is because we assume that if the genes containing those variants have strong predicted membership in gene sets found to be significantly enriched, they are still good candidates for prioritization (and one of the main purposes of the heatmap strategy is to visually prioritize the best candidate genes). In fact, this is arguably even stronger evidence for prioritization of these genes, because they had no opportunity to influence the gene sets that are identified as enriched and, as such, independently support the biology implicated by these gene sets.

35 Results

Height

Human height is a well-studied polygenic trait with heritability estimated at roughly

80%69. At the time of this analysis, the largest GWAS of height (N = 253,288) had identified 697 independent signals in 423 loci which captured ~16% of phenotypic variance, focusing on variants with MAF > 1%69. Genes in these loci implicated growth- and development-relevant pathways, such as Wnt and fibroblast growth factor signaling. In an effort to study the contribution of rare and low-frequency variants to height, the GIANT Height ExomeChip

Working Group subsequently conducted a large ExomeChip study in up to 711,428 individuals127, uncovering 83 rare and low-frequency (RLF) height-associated variants at array- wide significance (p < 2 × 10-7).

Previous pathway analyses of height loci identified by GWAS have highlighted gene sets related to both general biological processes (such as chromatin modification and regulation of embryonic size) and more skeletal growth-specific pathways (such as chondrocyte biology, extracellular matrix [ECM], and skeletal development)69. We used two different methods,

DEPICT and PASCAL128, to perform pathway analyses using the ExomeChip results to test whether nonsynonymous variants across the allele frequency spectrum could either (1) independently confirm the relevance of these previously highlighted pathways (and further implicate specific genes in these pathways) or (2) identify new pathways. To compare the pathways emerging from coding and noncoding variation, we applied DEPICT separately to (1) exome array-wide significant coding variants independent of known GWAS signals and (2) noncoding GWAS loci, excluding all novel height-associated genes implicated by coding variants (see Text S2.1).

36

We identified a total of 496 and 1,623 enriched gene sets, respectively, at a false discovery rate (FDR) of < 1% (Table 2.1, Table S2.1, Table S2.2); similar analyses with

PASCAL yielded 362 and 278 enriched gene sets. Comparison of the results revealed a high degree of shared biology between coding and noncoding variants (for DEPICT, gene set p-values compared between coding and noncoding results had Pearson’s r = 0.583, p < 2.2 × 10-16; for

PASCAL, Pearson’s r = 0.605, p < 2.2 × 10-16). However, some pathways showed stronger enrichment with either coding or noncoding variation. In general, coding variants more strongly implicated pathways specific to skeletal growth (such as ECM and bone growth), while GWAS signals highlighted more global biological processes (such as transcription factor binding and embryonic size/lethality) (Figure 2.1). The two meta-gene sets significant in both DEPICT and

PASCAL analyses and that were uniquely implicated by coding variants were “BCAN protein protein interaction subnetwork” and “proteoglycan binding” (meta-gene sets are clusters of correlated gene sets defined by affinity propagation clustering; for details, see Methods). Both of these pathways relate to the biology of proteoglycans, which are (such as aggrecan) that contain glycosaminoglycans (such as chrondroitin sulfate) and that have well-established connections to skeletal growth129.

37 Table 2.1. EC-DEPICT results for height-associated coding ExomeChip variants independent of known GWAS signals. The top 15 gene sets are shown; for the full list see Table S2.1. “Wood p-value” refers to the original DEPICT-GWAS p-value of the gene set from a recent height GWAS conducted by Wood et al.69 FDR: false discovery rate.

Original gene Original gene set Nominal FDR Wood Meta-gene Meta-gene set set ID description p-value p-value set description abnormal long bone epiphyseal 1.76E- plate proliferative MP:0002427 disproportionate dwarf 8.31E-18 <0.01 13 MP:0003662 zone endochondral 1.96E- bone GO:0051216 cartilage development 2.83E-17 <0.01 19 GO:0060350 morphogenesis abnormal abnormal bone 2.60E- trabecular bone MP:0008271 ossification 3.41E-17 <0.01 12 MP:0000130 morphology abnormal long bone epiphyseal decreased length of 2.67E- plate proliferative MP:0004686 long bones 3.56E-17 <0.01 12 MP:0003662 zone 4.40E- MP:0004351 short humerus 2.01E-16 <0.01 09 MP:0004359 short ulna embryonic skeletal system 1.80E- skeletal system GO:0001501 development 7.58E-16 <0.01 20 GO:0048704 morphogenesis 4.71E- MP:0004359 short ulna 9.96E-16 <0.01 13 MP:0004359 short ulna 1.35E- MP:0000445 short snout 1.60E-15 <0.01 18 MP:0000088 short mandible abnormal rib 1.23E- abnormal rib MP:0000150 morphology 3.12E-15 <0.01 17 MP:0000150 morphology abnormal long bone epiphyseal abnormal chondrocyte 4.44E- plate proliferative MP:0000166 morphology 5.75E-15 <0.01 09 MP:0003662 zone abnormal long bone 3.11E- MP:0003723 morphology 7.13E-15 <0.01 13 MP:0004359 short ulna abnormal long bone epiphyseal abnormal pelvic girdle 1.79E- plate proliferative MP:0004509 bone morphology 9.76E-15 <0.01 11 MP:0003662 zone MP:0000547 short limbs 1.20E-14 <0.01 2.79E- MP:0003662 abnormal long 13 bone epiphyseal plate proliferative zone MP:0003662 abnormal long bone 1.24E-14 <0.01 1.20E- MP:0003662 abnormal long epiphyseal plate 10 bone epiphyseal proliferative zone plate proliferative zone MP:0000163 abnormal cartilage 1.43E-14 <0.01 1.91E- MP:0010029 abnormal morphology 13 basicranium morphology

38

Figure 2.1. Comparison of DEPICT gene set enrichment results based on coding variation from ExomeChip or noncoding variation from GWAS data. The x-axis indicates the p-value for enrichment of a given gene set using EC-DEPICT, using as input genes implicated by coding ExomeChip variants that are independent of known GWAS signals. The y-axis indicates the p- value for gene set enrichment using GWAS-DEPICT, using as input the GWAS loci that do not overlap the coding signals. To visually reduce redundancy and increase clarity, we chose one representative meta-gene set for each group of highly correlated gene sets, based on affinity propagation clustering (Methods). Each point represents a meta-gene set, and the best p-value for any gene set within the meta-gene set is shown. Only significant (FDR < 0.01) gene set enrichment results are plotted. Colors correspond to whether the meta-gene set was significant for ExomeChip only (blue), GWAS only (green), both but more significant for ExomeChip (purple), or both but more significant for GWAS (orange), and the most significant meta-gene sets within each category are labelled. A line is drawn at x = y for ease of comparison.

39

Figure 2.1 (continued).

40 We also examined which height-associated genes identified by ExomeChip analyses were driving enrichment of pathways such as proteoglycan binding. Using unsupervised clustering analysis to aid in visualization, we observed that a cluster of 15 height-associated genes was strongly implicated in a group of correlated pathways that include biology related to proteoglycans/glycosaminoglycans (Figure 2.2, Figure S2.1). Nearly half (seven) of these 15 genes overlap a previously curated list of 277 genes annotated in the Online Mendelian

Inheritance in Man (OMIM) database as causing skeletal growth disorders69; strikingly, genes in this small cluster were significantly enriched for OMIM annotations relative to genes outside the cluster (odds ratio = 27.6, Fisher’s exact p = 1.1 × 10-5). As such, we consider the remaining genes in this cluster to be strong candidates for harboring variants that cause Mendelian growth disorders. Within this group are genes that are largely uncharacterized (SUSD5), have relevant biochemical functions (GLT8D2, a glycosyltransferase studied mostly in the context of the liver;

LOXL4, a lysyl oxidase expressed in cartilage)130,131, modulate pathways known to affect skeletal growth (FIBIN, SFRP4)132,133, or lead to increased body length when knocked out in mice

(SFRP4)134. The predictive power of our approach was validated shortly after submission of our manuscript, when SFRP4 was independently found to be causal for Pyle’s disease, a Mendelian disorder of cortical bone fragility135.

41

Figure 2.2. Heatmap showing subset of DEPICT gene set enrichment results for height- associated coding ExomeChip variants independent of known GWAS signals. The full heatmap is available in Figure S2.1. For any given square, the color indicates how strongly the corresponding gene (shown as columns) is predicted to belong to the reconstituted gene set (rows). This value is based on the gene’s z-score for gene set inclusion in DEPICT’s reconstituted gene sets, where red indicates a higher z-score and blue indicates a lower one. To visually reduce redundancy and increase clarity, we chose one representative meta-gene set for each group of highly correlated gene sets, based on affinity propagation clustering (Methods); meta-gene sets are listed with their database source. Heatmap intensity and DEPICT p-values correspond to the most significantly enriched gene set within the meta-gene set. The proteoglycan-binding pathway (bold) was uniquely implicated by coding variants by DEPICT and PASCAL. Annotations for the genes indicate whether the gene has an OMIM annotation as underlying a disorder of skeletal growth (black and grey) and the MAF of the significant ExomeChip (EC) variant (shades of blue; if multiple variants, the lowest- frequency variant was kept). Annotations for the gene sets indicate if the gene set was also found significant for ExomeChip by PASCAL (yellow, orange and grey) and if the gene set was found significant by DEPICT for ExomeChip only or for both ExomeChip and GWAS (purple and green). GO, Gene Ontology; MP, mouse phenotype in the Mouse Genetics Initiative; PPI, protein–protein interaction in the InWeb database.

42

Body mass index (BMI)

Body mass index (BMI) is a measurement of obesity that is relatively independent of height136; it is calculated as weight divided by height squared (kg/m2). Estimates for the heritability of BMI have ranged widely, from ~40 to 70%27. At the time of this analysis, the most recent BMI GWAS (Nmax = 339,224) had identified 97 BMI-associated loci, which explained

~2.7% of total BMI variation27. In the GWAS, the authors also performed gene set enrichment analysis using GWAS-DEPICT. Nearly 500 gene sets were significant, some of which had an obvious obesity connection (e.g. energy metabolism, insulin, polyphagia) and some of them less so (e.g. a constellation of central-nervous-system-related gene sets, including synaptic function and neurotransmitter signaling). The GIANT BMI ExomeChip Working Group subsequently conducted a large-scale ExomeChip study to evaluate the contribution of rare and low-frequency

12 (RLF) variants to BMI with Nmax = 718,734 , discovering 14 independent BMI-associated coding variants at array-wide significance in 13 genes.

43

Figure 2.3. Heatmap showing DEPICT gene set enrichment results for rare and low- frequency nonsynonymous variants with suggestive and significant evidence of association with BMI. The full heatmap is available in Figure S2.2a. For any given square, the color indicates how strongly the corresponding gene (shown as columns) is predicted to belong to the reconstituted gene set (rows). This value is based on the gene’s z-score for gene set inclusion in DEPICT’s reconstituted gene sets, where red indicates a higher z-score and blue indicates a lower one. To visually reduce redundancy and increase clarity, we chose one representative meta-gene set for each group of highly correlated gene sets, based on affinity propagation clustering (Methods). Annotations for genes indicate (i) whether the gene has an Online Mendelian Inheritance in Man (OMIM) annotation as underlying a monogenic obesity disorder (black/gray), (ii) the MAF of the significant ExomeChip variant (blue), (iii) whether the variant’s p-value reached array-wide significance (p < 2 × 10−7) or suggestive significance (p < 5 × 10−4) (purple), (iv) whether the variant was new, overlapped ‘relaxed’ GWAS signals from Locke et al.27 (GWAS p < 5 × 10−4) or overlapped ‘stringent’ GWAS hits (GWAS p < 5 × 10−8) (pink), and (v) whether the gene was included in the gene set enrichment analysis or excluded by filters (orange/brown) (Methods). Annotations for gene sets indicate if the meta-gene set was found significant (shades of green; FDR < 0.01, < 0.05, or not significant) in the DEPICT analysis of GWAS results from Locke et al.27 Here, two regions of particularly strong gene set membership are shown.

44 Figure 2.3 (continued).

45

Table 2.2. EC-DEPICT results when including rare and low-frequency nonsynonymous variants associated with BMI at p < 5 × 10-4 from the ExomeChip. The top 15 gene sets are shown; full results are available in Table S2.3. “Locke p-value” refers to the original DEPICT- GWAS p-value of the gene set from a recent height GWAS conducted by Locke et al.27 FDR: false discovery rate.

Original gene set Original gene set Nominal FDR Locke Meta-gene set Meta-gene ID description p-value p- set value description abnormal MP:0003635 synaptic 2.65E-10 <0.01 0.0002 GO:0044456 synapse part transmission 3.92E- GO:0045202 synapse 4.99E-10 <0.01 GO:0044456 synapse part 08 1.27E- GO:0033267 axon part 6.66E-10 <0.01 GO:0033267 axon part 06 abnormal 8.83E- MP:0001399 hyperactivity 9.42E-10 <0.01 MP:0001463 spatial 05 learning 3.87E- GO:0044456 synapse part 1.13E-09 <0.01 GO:0044456 synapse part 08 7.90E- GO:0008021 synaptic vesicle 1.82E-09 <0.01 GO:0044456 synapse part 08 1.07E- GO:0030424 axon 3.27E-09 <0.01 GO:0033267 axon part 06 AMPH PPI BAI1 PPI ENSG00000078053 4.43E-09 <0.01 0.006 ENSG00000181790 subnetwork subnetwork SYN1 PPI 1.35E- BAI1 PPI ENSG00000008056 4.80E-09 <0.01 ENSG00000181790 subnetwork 05 subnetwork cytoplasmic clathrin-coated 2.97E- GO:0030136 1.10E-08 <0.01 GO:0030659 vesicle vesicle 09 membrane increased susceptibility to 1.47E- MP:0002906 pharma- 1.40E-08 <0.01 GO:0044456 synapse part 05 cologically induced seizures MAP6 PPI BAI1 PPI ENSG00000171533 4.87E-08 <0.01 0.002 ENSG00000181790 subnetwork subnetwork neuron-neuron regulation of GO:0007270 synaptic 5.85E-08 <0.01 0.0006 GO:0050804 synaptic transmission transmission abnormal nervous gated 3.01E- MP:0002272 system 7.37E-08 <0.01 GO:0022836 channel 05 electrophysiology activity Reactome Reactome transmission 2.66E- transmission across 7.68E-08 <0.01 GO:0044456 synapse part across chemical 07 chemical synapses synapses

46

To test whether the EC association data from RLF variants could implicate biological pathways or functions, we performed gene set enrichment analyses on variants with MAF < 5%.

Using a p-value threshold of p < 5 × 10-4 (as in our previous analysis of GWAS data for BMI), after applying our filters (see Methods) we retained 50 variants for analysis. We observed significant enrichment (FDR < 0.05) for 471 gene sets that clustered in 106 meta-gene sets

(Figure 2.3, Figure S2.2a, Table 2.2, Table S2.3). Many of these were related to neuronal processes, such as neurotransmitter release and synaptic function (e.g. glutamate receptor activity, regulation of neurotransmitter levels, and synapse part), consistent with previous findings from GWAS27. When we excluded variants near (+/- 1Mb) previously identified GWAS loci, leaving 30 variants, we still observed 29 significantly enriched gene sets in 12 meta-gene sets, thereby providing an independent confirmation of the GWAS gene set enrichment results

(Table S2.4, Figure S2.3b) (we also retained novel secondary signals in known loci for this analysis). In addition to neuron-relevant gene sets, the analyses with RLF coding variants identified a cluster of metabolic pathways related to insulin action and adipocyte/lipid metabolism (e.g. enhanced lipolysis, abnormal lipid homeostasis, and increased circulating insulin level; Figure 2.3, Figure S2.2a, Table S2.3). Finally, we observed that RLF BMI- associated coding variants can be more effective at identifying enriched gene sets relative to common coding variants. Specifically, adding 192 common coding variants (all p < 5 × 10-4) to the analysis (a total of 242 variants) decreased the number of enriched gene sets from 471 (in

106 meta-gene sets) to 62 (in 24 meta-gene sets) (Table S2.5, Figure S2.2c). One possible explanation is that the RLF coding variants may fall in the causal gene more often than do the common coding variants, which would suggest that the RLF variants are more likely to be causal as opposed to simply in LD with causal variants.

47

We also used the gene set enrichment analysis to prioritize candidate genes. Among the genes with RLF coding variants associated with BMI at p < 5 × 10-4, a subset were prominently represented in the CNS-related enriched gene sets (Figure 2.3, Figure S2.2a, Table 2.2, Table

S2.3) and have been proposed to influence neurotransmission and/or synaptic organization, function and plasticity. These include genes in regions with suggestive evidence of association from GWAS (such as CARTPT, MAP1A, and ERC2) and genes in regions not previously implicated by GWAS (such as CALY, ACHE, PTPRD, and GRIN2A). The non-neuronal metabolic gene sets implicate two genes (CIDEA and ADH1B) that are markers of brown or

“beige” adipose tissue, providing new supporting evidence for a causal role for this aspect of adipocyte biology137,138.

Waist-hip ratio

The distribution of fat on the body has been associated with different risk profiles for metabolic and cardiovascular diseases. Specifically, central adiposity (intra-abdominal fat deposition) is associated with less favorable outcomes compared to peripheral adiposity139–141. A useful and easily obtained metric of fat distribution is waist-to-hip ratio (WHR), defined as the ratio of waist circumference to hip circumference. A higher WHR indicates more central fat deposition relative to a lower WHR. To disentangle the effects of WHR from the effects of BMI

(as the two are correlated), it is common to study WHR adjusted for BMI (WHRadjBMI). At the time of this analysis, the most recent large GWAS of WHRadjBMI (Nmax = 224,459) had identified 49 associated loci, explaining 1.4% of phenotypic variation overall142. Gene set enrichment analysis of this data highlighted many significant gene sets, including many involved in insulin sensitivity, glucose regulation, and skeletal growth. Subsequently, the GIANT WHR

ExomeChip Working Group conducted an ExomeChip study in up to 344,369 individuals to

48

investigate RLF contributions to WHRadjBMI, identifying 15 common and 9 RLF novel coding variants16.

To determine if the significant EC coding variants highlight novel biological pathways and/or provide additional support for previously identified biological pathways, we applied EC-

DEPICT to 361 suggestively significant variants (p < 5 × 10-4) from the combined ancestries and combined sexes analysis (which after clumping and filtering became 101 lead variants in 101 genes). We identified 49 significantly enriched gene sets (FDR < 0.05) that grouped into 25 meta-gene sets (Figure 2.4, Figure S2.3a, Table 2.3, Table S2.6). We noted a cluster of meta- gene sets with direct relevance to metabolic aspects of obesity (“enhanced lipolysis,” “abnormal glucose homeostasis,” “increased circulating insulin level,” and “decreased susceptibility to diet- induced obesity”). While these pathway groups had previously been identified in the GWAS-

DEPICT analysis, many of the individual gene sets within these meta-gene sets were not significant in the previous GWAS analysis, such as “insulin resistance,” “abnormal white adipose tissue physiology,” and “abnormal fat cell morphology” (Figure 2.4, Figure S2.3a,

Table 2.3, Table S2.6), but represent similar biological underpinnings implied by the shared meta-gene sets. Despite their overlap with the GWAS results, these analyses highlight novel genes that fall outside known GWAS loci, based on their strong contribution to the significantly enriched gene sets related to adipocyte and insulin biology (e.g. MLXIPL, ACVR1C, and ITIH5).

49

Table 2.3. EC-DEPICT results when including nonsynonymous variants associated with WHRadjBMI with p < 5 × 10-4 in the GIANT ExomeChip analyses. The top 15 gene sets are shown; full results are available in Table S2.6. “Shungin p-value” refers to the original DEPICT- GWAS p-value of the gene set from a recent height GWAS conducted by Shungin et al.142 FDR: false discovery rate.

Original gene set ID Original Nominal FDR Shungin Meta-gene set Meta-gene set gene set p-value p-value description description GO:0005811 lipid particle 8.38E- <0.01 0.009 MP:0008034 enhanced 08 lipolysis MP:0003055 abnormal 9.00E- <0.01 3.21E-05 MP:0003662 abnormal long long bone 08 bone epiphyseal epiphyseal plate plate proliferative morphology zone MP:0003109 short femur 3.39E- <0.01 0.006 MP:0000130 abnormal 07 trabecular bone morphology MP:0005508 abnormal 1.40E- <0.01 1.16E-06 MP:0000130 abnormal skeleton 06 trabecular bone morphology morphology MP:0005331 insulin 1.42E- <0.01 0.001 MP:0002079 increased resistance 06 circulating insulin level MP:0005670 abnormal 2.69E- <0.05 0.006 MP:0008034 enhanced white 06 lipolysis adipose tissue physiology MP:0005659 decreased 3.51E- <0.05 0.000399 MP:0005659 decreased susceptibility 06 susceptibility to to diet- diet-induced induced obesity obesity MP:0003662 abnormal 4.15E- <0.05 0.000247 MP:0003662 abnormal long long bone 06 bone epiphyseal epiphyseal plate plate proliferative proliferative zone zone MP:0000547 short limbs 4.53E- <0.05 0.003 MP:0003662 abnormal long 06 bone epiphyseal plate proliferative zone MP:0009115 abnormal fat 4.58E- <0.05 0.008 MP:0005659 decreased cell 06 susceptibility to morphology diet-induced obesity ENSG00000157227 MMP14 PPI 5.03E- <0.05 0.001 ENSG00000149968 MMP3 PPI subnetwork 06 subnetwork ENSG00000150093 ITGB1 PPI 5.20E- <0.05 0.02 ENSG00000150093 ITGB1 PPI subnetwork 06 subnetwork

50

Table 2.3 (continued).

Original gene set ID Original Nominal FDR Shungin Meta-gene set Meta-gene set gene set p-value p-value description description GO:0031589 cell-substrate 5.45E- <0.05 0.003 GO:0010810 regulation of adhesion 06 cell-substrate adhesion ENSG00000154928 EPHB1 PPI 5.48E- <0.05 0.109 ENSG00000133216 EPHB2 PPI subnetwork 06 subnetwork MP:0004686 decreased 6.22E- <0.05 0.000524 MP:0003662 abnormal long length of 06 bone epiphyseal long bones plate proliferative zone

51

-

ExomeChip

dundancy and dundancy and s) s) to is predicted

value value reached array -

full heatmap is The available in

in score for DEPICT’s gene inclusion set - rom Shungin et Shungin rom al. foundgene was set ofsignificant FDR (shades green; -

ichment analysis or excluded by filters (shades ofby or ichment (shades filtersanalysis excluded ) (shades of purple), (3) whether the variant was novel of the was or (shades (3) variant ) whether purple),

−4

× < 10 5

p

score score and To blue reduce a re indicates lower visually one.

- gene set for each group of highly correlated gene sets, based on gene for set sets, based affinity correlated highly of group gene each frequency variant was kept), (2) whether the variant’s p frequency (2) variant’s whether variant the kept), was of (shades or pink;known a 1 locus, 500 of within novel, a within Mb of kb - -

142

DEPICT gene set DEPICT forenrichment results WHRadjBMI.

- the indicate frequency (1) genes the the of Annotations for minor allele significant

( significance suggestive or ) −7

< 2 × < 10 2

p

For any given square, the color indicates how column as strongly For gene any (shown the the given indicates corresponding color square,

. Heatmap showing EC Heatmap showing

3a Figure 2.4. Figure S2. on is belong based This the toreconstituted gene (rows). gene’s z set value the reconstituted a z gene sets, where indicates higher red clarity,increase representative meta one we chose propagation clustering (Methods). (EC) the variants, lowest multiple if (blue; variant wide ( significance overlapping from GWAS et Shungin signals al. in known the locus),included (4) gene was whether gene enr set the and indicate ifsets meta brown/orange) (Methods). the gene Annotations for the of < 0.01, analysis the not GWAS or < in DEPICT f results significant) 0.05,

52 To focus on novel findings, we conducted an additional pathway analyses after excluding variants from previous WHRadjBMI analyses. Seventy-five genes were included in the DEPICT analysis, and we identified 26 significantly enriched gene sets in 13 meta-gene sets (Table S2.7,

Figure S2.3b). Here, all but one gene set (“lipid particle size”) were related to skeletal biology.

This result likely reflects an effect on the pelvic skeleton (hip circumference), shared signaling pathways between bone and fat (such as TGF-b) and shared developmental origin. Many of these pathways were previously found to be significant in the GWAS-DEPICT analysis; these findings provide a fully independent replication of their biological relevance for WHRadjBMI.

Glycemic traits

Type 2 diabetes is a serious condition that results from insufficient insulin secretion and loss of responsiveness to insulin signaling, leading to elevated levels of glucose in the blood.

There are many glycemic metrics available for quantifying physiological variation relevant to blood sugar and risk of type 2 diabetes, and, to date, more than 97 loci have been associated with various such measures143. The MAGIC ExomeChip Working Group recently conducted a large- scale ExomeChip study in up to 144,060 non-diabetic individuals on four glycemic metrics (all adjusted for BMI): fasting insulin (FI), fasting glucose (FG), 2-hour glucose tolerance test

(2hGlu), and glycated hemoglobin (HbA1c; this measures the amount of hemoglobin chemically linked to a sugar and is used to estimate two-to-three-month average blood sugar levels)18.

Across all traits, the working group identified 74 coding variant associations in 58 genes, including 21 coding variants in 18 novel loci. For each association, an effort was undertaken to identify the causal effector transcript and indicate how strong the evidence was for each one.

This included a literature search to see if the gene had a known role in glucose metabolism or red blood cell biology, lookups in association efforts of type 2 diabetes and blood cell traits, and a

53

reciprocal conditional analysis with known noncoding GWAS index variants to determine if the coding variant was likely to be driving the observed signal. Loci with evidence from both literature and conditional analysis were labeled “gold,” loci without evidence from conditional analysis but with strong biological plausibility were labeled “silver,” and genes with some evidence for causality but not enough for “silver” classification were labeled “bronze.” Genes without sufficient evidence for “bronze” status were left with an undetermined effector transcript. Based on this effort, conducted independently by four expert authors, 24 transcripts were classified as gold, 11 as silver, and 16 as bronze (Figure 2.5).

Next, we used EC-DEPICT to identify enriched pathways among our trait-associated variants. We began by analyzing all variants that reached p < 1 × 10-5 for any of the four glycemic traits in our analysis (after filtering, 79 variants in 78 genes, some of which were associated with more than one trait). We observed 234 significant gene sets in 86 meta-gene sets with an FDR of < 0.05 (Table S2.8, Figure S2.4a). As expected, we observed a strong enrichment of insulin- and glucose-related gene sets, as well as exocytosis biology (in keeping with insulin vesicle release). We also noted a strong enrichment for blood-related pathways, which was primarily driven by HbA1c-associated variants; this was likely because HbA1c levels are influenced not only by glycation but also by blood cell turnover rate144,145. To disentangle blood cell turnover from effects due to glycation, we repeated the analysis excluding variants that were significantly associated with HbA1c only (after filtering, 45 variants in 45 genes) and found 128 significant gene sets in 53 meta-gene sets (FDR < 0.05) (Figure 2.6a, Figure S2.4b,

Table 2.4, Table S2.9). We also analyzed each of the four traits separately (Figure 2.6b,

Figures S2.4c-e, Tables S2.10-12; for additional details and results, see Text S2.1).

54

Figure 2.5. Effector transcript classification into “gold,” “silver” and “bronze” categories based on strength of genetic and biological evidence. A total of 51 effector transcripts from 74 single variant and six gene-based signals were identified, with many of them shared across traits. The classification was undertaken independently by four of the authors and the consensus was used as the final classification for effector transcripts. *Asterisk indicates “silver” for FG, “bronze” for 2hGlu. Figure from ref 18.

55 Table 2.4. EC-DEPICT results when including all variants associated with any glycemic trait except HbA1c with p-value < 1 × 10-5. The top 15 gene sets are shown; full results are available in Table S2.9. FDR: false discovery rate.

Original gene set ID Original gene set Nominal FDR Meta-gene set Meta-gene set description p-value description Reactome regulation of Reactome regulation 3.49E-22 <0.01 KEGG Maturity KEGG Maturity gene expression in beta of gene expression in Onset Diabetes Onset Diabetes of cells beta cells of the Young the Young MP:0003059 decreased insulin 3.00E-19 <0.01 GO:0030072 peptide hormone secretion secretion Reactome regulation of Reactome regulation 1.31E-18 <0.01 KEGG Maturity KEGG Maturity beta-cell development of beta-cell Onset Diabetes Onset Diabetes of development of the Young the Young KEGG Maturity Onset KEGG Maturity 1.68E-17 <0.01 KEGG Maturity KEGG Maturity Diabetes of the Young Onset Diabetes of Onset Diabetes Onset Diabetes of the Young of the Young the Young MP:0003564 abnormal insulin 4.96E-17 <0.01 GO:0030072 peptide hormone secretion secretion MP:0002727 decreased circulating 5.84E-16 <0.01 MP:0002078 abnormal glucose insulin level homeostasis MP:0005216 abnormal pancreatic 1.22E-15 <0.01 KEGG Maturity KEGG Maturity alpha cell Onset Diabetes Onset Diabetes of morphology of the Young the Young MP:0009254 disorganized 2.80E-15 <0.01 KEGG Maturity KEGG Maturity pancreatic islets Onset Diabetes Onset Diabetes of of the Young the Young MP:0003339 decreased pancreatic 3.09E-15 <0.01 KEGG Maturity KEGG Maturity beta cell number Onset Diabetes Onset Diabetes of of the Young the Young MP:0002693 abnormal pancreas 5.57E-15 <0.01 KEGG Maturity KEGG Maturity physiology Onset Diabetes Onset Diabetes of of the Young the Young Reactome integration Reactome integration 4.53E-14 <0.01 GO:0030072 peptide hormone of energy metabolism of energy secretion metabolism Reactome regulation of Reactome regulation 5.20E-13 <0.01 GO:0030072 peptide hormone insulin secretion of insulin secretion secretion MP:0005293 impaired glucose 7.26E-12 <0.01 MP:0002079 increased tolerance circulating insulin level MP:0009114 decreased pancreatic 2.78E-11 <0.01 KEGG Maturity KEGG Maturity beta cell mass Onset Diabetes Onset Diabetes of of the Young the Young Reactome signal Reactome signal 2.98E-11 <0.01 Reactome SHC- Reactome SHC- attenuation attenuation mediated mediated signaling signaling

56

Figure 2.6. EC-DEPICT results from analysis of (a) all traits except HbA1c and (b) FG. The full heatmaps are available in Figure S2.4b and Figure S2.4d. The columns represent the input genes for the analysis. In panel (a), these are genes containing variant associations of p < 1 × 10-5 for FG, FI, and/or 2hGlu, and in panel (b) these are genes containing variant associations of p < 1 × 10-5 for FG. Rows in the heatmap represent significant meta-gene sets (FDR < 0.05). For any given square, the color indicates how strongly the corresponding gene (shown as columns) is predicted to belong to the reconstituted gene set (rows). This value is based on the gene’s z-score for gene set inclusion in DEPICT’s reconstituted gene sets, where red indicates a higher z-score and blue indicates a lower one. To visually reduce redundancy and increase clarity, we chose one representative meta-gene set for each group of highly correlated gene sets, based on affinity propagation clustering (Methods). Heatmap intensity and DEPICT p-values correspond to the most significantly enriched gene set within the meta-gene set. The gene set annotations indicate whether that meta-gene set was significant at FDR < 0.05 or not significant (n.s.) for each of the other EC-DEPICT analyses. The gene variant annotations are as follows: (1) the European MAF of the input variant, where rare is MAF < 1%, low-frequency is MAF 1-5%, and common is MAF > 5%, (2) whether the gene has an OMIM annotation as causal for a diabetes/glycemic- relevant syndrome or blood disorder, (3) the effector transcript classification for that variant: gold, silver, bronze, or NA (note that only array-wide significant variants were classified, so suggestively-significant variants are by default classified as “NA”), (4-7) whether each variant was significant (p < 2 × 10-7), suggestively significant (p < 1 × 10-5), or not significant in Europeans for each of the four traits, and (8) whether each variant was included in the analysis or excluded by filters (Methods). AWS: array-wide significant.

57

Figure 2.6 (continued).

58 To prioritize candidate genes, we then performed heatmap visualization with unsupervised clustering of the membership predictions (z-scores) of significant trait-associated genes for each significant gene set (Figure 2.6, Figure S2.4). This strategy has previously been effective for gene prioritization, as it becomes visually apparent which genes are the strongest drivers of the significant gene sets and thus natural targets for follow-up. This can be particularly helpful for prioritizing genes that are not well-characterized, as it leverages DEPICT’s prediction of gene function. For the all-except-HbA1c analysis, one cluster showed particularly strong predicted membership for highly relevant gene sets, including glucose homeostasis and insulin secretion, glycogen, incretin, carbohydrate metabolism, and Maturity Onset Diabetes of the

Young (Figure 2.6a). Strikingly, this cluster of six genes (PCSK1, GLP1R, GIPR, G6PC2,

SLC30A8 and CTRB2) contained five of the genes that had independently been assigned to

“gold” status during effector transcript identification. Therefore, the sixth gene, CTRB2, represented a novel gene for prioritization, since it showed strong similarity to other genes for which there was already biological evidence (hence their “gold” status). CTRB2 encodes chymotrypsinogen B2, which acts as a digestive enzyme and is expressed in the pancreas and subsequently secreted into the gut. The gene contains a nominally significant low-frequency variant for 2hGlu (rs147238447; p = 1.9 × 10-6).

The prioritization of CTRB2 is interesting, as it adds further support to current hypothesis that the exocrine pancreas contributes to complex mechanisms influencing 2hGlu levels and diabetes risk146,147. Additional lines of evidence also support the potential importance of this locus: other variants in this region have been shown to be associated with glucagon-like peptide

1 (GLP-1)-stimulated insulin secretion148, risk of chronic pancreatitis149, and pancreatic cancer

(although all with r2 < 0.01 with the coding variant rs147238447 described here)150. In addition,

59

two independent signals in this region (referred to as the BCAR1 locus) have been associated with T2D risk, though again rs147238447 is not151,152. Given the relatively low minor allele frequency of rs147238447 (1.3% observed in our HbA1c cohort, but 0.23% in non-Finnish

Europeans in gnomAD153) it is possible that the lack of observed association for some of these other traits could be due to low power.

We also noted a small but distinct cluster in the FG-only analysis implicating a role for the cilium/axoneme, pointing to novel biology relating to sensing and signaling in response to the extracellular environment (Figure 2.6b). Two genes were the main drivers of this association: WDR78 and AGBL2. These represent potentially interesting candidates for follow- up, although we note that the AGBL2 signal may be driven through effects of the nearby MADD gene, which harbors an FG-associated coding variant in our study and is labeled “silver” in our effector transcript classification (Figure 2.5). Overall, our pathway analyses highlighted several trait-associated genes that do not reach exome-wide significance in conventional single-variant or gene-based tests but show evidence of contribution to glycemic regulation.

Fat-free mass index and body fat percentage

BMI is a useful measure of obesity on a population scale, but it can be misleading: two individuals with the same BMI may have different proportions of lean to fat mass, which are associated with different risk profiles for obesity-associated diseases. Therefore, some studies have specifically investigated the genetic underpinnings of fat and lean mass. One metric commonly used to address this is body fat percentage, which can be measured via bioimpedance analysis or dual X-ray absorptiometry. A previous GWAS (Nmax = 100,716) found 12 loci associated with body fat percentage, explaining 0.58% of phenotypic variance154; in addition,

155 another GWAS (Nmax = ~100,000) identified five loci associated with lean mass . To evaluate

60

the contribution of RLF variation to body composition, the GIANT Body Fat Percentage

ExomeChip Working Group conducted an ExomeChip study of two phenotypes: body fat percentage and fat-free mass index (FFMI) (N = ~440,000 for each phenotype). Fat-free mass index (FFMI) is a metric analogous to BMI; it is calculated as fat-free mass divided by height squared.

For FFMI, we analyzed variants with p < 5 × 10-4 with a MAF between 0.0001 and 0.05, which after clumping and filtering became 106 input variants in as many genes. We observed three significant gene sets at an FDR of < 0.05 (“reduced female fertility,” “abnormal energy expenditure,” and “internal hemorrhage”) and an additional 15 at a relaxed significance threshold of FDR < 0.20. As a whole, these gene sets were relevant to height/growth (e.g. “osteoarthritis,”

“transforming growth factor beta binding”), fertility (e.g. “reduced female fertility, “absent estrous cycle”), and insulin/glucose signaling (e.g. “impaired glucose tolerance”, “carbohydrate binding”) (Table 2.5). These are a clearly relevant set of pathways which could potentially in the future be used to divide genes into groups by predicted mechanism (e.g. genes that are more relevant for general growth versus genes that are more relevant for glucose signaling). For body fat percentage, we analyzed 279 variants in 279 genes (with the same MAF filter as for FFMI) and found no significant pathways (Table S2.13).

61 Table 2.5. Results from EC-DEPICT analysis of variants associated with FFMI at p < 5 × 10-4 (MAF < 5% and > 0.01%). All gene sets with FDR < 0.20 are shown. FDR: false discovery rate.

Original gene set ID Original gene Nominal FDR Meta-gene set ID Meta-gene set set description p-value description MP:0001923 reduced female 2.11E-06 <0.05 MP:0011086 partial postnatal fertility lethality MP:0005450 abnormal 2.84E-06 <0.05 MP:0002079 increased circulating energy insulin level expenditure MP:0001634 internal 4.96E-06 <0.05 Reactome formation Reactome formation hemorrhage of fibrin clot clotting of fibrin clot cascade clotting cascade MP:0003560 osteoarthritis 2.19E-05 <0.20 MP:0003662 abnormal long bone epiphyseal plate proliferative zone ENSG00000163520 FBLN2 PPI 3.69E-05 <0.20 GO:0005604 basement membrane subnetwork MP:0001732 postnatal 4.49E-05 <0.20 MP:0011086 partial postnatal growth lethality retardation ENSG00000119699 TGFB3 PPI 4.71E-05 <0.20 ENSG00000092969 TGFB2 PPI subnetwork subnetwork MP:0000477 abnormal 6.22E-05 <0.20 MP:0004936 impaired branching intestine involved in ureteric morphology bud morphogenesis GO:0050431 transforming 6.22E-05 <0.20 GO:0004675 transmembrane growth factor receptor protein beta binding serine/ threonine kinase activity MP:0009009 absent estrous 6.53E-05 <0.20 MP:0001134 absent corpus cycle luteum ENSG00000092969 TGFB2 PPI 7.03E-05 <0.20 ENSG00000092969 TGFB2 PPI subnetwork subnetwork MP:0000642 enlarged 7.30E-05 <0.20 GO:0006700 C21-steroid adrenal glands hormone biosynthetic process ENSG00000166147 FBN1 PPI 8.01E-05 <0.20 GO:0031012 extracellular matrix subnetwork MP:0005293 impaired 0.0001253 <0.20 MP:0002079 increased circulating glucose insulin level tolerance GO:0030246 carbohydrate 0.0001288 <0.20 GO:0005539 glycosaminoglycan binding binding MP:0003400 kinked neural 0.0001452 <0.20 MP:0005221 abnormal rostral- tube caudal axis patterning MP:0010024 increased total 0.000163 <0.20 MP:0002079 increased circulating body fat amount insulin level MP:0003567 abnormal fetal 0.0001778 <0.20 MP:0000267 abnormal heart cardiomyocyte development proliferation

62 Adiponectin

Adiponectin is an adipocyte-secreted hormone which has been associated with glucose and insulin metabolism. Increasing levels of adiponectin have been associated with improved risk profiles for obesity, cardiovascular disease, and type 2 diabetes, and inflammation156–158. A recent GWAS of adiponectin adjusted for BMI (adiponectinadjBMI) levels in up to 45,981 individuals identified 12 loci159. To expand upon these results, the GIANT Adiponectin

ExomeChip Working Group conducted an ExomeChip analysis of adiponectinadjBMI levels in up to 67,739 individuals, identifying 20 significant loci, 9 of which were novel160.

Using p < 5 × 10-4 as a threshold, we conducted two EC-DEPICT analyses of adiponectinadjBMI: (1) 70 variants across all allele frequencies and (2) 51 variants with MAF <

5%. For both analyses, none of the gene sets achieved significance (all FDR > 0.05). However, most of the top-ranked gene sets from both analyses had clear phenotypic relevance. The top pathways from the all-variants analyses were “enhanced lipolysis” (p = 2.30 × 10-5), “SELP PPI subnetwork” (p = 7.10 × 10-5), “decreased triglyceride level” (p = 7.56 × 10-5), and “abnormal white adipose tissue morphology” (p = 9.62 × 10-5) (Table S2.14). The top pathways from the

RLF variants analysis (Table S2.15) were “decreased triglyceride level” (p = 5.84 × 10-6),

“abnormal fat cell morphology” (p = 1.01 × 10-5), “positive regulation of interleukin-8 production” (p = 1.53 × 10-5), and “decreased adiponectin level” (p = 1.85 × 10-5). It is likely that with more power, some of these gene sets might achieve significance. Additionally, genes can still be (very preliminarily) prioritized based on these results, using the genes with the best z- scores for these clearly related pathways. For example, two of the genes with the strongest membership scores for the most significant pathways, PLIN1 and PNPLA2 (neither of which

63

achieved array-wide significance), have critical roles in adipocytes and have been associated with monogenic lipid storage disorders and lipodystrophies161,162.

Leptin

Leptin is a hormone produced by adipocytes that signals satiety via hypothalamic circuitry in the brain163. Deficiencies in leptin production resulting from homozygous mutations in the leptin gene (LEP) have been associated with a severe form of monogenic obesity, and higher leptin levels generally correlate with higher body fat mass163. A large GWAS of leptin

164 levels (Nmax = ~50,000) identified four loci influencing leptin levels independent of BMI . The

GIANT Leptin ExomeChip Working Group conducted an ExomeChip analysis of leptin levels adjusted for BMI in up to 57,232 individuals and identified ten loci, half of which were novel.

Using p < 5 × 10-4 as a threshold, we applied EC-DEPICT to leptin adjusted for BMI

(leptinadjBMI) (45 variants across all allele frequencies, 33 variants with MAF < 5%). We observed no significant gene sets for either analysis (Table S2.16, Table S2.17).

64 Discussion

We have developed EC-DEPICT, a gene set enrichment analysis method designed specifically for ExomeChip data, and a complementary qualitative visualization-based approach for gene prioritization. EC-DEPICT uses predicted gene functions from GWAS-DEPICT’s reconstituted gene sets to perform a powerful analysis that can leverage information even from genes which are not annotated with any known functions. EC-DEPICT also enables contributions from genes at subthreshold levels of significance with tested phenotypes, improving power to detect biology and enabling prioritization of such genes based on strong contributions to biological enrichment. We applied EC-DEPICT to a variety of different traits, leading to a number of interesting observations.

First, for many traits, EC-DEPICT was able to replicate gene sets previously found to be significant by gene set enrichment analysis of GWAS data. In particular, the initial discovery in

Locke et al.27 that BMI was strongly associated with many neuron-relevant gene sets was surprising – though a hypothalamic connection with obesity has been well-established (via the leptin-melanocortin signaling pathway), the enrichment results seemed to suggest much more widespread and systematic association with the nervous system. Therefore, the fact that we were able to independently replicate this association by applying EC-DEPICT to novel RLF variants

(i.e. variants not discovered in the original GWAS) provides an important line of confirmatory evidence.

EC-DEPICT also enabled novel biological insight for some traits. For example, for height, we applied EC-DEPICT to coding variants independent of known GWAS signals and

GWAS-DEPICT to noncoding variants discovered by GWAS (after removing any novel genes discovered by the ExomeChip). We found that the most strongly enriched gene sets from the

65

noncoding set were general biological processes, such as transcription factor binding, while the most strongly enriched gene sets from the coding set were much more specific to skeletal biology and growth. Thus, we hypothesize that genes in the “coding” pathways may be more specific to growth biology, while the genes in the “noncoding” pathways may have more global functions and therefore affect height in a regulatory/noncoding tissue-specific manner.

We also used a visualization approach of EC-DEPICT results to prioritize genes for follow-up, which can be especially helpful in identifying potentially associated genes that do not quite reach statistical significance. The validity of this method was confirmed by two main observations. First, for height, our approach independently identified a cluster of genes that contained a statistically significant enrichment of genes annotated in OMIM as causal for

Mendelian disorders of skeletal growth. We predicted that at least some of the remaining genes in this cluster might themselves be causal for such disorders, and, indeed, one of these genes was soon discovered to cause Pyle’s disease, a Mendelian disorder of bone fragility. Second, in our analysis of glycemic traits, four experts from the MAGIC Consortium had already developed a gold/silver/bronze annotation system for classifying the strength of evidence for each predicted effector transcript. Of the six genes in the most obviously striking cluster in our heatmap of all glycemic traits except HbA1c, five had previously been annotated as “gold,” providing an independent validation that this cluster identified highly relevant genes. When we performed a literature search investigating the remaining gene, CTRB2 (harboring a suggestively significant variant for 2hGlu), we determined that a large amount of evidence from related traits (e.g. type 2 diabetes and pancreatic cancer) had previously identified this locus as relevant. CTRB2 is especially interesting because of a developing theory that, in addition to the well-studied role of the endocrine pancreas in type 2 diabetes risk, the exocrine pancreas may be relevant as well (as

66

the CTRB2 protein is a digestive enzyme secreted in the pancreas). Thus, EC-DEPICT highlighted a suggestively significance variant that, upon further investigation, showed substantial support from other lines of evidence and may point to a less-understood biological mechanism.

An important limitation of our approach is that it cannot be used to analyze ExomeChip and GWAS data that have been meta-analyzed together. GWAS-DEPICT and EC-DEPICT have each been developed specifically for their respective data types, since the permutations used for p-value calculation come from their respective genetic sources. In addition, each makes different assumptions about how to define potentially causal genes, with GWAS-DEPICT including all genes within r2 > 0.5 of an index SNP and EC-DEPICT including only genes harboring the most significant nonsynonymous SNP per locus. Therefore, a major improvement would be developing a DEPICT version capable of analyzing any type of input data, ideally also expanding to next-generation sequencing data and different types of data models (e.g. burden testing in addition to single-variant). This would likely involve a substantial redesign of the p- value calculation component, as it would be impossible to directly model every possible input format.

The rapidly increasing sample size of GWAS and corresponding increase in the number of trait-associated loci has made the development of efficient computational tools for interpretation ever more critical. EC-DEPICT represents one such tool, as it can be applied to

ExomeChip data for any trait of interest and can be used to generate hypotheses about specific biological mechanisms and genes to prioritize. We used EC-DEPICT to analyze a wide variety of different traits and observed that it often produced robust, interesting results which

67

substantially aided in interpretation of the genetic association data; we therefore conclude EC-

DEPICT is a useful new method which can add value to ExomeChip-based studies.

URLs

CHARGE Consortium ExomeChip annotation file: http://www.chargeconsortium.com/main/exomechip/ (file: SNPInfo_HumanExome-

12v1_rev6.tsv.txt.gz)

Ensembl Biomart: http://grch37.ensembl.org/biomart/

Reconstituted gene sets: http://www.broadinstitute.org/mpg/depict/depict_download/reconstituted_genesets/GPL570-

GPL96-GPL1261-GPL1355TermGeneZScores-MGI_MF_CC_RT_IW_BP_KEGG_z_z.txt.gz

Meta-cluster labels: https://github.com/RebeccaFine/obesity- ecdepict/blob/master/data/metacluster_labels.txt

GWAS-DEPICT: https://github.com/perslab/depict

EC-DEPICT for height: https://github.com/RebeccaFine/height-ec-depict

EC-DEPICT for BMI and WHR: https://github.com/RebeccaFine/obesity-ec-depict

Acknowledgements

This research has been conducted using the UK Biobank resource. This work was supported by

NHGRI NIH F31HG009850 (R.S.F.), NIDDK NIH R01DK075787 (J.N.H.), the Lundbeck

Foundation R190-2014-3904 (T.H.P.), and the Novo Nordisk Foundation NNF18CC0034900

(T.H.P.).

68 Chapter 3:

Benchmarker: an unbiased, association-data-driven strategy to evaluate gene prioritization

algorithms

Rebecca S. Fine,1-4 Tune H. Pers,5,6 Tiffany Amariuta,1,3,7-11 Soumya Raychaudhuri,3,7-10,12 &

Joel N. Hirschhorn1-3,13

1 Department of Genetics, Harvard Medical School, Boston, MA 02115, USA. 2 Division of Endocrinology and Center for Basic and Translational Obesity Research, Boston Children's Hospital, Boston, MA 02115, USA. 3 Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA. 4 Ph.D. Program in Biological and Biomedical Sciences, Graduate School of Arts and Sciences, Harvard University, Cambridge, MA 02138, USA. 5 The Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, University of Copenhagen, 2200 Copenhagen, Denmark. 6 Department of Epidemiology Research, Statens Serum Institut, 2300 Copenhagen, Denmark. 7 Center for Data Sciences, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA 02115, USA. 8 Division of Genetics, Brigham and Women’s Hospital, Boston, MA 02115 USA. 9 Division of Rheumatology, Immunology, and Allergy, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA 02115, USA. 10 Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA. 11 Ph.D. Program in Bioinformatics and Integrative Genomics, Graduate School of Arts and Sciences, Harvard University, Cambridge, MA 02138, USA. 12 Arthritis Research UK Centre for Genetics and Genomics, Centre for Musculoskeletal Research, Manchester Academic Health Science Centre, The University of Manchester, Manchester M13 9PL, UK. 13 Department of Pediatrics, Harvard Medical School, Boston, MA 02115, USA. Correspondence: Joel N. Hirschhorn

This chapter originally appeared in:

Fine, R.S., Pers, T.H., Amariuta, T., Raychaudhuri, S., Hirschhorn, J.N. (2019). Benchmarker: an unbiased, association-data-driven strategy to evaluate gene prioritization algorithms. American Journal of Human Genetics 104(6), 1025-1039.

Abstract

Genome-wide association studies (GWAS) are valuable for understanding human biology, but associated loci typically contain multiple associated variants and genes. Thus, algorithms that prioritize likely causal genes and variants for a given phenotype can provide biological interpretations of association data. However, a critical, currently missing capability is to objectively compare performance of such algorithms. Typical comparisons rely on “gold standard” genes harboring causal coding variants, but such gold standards may be biased and incomplete. To address this issue, we developed Benchmarker, an unbiased, data-driven benchmarking method that compares performance of similarity-based prioritization strategies to each other (and to random chance) by leave-one-chromosome-out cross-validation with stratified linkage disequilibrium (LD) score regression. We first applied Benchmarker to twenty well- powered GWAS and compared gene prioritization based on strategies employing three different data sources, including annotated gene sets and gene expression; genes prioritized based on gene sets had higher per-SNP heritability than those prioritized based on gene expression.

Additionally, in a direct comparison of three methods, DEPICT and MAGMA outperformed

NetWAS. We also evaluated combinations of methods; our results indicated that combining data sources and algorithms can help prioritize higher-quality genes for follow-up. Benchmarker provides an unbiased approach to evaluate any similarity-based method that provides genome- wide prioritization of genes, variants, or gene sets, and can determine the best such method for any particular GWAS. Our method addresses an important unmet need for rigorous tool assessment and can assist in mapping genetic associations to causal function.

70

Introduction

Genome-wide association studies (GWAS) have successfully identified thousands of loci genetically associated with a wide range of human diseases and traits2. However, determining the causal variants and genes within these loci remains challenging: the true identities of the causal variants are often obfuscated by linkage disequilibrium (LD) between neighboring variants, and assigning noncoding variants to the genes they regulate has proven difficult. To address these issues, numerous types of algorithms to prioritize the most likely causal variants and genes have been developed5,55,71,165–168. Many of these algorithms are based on a simple intuition: when all potentially causal genes or variants are pooled, we expect that those that are actually causal should share more genomic features and/or annotations in common with other causal genes and variants than with non-causal genes and variants. In other words, genomic features or annotations that are shared more strongly than expected by chance in the pool of potential causal genes can be used to prioritize genes and variants. For example, algorithms have been developed that prioritize genes or variants that share similar profiles within gene sets40, protein-protein interaction (PPI) or coexpression networks71,73–76,79,83,167,169, PubMed abstracts70, and regulatory features such as DNase hypersensitivity sites170.

Gene prioritization is a critical step in translating genetic discoveries into biological insights, so many methods for gene prioritization have been developed. However, it is not straightforward to compare, or “benchmark,” the performance of these methods and assess which of them produces the most accurate results. Most published prioritization algorithms contain a validation component, but each study takes its own approach to do this, making comparison between algorithms difficult. A common benchmarking approach is to use “gold standard” genes

(i.e. genes with a known link to the trait of interest)40,63,87 to calculate a receiver operating

71

characteristic or similar metric. Unfortunately, this strategy relies heavily on prior knowledge of disease etiology and is biased toward well-studied genes in well-characterized biological pathways. In fact, using gold standard genes may actually penalize a method that successfully discovers novel biology (and of course the accuracy of this strategy will suffer if any of the genes classified as “gold standards” turn out not to be truly causal). Another common approach is prospective validation, using a newer (and generally better-powered) data set to benchmark the prioritization results from an older one. One study conducted a large benchmarking effort with this strategy by collecting 42 novel trait-gene associations over six months to use for benchmarking purposes89. This methodology, however, requires the existence of multiple independent well-powered GWAS studies, which may not always be available. Another study used Gene Ontology (GO) annotations120 and the FunCoup network85 to benchmark several network-based prioritization strategies using cross-validation84. However, this strategy assumes that a method’s ability to use PPI network connectivity to recover withheld members of a GO gene set is equivalent to its ability to use that connectivity to prioritize causal genes from a

GWAS, which may not be the case. Causal genes for a trait likely relate to one another in ways more complex than membership in one gene set, so this analysis may measure the relationship between GO and the FunCoup network rather than the effectiveness of the prioritization methods per se.

An ideal strategy would combine the best features of the previously described large-scale benchmarking efforts: (1) cross-validation and (2) the use of GWAS data (rather than external sources of information that can be biased or in many cases nonexistent) as a benchmark. To that end, we propose a “leave-one-chromosome-out” strategy for benchmarking, in which the full set of GWAS data is used for both prediction and validation. Specifically, to benchmark one or more

72

similarity-based prioritization methods, we use those methods to prioritize genes on each chromosome in turn, using GWAS data for all the other . Next, for each method being benchmarked, we assemble all of the prioritized genes on each chromosome into a single group. Finally, we apply stratified LD score regression53 to each group of prioritized genes to determine whether the prioritized group significantly contributes to trait heritability. To compare multiple methods within a given trait, we compare the contribution to trait heritability by genes prioritized by each method. In this way, the GWAS data itself serves as its own control, without the need for incorporating additional data sources, and the use of the leave-one-chromosome-out approach prevents overfitting because association signals are not correlated across chromosomes.

This strategy is highly generalizable because it can be applied to any method that prioritizes genes or variants based on their similarity to each other with respect to some feature(s) of interest

(e.g. similar patterns of gene set membership, similar epigenetic marks).

73

Materials and Methods

We first describe the GWAS data we have used to test our method. Then, we describe our approach, which we refer to as Benchmarker; we include an overview of stratified LD score regression. Finally, we discuss the specific prioritization approaches tested.

GWAS data

We obtained GWAS summary statistics from publicly available resources (Table S3.1).

Specifically, we used published summary statistics for height69, schizophrenia171, inflammatory bowel disease (IBD)172, and several lipid measures (low-density lipoprotein [LDL] cholesterol, high-density lipoprotein [HDL] cholesterol, triglyceride level, and total cholesterol)173. We also used published UK Biobank summary statistics for body-mass index (BMI), waist-hip ratio adjusted for BMI (WHRadjBMI), skin pigment, red blood cell count, white blood cell count, diastolic and systolic blood pressure, years of education, smoking status, diagnosis of allergy or eczema, age of menarche, and age of menopause174. The UK Biobank GWAS each consist of an average of 448,690 European samples analyzed with BOLT-LMM (with the exception of menarche and menopause, which comprise 242,278 and 143,025 female-only samples, respectively).

Benchmarker approach

To evaluate a given prioritization method, we assume that variants near and within the set of truly causal genes will be, on average, enriched for heritability. We first apply the prioritization method of interest to a GWAS from which one chromosome has been removed

(Figure 3.1). Methods compatible with Benchmarker will have the following general framework: (1) being able to take as input a set S of trait-associated genes (or variants), where all

GWAS data from one or more chromosomes have been withheld and (2) for each gene/variant in

74

the genome, including on the withheld chromosome(s), highly ranking genes/variants that are similar to those in S. Based on an iterative implementation of this basic strategy, withholding each chromosome in turn, a Benchmarker-compatible method can produce a similarity-based ranking of all genes or variants in the genome. We consider the top-ranked 10% of genes from the withheld chromosome to be “prioritized.” We repeat this for all 22 autosomal chromosomes and combine all prioritized genes together, which represents the set to be tested.

Ranked gene prioritization list, Full GWAS GWAS, chromosomes 2-22 chromosome 1 genes only

summary statistics Chrom SNP P-Value Prioritize genes on Chrom Gene P-Value 10% top Remove 2 SNP D 0.0053 chromosome 1 (e.g. 1 Gene 1 4.2 x 10-5 chromosome 1 2 SNP E 0.543 with DEPICT) 1 Gene 2 8.7 x 10-4 2 SNP F 0.423 1 Gene 3 0.029 … … 22 SNP X 0.00002 1 Gene 1998 0.98 22 SNP Y 0.844 1 Gene 1999 0.99 22 SNP Z 0.244 1 Gene 2000 1 Annotate SNPs within +/- 50 kb of the top- ranked 10% of genes as “prioritized”

Chromosome 1 Gene Gene Gene Gene Gene Gene 2 1999 2000 3 1998 1 SNPs LDSC “prioritized” ✓ ✓ ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✓ ✗ ✗ ✓ annotation

Repeat for each chromosome & combine results

Chromosome 1 Gene Gene Gene Gene Gene Gene 2 1999 2000 3 1998 1 SNPs LDSC “prioritized” ✓ ✓ ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✓ ✗ ✗ ✓ Per-SNP annotation Run S-LDSC on heritability annotation & estimate from Gene Gene Gene Gene Gene Gene original GWAS coefficient (!) 3002 3100 3300 3005 3021 3005 and SNPs coefficient z- LDSC “prioritized” ✓ ✓ ✗ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✗✗ ✗ ✗ score annotation …

Chromosome 22 Gene Gene Gene Gene Gene Gene 18500 18322 18245 18221 18523 18876 SNPs LDSC “prioritized” ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✓ ✓ ✓ ✓ annotation

Figure 3.1. Schematic of the Benchmarker strategy.

75

Comparing methods

To evaluate method performance, we turned to a well-established method: stratified LD score regression53. LD score regression is based on the intuition that SNPs with more LD to other

SNPs are more likely to tag a truly causal variant (and therefore to have a higher χ2 statistic)54.

An “LD score” for each SNP is calculated by summing its squared Pearson correlation coefficient (r2) with all nearby SNPs. Specifically, the LD score of index SNP j is given by:

$ "# %&'() '* +,-). /01 2 = 4 (!,# # where rj,k is the correlation between SNPs j and k calculated from a reference panel. The slope of the relationship between LD scores and observed chi-square values, computed in a weighted regression, provides a reliable estimate of overall heritability for a given trait53.

Stratified LD score regression (S-LDSC) is an extension of LD score regression that allows for the estimation of the heritability explained by a particular genomic annotation (e.g. coding SNPs, SNPs predicted to be enhancers)53. In stratified LD score regression, LD scores are calculated as described above, but only consider the LD of the index SNP with other SNPs in the category of interest C (rather than across all SNPs). In our case, C represents the set of prioritized SNPs (or SNPs in prioritized genes). For robust and more accurate estimation of the heritability captured by an annotation of interest, it is recommended that a set of annotations of known genomic importance be included in the S-LDSC regression model as conditional covariates. These 53 such annotations are referred to as the “baseline model”53 and include annotations such as sequence characteristics (e.g. exon, intron) and cell-type-nonspecific regulatory marks (e.g. histone modifications). We therefore included the baseline model in each iteration of S-LDSC, as well as an additional category for SNPs that lie within 50 kb of any gene in our prioritization method as a control. Our reference SNPs for European LD score estimation

76

were the set of 9,997,231 SNPs with a minor allele count ³ 5 from 489 unrelated European individuals in Phase 3 of 1000 Genomes175. Heritability was partitioned for the set of 5,961,159

SNPs with MAF ³ 0.05; regression coefficient estimation was performed with 1,217,312

HapMap3 SNPs (SNPs in HapMap3 are used because they are generally well-imputed).

To evaluate model performance, we focused on two metrics: the regression coefficient t and its p-value (derived from a block jackknife) for our annotation. t measures the average per-

SNP contribution of the annotation to heritability after accounting for the other categories in the model. We note that, since we used the top 10% of genes for each method, the number of prioritized SNPs for each trait will vary mainly by the average gene length of prioritized genes.

To make t comparable across traits, we normalized by the average per-SNP heritability for each trait (i.e. we divided each estimate by the total trait heritability / the number of SNPs used to compute total trait heritability); we refer to this as “normalized t”. In previous work176, normalized t may be multiplied by the standard deviation of the annotation; this quantity, t*, measures the increase in per-SNP heritability per standard deviation increase in the annotation value. For a binary annotation, the standard deviation corresponds directly to the proportion of

SNPs assigned to that annotation. For our purposes, we were less interested in the change in heritability associated with a one-standard-deviation increase in the annotation value (which, for a binary annotation, does not have an intuitive interpretation) and more interested in the change in heritability associated with the value of the annotation increasing from 0 to 1, which is captured by normalized t rather than t*. We therefore computed normalized t as an evaluation metric throughout our analyses (this normalization has also been used in Finucane et al. 201890).

To compute p-values for pairwise comparisons of prioritization methods, we used a block jackknife to compute the standard error around the difference between two t estimates within a

77

trait (i.e. tA - tB). We also calculated random-effects meta-analysis p-values from the normalized t values to determine whether one annotation generally outperformed another across multiple traits, using the R package rmeta. To do this, we selected only one GWAS from groups of obviously overlapping sets of traits; specifically, from the lipid trait group we retained LDL cholesterol and excluded HDL cholesterol, triglycerides, and total cholesterol, and from the blood pressure traits we included systolic blood pressure and excluded diastolic blood pressure.

We also report overall meta-analyzed t along with standard error and p-value estimates from each analysis (Table S3.2). Results were visualized in R version 3.5, using ggplot2177 and

ComplexHeatmap126.

Choice of parameters

Benchmarker relies on two major parameters: (1) the percentage cutoff to use for top-ranked prioritized genes and (2) the size of the window used to map SNPs to genes. We used 10% and

50 kb, respectively, but our results do not differ substantially with respect to comparative method performance with alternative values of 5%, 15%, 25 kb, and 100 kb (Figure S3.1). There is also precedence in the literature for both of these parameters90,128; we chose 10% largely to strike a balance between minimizing standard error and obtaining a number of genes that would not be too large to be useful in prioritization. However, users can vary these parameters using the provided scripts.

Assessment of type I error

For each SNP-based type I error simulation, we randomly selected 10% of the SNPs that were within +/- 50 kb of any gene on each chromosome. These SNPs were then used as an annotation for input to S-LDSC. We tested each simulation against summary statistics from 10 different well-powered GWAS (for a total of 10,000 runs: 1,000 null simulations each using 10

78

GWAS), including BMI, diastolic blood pressure, age of menopause, height, schizophrenia, years of education, age of menarche, IBD, total cholesterol, and allergy/eczema. For the gene- based type I error simulations, we randomly selected 10% of genes and prioritized all SNPs within +/- 50 kb of those genes; these were then tested the same way as in the SNP-based type I error analysis (10,000 runs: 1,000 null simulations each tested with 10 GWAS).

Prioritization methods

We began by evaluating DEPICT (release 194)40, a method for gene prioritization, gene set enrichment analysis, and tissue enrichment analysis. DEPICT’s primary innovation is the use of “reconstituted” gene sets, which consist of 14,462 gene sets downloaded from multiple databases that have been extended based on 77,840 publicly available expression microarrays41.

The reconstituted gene sets contain z-scores for each gene in the genome for each of the 14,462 gene sets, representing how strongly each gene is predicted to be a member of each gene set.

DEPICT’s gene prioritization algorithm involves (1) identifying all genes in trait-associated loci

(referred to as S), which in DEPICT are defined as all genes overlapping any SNP with r2 > 0.5 to an index variant and (2) for each of the genes identified in S, assessing its correlation with the rest of the genes in S across the reconstituted gene sets. The stronger the overall correlation a gene has with the rest of the genes in S, the more highly it will be prioritized. We adapted this method for Benchmarker by forcing the prioritization to be genome-wide rather than across only the genes in S (that is, each gene in the genome is compared to the genes in S across the reconstituted gene sets). DEPICT requires a GWAS p-value threshold to define “trait-associated loci.” We used p < 1 × 10-5 for our analyses here, except for a few GWAS for which this threshold caused DEPICT to exceed its maximum number of loci. For these GWAS (height,

79

BMI, WHRadjBMI, red blood cell count, white blood cell count, diastolic blood pressure, and systolic blood pressure), we used a cutoff of p < 5 × 10-8.

As described, DEPICT’s default behavior is to prioritize genes based on correlation across the reconstituted gene sets. The implicit assumption in this method is that the genes most likely to be truly causal are the ones with the most similar profile across these gene sets. As a first question for Benchmarker, we asked: how would correlating across tissue expression (rather than gene set membership) fare in comparison? To answer this question, we applied DEPICT’s prioritization algorithm, exchanging the matrix of z-scores of reconstituted gene sets for a matrix of z-scores of expression data (gene expression for each gene across a range of tissues). We used two different expression data sources. The first was the matrix DEPICT typically uses to perform tissue enrichment analysis, which was derived from 37,427 publicly available human microarrays representing 209 different tissues (based on Medical Subject Heading annotations).

We note that this is a subset of the microarray data from the Gene Expression Omnibus (GEO)178 used in the process of “reconstituting” the gene sets. The second source of expression data was a matrix based on RNA sequencing data from the Genotype Tissue Expression project (GTEx, v6)97, including 53 human tissues with an average of 161.32 samples per tissue (processed as in

Finucane et al. 201890). For each tissue, we calculated the mean expression across all samples.

To make the expression data as comparable as possible, we normalized the GTEx expression matrix in the same way as the GEO matrix40: z-score normalizing across all tissues, then across genes. For all DEPICT analyses, we considered the top-ranked 10% of genes “prioritized.” All analyses were done on a set of 16,876 genes that were (1) outside the major histocompatibility complex region (chromosome 6: 25-35 Mb in hg19 genome build) and (2) present in both the

DEPICT and GTEx data.

80

The second method we evaluated was MAGMA (v1.06b)30. MAGMA works in two steps. First, a gene-based p-value is computed as the mean association of SNPs in the gene, corrected for LD. Then, competitive gene set and/or continuous covariate p-values are calculated based on the association of the gene-based p-values with the category of interest. We ran

MAGMA with default parameters. For gene set enrichment analysis, we treated the reconstituted gene sets as a continuous covariate and calculated one-tailed p-values (alternative hypothesis = genes with high gene set membership z-scores have a stronger trait association than those with low gene set membership z-scores).

Comparing DEPICT to MAGMA presents an immediate challenge: MAGMA does not explicitly prioritize genes based on the gene set enrichment analysis, as DEPICT does.

Therefore, we needed to establish a framework for deriving gene prioritization from gene set enrichment analysis results (Figure S3.2). Specifically, we first generated gene-based p-values from MAGMA. Then, we removed all genes from each chromosome in turn and applied the gene set enrichment analysis function to the remaining genes. From the gene set enrichment analysis results, we reasoned that we could prioritize genes that were members of the most highly enriched gene sets. However, the reconstituted gene sets used for the gene set enrichment do not have “members” per se; rather, they have z-scores for gene set membership prediction. To address this issue, we created several binarized forms of the gene sets in which we used z-score cutoffs (Z > 1.96, 2.58, or 3.29, which correspond to 2-tailed p-values of 0.05, 0.01, and 0.001, respectively) or rankings (top 50, 100, or 200 genes per gene set) to define the gene set

“members” (Figure S3.2, Figure S3.3). (For the Z > 3.29 condition, we removed 14 gene sets that contained fewer than 10 genes.) Then, for each withhold-one-chromosome gene set enrichment analysis result, we (1) ranked the gene sets and (2) for each gene set, annotated each

81

member gene on the withheld chromosome as “prioritized” until we reached 10% of genes on the withheld chromosome (where gene set members were based on one of the six versions of the binarized gene sets). For the purposes of comparison, we applied DEPICT in exactly the same way, performing the prioritization based directly on the leave-one-chromosome-out gene set enrichment results and the binarized gene sets. We note that this basic strategy is itself a useful technique for converting the results of any gene set enrichment analysis to gene prioritization; this may be helpful in generalizing the types of data that can be used with Benchmarker.

The third method we evaluated was NetWAS76,179, which prioritizes genes based on PPI network connectivity. One hundred and forty-four tissue-specific PPI networks are available for use with the algorithm, in addition to a tissue-naïve global network. NetWAS takes gene-level p- values from a GWAS as input and uses genes below a specified p-value threshold (e.g. p < 0.01) as “positive examples” of trait association. The positive examples are used in a support vector machine classifier, which learns the patterns of network connectivity in a user-specified tissue and uses them to reprioritize all genes in the genome.

For our NetWAS analysis, we first generated gene-level p-values for each of our twenty

GWAS with MAGMA. We removed all genes that were not present in the previously defined

DEPICT-GTEx overlapping set of 16,876 genes. The recommended p-value threshold cutoff for setting “positive examples” is nominal significance (p < 0.01)76,179. However, our GWAS are so well-powered that using p < 0.01 could define too many genes as positive examples for the method to be successful. We therefore tested three different p-value cutoffs for each trait: p <

0.01, p < 0.0001, and a Bonferroni-corrected threshold for the number of genes tested (roughly p

< 3 × 10-6, differing slightly from trait to trait). We tested these three thresholds with the

“global” (i.e. tissue-nonspecific) PPI network for all twenty GWAS. Then, for nine of the

82

GWAS, we also tested one to four relevant tissue-specific networks, chosen based on a combination of DEPICT tissue enrichment results and published S-LDSC analyses90, and compared their performance to the global network. For each trait, we used the p-value threshold that was most successful from the global analyses. eQTL analyses

We used data from a recent analysis of blood expression quantitative trait loci (eQTLs) from 31,684 individuals180, including all significant cis-eQTLs (a total of 3,699,823 SNPs regulating 16,989 genes). For all S-LDSC eQTL analyses, an additional control annotation was included that consisted of either (1) all annotated significant cis-eQTLs (for the analysis in which we split all prioritized SNPs into eQTLs and non-eQTLs) or (2) all annotated cis-eQTLs regulating at least one gene in our dataset (for the analysis in which we mapped each prioritized gene to eQTLs found to regulate it).

Nearest-gene analysis

For each GWAS, we used the default clumping and loci generation procedure from

DEPICT to define genes in significant loci (which we refer to as S), as well as the closest gene to each index SNP. The p-value threshold for “significant” loci was defined as described above for the DEPICT analyses (p < 1 × 10-5 or p < 5 × 10-8, depending on the number of loci for the trait).

For each of our seven analyses (comparison of DEPICT-gene-sets/DEPICT-GEO/DEPICT-

GTEx and comparison of DEPICT and MAGMA for all six binarizations), we then restricted genes prioritized by at least one method to those in S (which we refer to as S-prioritized) and did the same for the intersect (S-intersect) and the outersect (S-outersect) (i.e. S-prioritized is the union of S-intersect and S-outersect). We performed three Fisher’s exact tests for each trait,

83

comparing the fraction of nearest-genes in S-prioritized, S-intersect, and S-outersect to the fraction of nearest-genes in S overall.

84

Results

The Benchmarker framework is outlined in Figure 3.1. First, we remove one chromosome from a set of GWAS summary statistics. Then, we apply a gene prioritization method of interest to this partial GWAS; this produces prioritization p-values for each gene in the genome, including on the withheld chromosome. We rank the genes on the withheld chromosome by prioritization p-value and take the top 10% as “prioritized.” Then, we annotate all SNPs within +/- 50 kb of these genes as prioritized SNPs. We repeat this process for each chromosome, successively withholding each chromosome and annotating prioritized SNPs. We then combine all the prioritized SNPs for each chromosome into a single “prioritized” annotation. Finally, we apply stratified LD score regression (S-LDSC), which produces an estimate of the average per-SNP heritability (t) of the prioritized annotation. For each of our applications, we tested GWAS of 20 different traits; to improve comparability across traits, we normalize our estimates of t by the average genome-wide per-SNP heritability of each trait. We note that this method can easily be used for variant prioritization strategies in addition to gene prioritization strategies, with the only difference being that prioritized variants on the withheld chromosome can be directly annotated for stratified LD score regression without the step of mapping variants to genes.

We first assessed whether the type I error rate of our method was well-controlled by conducting 1,000 null simulations using randomly prioritized SNPs, each tested against ten different sets of actual GWAS summary statistics, for a total of 10,000 runs (see Methods). The coefficient (t) z-scores were well-controlled, with a one-tailed type 1 error rate of 0.0456 (95% confidence interval = 0.0415, 0.050) at p = 0.05 (Figure S3.4a; we report one-tailed p-values because we were concerned about type 1 error that overestimates rather than underestimates

85

heritability). This result indicates that S-LDSC is well-calibrated and correctly determines that a group of randomly selected SNPs does not significantly explain any heritability in any of the tested GWAS (i.e. in actual GWAS results). To more explicitly mimic our experimental setup, we also tested type I error by randomly sampling 10% of genes rather than variants (Figure

S3.4b; again, 1,000 null simulations each tested against 10 different GWAS, for a total of 10,000 runs). Then, we annotated all SNPs within +/- 50 kb of these genes as “prioritized.” For these simulations, we observed a slightly conservative one-tailed type 1 error rate of 0.0441 (95% confidence interval = 0.040, 0.048) at p = 0.05; we therefore proceeded without additional correction.

Next, we applied Benchmarker to compare three different methods of running DEPICT’s prioritization algorithm: (1) prioritizing based on shared patterns of gene set membership

(DEPICT’s standard approach, using a gene × gene set matrix of z-scores) and (2) prioritizing based on shared patterns of tissue expression, using either of two different gene × tissue expression matrices of z-scores (one microarray-based from GEO178 and one RNA-seq based from GTEx97; see Methods). We refer to these approaches as DEPICT-gene-sets, DEPICT-GEO, and DEPICT-GTEx, respectively. We observed that the prioritized variants had coefficients significantly greater than zero for nearly all traits and all three input matrices, indicating that prioritization by DEPICT using any of these data sources captures meaningful additional heritability beyond the effects of the baseline model (Figure S3.5, Table S3.3). To more directly compare these annotations to each other, we performed a conditional analysis in which all three annotations were modeled jointly (i.e. in the same S-LDSC model). This indicated that over all traits, DEPICT-gene-sets performed better than both DEPICT-GEO (random-effects meta- analysis p-value calculated over 16 traits = 2.40 × 10-3; we excluded four phenotypically

86

overlapping traits, see Methods) and DEPICT-GTEx (meta-analysis p-value = 8.00 × 10-6)

(Table S3.4).

All three of these methods use gene expression information as a data source, either explicitly (for the two gene × tissue expression matrices) or implicitly (for the gene × gene set matrix, where z-scores for gene set membership are derived from gene expression data). We therefore considered whether the three methods prioritize similar genes. However, when we compared the overlap of prioritized genes, we found that in fact the approaches prioritized relatively distinct groups of genes. Specifically, the average number of genes prioritized by all three methods for a given trait was 374.1; in contrast, the average number of genes prioritized by

DEPICT-gene-sets only, DEPICT-GEO only, and DEPICT-GTEx only were 796.7, 732.4, and

757.7, respectively (Figure 3.2). Therefore, the use of three different data sources produced three substantially different groups of genes that each significantly contribute to heritability, suggesting that each set of prioritized genes contains different and useful information.

87

height BMI WHRadjBMI skin pigment red blood cell count white blood cell count diastolic blood pressure systolic blood pressure Number of Genes LDL cholesterol 1600 HDL cholesterol 1200 total cholesterol 800 triglycerides 400 type 2 diabetes 0 years of education smoking status schizophrenia allergy or eczema IBD age of menarche age of menopause

gene.sets GEO GTEx yes yes yes no no no

Figure 3.2. Overlap in prioritized genes for DEPICT-gene-sets, DEPICT-GEO, and DEPICT-GTEx. Columns of the heat map represent all possible categories of overlap, also illustrated by the Venn diagrams on top and the annotation bar below (e.g. prioritized by DEPICT-gene-sets only, prioritized by DEPICT-gene-sets and DEPICT-GEO, prioritized by all three methods, etc.). Darker blues indicate more genes in the category.

Because of this partial overlap in prioritized genes across the three methods, we tested whether we could combine results to create a smaller set of prioritized genes that still captured most of the heritability in the genes prioritized across different methods. Specifically, we created a new annotation consisting of genes prioritized by at least two of the three input matrices, which we refer to as the “intersect” set (average number of genes = 1203.05; average proportion of

SNPs = 0.072) (Figure S3.6, Figure S3.7). We also created a separate annotation consisting of the remaining prioritized genes, i.e. genes prioritized by only one of those three methods; for simplicity, we refer to this group as the “outersect” set (average number of genes = 2286.80; average proportion of SNPs = 0.120). We first tested the performance of the “intersect” and

88

“outersect” annotations in separate LD score regression models (Figure S3.5, Table S3.3) and found that across most traits, the intersect performed better than the outersect, in some cases significantly. We then directly compared the intersect and outersect in a joint S-LDSC model

(Figure 3.3, Table S3.5). We observed a clear difference between the two annotations: across most traits, the intersect group of genes substantially outperformed the outersect (random-effects meta-analysis p-value for 16 traits = 1.39 × 10-4). These results indicate that the majority of the heritability explained by all prioritized genes could actually be localized to SNPs near or within the genes prioritized by more than one method. To analyze this further, we also tested intersect and outersect sets for additional combinations of the DEPICT implementations (Figure S3.8).

All three methods showed the same pattern of intersect genes outperforming outersect genes.

Additionally, consistent with the observation that DEPICT-gene-sets outperformed the other two

DEPICT approaches, the intersect of genes prioritized by DEPICT-gene-sets and at least one other implementation performed best (in fact, somewhat better than the intersect containing genes prioritized by all three implementations).

89

A. height BMI WHRadjBMI skin pigment red blood cell count 0.6 2.0 * * * 2.0 1.5 1.5 0.4 1 1.5 1.0 1.0 1.0 0.2 0 0.5 0.5 0.5

0.0 0.0 0.0 −1 0.0 white blood cell count diastolic blood pressure systolic blood pressure LDL cholesterol HDL cholesterol 1.5 4

) * * 1.00 * 4 * t 2.0 3 3 1.5 1.0 0.75 2 1.0 0.50 2 0.5 1 0.5 0.25 1 0 0.0 0.0 0.00 0

total cholesterol triglycerides type 2 diabetes years of education smoking status 5 * 5 * 3 0.8 4 0.6 4 0.6 3 2 3 0.4 2 0.4 2 1 1 0.2 0.2 1

Prioritization Effect Size (Normalized Prioritization Size Effect 0 0 0 0.0 0.0

schizophrenia allergy or eczema IBD age of menarche age of menopause 1.2 * * 0.6 4 * 3 4 0.9 3 3 0.4 0.6 2 2 2 1 0.2 0.3 1 1

0.0 0 0 0.0 0 intersect outersect intersect outersect intersect outersect intersect outersect intersect outersect

B. * ) t

1.0

0.5 Analyzed PrioritizationAnalyzed − Effect Size (Normalized Size Effect Meta 0.0 intersect outersect

Figure 3.3. Effect sizes (normalized t) for the joint LD score regression model comparing “intersect” and “outersect” genes for 20 different GWAS. Here, the “intersect” represents genes prioritized by at least two of (1) DEPICT-gene-sets, (2) DEPICT-GEO, and (3) DEPICT- GTEx. The “outersect” represents genes prioritized by only one of those three methods. Asterisks mark comparisons for which the difference between the intersect and outersect achieved nominal significance (p < 0.05). Error bars represent 95% confidence intervals. a) Results for each trait; note that y-axis scales differ for each panel. b) Results meta-analyzed over 16 traits.

90

We also noted that the intersect outperformed the outersect most strongly for the lipid traits (LDL and HDL cholesterol, total cholesterol, and triglycerides), immune traits

(allergy/eczema and IBD), and height. In contrast, we observed that the most brain-related traits we tested (BMI, years of education, smoking status, schizophrenia, and age of menarche) failed to show a nominally significant difference between the intersect and outersect (p > 0.05) (meta- analysis p-value for brain-related traits = 0.210; meta-analysis p-value for all other traits = 3.72 ×

10-6). (We consider these traits to be brain-related based on empirical evidence from both S-

LDSC analyses on tissue-specific expression and DEPICT analyses on general tissue expression enrichment)27,53,90,181. This suggests that brain-related traits may not benefit as much as other traits from combining information across these particular data sources (i.e. the reconstituted gene sets and tissue expression matrices).

We next wanted to compare DEPICT with another popular gene set enrichment analysis algorithm, MAGMA30. However, MAGMA does not perform gene prioritization based on its gene set enrichment, so we needed a way to convert prioritized gene sets to prioritized genes. We accomplished this by (1) ranking the prioritized gene sets, (2) using binarized versions of the reconstituted gene sets to assign genes to gene sets, and (3) prioritizing genes in the most enriched gene sets on the withheld chromosome (Figure S3.2). For step 2, we created six different binarized versions of the reconstituted gene sets, three based on z-score (Z > 1.96, 2.58, or 3.29) and three based on ranking (top 50, 100, or 200 genes per gene set). For the purposes of comparison, we used the same strategy for DEPICT (i.e. using the binarized gene sets to identify and prioritize genes on the withheld chromosome within enriched gene sets). This basic approach can be used for any method that prioritizes genomic features (e.g. gene sets, tissue expression, epigenomic annotations) but does not necessarily explicitly prioritize genes or variants; it also

91

illustrates that the Benchmarker strategy can be used to evaluate a wide variety of algorithm types.

DEPICT and MAGMA performed similarly: we observed no strongly significant differences between the two methods for each trait, either modeled separately or jointly (Figure

S3.9, Tables S3.6 and S3.7). However, as we observed for the comparison across different data sources, we again noted that DEPICT and MAGMA prioritized fairly different groups of genes

(average number of genes prioritized by both methods = 931.18, average number of genes prioritized by one method only = 757.82) (Figure 3.4, Figure S3.10-12). Using the same logic as before, we again asked whether the genes found by both methods (the “intersect”) outperformed the genes found by either method individually (the “outersect”). (Note that the outersect includes the union of genes prioritized by only DEPICT and only MAGMA, so it is on average slightly larger than the intersect [Figure S3.12].) Interestingly, we observed an even stronger trend toward the overlapping set outperforming the individual sets than for the analysis of different data sources, and for several traits this difference was nominally significant (Figure

S3.9; Table S3.6). When we modeled the intersect and outersect jointly, this difference became even more apparent, with the intersecting group of genes outperforming the outersect for nearly every trait (Table 3.1, Figure 3.5, Table S3.8). The differences were particularly pronounced for immune traits (IBD, allergy/eczema) and total cholesterol (which, interestingly, also were some of the best-performing traits in the previous analysis). For these traits, not only did the intersect significantly outperform the outersect for every gene set binarization, but outersect per-SNP heritability often did not significantly differ from zero. This implies that the majority of the heritability originally explained by both sets of prioritized genes (i.e. the union of DEPICT and

MAGMA) was captured by the intersect genes only, with almost none remaining in the outersect.

92

100%

75%

50%

25%

0% Percentage of Overlapping Prioritized of Overlapping Genes Percentage

Z > 1.96 Z > 2.58 Z > 3.29

Top 50 Genes Top 100 GenesTop 200 Genes Gene Set Binarization

Figure 3.4: Overlap in prioritized genes from DEPICT and MAGMA. For each version of the binarized gene sets (x-axis), the distribution of the percentage of overlapping genes across all 20 traits is represented as a violin plot overlaid with a box plot.

Table 3.1. Results for comparison of “intersect” and “outersect” genes for DEPICT and MAGMA. Each row includes data based on one of the six gene set binarizations. The columns represent the overall p-value for comparing tintersect and toutersect and results from the height OMIM enrichment analysis (odds ratio and p-value from Fisher’s exact test). OMIM = Online Mendelian Inheritance in Man.

Gene Set Meta-analysis Height OMIM Height OMIM Binarization p-value for enrichment odds enrichment p-value tintersect versus ratio (intersect (intersect versus toutersect versus outersect) outersect)

Top 50 Genes 8.546 × 10-7 2.531 1.541 × 10-4

Top 100 Genes 1.558 × 10-5 2.421 2.101 × 10-4

Top 200 Genes 1.084 × 10-4 1.364 0.185

Z > 1.96 2.537 × 10-4 0.832 0.437 Z > 2.58 8.489 × 10-5 1.298 0.243 Z > 3.29 9.110 × 10-6 1.869 6.180 × 10-3

93

A. height BMI WHRadjBMI skin pigment red blood cell count 2.5 2.0 * * 2.0 * * * * * * * * * * * * * 0.6 2.0 1.5 1.5 8 0.4 1.5 1.0 1.0 4 1.0 0.5 0.2 0.5 0.5 0.0 0.0 0 0.0 0.0

white blood cell count diastolic blood pressure systolic blood pressure LDL cholesterol HDL cholesterol 2.5 * * * * * * * * 8 * * * * * * * 2.0 1.5 ) 1.5 6 5.0 t 1.5 1.0 1.0 4 1.0 2.5 0.5 0.5 2 0.5 0 0.0 0.0 0.0 0.0

total cholesterol triglycerides type 2 diabetes years of education smoking status 1.25 * * * * * * 8 * * * * 3 * * * 6 1.00 6 0.75 2 4 0.75 4 0.50 2 1 0.50 2 0.25 Prioritization Effect Size (Normalized Prioritization Size Effect 0.25 0 0 0 0.00 0.00

schizophrenia allergy or eczema IBD age of menarche age of menopause

* * * * * * * * * * * * * * * 4 4 1.0 4 1.0 3 3

2 2 0.5 2 0.5 1 1 0 0 0.0 0.0 0

Z > 1.96Z > 2.58Z > 3.29 Z > 1.96Z > 2.58Z > 3.29 Z > 1.96Z > 2.58Z > 3.29 Z > 1.96Z > 2.58Z > 3.29 Z > 1.96Z > 2.58Z > 3.29

Top 50 Genes Top 50 Genes Top 50 Genes Top 50 Genes Top 50 Genes Top 100Top Genes 200 Genes Top 100Top Genes 200 Genes Top 100Top Genes 200 Genes Top 100Top Genes 200 Genes Top 100Top Genes 200 Genes

Condition DEPICT_MAGMA_intersect DEPICT_MAGMA_outersect )

B. t 1.5 * * * * * *

1.0 Condition DEPICT_MAGMA_intersect

DEPICT_MAGMA_outersect 0.5 Analyzed PrioritizationAnalyzed −

Effect Size (Normalized Size Effect 0.0 Meta

Z > 1.96Z > 2.58Z > 3.29

Top 50 TopGenes 100 TopGenes 200 Genes

Figure 3.5. Effect sizes (normalized t) for the joint LD score regression model comparing “intersect” and “outersect” genes for 20 different GWAS. Here, the “intersect” represents genes prioritized by both DEPICT and MAGMA. The “outersect” represents genes prioritized by only DEPICT or only MAGMA. Asterisks mark comparisons for which the difference between the intersect and outersect achieved nominal significance (p < 0.05). Error bars represent 95% confidence intervals. (a) Results for each trait; note that y-axis scales differ for each panel. (b) Results meta-analyzed over 16 traits.

94

We also observed that the top 50 genes, top 100 genes, and Z > 3.29 conditions showed the most significant difference between intersect and outersect performance (Table 3.1), and the

Z > 3.29 intersect also had the highest overall per-SNP heritability (Table S3.2). These conditions also represent the gene sets with the smallest number of genes (Figure S3.3), suggesting that for our choice of method, more stringent cutoffs for gene set membership produce more power. Given our prioritization schema (Figure S3.2; see Methods), this is not surprising: when gene sets have a very large number of members, a few gene sets will dominate the gene prioritization results. In contrast, when smaller gene sets are used, more gene sets will end up contributing prioritized genes, likely beneficially increasing diversity.

We next performed some additional validation for these sets of prioritized genes. In the following sections, for brevity, we will refer to the DEPICT-gene-sets vs. DEPICT-GEO vs.

DEPICT-GTEx analysis as “DEPICT-datatype” and the DEPICT vs. MAGMA analysis as

“DEPICT-MAGMA.” We considered whether the intersects from these analyses are enriched for the genes nearest to GWAS index SNPs, as such genes are slightly more likely to be causal than the other genes in associated loci. We observed that for most traits, the set of prioritized genes was enriched for nearest-genes, and that the intersect was much more strongly enriched than the outersect (Table S3.9; see Methods). This suggests that at least some of the information captured by nearest-genes is also captured by the different prioritization methods and that this is largely driven by genes in the intersects.

In addition, to validate prioritized genes, we used a previously curated list of 277 genes from Online Mendelian Inheritance in Man (OMIM) known to cause disorders of skeletal growth69. We note that in general, we do not endorse gold standards, but we use them here as an orthogonal source of validation data for height, where there are a large number of well-

95

established and well-validated Mendelian disease genes. We performed a Fisher’s exact test to determine whether the intersect genes for height were enriched for these genes relative to the outersect genes. We found that in the DEPICT-datatype analysis, the intersect was indeed enriched for OMIM genes (odds ratio = 2.004, p = 4.72 × 10-4). Similarly, for the DEPICT-

MAGMA analysis, in the three versions of the binarized gene sets with the most significant meta-analysis p-value for intersect versus outersect, we also observed an enrichment of OMIM genes (Table 3.1).

Our prioritization annotations, which encompass the entire gene body and a 50-kb window for each prioritized gene, will include a fair amount of noise, since many of the SNPs assigned to each gene will not be in LD with variants that have functional effects on that gene.

Therefore, we next investigated whether we could improve our heritability enrichment at a variant level by incorporating expression quantitative loci (eQTL) information, using a recently published set of blood cis-eQTL data from more than 31,000 individuals180. Based on this data, we first performed S-LDSC on a single annotation consisting of all significant eQTL variants regulating at least one gene in our data set. We observed that this annotation generally explained a small but statistically significant amount of per-SNP heritability (Table S3.2, Figure S3.13, meta-analysis p-value = 1.02 × 10-3). Unsurprisingly, given that the eQTLs were derived from blood, the annotation was the most significant for the three of our most immune-relevant traits: white blood cell count, IBD, and allergy/eczema. Furthermore, by separating out the subset of eQTL variants that are associated with expression of prioritized genes in our intersect sets and testing them jointly with the all-eQTL annotation, we found that the heritability explained for

IBD and allergy/eczema by the all-eQTL annotation was largely driven specifically by this subset of eQTLs (Figure S3.13).

96

In general, prioritization of variants that were cis-eQTLs of our prioritized intersect genes

(which can include SNPs up to 1 Mb away from a prioritized gene) resulted in worse performance than simply including all SNPs within 50 kb of prioritized genes, with the possible exception of the most brain-relevant traits (for DEPICT-datatype intersect, meta-analysis p-value

= 0.0612; for DEPICT-MAGMA intersect with Z > 3.29, p = 3.63 × 10-4; Figure S3.14). We also performed an additional analysis in which we split our intersect sets of SNPs (i.e. all SNPs within 50 kb of each prioritized gene) into two categories each: eQTL (i.e. listed as a significant cis-eQTL for any gene in the genome) and non-eQTL. For many, though not all, traits, these eQTL variants performed better than the non-eQTL variants in a joint model (for DEPICT- datatype intersect, p = 0.0212; for DEPICT-MAGMA intersect, p = 4.24 × 10-4; Figure S3.15).

Thus, eQTL information likely helps with prioritization, but the prioritization may be particularly helpful at the variant level rather than at the gene level. Indeed, the overall meta-analyzed normalized t values for eQTLs within or near prioritized genes from our two intersect sets were the highest observed for any of our analyses (Table S3.2).

Finally, we evaluated NetWAS, an algorithm that uses a different approach than either

DEPICT or MAGMA: prioritizing genes from a GWAS based on patterns of PPI network connectivity76,179. We used Benchmarker to test different parameter options with NetWAS.

NetWAS uses gene-level p-values from a GWAS as input, and the user provides a p-value threshold below which NetWAS will consider a gene a “positive” example for trait association in model training. We tested three p-value thresholds: 0.01, 0.0001, and a Bonferroni-corrected p- value for the number of genes tested (roughly 3 × 10-6, depending on the trait). Using these thresholds, we evaluated NetWAS performance on our twenty GWAS using the provided global

(tissue-nonspecific) network. We observed that in general, the Bonferroni-corrected p-value

97

threshold performed best; for most traits, NetWAS with at least one of the p-value thresholds performed above chance (Figure S3.16, Table S3.10). However, with the parameters we tested,

NetWAS generally did not perform as well as our DEPICT-gene-sets or MAGMA analyses (in a joint model including NetWAS, DEPICT-gene sets, and MAGMA, DEPICT-gene-sets versus

NetWAS p = 0.0178, MAGMA versus NetWAS p = 2.28 × 10-3, DEPICT-gene-sets versus

MAGMA p = 0.635; Figure 3.6; Table S3.11). We also carried out an additional analysis of

NetWAS using several tissue-specific PPI networks; for each of nine traits, we chose one to four relevant tissues based on a combination of evidence from S-LDSC90 and DEPICT. For the majority of these analyses, the global network outperformed all of the tested tissue-specific networks (Figure S3.17, Table S3.10). In general, the brain-related tissues (“brain,”

“cerebellum,” “cerebral cortex,” and “”) and the blood/immune-related tissues

(“blood,” “spleen,” “leukocyte,” “lymphocyte,” and “bone marrow”) came closest to the global network performance (when used with relevant phenotypes).

98

A. height BMI WHRadjBMI skin pigment red blood cell count * 0.4 1.5 1.0 4 1.0 0.3 1.0 0.2 2 0.5 0.5 0.5 0.1 0 0.0 0.0 0.0 0.0

white blood cell count diastolic blood pressure systolic blood pressure LDL cholesterol HDL cholesterol

) 1.2

t 1.00 2.0 1.5 0.8 0.75 1.5 2 1.0 1.0 0.50 0.4 0.5 1 0.5 0.25 0.0 0.0 0.0 0.00 −0.5 0

total cholesterol triglycerides type 2 diabetes years of education smoking status 3 * 1.5 0.6 0.6 2.0 1.0 1.5 2 0.4 0.4

1.0 1 0.5 0.2 0.2 0.5 0 0.0 0.0 0.0 Prioritization Effect Size (Normalized Prioritization Size Effect 0.0

schizophrenia allergy or eczema IBD age of menarche age of menopause 1.2 4 * 2.0 3 * * * 0.6 3 * 0.8 1.5 2 0.4 2 0.4 1.0 1 1 0.2 0.5 0 0.0 0.0 0.0 0 −1

DEPICT MAGMA NetWAS DEPICT MAGMA NetWAS DEPICT MAGMA NetWAS DEPICT MAGMA NetWAS DEPICT MAGMA NetWAS

B. * )

t * 0.75

0.50

Analyzed PrioritizationAnalyzed 0.25 − Effect Size (Normalized Size Effect Meta 0.00 DEPICT MAGMA NetWAS

Figure 3.6. Effect sizes (normalized t) for the joint LD score regression model comparing DEPICT-gene-sets, MAGMA (with the Z > 2.58 gene set binarization), and NetWAS (Bonferroni-corrected p-value threshold with the global network). Asterisks mark comparisons for which the difference between any pair of conditions achieved nominal significance (p < 0.05). Error bars represent 95% confidence intervals. (a) Results for each trait; note that y-axis scales differ for each panel. (b) Results meta-analyzed over 16 traits.

99

Discussion

We have developed and implemented a leave-one-chromosome out approach for benchmarking similarity-based gene prioritization algorithms. The use of heritability explained as a metric has several advantages over methods that have been used elsewhere84,89. Firstly, it does not rely on the idea of “true positive” or “gold standard” genes, which is inherent in commonly used area-under-the-curve benchmarks; true positives are only as good as the existing knowledge base on a given trait and are likely biased towards a subset of relevant biology.

Secondly, heritability explained directly measures an actual quantity of interest, namely, how well the method can independently identify genes that colocalize with GWAS association signals. Our leave-one-chromosome-out strategy is also an obviously unbiased way of measuring method performance, as it enables the use of the GWAS data itself as its own control rather than external sources of data. We recommend that this idea should be used in benchmarking new and existing prioritization methods; that is, testing should be done on genes or variants prioritized on a chromosome (or a set of chromosomes) that has been withheld from the input data. We also recommend leaving out at least an entire chromosome to avoid overfitting from correlation of association signals from neighboring genes. In this way, it is possible to both have the advantages of cross-validation without the disadvantages of relying on gold standards. With this approach, Benchmarker can assess any method that prioritizes genes or variants based on common features between the associated genes/variants and the left-out genes/variants, as long as the features do not depend on the GWAS association results themselves (because those are used as the benchmark). We also note that some existing prioritization methods that theoretically are capable of prioritizing genome-wide are not implemented to do so (i.e. they prioritize only

100

genes in trait-associated loci rather than across the genome); we recommend that future developers of such methods include this capability, at a minimum for benchmarking purposes.

In two different sets of comparisons, one using different data sources, and one using different prioritization algorithms, we showed that selecting genes prioritized by multiple approaches outperforms the use of genes prioritized by exactly one approach. Supporting this observation, Mendelian genes for skeletal growth disorders are more enriched in “intersect” than

“outersect” genes for height; similarly, nearest-genes from GWAS data are generally more enriched in intersect than outersect genes across most tested traits. This finding has important implications for translating genetic associations into biological insights, as it empirically demonstrates that combining prioritization approaches is superior to relying on the somewhat arbitrary choice of a single approach. We also observed that intersect performance was particularly strong for immune and lipid traits. In contrast, brain-related traits generally showed fewer significant differences between intersect and outersect gene performance. One possible explanation is the heterogeneity of the brain, which has extremely high regional and cell-type specificity. Gene prioritization for these traits might therefore improve with a different approach, such as restricting tissue expression data to brain regions only, or, conversely, including only a single representative brain region in the tissue expression analysis rather than many. Another possible reason for this finding is the importance of brain-region-specific isoform expression, which is not accounted for in our general tissue expression data.

We also explored the possibility of combining gene prioritization with additional variant- level information, using eQTLs as an example. First, we showed that SNPs within 50 kb of prioritized genes that are annotated as blood cis-eQTLs outperform similar variants not annotated as eQTLs for some (but not all) traits, and that this combination of cis-eQTL information and

101

prioritized genes from the intersect sets yielded a set of variants with particularly high per- variant heritability explained. We speculate that this enrichment of heritability arises in part because, regardless of the tissue they affect, SNPs that are eQTLs are more likely to be in functional elements; therefore, partitioning the prioritized SNPs into eQTLs and non-eQTLs may help separate SNPs enriched for function from SNPs enriched for inactivity. It is interesting that this finding was particularly pronounced for brain-related traits, where our prioritization methods produced lower per-SNP heritability estimates in general; this could reflect general differences in genetic architecture of brain-related traits compared to others, an idea for which there is some support from other S-LDSC analyses8,182.

In addition, we used Benchmarker to analyze a PPI-based method, NetWAS, and determined that it also prioritized genes better than random chance but did not perform quite as well as our other tested approaches. We emphasize, however, that many parameters in NetWAS could be changed and optimized beyond those we tested, and Benchmarker could be used to evaluate such potential improvements (for example, different methods could be used to generate the gene-based p-values used as input). In addition, the NetWAS analysis suggests another useful application of Benchmarker: determining the best combination of prioritization approaches and tissue-specific data sets to use for any trait of interest, in a manner complementary to that used in

Finucane et al. 201890.

We also note that, because Benchmarker relies on a cross-validation strategy, it can be used to fairly determine the best prioritization method for any given trait. This strategy provides a route to best practices for gene prioritization for the field: by benchmarking multiple approaches (and particularly the intersection of multiple approaches), it should be possible to objectively improve the gene prioritization of any given trait. For example, we observed that age

102

of menarche was one of the few traits for which combining information across gene-set- and tissue matrix-based prioritization did not improve per-SNP heritability; one possible explanation based on our original analysis is that the GTEx matrix alone did not significantly contribute to heritability, so using information from that analysis may have actually worsened the signal-to- noise ratio. Such observations have the potential to inform specific decisions for performing gene prioritization for GWAS from individual traits. In addition, with our analysis of MAGMA, we have shown that this benchmarking strategy can be extended to methods that evaluate enrichment of genomic features (e.g. pathways or tissue expression) but do not explicitly assign prioritization p-values to genes. This high generalizability will allow for the comparison of a wide variety of approaches for different traits.

An important caveat to our results is that LD score regression may have insufficient power to identify small differences in explained heritability using different approaches to gene prioritization, and that other metrics (such as the genomic inflation factor for prioritized variants) may be more sensitive. We note, however, that LD score regression is a widely-used method in the field and has found important and significant differences in heritability explained by a variety of other types of annotation, such as cell-type-specific expression90 and epigenomic marks53. We believe that any small losses in power from our choice of method are outweighed by the benefits of the output (i.e. per-SNP heritability) being directly interpretable and extremely meaningful.

Furthermore, improvements to the annotations we have used here will provide boosts in power

(as well as answers to other questions about how effective such “improvements” actually are).

For example, using a 50-kb window around prioritized genes, as we have done here, means that a large amount of noise is included in our annotations. A major improvement would therefore be assignment of noncoding SNPs to the genes they regulate based on expression, epigenetic, and/or

103

chromatin conformation data; we demonstrated one such possibility with our eQTL analyses, but there are many additional approaches that could be taken.

In conclusion, we have developed a powerful and well-controlled approach for benchmarking gene prioritization strategies that relies solely on GWAS data and does not require any assumptions about the “correct” biology. Our method shows that combining prioritization strategies can improve heritability enrichment and suggests a strong recommendation that follow-up studies be focused on genes prioritized using multiple approaches. Future prioritization methods will benefit from incorporating different statistical approaches and data sources; even apparently similar data (such as two different sources of tissue expression) can provide different and complementary information. We believe that the overall cross-validation approach described and implemented here provides a better “gold standard” for benchmarking existing and future methods for gene and variant prioritization. Finally, Benchmarker can be used to determine the best algorithm and data set for any particular trait of interest.

Declaration of Interests

J.N.H. is a member of the scientific advisory board of Camp4 Therapeutics. All other authors declare no competing interests.

Acknowledgements

We gratefully acknowledge Hilary Finucane, Steven Gazal, Eric Bartell, and Jack Kosmicki for helpful discussion. This research has been conducted using the UK Biobank resource. This work was supported by NHGRI NIH F31HG009850 (R.S.F.), NIDDK NIH R01DK075787 (J.N.H.), the Lundbeck Foundation R190-2014-3904 and the Novo Nordisk Foundation

104

NNF18CC0034900 (T.H.P.), NHGRI NIH T32 HG002295 (T.A.), and NIAMS NIH

1R01AR063759-01A1 (S.R.).

Web Resources

Benchmarker: https://github.com/RebeccaFine/benchmarker

LDSC software: https://github.com/bulik/ldsc (version 1.0)

LD scores and other files for LDSC: https://data.broadinstitute.org/alkesgroup/LDSCORE/

DEPICT software: https://data.broadinstitute.org/mpg/depict/ (release 194)

MAGMA software: https://ctg.cncr.nl/software/magma (version 1.06b)

GEO: https://www.ncbi.nlm.nih.gov/geo/

GTEx: http://www.gtexportal.org eQTLs: https://molgenis26.gcc.rug.nl/downloads/eqtlgen/cis-eqtl/

NetWAS: https://hb.flatironinstitute.org/api/, https://hb.flatironinstitute.org/netwas

105

Chapter 4:

Application of single-cell RNA sequencing data to interpret genome-wide association

studies of body-mass index

Rebecca S. Fine1-4 & Joel N. Hirschhorn1-3,5

1 Department of Genetics, Harvard Medical School, Boston, MA 02115, USA. 2 Division of Endocrinology and Center for Basic and Translational Obesity Research, Boston Children's Hospital, Boston, MA 02115, USA. 3 Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA. 4 Ph.D. Program in Biological and Biomedical Sciences, Graduate School of Arts and Sciences, Harvard University, Cambridge, MA 02138, USA. 5 Department of Pediatrics, Harvard Medical School, Boston, MA 02115, USA.

This chapter includes a description of a preliminary exploratory analysis.

Abstract

Genome-wide association studies of body mass index (BMI) have uncovered hundreds of associated SNPs and have indicated a strong and somewhat unexpected enrichment for expression in tissues of the nervous system. However, the nervous system is a highly diverse system comprised of many different cell types. To determine which cell types are most relevant for BMI, we analyzed a large, publicly available single-cell RNA sequencing data set which includes most of the mouse nervous system. We used several different approaches to ask whether increasing specificity of expression within any of the catalogued cell types is associated with stronger genetic association to BMI. We found many cell types with possible associations, nearly all of which were neurons in the central nervous system. These included neurons in the hindbrain, di- and mesencephalon, and cerebral cortex; we hypothesize that some of these associations may reflect vagal-gut signaling and reward circuitry. However, based on univariate and conditional analyses, we concluded that results are quite sensitive to the choice of method used, making interpretation difficult. In this preliminary work, we describe our findings as well as many of the difficulties we encountered in performing this analysis, which we believe are largely unsolved problems in the field.

107

Introduction

BMI and the nervous system

Body mass index (BMI), calculated as weight divided by height squared, is a commonly used measure of obesity. BMI is known to have a strong genetic component, with heritability estimated at between 40 and 70%27. A recent large GWAS of BMI (N = ~700,000) identified 716 nearly independent associated SNPs, explaining ~6% of phenotypic variance183.

One of the most interesting observations emerging from GWAS analysis of BMI has been a strong, consistent enrichment for gene sets involving the brain and nervous system12,27.

Though it has long been understood that the hypothalamus is a major center for control of appetite (in particular involving the leptin-melanocortin signaling pathway184), the enrichments found in GWAS seem to suggest more generalized mechanisms, particularly involving neurotransmission and the synapse. Supporting this observation, stratified LD score regression analysis (S-LDSC) has shown that genes with the most brain-specific expression contribute significantly to BMI heritability90; a further analysis indicated that of the three main cell types in the brain (neurons, oligodendrocytes, and astrocytes), neurons were by far the most significantly enriched. However, much more work is needed to more specifically pinpoint the exact tissues and cell types driving this enrichment, given the extreme complexity and diversity of the nervous system.

To more specifically investigate the relationship between BMI and the nervous system, we used a mouse single-cell RNA sequencing (scRNA-seq) data set recently published by Zeisel et al.110 This data set is unique because it includes tissue samples from both the central and peripheral nervous system; therefore, it is both broad and deep in scope and enables comparison among many different data types that have all been normalized together (avoiding many of the

108

difficulties in harmonizing across scRNA-seq data sets that have been collected for different experiments).

Anatomy of the nervous system

Here, I provide a brief overview of relevant classifications of the nervous system to facilitate interpretation of results from the Zeisel et al. data set. The nervous system contains two main subdivisions: the central nervous system (CNS, consisting of the brain and spinal cord) and the peripheral nervous system (PNS, consisting of everything else). Under the umbrella of the peripheral nervous system are two branches: sensory and motor. The motor division is further divided into two branches: somatic (under voluntary control) and autonomic (under involuntary control). Finally, the autonomic nervous system consists of three branches: sympathetic (“fight- or-flight”), parasympathetic (“rest-and-digest”), and enteric. The enteric nervous system, which can function independently from the CNS, is a separate nervous system devoted specifically to the gut185,186. The enteric nervous system also sends afferent projections to and receives efferent projections from the brain, particularly through the vagus nerve185.

The brain consists of many different regions, which can be classified by their developmental origin as well as their final anatomical structure and location. During development, the central nervous system arises from a structure called the neural tube187. The neural tube consists four subdivisions, from most anterior to most posterior: the forebrain

(prosencephalon), midbrain (mesencephalon), hindbrain (rhombencephalon), and spinal cord. As it develops further, the forebrain is eventually subdivided into the telencephalon (consisting of the cerebral cortex, hippocampus, basal ganglia, and olfactory bulb) and the diencephalon

(thalamus and hypothalamus). Finally, the hindbrain develops into the pons, cerebellum, and medulla oblongata.

109

An important feature in classifying neurons is the type of neurotransmitter/peptide they release. The primary excitatory neurotransmitter is glutamate, while the primary inhibitory neurotransmitter is gamma-aminobutyric acid (GABA). Some neurons are also “cholinergic”

(producing acetylcholine) or “monoaminergic” (producing dopamine, serotonin, epinephrine, or norepinephrine). Finally, “peptidergic” neurons release peptides such as oxytocin and Agouti- related protein (AgRP, a hypothalamic peptide with an important role in promoting appetite).

Integrating scRNA-seq with GWAS data

Next, I will describe some existing work which has used scRNA-seq data to interpret

GWAS data, as we are interested in doing here. Then, I will give an overview of what these methods have uncovered with respect to BMI specifically.

A common strategy for using scRNA-seq to interpret GWAS data involves the use of calculated “specificity” scores. These aim to quantify the degree to which expression of a particular gene is unique to a given cell type; there are many variations on how this can be calculated, but the general framework is usually a ratio of the mean expression of gene A in cell type X to the mean expression of gene A in all other cell types. The specificity scores can then be used as input to other methods, including MAGMA30 and stratified LD score regression (S-

LDSC)53. MAGMA is a gene set enrichment analysis framework that first calculates gene-level p-values for a given GWAS. Then, it can assess the strength of the relationship between a continuous covariate (e.g. specificity scores) and the gene-level p-values; in this case, the question is whether there is a significant relationship between increasing specificity to a given cell type and the GWAS gene-level p-value. Another approach is to use the specificity scores as input to S-LDSC. S-LDSC is a method that can calculate the heritability explained by a particular genomic annotation of interest (e.g. specificity scores). In this case, the specificity

110

scores are often binarized such that the top 10% of specifically expressed genes are coded as

“yes” and all others as “no.” Then, S-LDSC can be used to ask whether the top 10% of specifically expressed genes for a given cell type are enriched for heritability.

A handful of studies have used this approach, calculating specificity scores and then reporting results from MAGMA and/or S-LDSC115–118. One recent study (preprint: Bryois et al.117) used five scRNA-seq data sets to analyze 18 nervous-system-related and 8 non-nervous- system related traits, including BMI183. With respect to BMI, the authors observed the strongest associations with midbrain, hypothalamic, thalamic, hindbrain, and cortical excitatory neurons.

Conditional analysis indicated that associations of at least some of the cell types were independent of all other cell types. Since BMI was not the focus of this paper, more exhaustive analysis of the conditional analysis results (e.g. determining which cell types were the drivers of significance) was not performed.

An alternative strategy to the use of specificity scores was recently proposed by

Watanabe et al. as part of their tool, FUMA119. The authors argue that, because specificity scores are often correlated with average expression across cell types, it is important to model average expression of a given gene across cell types as a potential confounder. They therefore apply

MAGMA with a different formula: rather than a regression of gene-level z-scores on cell type specificity scores, they use a regression of gene-level z-scores on log-transformed gene expression in the relevant cell type and log-transformed average expression across cell types.

Therefore, two separate betas are assigned, one for expression in the relevant cell type and one for average expression level across cell types.

Using this model, the authors evaluated 43 publicly available scRNA-seq datasets

(including the data set used here) and performed marginal and conditional analyses for cell type

111

specificity of 26 different traits119. For BMI (using summary statistics from Pulit et al.188), they observed 20 cell types representing independently significant signals within their original scRNA-seq data sets, including both glutamatergic and GABAergic neurons. When performing conditional analysis between data sets, these 20 cell types were collapsed into four independent clusters. However, these clusters did not break down in a completely intuitive manner; for example, the first cluster included both pyramidal (excitatory) and GABAergic neurons.

Both Watanabe et al. and Bryois et al. included the Zeisel et al. data set we analyze here within their results. Within this data set, Watanabe et al. found 13 marginally significant cell types after multiple testing correction, including several glutamatergic and inhibitory hindbrain cell types, a cholinergic/monoaminergic hindbrain cell type, and two excitatory cortical cell types; when performing conditional analysis, they observed that one of the inhibitory hindbrain types (HBINH9) largely explained the signal of the other significant cell types. In contrast,

Bryois et al. found 48 cell types significant by both LDSC and MAGMA after multiple testing correction, including some of the same general types of cells found by Watanabe et al. (e.g. hindbrain, cortical excitatory), but also including substantial signal from other regions, such as the midbrain and diencephalon.

Though these studies have begun to examine the cell type specificity profile of BMI, both were in the context of large-scale analyses of many traits, where most traits did not receive much individual attention. In addition, the results from the two studies are not completely concordant with one another, and there remain many methodological difficulties in this type of analysis.

Thus, we conducted a scRNA-seq analysis focusing specifically on BMI, asking whether we could identify enriched cell types using GWAS data.

112

Methods

Input data

As input for our analyses, we used a publicly available summary statistics from a GWAS of BMI conducted with BOLT-LMM based on Europeans in UK Biobank data (N = 457,824)174.

We downloaded scRNA-seq data from http://mousebrain.org/ and filtered out all genes not marked as “valid” (corresponding to genes detected in fewer than 20 cells or more than 60% of all cells)110. We downloaded mouse-human orthologue mappings from https://github.com/NathanSkene/EWCE, consisting of 15,604 unique entries189. We then removed all genes from the scRNA-seq data that did not have a mapped human orthologue.

Specificity metric

We began with the specificity metric originally calculated by Zeisel et al.110 Given gene i and cell type j:

*%,! + 0.1 <%,! + 0.01 5%,! = 6 ; 6 ; *%,&̅ + 0.1 <%,&̅ + 0.01

In this equation, *%,! is the fraction of cells in cell type j expressing gene i and *%,&̅ is the fraction of cells not in cell type j expressing gene i. Analogously, <%,! is the mean expression of gene i in cell type j, and <%,&̅ is the mean expression of gene i in cells not in cell type j. The 0.1 and 0.01 constants prevent the fractions from going to zero.

We first considered the question of whether and how to transform these specificity scores, which have a highly non-normal distribution, for input to MAGMA and S-LDSC. Most of the studies that have used this approach have grouped the scores into 40 equally sized bins for

MAGMA, while the most recent one (Bryois et al.117) transformed the score to a normal distribution. Studies that have used S-LDSC also binarized the scores such that the top 10% of specifically expressed genes becomes a gene set.

113

The inherent question in this analysis is how best to enhance our signal-to-noise ratio. On the one hand, a continuous distribution maximally preserves the underlying information. On the other hand, it is very likely, while the difference in ranking between the first and twentieth most specifically expressed genes is meaningful, the difference between the 10,001st and 10,020th most specifically expressed genes is not. Furthermore, this is a biological question as much as a technical one: in terms of the mechanism by which we expect BMI-associated variants to be acting, how closely correlated do we believe the expression specificity and the genetic association to be? It may well be that a crude binarization is more reflective of the biological reality than a more sensitive continuous distribution, which may exaggerate the influence of small differences in expression. In addition, on top of these questions, it remains unclear whether binning should be used, how to make a decision about what size bins should be, or what percentage of genes to binarize.

In this exploratory work, we decided to use the top 10% metric as input to both S-LDSC and MAGMA, as well as a bin-based approach (40 bins, equally sized) approach for MAGMA.

Future work in this direction should try more combinations of these approaches, ideally in a simulation setting where the accuracy of these approaches under different assumptions can be measured. For example, there is now a version of S-LDSC that can handle continuous annotations176, which would enable the 40-bin annotation to be tested with this approach as well.

We next considered a question which, to our knowledge, has not been explicitly addressed by any of the previously performed scRNA-seq/GWAS integration analyses.

Specificity scores are by definition dependent on which cell types are being compared to each other, as the denominator averages across other cells in the data set. This creates a complicated issue, as some cell types have expression patterns that are strongly correlated with other cell

114

types, while others are more strongly differentiated (Figure 4.1, Figure 4.2). For example, the

Zeisel et al. data set contains 9 subtypes of enteric neurons, ENT1-ENT9. A specificity score will treat the “distance” between ENT1 and ENT2 (two closely related cell types) the same way as the “distance” between ENT1 and a completely unrelated cell type, such as a subtype of astrocytes. This behavior complicates the intuitive interpretation of “specificity.” This may be also be problematic because cell types with many close relatives in a data set may be treated differently than those without them. (Note that even within FUMA approach, which does not use specificity scores, this remains problematic because the other cell types are included within the

“average expression” term.)

We propose a potential solution to this problem. Instead of using the specificity scores as calculated by Zeisel et al., we modified the formula to account for cell type-cell type correlations. More precisely, consider each cell type k (including all cell types other than the tested cell type j). For cells in each cell type k, their fractions and means of expression are then weighted by (1 – rj,k), where rj,k is the Pearson correlation coefficient between the two cell types.

Thus, cell types highly correlated to cell type j will contribute much less to the specificity score than less highly correlated cell types.

115

Figure 4.1. Pearson correlation matrix between mean expression values of level 4 cell types.

116

Figure 4.2. Pearson correlation matrix between mean expression values of level 6 neurons.

117

Cell types to compare

When defining cell type clusters, Zeisel et al. created several levels of classification, with increasing levels of specificity. For example, the broadest classification (“level 1”) consists of four classes: neurons, glia, immune cells, and vascular cells. We therefore considered which level(s) we should compare, as different ways of grouping cells might reveal different information. Analogous to the specificity issues discussed above, this also represents an issue that could be biological or technical: we do not know what level of cell type specificity is actually relevant for mechanisms linking expression perturbation in certain cell types to effects on BMI. In addition, although we have attempted to correct for cell type correlation in our specificity score calculations, the level of cell type we analyze will likely still have a large effect.

In our analysis, we decided to focus on two different levels of classification, representing the most detailed level of classification (“level 6”) as well as a more general level (“level 4”).

Level 6 consists of 265 cell types (214 of which are neuronal), while the level 4 classifications consist of 39 cell types, including 21 neuronal subtypes (e.g. “hindbrain neurons,” “dentate gyrus granule neurons”).

MAGMA analysis

We calculated gene-based p-values for BMI in MAGMA v1.07b. Variants within 10 kb upstream of a gene or 1.5 kb downstream of a gene were assigned to that gene (following 118).

For gene boundaries, we used NCBI build 37 gene locations, and we used a Europeans from the

1000 Genomes phase 3 reference panel175 (both downloaded from the MAGMA website). We also removed variants with info score < 0.3 and those which could not be matched to an rsID in our reference data, leaving 8,919,303 variants.

118

For binned scores, we used the --gene-covar flag in MAGMA, specifying a one-sided test

(as we were only interested in unusually high rather than unusually low specificity scores). This model tests:

= = >( + /)>* + ?>+ + @ where Z is the gene-level z-scores, /) is the specificity scores of cell type c, and B is a matrix of technical confounders (by default MAGMA internally conditions on gene size, gene density, inverse minor allele account, and the log of these three variables). For binarized (top 10%) scores, we used the --set-annot flag in MAGMA to test each list of genes as a gene set. This is tests the same model as above except that /) becomes a binary indicator variable for gene set membership. For conditional analyses, we performed a pairwise conditional analysis conditioning each significant cell type on each other significant cell type (using the --joint-pairs flag).

Stratified LD score regression analysis

We used S-LDSC to analyze the binarized specificity scores (top 10% based on our correlation-corrected specificity scores). In each analysis, we included the baseline model (53 annotations of known genomic importance such as sequence characteristics and cell-type- nonspecific regulatory marks)53. We used the cell-type-specific mode of S-LDSC (the --ref-ld- chr-cts flag), including a window of 100 kb on either side of each gene (following 90). As a control, we also included in each analysis an annotation consisting of all annotated genes for the specificity matrix being tested (including the 100 kb window).

Our reference SNPs for European LD score estimation were the set of 9,997,231 SNPs with a minor allele count ³ 5 from 489 unrelated European individuals in Phase 3 of 1000

Genomes175. Heritability was partitioned for the set of 5,961,159 SNPs with MAF ³ 0.05;

119

regression coefficient estimation was performed with 1,217,312 HapMap3 SNPs (SNPs in

HapMap3 are used because they are generally well-imputed).

FUMA analysis

As described above, the FUMA model is also based on MAGMA, but handles specificity differently. This model tests:

= = >( + 5)>* + A>, + ?>+ + @ where Z is the gene-level z-scores, 5) is the log-transformed gene expression of cell type c, A is the average log expression across cell types, and B is a matrix of technical confounders. As

FUMA is not yet implemented as a standalone program, we could not apply any correlation- based correction to the average term (which would make it more comparable to our correlation- adjusted specificity scores). We used the FUMA web application to analyze our BMI GWAS at our two levels of interest (levels 4 and 6) with default settings.

Results

Level 4 cell types

We began by applying different MAGMA approaches to each of the 39 individual cell types within level 4. First, we applied MAGMA with 40 percentile bins (“MAGMA-40Bins”) and observed 11 significant cell types at a Bonferroni-corrected p-value of 0.05 / 39 ( = 0.00128)

(Figure 4.3). Next, we applied MAGMA with the top 10 percent of specifically expressed genes

(“MAGMA-Top10%”) and observed 12 significant cell types, nine of which overlapped the significant set from MAGMA-40Bins (Figure 4.3). Finally, we applied the FUMA method and observed 9 significant cell types (Figure 4.3). Seven cell types, all of which were CNS neurons, were significant for all three methods: hindbrain, di- and mesencephalon excitatory neurons, di- and mesencephalon inhibitory neurons, cholinergic and monoaminergic neurons, telencephalon

120

projecting excitatory neurons, spinal cord inhibitory neurons, and glutamatergic neuroblasts. The

Pearson correlations between the three methods were generally high: MAGMA-40Bins v.

MAGMA-Top10% r = 0.791 (p = 1.98 × 10-9), MAGMA-40Bins v. FUMA r = 0.941 (p = 4.42

× 10-19), and MAGMA-40Bins v. MAGMA-Top10% r = 0.866 (p = 1.07 × 10-12).

Figure 4.3. Univariate MAGMA-40Bins, MAGMA-Top10%, and FUMA results for level 4 cell types. The x-axis shows the negative log 10 p-value for significance for each cell type, and cell types are colored by their general neuroanatomical classification. Dashed line is at Bonferroni significance (p < 0.05/39).

121

Many of these cell types are have highly correlated expression patterns, so we next used conditional analysis to determine which cell types were driving the enrichment. For each analysis, we applied a conditional analysis in MAGMA to test each pairwise combination of observed significant gene sets. For MAGMA-40Bins, we applied a stringent and highly conservative Bonferroni correction for the total number of pairwise tests (p < 0.05/55 pairwise tests = 4.55 × 10-4) and observed that no cell type survived this correction for conditioning on every other cell type (Figure 4.4). However, hindbrain came close, failing to achieve significance only when conditioned on cholinergic/monoaminergic neurons and di- and mesencephalon excitatory neurons (at p = 6.56 × 10-4 and p = 5.11 × 10-4, respectively).

Similarly, we performed the MAGMA pairwise conditional analysis using the MAGMA-

Top10% results (with a p-value threshold of 0.05/66 = 7.58 × 10-4). In this case, we observed that only telencephalon projecting excitatory neurons survived all corrections (Figure 4.5). In general, the results from this analysis were not strongly concordant with the previous one, as each pointed to different conclusions about which cell types were the strongest drivers of the enrichment. For example, spinal cord inhibitory neurons appeared to have a much stronger residual signal in MAGMA-Top10% than in MAGMA-Top40Bins, where its association was abolished by conditioning on any of the six most strongly associated cell types.

122

Figure 4.4. Pairwise conditional analysis for significant cell types (MAGMA-40Bins). Each square represents the significance of the cell type in the row conditioned on the cell type in the column. Color indicates -log10 p-value (legend on the right). Double asterisks indicate cell types significant after Bonferroni correction (p < 0.05/55, correcting for the total number of pairwise tests), while single asterisks indicate cell types significant after a more lenient correction (p < 0.05/11, correcting for the number of significant cell types).

123

Figure 4.5. Pairwise conditional analysis for significant cell types (MAGMA-Top10%). Each square represents the significance of the cell type in the row conditioned on the cell type in the column. Color indicates -log10 p-value (legend on the right). Double asterisks indicate cell types significant after Bonferroni correction (p < 0.05/66, correcting for the total number of pairwise tests), while single asterisks indicate cell types significant after a more lenient correction (p < 0.05/12, correcting for the number of significant cell types).

124

Figure 4.6. Pairwise conditional analysis for significant cell types (FUMA). Each square represents the significance of the cell type in the row conditioned on the cell type in the column. Color indicates -log10 p-value (legend on the right). Double asterisks indicate cell types significant after Bonferroni correction (p < 0.05/36, correcting for total number of pairwise tests); single asterisks indicate cell types significant after less stringent correction (p < 0.05/9, correcting for the number of significant cell types).

In contrast with both sets of previous results, pairwise conditional analysis within FUMA indicated that only the di- and mesencephalon excitatory neurons survived all corrections

(Figure 4.6). In addition, the FUMA pipeline includes a forward selection procedure for determining how many independent signals are represented by the significant cell type associations. Based on this procedure, di- and mesencephalon excitatory neurons and enteric neurons were the only two cell types retained. Hindbrain, telencephalon projecting excitatory

125

neurons, di- and mesencephalon inhibitory neurons, and cholinergic/monoaminergic neurons were all partially but not completely explained by the di- and mesencephalon excitatory neuron signal.

We also performed an additional regression analysis to try to determine which signals were independent. First, we conditioned on the most significant cell type. Then, we conditioned on the most significant remaining cell type. We repeated this procedure until no cell type achieved a nominal significance p-value of less than 0.05. For MAGMA-40Bins, this analysis implicated (in order) hindbrain, telencephalon projecting excitatory neurons, and cholinergic/monoaminergic neurons. In contrast, for MAGMA-Top10%, the cell types implicated were telencephalon projecting excitatory neurons, spinal cord inhibitory neurons, telencephalon projecting inhibitory neurons, and di- and mesencephalon inhibitory neurons.

We also used S-LDSC to evaluate the heritability explained by each cell type. Here, we observed that 10 cell types achieved p < 0.05/39 (which included all seven of the cell types significant in all three MAGMA analyses): telencephalon projecting excitatory neurons, di- and mesencephalon inhibitory neurons, glutamatergic neuroblasts, di- and mesencephalon excitatory neurons, spinal cord excitatory neurons, cholinergic and monoaminergic neurons, hindbrain neurons, dentate gyrus granule neurons, telencephalon inhibitory interneurons, and spinal cord inhibitory neurons.

Level 6 neurons

We next analyzed the highest resolution cell type groupings available from Zeisel et al.

(“level 6”). Specifically, we included the 214 neuronal cell types, as our results (and the results of others90) have found that the BMI nervous system enrichments are predominantly neuronal in origin. When applying MAGMA-40Bins, MAGMA-Top10%, and FUMA to the data, we

126

observed 55, 27, and 44 significant cell types, respectively (p < 0.05/214); in total, 70 unique cell types were implicated by at least one analysis (Figure 4.7). Fourteen cell types were shared by all three methods, including three hindbrain glutamatergic cell types (HBGLU2, HBGLU3,

HBGLU9), three midbrain cell types (MEGLU6, MEGLU11, MEINH2), two excitatory thalamus cell types (DEGLU1, DEGLU4), and six cortical excitatory cell types (TEGLU9,

TEGLU10, TEGLU11, TEGLU16, TEGLU17, TEGLU21). An additional 20 cell types were shared by MAGMA-40Bins and FUMA, including four hindbrain inhibitory cell types, three monoaminergic/cholinergic hindbrain cell types, and additional glutamatergic hindbrain, midbrain, and cortical cell types.

We then carried out conditional analysis to see if we could determine which cell types were responsible for the observed signal. Pairwise conditional analysis with MAGMA-40Bins,

MAGMA-Top10%, and FUMA indicated that no cell type survived stringent Bonferroni correction for all other cell types (p < 0.05/1485, p < 0.05/351, and p < 0.05/946, respectively)

(Figures 4.8-4.10). Due to the complex correlation structure between cell types, interpretation of this data is difficult and remains an ongoing challenge. Preliminary visual inspection of the results indicated that there were likely multiple at least partially independent associated cell types in each analysis. Clusters of related excitatory cortical cell types (e.g. TEGLU9,

TEGLU10, TEGLU17) and hindbrain cell types (e.g. HBGLU9, HBGLU10) were apparent for all three approaches; consistent with the univariate MAGMA results, a possible midbrain signal was much more prominent in MAGMA-40Bins and MAGMA-Top10% than in FUMA.

Additionally, in FUMA’s forward selection procedure, HBINH9 and TEGLU17 were chosen to represent distinct signals.

127

A.

B.

Figure 4.7. Results from MAGMA analysis of level 6 neurons. (a) Only cell types significant in at least one of the analyses are shown (70 out of a possible 214 cell types). The x-axis shows the negative log 10 p-value for significance for each cell type, and cell types are colored by their general neuroanatomical classification. Dashed line is at Bonferroni significance (p < 0.05/214). (b) Number of overlapping significant cell types for each analysis (p < 0.05/214).

128

Figure 4.8. Pairwise conditional analysis for significant cell types (MAGMA-40Bins). Each square represents the significance of the cell type in the row conditioned on the cell type in the column. Color indicates -log10 p-value (legend on the right). Double asterisks indicate cell types significant after Bonferroni correction (p < 0.05/1485, correcting for the total number of pairwise tests); single asterisks indicate cell types significant after less stringent correction (p < 0.05/55, correcting for the number of significant cell types).

129

Figure 4.8 (continued).

130

Figure 4.9. Pairwise conditional analysis for significant cell types (MAGMA-Top10%). Each square represents the significance of the cell type in the row conditioned on the cell type in the column. Color indicates -log10 p-value (legend on the right). Double asterisks indicate cell types significant after Bonferroni correction (p < 0.05/351, correcting for the total number of pairwise tests); single asterisks indicate cell types significant after less stringent correction (p < 0.05/27, correcting for the number of significant cell types).

131

Figure 4.10. Pairwise conditional analysis for significant cell types (FUMA). Each square represents the significance of the cell type in the row conditioned on the cell type in the column. Color indicates -log10 p-value (legend on the right). Double asterisks indicate cell types significant after Bonferroni correction (p < 0.05/946, correcting for the total number of pairwise tests); single asterisks indicate cell types significant after less stringent correction (p < 0.05/44, correcting for the number of significant cell types).

132

We also performed the stepwise procedure described in the level 4 analysis, successively conditioning on the most significant remaining cell type until all signal was abolished.

MAGMA-40Bins implicated an excitatory cortical cell type (TEGLU10), a hindbrain excitatory cell type (HBGLU3), afferent nuclei of cranial nerves VII-XII (HBCHO3), and a midbrain excitatory cell type (MEGLU6). In contrast, MAGMA-Top10% implicated a midbrain excitatory cell type (MEGLU2), a telencephalon excitatory cell type (TEGLU11), an thalamic excitatory cell type (DEGLU4), a hindbrain inhibitory cell type (HBINH4), non-border Cck interneurons in the cortex/hippocampus (TEINH12), another midbrain excitatory cell type (MEGLU11), another excitatory telencephalon cell type (TEGLU16), and a hindbrain serotonergic cell type

(HBSER1).

Finally, we applied S-LDSC to the level 6 cell types. We observed that only two cell types – a thalamic excitatory cell type (DEGLU5) and a telencephalon excitatory cell type

(TEGLU4) – survived Bonferroni correction of p < 0.05/214. Given the high number of significant cell types (48) reported by Bryois et al.117 using S-LDSC, this was surprising. Further investigation indicated that the likely reason for this difference is that S-LDSC provides two sets of p-values. We report p-values associated with the S-LDSC coefficient, which is a measure of how much heritability is contributed by the cell type of interest after accounting for other annotations in the model (e.g. being in an exon, being in a DNase hypersensitivity site, etc.).

Bryois et al., however, appear to have reported S-LDSC p-values associated with heritability enrichment, which are calculated based on the proportion of heritability explained by the annotation divided by the proportion of SNPs in that annotation. Enrichment thus does not depend on the other categories in the model. Therefore, the two S-LDSC analyses measure different quantities.

133

Discussion

We have performed a preliminary analysis attempting to use scRNA-seq and GWAS data to identify nervous system cell types relevant to BMI. Our results corroborate previous observations that the BMI nervous system enrichment is driven predominantly by neurons rather than other nervous system cell types, such as astrocytes and oligodendrocytes. Using multiple different approaches, we found significant results for many different types of neurons in different brain regions, prominently cortical excitatory neurons, hindbrain neurons, di- and mesencephalon neurons, and cholinergic/monoaminergic neurons.

We have observed a very large number of significant cell types, which do not clearly point to one or two regions of importance. The univariate level 4 results were somewhat consistent across different methods, with multiple methods indicating significance for (1) hindbrain, (2) di- and mesencephalon excitatory neurons, (3) di- and mesencephalon inhibitory neurons, (4) cholinergic and monoaminergic neurons, (5) telencephalon projecting excitatory neurons, (6) spinal cord inhibitory neurons, and (7) glutamatergic neuroblasts. This indicates that many neurons in many parts of the central nervous system are likely to be relevant and supports the idea that the neuronal relevance to BMI extends far beyond the well-characterized hypothalamic appetite pathway184 (which is part of the di- and mesencephalon class).

The association with the hindbrain is intriguing, as it may relate to the signaling pathways between the gut and brain; the vagus nerve (which is located in the hindbrain and is also known as cranial nerve X) receives afferent projections from and sends efferent projections to the gut.

Of note, in the level 6 results, two of the three analyses identified “afferent nuclei of cranial nerves VI-XII” as significant. In addition, two of the cell types significant across all three analyses were located in the nucleus of the solitary tract, which receives these vagal afferent

134

projections from the gut190. Signaling between the vagus and gut is known to play an important roles in feeding behavior; indeed, vagal blocking is actually an approved treatment for obesity191.

The precise nature of a possible relationship with neurons in the telencephalon is unclear, but one possibility is that these neurons relate to reward circuitry. In line with this hypothesis, work using neuroimaging in humans has found that obese individuals show altered cortical and subcortical responses to visual and gustatory food cues, especially in brain regions known to be associated with reward190,192. The most well-established reward pathway involves dopaminergic neurons in the ventral tegmental area (VTA) (in the midbrain) which project to the striatum (in the telencephalon). Two cell types within the striatum or VTA were found significant in our BMI analysis (both by MAGMA-40Bins only): MBDOP2 (“dopaminergic neurons, ventral midbrain

[substantia nigra, VTA]”) and TECHO (“cholinergic interneurons, telencephalon,” annotated as striatum and amygdala). In addition, medium spiny neurons (MSNs) represent ~95% of neurons in the striatum193, and no MSN cell types were significant in any of our analyses.

However, further dissection of these relationships was challenging. First, we observed that, in general, the univariate level 6 results varied substantially between our three analyses. For example, as mentioned above, MBDOP2 was significant in MAGMA-40Bins only; we did not observe any other significant dopaminergic or medium spiny neuron cell types. Therefore, it is difficult to say whether our work meaningfully suggests that changes in dopamine signaling might be relevant to obesity etiology. Additionally, when we applied conditional analysis, the results diverged even further, with the driver cell types fairly inconsistent between methods. For example, in the level 4 analysis, MAGMA-40Bins, MAGMA-Top10%, and FUMA indicated that the most significant cell types by both univariate and conditional analysis were hindbrain, telencephalon projecting excitatory neurons, and di- and mesencephalon neurons, respectively.

135

A major difficulty in resolving these results lies in distinguishing between biological and technical questions: how biologically meaningful are the differences between the results? Each approach makes different assumptions about the relationship between expression of a gene in a trait-relevant cell type and the probability that variants in that gene affect that trait (e.g. the top-

10% versus specificity bins approach). Questions about the proper definition and scope of

“specificity” are inherent in this issue: how do we calculate specificity, how do we handle substantial and complex correlations between cell types, and, most broadly, does specificity actually measure the biological quantity that is most relevant? To this point, the authors of

FUMA have proposed a framework that does not use a calculated specificity metric per se, instead conditioning on average gene expression. They argue that because average expression itself can be correlated to phenotypic outcome, this is a necessary approach to correct for confounding; however, it is unclear whether such a correlation represents a metaphorical “bug” or a “feature” (that is, the correlation between average expression and phenotype might be real biological signal rather than a confounder). Regardless, as mentioned above, this paradigm still fundamentally relies on the “specificity” concept, as correcting for average expression involves deciding which cell types to compare against the cell type of interest.

A more thorough comparison of our results with the results of others who have analyzed scRNA-seq data and BMI (Watanabe et al.119 and Bryois et al.117, both of whom used level 6 data) is warranted. For example, we have observed that for Bryois et al., many of our significant results are overlapping, but they observed many more significant cell types with MAGMA. Of the 55 significant cell types we observed with MAGMA-40Bins, they found 54 significant (they did not evaluate the remaining cell type because it did not pass their quality control filters); however, they also found 98 additional significant cell types. There are many possible

136

methodological reasons for this difference (for example, they used different quality control steps, different GWAS summary statistics of BMI, and a different p-value threshold for multiple testing correction), and it will be important to evaluate what these differences are and why we believe they occurred.

One future direction would be to implement simulations of trait-associated variants and gene expression. We could use these simulations to systematically vary different parameters and assess how well our different methodological approaches perform at identifying relevant cell types. This is of course complicated by the fact that we do not know which underlying models are most likely to be biologically accurate, but this would at least allow us to have a better understanding of how each model performs under different scenarios.

Another future direction is to develop a sophisticated way of interpreting our pairwise conditional analysis data. In particular, we would like to know, given a set of this data, how many independent signals there are likely to be and which cell types might be responsible for each such signal. Ideally, for each independent signal, we would be able to generate a list of multiple possibly causal cell types, as our preliminary analyses here indicate that it is likely we might not be able to distinguish between highly correlated cell types. Such an approach might involve some type of clustering and/or forward selection procedure. It will also be useful in this context to determine the best way to correct for multiple testing. Here, we have generally used

Bonferroni correction for the number of tests, which, for the pairwise analyses, is quite large

(sometimes > 1000). Given that these tests are not all independent because of the strong correlations between cell types, it would be useful to develop a method for determining the appropriate number of tests to correct for, which will facilitate interpretation. Additionally, we will likely wish to incorporate information not only about the significance of each pairwise test

137

but how that significance compares to the original p-values of the two tested cell types; this will be helpful in interpreting the conditional analysis results.

In conclusion, our preliminary analysis indicates some possible cell types of interest for

BMI, but much more work remains to be done in clarifying, interpreting, and harmonizing our results. We hope this work raises questions other scientists may find useful to consider as pipelines for this type of analysis are developed and refined.

URLs

Single-cell RNA sequencing dataset from Zeisel et al.: www.mousebrain.org

LDSC software: https://github.com/bulik/ldsc (version 1.0)

LD scores and other files for LDSC: https://data.broadinstitute.org/alkesgroup/LDSCORE/

MAGMA software: https://ctg.cncr.nl/software/magma (version 1.07)

Acknowledgements

This research has been conducted using the UK Biobank resource. This work was supported by

NHGRI NIH F31HG009850 (R.S.F.) and NIDDK NIH R01DK075787 (J.N.H.).

138

Chapter 5:

Discussion

In this section, I will review the major findings and conclusions from the work I have described in this thesis. Then, I discuss some of my thoughts on ways this work can be extended and how I believe it fits into the broader field.

Major findings and implications

Chapter 2

In Chapter 2, we described EC-DEPICT, a gene set enrichment analysis method designed specifically for ExomeChip (EC) data12,15,16,18,160. The ExomeChip targets mainly rare and low- frequency coding variants, which are generally easier to interpret than variants in the noncoding region. EC-DEPICT leverages predictions of gene function that were developed several years ago in the form of “reconstituted” gene sets, which enable the use of information across all potentially trait-associated genes even if those genes do not yet have known functions40,41. In collaboration with the GIANT and MAGIC Consortia, we applied EC-DEPICT to a wide variety of traits to identify significantly enriched biological pathways and to prioritize genes which seem to be especially strong drivers of observed biological enrichments12,15,16,18,160.

First, I will summarize a few of the trait-specific interesting observations enabled by EC-

DEPICT. For height, we observed that coding variants identified by the ExomeChip more strongly implicated pathways specific to skeletal growth such as bone growth, while noncoding variants identified by GWAS more strongly implicated general processes such as transcription factor binding15. We also prioritized eight genes as possibly undiscovered causes of monogenic disorders of skeletal growth (one of which has now been independently validated135). Next, for body mass index, we independently replicated a previous enrichment observed in GWAS of nervous-system-relevant gene sets, such as synaptic function12. We also observed that rare and low-frequency coding variants may fall into the causal gene more often than common variants,

140

providing evidence that searching for these variant associations may be valuable even though they tend to explain very little heritability (i.e. because they help with causal gene identification).

Finally, for 2-hour glucose, we prioritized the gene CTRB2, which harbored a suggestively significant variant (meaning it would likely have been overlooked as a potential candidate)18.

CTRB2 is particularly interesting because it encodes a digestive enzyme and therefore suggests a potentially exocrine role for the pancreas in diabetes risk in addition to its well-established endocrine role146,147. In addition, other variants in this locus have been associated with a number of other relevant phenotypes, such as type 2 diabetes, glucagon-like peptide 1-stimulated insulin secretion, and pancreatitis148,149,151,152.

More generally, we conclude that EC-DEPICT can be used to identify novel trait biology, replicate previously discovered trait biology, and prioritize potentially causal genes for follow- up. In particular, EC-DEPICT is useful for identifying genes at subthreshold levels of trait association that show strong biological support for causality. We believe that this approach is an important addition to the literature, as, at the time of our work, many studies on the ExomeChip were being conducted and there was no well-established method for gene set enrichment analysis on this type of data. The ability to perform gene set enrichment analysis on rare and low- frequency variation represents an important step forward, as well-powered data including such variants has become increasingly available. Careful consideration of how different types of data

(e.g. GWAS versus ExomeChip) should be handled is an important challenge which we addressed in this work; this challenge will continue as the number of trait-associated loci from other data types (e.g. exome sequencing and whole-genome sequencing) increase, creating a stronger need for computational tools that can assist in interpretation.

141

Chapter 3

In Chapter 3, we described Benchmarker, a method for benchmarking gene and variant prioritization algorithms for GWAS194. Most existing approaches for benchmarking these types of algorithms involve the use of “gold standards,” lists of genes already known to be involved in a tested phenotype. However, gold standard genes are biased toward well-studied and well- characterized biology, which means that they are not optimally suited for benchmarking purposes: they will punish a method that successfully identifies novel biology. Therefore, we developed Benchmarker, which uses a leave-one-chromosome-out approach combined with stratified LD score regression (S-LDSC)53 to evaluate method performance in a manner that depends solely on the original GWAS data. We measure performance based on per-SNP heritability explained, a useful metric with an intuitive and meaningful definition. Benchmarker can be applied to any method that prioritizes genes or variants from GWAS based on their similarity to one another (e.g. by identifying shared biological pathways, patterns of tissue expression, etc.).

First, I will summarize some of our findings from specific comparisons. We observed that different methods and data sources find (often fairly different) groups of genes that each contribute significantly to heritability. In our first comparison, we observed that performing prioritization using DEPICT with gene sets outperformed using DEPICT with tissue expression.

Then, when using MAGMA30 results to perform gene-set-based prioritization, DEPICT and

MAGMA performed similarly despite prioritizing very different sets of genes; we also found that both DEPICT and MAGMA outperformed NetWAS76. Finally, we noted that across many of these comparisons, selecting genes prioritized by one approach can often outperform the use of genes prioritized by exactly one approach.

142

More generally, we concluded that Benchmarker fills a pressing need in the field for an unbiased, powerful and well-controlled method for testing the efficiency of prioritization when using different algorithms, data types, and combinations thereof. We also observed that gene set enrichment analysis results can effectively be used to prioritize genes; Benchmarker can be used to evaluate the performance of this approach. Importantly, Benchmarker can be also used to find the most appropriate prioritization method for a given set of GWAS data, i.e. on a phenotype- specific level. This is significant because different methods may perform better for different traits, and therefore having phenotype-level resolution is critical for maximal performance gain.

Until now, there has been no unbiased standard for benchmarking new prioritization algorithms when they are developed, leading to the use of many different approaches and thus difficulty in comparing published algorithms; Benchmarker thus provides a clearer path forward. We believe that cross-validation-style frameworks such as the one we propose here should provide a guideline for development of future prioritization (and perhaps even enrichment) methods, and we hope that future developers of such methods will use Benchmarker to compare their approach to existing approaches and to fine-tune their parameter choices.

Chapter 4

In Chapter 4, we performed a preliminary analysis using a single-cell RNA sequencing dataset of the mouse nervous system to identify relevant cell types for body mass index (BMI).

In particular, we were interested in identifying cell types for which the most specifically expressed genes showed genetic evidence of association with BMI. We calculated specificity scores for each gene in each cell type, using a novel weighting scheme to downweight contributions from highly related cell types and upweight contributions from unrelated cell types, and used these as input to MAGMA and stratified LD score regression. We also employed an

143

analysis implemented within the FUMA software, which is based on MAGMA but uses average expression rather than specificity scores. In addition, we transformed our data in multiple ways

(creating specificity bins versus making a gene set of the most specifically expressed genes).

Finally, we used conditional analysis in MAGMA to see if we could find cell types that were the strongest drivers of observed enrichments (i.e. disentangling associations between related cell types).

We analyzed nervous system cell types based on two different levels of cell type classification: “level 4” (fairly large-scale groupings) and “level 6” (more fine-grained groupings). Within the level 4 cell type grouping, we found significant enrichments only in neurons and most often in the central nervous system. Seven particular cell types stood out as significant in all of our analyses: (1) hindbrain, (2) di- and mesencephalon excitatory neurons,

(3) di- and mesencephalon inhibitory neurons, (4) cholinergic and monoaminergic neurons, (5) telencephalon projecting excitatory neurons, (6) spinal cord inhibitory neurons, and (7) glutamatergic neuroblasts. These findings suggest that, as suggested by gene set and tissue enrichment analysis of BMI GWAS data, the relevance of the nervous system to BMI is not restricted to the hypothalamic appetite control centers. In the level 6 analysis, we observed fourteen cell types significant across all MAGMA analyses, including cell types in the hindbrain

(glutamatergic), midbrain (glutamatergic and GABAergic), thalamus (glutamatergic), and cortical (glutamatergic). Interestingly, two of these hindbrain cell types are located in the nucleus of the solitary tract, which receives vagal input from the gut; additionally, two of our three analyses implicated the sensory vagal afferents themselves. This observation contributes to a growing body of work on the importance of vagal-gut communication in obesity191. We further

144

hypothesize that the enrichments we observed from other parts of the brain may be involved in neural reward circuitry.

More generally, we observed that, although some of our findings may represent interesting new hypotheses, results were not as consistent between different methods as we would prefer them to be. This was most evident in the level 6 analyses, where there was a fair amount of variability in the significant cell types for each of our methods, and in our conditional analyses, where driver cell types also seemed to differ substantially. We emphasize that more work needs to be done in the field to determine which modeling assumptions are most reasonable

(e.g. binning versus cutoff approaches), more thoughtfully define specificity (including consideration of whether this is the most appropriate metric for the biological quantity of interest), and interpret complex conditional analysis results (the authors of FUMA have implemented a forward selection procedure which is an important step in this direction). We hope our analysis can be a good starting point for consideration of some of these important issues.

145

Future directions

Improved linking of noncoding elements to genes

One of the most substantial limitations of the Benchmarker algorithm is the largely unsolved problem of linking noncoding variants to the genes they regulate. This is a critical issue for correctly evaluating the heritability explained by prioritized genes because S-LDSC evaluates variants rather than genes. To deal with this problem, we used a distance-based window to assign all SNPs near a given gene to that gene; this is a commonly used approach (e.g. 90,128), but it is obviously not an ideal solution, because (1) many such SNPs will not actually be associated with the gene, (2) some SNPs may regulate genes at a distance greater than our window size, and (3) using a window requires a necessarily arbitrary choice of its size. Many resources to improve noncoding variant annotations and variant-gene linking have been and continue to be published, such as genome-wide maps generated by the ENCODE, Roadmap, and FANTOM Consortia195–

198. Additionally, researchers are working to develop improved methods to predict enhancer-gene connections; for example, Fulco et al. recently published an approach predicting enhancer-gene connections based on a large-scale CRISPR interference screen and an “activity-by-contact” model199. We hope to eventually integrate data from these types of approaches into Benchmarker to improve its accuracy.

Benchmarking many combinations of algorithms and data types

The Benchmarker approach will hopefully enable a large-scale effort to determine, on a phenotype-specific level, which combinations of algorithm and data type are the most effective for prioritization. This is an even more complex and necessary endeavor because the next generation of approaches will likely combine many types of data together; for example, we have begun a collaboration with Dr. Hilary Finucane and Elle Weeks, who are developing a

146

prioritization method based on gene features derived from 12 gene expression data sets, 5 pathway databases, and a protein-protein interaction network200. The ability to test features individually and combinatorially will be beneficial to developing the best possible version of this method.

As an example, there are now massive repositories of tissue and cell type expression data, spanning the entire human body. Applying Benchmarker to test which of these data sets or combination thereof is most useful for prioritization for a given phenotype may be very informative. For instance, we could ask whether prioritization based on scRNA-seq data performs better than prioritization based on bulk RNA-seq. Even more specifically, we could ask if, for a brain-relevant phenotype, prioritization is explains the most heritability when based on expression patterns across (1) many different tissues across the body, including many different brain regions, (2) many different tissues across the body, including a single representative brain region, (3) tissues from many different brain regions, and the equivalent of (1)-(3) using cell types rather than tissues (i.e. based on scRNA-seq rather than bulk RNA-seq). As another example, different types of network-based approaches could be compared using many different biological networks that have been developed, such as STRING201, GIANT76, and HumanNet74.

We hope that scientists will take advantage of this capability when developing new methods and when constructing new networks and other types of data sets.

Gene prioritization and benchmarking power in the era of biobanks

GWAS sample sizes have skyrocketed in recent years, leading to massive increases in the number of loci achieving genome-wide significance for many traits. For example, a height

GWAS published in 2014 (N = 253,288) identified 697 independent signals in 423 loci69.

However, a height GWAS published in 2018 with a much larger sample size (N = ~700,000)

147

identified 3,290 independent signals in 712 loci183. This rapid acceleration offers tantalizing opportunities for better understanding of trait biology, but it has also presented a thorny problem: how meaningful do we consider each additional new locus to be? Newly discovered common variant associations for already well-studied traits nearly always have very small effect sizes.

Small effect sizes, of course, do not necessarily imply that the relevant gene is unimportant to the underlying biology – the effect of a small regulatory tweak may cause a correspondingly small change in phenotype, but it may well be that a more substantial effect on gene function (e.g. from a loss of function variant, a highly deleterious missense variant, or a pharmacological perturbation) would produce large phenotypic consequences. Nevertheless, there is no consensus on the best way to handle this problem, and it has raised a number of difficult questions. Should we increase thresholds for genome-wide significance? How might this affect our ability to use gene set associations to discover underlying biology? For instance, do we expect that as the number of associated variants increases, eventually the strength of gene set enrichment results may actually begin to decline as genes with less immediate relationships to the trait of interest are included?

In collaboration with Eric Bartell, a graduate student in the Hirschhorn lab, I have been considering ways in which Benchmarker might be used to help address this issue. One idea we have proposed is to conduct a GWAS based on a range of sample sizes, starting from a relatively small number (e.g. N = 2,000) and ending at a high one (e.g. N = 500,000). We can then apply prioritization methods such as DEPICT to GWAS at each sample size and use Benchmarker to evaluate performance. This might give us a window into the relationship between increasing sample sizes and ability to discover biology: will the relationship saturate and flatten (or even begin to drop off) at some sample size? Will it continue to increase, and if it does continue to

148

increase, does it do so at a consistent rate or does the slope begin to flatten? Of course there is a caveat here that each result will be specific to the particular prioritization methods used, but, even so, evaluating this relationship across a variety of traits at differing levels of heritability and across many different methods may produce useful insight. For prioritization methods that use a p-value cutoff, we could also use this approach to test the effect of varying the threshold for genome-wide significance.

Improvements to gene set enrichment analysis

Gene set enrichment analysis is an important tool, but it has also been the target of criticism. Critics ask (and fairly so): how much do these results really tell us, specifically and mechanistically, about trait biology? For example, DEPICT finds many significant nervous- system-relevant gene sets to be enriched within BMI genetic associations. This itself of course is a valuable observation. However, what we would really like to do is ask what specifically within the nervous system is relevant to BMI. For example, which neurotransmitters are most relevant?

Are there particular parts of the neuron whose function is perturbed? In theory, the EC-DEPICT analysis of BMI answers some of these questions; significant gene sets include components of the neuron (e.g. “synapse,” “axon,” and “dendrite”) and neurotransmitters (e.g. “synaptic transmission, glutamatergic,” and “abnormal dopamine level”)12. However, I am not confident that these results really provide such specific insight, mostly because it is likely that most nervous-system-relevant gene sets are strongly correlated with each other: it is difficult to distinguish a specific association from a general one, and that general effect may be driven mostly by expression in the brain and/or the larger nervous system. Therefore, seeing these gene sets as a group gives me some confidence that there is a relationship between BMI and the nervous system, perhaps even BMI and neuronal signaling, but I would hesitate to conclude that,

149

for instance, a significant enrichment of a dopamine-related gene set truly specifically implicates dopamine relative to other neurotransmitters. The difficulty of obtaining mechanistic insight is even more clear with other traits such as fasting glucose, where the significant gene sets as a whole implicate already well-known biology18. In this case, the real value of gene set enrichment analysis would be true specificity, the ability to distinguish between multiple related possible mechanisms.

Though this is a serious obstacle, I believe the field is beginning to work through some of these issues to (1) reliably differentiate between similar gene sets and (2) test whether a gene set association is driven mainly by expression in a trait-relevant tissue. In particular, the developers of MAGMA have done some rigorous and thoughtful research in this area. Because MAGMA’s general framework is regression-based, it is especially flexible in terms of what types of analyses can be performed. To wit, de Leeuw et al.202 recently released a new version of MAGMA that can easily perform conditional and interaction analyses – that is, multiple gene sets or continuous covariates such as tissue expression can be tested together to test whether one association is dependent on another and/or whether two associations are more significant together than either individually (i.e. an interaction effect). The ability to perform these types of analyses will improve the precision of gene set enrichment analysis results tremendously; we can now directly test whether two gene set associations are independent of each other, or whether a gene set association is really driven solely by an effect of tissue expression. For example, would the association of the gene set “abnormal dopamine level” disappear if it were tested in a MAGMA model along with tissue expression data from different brain regions? I also note two relevant points from our work on Benchmarker. First, we directly compared the efficacy of prioritization based on gene sets versus tissue expression and observed that gene sets performed somewhat

150

better, indicating that, at least in some cases, gene sets are likely to contain information beyond the effect of expression only. Second, we used the reconstituted gene sets from DEPICT as input to MAGMA, which combines what I think are the best features of each method: DEPICT’s gene set predictions and MAGMA’s flexible statistical framework194. I believe that evaluating gene sets in combination with tissue (and cell type) expression within conditional and interaction models is the way forward, and I expect to see many future methods using and expanding these approaches.

I also believe that gene set enrichment analysis has been underappreciated in its ability to prioritize causal genes. In my opinion, the most compelling aspect of our EC-DEPICT results was the genes we highlighted as having particularly strong biological evidence for causality; the fact that we predicted a novel Mendelian gene for height that nearly immediately was independently found to be so is strong validation for our approach15,135. Most gene set enrichment analysis methods do not directly include a prioritization component (though there are exceptions203,204), which is a wasted opportunity – as I have discussed, gene set enrichment analysis results often produce gene sets with known and self-explanatory biological links, but the opportunity to identify potentially novel genes is often substantial, especially with the inclusion of genes at subthreshold levels of phenotype association. Our Benchmarker results also emphasize that this approach to prioritization works surprisingly well even when implemented in fairly crude fashion; using MAGMA and DEPICT to prioritize the genes that were the strongest predicted members of the most significant gene sets showed strong performance based on S-

LDSC194. More work in the area of linking gene set enrichment to prioritization is therefore warranted. For example, we would like to develop a method to quantitate the visually intuitive

151

heatmap-based prioritization approach we have used with EC-DEPICT, particularly incorporating the unsupervised clustering component.

Another challenge in the field is that gene set enrichment analysis is also generally rather difficult to benchmark, as simulations require many parameters and assumptions (though heroic efforts have been made25). Our work suggests that, given the good performance of gene set enrichment analysis methods in gene prioritization, applying Benchmarker to gene set enrichment analysis methods (in addition to algorithms designed explicitly for gene prioritization) may be another way to test method performance and improve rigor. This may provide an attractive orthogonal and complementary approach to simulations and related strategies.

The future of gene set enrichment and gene prioritization

The ultimate goal of human genetics is to better understand the biology underlying complex traits and to use that biology to develop improved therapeutic approaches. To advance possible candidates into drug development, however, we must be highly confident that proposed targets are truly relevant to the disease. To attain this level of confidence, experimental dissection of GWAS loci is the gold standard (e.g. 205,206); unfortunately, this process remains time-consuming, expensive, and technically challenging. Computational approaches are thus critical so that we can make the best possible guesses in choosing genes for experimental follow- up, and I believe that gene set enrichment and gene prioritization methods are important tools for doing so. Combining evidence from these approaches with other complementary ones (e.g. statistical fine-mapping, integration with regulatory and/or molecular quantitative trait loci information) can generate lists of possible gene targets with as many lines of supporting evidence as possible. Ideally, the best candidates will have independent statistical and functional (e.g.

152

membership in relevant pathways and/or expression in relevant tissues) support. For example, a fine-mapping approach might identify a GWAS locus with a high-probability causal coding variant. If the gene harboring that variant is also known to participate in a relevant (enriched) biological process for the trait of interest, these together might be good evidence for potential causality of that gene. Gene set enrichment analysis has the added benefit of generating biological hypotheses, so even if none of the GWAS genes are good drug targets, other genes that are members of the same pathways may be considered. For example, GWAS-derived association of the IL-23R pathway with ankylosing spondylitis has led to clinical trials for blockers of IL-17, a member of the same pathway which did not itself achieve genome-wide significance but was a promising drug target4,207. In combination with the arsenal of other computational approaches that have been developed in the last ten years, gene set enrichment analysis and gene prioritization thus are likely to have increasingly important roles in interpretation and translation of GWAS data.

153

Conclusions

It is hard to believe that the Human Genome Project, which set out to sequence the entire human genome for the first time, was only completed in 2003. The field of human genomics that sprang up in its wake has moved at breakneck speed, progressing from GWAS of thousands of individuals to hundreds of thousands and now to millions. But if GWAS has taught us anything, it is that complex trait genetics is just that: incredibly complex. Many traits have been associated with thousands of variants, most associations are within hard-to-interpret noncoding regions

(areas once termed “junk DNA” because they were thought to be largely nonfunctional), and extensive linkage disequilibrium within the human genome has turned out to be a blessing and a curse: it has enabled imputation and thus genome-wide studies at a fraction of the cost of sequencing, but it has frustrated attempts to pinpoint causal variants.

In this dissertation, I was primarily interested in this difficult task of interpreting GWAS data and using it to understand biology. The methods we have developed for gene set enrichment analysis of a particular type of data and for benchmarking prioritization algorithms contribute to a growing mountain of work in this area, work which has become more urgent with the explosive growth of biobanks and the possibility of genetics research on an unprecedented scale. We look to a future in which, someday, the interpretation of GWAS summary statistics becomes routine and straightforward, enabling important mechanistic insights and accelerating development of therapeutics for human disease.

154

Appendix A:

Supplementary materials for Chapter 2

Supplementary figures for Chapter 2

156

Figure S2.1. Heatmap showing EC-DEPICT results for height-associated coding ExomeChip variants independent of known GWAS signals (related to Figure 2.2). Rows in the heatmap represent significant meta-gene sets (FDR < 0.01). For any given square, the color indicates how strongly the corresponding gene (shown as columns) is predicted to belong to the reconstituted gene set (rows). This value is based on the gene’s z-score for gene set inclusion in DEPICT’s reconstituted gene sets, where red indicates a higher z-score and blue indicates a lower one. To visually reduce redundancy and increase clarity, we chose one representative meta-gene set for each group of highly correlated gene sets, based on affinity propagation clustering (Methods). Heatmap intensity and DEPICT p-values correspond to the most significantly enriched gene set within the meta-gene set; meta-gene sets are listed with their database source. The proteoglycan-binding pathway (bold) was uniquely implicated by coding variants by DEPICT and PASCAL. Annotations for the genes indicate whether the gene has OMIM annotation as underlying a disorder of skeletal growth (black and grey) and the MAF of the significant ExomeChip variant (shades of blue; if multiple variants, the lowest- frequency variant was kept). Annotations for the gene sets indicate if the gene set was also found significant for ExomeChip by PASCAL (yellow, orange and grey) and if the gene set was found significant by DEPICT for ExomeChip only or for both ExomeChip and GWAS (purple and green). GO, Gene Ontology; MP, mouse phenotype in the Mouse Genetics Initiative; PPI, protein–protein interaction in the InWeb database.

157

158

Figure S2.2. Heatmaps showing full EC-DEPICT gene set enrichment results for BMI. Panel (a) shows results from analysis of rare and low-frequency nonsynonymous variants associated with BMI at p < 5 × 10-4 (related to Figure 2.3), panel (b) shows results from rare and low-frequency nonsynonymous variants associated with BMI at p < 5 × 10-4 that were independent of known GWAS loci (p < 5 × 10-4) from Locke et al. and panel (c) shows results from nonsynonymous variants (rare, low-frequency and common) associated with BMI at p < 5 × 10-4. Rows in the heatmap represent significant meta-gene sets (FDR < 0.05). For any given square, the color indicates how strongly the corresponding gene (shown as columns) is predicted to belong to the reconstituted gene set (rows). This value is based on the gene’s z-score for gene set inclusion in DEPICT’s reconstituted gene sets, where red indicates a higher z-score and blue indicates a lower one. To visually reduce redundancy and increase clarity, we chose one representative meta-gene set for each group of highly correlated gene sets, based on affinity propagation clustering (Methods). Heatmap intensity and DEPICT p-values correspond to the most significantly enriched gene set within the meta-gene set. Annotations for the genes indicate (1) whether the gene has an OMIM annotation as underlying a monogenic obesity disorder (black and grey), (2) the MAF of the significant ExomeChip (EC) variant (shades of blue; if multiple variants, the lowest-frequency variant was kept), (3) whether the variant’s p-value reached array-wide significance (p < 2 × 10-7) or suggestive significance (p < 5 × 10-4) (shades of purple), (4) whether the variant was novel, overlapping “relaxed” GWAS signals from Locke et al.27 (GWAS p < 5 × 10-4), or overlapping “stringent” GWAS signals (GWAS p < 5 × 10-8) (shades of pink), and (5) whether the gene was included in the gene set enrichment analysis or excluded by filters (shades of brown/orange) (Methods). Annotations for the gene sets indicate if the meta-gene set was found significant (shades of green; FDR < 0.01, < 0.05, or not significant) in the DEPICT analysis of GWAS results from Locke et al.27

159

Figure S2.2. (continued). a.

160

Figure S2.2. (continued).

PER3 PPI subnetwork BSN PPI subnetwork MYELOID LEUKEMIA KEGG CHRONIC hormone activity organism growth regulation of multicellular AND SYNAPTIC PLASTICITY RECEPTORS OF AMPA BINDING ACTIVATION GLUTAMATE REACTOME membrane cytoplasmic vesicle BAI1 PPI subnetwork PPI subnetwork MYO1G small cerebellum ACTIVATION GABA A RECEPTOR REACTOME gated channel activity abnormal spatial learning regulation of synaptic transmission glutamate receptor activity neuron spine synapse part axon part C2orf55 PPI subnetwork GRIK2 PPI subnetwork ataxia RELEASE CYCLE NEUROTRANSMITTER SEROTONIN REACTOME regulation of secretion abnormal sleep pattern

VPS11

SLC17A5

BBS4

CACNA2D4

LIPC

CLU

RGR

PDILT

FAM177B

PIK3IP1

OPLAH

GPR65

SOAT2

NLRP14

DNASE1

C17orf58

LAMP3

TIGD2

CP

MTUS1

INADL

ANKRD26

MYH15

TMEM106B

ZNF646

NT5DC1

SLC38A9

AKAP10

RPAIN

MUS81

LPA

LHCGR

L3MBTL3

TBC1D32

BRCA1

APOB

C17orf53

MX1

APOH

MASP2

ITIH4

DNAH11

MTCH2

KIF14

NFU1

CHD8

EFCAB5

FAM71F2

MFI2

MATN4

COL16A1

BBS2

STK36 LOXL4

cortical cytoskeleton glutamate receptor activity serotonin receptor activity gated channel activity axon part synapse part abnormal spatial learning learning or memory XENOBIOTICS REACTOME membrane cytoplasmic vesicle DPYSL2 PPI subnetwork BAI1 PPI subnetwork SCML4

SUGP1

CHD8 ATG4B

CFDP1

TNFRSF10A

KMT2C ZNF226

USP16

ZNF33A

ZNF33A LAIR1

DPY19L1

RTDR1

GDF3 ASTE1

C1orf159

IFNAR2

MRPL15 MYO19

NFKBIB

TGIF1

ERICH1 ESF1

NUP54

MGA

CYP20A1 ZFYVE16 included excluded

CBX8

GABPB2

HRC MATN3

TTN

Inclusion

SYNPO2L

MYH6 CFDP1

PCDH1

ANKFN1

RPGRIP1L FLG

WDR74

TTC27

DDX20 SRSF9

CENPJ

ZBTB5

POC5 ZBTB7B

POLG2

ZNF133

PPARG

PRKAG1

PLEKHG6

NTAN1

EVI5

SENP3

CTBP2

NCF2

PLXNB2

RAPGEF3

ARAP1

ETS2

CTTNBP2

CACNA2D4

PRKAG1

DCAF12

SRSF9

NLRP14

ITGB3

GIPC2

ZNF518B

KIAA0368

4) ADAM32

NUP210

− ARV1

ADH1B

IFI30

LMOD1

WBP1L

RTDR1 FGFR4

TRIM65

FZD6

MFI2 SLC30A5

ADAM20

SLC39A8

APOB C9orf114

SKIV2L

MRPS9

SPARC MRPL15

BRD8

MKI67

TLL2 wide significant GWAS loci) wide significant GWAS HMMR

LIG1 included excluded

− ZWILCH

NFKBIB KIF20B Inclusion

SMARCAD1

HEG1

ANAPC4 CENPJ

ASCC3

DHX36

CCDC171 ESF1

ENPEP

RNF133

PSME4 ZNF169

ZBTB7B

SCMH1

PHTF2

OSGIN2 4)

FAM178B −

CD19

PPL NT5DC1

APOL3

ERICH1

PROM2 LPA CREB3L2

wide significant GWAS loci) wide significant GWAS FAM114A2

− DNAH8

RAB21

LY75

POLN

LUZP1

SFTPD

EMR3

GDF3

DUSP6

LIG1

ATXN1L

UBE3B

GRB14

CP

ADH1B

MATN3

GPR126

CYP1B1

TBC1D4

JAK2

CCDC50

WDR74

SH2B3

F13A1

PJA1

RGR

ZNF502

CRB2

DYNC2H1

VPS13C

NPHP4

ZFHX4

PDE1C

CDK20 PIK3R3

KLHL32 loci) GWAS or secondary with known signal (no overlap novel loci with p<5e with Locke (overlap relaxed known with genome stringent known (overlap

HLX

ARV1 Novelty COMT

PLCL1

FLT3

KIF13A PRDM16

EGFR

DPYD

value) ADAM32 LDLRAD4

− KANSL1

DMD

C2orf57 OSGIN2

CEP120

DCAF4

LAIR1 VPS13C

novel or secondary signal (no overlap with known GWAS loci) GWAS or secondary with known signal (no overlap novel loci with p<5e with Locke (overlap relaxed known with genome stringent known (overlap CYP1B1

DUSP13

SYPL2 CLCA1

CIDEA

PKD1L2

7 ("stringent") 4 ("relaxed") Novelty

SFTPD

ADAM20 − −

SIGLEC9

IGSF10

RAPGEF3 <2e <5e PLEKHG6 IP6K3

EC Significance (p TRIM15

CLCA1 MTUS1

GIPR

CCDC178

ZNF169 DST

CALCR

PNPLA1

PRSS55

DNAH8 frequency

C2orf57 −

value) CRIP3

PCDH15

TRIM15 rare low common GJD4

MC4R

MAF

CPA6

MYH6

CDK20

SLC28A2

MST1R yes no TTN

USP53

SEC16B OMIM

C16orf70

F13A1

UHRF1BP1

TFPI

KCNH2

SLC30A5

PIKFYVE

DGKK

ERO1LB

RALGAPB

CELA2B

SLC30A8

SMG6

DNAH11

WNK1

FCHO2

RAB21

C16orf70 PXK

DIDO1

FLG

SMARCC2 ZNF654

RAB3GAP1

7 ("stringent") 4 ("relaxed") SENP3

DYNC2H1 UBQLN4

− − CAPN7

REXO1

MLLT4 PHLPP1

MLLT4

TCF20

SLC17A5 NFAT5

SETD5

<2e <5e CAPRIN2

RAC3 FRYL

RALGAPB

HECTD1

ATXN1 GJD4 EC Significance (p

ZBTB38

DST

ZFYVE16 HOXD12

KIAA0368

LRRK2

KIF13A RGS12

ANKRD12

HIVEP1

AKAP9 FRS3

KMT2C

ZNF398

JMJD1C IGSF1

EP300

TRAF3

UHMK1 MC4R

XIAP

NCOR2

frequency MON2 GIPR

PTPRA

RAC3

HSD17B12

PTPRD

MACF1

NPC1

CCDC13

NF1

HIP1R

rare low common NUDCD3

OPA1

FRYL

SENP2

FUT8

SMARCC2

MAF PHLPP1

LRIG1

FBXL19

TM6SF1

CTTNBP2

PIP4K2A

PJA2

EPB41L4B

GPR37

OPRM1

RGS12

OTOF

TENM2 yes no IGHMBP2

SGK223

ENTPD6

BAIAP3 DMXL2

OMIM GPR37

PRUNE2

KSR2 ANKRD6

HOXD12

TLL2

CALY NF1

THSD7B

BDNF

ACHE FRS3

TFAP2D

ADCY3

PTPRT ECE2

TENM2

ZNF423

SPARC GRIN2A

SDK1

PTPRD

LHX5

LRP1B

TNR

KSR2

TRIM66

BAIAP3

PCSK1

CARTPT

PAX4

SLC35G2

IGSF1

MYO15A

DPP6

CSPG5

PTPRT

GRIN2A

PAK7

ANKS1B

SPHKAP

OTUD7A

GALNT9

NRXN2

UNC79

LONRF2

PDZD7

ZFR2

ACHE

MAP1A

ERC2

DISP2

CPNE7

GALNT16

KCNJ11

APC

score AKAP6 CALY − 2 4 <0.01 <0.05 >=0.05 4 2 0 − − score − GWAS significance GWAS (FDR) Z Pathway . . 2 4 c <0.01 <0.05 >=0.05 4 2 0 − − b GWAS significance GWAS (FDR) Z Pathway

161

Figure S2.3. Heatmaps showing full EC-DEPICT gene set enrichment results for WHRadjBMI. Panel (a) shows results from the analysis with nonsynonymous variants with p < 5 × 10-4 (related to Figure 2.4) and panel (b) shows results from the analysis with nonsynonymous variants with p < 5 × 10-4 independent of known GWAS loci from Shungin et al.142 Rows in the heatmap represent significant meta-gene sets (FDR < 0.05). For any given square, the color indicates how strongly the corresponding gene (shown as columns) is predicted to belong to the reconstituted gene set (rows). This value is based on the gene’s z-score for gene set inclusion in DEPICT’s reconstituted gene sets, where red indicates a higher z-score and blue indicates a lower one. To visually reduce redundancy and increase clarity, we chose one representative meta-gene set for each group of highly correlated gene sets, based on affinity propagation clustering (Methods). Heatmap intensity and DEPICT p-values correspond to the most significantly enriched gene set within the meta-gene set. Annotations for the genes indicate (1) the minor allele frequency of the significant ExomeChip (EC) variant (blue; if multiple variants, the lowest-frequency variant was kept), (2) whether the variant’s p-value reached array- wide significance (p < 2 × 10−7) or suggestive significance (p < 5 × 10−4) (shades of purple), (3) whether the variant was novel or overlapping GWAS signals from Shungin et al.142 (shades of pink; novel, within 1 Mb of a known locus, or within 500 kb of a known locus) and (4) whether the gene was included in the gene set enrichment analysis or excluded by filters (shades of brown/orange) (Methods). Annotations for the gene sets indicate if the meta-gene set was found significant (shades of green; FDR < 0.01, < 0.05, or not significant) in the DEPICT analysis of GWAS results from Shungin et al.

162

Figure S2.3 (continued). induced obesity − substrate adhesion substrate −

DDR1 PPI subnetwork EPHB2 PPI subnetwork abnormal streak morphology primitive short tail regulation of cell growth organism growth regulation of multicellular increased circulating insulin level decreased susceptibility to diet enhanced lipolysis abnormal glucose homeostasis abnormal morphology aortic valve development vessel blood abnormal production of surfactant complete neonatal lethality system morphogenesisembryonic skeletal abnormal rib morphology short ulna MMP3 PPI subnetwork abnormal bone morphology trabecular zone plate proliferative abnormal long bone epiphyseal abnormal morphology basicranium basement membrane ITGB1 PPI subnetwork matrix extracellular regulation of cell

LAMA4 FBLN2

abnormal incisor morphology short mandible abnormal morphology aortic valve abnormal pulmonary elastic fiber morphology abnormal streak morphology primitive abnormal bone morphology trabecular short ulna zone plate proliferative abnormal long bone epiphyseal abnormal morphology basicranium abnormal rib morphology glycosaminoglycan binding organism growth regulation of multicellular enhanced lipolysis ADAM12

TNC

NID2

WSCD2

HSPG2

LAMA4

TNC

NID2 CDH11

ADAM12 MMP14

TBX15 MMP14

PHTF2

CDH11

CYTL1

LRRC17

LRRC17

TBX4

WSCD2

ACAN

TBX4

CYTL1

SEMA3D

RRBP1 UGGT2

FBLN2 ACAN

COL11A1 R3HDML

WBP1L SLC5A3

ZBTB7B

FAM160A1

RREB1

LGR6

ELMOD3

EPHA2

SLC5A3

WBP1L

NPHP3

MEGF8 COBLL1

KIF9 ABCA1

SCGB3A1 TMEM175

SLC39A8 KRT75

RASIP1

C3orf18

ARHGEF15

RFX2

TFPI

EPPK1

ARAP3

COL4A6

STAB1

CERS2 PLEK2

ZNF572 HOXA7

MEGF8 TBC1D10A

RRBP1 PCNXL3

KREMEN1

PDE5A

FGFR4

DGKH

PDE5A

RPL39L

EPPK1

PRLR

ERBB2

L3MBTL3 ITGA9

ANKRD12 ATP6V1B1

TSPAN8 RGS10

SLC44A4 RDH16

TNRC6B

EDA2R

KIAA0430

PLEK2

UBR4

ATP6V1B1

FYCO1

THADA

C1orf105

HPGD GTPBP2

HDLBP ZFR2

MICAL3 ITGA9

SYNE2 OR12D2

WNK1

KCNJ10

CAPRIN2

SEMA3D

METTL21A

PHTF2

CLASP1

SLC39A8

IFI16

PLCE1 CCDC93

MAML2 RAPGEF3

STAB2 UGGT2

DOCK5 ADAMTS17

APOB

ADAMTSL1

RDH16

MYO19

OR12D2

ERBB2

INPP4B

BBS4

PLCB3

RASSF9 KAT2A

NPHP3 NTRK1

CERS2 GPR98

GPR98 ELMOD3

SSPO

ARHGEF15

EFCAB12

RASIP1

C3orf18

GDA

SPPL2C

MYO3A

FREM3

RAPGEF3 HIP1R

PSME4 DNAH17

CCDC92

CDK13

included excluded

FBXW2

DOCK5

UQCC1 Inclusion

CAPRIN2

DLAT

NOP2

included excluded SWI5

ITGA7

GDA Inclusion

FYCO1

MGA

SCGB3A1 PRLR

MGA ANKRD12

BBS4 LRP1B

MYO19 JMJD1C

TBC1D10A

ZZEF1

TMEM175

VPS13B

HPGD

novel UBR4

HDLBP

WDR25 Novelty_Shungin NBAS

novel within 1 Mb within 500 kb PHF20L1 VPS13B

TNRC6B

MYO3A Novelty_Shungin

HHATL NBAS

TMEM86B KIAA0430

value)

KIF9

CLASP1 −

ZNF614

HMGXB4

value) TSEN15

− EXOC4

DGKH

KAT6B

RPL39L

TET2 LAT

APOB RGS10

CD74 PPP2R3A

ECE2 NSFL1C

SLC16A6

FCAMR 7 ("stringent") 4 ("relaxed")

NLRP3 − −

SLC16A6

7 ("stringent") 4 ("relaxed") AAK1

− − SETD3

GRM5 <2e <5e

HIST1H1T DBH

<2e <5e

EC Significance (p XPO7 NOP2

CD74 UTP20 EC Significance (p

ANLN AKAP8

PTPMT1 C19orf53

PANX1

DNAH12

KCNJ10

NTRK1

EDA2R frequency

ZNF830 − frequency ZZEF1

− KAT2A PCNXL3

rare low common

DARS2

ZNF572 rare low common

HIP1R RFX2 MAF

MAF

FAM3B ZFR2

PPP2R3A PTPRT

HINT3

VPS11

TET2

HHATL

JMJD1C

TMEM86B

KAT6B

SPPL2C

L3MBTL3

ARAP3 LRP1B

FREM3 UHRF1BP1

PSME4 NLRP3

WDR25 ANLN

MGAT5B

GRM5

GRSF1

TRPM2

DAGLB

CENPQ

CRTC3

MCM4

POMT1

ZWINT CDK13

ZNF614 NUDT6

CENPQ TSEN15

VSX1 FASTKD3

PHF20L1

ECE2

SGSM2

MGAT5B

PTPRT

SSPO

DCD

LAT

HIST1H1T

GRSF1 C19orf53

NAALAD2 VPS11

R3HDML AAK1

ADAMTS17 RBM19

LGR6

OBFC1

EPHA2

MTMR6

PLCE1

CRTC3

MAML2

HELQ

FAM160A1

CLSPN KRT75

ZNF717 ADAMTSL1

RASSF9 DNAH17

COL4A6 MICAL3

NAALAD2

AIM1L

HERC3

DAGLB

SHC2

PTPMT1

ETAA1

SWI5

AGGF1

COPG2 MUC4

FBXW2 MPV17L2

MFSD6 DR1

COPG2 ERI1

CHRNB4

DLAT

AIM1L

CHRNB4

UNC13B

DBH

TLR10

CD8A

LRRC36

WNK1 NUCB2

SGSM2 FCAMR

DNAH12 SYNE2

ZNF717 AGGF1

OBFC1

STAB2

FASTKD3

INPP4B

XPO7

TLR10

THADA

VSX1

MNS1

POMT1 SETD3

METTL21A HELQ

AKAP8 NUCB2

DR1 MFSD6

MTMR6

FAM3B

RBM19

CCDC93

TRPM2

UTP20

ERI1

UNC13B

NSFL1C

PANX1 CD8A

SHC2 EXOC4

DARS2 TSPAN8

HMGXB4 MUC4

ZNF830

ITIH5

CLSPN

MLXIPL

MCM4

ACVR1C

ZWINT

PALMD

PCK1

ACVR1C

PEMT

MLXIPL

ITIH5

ITGA7 PALMD score − 2 4 <0.01 <0.05 >=0.05 4 2 0 − − score GWAS significance GWAS (FDR) Z Pathway − 2 4 <0.01 <0.05 >=0.05 4 2 0 − − a. b. GWAS significance GWAS (FDR) Z Pathway

163

Figure S2.4. EC-DEPICT results from analysis of glycemic traits. Panel (a) shows results from nonsynonymous variants with p < 1 × 10-5 for any of the four glycemic traits, panel (b) shows results from analysis of nonsynonymous variants with p < 1 × 10-5 for any of the glycemic traits except HbA1c (related to Figure 2.6a), panel (c) shows results from analysis of nonsynonymous variants with p < 1 × 10-5 for HbA1c, panel (d) shows results from analysis of nonsynonymous variants with p < 1 × 10-5 for fasting glucose (related to Figure 2.6b), and panel (e) shows results from analysis of nonsynonymous variants with p < 1 × 10-5 for 2hGlu. Rows in the heatmap represent significant meta-gene sets (FDR < 0.05). For any given square, the color indicates how strongly the corresponding gene (shown as columns) is predicted to belong to the reconstituted gene set (rows). This value is based on the gene’s z-score for gene set inclusion in DEPICT’s reconstituted gene sets, where red indicates a higher z-score and blue indicates a lower one. To visually reduce redundancy and increase clarity, we chose one representative meta-gene set for each group of highly correlated gene sets, based on affinity propagation clustering (Methods). Heatmap intensity and DEPICT p-values correspond to the most significantly enriched gene set within the meta-gene set. The gene set annotations indicate whether that meta-gene set was significant at FDR < 0.05 or not significant (n.s.) for each of the other EC-DEPICT analyses. For heatmap intensity and EC-DEPICT p-values, the meta-gene set values are taken from the most significantly enriched member gene set. The gene variant annotations are as follows: (1) the European MAF of the input variant, where rare is MAF <1%, low-frequency is MAF 1-5%, and common is MAF > 5%, (2) whether the gene has an Online Mendelian Inheritance in Man (OMIM) annotation as causal for a diabetes/glycemic-relevant syndrome or blood disorder, (3) the effector transcript classification for that variant: gold, silver, bronze, or NA (note that only array-wide significant variants were classified, so suggestively- significant variants are by default classified as “NA”), (4-7) whether each variant was significant (p < 2 × 10-7), suggestively significant (p < 1 × 10-5), or not significant in Europeans for each of the four traits, and (8) whether each variant was included in the analysis or excluded by filters (Methods). AWS: array-wide significant.

164

Figure S2.4 (continued). K9 specific) − specific DNA binding − induced obesity − specific DNA binding transcription factor activity specific DNA binding transcription factor − protein complex − dependent protein kinase activity dependent histone deacetylase activity (H3 − −

REACTOME_DOWNSTREAM_SIGNAL_TRANSDUCTION REACTOME_SIGNALLING_BY_NGF SRC PPI subnetwork PPI subnetwork ACOX3 ADCY2 PPI subnetwork transportGolgi to plasma membrane cAMP membrane cytoplasmic vesicle SNARE binding PPI subnetwork WAS KEGG_SNARE_INTERACTIONS_IN_VESICULAR_TRANSPORT heterotrimeric G REACTOME_GLUCAGON_SIGNALING_IN_METABOLIC_REGULATION clathrin coat of coated pit YIF1A PPI subnetwork AP1S3 PPI subnetwork SHC1 PPI subnetwork CHRM2 PPI subnetwork acid catabolic process fatty regulation of sequence abnormal bone morphology trabecular mitochondrial envelope PPI subnetwork PRKACG gated channel activity PPFIA2 PPI subnetwork cellular polysaccharide biosynthetic process glucan catabolic process glycogen metabolic process catabolic process regulation of carbohydrate regulation of polysaccharide metabolic process metabolic process regulation of cellular carbohydrate acid metabolic process regulation of fatty monosaccharide catabolic process REACTOME_GLUCOSE_METABOLISM acid binding carboxylic cellular response to peptide hormone stimulus REACTOME_MTOR_SIGNALLING abnormal gastric parietal cell morphology abnormal lipid homeostasis binding cofactor BBS4 PPI subnetwork metabolic process cofactor stimulus cellular response to dexamethasone REACTOME_METABOLISM_OF_NUCLEOTIDES REACTOME_REGULATION_OF_LIPID_METABOLISM_BY_PEROXISOME_PROLIFERATOR:ACTIVATED_RECEPTOR_ALPHA_PPARALPHA activity binding transcription factor transcription factor PPI subnetwork HDAC1 WRB PPI subnetwork regulation of histone modification abnormal streak morphology primitive BICD2 PPI subnetwork binding, bridging hormonegrowth receptor signaling pathway PPI subnetwork STAT5B abnormal lactation npBAF complex NAD partial postnatal lethality abnormal megakaryocyte progenitor cell morphology abnormal dorsal aorta morphology complete embryonic lethality during organogenesis size vessel regulation of blood RNF31 PPI subnetwork REACTOME_EGFR_DOWNREGULATION transcription regulatory region sequence REACTOME_SHC:MEDIATED_SIGNALLING norepinephrine secretion cellular response to glucose stimulus response to monosaccharide stimulus axon part lumen platelet alpha granule CNGA3 PPI subnetwork REACTOME_LIPOPROTEIN_METABOLISM lipid digestion decreased susceptibility to diet abnormal glucose homeostasis increased circulating insulin level peptide hormone secretion abnormal erythropoiesis enlarged spleen erythrocyte differentiation myofibril cortical cytoskeleton iron level increased liver heme metabolic process reticulocytosis

MYO9B

PRR14

CERS2

G6PD

EXOC6

TMC6

ATP11A

SWAP70

RAPGEF3

SEZ6

SV2C

MADD

TIAM2

MAST2

PLCB3

ARID1B

DVL2 ULK3

included excluded DHX58 SH2B3

Inclusion ARAP1

5 AMPD3

PIEZO1

MLX

NBR1 AWS <1e n.s. AKT2

FI

RREB1 NFIC 5

− ZHX3

ANKH

GPSM1 AWS <1e n.s. VPS13C

2hGlu

OBSL1 HELB 5

− RGS9BP

PDE6C

HNF1A AWS <1e n.s. C2orf44

FG

ZGPAT CCDC67 5

− CCDC36

MUC6

TPCN2 AWS <1e n.s. C3orf18

HbA1c RASIP1

TMEM255B

COBLL1

EGF

B3GNTL1

gold silver bronze NA

HFE

ZNF641 Medal

PIEZO2 DCAF4

All traits combined DPAGT1

THADA

ERAP2

LRP4

PJA1

FAM206A

WDR78

AGBL2

STEAP2

diabetes/glycemic disorder blood none

MLXIPL

PPARG OMIM

LPCAT3

SREBF1

GCKR

SLC25A47

SLC2A2 TMPRSS6

frequency

GIPR

KCNJ11

PFKL

rare low common SLC30A8

MAF G6PC2

PCSK1

GLP1R

CTRB2

ANK1

SPTB

HK1

BLVRB

GYPC

DCAF12

SPTA1

KEL GYPB a. HbA1c Analysis − Only Analysis Only Only Analysis Only − − Only Analysis Only Except − − score − 2 4 sig n.s. sig n.s. sig n.s. sig n.s. 4 2 0 − − Significance in HbA1c Significance in All Significance in FG Significance in 2hGlu Z Pathway

165

Figure S2.4 (continued).

induced obesity − protein complex − glutamic acid carboxylation −

regulation of secretion abnormal fourtharch artery branchial morphology membrane cytoplasmic vesicle REACTOME_GLUCAGON_SIGNALING_IN_METABOLIC_REGULATION SNARE binding KEGG_SNARE_INTERACTIONS_IN_VESICULAR_TRANSPORT mitochondrial envelope GRIK2 PPI subnetwork regulation of calcium ion transportpositive into cytosol abnormal spatial learning gated channel activity heterotrimeric G biosynthetic process hexose acid binding carboxylic size vessel regulation of blood abnormal lipid homeostasis binding cofactor in spinal cord cell differentiation peptidyl lipid homeostasis development vessel blood REACTOME_DOWNSTREAM_SIGNAL_TRANSDUCTION REACTOME_SIGNALLING_BY_NGF REACTOME_MTOR_SIGNALLING phosphatidylinositol binding nervous system axon ensheathment in central cellular response to glucose stimulus abnormal bone morphology trabecular REACTOME_SHC:MEDIATED_SIGNALLING GNG5 PPI subnetwork regulation of response to food REACTOME_INCRETIN_SYNTHESIS_SECRETION_AND_INACTIVATION regulation of peptide hormone positive secretion response to monosaccharide stimulus patterning spinal cord dorsal/ventral axon part regulation of bone remodeling EXOC3 PPI subnetwork FN3K PPI subnetwork polyuria norepinephrine secretion regulation of polysaccharide metabolic process catabolic process regulation of carbohydrate glycogen metabolic process metabolic process regulation of cellular carbohydrate cellular response to peptide hormone stimulus REACTOME_LIPOPROTEIN_METABOLISM lipid digestion decreased susceptibility to diet KEGG_MATURITY_ONSET_DIABETES_OF_THE_YOUNG peptide hormone secretion abnormal glucose homeostasis increased circulating insulin level

SLC30A8

G6PC2

GIPR

GLP1R

PCSK1 CTRB2

included excluded MUC6

Inclusion MLX

5

ANKH

VPS13C

ARAP1 AWS <1e n.s.

SH2B3 FI TPCN2 5

− RREB1 OBSL1

AWS <1e n.s. TMEM255B

2hGlu RASIP1

5

PLCB3

TIAM2

PDE6C AWS <1e n.s.

CERS2 FG

COBLL1

PPARG

ZHX3 AWS n.s. AKT2

HbA1c

HNF1A

CCDC67 KCNJ11

gold silver bronze NA GPSM1

SLC2A2 Medal

GCKR

C3orf18

ERAP2

PJA1

PIEZO2

ULK3

AGBL2 WDR78

diabetes/glycemic none STEAP2

OMIM NBR1

All traits except HbA1c

FAM206A

RAPGEF3

LRP4 HELB

frequency

ARID1B − ZGPAT

rare low common THADA

MAF DVL2 DHX58 Only Analysis Only Only Analysis Only − − Only Analysis Only Trait Analysis Trait − − score − 2 4 . sig n.s. sig n.s. sig n.s. sig n.s. 4 2 0 − − b Significance in All Significance in HbA1c Significance in FG Significance in 2hGlu Z Pathway

166

Figure S2.4 (continued).

c. HbA1c only

VPS18 PPI subnetwork cell differentiation in spinal cord GRIK2 PPI subnetwork RNF31 PPI subnetwork WAS PPI subnetwork SGCG PPI subnetwork IKBKE PPI subnetwork abnormal lactation cortical cytoskeleton actin cytoskeleton TAGLN2 PPI subnetwork XYLB PPI subnetwork cellular response to dexamethasone stimulus double−strand break repair via nonhomologous end joining regulation of blood vessel size oligosaccharyl transferase activity hydro−lyase activity cofactor metabolic process HEBP1 PPI subnetwork SAA4 PPI subnetwork REACTOME_ORGANIC_CATIONANIONZWITTERION_TRANSPORT regulation of histone modification AP1S3 PPI subnetwork SHC1 PPI subnetwork heterotrimeric G−protein complex ACOX3 PPI subnetwork PRKACG PPI subnetwork partial postnatal lethality KEGG_CHRONIC_MYELOID_LEUKEMIA MTPN PPI subnetwork ADCY2 PPI subnetwork Significance in All−Trait Analysis REACTOME_SIGNALLING_BY_NGF sig regulation of myeloid leukocyte differentiation cytoplasmic vesicle membrane n.s. secondary active transmembrane transporter activity Significance in All−Except−HbA1c Analysis abnormal trabecular bone morphology CALM2 PPI subnetwork sig axon part n.s. impaired conditioned place preference behavior peripheral nervous system neuron differentiation Significance in FG−Only Analysis carboxylic acid binding sig glycogen metabolic process n.s. decreased susceptibility to diet−induced obesity BICD2 PPI subnetwork Significance in 2hGlu−Only Analysis cofactor binding sig vitamin B6 binding mitochondrial envelope n.s. REACTOME_METABOLISM_OF_NUCLEOTIDES Pathway Z−score negative regulation of bone mineralization 4 signal transduction by p53 class mediator resulting in induction of apoptosis abnormal neutrophil physiology 2 abnormal dorsal aorta morphology 0 decreased common myeloid progenitor cell number −2 lymphocyte differentiation transcription regulatory region sequence−specific DNA binding −4 npBAF complex SRC PPI subnetwork transcription factor binding transcription factor activity REACTOME_EGFR_DOWNREGULATION complete embryonic lethality during organogenesis abnormal lipid homeostasis TXNDC2 PPI subnetwork abnormal megakaryocyte progenitor cell morphology increased circulating insulin level REACTOME_SHC:MEDIATED_SIGNALLING abnormal glucose homeostasis abnormal erythropoiesis enlarged spleen erythrocyte differentiation myofibril reticulocytosis increased liver iron level heme metabolic process KEL HK1 HFE EGF NFIC PFKL SEZ6 SV2C SPTB ANK1 TMC6 G6PD GYPB GYPC MADD SPTA1 SH2B3 BLVRB PRR14 G6PC2 DCAF4 RREB1 MAST2 CERS2 THADA EXOC6 MYO9B AMPD3 GPSM1 MLXIPL C2orf44 PIEZO1 ATP11A ZNF641 LPCAT3 SLC2A2 DCAF12 DPAGT1 SREBF1 CCDC36 SWAP70 RGS9BP SLC30A8 B3GNTL1 TMPRSS6 SLC25A47

MAF OMIM Medal HbA1c FG 2hGlu FI Inclusion rare diabetes/glycemic gold AWS AWS n.s. <1e−5 included low−frequency blood disorder silver <1e−5 <1e−5 n.s. excluded common none bronze n.s. NA

167

Figure S2.4 (continued). included excluded Inclusion protein complex − AWS n.s. glutamic acid carboxylation − FI

polyuria REACTOME_SHC:MEDIATED_SIGNALLING response to monosaccharide stimulus cellular response to glucose stimulus patterning spinal cord dorsal/ventral norepinephrine secretion glycogen metabolic process regulation of polysaccharide metabolic process catabolic process regulation of carbohydrate regulation of peptide hormone positive secretion REACTOME_EUKARYOTIC_TRANSLATION_TERMINATION IL10RA PPI subnetwork heterotrimeric G abnormal glucose homeostasis increased circulating insulin level SNARE binding membrane cytoplasmic vesicle regulation of secretion regulation of response to food EXOC3 PPI subnetwork REACTOME_INCRETIN_SYNTHESIS_SECRETION_AND_INACTIVATION KEGG_MATURITY_ONSET_DIABETES_OF_THE_YOUNG peptide hormone secretion in spinal cord cell differentiation metabolic process regulation of cellular carbohydrate size vessel regulation of blood hormone activity axon part abnormal fourtharch artery branchial morphology cyclase activity regulation of adenylate positive regulation of bone remodeling peptidyl lipid digestion nervous system axon ensheathment in central pH regulation of intracellular cilium morphogenesis regulation of smoothened signaling pathway layer thin retinal outer nuclear axoneme

SLC30A8

G6PC2 AWS n.s. GLP1R

2hGlu GIPR

5 PCSK1

SLC2A2 ANKH

AWS <1e

GPSM1

CCDC67 FG

RREB1

ARAP1 CERS2

AWS n.s.

TMEM255B

ZHX3 HbA1c

ARID1B

THADA

NBR1 DHX58

gold bronze NA

PDE6C

GCKR Medal

FAM206A

Fasting glucose only STEAP2

ULK3

PIEZO2

PJA1

LRP4

WDR78 AGBL2 diabetes/glycemic none OMIM frequency − rare low common MAF HbA1c Analysis − Only Analysis Only Only Analysis Only − − Trait Analysis Trait Except − − score − . d 2 4 sig n.s. sig n.s. sig n.s. sig n.s. 4 2 0 − − Significance in All Significance in HbA1c Significance in All Significance in 2hGlu Z Pathway

168

Figure S2.4 (continued). included Inclusion 5 − AWS <1e n.s. FI 5 − AWS <1e 2hGlu 5 −

cellular response to glucose stimulus regulation of peptide hormone positive secretion catabolic process regulation of carbohydrate norepinephrine secretion REACTOME_EUKARYOTIC_TRANSLATION_TERMINATION PPI subnetwork GATAD2A glycogen metabolic process peptide hormone secretion increased circulating insulin level abnormal glucose homeostasis REACTOME_SHC:MEDIATED_SIGNALLING axon part CNGA3 PPI subnetwork KEGG_SNARE_INTERACTIONS_IN_VESICULAR_TRANSPORT abnormal gastric parietal cell morphology KEGG_MATURITY_ONSET_DIABETES_OF_THE_YOUNG lipid digestion

HNF1A AWS <1e n.s. GCKR

FG

KCNJ11

TIAM2

VPS13C

MLX n.s.

PLCB3

HbA1c

ERAP2

DVL2 GIPR

2hGlu Only gold silver NA MUC6

Medal CTRB2 diabetes/glycemic none OMIM frequency − rare low common MAF HbA1c Analysis − Only Analysis Only − Only Analysis Only Trait Analysis Trait Except − − − score − 2 4 sig n.s. sig n.s. sig n.s. sig n.s. 4 2 0 − − . e Significance in All Significance in HbA1c Significance in All Significance in FG Z Pathway

169

Supplementary tables for Chapter 2

Table S2.1. EC-DEPICT results for height-associated coding ExomeChip variants independent of known GWAS signals. Attached as digital information.

Table S2.2. GWAS-DEPICT results for height-associated noncoding GWAS variants. Attached as digital information.

Table S2.3. EC-DEPICT results when including rare and low-frequency nonsynonymous and splicing variants associated with BMI at p < 5 × 10-4 in the GIANT ExomeChip analyses. Attached as digital information.

Table S2.4. EC-DEPICT results when including rare and low-frequency nonsynonymous variants associated with BMI at p < 5 × 10-4 in in the ExomeChip analysis and independent of known GWAS loci (p < 5 × 10-4) from Locke et al.27 Attached as digital information.

Table S2.5. EC-DEPICT results when including nonsynonymous and splicing variants (rare, low-frequency and common) associated with BMI at p < 5 × 10-4 in in the GIANT ExomeChip analyses. Attached as digital information.

Table S2.6. EC-DEPICT results when including variants associated with WHRadjBMI at p < 5 × 10-4 in in the GIANT ExomeChip analyses. Attached as digital information.

Table S2.7. EC-DEPICT results when including variants associated with WHRadjBMI at p < 5 × 10-4 in the GIANT ExomeChip analyses that are independent of known GWAS loci from Shungin et al. (2015)142. Attached as digital information.

Table S2.8. EC-DEPICT results from analysis of variants with p < 1 × 10-5 for any glycemic trait. Attached as digital information.

Table S2.9. EC-DEPICT results when including variants associated with any glycemic trait except HbA1c with p-value < 1 × 10-5. Attached as digital information.

Table S2.10. EC-DEPICT results when including variants associated with HbA1c with p- value < 1 × 10-5. Attached as digital information.

Table S2.11. EC-DEPICT results when including variants associated with fasting glucose with p-value < 1 × 10-5. Attached as digital information.

Table S2.12. EC-DEPICT results when including variants associated with 2hGlu with p- value < 1 × 10-5. Attached as digital information.

170

Table S2.13. Results from EC-DEPICT analysis of variants associated with body fat percentage at p < 5 × 10-4 (MAF between 0.0001 and 0.05). The top 10 gene sets are shown. Note that no gene set achieves significance. FDR: false discovery rate.

Original gene set Original gene Nominal FDR Meta-gene set ID Meta-gene set ID set description p value description MP:0001516 abnormal 8.28E-06 >=0.20 MP:0001393 ataxia motor coordination/ balance GO:0045202 synapse 1.41E-05 >=0.20 GO:0044456 synapse part ENSG00000082701 GSK3B PPI 2.59E-05 >=0.20 ENSG00000091831 ESR1 PPI subnetwork subnetwork ENSG00000068305 MEF2A PPI 4.32E-05 >=0.20 ENSG00000068305 MEF2A PPI subnetwork subnetwork GO:0031594 neuromuscular 5.72E-05 >=0.20 MP:0001053 abnormal junction neuromuscular synapse morphology GO:0030424 axon 9.39E-05 >=0.20 GO:0033267 axon part MP:0001513 limb grasping 0.0001042 >=0.20 MP:0001393 ataxia GO:0019717 synaptosome 0.0001047 >=0.20 GO:0044456 synapse part GO:0008021 synaptic 0.0001156 >=0.20 GO:0044456 synapse part vesicle MP:0005176 eyelids fail to 0.0001373 >=0.20 MP:0005193 abnormal open anterior eye segment morphology

171

Table S2.14. Results from EC-DEPICT analysis of variants associated with adiponectin levels adjusted for BMI at p < 5 × 10-4 (all MAF). The top 10 gene sets are shown. Note that no gene set achieves significance. FDR: false discovery rate.

Original gene set ID Original gene set Nominal FDR Meta-gene set Meta-gene set description p-value ID description MP:0008034 enhanced lipolysis 2.30E-05 >=0.20 MP:0008034 enhanced lipolysis Reactome platelet adhesion Reactome platelet SELP PPI to exposed adhesion to ENSG00000174175 subnetwork 7.10E-05 >=0.20 collagen exposed collagen decreased abnormal glucose MP:0005318 triglyceride level 7.56E-05 >=0.20 MP:0002078 homeostasis abnormal white adipose tissue MP:0002970 morphology 9.62E-05 >=0.20 MP:0008034 enhanced lipolysis tumor necrosis tumor necrosis GO:0032640 factor production 1.021e-4 >=0.20 GO:0032640 factor production regulation of tumor necrosis factor tumor necrosis GO:0032680 production 1.021e-4 >=0.20 GO:0032640 factor production regulation of macrophage macrophage GO:0043030 activation 1.355e-4 >=0.20 GO:0042116 activation negative regulation of viral genome regulation of viral GO:0045071 replication 1.413e-4 >=0.20 GO:0045069 genome replication negative regulation of viral regulation of viral GO:0048525 reproduction 1.413e-4 >=0.20 GO:0045069 genome replication GO:0032757 positive regulation 0.000155 >=0.20 GO:0032677 regulation of of interleukin-8 interleukin-8 production production

172

Table S2.15. Results from EC-DEPICT analysis of variants associated with adiponectin levels adjusted for BMI at p < 5 × 10-4 (MAF < 5% only). The top 10 gene sets are shown. Note that no gene set achieves significance. FDR: false discovery rate.

Original gene set ID Original gene set Nominal FDR Meta-gene set Meta-gene set description p-value ID description MP:0005318 decreased triglyceride 5.84E-06 >=0.20 MP:0002078 abnormal glucose level homeostasis MP:0009115 abnormal fat cell 1.01E-05 >=0.20 MP:0005659 decreased morphology susceptibility to diet-induced obesity GO:0032757 positive regulation of 1.53E-05 >=0.20 GO:0032677 regulation of interleukin-8 interleukin-8 production production MP:0004893 decreased adiponectin 1.85E-05 >=0.20 MP:0002078 abnormal glucose level homeostasis MP:0001559 hyperglycemia 3.57E-05 >=0.20 MP:0002079 increased circulating insulin level GO:0032642 regulation of 4.72E-05 >=0.20 GO:0042033 chemokine chemokine production biosynthetic process MP:0000242 impaired fertilization 7.23E-05 >=0.20 MP:0002675 astheno- zoospermia Reactome hormone- Reactome hormone- 8.14E-05 >=0.20 GO:0019433 triglyceride sensitive lipase HSL- sensitive lipase HSL- catabolic process mediated mediated triacylglycerol triacylglycerol hydrolysis hydrolysis MP:0008034 enhanced lipolysis 8.50E-05 >=0.20 MP:0008034 enhanced lipolysis GO:0032602 chemokine production 1.019E- >=0.20 GO:0042033 chemokine 04 biosynthetic process

173

Table S2.16. Results from EC-DEPICT analysis of variants associated with leptin adjusted for BMI levels at p < 5 × 10-4 (all MAF). The top 10 gene sets are shown. Note that no gene set achieves significance. FDR: false discovery rate.

Original gene set ID Original gene Nominal p- FDR Meta-gene set ID Meta-gene set set description value description tropomyosin GO:0005523 binding 7.55E-05 >=0.20 GO:0030016 myofibril KEGG p53 KEGG p53 signaling signaling KEGG cell pathway pathway 2.818E-04 >=0.20 KEGG cell cycle cycle myosin II GO:0016460 complex 3.046E-04 >=0.20 GO:0030016 myofibril intracellular androgen steroid hormone receptor receptor signaling signaling GO:0030521 pathway 0.0004674 >=0.20 GO:0030518 pathway muscle myosin GO:0005859 complex 0.0005459 >=0.20 GO:0030016 myofibril increased mortality Reactome induced by Reactome homologous ionizing homologous recombination MP:0003992 radiation 0.0005789 >=0.20 recombination repair repair GO:0051170 nuclear import 0.0006254 >=0.20 GO:0051170 nuclear import TCAP PPI ENSG00000173991 subnetwork 0.0006394 >=0.20 GO:0030016 myofibril protein import GO:0006606 into nucleus 0.0006565 >=0.20 GO:0051170 nuclear import ELP3 PPI ELP4 PPI ENSG00000134014 subnetwork 0.000701 >=0.20 ENSG00000109911 subnetwork

174

Table S2.17. Results from EC-DEPICT analysis of variants associated with leptin adjusted for BMI levels at p < 5 × 10-4 (MAF < 5%). The top 10 gene sets are shown. Note that no gene set achieves significance. FDR: false discovery rate.

Original gene set ID Original gene Nominal p- FDR Meta-gene set ID Meta-gene set set value description description KEGG p53 KEGG p53 signaling signaling pathway pathway 0.0004007 >=0.20 KEGG cell cycle KEGG cell cycle REACTOME RAD9B PPI REACTOME lagging lagging strand ENSG00000151164 subnetwork 0.0005308 >=0.20 strand synthesis synthesis abnormal peripheral nervous system extracellular MP:0004220 regeneration 0.0006636 >=0.20 GO:0031012 matrix KEGG bladder KEGG KEGG chronic KEGG chronic cancer bladder cancer 0.0007082 >=0.20 myeloid leukemia myeloid leukemia PTPN13 PPI ENSG00000163629 subnetwork 0.0007894 >=0.20 GO:0005912 adherens junction regulation of KEGG renin systemic arterial KEGG renin angiotension blood pressure by angiotension system system 0.0008678 >=0.20 GO:0003081 renin-angiotensin ARHGEF4 PPI UBA3 PPI ENSG00000136002 subnetwork 0.001108 >=0.20 ENSG00000144744 subnetwork regulation of systemic arterial AGT PPI blood pressure by ENSG00000135744 subnetwork 0.00112 >=0.20 GO:0003081 renin-angiotensin decreased Peyer's patch small Peyer's MP:0008133 number 0.001139 >=0.20 MP:0008135 patches abnormal seminiferous tubule MP:0002216 morphology 0.001299 >=0.20 MP:0001147 small testis

175

Supplementary text for Chapter 2

Text S2.1 Trait-specific implementational details and results for EC-DEPICT

Height

We started by defining the set of variants and genes to analyze. To define a “background” set of variants with coherent annotations of their functional consequences on gene products, we used annotations from the CHARGE consortium file (see URLs, download date 2/10/16), which contained 247,037 variants in 27,196 genes. First, we limited the analysis to 220,647 variants with frameshift, nonsynonymous, stopgain, stoploss, or splice-site annotations (in 17,473 unique genes). We then removed all variants that were not assessed in the height ExomeChip data, leaving 215,689 variants in 17,416 genes. For variants with different functional consequences that mapped to more than one gene, the most severe one was retained, leaving 17,415 genes. In instances where variants mapped to multiple genes with equal deleteriousness, the gene was chosen randomly. HUGO gene identifiers from the CHARGE annotation file were converted to

Ensembl identifiers using Ensembl Biomart (version 84, GRCh37) gene homology mapping (see

URLs, download date 4/18/2016). After removing variants that did not map to an Ensembl identifier present in DEPICT’s reconstituted gene sets, we were left with 199,907 variants in

15,652 genes. Finally, to ensure valid null distributions, we also excluded all variants absent from the null ExomeChip data that was used to calculate p-values. After all filtering steps, a total of 41,538 variants in 11,756 genes were included in the analysis.

As input to EC-DEPICT, we took variants which either (1) had p-values < 2 × 10-7 in the discovery or conditional analyses or (2) had p-values between 2 × 10-6 and 2 × 10-7 in the discovery analysis and a p-value of < 2 × 10-7 in the combined discovery and replication

176

analysis). Then, to discern pathways that were highlighted by coding variants independently of known GWAS signals, we filtered for variants found only by ExomeChip (i.e. ExomeChip hits for height that were independent of known GWAS loci) and removed variants where there was a common ExomeChip missense variant equivalent to a GWAS variant. A total of 128 variants in

119 genes were retained and used in the analysis.

In further exploring the DEPICT results, we observed that in certain scenarios, extreme nonnormality of pathway membership z-scores for certain gene sets could slightly increase type I error in DEPICT. To determine if this issue affected our results, we performed another DEPICT analysis using a version of the reconstituted gene sets where pathway membership scores were inverse normal transformed. We then compared the ranks of each gene set in the original data with their rank in the inverse normal transform. Of the 496 originally significant gene sets at

FDR < 0.01, only 22 were “outliers” (> 1.5 ´ the interquartile range). We note that all but one of these outliers still had an FDR of < 0.05 in the inverse normal transformed data, and the remaining gene set had an FDR of < 0.10. These outliers are denoted in the supplementary table with an asterisk. Additionally, we note that removing these outliers leads to the loss of only one significant meta-gene set (which is still significant at FDR < 0.05). We conclude that this issue only minimally impacts these results, but recommend that future users of this method should also repeat the analysis using inverse normal transformed pathway membership scores and look for outliers as an additional check.

We also performed a DEPICT analysis of height GWAS results from Wood et al.69 For this, we started with significant noncoding variants found by GWAS (using the Variant Effect

Predictor tool123 for annotations). We constructed loci using DEPICT as previously described

(beta version 1.1, release 194, see URLs), using 1000 Genomes phase 1 data and default

177

parameters to perform the analysis (20 repetitions to compute FDRs, 500 permutations to adjust for biases) and eliminated loci containing overlap with any of the novel EC genes. This left 446 loci, which were used in the subsequent DEPICT gene set enrichment step.

We also applied the PASCAL pathway analysis tool to this data. PASCAL derives gene- based scores from maximum or summed chi-square scores and calculates a test statistic based on summing across gene-based scores for a given pathway128. Within this study, PASCAL was applied separately to (1) all coding variants from the ExomeChip and (2) to regulatory variants

(20 kb upstream, gene body excluded) within association summary statistics from height GWAS results69.

For Online Mendelian Inheritance in Man (OMIM) annotations, we supplemented a previously annotated list of OMIM genes (from Wood et al.69) with the results of a more recent search in OMIM for the terms “short stature,” “overgrowth,” “skeletal dysplasia,” or

“brachydactyly,” and manually curated the combined list to include genes where mutations cause clinical abnormalities of skeletal growth. The manual curation was performed prior to obtaining the ExomeChip results.

BMI

Before gene set enrichment analysis of BMI-associated variants, we removed variants absent from the permuted ExomeChip data. This resulted in the exclusion of a substantial number of variants from most analyses when using the Swedish nulls (for example, about 30% of the BMI-associated variants at suggestive or array-wide significance). Most of the excluded variants were in the very rarest allele frequency bins, and mostly reached suggestive rather than array-wide significance. The exclusion of the rarest variants is expected due to the much smaller sample size of the null cohorts relative to the BMI data. To try to include more variants in the

178

analysis, we also generated null ExomeChip data based on the UK Biobank (N = 152,249), which resulted in the exclusion of fewer BMI-associated variants (due to the much larger sample size of the UK Biobank data). However, we observed that use of these null data, despite including more of the rarest BMI-associated variants in the gene set enrichment analysis, dampened the signal and resulted in fewer significantly enriched gene sets. This result suggests that (1) the rarest variants are more likely to have a lower true positive rate and/or (2) the heterogeneity of the underlying biology increases with the inclusion of very rare variants.

Glycemic traits

As part of our gene set enrichment analysis pipeline, we remove input trait-associated variants that are not present in the permuted data to ensure that all variants are appropriately modeled. When using the Swedish permutations, this generally results in removing a substantial fraction of the variants, especially of the very rarest variants (due to the smaller sample size of the Swedish data relative to the data being analyzed). We have previously observed that this filtering can actually improve the enrichment signal, possibly due to more heterogeneous biology or a higher false-positive rate in these very rare variants12. However, in this case, we observed that in performing this filtering, we excluded variants in several known monogenic disease genes, such as HNF1A and SLC2A2. Therefore, we wished to repeat the analysis with a set of permutations which would allow us to retain these variants. We thus repeated the analysis with a set of permutations from UK Biobank (UKBB) (N = 152,249). The larger sample size in the

UKBB permutations means more variants are present and can therefore be included in the analysis.

Concordance of results from two different sets of permuted distributions across phenotypes: For completeness, we report the results from the use of both sets of permutations.

179

We note that they are strongly concordant. The larger number of significant gene sets reported based on the UK Biobank permutations is generally a combination of (1) overall improved power

(i.e. more variants are included) and (2) the inclusion of variants in key driver genes absent in the

Swedish permutations, encompassing both the monogenic genes mentioned above (e.g. SLC2A2) and additional genes with clearly relevant biology (e.g. CTRB2, SLC30A8). The results from both sets of permutations are summarized below. For all analyses, “significance” refers to a false discovery rate of < 0.05.

All-trait analysis: After filtering, 78 input genes were included for the analysis with the

UKBB permutations and 60 for the analysis with the Swedish permutations. (Note that the difference in the number of input genes is due to the presence of a larger number of input variants in the UKBB permutations – see above). We found 234 significant gene sets in 86 meta- gene sets based on the UKBB permutations (Figure S2.4a, Table S2.8) and 133 gene sets in 51 meta-gene sets based on the Swedish permutations. The correlation between the UKBB and

Swedish analyses was r = 0.902, p < 10-300.

All-traits-except-HbA1c analysis: After filtering, 45 input genes were included for the analysis with the UKBB permutations and 33 for the analysis with the Swedish permutations.

We found 128 significant gene sets in 53 meta-gene sets based on the UKBB permutations

(Figure S2.4b, Table S2.9) and 45 significant gene sets in 18 meta-gene sets based on the

Swedish permutations. The correlation between the UKBB and Swedish analyses was r = 0.882, p < 10-300.

HbA1c-only analysis: After filtering, 41 input genes were included for the analysis with the UKBB permutations and 33 for the analysis with the Swedish permutations. We found 191 significant gene sets in 73 meta-gene sets based on the UKBB permutations (Figure S2.4c,

180

Table S2.10) and 120 gene sets in 41 meta-gene sets based on the Swedish permutations. The correlation between the UKBB and Swedish analyses was r = 0.936, p < 10-300.

FG-only analysis: After filtering, 26 input genes were included for the analysis with the

UKBB permutations and 22 for the analysis with the Swedish permutations. We found 106 significant gene sets in 39 meta-gene sets based on the UKBB permutations (Figure S2.4d,

Table S2.11) and 48 significant gene sets in 15 meta-gene sets based on the Swedish permutations. The correlation between the UKBB and Swedish analyses was r = 0.939, p <

10-300.

2hGlu-only analysis: After filtering, 12 input genes were included for the analysis with the UKBB permutations and 7 for the analysis based on the Swedish permutations. We found 56 significant gene sets in 17 meta-gene sets based on the UKBB permutations (Figure S2.4e,

Table S2.12), with no significant gene sets based on the Swedish permutations. The correlation between the UKBB and Swedish analyses was r = 0.787, p < 10-300.

FI-only analysis: After filtering, 11 input genes were included for the analysis with the

UKBB permutations and 8 for the analysis with the Swedish permutations. There were no significant gene sets from either analysis. The correlation between the UKBB and Swedish analyses was r = 0.860, p < 10-300.

Fat-free mass index and body fat percentage

These traits included a large amount of data from both the ExomeChip and UK Biobank.

Since EC-DEPICT was designed specifically for the ExomeChip set of variants, we therefore removed variants present in UK Biobank but absent from the ExomeChip. It is a future direction to create permutations that include all RLF coding variants from the UK Biobank (rather than restricting to the ExomeChip).

181

Appendix B:

Supplementary materials for Chapter 3

Supplementary figures for Chapter 3

A. Top 5% Top 10% Top 15% 1.5

Analyzed − 1.0

0.5

(Normalized), Meta 0.0 t

GEO GTEx GEO GTEx GEO GTEx intersect intersect intersect gene sets outersect gene sets outersect gene sets outersect

B. 25 kb 50 kb 100 kb

Analyzed 1.0 −

0.5

(Normalized), Meta 0.0 t GEO GTEx GEO GTEx GEO GTEx intersect intersect intersect gene sets outersect gene sets outersect gene sets outersect

Figure S3.1. Meta-analyzed effect sizes (normalized t) for varying parameters of Benchmarker. (a) Varying percentage cutoffs for prioritizing genes (top 5%, top 10%, and top 15%). (b) Varying kb window size for assigning SNPs to genes (25 kb, 50 kb, and 100 kb). Figure includes data for separate LD score regression models comparing (1) DEPICT-gene-sets, (2) DEPICT-GEO, (3) DEPICT-GTEx, (4) the “intersect” (genes prioritized by at least two of the three preceding methods) and (5) the “outersect” (genes prioritized by only one method). (“Separate LD score regression models” indicates that these annotations are tested individually rather than jointly). Error bars represent 95% confidence intervals.

183

Gene sets A. Gene set binarization Highly likely member Likely member Less likely member Genes Highly unlikely member

Binarize

Z-score cutoff Ranking cutoff

Stringent z-score cutoff Liberal z-score cutoff Stringent Liberal rank cutoff rank cutoff Genes Genes

Gene sets Gene sets Gene sets Gene sets Gene sets Gene sets

B. Gene prioritization with MAGMA Chrom Gene P-Value 1 Gene 1 0.004 Calculate gene- 1 Gene 2 1.2 x 10-4 based p-values 1 Gene 3 0.10 with MAGMA … 22 Gene 19998 0.67 22 Gene 19999 3.0 x 10-4 22 Gene 20000 0.21

Ranked gene set p-values Gene-based p-values, Identify gene members of the enriched gene Remove Gene Set P-Value sets (based on one of the binarizations) chromosomes 2-22 GeneSet_1 8.6 x 10-5 chromosome 1 Chrom Gene P-Value GeneSet_2 2.3 x 10-4 2 Gene 4 0.23 GeneSet_3 0.00045 2 Gene 5 0.92 Run MAGMA gene set … 2 Gene 6 0.03 GeneSet_14460 0.99 … enrichment analysis GeneSet_14461 0.99 c 22 Gene 19998 0.67 GeneSet_14462 1 22 Gene 19999 3.0 x 10-4 c 22 Gene 20000 0.21

Prioritize top 10% of genes in enriched gene sets on the withheld chromosome Gene Set Gene Members GeneSet_1 Gene 2, Gene 5231, Gene 321, Gene 530… GeneSet_2 Gene 1093, Gene 923, Gene 1, Gene 1231… GeneSet_3 Gene 2, Gene 9320, Gene 3, Gene 4320… … GeneSet_14460 Gene 234, Gene 8234, Gene 2, Gene 23 Annotate SNPs within +/- 50 kb GeneSet_14461 Gene 982, Gene 32, Gene 400, Gene 231 of the top-ranked 10% of genes GeneSet_14462 Gene 10345, Gene 1, Gene 324, Gene 3234 as “prioritized” Chromosome 1 Gene Gene Gene Gene Gene Gene 2 1999 2000 3 1998 1 SNPs LDSC “prioritized” ✓ ✓ ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✓ ✗ ✗ ✓ annotation

Repeat for each chromosome & combine results Benchmarker

Figure S3.2. Schematic of prioritization with MAGMA.

184

4000

3000

Gene Set Binarization

Z > 1.96 2000 Z > 2.58 Count

Z > 3.29

1000

0

0 250 500 750 1000 1250 Number of Genes Passing Threshold for Each Gene Set

Figure S3.3. Distribution of size of gene sets after binarization with different z-score cutoffs. Dashed vertical lines are at 50, 100, and 200, representing the cutoffs for the three rank- based binarization strategies.

185

1000 Null SNP Runs Across All Traits

A. 600

400

Count 200

0 −2.5 0.0 2.5 Coefficient Z−Score

1000 Null Gene Runs Across All Traits B. 600

400

Count

200

0 −2.5 0.0 2.5 Coefficient Z−Score

Figure S3.4 Results from the Type I error analysis. Panel (a) shows results from randomly prioritizing 10% of SNPs and panel (b) shows results from randomly prioritizing 10% of genes. Histogram displays the z-score distribution for t in both sets of simulations (each consisting of 1,000 null simulations, each tested on 10 real GWAS, for a total of 10,000 data points).

186

Figure S3.5 Effect sizes (normalized t) for separate LD score regression models comparing (1) DEPICT-gene-sets, (2) DEPICT-GEO, (3) DEPICT-GTEx, (4) the “intersect” (genes prioritized by at least two of the three preceding methods) and (5) the “outersect” (genes prioritized by only one method). (“Separate LD score regression models” indicates that these annotations are tested individually rather than jointly). Error bars represent 95% confidence intervals. Note that y-axis scales are different for each trait. (a) Results from each trait. (b) Results meta-analyzed across 16 traits.

187

Figure S3.5 (continued).

A. height BMI WHRadjBMI skin pigment red blood cell count 1.5 2.0 2 1.5 0.4 1.5 1.0 1.0 1 1.0 0.2 0.5 0.5 0 0.5 0.0 0.0 0.0 0.0 −1 ) t white blood cell count diastolic blood pressure systolic blood pressure LDL cholesterol HDL cholesterol 2.0 3 3 1.5 1.0 1.0 2 1.0 2 0.5 0.5 1 0.5 1 0 0.0 0.0 0.0 0

total cholesterol triglycerides type 2 diabetes years of education smoking status 3 4 4 0.6 0.6 3 3 2 2 0.4 0.4 2 1 1 1 0.2 0.2 0 0 0 0.0 0.0

schizophrenia allergy or eczema IBD age of menarche age of menopause Prioritization Effect Size (Normalized Prioritization Size Effect 1.00 3 4 3 0.75 3 0.50 2 0.50 2 0.25 2 1 1 0.25 1 0.00 0.00 0 0 0

GEOGTEx GEOGTEx GEOGTEx GEOGTEx GEOGTEx intersect intersect intersect intersect intersect gene sets outersect gene sets outersect gene sets outersect gene sets outersect gene sets outersect

1.25 B. )

t 1.00

0.75

0.50

Analyzed PrioritizationAnalyzed

− 0.25

Effect Size (Normalized Size Effect Meta 0.00 gene sets GEO GTEx intersect outersect

188

Proportion of Prioritized SNPs

height BMI WHRadjBMI skin pigment red blood cell count 0.15 0.10 0.05 0.00 white blood cell count diastolic blood pressure systolic blood pressure LDL cholesterol HDL cholesterol 0.15 0.10 0.05 0.00

total cholesterol triglycerides type 2 diabetes years of education smoking status 0.15 0.10 0.05

Proportion of SNPs Prioritized 0.00

schizophrenia allergy or eczema IBD age of menarche age of menopause 0.15 0.10 0.05 0.00

GEO GTEx GEO GTEx GEO GTEx GEO GTEx GEO GTEx intersect intersect intersect intersect intersect gene sets outersectgene sets outersectgene sets outersectgene sets outersectgene sets outersect

Figure S3.6. For each trait, proportion of SNPs prioritized by DEPICT-gene-sets, DEPICT- GEO, DEPICT-GTEx, the intersect, and the outersect.

189

Number of Genes In Each Condition

3000

2000 Category

intersect

outersect

Number of Genes 1000

0 IBD BMI height triglycerides WHRadjBMI skin pigment schizophrenia smoking status type 2 diabetes LDL cholesterol total cholesterol HDL cholesterol age of menarche allergy or eczema years of education years age of menopause red blood cell count red blood white blood cell count white blood systolic blood pressure systolic blood diastolic blood pressure diastolic blood Trait

Figure S3.7. For each trait, number of genes in the “intersect” and “outersect” conditions for DEPICT-gene-sets, DEPICT-GEO, and DEPICT-GTEx. Dashed line is at 10% of the genome, which represents the number of genes included in each individual DEPICT condition.

190

Figure S3.8. Effect sizes (normalized t) for separate LD score regression models in which each of DEPICT-gene-sets, DEPICT-GEO, and DEPICT-GTEx are split into individual intersects and outersects. For example, the DEPICT-gene-sets intersect consists of genes prioritized by DEPICT-gene-sets and at least one of the other two methods. The DEPICT-gene- sets outersect consists of genes prioritized by DEPICT-gene-sets only. The “all methods combined” column represents the union of all genes prioritized by any method (and the black and grey bars for that column represent the original intersect and outersect data). Venn diagrams show which sets of genes are included for each analysis. Error bars represent 95% confidence intervals. Results meta-analyzed across 16 traits.

191

Figure S3.9. Effect sizes (normalized t) for separate LD score regression models comparing 1) DEPICT with binarized gene sets, 2) MAGMA, 3) the “intersect” (genes prioritized by both DEPICT and MAGMA) and 5) the “outersect” (genes prioritized by only DEPICT or only MAGMA). (“Separate LD score regression models” indicates that these annotations are tested individually rather than jointly). Error bars represent 95% confidence intervals. Note that y-axis scales are different for each trait. (a) Results from each trait. (b) Results meta-analyzed across 16 traits.

192

Figure S3.9 (continued).

A.

height BMI WHRadjBMI skin pigment red blood cell count 2.0 10.0 0.6 2.0 1.5 1.5 7.5 1.5 1.0 0.4 1.0 5.0 1.0 0.5 0.5 2.5 0.2 0.5 0.0 0.0 0.0 0.0 0.0 −2.5 white blood cell count diastolic blood pressure systolic blood pressure LDL cholesterol HDL cholesterol ) t 1.5 6 2.0 1.5 6 1.5 1.0 4 1.0 4 1.0 2 0.5 0.5 2 0.5 0 0 0.0 0.0 0.0

total cholesterol triglycerides type 2 diabetes years of education smoking status

6 3 0.8 1.00 6 0.6 4 2 0.75 4 0.4 0.50 2 1 2 0.2 0.25 0 0 0 0.0 0.00 Prioritization Effect Size (Normalized Prioritization Size Effect schizophrenia allergy or eczema IBD age of menarche age of menopause 4 5 1.2 4 0.9 4 3 0.8 3 3 0.6 2 2 2 0.4 1 0.3 1 1 0 0 0.0 0.0 0

Z > 1.96Z > 2.58Z > 3.29 Z > 1.96Z > 2.58Z > 3.29 Z > 1.96Z > 2.58Z > 3.29 Z > 1.96Z > 2.58Z > 3.29 Z > 1.96Z > 2.58Z > 3.29

Top 50 Genes Top 50 Genes Top 50 Genes Top 50 Genes Top 50 Genes Top 100Top Genes 200 Genes Top 100Top Genes 200 Genes Top 100Top Genes 200 Genes Top 100Top Genes 200 Genes Top 100Top Genes 200 Genes

DEPICT_binarized MAGMA DEPICT_MAGMA_overlap DEPICT_MAGMA_outersect B. ) t

1.0 ) Label t

1.0 DEPICT_binarized Label DEPICT_binarizedMAGMA MAGMA DEPICT_MAGMA_intersect 0.5 DEPICT_MAGMA_intersectDEPICT_MAGMA_outersect

Analyzed PrioritizationAnalyzed 0.5 DEPICT_MAGMA_outersect − Analyzed PrioritizationAnalyzed − Effect Size (Normalized Size Effect Effect Size (Normalized Size Effect Meta Meta

0.0 0.0

Z > 1.96 Z > 2.58 Z > 3.29

Top 50 Genes Z > 1.96 Z > 2.58 Z > 3.29 Top 100 GenesTop 200 Genes Top 50 Genes Top 100 GenesTop 200 Genes 193

Top 50 Genes 0.15

0.10

0.05

0.00

Top 100 Genes 0.15

0.10

0.05 Proportion of SNPs Prioritized 0.00

Top 200 Genes 0.15

0.10

0.05

0.00 IBD BMI height triglycerides WHRadjBMI skin pigment schizophrenia smoking status type 2 diabetes LDL cholesterol total cholesterol HDL cholesterol age of menarche allergy or eczema years of education years age of menopause red blood cell count red blood white blood cell count white blood systolic blood pressure systolic blood diastolic blood pressure diastolic blood Trait

DEPICT_binarized MAGMA DEPICT_MAGMA_intersect DEPICT_MAGMA_outersect

Figure S3.10. For each trait, proportion of SNPs prioritized by (1) DEPICT with binarized gene sets, (2) MAGMA, (3) the “intersect”, and (4) the “outersect.” Figure displays results for the three rank-based binarizations.

194

Z > 1.96

0.15

0.10

0.05

0.00

Z > 2.58

0.15

0.10

0.05 Proportion of SNPs Prioritized 0.00

Z > 3.29

0.15

0.10

0.05

0.00 IBD BMI height triglycerides WHRadjBMI skin pigment schizophrenia smoking status type 2 diabetes LDL cholesterol total cholesterol HDL cholesterol age of menarche allergy or eczema years of education years age of menopause red blood cell count red blood white blood cell count white blood systolic blood pressure systolic blood diastolic blood pressure diastolic blood Trait

DEPICT_binarized MAGMA DEPICT_MAGMA_intersect DEPICT_MAGMA_outersect

Figure S3.11. For each trait, proportion of SNPs prioritized by (1) DEPICT with binarized gene sets, (2) MAGMA, (3) the “intersect”, and (4) the “outersect.” Figure displays results for the three z-score-based binarizations.

195

Top 50 Genes Top 100 Genes Top 200 Genes 3000

2000

1000

0 Category

Z > 1.96 Z > 2.58 Z > 3.29 intersect 3000 outersect Number of Genes

2000

1000

0 IBD IBD IBD BMI BMI BMI height height height triglycerides triglycerides triglycerides WHRadjBMI WHRadjBMI WHRadjBMI skin pigment skin pigment skin pigment schizophrenia schizophrenia schizophrenia smoking status smoking status smoking status type 2 diabetes type 2 diabetes type 2 diabetes LDL cholesterol LDL cholesterol LDL cholesterol total cholesterol total cholesterol total cholesterol HDL cholesterol HDL cholesterol HDL cholesterol age of menarche age of menarche age of menarche allergy or eczema allergy or eczema allergy or eczema years of education years of education years of education years age of menopause age of menopause age of menopause red blood cell count red blood cell count red blood cell count red blood white blood cell count white blood cell count white blood cell count white blood systolic blood pressure systolic blood pressure systolic blood pressure systolic blood diastolic blood pressure diastolic blood pressure diastolic blood pressure diastolic blood trait

Figure S3.12. For each trait, number of genes in the “intersect” and “outersect” conditions for binarized DEPICT and MAGMA. Each version of the binarized gene sets is shown separately.

196

Figure S3.13. Effect sizes (normalized t) for an annotation consisting of all blood eQTLs regulating at least one gene in our data set. In the “Separate” model, the all-eQTL annotation was modeled with the baseline model only. In the “Joint (DEPICT-datatype)” and “Joint (DEPICT-MAGMA)” models, the all-eQTL annotation was jointly modeled with all eQTLs regulating prioritized genes from either the DEPICT-datatype intersect set or DEPICT- MAGMA intersect set (where the DEPICT-MAGMA intersect set was based on the Z > 3.29 binarization). Error bars represent 95% confidence intervals. (a) Results for all traits. (b) Results meta-analyzed across 16 traits.

197

Figure S3.13 (continued).

198

Figure S3.14. Effect sizes (normalized t) for a joint S-LDSC model including (1) all eQTLs, (2) the original gene intersect prioritization annotation (i.e. all SNPs in the gene body and a 50-kb window), and (3) all eQTLs regulating the prioritized intersect genes. Results are shown for two intersect sets: DEPICT-datatype and DEPICT-MAGMA (the latter based on the Z > 3.29 binarization). (a) Results for all traits. (b) Results meta-analyzed across 16 traits.

199

Figure S3.15. Effect sizes (normalized t) for an analysis in which we divided our intersect annotations into eQTLs and non-eQTLs and modeled them jointly for each trait. (a) Results for all traits. (b) Results meta-analyzed across 16 traits.

200

Figure S3.16. Effect sizes (normalized t) for separate LD score regression models comparing (1) DEPICT-gene-sets, (2) MAGMA (based on the Z > 2.58 binarization), (3) NetWAS using the global tissue network with three p-value thresholds: p < 0.01, p < 0.0001, and a Bonferroni correction for the number of genes tested. (“Separate LD score regression models” indicates that these annotations are tested individually rather than jointly). Error bars represent 95% confidence intervals. Note that y-axis scales are different for each trait. (a) Results from each trait. (b) Results meta-analyzed across 16 traits.

201

Figure S3.17. Effect sizes (normalized t) for separate LD score regression models comparing NetWAS performance for the global tissue network and several trait-relevant tissues for nine of our GWAS. For each trait, we used the p-value threshold with the best performance from the global network. (“Separate LD score regression models” indicates that these annotations are tested individually rather than jointly). Error bars represent 95% confidence intervals.

202

Supplementary tables for Chapter 3

Table S3.1. GWAS data used for each trait. Attached as digital information.

Table S3.2. Meta-analyzed normalized coefficient (tau) value for each of the major individual analyses (meta-analyzed across 16 traits; we excluded four phenotypically overlapping traits – diastolic blood pressure, HDL cholesterol, total cholesterol, and triglycerides). Analyses are also performed separately for brain- and non-brain-related traits. Brain-related traits = BMI, menarche, schizophrenia, smoking, years of education. Attached as digital information.

Table S3.3. Stratified LD score regression results for individual models of DEPICT-gene- sets, DEPICT-GEO, DEPICT-GTEx, the “intersect” (prioritized by at least two of the three preceding methods), and the “outersect” (prioritized by only one method). Attached as digital information.

Table S3.4. Stratified LD score regression results for joint models of DEPICT-gene-sets, DEPICT-GEO, and DEPICT-GTEx. Attached as digital information.

Table S3.5. Joint stratified LD score regression results for the “intersect” (prioritized by at least two of DEPICT-gene-sets, DEPICT-GEO, and DEPICT-GTEx), and the “outersect” (prioritized by only one of those three methods). Attached as digital information.

Table S3.6. Stratified LD score regression results for individual models of binarized DEPICT, MAGMA, the “intersect” (prioritized by both methods), and the “outersect” (prioritized by one method only). Attached as digital information.

Table S3.7. Stratified LD score regression results for joint models of DEPICT and MAGMA. Attached as digital information.

Table S3.8. Joint stratified LD score regression results for the “intersect” (prioritized by both DEPICT-binarized and MAGMA), and the “outersect” (prioritized by only one of those two methods). Attached as digital information.

Table S3.9. Fisher's exact tests for the enrichment of nearest-genes within prioritized genes, intersect, and outersect (where prioritized genes = the union of intersect and outersect genes). DEPICT-datatype = DEPICT-gene-sets, DEPICT-GEO, and DEPICT-GTEx analysis. Attached as digital information.

Table S3.10. NetWAS results from individual models. Attached as digital information.

Table S3.11. Joint S-LDSC model with DEPICT-gene-sets, MAGMA (with the Z > 2.58 binarized gene sets), and NetWAS (Bonferroni-corrected, global network). Attached as digital information.

203

References

1. Hirschhorn, J. N. & Daly, M. J. Genome-wide association studies for common diseases and complex traits. Nat. Rev. Genet. 6, 95–108 (2005).

2. Visscher, P. M. et al. 10 Years of GWAS Discovery: Biology, Function, and Translation. Am. J. Hum. Genet. 101, 5–22 (2017).

3. Visscher, P. M., Hill, W. G. & Wray, N. R. Heritability in the genomics era--concepts and misconceptions. Nat. Rev. Genet. 9, 255–266 (2008).

4. Visscher, P. M., Brown, M. A., McCarthy, M. I. & Yang, J. Five years of GWAS discovery. Am. J. Hum. Genet. 90, 7–24 (2012).

5. Ward, L. D. & Kellis, M. Interpreting noncoding genetic variation in complex traits and human disease. Nat. Biotechnol. 30, 1095–106 (2012).

6. Maurano, M. T. et al. Systematic Localization of Common Disease-Associated Variation in Regulatory DNA. Science (80-. ). 337, 1190–1195 (2012).

7. Kiezun, A. et al. Exome sequencing and the genetic basis of complex traits. Nat. Genet. 44, 623–630 (2012).

8. Gazal, S. et al. Functional architecture of low-frequency variants highlights strength of negative selection across coding and non-coding annotations. Nat. Genet. 50, 1600–1607 (2018).

9. Zeng, J. et al. Signatures of negative selection in the genetic architecture of human complex traits. Nat. Genet. 50, 746–753 (2018).

10. Schoech, A. P. et al. Quantification of frequency-dependent genetic architectures in 25 UK Biobank traits reveals action of negative selection. Nat. Commun. 10, 790 (2019).

11. Auer, P. L. & Lettre, G. Rare variant association studies: considerations, challenges and opportunities. Genome Med. 7, 16 (2015).

12. Turcot, V. et al. Protein-altering variants associated with body mass index implicate pathways that control energy intake and expenditure in obesity. Nat. Genet. 50, 26–41 (2018).

13. Fuchsberger, C. et al. The genetic architecture of type 2 diabetes. Nature (2016). doi:10.1038/nature18642

14. Peloso, G. M. et al. Association of Low-Frequency and Rare Coding-Sequence Variants

204

with Blood Lipids and Coronary Heart Disease in 56,000 Whites and Blacks. Am. J. Hum. Genet. 94, 223–232 (2014).

15. Marouli, E. et al. Rare and low-frequency coding variants alter human adult height. Nature (2016).

16. Justice, A. E. et al. Protein-coding variants implicate novel genes related to lipid homeostasis contributing to body-fat distribution. Nat. Genet. 51, 452–469 (2019).

17. Wessel, J. et al. Low-frequency and rare exome chip variants associate with fasting glucose and type 2 diabetes susceptibility. Nat. Commun. 6, 5897 (2015).

18. Ng, N. H. J. et al. Tissue-Specific Alteration of Metabolic Pathways Influences Glycemic Regulation. BioRxiv (2019).

19. Huyghe, J. R. et al. Exome array analysis identifies new loci and low-frequency variants influencing insulin processing and secretion. Nat. Genet. 45, 197–201 (2012).

20. Liu, D. J. et al. Exome-wide association study of plasma lipids in >300,000 individuals. Nat. Genet. 49, 1758–1766 (2017).

21. Mahajan, A. et al. Refining the accuracy of validated target identification through coding variant fine-mapping in type 2 diabetes. Nat. Genet. 50, 559–571 (2018).

22. Ramanan, V. K., Shen, L., Moore, J. H. & Saykin, A. J. Pathway analysis of genomic data: Concepts, methods, and prospects for future development. Trends Genet. 28, 323– 332 (2012).

23. Mooney, M. A., Nigg, J. T., McWeeney, S. K. & Wilmot, B. Functional and genomic context in pathway analysis of GWAS data. Trends Genet. 30, 390–400 (2014).

24. Wang, L., Jia, P., Wolfinger, R. D., Chen, X. & Zhao, Z. Gene set analysis of genome- wide association studies: Methodological issues and perspectives. Genomics 98, 1–8 (2011).

25. de Leeuw, C. A., Neale, B. M., Heskes, T. & Posthuma, D. The statistical properties of gene-set analysis. Nat. Rev. Genet. 17, 353–364 (2016).

26. Kao, P. Y. P., Leung, K. H., Chan, L. W. C., Yip, S. P. & Yap, M. K. H. Pathway analysis of complex diseases for GWAS, extending to consider rare variants, multi-omics and interactions. Biochim. Biophys. Acta - Gen. Subj. 1861, 335–353 (2017).

27. Locke, A. E. et al. Genetic studies of body mass index yield new insights for obesity biology. Nature 518, 197–206 (2015).

28. Pers, T. H. et al. Comprehensive analysis of schizophrenia-associated loci highlights ion

205

channel pathways and biologically plausible candidate causal genes. Hum. Mol. Genet. 1– 28 (2016). doi:10.1093/hmg/ddw007

29. Schijven, D. et al. Comprehensive pathway analyses of schizophrenia risk loci point to dysfunctional postsynaptic signaling. Schizophr. Res. 199, 195–202 (2018).

30. de Leeuw, C. a., Mooij, J. M., Heskes, T. & Posthuma, D. MAGMA: Generalized Gene- Set Analysis of GWAS Data. PLOS Comput. Biol. 11, e1004219 (2015).

31. Liu, J. Z. et al. A Versatile Gene-Based Test for Genome-wide Association Studies. Am. J. Hum. Genet. 87, 139–145 (2010).

32. Purcell, S. et al. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).

33. Lips, E. S., Kooyman, M., de Leeuw, C. & Posthuma, D. JAG: A computational tool to evaluate the role of gene-sets in complex traits. Genes (Basel). 6, 238–251 (2015).

34. Lee, P. H., O’dushlaine, C., Thomas, B. & Purcell, S. M. INRICH: Interval-based enrichment analysis for genome-wide association studies. Bioinformatics 28, 1797–1799 (2012).

35. Holmans, P. et al. Gene Ontology Analysis of GWA Study Data Sets Provides Insights into the Biology of . Am. J. Hum. Genet. 85, 13–24 (2009).

36. Wang, K., Li, M. & Bucan, M. Pathway-Based Approaches for Analysis of Genomewide Association Studies. Am. J. Hum. Genet. 81, 1278–1283 (2007).

37. Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U. S. A. 102, 15545– 50 (2005).

38. Thévenin, A., Ein-Dor, L., Ozery-Flato, M. & Shamir, R. Functional gene groups are concentrated within chromosomes, among chromosomes and in the nuclear space of the human genome. Nucleic Acids Res. 42, 9854–9861 (2014).

39. Hong, M. G., Pawitan, Y., Magnusson, P. K. E. & Prince, J. A. Strategies and issues in the detection of pathway enrichment in genome-wide association studies. Hum. Genet. 126, 289–301 (2009).

40. Pers, T. H. et al. Biological interpretation of genome-wide association studies using predicted gene functions. Nat. Commun. 6, 5890 (2015).

41. Fehrmann, R. S. N. et al. Gene expression analysis identifies global gene dosage sensitivity in cancer. Nat. Genet. 47, 115–125 (2015).

206

42. Ogata, H. et al. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 27, 29–34 (1999).

43. Croft, D. et al. Reactome: A database of reactions, pathways and biological processes. Nucleic Acids Res. 39, 691–697 (2011).

44. Zhao, J., Zhu, Y., Boerwinkle, E. & Xiong, M. Pathway analysis with next-generation sequencing data. Eur. J. Hum. Genet. 23, 507–515 (2015).

45. Pan, W., Kim, J., Zhang, Y., Shen, X. & Wei, P. A powerful and adaptive association test for rare variants. Genetics 197, 1081–1095 (2014).

46. Petersen, A. et al. Evaluating methods for combining rare variant data in pathway-based tests of genetic association. BMC Proc. 5, S48 (2011).

47. Wu, G. & Zhi, D. Pathway-Based Approaches for Sequencing-Based Genome-Wide Association Studies. Genet. Epidemiol. 37, 478–494 (2013).

48. Evans, L. M. & Keller, M. C. Using partitioned heritability methods to explore genetic architecture. Nat. Rev. Genet. 19, 185 (2018).

49. Yang, J. et al. Common SNPs Explain a Large Proportion of the Heritability for Human Height. Nat. Genet. 42, 565–569 (2010).

50. Lee, S. H. et al. Estimating the proportion of variation in susceptibility to schizophrenia captured by common SNPs. Nat. Genet. 44, 247–250 (2012).

51. Gusev, A. et al. Partitioning Heritability of Regulatory and Cell-Type-Specific Variants across 11 Common Diseases. Am. J. Hum. Genet. 95, 535–552 (2014).

52. Speed, D., Cai, N., Johnson, M. R., Nejentsev, S. & Balding, D. J. Reevaluation of SNP heritability in complex human traits. Nat. Genet. 49, 986–992 (2017).

53. Finucane, H. K. et al. Partitioning heritability by functional annotation using genome- wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015).

54. Bulik-Sullivan, B. K. et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015).

55. Moreau, Y. & Tranchevent, L.-C. Computational tools for prioritizing candidate genes: boosting disease gene discovery. Nat. Rev. Genet. 13, 523–36 (2012).

56. Tranchevent, L.-C. et al. A guide to web tools to prioritize candidate genes. Brief. Bioinform. 12, 22–32 (2011).

57. Zolotareva, O. & Kleine, M. A Survey of Gene Prioritization Tools for Mendelian and

207

Complex Human Diseases. J. Integr. Bioinform. 1–20 (2019). doi:10.1515/jib-2018-0069

58. Bromberg, Y. Chapter 15: Disease Gene Prioritization. PLoS Comput. Biol. 9, (2013).

59. Seyyedrazzagi, E. & Navimipour, N. J. Disease genes prioritizing mechanisms: a comprehensive and systematic literature review. Netw. Model. Anal. Heal. Informatics Bioinforma. 6, (2017).

60. Oti, M. & Brunner, H. G. The modular nature of genetic diseases. Clin. Genet. 71, 1–11 (2007).

61. Goh, K. Il et al. The human disease network. Proc. Natl. Acad. Sci. U. S. A. 104, 8685– 8690 (2007).

62. Aerts, S. et al. Gene prioritization through genomic data fusion. Nat. Biotechnol. 24, 537– 544 (2006).

63. Tranchevent, L.-C. et al. Candidate gene prioritization with Endeavour. Nucleic Acids Res. 44, 117–121 (2016).

64. Hutz, J. E., Kraja, A. T., McLeod, H. L. & Province, M. A. CANDID: a flexible method for prioritizing candidate genes for complex human traits. Genet. Epidemiol. 32, 779–90 (2008).

65. Adie, E. A., Adams, R. R., Evans, K. L., Porteous, D. J. & Pickard, B. S. SUSPECTS: Enabling fast and effective prioritization of positional candidates. Bioinformatics 22, 773– 774 (2006).

66. Köhler, S., Bauer, S., Horn, D. & Robinson, P. N. Walking the Interactome for Prioritization of Candidate Disease Genes. Am. J. Hum. Genet. 82, 949–958 (2008).

67. Kumar, A. A. et al. pBRIT: gene prioritization by correlating functional and phenotypic annotations through integrative data fusion. Bioinformatics 34, 2254–2262 (2018).

68. Zampieri, G. et al. Scuba: Scalable kernel-based gene prioritization. BMC Bioinformatics 19, 1–12 (2018).

69. Wood, A. R. et al. Defining the role of common variation in the genomic and biological architecture of adult human height. Nat. Genet. 46, 1173–86 (2014).

70. Raychaudhuri, S. et al. Identifying Relationships among Genomic Disease Regions: Predicting Genes at Pathogenic SNP Associations and Rare Deletions. PLoS Genet. 5, e1000534 (2009).

71. Cowen, L., Ideker, T., Raphael, B. J. & Sharan, R. Network propagation: a universal amplifier of genetic associations. Nat. Rev. Genet. 18, 551–562 (2017).

208

72. Leiserson, M. D. M., Eldridge, J. V., Ramachandran, S. & Raphael, B. J. Network analysis of GWAS data. Curr. Opin. Genet. Dev. 23, 602–610 (2013).

73. Rossin, E. J. et al. Proteins encoded in genomic regions associated with immune-mediated disease physically interact and suggest underlying biology. PLoS Genet. 7, e1001273 (2011).

74. Lee, I., Blom, U. M., Wang, P. I., Shim, J. E. & Marcotte, E. M. Prioritizing candidate disease genes by network-based boosting of genome-wide association data. Genome Res. 21, 1109–21 (2011).

75. Shim, J. E. et al. GWAB: a web server for the network-based boosting of human genome- wide association data. Nucleic Acids Res. 45, W154–W161 (2017).

76. Greene, C. S. et al. Understanding multicellular function and disease with human tissue- specific networks. Nat. Genet. 47, 569–76 (2015).

77. Carlin, D. E. et al. A Fast and Flexible Framework for Network-Assisted Genomic Association. iScience 16, 155–161 (2019).

78. Wang, Q. et al. A Bayesian framework that integrates multi-omics data and gene networks predicts risk genes from schizophrenia GWAS data. Nat. Neurosci. 22, (2019).

79. Taşan, M. et al. Selecting causal genes from genome-wide association studies via functionally coherent subnetworks. Nat. Methods 12, 154–9 (2015).

80. Fernández-Tajes, J. et al. Developing a network view of type 2 diabetes risk pathways through integration of genetic, genomic and functional data. Genome Med. 11, 1–14 (2019).

81. Simeone, D. M. et al. Islet hypertrophy following pancreatic disruption of Smad4 signaling. Am. J. Physiol. - Endocrinol. Metab. 291, 1305–1316 (2006).

82. Kim, S. Y. et al. Loss of cyclin-dependent kinase 2 in the pancreas links primary β-cell dysfunction to progressive depletion of β-cell mass and diabetes. J. Biol. Chem. 292, 3841–3853 (2017).

83. Vanunu, O., Magger, O., Ruppin, E., Shlomi, T. & Sharan, R. Associating genes and protein complexes with disease via network propagation. PLoS Comput. Biol. 6, e1000641 (2010).

84. Guala, D. & Sonnhammer, E. L. L. A large-scale benchmark of gene prioritization methods. Sci. Rep. 7, 1–10 (2017).

85. Schmitt, T., Ogris, C. & Sonnhammer, E. L. L. FunCoup 3.0: Database of genome-wide functional coupling networks. Nucleic Acids Res. 42, 380–388 (2014).

209

86. Koscielny, G. et al. Open Targets: A platform for therapeutic target identification and Validation. Nucleic Acids Res. 45, D985–D994 (2017).

87. Picart-Armada, S. et al. Benchmarking network propagation methods for disease gene identification. PLoS Comput. Biol. 15, e1007276 (2019).

88. Shim, J. E., Hwang, S. & Lee, I. Pathway-dependent effectiveness of network algorithms for gene prioritization. PLoS One 10, 1–10 (2015).

89. Börnigen, D. et al. An unbiased evaluation of gene prioritization tools. Bioinformatics 28, 3081–3088 (2012).

90. Finucane, H. K. et al. Heritability enrichment of specifically expressed genes identifies disease-relevant tissues and cell types. Nat. Genet. 50, 621–629 (2018).

91. Ongen, H. et al. Estimating the causal tissues for complex traits and diseases. Nat. Genet. 49, 1676–1683 (2017).

92. Gamazon, E. R. et al. Using an atlas of gene regulation across 44 human tissues to inform complex disease- and trait-associated variation. Nat. Genet. 50, 956–967 (2018).

93. Liu, B., Gloudemans, M. J., Rao, A. S., Ingelsson, E. & Montgomery, S. B. Abundant associations with gene expression complicate GWAS follow-up. Nat. Genet. 51, (2019).

94. Hormozdiari, F. et al. Colocalization of GWAS and eQTL Signals Detects Target Genes. Am. J. Hum. Genet. 99, 1245–1260 (2016).

95. Calderon, D. et al. Inferring Relevant Cell Types for Complex Traits by Using Single-Cell Gene Expression. Am. J. Hum. Genet. 101, 686–699 (2017).

96. Shang, L., Smith, J. A. & Zhu, X. Leveraging Gene Co-expression Patterns to Infer Trait- Relevant Tissues in Genome-wide Association Studies. BioRxiv (2019).

97. GTEx Consortium. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580–5 (2013).

98. Aguet, F. et al. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).

99. Hedlund, E. & Deng, Q. Single-cell RNA sequencing: Technical advancements and biological applications. Mol. Aspects Med. 59, 36–46 (2018).

100. Andrews, T. S. & Hemberg, M. Identifying cell populations with scRNASeq. Mol. Aspects Med. 59, 114–122 (2018).

101. Luecken, M. D. & Theis, F. J. Current best practices in single‐cell RNA‐seq analysis: a

210

tutorial. Mol. Syst. Biol. 15, e8746 (2019).

102. Vieira Braga, F. A. et al. A cellular census of human lungs identifies novel cell states in health and in asthma. Nat. Med. 25, 1153–1163 (2019).

103. Aizarani, N. et al. A human liver cell atlas reveals heterogeneity and epithelial progenitors. Nature (2019). doi:10.1038/s41586-019-1373-2

104. Haber, A. L. et al. A single-cell survey of the small intestinal epithelium. Nature 551, 333–339 (2017).

105. Zeisel, A. et al. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138–42 (2015).

106. Tasic, B. et al. Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nat. Neurosci. 19, 335–346 (2016).

107. Wallace, M. L. et al. Genetically Distinct Parallel Pathways in the Entopeduncular Nucleus for Limbic and Sensorimotor Output of the Basal Ganglia. Neuron 94, 138- 152.e5 (2017).

108. Campbell, J. N. et al. A molecular census of arcuate hypothalamus and median eminence cell types. Nat. Neurosci. 20, 484–496 (2017).

109. Chen, R., Wu, X., Jiang, L. & Zhang, Y. Single-Cell RNA-Seq Reveals Hypothalamic Cell Diversity. Cell Rep. 18, 3227–3241 (2017).

110. Zeisel, A. et al. Molecular Architecture of the Mouse Nervous System. Cell 174, 999- 1014.e22 (2018).

111. Saunders, A. et al. Molecular Diversity and Specializations among the Cells of the Adult Mouse Brain. Cell 174, 1015-1030.e16 (2018).

112. Herculano-Houzel, S. & Lent, R. Isotropic fractionator: A simple, rapid method for the quantification of total cell and neuron numbers in the brain. J. Neurosci. 25, 2518–2521 (2005).

113. Kiselev, V. Y., Andrews, T. S. & Hemberg, M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat. Rev. Genet. 20, 273–282 (2019).

114. Andrews, T. S. & Hemberg, M. M3Drop: dropout-based feature selection for scRNASeq. Bioinformatics 35, 2865–2867 (2019).

115. Nagel, M. et al. Meta-analysis of genome-wide association studies for neuroticism in 449,484 individuals identifies novel genetic loci and pathways. Nat. Genet. 50, 920–927 (2018).

211

116. Jansen, P. R. et al. Genome-wide analysis of insomnia in 1,331,010 individuals identifies new risk loci and functional pathways. Nat. Genet. 51, 394–403 (2019).

117. Bryois, J. et al. Genetic Identification of Cell Types Underlying Brain Complex Traits Yields Novel Insights Into the Etiology of Parkinson’s Disease. bioRxiv 528463 (2019). doi:10.1101/528463

118. Skene, N. G. et al. Genetic identification of brain cell types underlying schizophrenia. Nat. Genet. 50, 825–833 (2018).

119. Watanabe, K., Umićević Mirkov, M., de Leeuw, C. A., van den Heuvel, M. P. & Posthuma, D. Genetic mapping of cell type specificity for complex traits. Nat. Commun. 10, 3222 (2019).

120. Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000).

121. Blake, J. A., Bult, C. J., Eppig, J. T., Kadin, J. A. & Richardson, J. E. The Mouse Genome Database: Integration of and access to knowledge about the . Nucleic Acids Res. 42, 810–817 (2014).

122. Lage, K. et al. A human phenome-interactome network of protein complexes implicated in genetic disorders. Nat Biotechnol 25, 309–316 (2007).

123. McLaren, W. et al. The Ensembl Variant Effect Predictor. Genome Biol. 17, 1–14 (2016).

124. Frey, B. & Dueck, D. Clustering by Passing Messages Between Data Points. Science (80-. ). 315, 972–976 (2007).

125. Pedregosa, F., Grisel, O., Weiss, R., Passos, A. & Brucher, M. Scikit-learn: Machine Learning in Python. 12, 2825–2830 (2011).

126. Gu, Z., Eils, R. & Schlesner, M. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics 32, 2847–2849 (2016).

127. Marouli, E. et al. Rare and low-frequency coding variants alter human adult height. Nature 542, 186–190 (2017).

128. Lamparter, D., Marbach, D., Rueedi, R., Kutalik, Z. & Bergmann, S. Fast and Rigorous Computation of Gene and Pathway Scores from SNP-Based Summary Statistics. PLOS Comput. Biol. 12, e1004714. (2016).

129. Schwartz, N. B. & Domowicz, M. Chondrodysplasias due to proteoglycan defects. Glycobiology 12, 57–68 (2002).

130. Wei, H. S., Wei, H. L., Zhao, F., Zhong, L. P. & Zhan, Y. T. Glycosyltransferase

212

GLT8D2 positively regulates apoB100 protein expression in hepatocytes. Int. J. Mol. Sci. 14, 21435–21446 (2013).

131. Ito, H. et al. Molecular Cloning and Biological Activity of a Novel Lysyl Oxidase-related Gene Expressed in Cartilage. J. Biol. Chem. 276, 24023–24029 (2001).

132. Wakahara, T. et al. Fibin, a Novel Secreted Lateral Plate Mesoderm Signal, Is Essential for Pectoral Fin Bud Initiation in Zebrafish. Dev. Biol. 303, 527–535 (2007).

133. Kawano, Y. & Kypta, R. Secreted antagonists of the Wnt signalling pathway. J. Cell Sci. 116, 2627–2634 (2003).

134. Mastaitis, J. et al. Loss of SFRP4 alters body size, food intake, and energy expenditure in diet-induced obese male mice. Endocrinology 156, 4502–4510 (2015).

135. Simsek Kiper, P. O. et al. Cortical-Bone Fragility — Insights from sFRP4 Deficiency in Pyle’s Disease. N. Engl. J. Med. 374, 2553–2562 (2016).

136. Heymsfield, S. B. et al. Scaling of adult body weight to height across sex and race / ethnic groups : relevance to BMI 1 – 4. Am. J. Clin. Nutr. 100, 1455–1461 (2014).

137. Zhou, Z. et al. Cidea-deficient mice have lean phenotype and are resistant to obesity. Nat. Genet. 35, 49–56 (2003).

138. Tews, D. et al. Comparative gene array analysis of progenitor cells from human paired deep neck and subcutaneous adipose tissue. Mol. Cell. Endocrinol. 395, 41–50 (2014).

139. Pischon, T. et al. General and abdominal adiposity and risk of death in Europe. N. Engl. J. Med. 359, 2105–2120 (2008).

140. Canoy, D. Distribution of body fat and risk of coronary heart disease in men and women. Curr. Opin. Cardiol. 23, 591–598 (2008).

141. Wang, Y., Rimm, E. B., Stampfer, M. J., Willett, W. C. & Hu, F. B. Comparison of abdominal adiposity and overall obesity in predicting risk of type 2 diabetes among men. Am. J. Clin. Nutr. 81, 555–563 (2005).

142. Shungin, D. et al. New genetic loci link adipose and insulin biology to body fat distribution. Nature 518, 187–196 (2015).

143. Wheeler, E., Marenne, G. & Barroso, I. Genetic aetiology of glycaemic traits: Approaches and insights. Hum. Mol. Genet. 26, R172–R184 (2017).

144. Cohen, R. M. et al. Red cell life span heterogeneity in hematologically normal people is sufficient to alter HbA1c. Blood 112, 4284–4291 (2008).

213

145. Wheeler, E. et al. Impact of common genetic determinants of Hemoglobin A1c on type 2 diabetes risk and diagnosis in ancestrally diverse populations: A transethnic genome-wide meta-analysis. PLoS Med. 14, 1–30 (2017).

146. Hart, N. J. et al. Cystic fibrosis-related diabetes is caused by islet loss and inflammation. JCI insight 3, (2018).

147. Woodmansey, C. et al. Incidence, demographics, and clinical characteristics of diabetes of the exocrine pancreas (Type 3c): A retrospective cohort study. Diabetes Care 40, 1486– 1493 (2017).

148. ’t Hart, L. M. et al. The CTRB1/2 locus affects diabetes susceptibility and treatment via the incretin pathway. Diabetes 62, 3275–81 (2013).

149. Rosendahl, J. et al. Genome-wide association study identifies inversion in the CTRB1- CTRB2 locus to modify risk for alcoholic and non-alcoholic chronic pancreatitis. Gut 67, 1855–1863 (2018).

150. Wolpin, B. M. et al. Genome-wide association study identifies multiple susceptibility loci for pancreatic cancer. Nat. Genet. 46, 994–1000 (2014).

151. Mahajan, A. et al. Fine-mapping type 2 diabetes loci to single-variant resolution using high-density imputation and islet-specific epigenome maps. Nat. Genet. 50, 1505–1513 (2018).

152. Morris, A. P. et al. Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nat. Genet. 44, 981–990 (2012).

153. Karczewski, K. J. et al. Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes. bioRxiv 531210 (2019). doi:10.1101/531210

154. Lu, Y. et al. New loci for body fat percentage reveal link between adiposity and cardiometabolic disease risk. Nat. Commun. 7, (2016).

155. Zillikens, M. C. et al. Large meta-analysis of genome-wide association studies identifies five loci for lean body mass. Nat. Commun. 8, 80 (2017).

156. Chandran, M., Phillips, S. A., Ciaraldi, T. & Henry, R. R. Adiponectin: More than just another fat cell hormone? Diabetes Care 26, 2442–2450 (2003).

157. Duncan, B. B. et al. Adiponectin and the development of type 2 diabetes: the atherosclerosis risk in communities study. Diabetes 53, 2473–8 (2004).

158. Shibata, R., Ouchi, N. & Murohara, T. Adiponectin and cardiovascular disease. Circ. J. 73, 608–614 (2009).

214

159. Dastani, Z. et al. Novel loci for adiponectin levels and their influence on type 2 diabetes and metabolic traits: A multi-ethnic meta-analysis of 45,891 individuals. PLoS Genet. 8, (2012).

160. Spracklen, C. N. et al. Exome-Derived Adiponectin-Associated Variants Implicate Obesity and Lipid Biology. Am. J. Hum. Genet. 105, 15–28 (2019).

161. Fischer, J. et al. The gene encoding adipose triglyceride lipase (PNPLA2) is mutated in neutral lipid storage disease with myopathy. Nat. Genet. 39, 28–30 (2007).

162. Gandotra, S. et al. Perilipin deficiency and autosomal dominant partial lipodystrophy. N. Engl. J. Med. 364, 740–748 (2011).

163. Kelesidis, T., Kelesidis, I., Chou, S. & Mantzoros, C. S. Narrative review: the role of leptin in human physiology: emerging clinical applications. Ann. Intern. Med. 152, 93– 100 (2010).

164. Kilpeläinen, T. O. et al. Genome-wide meta-analysis uncovers novel loci influencing circulating leptin levels. Nat. Commun. 7, (2016).

165. Tranchevent, L. C. et al. A guide to web tools to prioritize candidate genes. Brief. Bioinform. 12, 22–32 (2011).

166. Spain, S. L. & Barrett, J. C. Strategies for fine-mapping complex traits. Hum. Mol. Genet. 24, R111–R119 (2015).

167. Jia, P. & Zhao, Z. Network-assisted analysis to prioritize GWAS results: Principles, methods and perspectives. Hum. Genet. 133, 125–138 (2014).

168. Hou, L. & Zhao, H. A review of post-GWAS prioritization approaches. Front. Genet. 4, 1–6 (2013).

169. Gottlieb, A., Magger, O., Berman, I., Ruppin, E. & Sharan, R. PRINCIPLE: a tool for associating genes with diseases via network propagation. Bioinformatics 27, 3325–6 (2011).

170. Schmidt, E. M. et al. GREGOR: Evaluating global enrichment of trait-associated variants in epigenomic features using a systematic, data-driven approach. Bioinformatics 31, 2601–2606 (2015).

171. Schizophrenia Working Group of the Psychiatric Genomics Consortium. Biological insights from 108 schizophrenia-associated genetic loci. Nature 511, 421–427 (2014).

172. Liu, J. Z. et al. Association analyses identify 38 susceptibility loci for inflammatory bowel disease and highlight shared genetic risk across populations. Nat. Genet. 47, 979–989 (2015).

215

173. Teslovich, T. M. et al. Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466, 707–13 (2010).

174. Loh, P. R., Kichaev, G., Gazal, S., Schoech, A. P. & Price, A. L. Mixed-model association for biobank-scale datasets. Nat. Genet. 50, 906–908 (2018).

175. Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).

176. Gazal, S. et al. Linkage disequilibrium–dependent architecture of human complex traits shows action of negative selection. Nat. Genet. 49, 1421–1427 (2017).

177. Wickham, H. ggplot2: Elegant Graphics for Data Analysis. (Springer-Verlag New York, 2016).

178. Barrett, T. et al. NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res. 41, D991-5 (2013).

179. Wong, A. K., Krishnan, A. & Troyanskaya, O. G. GIANT 2.0: Genome-scale integrated analysis of gene networks in tissues. Nucleic Acids Res. 46, W65–W70 (2018).

180. Võsa, U. et al. Unraveling the polygenic architecture of complex traits using blood eQTL meta-analysis. bioRxiv, doi: 10.1101/447367 (2018). doi:10.1101/447367

181. Day, F. R. et al. Genomic analyses identify hundreds of variants associated with age at menarche and support a role for puberty timing in cancer risk. Nat. Genet. 49, 834–841 (2017).

182. O’Connor, L. J. et al. Extreme Polygenicity of Complex Traits Is Explained by Negative Selection. Am. J. Hum. Genet. 105, 456–476 (2019).

183. Yengo, L. et al. Meta-analysis of genome-wide association studies for height and body mass index in ~700 000 individuals of European ancestry. Hum. Mol. Genet. 27, 3641– 3649 (2018).

184. Yeo, G. S. H. & Heisler, L. K. Unraveling the brain regulation of appetite: lessons from genetics. Nat. Neurosci. 15, 1343–1349 (2012).

185. Furness, J. B. The enteric nervous system and neurogastroenterology. Nat. Rev. Gastroenterol. Hepatol. 9, 286–294 (2012).

186. Rao, M. & Gershon, M. D. The bowel and beyond: The enteric nervous system in neurological disorders. Nat. Rev. Gastroenterol. Hepatol. 13, 517–528 (2016).

187. Stiles, J. & Jernigan, T. L. The basics of brain development. Neuropsychol. Rev. 20, 327– 348 (2010).

216

188. Pulit, S. L. et al. Meta-Analysis of genome-wide association studies for body fat distribution in 694 649 individuals of European ancestry. Hum. Mol. Genet. 28, 166–174 (2019).

189. Skene, N. G. & Grant, S. G. N. Identification of vulnerable cell types in major brain disorders using single cell transcriptomes and expression weighted cell type enrichment. Front. Neurosci. 10, 1–11 (2016).

190. Andermann, M. L. & Lowell, B. B. Toward a Wiring Diagram Understanding of Appetite Control. Neuron 95, 757–778 (2017).

191. de Lartigue, G. Role of the vagus nerve in the development and treatment of diet-induced obesity. J. Physiol. 594, 5791–5815 (2016).

192. Makaronidis, J. M. & Batterham, R. L. Obesity, body weight regulation and the brain: Insights from fMRi. Br. J. Radiol. 91, (2018).

193. Kemp, J. M. & Powell, T. P. The structure of the caudate nucleus of the cat: light and electron microscopy. Philos. Trans. R. Soc. Lond. B. Biol. Sci. 262, 383–401 (1971).

194. Fine, R. S., Pers, T. H., Amariuta, T., Raychaudhuri, S. & Hirschhorn, J. N. Benchmarker: An Unbiased, Association-Data-Driven Strategy to Evaluate Gene Prioritization Algorithms. Am. J. Hum. Genet. 104, 1025–1039 (2019).

195. The ENCODE Project Consortium et al. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

196. The Roadmap Epigenomics Consortium et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).

197. Forrest, A. R. R. et al. A promoter-level mammalian expression atlas. Nature 507, 462–70 (2014).

198. Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–61 (2014).

199. Fulco, C. P. et al. Activity-by-contact model of enhancer-promoter regulation from thousands of CRISPR perturbations. Nat. Genet. 51, 1664–1669 (2019).

200. Weeks, E. et al. PgmNr 2991: Computational gene prioritization from GWAS using local and polygenic signal. in American Society of Human Genetics Annual Meeting (2019).

201. Snel, B., Lehmann, G., Bork, P. & Huynen, M. A. STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene. Nucleic Acids Res. 28, 3442–4 (2000).

217

202. de Leeuw, C. A., Stringer, S., Dekkers, I. A., Heskes, T. & Posthuma, D. Conditional and interaction gene-set analysis reveals novel functional pathways for blood pressure. Nat. Commun. 9, (2018).

203. Carbonetto, P. & Stephens, M. Integrated Enrichment Analysis of Variants and Pathways in Genome-Wide Association Studies Indicates Central Role for IL-2 Signaling Genes in Type 1 Diabetes, and Cytokine Signaling Genes in Crohn’s Disease. PLoS Genet. 9, (2013).

204. Zhu, X. & Stephens, M. Large-scale genome-wide enrichment analyses identify new trait- associated genes and pathways across 31 human phenotypes. Nat. Commun. 9, 1–14 (2018).

205. Smemo, S. et al. Obesity-associated variants within FTO form long-range functional connections with IRX3. Nature 507, 371–5 (2014).

206. Claussnitzer, M. et al. FTO obesity variant circuitry and adipocyte browning in humans. N. Engl. J. Med. 373, 895–907 (2015).

207. Zhou, K. & Pearson, E. R. Insights from Genome-Wide Association Studies of Drug Response. Annu. Rev. Pharmacol. Toxicol. 53, 299–310 (2012).

218