Network based analysis of genetic disease associations

Sarah Gilman

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy under the Executive Committee of the Graduate School of Arts and Sciences

COLUMBIA UNIVERSITY 2014

© 2013 Sarah Gilman All Rights Reserved

ABSTRACT Network based analysis of genetic disease associations Sarah Gilman

Despite extensive efforts and many promising early findings, -wide association studies have explained only a small fraction of the genetic factors contributing to common diseases.

There are many theories about where this “missing heritability” might lie, but increasingly the prevailing view is that common variants, the target of GWAS, are not solely responsible for susceptibility to common diseases and a substantial portion of human disease risk will be found among rare variants.

Relatively new, such variants have not been subject to purifying selection, and therefore may be particularly pertinent for neuropsychiatric disorders and other diseases with greatly reduced fecundity.

Recently, several researchers have made great progress towards uncovering the genetics behind autism and schizophrenia. By sequencing families, they have found hundreds of de novo variants occurring only in affected individuals, both large structural copy number variants and single nucleotide variants. Despite studying large cohorts there has been little recurrence among the implicated suggesting that many hundreds of genes may underlie these complex phenotypes. The question becomes how to tie these rare together into a cohesive picture of disease risk.

Biological networks represent an intuitive answer, as different mutations which converge on the same phenotype must share some underlying biological process. Network-based analysis offers three major advantages: it allows easy integration of both common and rare variants, it allows us to assign significance to collection of genes where individual genes may not be significant due to rarity, and it allows easier identification of the biological processes underlying physical consequences.

This work presents the construction of a novel phenotype network and a method for the analysis of disease-associated variants. This method has been applied to de novo mutations and GWAS results associated with both autism and schizophrenia and found clusters of genes strongly connected by shared function for both diseases. The results help elucidate the real physical consequences of putative disease mutations, leading to a better understanding of the pathophysiology of the diseases.

Table of contents

1. Introduction ...... 1 1.1. Thesis Overview ...... 4 1.2. References ...... 7 2. Pleiotropy: contribution to multiple diseases ...... 9 2.1. Introduction ...... 9 2.2. Results ...... 11 2.2.1. Disease mutations and domains ...... 11 2.2.2. functional annotations and Pleiotropy ...... 13 2.2.3. Pleiotropy in yeast genes ...... 16 2.3. Discussion ...... 17 2.4. Methods ...... 19 2.5. References ...... 21 3. A functional network of phenotypic interactions ...... 22 3.1. Introduction ...... 22 3.2. Results and Discussion ...... 24 3.3. Methods ...... 29 3.4. References ...... 33 4. Network based analysis of de novo copy number variants associated with autism ...... 35 4.1. Introduction ...... 35 4.2. Results ...... 38 4.2.1. Autism associated networks ...... 38 4.2.2. Gender Bias in autism variants ...... 39 4.2.3. ASD network and related sets ...... 41 4.2.4. General Network functions ...... 41 4.2.5. Synaptic contacts and Dendritic Morphology ...... 43 4.2.6. Consequences for Dendritic Morphology ...... 45 4.3. Discussion ...... 47 4.4. Methods ...... 49 4.5. References ...... 57

i

5. Diverse genetic variations in schizophrenia converge on functional gene networks ...... 62 5.1. Introduction ...... 62 5.2. Results ...... 64 5.2.1. Gene networks from schizophrenia-associated variations ...... 64 5.2.2. Biological processes associated with schizophrenia clusters ...... 67 5.2.3. Expression of genes from schizophrenia-associated clusters ...... 69 5.2.4. Common functions between schizophrenia and autism ...... 70 5.3. Discussion ...... 73 5.4. Methods ...... 75 5.5. References ...... 90 6. Function and phenotype across de novo genetic mutations associated with autism ...... 92 6.1. Introduction ...... 92 6.2. Results ...... 94 6.2.1. Gene cluster affected by de novo mutations ...... 94 6.2.2. Biological functions associated with autism cluster ...... 97 6.2.3. Association of NETBAG network with functionally relevant gene sets ...... 99 6.2.4. Patterns of neuronal expression associated with ASD ...... 100 6.2.5. Temporal and regional bias in neuronal expression ...... 104 6.2.6. Phenotypic variation in ASD-associated genes ...... 106 6.3. Discussion ...... 109 6.4. Methods ...... 111 6.5. References ...... 118

ii

List of Figures and Tables Figure 2.1: Sample SNPs in protein domains...... 11 Figure 2.2: Associations between disease gene pleiotropy and various descriptors...... 14 Figure 3.1: Disease gene predictions by evidence source...... 25 Figure 3.2: Naïve versus Deweighted scoring...... 26 Table 3.1: Selected ranks from figure 3.2...... 26 Figure 3.3: Disease gene predictions using different decoy genes...... 27 Table 3.2: Area under curve based on network source and decoy genes selected...... 28 Figure 4.1: The NETBAG method...... 37 Figure 4.2: NETBAG gene clusters found in de novo CNV regions associated with autistic individuals. .... 39 Figure 4.3: Functional importance of genes perturbed by CNV events in females and males...... 40 Table 4.1: Gene Ontology terms associated with autism cluster ...... 42 Figure 4.4: Important genes associated with the morphogenesis of dendritic spines...... 45 Table 4.2: Likely impact on dendritic growth for ASD genes ...... 46 Figure 4.5: Illustration of the cluster building procedure used in the NETBAG approach...... 51 Table 4.3: Function notes on NETBAG cluster genes ...... 53 Figure 5.1: The NETBAG+ algorithm...... 63 Table 5.1: Clusters identified by the NETBAG algorithm ...... 65 Figure 5.2: Significant clusters identified using schizophrenia associated genes...... 66 Table 5.2: Selected Gene Ontology (GO) terms associated with schizophrenia clusters...... 68 Figure 5.3: Temporal brain expression profiles for genes forming the identified clusters...... 70 Figure 5.4: Likely impact of genes from de novo CNVs in autism and schizophrenia on growth of dendrites or dendritic spines...... 72 Figure 5.5: Genes forming the NETBAG clusters in the context of cellular signaling pathways...... 74 Table 5.3: Literature review of SNV genes ...... 78 Table 5.4: All Gene Ontology (GO) terms associated with schizophrenia clusters ...... 84 Table 5.5: Likely impact of CNVs on the growth of dendrites and dendritic spines ...... 89 Figure 6.1: NETBAG selected cluster using autism-associated de novo SNVs and CNVs...... 95 Table 6.1: Clusters selected by NETBAG for various input sets ...... 96 Table 6.2: Selected Gene Ontology functions associated with autism cluster ...... 96

iii

Table 6.3: Overlap of various autism gene sets with FMRP targets and PSD ...... 100 Figure 6.2: profiles in the brain across developmental stages for ASD genes...... 103 Figure 6.3: Average expression difference between early developmental periods and later periods. ... 104 Figure 6.4: Average expression difference across brain regions...... 105 Table 6.4: Early development expression bias calculations for regions of the brain ...... 106 Figure 6.5: Expression by phenotype measurements...... 107 Figure 6.6: Histograms showing distribution of subject IQ based on gene subsets...... 109 Figure 6.7: Hierarchical clustering of genes selected by NETBAG...... 114 Figure 6.8: Gene expression profiles for disease-matched gene sets...... 115 Table 6.5: Gene Ontology terms significantly associated with functional clusters ...... 116

iv

Acknowledgements

I would like to thank my lab mates past and present, including Tsu-Lin Hsiao, Jie Hue, Eugenia Lyashenko, German Plata, and Mariam Konate, for their thoughtful comments and patience during many practice talks.

I would like to thank my thesis committee, George Hripcsak, Noemie Elhadad, Maria Karayiorgou, Stephen Johnson for their support.

I would like to thank all of our collaborators in the Wigler lab at Cold Spring Harbor, the State lab at Stanford, and the Gogos and Karayiorgou labs at Columbia University for all their hard work bringing this research to publication

In particular, I’d like thank my co-authors, Ivan Iossifov who was integral in developing the methods on which this research is based, and Jonathan Chang for helping refine those methods and for his work on expression analysis.

I would like to thank my advisor, Dennis Vitkup, for his intelligence and support throughout the years. His commitment to science will be an inspiration to me the rest of my life.

I want to thank my family – my parents, brother, and even my little distractions – whose love gets me out of bed in the morning. Often literally.

And finally, thank you to my husband Felix, for all the usual reasons.

v

To Felix

vi

1

1. Introduction

Common Diseases and Human Genetic Variation

At the start of the century, the popular theory held that for common diseases there must exist common genetic variants which caused the disease, or at least increase individual susceptibility. Around this time, the International HapMap project was formed with the goal of describing patterns of variation in the . They examined common single nucleotide polymorphisms (SNPs), and defined blocks of genetic code, or haplotypes, which could be used to characterize the genome of an individual

[1, 2]. With this knowledge and the decreasing cost of SNP genotyping, it became possible to run genome-wide association studies (GWAS), in which genotypes from a large cohort of unrelated subjects with a particular disease were compared to genotypes from healthy subjects in the hopes of finding the common genetic factors associated with the disorder [3-5].

GWAS have had many notable successes, finding genes associated with age-related macular degeneration [6] and a well-documented association between APOE and late-onset Alzheimer’s disease

[7]. In 2007, the Wellcome Trust Consortium published a Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls, one of the seminal works in this field, the first to sequence thousands of subjects [8]. Along with several meta-analyses, this study implicated many susceptibility loci for Type 1 Diabetes and Crohn’s disease [9, 10]. After several years, hundreds of loci have been associated with different phenotypes and many interesting disease genes uncovered [11-

14].

It has become clear however that common variants alone cannot explain genetic risk for all common diseases [15]. For many disorders with high heritability, different studies frequently implicate different genomic loci. Most confirmed loci may have only a moderate effect on disease risk ratios, showing very little impact on disease risk for the individual. Even when taken together, GWAS results often explain only a small fraction of the phenotypic variability [16, 17]. There are many theories about

2

where this missing heritability may be found, with some suggesting a new focus on environmental factors and others calling for more advanced models of epistasis, the interactions between SNPs.

Recently, however, there has been increasing interest in favor of an alternate theory of disease susceptibility: “common disease – rare variant”. This theory posits that a collection of rare variants, occurring in genes with similar function or a shared biological pathway, might lead to a common phenotype despite their differences. One researcher has even demonstrated in silico that rare mutations might cause spurious associations between common SNPs and diseases [18]. Rare mutations are less frequent because they occurred more recently and therefore have not had time to affect a substantial portion of the population. However, they might have a larger impact on disease risk and odds ratios as they have not been exposed to purifying selection yet, and can therefore be more deleterious than common variants [16, 19-23]. This might be particularly relevant for neuropsychiatric disorders where causal mutations will face particularly strong pressures from natural selection due to greatly reduced fecundity [24].

Unfortunately, rare variants pose new problems for researchers. The foremost issue is how to locate them; while the cost of sequencing is dropping, it is impossible to estimate the study size needed to detect variants of unknown frequency. Many researchers have addressed this by returning to family studies and searching for de novo variants, or novel mutations, in the affected individual by comparison with their healthy parents and siblings. A second problem is that even relatively large studies show very low recurrence among the genes implicated by de novo variants. This makes it difficult to assign significance to the association between disease and [25]. Finally, while it may be possible to trace the biological implications of single mutations, such studies are difficult and costly. Some association studies report lists of putative disease genes, trusting future studies to confirm the physiological ramifications.

3

Biological networks represent an intuitive answer to many of these problems. The underlying idea is that for different mutations to converge on the same phenotype they must share cellular processes [26, 27]. This shifts the focus of genetic studies from finding disease mutations to finding biological pathways perturbed by disorders [28-31]. This has three key advantages. First, it becomes possible to assign significance to a collection of mutations where it might be impossible to assign it to individual elements while also making it easier to identify unrelated spurious results. Second, it becomes possible to consider multiple types of mutations together, unifying common and rare variants through their shared functions. Finally, viewing mutations through the lens of biological process brings the focus to predicting real physical consequences, and it is these consequences that will lead to a better understanding of human disease.

The major contribution of this work will be a novel method for analysis of disease associated variants which not only helps to select true disease associations from spurious results, but also helps define biological functions underlying diseases. The major limitation is that this work will require pre- existing sets of putative disease associations discovered through unbiased genome-wide assessment; several such results have been published, but there remain many diseases that have not been the subject of this type of analysis.

4

1.1. Thesis Overview

Pleiotropy: gene contribution to multiple diseases

In the study of human disease, pleiotropic genes, those which affect susceptibility to multiple conditions, are of special importance as they impact disease co-occurrence models and drug side- effects. The first step in understanding pleiotropy is finding the mechanisms that allow genes to contribute to multiple diseases. There are two competing models for how genes exhibit pleiotropy: through proteins capable of multiple molecular functions, the elemental biochemical activities of proteins, or through the reuse of such functions in multiple biological processes and contexts. This chapter examines the evidence for both hypotheses, ultimately favoring the latter; participation in multiple high-level processes and pathways represent the key to understanding pleiotropy in human disease genes. These results suggest that the use of such processes would assist in the prioritization and understanding of disease genes.

A functional network of phenotypic interactions

With an increasing catalog of genes effecting disease susceptibility, there is a growing need for methods to combine and interpret those results in order to better understand their underlying biological effects. Functional networks represent the perfect framework for promoting this understanding; shared function can be used to reduce noise among putative disease genes, predict real physiological changes, and potentially improve our understanding of drug targets and side-effects. Rather than use an existing network of function annotations or protein-protein interactions, this chapter proposes the construction of phenotypic network, one which combines multiple sources of evidence into likelihoods that gene pairs share phenotypic assignments. The network is validated using the task of prioritizing disease genes, and we demonstrate the value of combining evidence rather than relying on a single existing network of functional relationships.

5

Network based analysis of de novo copy number variants associated with autism

Preliminary results from a large study of high-functioning autistic children reveal a collection of de novo copy number variants (CNV) associated with the disease, but understanding theser variants remains challenging. Using the developed phenotypic network, this chapter proposes a method for network-based analysis of genetic associations (NETBAG) to address some of the issues of studying rare and de novo disease mutations. The method finds a significant cluster of genes primarily related to synapse development, axon targeting, and neuron motility, highlighting perturbed synaptogenesis as the heart of autism. In addition, the results support the hypothesis that significantly stronger functional perturbations are required to trigger the autistic phenotype in females compared to males. This work provides an important proof of the principle that the biological underpinnings of complex human phenotypes can be identified by a network-based functional analysis of rare genetic variants.

Diverse genetic variations converge on functional gene networks involved in schizophrenia

Different types of genetic variation have shown a significant contribution to schizophrenia susceptibility, confounding the search for a cohesive explanation of disease etiology. Using the developed phenotype network and analysis method (NETBAG), we combine the results of GWAS studies, de novo copy number variants (CNV) and single nucleotide variants (SNVs) to find a cluster of functionally related genes highly expressed in the brain. This cluster – related to axon guidance, neuronal mobility, synaptic function, and chromosomal remodeling – implicates many of the same biological processes underlying autism. A comparative analysis of CNVs associated the two diseases suggests that while the molecular networks implicated in these distinct disorders may be related, the mutations associated with each disease are likely to lead to different functional consequences.

6

Function and phenotype across de novo genetic mutations associated with autism

The analysis of de novo single nucleotide variants (SNVs), implicates many hundreds of genes for autism spectrum disorders (ASD). These are converging on core biological pathways underpinning the disease, such as synaptic plasticity, dendritic growth, and neuronal signaling. The next step is to use this knowledge to better understand the wide phenotypic variation that ASD represents. We explore this problem using the information available in the Simons Simplex Collection (SSC), a collection of de novo mutations in affected individuals alongside a substantial array of measurements relevant to disease severity. This chapter highlights the biological functions associated with the disease but more importantly the consequences of mutation on individual phenotype. We find a wide range of biological processes, expression patterns, and phenotypic effects mirroring the complexity of the disorder.

7

1.2. References

1. International HapMap, C., The International HapMap Project. Nature, 2003. 426(6968): p. 789- 96. 2. Thorisson, G.A., et al., The International HapMap Project Web site. Genome Res, 2005. 15(11): p. 1592-3. 3. Frazer, K.A., et al., A second generation human haplotype map of over 3.1 million SNPs. Nature, 2007. 449(7164): p. 851-61. 4. Hardy, J. and A. Singleton, Genomewide association studies and human disease. N Engl J Med, 2009. 360(17): p. 1759-68. 5. Hirschhorn, J.N. and M.J. Daly, Genome-wide association studies for common diseases and complex traits. Nat Rev Genet, 2005. 6(2): p. 95-108. 6. Klein, R.J., et al., Complement factor H polymorphism in age-related macular degeneration. Science, 2005. 308(5720): p. 385-9. 7. Ertekin-Taner, N., Genetics of Alzheimer disease in the pre- and post-GWAS era. Alzheimers Res Ther, 2010. 2(1): p. 3. 8. Wellcome Trust Case Control, C., Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 2007. 447(7145): p. 661-78. 9. Barrett, J.C., et al., Genome-wide association study and meta-analysis find that over 40 loci affect risk of type 1 diabetes. Nat Genet, 2009. 41(6): p. 703-7. 10. Barrett, J.C., et al., Genome-wide association defines more than 30 distinct susceptibility loci for Crohn's disease. Nat Genet, 2008. 40(8): p. 955-62. 11. Hindorff, L.A., et al., Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A, 2009. 106(23): p. 9362-7. 12. Manolio, T.A., Genomewide association studies and assessment of the risk of disease. N Engl J Med, 2010. 363(2): p. 166-76. 13. Visscher, P.M., et al., Five Years of GWAS Discovery. American Journal of Human Genetics, 2012. 90(1): p. 7-24. 14. Manolio, T.A., L.D. Brooks, and F.S. Collins, A HapMap harvest of insights into the genetics of common disease. J Clin Invest, 2008. 118(5): p. 1590-605. 15. Eichler, E.E., et al., Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet, 2010. 11(6): p. 446-50. 16. Bodmer, W. and C. Bonilla, Common and rare variants in multifactorial susceptibility to common diseases. Nat Genet, 2008. 40(6): p. 695-701. 17. Manolio, T.A., et al., Finding the missing heritability of complex diseases. Nature, 2009. 461(7265): p. 747-53. 18. Dickson, S.P., et al., Rare variants create synthetic genome-wide associations. PLoS Biol, 2010. 8(1): p. e1000294. 19. Gibson, G., Rare and common variants: twenty arguments. Nat Rev Genet, 2012. 13(2): p. 135- 45.

8

20. Goldstein, D.B., Common genetic variation and human traits. N Engl J Med, 2009. 360(17): p. 1696-8. 21. McClellan, J. and M.C. King, Genetic heterogeneity in human disease. Cell, 2010. 141(2): p. 210- 7. 22. Pritchard, J.K. and N.J. Cox, The allelic architecture of human disease genes: common disease- common variant...or not? Hum Mol Genet, 2002. 11(20): p. 2417-23. 23. Schork, N.J., et al., Common vs. rare allele hypotheses for complex diseases. Curr Opin Genet Dev, 2009. 19(3): p. 212-9. 24. Malhotra, D. and J. Sebat, CNVs: harbingers of a rare variant revolution in psychiatric genetics. Cell, 2012. 148(6): p. 1223-41. 25. Kiezun, A., et al., Exome sequencing and the genetic basis of complex traits. Nat Genet, 2012. 44(6): p. 623-30. 26. Schadt, E.E., Molecular networks as sensors and drivers of common human diseases. Nature, 2009. 461(7261): p. 218-23. 27. Feldman, I., A. Rzhetsky, and D. Vitkup, Network properties of genes harboring inherited disease mutations. Proc Natl Acad Sci U S A, 2008. 105(11): p. 4323-8. 28. Barabasi, A.L., N. Gulbahce, and J. Loscalzo, Network medicine: a network-based approach to human disease. Nat Rev Genet, 2011. 12(1): p. 56-68. 29. Hirschhorn, J.N., Genomewide association studies--illuminating biologic pathways. N Engl J Med, 2009. 360(17): p. 1699-701. 30. Cho, D.Y., Y.A. Kim, and T.M. Przytycka, Chapter 5: Network biology approach to complex diseases. PLoS Comput Biol, 2012. 8(12): p. e1002820. 31. Loscalzo, J. and A.L. Barabasi, Systems biology and the future of medicine. Wiley Interdiscip Rev Syst Biol Med, 2011. 3(6): p. 619-27.

9

2. Pleiotropy: gene contribution to multiple diseases

2.1. Introduction

In the study of human disease, pleiotropic genes, those which affect susceptibility to multiple conditions, are of special importance; they are crucial to understanding disease co-occurrence models and predicting side effects for drugs with specific gene targets [1, 2]. In addition, it has been proposed that such genes play a central role in human senescence as genes which are advantageous in early life may exhibit deleterious effects later [3, 4]. Pleiotropy is common with almost 30% of genes in the disease database Online Mendelian Inheritance in Man (OMIM) implicated in more than one phenotype, and more than one third of these were implicated in more than two phenotypes.

The first step in understanding pleiotropy is finding the mechanisms that allow genes to contribute to multiple diseases. Two potential models were developed in the context of yeast genes and growth phenotypes: genes may produce products which participate in multiple molecular functions that contribute to different conditions, or genes may produce products with a single function that is reused in multiple biological processes and phenotypes [2, 5, 6]. In these models, molecular functions represent the elemental biochemical activities of proteins while biological processes represent cellular-level activities. Both can range from very broad categories which apply to thousands of genes, to more specific annotations that apply to only a handful of genes.

In this chapter, we perform three experiments to explore which mechanism represents the more likely path to pleiotropy in human disease genes. First, we examine disease specific alleles to see if separate diseases are significantly tied to unique protein domains as this would support multiple molecular functions as the key to pleiotropy. Second, we calculate Spearman rank correlations between the number of disease associations for a gene and the number of Gene Ontology (GO) annotations in either the molecular functions or biological process category. This correlation should indicate which category of annotation is more linked to pleiotropy in OMIM genes. Finally, using slow-growth in yeast

10

as a proxy for human disease we look for the same correlations over a larger set of genes from a recent study of yeast phenotypes [7].

The major goal of this chapter will be to define the likely mechanism for pleiotropy among human disease genes. While both models undoubtedly play some role, we find more evidence for the latter, that is that pleiotropic genes are more likely to have a single molecular function that is reused in multiple biological processes and phenotypes, rather than produce gene products acting in multiple low- level molecular functions.

11

2.2. Results

2.2.1. Disease mutations and protein domains

We hypothesize that if genes contribute to multiple diseases through multiple molecular functions, disease causing mutations are more likely to be appear in separate functional protein domains. For example, the protein PSAP is associated with Methachromatic leukodystrophy and Atypical

Gaucher disease, each of which has multiple single nucleotide polymorphisms (SNPs) associated with increased risk of the disease. The SNPs for these diseases are localized in separate domains within the

PSAP protein (Figure 2.1).

Figure 2.1: Sample SNPs in protein domains. Protein sequence for SAP_HUMAN with domains from InterPro shown in purple and disease associated SNPs shown in red and blue. Three SNPs associated with Metachromatic leukodystrophy fell into one protein domain, while the two SNPs associated with Atypical Gaucher fell in another.

In order to test this idea, we downloaded known disease polymorphisms (SNPs) from UniProt and protein domains from InterPro and examined the strength of the associations between diseases and domains. We used only those genes with at least two domains and two diseases, leaving a total of 355 genes containing 1418 domains and 6338 SNPs and associated to 1051 diseases. We used Fisher’s exact test to determine for each disease and domain if the number of disease SNPs in the domain was significant compared to the number of disease SNPs outside the domain based on the size of the domain and the protein. Under our null hypothesis the disease and domain are not related, with SNPs as likely to appear outside the domain as inside. Thus a low p-value indicates that the SNPs are more likely to

12

appear in the domain, and suggests that the disease is associated with the molecular function of that domain.

We found that only 10.8% of the diseases were associated to a domain, at a p-value of less than

0.05. In addition, only 6.8% were uniquely associated to a domain; that is there were no other diseases associated with this domain at such a low p-value. Less than 0.5% of genes had multiple diseases mapped to separate domains.

To assess the significance of this finding, we randomly reassigned SNP positions within each gene, allowing us to randomize the locations of disease mutations while maintaining the number of such mutations across genes and diseases. In 1000 trials we found that the mean number of diseases associated with a domain was 4.6% (±0.5) and the mean number associated uniquely to a domain was

4.3% (±0.5). This gives our results with the actual disease labels an estimated p-value less than 0.005.

However, the number of genes with all diseases uniquely tied to separate domains remained similar to our results using actual disease associations, 0.2% (±0.2).

We noted that 40% of these genes are related to forms of cancer which may exhibit unique behavior from other diseases, so we split these into a separate subset and repeated our analysis.

Cancers showed weaker associations between diseases and domains, with 2.8% percent of diseases associated to domains, and 1.4% associated uniquely to domains. By contrast, the data without cancers showed strong associations between diseases and domains, with 16.5% percent of diseases associated to domains and 9.5% associated uniquely to domains. This suggests that cancerous genes may be less likely than other disease genes to exhibit pleiotropy through multiple molecular functions.

While this result demonstrated a significant association between functional domains and disease it is not a strong one as very few diseases were uniquely associated to a functional domain.

Furthermore, not all genes are complex enough to have multiple functional domains. The data from

UniProt also contains 100 genes with SNPs from multiple diseases but only a single domain. More than

13

10% of these diseases were significantly tied to the protein’s one functional domain with a p-value of

0.05 or less. Therefore having multiple functional domains is far from the only source of pleiotropy in disease genes.

2.2.2. Gene Ontology functional annotations and Pleiotropy

For our next analysis, we used diseases annotations in OMIM which contains 2212 genes associated with 2495 phenotypes. While these are not specific locations, we can examine the correlation between the number of diseases associated with each gene and certain key annotations to indicate the source of pleiotropy. First, we focus on pleiotropy though multiple molecular functions. For this we examine the correlation between the number of diseases in OMIM and the number of annotations from the Gene Ontology (GO) classified under ‘molecular function’ as well as the number of codes associated with the gene in the Kyoto Encyclopedia of Genes and (KEGG) and the number of protein domains associated with the gene in InterPro. We found no correlation between disease count and the number of enzyme codes (Spearman’s rank correlation coefficient: r = 0.026, two-tailed significance: p = 0.53; Figure 2.2) although this was not unexpected as only about 10% of genes had multiple enzyme codes. We do find modest but significant correlations with both molecular function

(Spearman’s r = 0.121, p = 3e-8; Figure 2.2) and the number of domains (Spearman’s r = 0.096, p = 1e-4;

Figure 2.2). This supports our finding of disease-specific domains from the UniProt alleles by showing genes with multiple functional domains have more associated diseases.

We contrast this analysis by considering the correlation between the number of diseases in

OMIM and the number of annotations from GO classified under ‘biological processes’ and ‘cellular component’ as well as the number of tissues expressing the gene in the TiGER database and the number of distinct interaction partners in the Human Protein Reference Database (HPRD). These measures would be indicative of a single molecular function being reused in multiple biological processes and

14

contexts. We found strong correlations with biological processes (Spearman’s r = 0.228, p = 0; Figure

2.2) as well as the number of interaction partners (Spearman’s r = 0.170, p = 3e-11; Figure 2.2), but only a modest correlation with the number of cellular components (Spearman’s r = 0.088, p = 5e-5; Figure 2.2) and the number of tissues (Spearman’s r = 0.098, p = 0.006; Figure 2.2).

Figure 2.2: Associations between disease gene pleiotropy and various descriptors. Plotted is the average number of annotations in each descriptor binned by the number of diseases associated with each gene. Bins represent genes associated 1 disease (1556 genes), 2 diseases (370 genes), 3 diseases (143 genes) and 4 or more diseases (143 genes). Error bars show one standard error from the mean. (A) Descriptors which would imply pleiotropy was linked to multiple molecular functions: number of protein domains in the gene, number of molecular functions in the Gene Ontology (GO), and the number of enzyme codes in KEGG. (B) Descriptors which would imply pleiotropy was linked to molecular function being reused in multiple biological processes: number of protein-protein interaction partners for the gene, number of biological processes or cellular components in GO, and the number of human tissues expressing the protein (TiGER). There is some correlation between the number of diseases associated with a gene and the number of domains or molecular functions, but stronger correlations are seen with the number of biological processes and interaction partners.

Confounding this analysis, we note a high correlation between the number of molecular functions and the number of biological processes or interaction partners (Spearman’s r = 0.471, p = 0; r =

0.163, p = 1e-10 respectively). To correct for this, we examined partial correlation coefficients. We found

15

that the correlations between disease count and the number of molecular functions or protein domains were insignificant when controlled by the number of biological process and greatly reduced when controlled by the number of interaction partners (Partial correlation coefficient c = 0.016, p = 0.49; c =

0.077, p = 0.003 respectively). By contrast, the correlation between disease count and the number of biological process remained strong when controlled by the number of molecular functions or protein domains (c = 0.195, p = 1e-19; c = 0.200, p = 1e-15 respectively). We also found that correlations between disease count and the number of biological processes or interaction partners were largely independent with strong partial correlations in both directions (Process controlled by partner count: c = 0.179, p = 2e-

12; Partners controlled by process count: c = 0.117, p = 5e-6).

As in our first experiment, we divided our genes into cancers and other diseases. In the OMIM database, more than 90% of genes were associated with other diseases and given their we were not surprised to find that this set of disease genes had correlations comparable to our original results. Cancer genes however showed some interesting differences. We found a much higher correlation between multiple disease associations and the number of protein domains (Spearman’s r =

0.265, p = 0.002), although given our earlier results showing that cancers were less likely to be significantly associated with particular domains, this may indicate that cancer causing genes are more likely to have multiple functional domains not that these domains are tied to separate diseases. We also found that among cancer genes there was no significant correlation between disease count and the number of molecular functions, but strong correlation with the number of biological process

(Spearman’s r = 0.280, p = 3e-4), cellular components (Spearman’s r = 0.217, p = 0.007), and the number of interaction partners (Spearman’s r = 0.332, p = 6e-5).

Because the correlations with biological process are more robust, we conclude that participation in multiple molecular functions can lead to multiple diseases but that this likely works through allowing

16

the gene to participate in multiple higher level processes and interaction pathways. Thus functional reuse in multiple environments is likely to be more important to disease gene pleiotropy.

2.2.3. Pleiotropy in yeast genes

In the OMIM database, genes with high pleiotropy are rare; there are only 72 genes associated with more than four diseases. In order to expand our analysis to a data set with higher levels of pleiotropy, we used slow-growth phenotypes from yeast in the place of human diseases. A recent study attempted to uncover phenotypes for nonessential yeast genes by measuring growth rates of mutants in the presence of small molecules. This data set contained 4327 genes covering 161 conditions, with over

87% of genes relating to at least two conditions and nearly 50% relating to more than 4 conditions [7].

We used similar annotations for yeast genes as with human disease genes. For molecular function this included: GO annotations under the heading ‘molecular function’, enzyme codes (EC) and protein domains associated with each gene through Munich Information Center for Protein Sequences (MIPS).

For biological process this included: GO annotations under the heading ‘biological process’ and ‘cellular component’, and of the number protein-protein interaction partners manually curated from results of yeast two-hybrid experiments [7].

Similar to OMIM genes, we found strong correlations between the number of slow growth conditions and the number of biological processes (Spearman’s r = 0.244, p = 0), cellular components

(Spearman’s r = 0.132, p = 0), interaction partners (Spearman’s r = 0.245, p = 1x1e-11), and a small but significant correlation with molecular function (Spearman’s r = 0.101, p = 3e-11). By contrast, we did not see a correlation between the number of slow growth conditions and the number of protein domains

(Spearman’s r = 0.010, p = 7e-1), although this correlation was weak among human disease genes as well.

Similar to our previous results, the correlation between the number of slow growth conditions and the number of functions can be greatly reduced by controlling for the number of interaction partners or

17

biological processes (c = 0.022, p = 0.55; c = 0.040, p = 8e-3 respectively), while the correlations between the number of slow growth conditions and the number of biological process or interaction partners are largely unaffected by controlling for the number of molecular function (c = 0.22, p = 7e-53; c = 0.24, p =

9e-12).

2.3. Discussion

There are some important limitations to our approach. First, we were not able to guarantee that all associated GO terms were non-redundant. We did address this problem to some degree by using the

GO hierarchy to remove parent terms, however, not all redundancy can be accounted for this way as sibling terms are often similar. We considered restricting our annotations to various levels in the GO hierarchy, so that very specific siblings might be replaced by a single more general parent term, but this would not guarantee that all redundancy was removed and in fact may remove many non-redundant sibling annotations. Trials using different levels of the GO hierarchy as a cutoff had little or no effect on the correlations observed.

Another weakness of our method is that we considered diseases independent despite the fact that many may be related. For example, we would consider gene AGRP, associated with leanness and obesity, to have the same pleiotropy as gene COX15, associated with Cardiomyopathy and , a neuro-metabolic disorder. While both genes are associated with two distinct diseases, it seems obvious that the former pair of diseases have more in common with each other than the latter pair. Unfortunately OMIM does not have a disease classification scheme which we could utilize to account for this problem.

Despite these limitations we feel our results strongly suggest that participation in multiple high- level processes and pathways represents the key to understanding pleiotropic disease genes, particularly for cancers. We did find some support for the theory that different diseases are associated

18

with separate functional domains, however, this was far from conclusive as many pleiotropic proteins had only one functional domain. In addition, the fact that correlations between multiple diseases and multiple biological processes were stronger and more independent than the correlations between multiple diseases and multiple molecular functions indicates that this is the more common source of pleiotropy. These patterns were confirmed in yeast genes, in contrast with earlier studies [6] which may have been limited due to a smaller set of phenotypes.

This conclusion has important implications for the study of human disease genes. First, it suggests that finding disease specific alleles may be problematic as a mutation which disrupts a single molecular function may affect multiple diseases [5, 6]. This will be particularly problematic for genes with simpler products consisting of few functional domains. Second, our results suggest that using shared processes or high level pathways would be a good way to prioritize disease genes. Such methods are being applied to the problem of finding disease genes from the very noisy results of genome-wide association studies [8-11]. Third, our results confirm the importance of varied protein-protein interaction partners as a source of pleiotropy in disease genes, and support the idea that targeting context-specific interactions rather genes will lead to more effective drug therapies with fewer side effects [5, 6]. Finally, understanding the shared biological processes or interaction pathways can help our understanding of how diseases relate to each other and increase our understanding of a patient’s susceptibility to specific diseases.

19

2.4. Methods

Disease mutations and protein domains

For our analysis of disease distinct domains, we used the disease SNPs available through UniProt mapped to protein domains from Interpro. Any overlapping domain annotations were assigned to a single domain. We used Fisher’s exact test with the length of the gene, length of the domain and the number of disease SNPs inside and outside the domain to measure the association between domain and disease. To assess the significance of our results, we ran 1000 trials where we reassigned the disease labels within each gene and measured the association between disease and domain again.

o Uniprot: release 57.6 - http://www.uniprot.org/docs/humsavar o InterPro: release 22.0 - ftp://ftp.ebi.ac.uk/pub/databases//

Gene Ontology functional annotations and Pleiotropy

Disease genes were downloaded from the National Center for Biotechnology Information (NCBI) through OMIM which contained 2212 genes in 2495 diseases or phenotypes, with 656 genes participating in more than one disease. GO annotations were also downloaded from NCBI. We removed the root “unknown” annotations and any redundant ancestor annotations using the “is_a” and

“part_of” relationships to determine ancestry. We also downloaded enzyme codes from the Kyoto

Encyclopedia of Genes and Genomes (KEGG) database, protein domains from the InterPro database, tissue indicators from the TiGER database, and protein-protein interactions available through the

Human Protein Reference Database.

All correlations were measured using the Spearman rank algorithm built into R. Partial correlation coefficients were calculated with an R extension available from Georgia Institute of

Technology (http://www.yilab.gatech.edu/pcor.html).

o GO annotations: ftp://ftp.ncbi.nlm.nih.gov/gene/ (01/2009) o KEGG pathways: ftp://ftp.genome.jp/pub/kegg/genes/organisms/hsa/ (01/2009)

20

o TiGER: http://bioinfo.wilmer.jhu.edu/tiger/ (09/2009) o HPRD: http://www.hprd.org/download (01/2009)

Pleiotropy in yeast genes

Yeast genes were obtained from a study which analyzed 1144 chemical genomic assays for heterozygous and homozygous deletions. The purpose of this study was measure the growth rate of mutants in the presence of small molecules to gain “insight into gene dispensability, multidrug resistance, and gene functions.” [7] Our final data set included 4327 genes which displayed slowed growth with a p-value of 1e-5 or less. These genes covered 161 conditions, and we found 3780 genes present in multiple conditions. See Hillenmeyer et al. for details on the small molecules tested.

For yeast annotation data we used the same sources as an earlier study by He and Zhang. GO annotations came from a list mined from literature, enzyme codes from the Yeast Enzyme Commission

(EC), and predicted functional domains from Munich Information Center for Protein Sequences (MIPS).

Protein-protein interactions were downloaded from Han et al. based on yeast two-hybrid experiments, protein complexes, and in silico interaction predictions known interactions in the MIPS database. [12]

o GO: http://www.ncbi.nlm.nih.gov/pubmed/17942445 o Domains: (06/2009) ftp://ftpmips.gsf.de/yeast/catalogues/eccat/eccat_data_20062005 o MIPS: (06/2009) ftp://ftpmips.gsf.de/yeast/catalogues/motifs/

21

2.5. References

1. Rzhetsky, A., et al., Probing genetic overlap among complex human phenotypes. Proc Natl Acad Sci U S A, 2007. 104(28): p. 11694-9. 2. van de Peppel, J. and F.C. Holstege, Multifunctional genes. Mol Syst Biol, 2005. 1: p. 2005 0003. 3. Promislow, D.E., Protein networks, pleiotropy and the evolution of senescence. Proc Biol Sci, 2004. 271(1545): p. 1225-34. 4. Williams, P.D. and T. Day, Antagonistic pleiotropy, mortality source interactions, and the evolutionary theory of senescence. Evolution, 2003. 57(7): p. 1478-88. 5. Dudley, A.M., et al., A global view of pleiotropy and phenotypically derived gene function in yeast. Mol Syst Biol, 2005. 1: p. 2005 0001. 6. He, X. and J. Zhang, Toward a molecular understanding of pleiotropy. Genetics, 2006. 173(4): p. 1885-91. 7. Hillenmeyer, M.E., et al., The chemical genomic portrait of yeast: uncovering a phenotype for all genes. Science, 2008. 320(5874): p. 362-5. 8. Franke, L., et al., Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am J Hum Genet, 2006. 78(6): p. 1011-25. 9. Wang, K., M. Li, and M. Bucan, Pathway-Based Approaches for Analysis of Genomewide Association Studies. Am J Hum Genet, 2007. 81(6). 10. Lage, K., et al., A human phenome-interactome network of protein complexes implicated in genetic disorders. Nat Biotechnol, 2007. 25(3): p. 309-16. 11. Iossifov, I., et al., Genetic-linkage mapping of complex hereditary disorders to a whole-genome molecular-interaction network. Genome Res, 2008. 18(7): p. 1150-62. 12. Han, J.D., et al., Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature, 2004. 430(6995): p. 88-93.

22

3. A functional network of phenotypic interactions

3.1. Introduction

With an increasing catalog of genes effecting disease susceptibility, there is a growing need for methods to combine and interpret those results in order to better understand their underlying biological effects. Functional networks represent the perfect framework for promoting this understanding. The core concept behind this type of analysis is that if two genes contribute to a single phenotype they must share some biological process, one directly relevant to the disease etiology [1, 2]. Shared function can be used to reduce noise among putative disease genes, predict real physiological changes, and potentially help fulfill the promise of personalized medicine, improving our understanding of drug targets and side-effects [3-8].

In recent years, the focus has been on using network methods to interpret the results of genome-wide association studies (GWAS) [3, 4]. In GWAS, large populations of sick and healthy individuals are compared in order to find haplotypes, or genetic regions, which increase susceptibility to a particular disease. One of the major issues is that by implicating haplotypes rather than specific mutations, GWAS results often leave the causative gene in question. Various methods prioritize likely candidate genes from an implicated genomic region using functional similarity to previously confirmed disease genes (Endeavour [9], [10, 11]), similarity with genes from other implicated regions (GRAIL [12]), or the existence of protein interactions with genes from other implicated regions (DAPPLE [13]).

Network methods have also been applied to other tasks; for example, many methods use protein-protein interactions (PPI) networks to better predict disease associations and draw conclusions about the topological features of disease genes in the network itself [14]. Other methods search for protein interactions specifically targeted by cancer [15-17]. There have been methods which predict novel disease genes, those not implicated through whole-genome analysis, based on current knowledge

23

of the disorder [18, 19]. Finally, some methods ignore the question of finding new associations and focus instead on explaining the biological impact of previously documented disease genes [20-22].

Here, we propose that instead of using an existing network of protein interactions or functional annotations, we will combine multiple sources into a network where every gene pair is assigned a score based on the likelihood that these genes contribute to a shared phenotype. We use a method similar to that used in the construction of functional PPI networks for several model organisms [23-25], although we differ from these methods by predicting shared phenotype rather than direct interaction. This chapter will focus on construction and validation of the network, while later chapters will demonstrate its utility in the prioritization of genomic regions and in the analysis of rare variants associated with disease. The major goals of this chapter will be:

(1) Demonstrate that functional descriptors, such as shared functional annotations or protein

interaction partners, can be used to predict the likelihood that two genes contribute to a shared

phenotype by performing a leave-one-out validation on a collection of known disease gene sets.

(2) Verify that the naïve combination of likelihood scores from various evidence sources performs

better than any single functional descriptor, and demonstrate a simple de-weighting heuristic to

control for the fact that edges within a collection of genes that share a single disease are not

fully independent.

24

3.2. Results and Discussion

We build a network in which each pair of genes is connected by score proportional to the likelihood ratio that these genes either share a phenotype or not. This network was built using individual sources of evidence, for example the existence of a PPI between the genes, or using a naïve sum over multiple sources of evidence. In order to verify that these scores capture true phenotypic connections, we apply the network to the task of prioritizing known disease genes from their genomic regions. We used a collection of diseases from the OMIM database, excluding any phenotypes used in training our network; this left 596 genes over 205 phenotypes. We performed leave-one-out cross validation where each gene was placed among its 99 closest chromosomal neighbors and all 100 genes were prioritized based on network connections to the remaining genes from that disease.

Figure 3.1 shows the number of tests where the true disease gene appeared at, or below, a certain rank using single-source networks (described in Methods) and our combined phenotype network. The combined network consistently out-performs all individual sources. The poor performance of phylogenetic profiling and co-clustering was due to limited coverage; in most tests none of the 100 genes had any network connections to the disease genes making the ranking essentially random. The same problem effected sequence alignment, shared protein domains, and tissue expression where only a handful of tests had some network connections between disease genes.

Removing these sources did not significantly impact the performance of the combined network, so they were kept as they do provide some value in extremely rare cases. The best performing sources were those with the broadest coverage, shared annotations in KEGG or GO, or shared protein interaction partners.

25

Figure 3.1: Disease gene predictions by evidence source. For each trial, a disease gene was placed among its 99 closest neighbors and the entire genomic region was prioritized based on the sum of all connections from the candidate to the remaining genes from that disease. Area under curve (AUC) was calculated based on the number of trials at or below each rank. Performance using a naïve sum over all functional descriptors is shown in black. Other lines represent performance using only single sources of functional descriptor. These included: Direct protein-protein interaction between candidate and disease genes (PPI Interaction); overlap between protein-protein interaction partners (PPI Partners); shared annotations in the Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), tissue expression database (TiGER), or protein domains (InterPro); Blast sequence similarity; phylogenetic profile similarity and co-clustering across from Chen et al. [26]

26

It should be apparent that edges in the phenotype network are not wholly independent; that is, if a candidate gene has strong connections to one disease gene, it is likely to have strong connections to other genes associated with the same disease. To account for this dependence, we employed a simple de-weighting heuristic previously described in Lee et al. [23]. Figure 3.2 and Table 3.1 show the percent of trials at or below each rank under each scoring scheme. While performance of naïve scoring is slightly higher at the first rank, the two are virtually equivalent after that demonstrating that our de-weighting heuristic may be conceptually valuable but not necessary to our final results.

Table 3.1 Trials at or below rank Rank Deweighted Naïve 1 40.3 42.1 2 49.0 49.5 3 54.9 54.9 4 57.0 57.4 5 59.6 59.9 6 61.7 61.4 7 63.1 63.3 8 64.6 64.9 9 66.1 66.1 10 66.8 66.9 50 87.8 87.9 AUC 85.3 85.3

Figure 3.2: Naïve versus Deweighted scoring. Results from the cross validation trials where each disease gene was placed among its 99 closest neighbors and the entire genomic region was prioritized based on connections in the combined phenotype network from the candidate to the remaining genes from that disease. Results using naïve scoring (red) where final score is the sum of all edges to the disease genes, and using dewieghted scoring (black) where final score is weighted to reflect dependence between disease genes. Table 3.1: Selected ranks from figure 3.2.

Finally, we note that having been identified in OMIM our trial genes may have been the subject of increased study resulting in more numerous or detailed functional annotations and therefore more or

27

stronger connections in our underlying phenotype network than the average gene. As they are being compared to random genes, this might artificially inflate our results. To test this, we select decoy genes for prioritization in three ways: we take neighboring genes as described to simulate a genomic region, we take 99 randomly selected genes, and finally we take 99 random genes from other diseases in

OMIM. For the latter two cases, we used 100 random sets to estimate rank of each trial gene. Figure 3.3 plots the percent of trials at or below each rank based on decoy genes selected, and Table 3.2 shows area under curve for based on the network source, scoring method, and decoys genes selected. We note that there is virtually no difference between neighboring genes and randomly selected genes showing that genes are not subject to any bias merely by being neighbors of a well-studied disease gene.

However, performance declines when decoys are other disease genes, particularly using our combined phenotype network. While our combined network still has value over individual networks this results demonstrates that it is important to consider a gene’s overall connectivity in the phenotype network.

Figure 3.3: Disease gene predictions using different decoy genes. Results from the cross validation trials where each disease gene was placed among its 99 decoys and all candidate genes were prioritized based on connections in the combined phenotype network to the remaining genes from that disease. Results were obtained using deweighted scoring. Results using neighboring genes as decoys (black), randomly selected genes (red), and randomly selected genes from other diseases (blue).

28

Table 3.2: Area under curve based on network source and decoy genes selected.

Neighbor Neighbor (deweighted) Random (deweighted) Gene Disease (deweighted) Neighbor (naive) Random (naive) Gene Disease (naive) Combined phenotype network 0.853 0.859 0.782 0.853 0.857 0.779 PPI Interaction 0.755 0.755 0.639 0.753 0.755 0.640 Shared KEGG annotations 0.710 0.709 0.697 0.709 0.709 0.695 Shared GO annotations 0.701 0.720 0.682 0.727 0.719 0.682 Shared PPI partners 0.611 0.613 0.676 0.615 0.612 0.675 Shared TiGER tissue annotations 0.573 0.567 0.597 0.577 0.556 0.591 Blast sequence similarity 0.556 0.575 0.573 0.572 0.573 0.573 Shared InterPro annotations 0.529 0.531 0.604 0.528 0.532 0.604 Phylogenetic profile similarity 0.507 0.499 0.498 0.511 0.501 0.509 Chromosome co-clustering 0.489 0.507 0.504 0.485 0.506 0.504

These results demonstrate three important points. First, our method of defining a likelihood ratio does accurately reflect phenotypic associations between disease genes. Second, while there are some individual networks which capture phenotypic connections well, the naïve combination of all data outperforms individual sources. Finally, when scoring connections between groups of genes, a simple de-weighting heuristic provides comparable performance to naïve scoring while being conceptually preferable.

29

3.3. Methods

Network Construction

In our underlying phenotype network every gene pair is connected by a score proportional to our expectation that the two genes share a phenotype. This score is based on evidence sources such as shared functional annotations or protein-protein interaction partners. The final score is calculated according the following formula:

⁄ ( ) ∑ ( ) ⁄

s Where for each evidence source, s, there is a score, E ij, which is used to predict the probability that this pair shares a disease phenotype, P(D|E), or not, P(~D|E). Although the prior odds ratio, P(D)/P(~D) is not precisely known, it is constant and since our results rely on comparing edges in the network, they are not sensitive to the exact value of this ratio.

For each source of evidence, we trained a model to estimate the ratio (P(D|E) / P(~D|E)). For this task, we collected gene pairs with a shared phenotype from a carefully curated set of disease genes

[2]. This set contained 476 genes associated with 132 genetic diseases, for a total of 1183 unique pairs.

For gene pairs without a shared phenotype we used 500,000 randomly selected gene pairs. We selected this large number of pairs both to reflect our belief that sharing a phenotype was unlikely and also to dilute the effect of any false negatives selected due to imperfect knowledge about human disease. Gene pairs were binned by their evidence scores and we counted pairs with and without a shared phenotype in each bin. We fit these values with a combination of linear and quadratic formulas which were then used to estimate the ratio for novel gene pairs. For any gene pair without a score in the evidence source, for example two genes with no associated GO terms, we used a default score equivalent to the median value from the random gene pairs.

30

When considering network connections from a candidate gene to a collection of n disease genes, we used a simple de-weighting heuristic to account for the fact that these edges are not independent. Under this scheme, the edges from one gene to the collection are sorted by their strength and de-weighted according to their position in the sorted array. That is:

A similar method was previously used in the construction of functional protein-protein interaction networks for several model organisms [23-25].

Network Sources

Shared Annotations: We considered shared annotation from a number of publicly available databases. Specifically, we used the descriptions of protein function and cellular localization in Gene

Ontology (GO), enzyme codes and pathways defined in Kyoto Encyclopedia of Genes and Genomes

(KEGG), protein domains from the InterPro database, tissue expression from TiGER database. For these sources, we used a score based on the number and strength of shared annotations.

First, we calculated a weight for every annotation based on its specificity, defined as the number of genes annotated with a particular term (Mt ) divided by the number of genes with annotations (N).

Second, we summed the weights for all terms shared by a given gene pair using Fisher’s omnibus

annotation procedure (see the equation below) to obtain a final score, Eij . This method has been used previously to prioritize candidate genes based on their similarity to known disease genes [9]. In the case of GO terms, we utilize the GO annotation hierarchy, so that if a gene was assigned a specific term we also assigned it to all more general parents of that term; this ensures that general terms have more gene assignments than their more specific children terms, giving them a lower weight.

⁄ ∑ ( )

31

Protein-protein interactions: We used the following databases in our analysis of protein-protein interactions (PPI): BIND, BioGRID, DIP, HPRD, InNetDB, IntAct, BiGG, MINT, and MIPS. We derived two different measures based on PPI networks. First, we checked if a given gene pair directly interacts, and

PPI counted the number of PPI databases that contain that interaction. This measure, Eij was a good predictor of a shared phenotype, but naturally covered a very small fraction of gene pairs. Second, for each pair of genes, gi and gj, we compared the number of interactions they share with other human proteins. Specifically, for each gene I we found a vector I=(i1, i2, …,in) where ia is the number of databases which contained a PPI between genes I and A. We then used the Tanimoto coefficient, an extension of

neighbors cosine similarity, to measure this distance between vectors I and J, Eij , where  is the scalar product of two vectors and ‖ ‖ is the vector magnitude.

‖ ‖ ‖ ‖

Sequence homology: To detect between each gene pair, we used BLAST.

The homology-related score for a gene pair was the sequence homology for the most significant (lowest

E-value) BLAST alignment.

Phylogenetic profiles and chromosomal co-clustering across genome locations: We used two genomic context correlations which are commonly used to infer functional similarity between genes: phylogenetic profiles and chromosomal co-clustering calculated previously by Chen et al. [26]. For phylogenetic profiles, this work used binary vectors, describing the presence or absence of gene homologs in 70 diverse species. The gene profiles were compared using Pearson’s correlation to produce a score. For chromosomal co-clustering, the mean distance (in base pairs) between gene homologs across multiple diverse genomes was used.

32

Information was downloaded from the following resources:

 Gene Ontology annotations from NCBI (01/2009 – ftp://ftp.ncbi.nlm.nih.gov/gene/)  Pathways and enzyme codes from Kyoto Encyclopedia of Genes and Genomes (KEGG) database (01/2009 – ftp://ftp.genome.jp/pub/kegg/genes/organisms/hsa/)  Domains from InterPro database (01/2009 – ftp://ftp.ebi.ac.uk/pub/databases/interpro/)  Tissue indicators from the TiGER database (09/2009 – http://bioinfo.wilmer.jhu.edu/tiger/)  Protein-protein Interactions: o BIND with protein complexes (08/2009 – http://bond.unleashedinformatics.com/) o BioGRID interactions (08/2009 – http://www.thebiogrid.org/downloads.php) o DIP interactions (10/2009 – http://dip.doe-mbi.ucla.edu/dip/) o HPRD interactions (10/2009 – http://www.hprd.org/) o InNetDB interactions (05/2009 – http://hanlab.genetics.ac.cn/sys/intnetdb) o IntAct (10/2009 – http://www.ebi.ac.uk/intact/) o BiGG metabolic interactions (04/2009 – http://gcrg.ucsd.edu/Downloads ) o MINT (05/2009 – http://mint.bio.uniroma2.it/mint/) o MIPS (05/2009 – http://mips.gsf.de/proj/ppi/)

33

3.4. References

1. Schadt, E.E., Molecular networks as sensors and drivers of common human diseases. Nature, 2009. 461(7261): p. 218-23. 2. Feldman, I., A. Rzhetsky, and D. Vitkup, Network properties of genes harboring inherited disease mutations. Proc Natl Acad Sci U S A, 2008. 105(11): p. 4323-8. 3. Cantor, R.M., K. Lange, and J.S. Sinsheimer, Prioritizing GWAS results: A review of statistical methods and recommendations for their application. Am J Hum Genet, 2010. 86(1): p. 6-22. 4. Hardy, J. and A. Singleton, Genomewide association studies and human disease. N Engl J Med, 2009. 360(17): p. 1759-68. 5. Sorger, P.K. and B. Schoeberl, An expanding role for cell biologists in drug discovery and pharmacology. Mol Biol Cell, 2012. 23(21): p. 4162-4. 6. Barabasi, A.L., N. Gulbahce, and J. Loscalzo, Network medicine: a network-based approach to human disease. Nat Rev Genet, 2011. 12(1): p. 56-68. 7. Cho, D.Y., Y.A. Kim, and T.M. Przytycka, Chapter 5: Network biology approach to complex diseases. PLoS Comput Biol, 2012. 8(12): p. e1002820. 8. Loscalzo, J. and A.L. Barabasi, Systems biology and the future of medicine. Wiley Interdiscip Rev Syst Biol Med, 2011. 3(6): p. 619-27. 9. Aerts, S., et al., Gene prioritization through genomic data fusion. Nat Biotechnol, 2006. 24(5): p. 537-44. 10. Franke, L., et al., Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am J Hum Genet, 2006. 78(6): p. 1011-25. 11. Freudenberg, J. and P. Propping, A similarity-based method for genome-wide prediction of disease-relevant human genes. Bioinformatics, 2002. 18 Suppl 2: p. S110-5. 12. Raychaudhuri, S., et al., Identifying relationships among genomic disease regions: predicting genes at pathogenic SNP associations and rare deletions. PLoS Genet, 2009. 5(6): p. e1000534. 13. Rossin, E.J., et al., Proteins encoded in genomic regions associated with immune-mediated disease physically interact and suggest underlying biology. PLoS Genet, 2011. 7(1): p. e1001273. 14. Navlakha, S. and C. Kingsford, The power of protein interaction networks for associating genes with diseases. Bioinformatics, 2010. 26(8): p. 1057-63. 15. Vandin, F., E. Upfal, and B.J. Raphael, Algorithms for detecting significantly mutated pathways in cancer. J Comput Biol, 2011. 18(3): p. 507-22. 16. Chuang, H.Y., et al., Network-based classification of breast cancer metastasis. Mol Syst Biol, 2007. 3: p. 140. 17. Mani, K.M., et al., A systems biology approach to prediction of oncogenes and molecular perturbation targets in B-cell lymphomas. Mol Syst Biol, 2008. 4: p. 169. 18. Lage, K., et al., A human phenome-interactome network of protein complexes implicated in genetic disorders. Nat Biotechnol, 2007. 25(3): p. 309-16. 19. Wu, X., et al., Network-based global inference of human disease genes. Mol Syst Biol, 2008. 4: p. 189.

34

20. Holmans, P., et al., Gene ontology analysis of GWA study data sets provides insights into the biology of bipolar disorder. Am J Hum Genet, 2009. 85(1): p. 13-24. 21. Torkamani, A. and N.J. Schork, Pathway and network analysis with high-density allelic association data. Methods Mol Biol, 2009. 563: p. 289-301. 22. Subramanian, A., et al., Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A, 2005. 102(43): p. 15545- 50. 23. Lee, I., et al., A probabilistic functional network of yeast genes. Science, 2004. 306(5701): p. 1555-8. 24. Lee, I., et al., A single gene network accurately predicts phenotypic effects of gene perturbation in Caenorhabditis elegans. Nat Genet, 2008. 40(2): p. 181-8. 25. Jansen, R., et al., A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science, 2003. 302(5644): p. 449-53. 26. Chen, L. and D. Vitkup, Predicting genes for orphan metabolic activities using phylogenetic profiles. Genome Biol, 2006. 7(2): p. R17.

35

4. Network based analysis of de novo copy number variants associated with autism

4.1. Introduction

Autism represents one of the most common neurological disorders in our society today [1]; the combined prevalence of autistic spectrum disorders (ASD) has been steadily increasing for several decades, due in part to better detection strategies, and now approaches 1%. Despite an estimated heritability as high as 90% [2], genome-wide association studies have implicated only a handful of genes

[3, 4], and agreement between published findings remains poor [5]. Even taken as a collection these variants explain only a small fraction of overall disease risk. It is likely that a substantial portion of the

“missing heritability” can be accounted for by rare structural variations such as copy number variants

(CNV) [6]. Among such events, de novo variants offer special promise, as they are more likely than inherited events to contain a causative mutation, and a gene relevant to disease etiology [7-11].

The main challenge in the analysis of de novo CNVs is precisely their rarity; the vast majority of the observed genetic events are unique and consequently not statistically significant by themselves. In addition, each CNV event may cover multiple genes most of which have no relevance to the disease.

Finally, while it may be possible to trace the biological implications of a single mutation, this task is more difficult with a large number of variants.

We addressed these problems with a network based approach, NETBAG (NETwork Based

Association of Genetic variants). In brief, we used the defined network of phenotypic connections in which all gene pairs are connected with a score proportional to the likelihood ratio that they either contribute a shared phenotype or not (Figure 4.1a). All putative disease variants are mapped to the corresponding genes in the phenotype network (Figure 4.1b) where true causative genes from a single disease should form a strongly interconnected cluster. A greedy search algorithm is employed to find high scoring clusters among implicated genes (Figure 4.1c) with limitations on the number of genes which each genomic event may contribute to the cluster; this prevents large CNV events, those with

36

many genes, from dominating the cluster. To evaluate the significance of the disease cluster found, random genomic events are generated and the same greedy algorithm is used to find clusters using these events (Figure 4.1d); P-values are assigned to disease clusters based on a distribution of scores from the randomized clusters which are then corrected for multiple hypothesis testing due to considering clusters of different sizes.

This method achieves three important goals:

(1) By limiting the number of genes each CNV may contribute to the cluster, we select the one or

two genes most relevant to the disease. Furthermore, since we consider clusters of various

sizes, the final selection may ignore less relevant events altogether. Importantly, the method

makes these selections without any previous knowledge of autism genetics.

(2) By considering the cluster as a whole, we are able to assign significance even if all events are

unique and thus not significant by themselves. In addition, gene contribution to the cluster

score is not uniform but instead proportional phenotypic connections which tie a gene to the

cluster, which allows us to evaluate the importance of individual genes to the disease etiology.

(3) As the genes in our network are linked through shared function, the underlying biological

pathways relevant to the disease will be more distinct than if we were considering a collection

of all implicated genes.

37

Figure 4.1: The NETBAG method. (A) A background network is constructed in which nodes represent human genes and edges represent the likelihood that two genes participate in a shared phenotype. (B) One or two genes are selected from each of de novo CNV region (green) to form a cluster. The genes are mapped to the likelihood network and a combined score is calculated for each cluster based on interactions between genes forming the cluster. (C) A greedy search procedure is used to identify the cluster with maximal score. (D) The significance of the cluster with maximum score is determined by comparing it to the distribution of maximal scores from randomly selected genomic regions (blue) with similar gene counts.

38

4.2. Results

4.2.1. Autism associated networks

The approach outlined in Figure 4.1 was applied to the experimental CNV dataset described in a recent analysis of de novo CNVs associated with ASD [12]. From these events, we identified statistically significant clusters allowing each event to contribute either one or two genes (Figure 4.2, P-value = 0.03 for one gene per event, P-value = 0.02 for two genes per event). If genes forming the high scoring clusters were removed from the input, no other significant clusters were detected among remaining genes. In addition, Levy et al. also identified de novo events in the unaffected siblings of autistic individuals, and rare inherited events in autistic children and their parents. No statistically significant clusters were obtained using either of these sets (rare inherited P-value = 0.6, sibling P-value = 0.3).

Analysis of the established annotation resources – Swiss-Prot [13], GeneCards

(www..org), WikiGenes (www.wikigenes.org), IHOP [14] – suggests that a significant fraction of genes in the identified clusters are known to be active in the brain or have been previously implicated in neurodegenerative and psychiatric disorders. Between 50-70% of genes in the identified clusters have been previously associated to the brain in this fashion (Figure 4.2, orange genes), compared to only

~20% of the all genes within the de novo CNV regions (P-value < 10-3, see Table 4.3 for functional description of cluster genes).

The contribution of each gene to the cluster score is not uniform. While some genes have strong interactions with many genes in the cluster, others have relatively weaker functional connections. Each gene’s contribution to the final score can be captured as a de-weighted sum of the edges from that gene to all other genes in the cluster, such that the final contribution for a gene is the strongest edge from that gene plus half of the second strongest edge plus one third of the third strongest edge and so on.

The size of each gene in Figure 4.2 is proportional to that score. In addition, we performed Markov Chain

Monte Carlo (MCMC) simulations to sample clusters based on their overall strength and assigned each

39

gene a score proportional to its membership in strongly interconnected clusters (see Methods). This obtained similar results to the individual contributions to the cluster score shown in Figure 4.2.

Figure 4.2: NETBAG gene clusters found in de novo CNV regions associated with autistic individuals. In the figure, genes (nodes) with known functions in the brain and nervous systems are colored in orange (see Table 4.3 for functional information about all genes shown). Node sizes represent the importance of each gene to the overall cluster score. Importance was determined by taking the weighted sum of all edges from that gene to each other cluster gene. Edge widths are proportional to the prior likelihood that the two corresponding genes contribute to a shared genetic phenotype. For clarity, we show only the two strongest edges from each node. (A) The highest scoring cluster obtained using the search procedure with up to one gene per each CNV region. (B) The cluster obtained using the search with up to two genes per region.

4.2.2. Gender Bias in autism variants

One of the striking characteristics of autism is the male-to-female incidence ratio, potentially as high as 7:1 for high-functioning ASD [15]. It has been previously suggested that stronger genetic perturbations are required, on average, to trigger an autistic phenotype in females than males, due to currently unknown compensatory mechanisms [11]. Interestingly, we found that genes perturbed by

40

rare de novo CNVs from female subjects are significantly more important for the overall cluster score, compared to genes implicated by male subjects (see Figure 4.3, One-tail Mann-Whitney test, female > male, P-value=0.013). Thus genes disrupted in female subjects were more central to the disease cluster or more related to its core functions, supporting the hypothesis that females require more functionally severe mutations to trigger the phenotype than their male counterparts.

Figure 4.3: Functional importance of genes perturbed by CNV events in females and males. Autism is a disease with strikingly different prevalence in males versus females, especially for high functioning children. In Simons Simplex Collection (SSC), analyzed by Levy et al., the ratio of males to females is 7:1. It was proposed that, due to a currently unknown mechanism(s), females may be more resistant to autism-causing genetic perturbations. Our results are consistent with this hypothesis as the genes from CNV events observed in females made more significant contribution to the cluster score, i.e. they interact more strongly with other genes in the cluster. A) The identified cluster is shown with children gender indicated by node shapes.. Round nodes represent genes coming from the CNV events observed exclusively in males, square nodes from females, and diamond nodes come from genomic regions perturbed in both males and females. B) Distributions of gene contributions for genes perturbed in males and females CNV (One-tail Mann-Whitney test, female > male, P-value=0.013).

41

4.2.3. ASD network and related sets

To confirm that our cluster is associated with existing knowledge about autism, we examined manually curated gene sets implicated in autism and intellectual disability [8]. For this analysis, we examined the connections in our phenotype network between the cluster and a set of known disease genes. These were compared to the connections between random clusters and the same disease sets.

We found that the cluster using one gene from autism CNVs (Figure 4.2a) is strongly related to genes implicated in autism in previous studies (P-value=0.001) and genes associated with intellectual disability

(P-value=0.017), despite a relatively small overlap between the gene sets (~3%). The cluster was also connected (P-value=0.013) to proteins identified experimentally by recent proteomic profiling of postsynaptic density (PSD), a neuronal location long thought to be key to autism development [16].

4.2.4. General Network functions

To characterize the specific biological processes related to our cluster in more detail, we investigated the strength of functional interactions between the cluster genes and various Gene

Ontology (GO) categories [17]. We submitted the genes in Figure 4.2b to DAVID (Table 4.1) revealing a strong connection with cytoskeleton development and actin formation. In addition, we employed the same method described above to assess the connections in our phenotype network between our cluster and GO terms, correcting for multiple hypothesis testing using a False Discovery Rate (FDR) procedure.

The 25 GO categories with highest significance (Table 4.1) also contain many terms related to actin network dynamics and reorganization. The remaining terms implicated a diverse collection of molecular and cellular processes essential for proper synaptic formation and axon guidance including synaptogenesis, axonogenesis, cell-cell adhesion, small GTPase signaling, and neurite development.

42

Table 4.1: Gene Ontology terms associated with autism cluster through DAVID Gene Ontology Term GO Category P-value GO:0007010: cytoskeleton organization Biological process 0.006 GO:0030054: cell junction Cellular component 0.008 GO:0030036: actin cytoskeleton organization Biological process 0.011 GO:0008022: protein C-terminus binding Molecular function 0.014 GO:0030029: actin filament-based process Biological process 0.018 GO:0008092: cytoskeletal protein binding Molecular function 0.033 Gene Ontology terms associated with autism cluster using the phenotype network Gene Ontology Term GO Category q-value GO:0007015: actin filament organization Biological process < 0.01 GO:0030424: axon Cellular component < 0.01 GO:0048469: cell maturation Biological process < 0.01 GO:0007611: learning and or memory Biological process < 0.01 GO:0044456: synapse part Cellular component < 0.01 GO:0045202: synapse Cellular component < 0.01 GO:0007163: establishment and or maintenance of cell polarity Biological process 0.025 GO:0045216: intercellular junction assembly and maintenance Biological process 0.025 GO:0019201: nucleotide kinase activity Molecular function 0.025 GO:0005912: adherens junction Cellular component 0.025 GO:0007409: axonogenesis Biological process 0.025 GO:0016323: basolateral plasma membrane Cellular component 0.025 GO:0030041: actin filament polymerization Biological process 0.025 GO:0051258: protein polymerization Biological process 0.025 GO:0021700: developmental maturation Biological process 0.025 GO:0030863: cortical cytoskeleton Cellular component 0.027 GO:0000904: cellular morphogenesis during differentiation Biological process 0.027 GO:0005925: focal adhesion Cellular component 0.027 GO:0030427: site of polarized growth Cellular component 0.027 GO:0032271: regulation of protein polymerization Biological process 0.027 GO:0031175: neurite development Biological process 0.027 GO:0048666: neuron development Biological process 0.027 GO:0030055: cell matrix junction Cellular component 0.027 GO:0030832: regulation of actin filament length Biological process 0.027 GO:0030036: actin cytoskeleton organization and biogenesis Biological process 0.027 Table 4.1: Gene Ontology (GO) terms highly connected to the clusters in Figure 4.2. Terms were found by submitting cluster genes to DAVID, or by an analysis of the connections in our phenotype network between the cluster and genes annotated with a particular term. Multiple hypothesis correction, due to examination of hundreds of GO terms, was done in the former case through Bonferroni correction, and the in the latter case through a False Discovery Rate (FDR) procedure.

43

4.2.5. Synaptic contacts and Dendritic Morphology

At the core of the processes described above is the development and maturation of synaptic contacts in the brain. Thus the relationships between cluster proteins can be better appreciated when considered in the context of molecular interactions involved in formation and maturation of the excitatory (glutamatergic) synapse (Figure 4.4). Excitatory synaptic connections are formed between axons and dendritic spines, which are complex and dynamic post-synaptic structures containing thousands of proteins [18, 19]. The formation, maturation, and elimination of dendritic spines—and their precursors, filopodia—lie at the core of synaptic transmission and memory formation [20, 21].

Although Figure 4.4 shows a dense and interconnected web of molecular interaction, the processes depicted can be understood in terms of several signaling and structural pathways, many of which converge on the regulation of the growth and branching of the actin filament network essential for spine structuring and morphogenesis.

The initial contacts between axons and dendrites are mediated by specific adhesion-related proteins, such as neurexin (e.g. NRXN1, genes perturbed by de novo CNVs in ASD are italicized here and below) and neuroligin (e.g. NLGN3) [22]. On the postsynaptic side of an excitatory synapse, these initial axon-dendrite contacts ultimately develop into a complex and dense structure, the postsynaptic density

(PSD), dominated by several types of glutamate receptors (such as AMPA and NMDA), various scaffolding proteins (DLG4/PSD95, DLG2, SHANK2/3, SynGAP1, DLGAP2) and trafficking/signaling proteins (CNND2). In total, PSD contains many hundreds of distinct proteins [16, 23]. From these contacts, information for activity-dependent regulation of spine morphology is passed through an intermediate level of signaling protein, such as Rho family [24] of small GTPases: RhoA/B, Cdc42, Rac1, to downstream targets ultimately connected (for example, through LIMK1 and PAK1/2/3) to proteins

(such as cofilin and Arp2/3) modifying morphology of the actin network [25]. The activity of the GTPases

44

is regulated postsynaptically by many guanine exchange factors (GEFs, such as GDI1) and GTP activating proteins (GAPs).

Another signaling pathway, the WNT pathway, plays a crucial role in diverse processes associated with formation of neural circuits [26], and is known to be directly involved in the regulation of dendrite morphogenesis [27, 28]. WNT signaling is accomplished through the canonical branch (DVL,

AXIN1, beta-catenin) and the non-canonical branch (DVL1/2/3, Rac1, and JNK); both of these pathway branches converge on regulation of actin network morphogenesis. Similar to WNT, the signaling also plays a prominent role in the context of autism phenotype and specifically dendritic spine morphogenesis [29, 30]. Signaling by secreted extracellular RELN protein acts though VLDR and Apoer2 receptors and the PI3K/Akt pathway [31] regulating the mammalian target of rapamycin (mTOR) pathway [32, 33]. Another important pathway converging on mTOR is MAPK3/ERK, which is activated, in part, by Ras and NF1. mTOR integrates various inputs from upstream growth-related pathways, and is also known to regulate dendrite morphogenesis [34].

45

Figure 4.4: Important genes associated with the morphogenesis of dendritic spines. Dendritic spines are dynamically forming protrusions from a neuron’s dendrite which mediate excitatory connection to axons and determine synaptic strength. Proteins shown in the figure play crucial roles in formation of physical contacts between axons and dendrites, organization of postsynaptic density (PSD), and signaling processes controlling spine morphology. Many of the signaling pathways ultimately converge on the regulation of the growth and branching of the actin filament network, which is essential for spine structural remodeling. The proteins encoded by genes from the identified functional cluster (Figure 4.2) are shown in yellow, other genes hit by de novo CNV from Levy et al., in blue, and genes previously implicated or discussed in the context of autism are circled in orange

4.2.6. Consequences for Dendritic Morphology

Considering our cluster genes from the perspective of the functional molecular network (Figure 4.4) allowed us to investigate likely morphological consequences of some CNVs events. There is growing evidence that changes in dendritic spine morphology play an important role in a number of neurological disorders [35]. A decrease in the density of dendritic spines in regions of the cerebral cortex has been

46

linked to schizophrenia [36-38]. On the other hand, an increase in spine size or density has been connected to Fragile X syndrome, a disorder related to autism [39, 40]. Following the logic that CNV deletions should decrease the dosage of the perturbed genes and duplications should increase it, we can infer—based on the structure and regulatory logic of the functional network in Figure 4.4—the morphological effects of 13 gene perturbations on dendritic spines. Specifically, we found that in 11 out of 13 cases (~85%) the gene perturbations caused by the observed CNV events should increase either dendritic spine growth or their density (Table 4.2). This result is consistent with recent findings that autistic individuals have increased spine density in portions of their cerebral cortex [41, 42] and possibly a local brain over-connectivity [43].

Table 4.2: Likely impact on dendritic growth for ASD genes Gene Literature Support CNV Event Likely impact A negative regulator promoting the phosphorylation and destruction of β- AXIN1 catenin [44], which is necessary for dendritic growth [45] deletion Increase Knockdown of BAIAP2 through RNA reduces the density, length and width of BAIAP2 dendritic spines [46] deletion Decrease CRKL Reduction of CRKL block dendrogensis [47] duplication Increase Alpha-catenein regulates actin-filament assembly , potentially through CTNNA3 competition with Arp2/3 [48] duplication increase CTNND2 Delta-catenin knockdowns have been shown to increase spine density [49] Deletion increase DKK1 can stimulate neurite outgrowth through Frizzled and JNK signaling DKK1 [50] duplication increase DVL2 DVL2 knockdown suppresses neurite outgrowth [50] duplication increase Actin-binding protein that crosslinks filaments, which is necessary for the FLNA development of neuronal filopodia [51] duplication increase LIMK1 deficits result in decreased number of dendritic spines or abnormally LIMK1 small, thing spines [36, 52] duplication increase Overexpression of Tau leads to descreased levels of β-catenin [53], which is MAPT necessary for dendritic growth [45] deletion increase NDE1 Promotes neuron outgrowth by regulating microtubules [54] duplication increase Pak2 phosphorylates LIMK1, inactivating cofilin [55], alternatively Pak2 can decrease / PAK2 inhibit effect of Pak1 which may lead to an opposite effect deletion increase Stabilizes cytoskeleton and binds it to plasma membrane. Located throughout brain where intensity corresponds to the density of neuronal SPTAN1 cells [56] duplication Increase Table 4.2: We examined genes from our cluster to understand whether it is possible to predict a likely morphological impact. We found that, based on analysis of our network or previously published results, it is possible to infer effects for 13 genes which appeared in our cluster. Two genes (MAPK and CYFIP1) were not considered because they appeared in both duplication and deletion events, making their impact uncertain. Interestingly, we observed that events affecting 11 out of 13 analyzed genes the observed CNV events should increase growth or density of dendritic spines. One further event, a deletion containing gene PAK2, may also increase dendritic density through a non-canonical pathway.

47

4.3. Discussion

Although we discussed cluster proteins in the context of dendritic spine development, many of the genes in Figure 4.4 proteins also participate in diverse cellular processes and are reused in the context of axon guidance and neuron motility [57]. Such a recycling of proteins is natural because actin network dynamics are essential for such processes as growth of axonal filopodia used in searching for growth cone guidance cues [58]. The presence of DCC protein in the identified network (Figure 4.2 and

Figure 4.4), also suggests an important role of perturbed axonal guidance in autism. Although DCC is involved in dendrite development [59], this receptor and its signaling protein netrin are essential for guiding axons to their final destinations [58]. In addition, several signaling pathways, for example the

WNT and reelin pathways, also play a prominent role in neuron motility [26, 60] and several specific proteins, such as PAKs and LIMK, which regulate the dynamics of actin network, are reused in axonal morphogenesis. Consequently, malfunction of many proteins shown in Figure 4.4 may influence autistic phenotypes through their role in either dendrite or axon signaling, or possibly a combination of these processes.

The presented analysis strongly supports the hypothesis that autism is primarily a disease of synaptic and neuronal connectivity malfunction [61] with genetic events affecting the whole arch of molecular processes essential for proper synapse formation and function. Genes identified by our network analysis reveal a striking genetic complexity of autism, similar to that already apparent in many cancers [62, 63]. We and others believe this complexity will be a hallmark of many common human phenotypes and maladies [64]. In spite of the observed complexity, our study provides an important proof of the principle that underlying functional networks responsible for common phenotypes can be identified by an unbiased analysis of multiple rare genetic perturbations from a large collection of affected individuals.

48

The functional network presented in Figure 4.4 contains approximately 70 genes, with about

40% of them perturbed by rare de novo CNVs observed by Levy et al. As more genetic data are analyzed it is likely that the network will growth in size and significance. Considering, for example, that up to a thousand [23] distinct proteins are associated with postsynaptic density or that hundreds of different

GAPs/GEFs modify activity of Rho GTPases associated with actin network remodeling, it is likely that many hundreds of genes could ultimately contribute to the autistic phenotype. This estimate is consistent with independent estimates based on recurrent mutations and the overall incidence of autism in the human population [11, 12]. Deleterious variants in different genes contributing to autistic phenotype will almost certainly have different penetrance and vulnerabilities. To illuminate both the complete set of genes responsible for ASD and their respective contributions to the phenotype will require analyses of next generation sequencing data coupled with investigation of underlying molecular networks.

49

4.4. Methods

Copy number variants

The recent analysis by Levy et al. identified 75 rare de novo CNV events. These were mapped to

746 unique human genes using NCBI build 36. We combined any overlapping events into a single region and removed events that did not intersect any genes; we also removed six very large CNV events (length

>5 Mb). As a result, the final set used for our analysis contained 47 genomic regions intersecting 433 genes. In addition, Levy et al. also identified de novo CNV events identified from unaffected siblings as well as rare CNVs inherited by autistic subjects but not their siblings. The average number of genes within each de novo CNV region was ~9, with the median of 3 genes per regions.

Levy et al. also identified 19 de novo CNV events in siblings of autistic children and 157 rare inherited CNV events, those seen a parent and the autistic child but not the seen in the unaffected siblings. Appling the preprocessing steps described above resulted in the sibling dataset containing 14 regions hitting 69 genes, and inherited data containing 156 regions hitting 418 genes.

Greedy Search Algorithm

Once putative disease genes are mapped to the underlying phenotype network, a greedy growth algorithm was used to find strongly connected clusters among these genes (Figure 4.5).

Specifically, the search algorithm was started from every possible gene in CNV regions, and then at each iteration the gene which most increased the cluster score was added. The method allowed only one or two genes per each CNV region in the growing cluster; once a region had contributed the limited number other genes from that event were ignored. This procedure was run until no further genes could be added. At each cluster size, results obtained by starting from different genes were compared and the cluster with the highest score was selected to represent that size.

50

The score for a cluster was the sum of the contributions from each gene it contained. Each gene’s contribution was the de-weighted sum of all edges from that gene to other cluster genes. The sum included the strongest edge, plus half of the second strongest edge, plus one third of the third strongest edge and so on. This heuristic was used to reflect our belief that genes in a cluster will contain common functions and therefore their edges are not wholly independent.

Network Significance

In order to determine the significance of the disease-associated network it was compared with results from random genomic events. The randomization procedure preserved either the size in base pairs, or the number of genes in each observed in each genomic event. Preserving gene count produced more conservative P-values and so this was used throughout our analysis. Once random events were generated, the same search algorithm was applied to produce a distribution of cluster scores from random genomic events at every possible cluster size.

Using 10,000 random trials, we first determined the P-value for the highest scoring cluster at each size based on the scores of random clusters of the same size. We refer to this as the local P-value.

The final cluster is the one with the lowest local P-value. Next, we find the lowest local P-value in each random trial and used this distribution to assign significance to our local P-value which we refer to this as the global P-value. This corrects for multiple hypothesis due to different cluster sizes. The global P- value is more stringent, and therefore is considered the final P-value of the cluster (Figure 4.5).

51

Figure 4.5: Illustration of the cluster building procedure used in the NETBAG approach. Gene clusters were built using a greedy iterative search, i.e. by adding the gene which most increased the overall cluster score at each iteration. (A) A cluster of size 3 is on the left, and genes mapped to five CNV regions are shown on the right. Genes in grey cannot be used because either they or other genes from their CNV regions are already in the cluster. Genes in blue are still available for the cluster building algorithm. One available gene, X1, has weak connections to the cluster. (B) Another available gene from another CNV, Y1, has relatively stronger connections to the cluster when compared to X1. (C) Gene Y1 is added to the cluster. (D) The greedy algorithm was started with all possible genes within CNV regions (Gene 1, Gene 2 … Gene G) and clusters were grown to a maximum size. NETBAG selects the Max highest scoring clyster at each size (S i), assigns local P-values (Pi) and selects the cluster with the lowest P-value as the final result.

Gene contribution to cluster and MCMC simulations

Contributions of different genes to the score of the identified network vary substantially.

Because of de-weighting these measurements are largely based on the strength of the two strongest edges from that gene. The size of nodes in Figure 4.2 is relative to this score. In addition, we devised a formal method to assign weights to individual genes reflecting their contribution to high scoring clusters. The method is based on two distributions over networks: pu(C) in which all clusters are equally likely, and p(C) in which clusters are assigned a probability relative to the their scores such that high scoring clusters are more likely to be seen. Thus, genes from a particular genomic event are equally

52

likely to appear in clusters sampled from the uniform distribution, pu(C), while genes with stronger functional connections to other genomic events are more likely to appear in clusters sampled from the non-uniform distribution, p(C). We used Markov-Chain Monte Carlo (MCMC) to sample five million clusters from each of the two distributions. Each gene is assigned a score equal to the number of times it was seen in non-uniform clusters versus the number of times it was seen in uniform clusters. This method accounts not only for an individual gene’s connections but also the number of genes in each genomic event. Results were comparable to considering a simple weighted sum of the edges containing each gene, although the simulations had the advantage of allowing us to assign weights to genes not seen in the selected cluster.

Gene Set Relationship Analysis

To characterize the identified cluster we investigated its interactions with a collection of defined functional sets of human genes, the 5518 gene sets from the MSigDB database which included Gene

Ontology (GO) categories along with molecular pathways [65]. Using the phenotype network, for each gene set we calculated its average interaction to the identified clusters shown in Figure 4.2. To determine the significance of the calculated interaction scores we built gene set-specific background distributions by generating clusters from the randomized genomic regions with the same gene count as in Levy et al. We used the background distribution to assign an empirical P-value for every gene set and then applied FDR procedure to address the multiple hypothesis involved in testing all gene sets within the collection.

53

Table 4.3: Function notes on NETBAG cluster genes GeneID Name 2A 2B Neural Documented Protein Function Plays a role in apoptosis via interaction with MAP3K12; unclassified interactions with aberrant amyloid peptide associated with Alzheimer 26574 AATF Yes Yes No disease 255926 ADAM5P Yes Yes No Pseudogene; unknown function Involved in AMP biosynthesis; point mutations implicated, though 158 ADSL No Yes Yes rare, in autistic individuals [66] Involved in the transport of phosphatidylserine and phosphatidylethanolamine across the bilayer; mutations may result 5205 ATP8B1 Yes Yes No in epithelial disorders Part of the WNT signaling pathway; facilitates phosphorylation of 8312 AXIN1 No Yes Yes beta-catenin with APC and GSK3B. Plays a role in CDC42-mediated reorganization of the actin 10458 BAIAP2 Yes Yes Yes cytoskeleton, and may affect neuronal growth-cone guidance Plays a central role in DNA repair acting as marker to distinguish repair from apoptosis, and in targeting the WINAC complex to acetylated ; may be involved in Williams-Bueren syndrome 9031 BAZ1B No Yes No [67] Voltage-dependent Ca(2+) calcium channel, involved in neurotransmitter release; may play a role in directed migration of 774 CACNA1B Yes Yes Yes immature neurons; implicated in schizophrenia [68] Member of the highly conserved multiprotein complex, including Cdc6/Cdc18, important during early steps of DNA replication in 8318 CDC45 No Yes No eukaryotes Member of the nicotinic acetylcholine receptors (nAChRs) super- family, ion channels that mediate fast signal transmission at synapses; located in a chromosomal region previously implicated in 1139 CHRNA7 Yes Yes Yes schizophrenia May function as a regulatory ATPase and be related to 81570 CLPB Yes Yes No secretion/protein trafficking process C-protein coupled receptor that binds neuropeptides, major regulators of the hypothalamic-pituitary-adrenal pathway; may have 1394 CRHR1 No Yes Yes protective effect against adult depression [69] Shown to activate the RAS and JUN kinase signaling pathways; binds to DAB1 to help mediate transduction of Reelin signaling important 1399 CRKL Yes Yes Yes for neuronal positioning [70] Potential tumor suppressor; highest expression is in brain, may play a 64478 CSMD1 Yes Yes Yes crucial role in nerve growth cone function [71] Alpha-catenin; functions in cell-cell adhesion and regulation of actin- filament organization, possibly through competition with Arp2/3- 29119 CTNNA3 Yes Yes Yes mediated actin polymerization Delta-catenin; primarily expressed in brain; may be involved in 1501 CTNND2 Yes Yes Yes neuronal cell adhesion; involved in regulation of neurite outgrowth Involved in formation of membrane ruffles and lamellipodia 23191 CYFIP1 Yes Yes Yes protrusions and in axon outgrowth through binding to F-actin Transmembrane receptor of neutrin-1 ligand; mediates axon 1630 DCC Yes Yes Yes guidance Primarily expressed in testis, and potentially related to tumorigenesis; recently identified as part of x-linked associated 168400 DDX53 Yes Yes Yes with autism [8] Involved in embryonic development through its inhibition of the WNT signaling pathway; in adults implicated in bone disease, cancer and 22943 DKK1 No Yes Yes Alzheimer's disease Post-synaptic scaffolding protein which may be involved in the 1739 DLG1 No Yes Yes trafficking AMPAR crucial to synaptic plasticity in the brain

54

GeneID Name 2A 2B Neural Documented Protein Function Member of the post-synaptic density (PSD) in excitatory synapses; plays a role in NMDA receptor signaling and stability of certain 1740 DLG2 Yes Yes Yes synapses Key post-synaptic scaffolding protein utilized for clustering/assembly of receptors, ion channels, and associated signaling proteins; changes in this gene expression were shown to alter the ratio of excitatory to 1742 DLG4 Yes Yes Yes inhibitory synapses in hippocampal neurons Part of the WNT signaling pathway, involved in neural connectivity 1856 DVL2 No Yes Yes [72] Methylates H3, which tags it for transcriptional repression; may play a role in the G0/G1 cell cycle transition through silencing of 79813 EHMT1 No Yes No MYC- and E2F-responsive genes May operate in a postreplication repair or a cell cycle checkpoint function; may be involved in interstrand DNA cross-link repair and in 2175 FANCA Yes Yes No the maintenance of normal chromosome stability Interacts with actin filaments to remodel the cytoskeleton, effecting cell shape and migration; may assist in certain neuroblast migration; 2316 FLNA Yes Yes Yes defects can cause neurological developmental disorders Plays a role in cytoskeletal organization and/or establishment of cell 56776 FMN2 No Yes No polarity Encodes a protein localized to the Golgi apparatus, and postulated to play a role in Rab6-regulated membrane tethering events, and 2803 GOLGA4 Yes Yes No vesicular transport from the trans-Golgi Encodes a member of the BMP antagonist family, involved in organogenesis, body patterning, and tissue differentiation; other members of this protein family, although not this gene, implicated in 64388 GREM2 Yes Yes No axon cone guidance GRK family member whose primary function is to phosphorylate and thus deactivate G protein-coupled receptors; mouse knockouts displayed an age-dependent neurodegeneration, comparable to early 2869 GRK5 Yes Yes Yes Alzheimer disease [73] Encodes the regulatory subunit of the inhibitor of kappaB kinase (IKK) complex, which indirectly activates genes involved in inflammation, 8517 IKBKG No Yes No immunity, cell survival, and other pathways Encodes an intracellular receptor for inositol 1,4,5-trisphosphate, which mediates calcium release from the smooth endoplasmic 3708 ITPR1 Yes Yes Yes reticulum; mutations associated with a various cerebellum disorders Part of a transcription factor family that binds to C-rich sequences; related to inflammation response through activation of RANTES 51621 KLF13 No Yes No expression in T-cells Phosphorylates and inactivates cofilin, thereby stabilizing the actin cytoskeleton; likely to be involved in brain development; implicated in the impaired visuospatial constructive cognition of Williams 3984 LIMK1 Yes Yes Yes syndrome Predominantly localizes in glycolipid- and cholesterol-enriched membrane (GEM) rafts, involved in raft-mediated trafficking in 7851 MALL No Yes No endothelial cells Member of kinase family which responds to extracellular signals; transduces signals from growth factors and key in regulating differentiation and proliferation; MAPK kinase pathways associated 5595 MAPK3/ERK1 Yes Yes Yes with various diseases including neurodegeneration Promotes microtubule assembly and stability, and might be involved in the establishment and maintenance of neuronal polarity; expressed at various stages and locations in the nervous system; mutations have been associated with several neurodegenerative 4137 MAPT Yes Yes Yes disorders

55

GeneID Name 2A 2B Neural Documented Protein Function Component of the Mediator complex, a coactivator involved in the regulated transcription of nearly all RNA polymerase II-dependent genes; serves as a scaffold for the assembly of necessary elements 9442 MED27 Yes Yes No and general transcription factors Binds DNA as a heterodimer with MLX, and plays a role in transcriptional activation of glycolytic target genes; involved in 22877 MLXIP Yes Yes No glucose-responsive gene regulation; abundant in skeletal muscle A smooth muscle myosin, predominantly found in myocytes; functions as a major contractile protein, converting chemical energy into mechanical energy through the hydrolysis of ATP; was identified 4629 MYH11 Yes Yes Yes as a component of post synaptic density Plays an essential role in microtubule organization, mitosis and neuronal migration; essential for the development of the cerebral cortex; may regulate the production of neurons by controlling the 54820 NDE1 No Yes Yes orientation of the mitotic spindle Member of a family of neuronal cell surface proteins, binding to beta- neurexins; involved in the formation and remodeling of central nervous system synapses; mutations in this gene may be associated 54413 NLGN3 Yes Yes Yes with autism and Asperger syndrome Interacts with Crk-associated substrate, and appears to function in the control of cell division, and in cell-cell and cell-matrix adhesion signaling through a complex localized in actin- and microtubule-based 4867 NPHP1 Yes Yes No structures Functions in the vertebrate nervous system as cell adhesion molecules and receptors; plays an important role in formation or 9378 NRXN1 Yes Yes Yes maintenance of synaptic junctions through binding to neuroligins Catalyzes the fusion of transport vesicles within the Golgi cisternae; neuronal function depends on vesicles fusing with the presynaptic membrane for the release of neurotransmitters; a putative role in 4905 NSF Yes Yes Yes delivery and expression of AMPA receptors at the synapse PAK kinases are critical effectors that link Rho GTPases to cytoskeleton reorganization and nuclear signaling; it was recently suggested that PAK2 mediates neurite outgrowth via Rac1 and promotes neuronal differentiation through interaction with RhoGDI1- 5062 PAK2 Yes Yes Yes Rac1/Cdc42 complex [74] 5253 PHF2 No Yes No Probable regulator of histone methyltransferase complexes A precursor to GDP-mannose; mutations in this gene have been shown to cause defects in glycoprotein biosynthesis, which manifests 5373 PMM2 No Yes No as carbohydrate-deficient glycoprotein syndrome type I [75] Encodes a protein with antioxidant function, localized in the ; involved in regulation of cellular proliferation, 10935 PRDX3 No Yes No differentiation, and antioxidant functions Involved in mediate smooth muscle cell relaxation and vasodilation in responses to rises in cGMP through interaction with PPP1R12A; interacts with ITPR1. Mouse knockouts show subtle mental 5592 PRKG1 Yes Yes Yes disabilities Mediates arginine methylation, a widespread posttranslational 56341 PRMT8 Yes Yes Yes modification involved in many cellular processes; brain-specific Member of regulatory factor X family; required for various 5991 RFX3 No Yes No developmental and differentiation activities Involved in apoptosis, transcription, nucleosome assembly and histone binding; stimulates DNA replication of the adenovirus 6418 SET No Yes No genome complex with viral core proteins Plays a role in NF2-mediated growth suppression of cells; strongly expressed in neurons and associates with Rab proteins involved in 27352 SGSM3 Yes Yes Yes vesicle trafficking [76]

56

GeneID Name 2A 2B Neural Documented Protein Function Interacts with WINAC complex; belongs to the neural progenitors- specific remodeling complex (npBAF complex) and the neuron-specific chromatin remodeling complex (nBAF complex) 6595 SMARCA2 Yes Yes Yes involved in neuronal development and regulation of dendrite growth Is a member of a family of widely-distributed cytoskeletal proteins involved in actin cross-linking; potentially anchors dendritic spines to actin cytoskeleton and may regulate synaptogenesis and/or synaptic 6709 SPTAN1 Yes Yes Yes plasticity [77] Modulates the activity of several chaperone complexes involved in degradation of misfolded proteins; linked to Alzheimer's disease due to reduction in aberrant protein scavenging causing abnormal protein 10273 STUB1 Yes Yes Yes accumulation in the brain [78] Catalyzes the hydrolysis of sulfate esters; mutations in this gene 285362 SUMF1 No Yes No cause multiple sulfatase deficiency, a lysosomal storage disorder Component of the ATAC complex required for the function of some acidic activation domains which activate transcription from a distant 6871 TADA2A No Yes No site Plays a role in cell signaling, microtubule organization and stability, and in apoptosis; Isoform1 activates the JNK MAP kinase pathway 9344 TAOK2 No Yes No and binds to microtubules Chromatin-associated protein that represses transcription via 80764 THAP7 No Yes No recruitment of HDAC3 and nuclear hormone receptor corepressors 6434 TRA2B Yes Yes No Participates in the control of pre-mRNA splicing Involved in the late stages of spermatogenesis, during the 23617 TSSK2 Yes Yes No reconstruction of the cytoplasm 23331 TTC28 Yes Yes No Function is unclear Gamma- complex is necessary for microtubule nucleation at 114791 TUBGCP5 No Yes No the centrosome Cleaves ubiquitin fusion protein substrates. Stabilizes TP53 and induces TP53-dependent repression and apoptosis; 7874 USP7 Yes Yes No involved in cell proliferation during early embryonic development Serine/threonine-protein kinase, and member of the WNK family; 65268 WNK2 Yes Yes No potentially related to cell growth or proliferation [79, 80] WNT family protein, implicated in oncogenesis and in several developmental processes. Plays a role in cell-cell signaling during morphogenesis of the developing neural tube. Regulates 7473 WNT3 No Yes Yes neurogenesis [81] 92822 ZNF276 No Yes No protein. May be involved in transcriptional regulation Zinc finger protein. May function as a transcriptional repressor for 163087 ZNF383 Yes Yes No activities mediated by MAPK signaling pathways 92285 ZNF585B No Yes No Zinc finger protein. May be involved in transcriptional regulation Table 4.3: We examined functional annotation of genes in the autism associated clusters to determine if they are involved in neural or brain-related functions. Table indicates which genes were present in each cluster, with cluster 2A allowing one gene per CNV and cluster 2B allowed two genes per CNV. Also included is a short summary of gene function based primarily on GeneCards (www.genecards.org) and IHOP [14], with other papers as referenced in the table.

57

4.5. References

1. Geschwind, D.H., Autism: many genes, common pathways? Cell, 2008. 135(3): p. 391-5. 2. Hyman, S.E., A glimmer of light for neuropsychiatric disorders. Nature, 2008. 455(7215): p. 890- 3. 3. Wang, K., et al., Common genetic variants on 5p14.1 associate with autism spectrum disorders. Nature, 2009. 459(7246): p. 528-33. 4. Weiss, L.A., et al., A genome-wide linkage and association scan reveals novel loci for autism. Nature, 2009. 461(7265): p. 802-8. 5. Manolio, T.A., et al., Finding the missing heritability of complex diseases. Nature, 2009. 461(7265): p. 747-53. 6. McClellan, J. and M.C. King, Genetic heterogeneity in human disease. Cell, 2010. 141(2): p. 210-7. 7. Sebat, J., et al., Strong association of de novo copy number mutations with autism. Science, 2007. 316(5823): p. 445-9. 8. Pinto, D., et al., Functional impact of global rare copy number variation in autism spectrum disorders. Nature, 2010. 466(7304): p. 368-72. 9. Moessner, R., et al., Contribution of SHANK3 mutations to autism spectrum disorder. Am J Hum Genet, 2007. 81(6): p. 1289-97. 10. Marshall, C.R., et al., Structural variation of chromosomes in autism spectrum disorder. Am J Hum Genet, 2008. 82(2): p. 477-88. 11. Zhao, X., et al., A unified genetic theory for sporadic and inherited autism. Proc Natl Acad Sci U S A, 2007. 104(31): p. 12831-6. 12. Levy, D., et al., Rare de novo and transmitted copy-number variation in autistic spectrum disorders. Neuron, 2011. 70(5): p. 886-97. 13. Consortium, T.U., The Universal Protein Resource (UniProt). Nucleic Acids Res, 2007. 35(Database issue): p. D193-7. 14. Hoffmann, R. and A. Valencia, A gene network for navigating the literature. Nat Genet, 2004. 36(7): p. 664. 15. Newschaffer, C.J., et al., The epidemiology of autism spectrum disorders. Annu Rev Public Health, 2007. 28: p. 235-58. 16. Bayes, A., et al., Characterization of the proteome, diseases and evolution of the human postsynaptic density. Nat Neurosci, 2011. 14(1): p. 19-21. 17. Ashburner, M., et al., Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 2000. 25(1): p. 25-9. 18. Tada, T. and M. Sheng, Molecular mechanisms of dendritic spine morphogenesis. Curr Opin Neurobiol, 2006. 16(1): p. 95-101. 19. Alvarez, V.A. and B.L. Sabatini, Anatomical and physiological plasticity of dendritic spines. Annu Rev Neurosci, 2007. 30: p. 79-97.

58

20. Yang, G., F. Pan, and W.B. Gan, Stably maintained dendritic spines are associated with lifelong memories. Nature, 2009. 462(7275): p. 920-4. 21. Roberts, T.F., et al., Rapid spine stabilization and synaptic enhancement at the onset of behavioural learning. Nature, 2010. 463(7283): p. 948-52. 22. Sudhof, T.C., Neuroligins and neurexins link synaptic function to cognitive disease. Nature, 2008. 455(7215): p. 903-11. 23. Sheng, M. and C.C. Hoogenraad, The postsynaptic architecture of excitatory synapses: a more quantitative view. Annu Rev Biochem, 2007. 76: p. 823-47. 24. Linseman, D.A. and F.A. Loucks, Diverse roles of Rho family GTPases in neuronal development, survival, and death. Front Biosci, 2008. 13: p. 657-76. 25. Blanchoin, L., T.D. Pollard, and R.D. Mullins, Interactions of ADF/cofilin, Arp2/3 complex, capping protein and profilin in remodeling of branched actin filament networks. Curr Biol, 2000. 10(20): p. 1273-82. 26. Salinas, P.C. and Y. Zou, Wnt signaling in neural circuit assembly. Annu Rev Neurosci, 2008. 31: p. 339-58. 27. Rosso, S.B., et al., Wnt signaling through Dishevelled, Rac and JNK regulates dendritic development. Nat Neurosci, 2005. 8(1): p. 34-42. 28. Salinas, P.C., et al., Maintenance of Wnt-3 expression in Purkinje cells of the mouse cerebellum depends on interactions with granule cells. Development, 1994. 120(5): p. 1277-86. 29. Fatemi, S.H., et al., Reelin signaling is impaired in autism. Biol Psychiatry, 2005. 57(7): p. 777-87. 30. Niu, S., O. Yabut, and G. D'Arcangelo, The Reelin signaling pathway promotes dendritic spine development in hippocampal neurons. J Neurosci, 2008. 28(41): p. 10339-48. 31. Jossin, Y. and A.M. Goffinet, Reelin signals through phosphatidylinositol 3-kinase and Akt to control cortical development and through mTor to regulate dendritic growth. Mol Cell Biol, 2007. 27(20): p. 7113-24. 32. Kumar, V., et al., Regulation of dendritic morphogenesis by Ras-PI3K-Akt-mTOR and Ras-MAPK signaling pathways. J Neurosci, 2005. 25(49): p. 11288-99. 33. Shaw, R.J. and L.C. Cantley, Ras, PI(3)K and mTOR signalling controls tumour cell growth. Nature, 2006. 441(7092): p. 424-30. 34. Tavazoie, S.F., et al., Regulation of neuronal morphology and function by the tumor suppressors Tsc1 and Tsc2. Nat Neurosci, 2005. 8(12): p. 1727-34. 35. Halpain, S., K. Spencer, and S. Graber, Dynamics and pathology of dendritic spines. Prog Brain Res, 2005. 147: p. 29-37. 36. Blanpied, T.A. and M.D. Ehlers, Microanatomy of dendritic spines: emerging principles of synaptic pathology in psychiatric and neurological disease. Biol Psychiatry, 2004. 55(12): p. 1121-7. 37. Garey, L.J., et al., Reduced dendritic spine density on cerebral cortical pyramidal neurons in schizophrenia. J Neurol Neurosurg Psychiatry, 1998. 65(4): p. 446-53. 38. Glantz, L.A. and D.A. Lewis, Decreased dendritic spine density on prefrontal cortical pyramidal neurons in schizophrenia. Arch Gen Psychiatry, 2000. 57(1): p. 65-73.

59

39. Fiala, J.C., J. Spacek, and K.M. Harris, Dendritic spine pathology: cause or consequence of neurological disorders? Brain Res Brain Res Rev, 2002. 39(1): p. 29-54. 40. Kaufmann, W.E. and H.W. Moser, Dendritic anomalies in disorders associated with mental retardation. Cereb Cortex, 2000. 10(10): p. 981-91. 41. Hutsler, J.J. and H. Zhang, Increased dendritic spine densities on cortical projection neurons in autism spectrum disorders. Brain Res, 2010. 1309: p. 83-94. 42. Woolfrey, K.M., et al., Epac2 induces synapse remodeling and depression and its disease- associated forms alter spines. Nat Neurosci, 2009. 12(10): p. 1275-84. 43. Scott-Van Zeeland, A.A., et al., Altered functional connectivity in frontal lobe circuits is associated with variation in the autism risk gene CNTNAP2. Sci Transl Med, 2010. 2(56): p. 56ra80. 44. Ikeda, S., et al., Axin, a negative regulator of the Wnt signaling pathway, forms a complex with GSK-3beta and beta-catenin and promotes GSK-3beta-dependent phosphorylation of beta- catenin. EMBO J, 1998. 17(5): p. 1371-84. 45. Yu, X. and R.C. Malenka, Beta-catenin is critical for dendritic morphogenesis. Nat Neurosci, 2003. 6(11): p. 1169-77. 46. Choi, J., et al., Regulation of dendritic spine morphogenesis by insulin receptor substrate 53, a downstream effector of Rac1 and Cdc42 small GTPases. J Neurosci, 2005. 25(4): p. 869-79. 47. Matsuki, T., A. Pramatarova, and B.W. Howell, Reduction of Crk and CrkL expression blocks reelin-induced dendritogenesis. J Cell Sci, 2008. 121(Pt 11): p. 1869-75. 48. Drees, F., et al., Alpha-catenin is a molecular switch that binds E-cadherin-beta-catenin and regulates actin-filament assembly. Cell, 2005. 123(5): p. 903-15. 49. Arikkath, J., et al., Delta-catenin regulates spine and synapse morphogenesis and function in hippocampal neurons during development. J Neurosci, 2009. 29(17): p. 5435-42. 50. Endo, Y., et al., Wnt-3a and Dickkopf-1 stimulate neurite outgrowth in Ewing tumor cells via a Frizzled3- and c-Jun N-terminal kinase-dependent mechanism. Mol Cell Biol, 2008. 28(7): p. 2368-79. 51. Rao, A. and A.M. Craig, Signaling between the actin cytoskeleton and the postsynaptic density of dendritic spines. Hippocampus, 2000. 10(5): p. 527-41. 52. McNair, K., et al., A role for RhoB in synaptic plasticity and the regulation of neuronal morphology. J Neurosci, 2010. 30(9): p. 3508-17. 53. Kwok, J.B., et al., Glycogen synthase kinase-3beta and tau genes interact in Alzheimer's disease. Ann Neurol, 2008. 64(4): p. 446-54. 54. Shim, S.Y., et al., Ndel1 controls the dynein-mediated transport of vimentin during neurite outgrowth. J Biol Chem, 2008. 283(18): p. 12232-40. 55. Bernstein, B.W., M.T. Maloney, and J.R. Bamburg, Actin and Diseases of the Nervous System, in Neurobiology of Actin, G. Gallo and L.M. Lanier, Editors. 2011, Springer New York. p. 201-234. 56. Goodman, S.R. and I.S. Zagon, Brain spectrin: a review. Brain Res Bull, 1984. 13(6): p. 813-32. 57. Shen, K. and C.W. Cowan, Guidance molecules in synapse formation and plasticity. Cold Spring Harb Perspect Biol, 2010. 2(4): p. a001842.

60

58. Tessier-Lavigne, M. and C.S. Goodman, The molecular biology of axon guidance. Science, 1996. 274(5290): p. 1123-33. 59. Suli, A., et al., Netrin/DCC signaling controls contralateral dendrites of octavolateralis efferent neurons. J Neurosci, 2006. 26(51): p. 13328-37. 60. Reiner, O. and T. Sapir, Similarities and differences between the Wnt and reelin pathways in the forming brain. Mol Neurobiol, 2005. 31(1-3): p. 117-34. 61. Zoghbi, H.Y., Postnatal neurodevelopmental disorders: meeting at the synapse? Science, 2003. 302(5646): p. 826-30. 62. Network, C.G.A.R., Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature, 2008. 455(7216): p. 1061-8. 63. Wood, L.D., et al., The genomic landscapes of human breast and colorectal cancers. Science, 2007. 318(5853): p. 1108-13. 64. Wang, K., et al., Strategies for genetic studies of complex diseases. Cell, 2010. 142(3): p. 351-3; author reply 353-5. 65. Subramanian, A., et al., Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A, 2005. 102(43): p. 15545- 50. 66. Fon, E.A., et al., Adenylosuccinate lyase (ADSL) and infantile autism: absence of previously reported point mutation. Am J Med Genet, 1995. 60(6): p. 554-7. 67. Kitagawa, H., et al., The chromatin-remodeling complex WINAC targets a nuclear receptor to promoters and is impaired in Williams syndrome. Cell, 2003. 113(7): p. 905-17. 68. Glessner, J.T., et al., Strong synaptic transmission impact by copy number variations in schizophrenia. Proc Natl Acad Sci U S A, 2010. 107(23): p. 10584-9. 69. Polanczyk, G., et al., Protective effect of CRHR1 gene variants on the development of adult depression following childhood maltreatment: replication and extension. Arch Gen Psychiatry, 2009. 66(9): p. 978-85. 70. Park, T.J. and T. Curran, Crk and Crk-like play essential overlapping roles downstream of disabled-1 in the Reelin pathway. J Neurosci, 2008. 28(50): p. 13551-62. 71. Kraus, D.M., et al., CSMD1 is a novel multiple domain complement-regulatory protein highly expressed in the central nervous system and epithelial tissues. J Immunol, 2006. 176(7): p. 4419- 30. 72. Almuedo-Castillo, M., E. Salo, and T. Adell, Dishevelled is essential for neural connectivity and planar cell polarity in planarians. Proc Natl Acad Sci U S A, 2011. 73. Suo, Z., et al., GRK5 deficiency leads to early Alzheimer-like pathology and working memory impairment. Neurobiol Aging, 2007. 28(12): p. 1873-88. 74. Shin, E.Y., et al., Phosphorylation of RhoGDI1 by p21-activated kinase 2 mediates basic fibroblast growth factor-stimulated neurite outgrowth in PC12 cells. Biochem Biophys Res Commun, 2009. 379(2): p. 384-9. 75. Feil, R., et al., cGMP-dependent protein kinase I, the circadian clock, sleep and learning. Commun Integr Biol, 2009. 2(4): p. 298-301.

61

76. Yang, H., et al., Identification of three novel proteins (SGSM1, 2, 3) which modulate small G protein (RAP and RAB)-mediated signaling pathway. Genomics, 2007. 90(2): p. 249-60. 77. Hirai, H. and S. Matsuda, Interaction of the C-terminal domain of delta glutamate receptor with spectrin in the dendritic spines of cultured Purkinje cells. Neurosci Res, 1999. 34(4): p. 281-7. 78. Zhang, G.R., et al., Age-related expression of STUB1 in senescence-accelerated mice and its response to anti-Alzheimer's disease traditional Chinese medicine. Neurosci Lett, 2008. 438(3): p. 371-5. 79. Hong, C., et al., Epigenome scans and cancer genome sequencing converge on WNK2, a kinase- independent suppressor of cell growth. Proc Natl Acad Sci U S A, 2007. 104(26): p. 10974-9. 80. Moniz, S., et al., Protein kinase WNK2 inhibits cell proliferation by negatively modulating the activation of MEK1/ERK1/2. Oncogene, 2007. 26(41): p. 6071-81. 81. Lie, D.C., et al., Wnt signalling regulates adult hippocampal neurogenesis. Nature, 2005. 437(7063): p. 1370-5.

62

5. Diverse genetic variations in schizophrenia converge on functional gene networks

5.1. Introduction

The genetic causes for increased susceptibility to common complex diseases are likely to be numerous and varied. Schizophrenia is an example of this phenomenon; several genomic loci have been implicated by genome-wide association studies (GWAS) [1-4], a contribution from de novo and rare copy number variants (CNVs) has been established [5-7], and a significant contribution from de novo single nucleotide variants (SNVs) is demonstrated in a recent study based on exome sequencing in two populations [8]. An important challenge is to combine these diverse disease-related genetic variations into a single comprehensive analysis this common disorder.

Biological networks provide an intuitively appealing framework for integration of diverse genetic variations [9, 10], as mutations of all types must be connected through function in order to trigger a single phenotype. In this chapter, we employ a modified version of our previously described algorithm in the analysis of existing schizophrenia variants. The new approach is able to integrate data from multiple types of genetic variation: SNVs, CNVs, and GWAS-implicated loci. Genes from all variants are mapped to the phenotype network (Figure 5.1), a greedy search algorithm identifies highly connected gene clusters among putative disease genes, and the significance of the cluster as a whole is established using an appropriate randomization. Although we and others have previously developed methods to identify and analyze disease-related gene networks [11-15], to our knowledge NETBAG is the first principled approach for integration of diverse sources of genome-wide genetic variation under a unified framework. The major goals for this work are:

(1) As previously described, the method provides significance measures to a collection of disease

genes where evaluating the significance of some individual genes may be difficult due to their

rarity. In addition, because gene contribution to the cluster score is not uniform, we are able to

explore the relative significance of different types of variation.

63

(2) By linking genes through shared function, our method will be able to find biological functions

behind the disorder, demonstrating that different types of mutation converge on interrelated

molecular processes.

(3) Autism was once classified as “childhood schizophrenia” and our analysis suggests that the two

share many common biological functions. We examine how mutations in shared genes and

functions can lead to strikingly different phenotypes.

Figure 5.1: The NETBAG+ algorithm. different types of genetic variations are mapped to a phenotype network (pale grey) where every pair of genes is assigned a score proportional to the likelihood ratio that those genes share a genetic phenotype. Strongly inter-connected clusters (dark grey) are identified among putative disease genes. Clusters scores are based on the de-weighted sum of edges between all genes in the cluster; this score is proportional to the likelihood that all cluster genes share the same phenotype. Cluster significance is then established by an appropriate randomization (see Methods).

64

5.2. Results

5.2.1. Gene networks from schizophrenia-associated variations

We considered non-synonymous de novo SNVs from recent studies by Xu et al. [8] and Girard et al. [16], de novo CNVs from published genome-wide scans [7, 17-23] , and genomic regions implicated by genome-wide association studies[24-28]. In total, this set contained 1044 genes (159 from non- synonymous de novo SNVs, 712 from de novo CNVs, 173 from GWAS) from 213 genomic locations. In searching for cohesive gene cluster, the algorithm was allowed to pick any gene affected by a de novo

SNV, any gene within a de novo CNV (one gene per CNV), or any gene within a GWAS-implicated region

(one gene per region).

NETBAG identified a significant gene cluster (P-value < 0.001, Figure 5.2a) containing in total 47 genes (22 from SNVs, 20 from CNVs, 6 from GWAS regions), and divided into two weakly connected sub- clusters as marked in the figure. NETBAG was also run using different combinations of genetic variations

(Table 5.1), but the strongest significance was achieved when all types of genetic variations were considered together demonstrating that different sources of genetic variations appear to reinforce each other increasing the overall cluster significance. While all sources contributed to the final cluster, taken alone SNVs and CNVs formed marginally significant clusters while GWAS did not, perhaps reflecting their relatively lower impact on disease risk. Despite their lack of significance alone, genes implicated through

GWAS did improve the overall cluster significance, highlighting the fact that both common and rare variants may converge on shared functions. After masking the genes forming cluster I, i.e. removing these genes from the input data, NETBAG was able to identify another marginally significant cluster

(Figure 5.2c, cluster II, P-value = 0.071). In contrast to these results, no significant clusters were detected in various control sets, for example synonymous SNVs observed in the control population, or synonymous de novo SNVs observed in schizophrenia patients [8].

65

Notably, taken together cluster I and cluster II included the three out of four genes (LAMA2,

TRRAP, DPYD) with recurrent non-synonymous SNVs in the cohort analyzed by Xu et al. [8] (Fisher’s exact test, one-tailed, P-value = 0.05,), demonstrating that NETBAG is selecting likely causative genes. In addition, we performed a manual literature review of all 159 genes from Xu et al. containing a de novo

SNV. We include brief functional descriptions for these genes with a focus on information about known neuronal and/or brain functions (Table 5.3). We demonstrate from this review that our clusters are enriched in genes with known brain and neuronal functions when compared with genes not selected by the algorithm. Specifically, the identified clusters contain 26 genes with known brain and neural functions out of 56 in total (Fisher’s exact test P-value= 10-4, Barnard’s exact test P-value=2e-5).

Table 5.1: Clusters identified by the NETBAG algorithm Input Events Cluster size Cluster P-value SNVs + CNVs + GWAS 47 < 0.001 CNVs + GWAS 29 0.013 SNVs + CNVs 37 0.014 SNVs 16 0.056 CNVs 6 0.057 Masked SNVs + CNVs + GWAS 42 0.071 SNVs + GWAS 23 0.079 GWAS 10 0.463 Control SNVs 10 0.568 Control SNVs + CNVs 18 0.572 SNVs + CNVs + GWAS (100kb) 51 < 0.001 SNVs + CNVs + GWAS (450kb) 48 < 0.001

Table 5.1: The left column shows various combinations of input events, the middle column shows the number of genes in the best cluster, and the right column shows the best cluster P-value. The rows in the table are sorted by the P-values. “Masked” indicates an input set where the genes forming cluster I were removed. Control SNVs were taken from three sources: de novo mutations in unaffected individuals and de novo synonymous mutations in probands from the recent paper by Xu et al., and de novo mutations in unaffected siblings from two recent autism exome sequencing studies by O'Roak et al. and Sanders et al. In total, this combined set contained 161 genes. Control de novo CNVs were taken from unaffected siblings in a recent autism study by Levy et al., and contained 67 genes in 14 events.

66

Figure 5.2: Significant clusters identified using schizophrenia associated genes. (A) Cluster results from the combined set of schizophrenia-associated genetic variations: genes from de novo CNVs are in blue, genes from non-synonymous de novo SNVs are in green, and genes from GWAS-implicated regions in red. Edge width represents the strength of the likelihood score between two genes, and node size represents the gene’s contribution to the overall cluster score. For simplicity, only the top two edges from each gene are shown. Cluster I was the best cluster from the combined set of all schizophrenia genetic variations (P-value < 0.001). (B) The best cluster found when using only genes affected by non-synonymous de novo SNVs (P-value = 0.056). All genes from this cluster appear in the combined cluster, some with increased significance due to their connections with genes from other variants. (C) Cluster II was the best cluster from the combined set of all schizophrenia genetic variations when the genes forming cluster I were removed from the input data (P-value = 0.071).

67

5.2.2. Biological processes associated with schizophrenia clusters

To determine functional roles of genes forming the identified schizophrenia clusters we used two computational tools (FuncAssociate [29] and DAVID [30]) which identify over-represented Gene

Ontology (GO) terms within a given gene set. These analyses showed that the genes in cluster I participate in several important neurodevelopmental processes, such as axon guidance, neuron projection development, cell migration and locomotion (Select terms in Table 5.2, All terms in Tables

5.4). The GO analysis also implicated several cellular pathways, including signaling through essential second messengers: calcium, cAMP, and inositol trisphosphate. Separate analysis of genes forming sub- clusters Ia and Ib showed that the former was enriched for gene functions related to signaling and axon guidance, while the latter for functions related to neuron mobility and locomotion. The genes forming cluster II were enriched for functions related to chromosomal organization and chromosomal remodeling. Importantly, a similar GO-enrichment analysis of all genes affected by non-synonymous de novo SNVs or de novo CNVs did not identify any significantly enriched functional terms. Thus, the developed computational approach is essential for revealing cohesive functional networks hidden within the genomic loci affected in schizophrenia.

68

Table 5.2a. Selected Gene Ontology (GO) terms associated with cluster I by FuncAssociate N X P-value GO ID GO Term 16 326 <0.001 GO:0007411 axon guidance 11 335 <0.001 GO:0040012 regulation of locomotion 7 108 <0.001 GO:0000187 activation of MAPK activity 8 193 <0.001 GO:0001666 response to hypoxia 9 295 <0.001 GO:0030334 regulation of cell migration 9 333 <0.001 GO:0051960 regulation of nervous system development 8 289 0.001 GO:0019932 second-messenger-mediated signaling 6 132 0.001 GO:0008286 insulin receptor signaling pathway 8 307 0.001 GO:0050767 regulation of neurogenesis 7 227 0.001 GO:0071375 cellular response to peptide hormone stimulus 6 155 0.001 GO:0010975 regulation of neuron projection development 7 253 0.002 GO:0045664 regulation of neuron differentiation 3 16 0.015 GO:0035004 phosphatidylinositol 3-kinase activity 4 54 0.018 GO:0051896 regulation of protein kinase B signaling cascade 5 119 0.021 GO:0007204 elevation of cytosolic calcium ion concentration 4 58 0.024 GO:0007190 activation of adenylate cyclase activity 7 323 0.046 GO:0032870 cellular response to hormone stimulus 6 217 0.048 GO:0048011 nerve growth factor receptor signaling pathway

Table 5.2b. Selected Gene Ontology (GO) terms associated with cluster I by DAVID N X P-value GO ID GO Term 7 107 8.85E-05 GO:0007411 axon guidance 8 169 8.94E-05 GO:0030334 regulation of cell migration 9 256 1.09E-04 GO:0031175 neuron projection development 8 184 1.33E-04 GO:0000165 MAPKKK cascade 8 193 1.70E-04 GO:0007409 axonogenesis 9 339 6.14E-04 GO:0048666 neuron development 6 96 6.47E-04 GO:0009894 regulation of catabolic process 7 163 9.33E-04 GO:0030425 dendrite 9 342 0.001 GO:0043005 neuron projection 7 183 0.001 GO:0006874 cellular calcium ion homeostasis

Table 5.2: Selected Gene Ontology (GO) terms associated with schizophrenia clusters. GO annotation terms that were over-represented among genes in cluster I (Figure 5.2a) based on the analysis with FuncAssociate and DAVID. In the tables N is the number of cluster genes annotated with a given GO term, and X is the total number of human genes with that GO annotation. Repetitive and very broad GO terms, i.e. terms associated with more than 350 genes, are not listed in the tables. For a list of terms associated with each sub-cluster see Table 5.4. (A) The cluster- associated GO terms identified by FuncAssociate. P-values adjusted for multiple hypothesis testing using simulations. (B) The cluster-associated GO terms identified by DAVID. P-values adjusted for multiple hypothesis testing using Benjamini-Hochberg procedure.

69

5.2.3. Expression of genes from schizophrenia-associated clusters

Complementary to curated gene ontology terms, another important descriptor of biological function is temporal gene expression profile. To investigate brain-related gene expression we took advantage of the Human Brain Transcriptome (HBT) database [31]. This database contains expression levels from post-mortem samples across a range of ages from subjects who were considered neurologically healthy. Our goal is to examine not the functioning of a schizophrenic brain, but to see how genes affected by the disease might function in a healthy brain. In Figure 5.3a we show the median brain expression profiles for the genes forming the identified clusters across 15 developmental stages from embryonic to late adulthood. The level of brain expression for all genes forming the identified clusters (orange in Figure 5.3a) is notably higher than expression of all genes in the HBT database

(dashed grey) and all genes used as the input for NETBAG but not selected by the algorithm (dashed black). Moreover, the expression of the cluster genes is higher during the prenatal (left of the vertical dashed line) than the postnatal developmental stages. This result is in agreement with significantly higher enrichment for functional de novo mutations in genes biased towards prenatal expression observed in the recent study by Xu et al., and suggests that prenatal genetic insults are important for etiology of schizophrenia.

Intriguingly, genes forming sub-cluster Ia, sub-cluster Ib, and cluster II showed distinct expression profiles. Sub-cluster Ia contains many genes with broad brain-related functions that are essential across all developmental periods; the median gene expression in this sub-cluster is quite uniform across considered developmental stages, and only slightly higher levels during prenatal periods.

In contrast, genes forming cluster II are primarily responsible for chromosomal organization and remodeling; their expression is likely to be particularly important during periods of neuronal development and differentiation. Naturally, the median expression profile for the cluster II genes is much higher in prenatal compared to postnatal developmental stages. Although the genes forming sub-

70

cluster Ib also display higher prenatal expression, their median expression profile shows a prominent decrease between early fetal to late mid-fetal stages, approximately corresponding to the period between 10 and 20 post-conception weeks. Several genes (DOCK1, ITGA6, COL3A1, LAMA2, and THBS1) in this sub-cluster independently show U-like expression profiles (Figure 5.3b). This observation suggests that specific processes occurring early or late in corticogenesis may be affected in schizophrenia.

Figure 5.3: Temporal brain expression profiles for genes forming the identified clusters. Gene expression profiles in the brain across developmental stages. Gene expression data were obtained from the Human Brain Transcriptome database (hbatlas.org). Median expression levels for each gene were quantile normalized values (log2 transformed) across all samples at that time point. Error bars in the figure represent standard error on the mean across samples, and where applicable all genes, at a given developmental stage. (A) Temporal profiles of the median gene expression for the schizophrenia clusters (Figure 5.1). (B) Temporal expression profiles for individual genes forming sub-cluster Ib in which several genes independently exhibit U- like expression profiles, i.e. decreased expression in early or mid-fetal development.

5.2.4. Common functions between schizophrenia and autism

Our analysis of functions underlying the schizophrenia cluster revealed many GO terms in common with our recent analysis of autism associated CNVs. This observation raises an intriguing question: how can mutations in related and overlapping genes lead to different clinical phenotypes?

Although a detailed understanding of this question will certainly require extensive clinical and biological

71

research, we decided to gain an initial insight by focusing on a distinct phenotype previously considered by us and others: growth of dendrites and dendritic spines. The majority of the excitatory glutamatergic synapses in the human brain are formed on dendritic spines, and their structural aberrations have been implicated in several psychiatric and neurodegenerative disorders [32, 33]. Likely impact on the growth of dendrites or dendritic spines by a gene within a CNV can be investigated based on the corresponding dosage changes, i.e. either a deletion or duplication. Using this approach we previously noted that CNVs associated with autism should primarily lead to an increase in spine or dendritic growth [11].

Interestingly, a similar analysis in schizophrenia – based on mutant phenotypes available in published literature for CNV-associated cluster genes (Table 5.5) – reveals the opposite trend (Figure 5.4, Fisher’s exact test, two-tailed, P-value = 0.01, Barnard’s exact test, two-tailed, P-value = 0.007), i.e. a majority of schizophrenia-associated CNVs should lead to a decrease in growth of dendrites or spines. Notably, postmortem brain analyses observed an increase in spine density increase in autism [34] and a decrease in schizophrenia [35]. Of course, morphological changes in dendritic spines are crucial to understanding their clinical impact. For example, it is known that in Fragile X syndrome patients had a higher density of dendritic spines in certain regions of the brain, but these show an immature morphology suggesting fewer functioning connections [36]. Thus increases and/or decreases are clearly not the only factor relevant to distinct clinical phenotypes. Our analysis does demonstrate, however, that different types of mutation in a single gene can lead, at least on average, to different functional consequences and distinct psychiatric disorders.

72

Figure 5.4: Likely impact of genes from de novo CNVs in autism and schizophrenia on growth of dendrites or dendritic spines. Using the dosage changes (deletion or duplication) for CNV-associated genes in the schizophrenia and autism clusters, we explored available literature for phenotypes related to growth changes of dendrites or dendritic spines. This analysis showed that while de novo CNVs in autism primarily lead to an increase in growth of dendrites or dendritic spines, de novo CNVs in schizophrenia lead to the opposite effect. The difference in the phenotypic impact for the two disorders was significant (Fisher’s exact test, two-tailed, P = 0.01; Barnard’s exact test, two-tailed, P = 0.007). Genes that were considered in the analysis, their corresponding CNVs and predicted functional impact are provided in Table 5.5.

73

5.3. Discussion

It is interesting to consider the implicated genes not only as a cluster in Figure 5.2, but also in the context of relevant signaling pathways. In figure 5.5, the genes forming the significant clusters identified by NETBAG are shown in yellow, other relevant genes that were present in the input data but not selected by the algorithm in cyan, and genes and signaling molecules previously implicated as playing a role in schizophrenia are circled in orange. Individual components of the presented network are active in diverse developmental and functional contexts, such as cell motility, axonal guidance, and synaptogenesis. Several conceptual signaling levels can be delineated in the network. The first layer is formed primarily by a diverse array of receptors and channels, ranging from receptors involved in axonal guidance (such as Ephrin and DCC) to ionotropic and metabotropic neurotransmitter receptors (such as

CHRNA7 and HTR7). The second signaling layer is formed by cellular kinases, phosphatases, and GTPases that are, either directly or indirectly, regulated by the first signaling layer. The third layer consists of regulatory (such as CREB) or structural proteins (such as Cofilin) involved in neurite outgrowth, synaptogenesis, and synaptic plasticity. In addition to the aforementioned horizontal layers, several well-defined top-down pathways that were previously discussed in connection with schizophrenia and other brain-related diseases can be recognized [37, 38]. These include the reelin, WNT, and insulin signaling pathways; pathways involving Akt/PI3K, MAPK, and mTOR signaling; as well as the PKC and PKA pathways. Considering the remarkable diversity of the implicated molecular circuits, it is likely that many hundreds of genes (>1000, according to the estimate the recent study by Xu et al.) may ultimately contribute to the schizophrenia etiology.

Although genetic variations considered in this chapter differ in their type and origin, in combination they perturb a complex but interrelated set of molecular processes. This functional convergence allows the presented integrative approach to identify a cohesive functional cluster. A similar convergence – resulting from common biological mechanisms underlying phenotypes – should

74

occur in many other human disorders. If this is indeed the case, it is likely that genetic data collected using unbiased whole-genome approaches and analyzed by proper computational methods will reveal the underlying molecular networks for a significant fraction of common human maladies, thus realizing an important goal of the human genome project.

Figure 5.5: Genes forming the NETBAG clusters in the context of cellular signaling pathways. Proteins encoded by cluster genes (Figure 5.2) are shown in yellow, and those corresponding to other relevant genes that were present in the input data but not selected by the NETBAG algorithm are shown in cyan. Proteins and signaling molecules that were previously implicated in schizophrenia are circled in red. ER, endoplasmic reticulum; IP3, inositol-1,4,5- trisphosphate; PIP3, phosphatidylinositol-1,4,5-trisphosphate.

75

5.4. Methods

Schizophrenia-associated genetic variation

We used three types of genetic variation: 159 non-synonymous de novo SNVs from two recent studies by Xu et al. and Girard et al., de novo CNVs from several previous studies, and 14 genomic regions that were implicated by SNPs (P-value < 5e-8) in recent genome-wide association studies. We considered all genes affected by non-synonymous de novo SNVs, all genes that overlap the de novo

CNVs events according to the human genome NCBI build 36, and – following previous studies – all genes overlapping a region 250 kb in either direction from SNPs implicated by GWAS; similar results were obtained using calculations with distances of 100 kb and 450 kb from GWAS-implicates SNPs (Table 5.1).

In total, our set contained 1044 genes from 213 genomic regions: 159 from SNVs, 712 from CNVs, and

173 from genomic loci implicated by GWAS.

NETBAG Algorithm

Genes affected by the considered genetic variations were mapped to the phenotype network where every gene pair is connected by a score representing the likelihood that those genes contribute to a shared phenotype. NETBAG employs a greedy algorithm previously described to find the highest scoring cluster at each size. Cluster score is the sum of contributions from each gene, and each gene contributes based on a de-weighted sum of the edges between it and other cluster genes. De-weighting, in which the sum is the strongest edge plus half of the second strongest edge, plus one third of the third strongest edge and so on, reflects our belief that because all cluster genes share common functions their edges are not wholly independent. Cluster significance was determined based on a distribution of cluster scores obtained by applying the same greedy search algorithm to randomized data. The cluster size with the strongest significance is selected and the P-value is adjusted for multiple hypothesis correction as we have examined clusters of multiple sizes.

76

To generate random datasets, we selected genes with average connection strengths in the phenotype network comparable to the disease-associated input genes. This ensures that it is not high overall connectivity of potential disease genes that drives cluster significance. The average connection strength was calculated by considering the weights of the 20 strongest edges from a particular gene to all other network genes.

For a disease cluster of a given size, we assigned a size-specific P-value based on randomized clusters of the same size. To correct for multiple hypothesis testing (due to considering clusters at multiple sizes), we took the best P-value from each random trial regardless of cluster size and used this distribution to assign a P-value to the size-specific P-value. Throughout the paper we used this corrected

P-value to characterize cluster significances. We ignored clusters with five genes or less to ensure that our analysis is not overly influenced by very small gene clusters with strong connections.

In this analysis we found that many clusters shared the same, highly significant P-values. We therefore defined a z-score for each cluster based on scores from clusters of the same size found using random events and selected the final cluster based on a combination of P-value and z-score.

Cluster functional analysis

To establish specific biological functions associated with the schizophrenia clusters, we used two computational tools, FuncAssociate and DAVID, to find all over-represented GO terms. For clarity, we show only the terms with fewer than 350 associated human genes. In the tables we report only multiple hypothesis corrected P-values.

Brain expression profiles

To examine the expression of cluster genes across developmental stages, we used expression dataset from post-mortem analysis of healthy human brains obtained from the Human Brain

77

Transcriptome database (http://hbatlas.org/). Median expression levels for each gene were quantile normalized values and log2-transformed across all samples. The database contained samples across a range of developmental stages including: embryonic, early/mid/late fetal, neonatal/early/late infancy, early/middle/late childhood, adolescence, and young/middle/late adulthood. This dataset allowed us to analyze what functions in a healthy human brain might be compromised by these mutations.

Likely impact of CNV events on dendrites and dendritic spines

To assess the impact of cluster genes associated with de novo CNVs on the growth of dendrites and dendritic spines, we performed an analysis of available literature. CNV polarity, i.e. deletion or duplication, allowed us to determine a corresponding change in gene dosage. CNV-associated genes were taken from either the schizophrenia clusters or the autism cluster identified in our previous work.

For the two genes with both duplication and deletion events (CRKL and PIAS3), we used the frequency of

CNVs reported in Malhotra et al. [5] to determine the predominant polarity associated with each disease. The information about CNV-associated genes, polarities, and phenotypes reported in the literature is provided in Table 5.5.

78

Table 5.3: Literature review of SNV genes Gene Cluster Neural Function Membrane transport molecule. Linked to cholesterol transport and ABCG4 Yes accumulation in brain. (18039927) Protein family plays a role in intracellular lipid metabolism. (11755680, ACOT6 No 21822266) Protein encoded is the major procollagen II N-propeptidase. Family members ADAMTS3 No linked to diabetes and defects in connective tissue. (21822266) Membrane bound catalyst inhibited by calcium, found in neurons, and part of ADCY7 I Yes cAMP pathway. (17135423, 21822266, 20299190) AHNAK2 No Protein class associated with calcium channels in cardiomyocytes and other cells. Acts as a guanine nucleotide exchange factor (GEF) for Rab5 GTPase. Involved in endosome dynamics, potentially relevant to cell migration during development. ALS2CL No (16806877) ANO9 No May act as a calcium-activated chloride channel. AP-2 seems to play a role in the recycling of synaptic vesicle membranes from AP2A2 Yes the presynaptic surface. May play a role in developmental myelination of peripheral nerves. (17893707, ARHGEF10 Yes 19204725) ARHGEF38 No May act as a guanine-nucleotide releasing factor. Transcriptional corepressor. May specifically inhibit gene expression when recruited to promoter regions by sequence specific DNA-binding proteins such BCORL1 No as BCL6. This protein inhibits apoptosis by facilitating the degradation of apoptotic BIRC6 No proteins by ubiquitination. Positively regulates the transcription of RUNX1 and RUNX2. Potentially required BRPF1 II Yes for maintenance of cranial Hox gene expression. (18469222) C16orf62 No Little information Component of chromatin complexes such as the MLL1/MLL and NURF C17orf49 No complexes. C1orf185 No Little information C2cd3 is required for cilia formation and is a recessive lethal mouse mutant with C2CD3 Yes multiple defects, including neural tube defects. (19004860) May play a role in the consolidation/retention of hippocampus-dependent long- CAMK4 I Yes term memory. (21963396) Calcium-binding protein that interacts with newly synthesized glycoproteins in CANX Yes the endoplasmic reticulum. CARD6 No Little information. Involved in the signal transduction pathways of apoptosis, necrosis and CASP4 No inflammation. CCDC108 No Little information. (21822266) CCDC137 No Little information. CCDC39 is required for assembly of inner dynein arms and the dynein regulatory CCDC39 No complex and for normal ciliary motility in (21131972) CCDC84 No Little information. CEACAM18 No Little information. Main component of the nucleosome remodeling and deacetylase complex and CHD4 II Yes plays an important role in epigenetic transcriptional repression. Plays an important role in the regulation of cytokinesis and the development of CIT I Yes the central nervous system. (20084519) Required for the development and/or maintenance of the proper glomerular CLIC5 No endothelial cell and podocyte architecture. Collagen type III occurs in most soft connective tissues along with type I COL3A1 I Yes collagen. Mutations associated with neuronal migration in mice. (21768377)

79

Gene Cluster Neural Function Conserved in multicellular eukaryotic organisms, functions in the nucleus by COMMD9 No affecting the association of NF-kappaB with chromatin. May inhibit neurite outgrowth and growth cone collapse during axon CSPG4 Yes regeneration. CYTH1 No Promotes the activation of ARF through replacement of GDP with GTP. DAB2IP Yes Potential tumor suppressor gene. (12877983) May be involved in the maintenance of the endoplasmic reticulum and/or Golgi DDHD2 Yes structures. (22037555) May be involved in ribosome assembly. Family members believed to be involved DDX10 II No in embryogenesis, spermatogenesis, and cellular growth and division. Encodes a novel putative adhesion receptor protein, which could play a role in neural crest cells migration, a process which has been proposed to be altered in DGCR2 Yes DiGeorge syndrome (16783572) Facilitates nuclear export of spliced mRNA by releasing the RNA from the DHX8 II No spliceosome. Involved in cytoskeletal rearrangements required for phagocytosis of apoptotic DOCK1 I Yes cells and cell motility. Dock3 stimulates axon outgrowth (22219288) Involved in pyrimidine base degradation. Catalyzes the reduction of uracil and DPYD II Yes thymine. Implicated in autism. (19571472) Plays a role in spinal chord development, guiding commissural axons projection DSCAM Yes and path finding across the ventral midline. EDEM2 No Degrades mis-folded proteins in a process known as ER-associated degradation EIF5 II No Catalyzes the hydrolysis of GTP bound to the 40S ribosomal initiation complex. Receptor probably involved in myeloid interactions during immune and EMR3 No inflammatory responses. ESAM No Little information. EVC2 No Plays a critical role in bone formation and skeletal development. FAM13C No Little information. FAM3D No Little information. FASTKD5 No Kinase protein involved in apoptosis. Component of a complex which mediates the ubiquitination and subsequent proteasomal degradation of target proteins. Mutations associated with FBXO7 II Yes Parkinson's disease. (21611747) May be involved in the maintenance of the mucosal structure as a gel-like FCGBP No component of the mucosa. Glycoprotein secreted by parietal cells of the gastric mucosa and is required for GIF Yes adequate absorption of vitamin B12. Guanine nucleotide-binding proteins (G proteins) are involved as modulators or GNAO1 Yes transducers in various transmembrane signaling systems. (14741323) GPR115 No Little information. GPR153 No Little information. GPRIN3 Yes May be involved in neurite outgrowth (by similarity). Variant histone H2A which replaces conventional H2A in a subset of nucleosomes. May be involved in the formation of constitutive heterochromatin, H2AFV No or required for chromosome segregation during cell division. Probable E3 ubiquitin-protein ligase. May be required for development of the HECTD1 Yes head mesenchyme and neural tube closure. (17442300) Encodes a member of the histone H1 family. Found in the large histone gene HIST1H1E No cluster on . HLA-C No Major histocompatibility complex. HLA-DQA1 No Major histocompatibility complex. HMGCR No Rate-limiting enzyme for cholesterol synthesis.

80

Gene Cluster Neural Function Serotonin receptor belonging to the superfamily of G protein-coupled receptors. A candidate locus for involvement in autistic disorder and other neuropsychiatric HTR7 I Yes disorders. (16192982, 10490701) IFT140 No Little information. IFT81 No Little information. Family of signaling molecules that play critical roles in cellular energy IGFL2 No metabolism and in growth and development, especially prenatal growth. Mobilizes intracellular calcium and acts as a second messenger mediating cell INPP5A Yes responses to various stimulation brain. (8013665) Subunit of the Integrator complex, which mediates 3-prime end processing of INTS10 No small nuclear RNAs. Phosphorylated by insulin receptor tyrosine kinase. Mutations in this gene are IRS1 I Yes associated with type II diabetes and susceptibility to insulin resistance. Integrins are known to participate in cell adhesion as well as cell-surface mediated signaling. a receptor forlaminin in epithelial cells and it plays a critical ITGA6 I Yes structural role in the hemidesmosome. F-box proteins function in phosphorylation-dependent ubiquitination, deficiency KDM2B Yes leads to neural progenitor death. (21220025) Participates in transcriptional repression of neuronal genes by recruiting histone KDM5C II Yes deacetylases and REST at neuron-restrictive silencer elements. May be involved in superoxide dismutase activity and in neuroprotective effect KIAA0467 Yes of peroxisomes, May enhance epileptogenesis and confer low threshold. KIAA1109 No Membrane gene, involved in rheumatoid arthritis. Developmentally-regulated transcription factor and important regulator of gene KLF12 II No expression during vertebrate development and carcinogenesis. KPNA1 II No Functions in nuclear protein import. Co-expressed during differentiation of simple and stratified epithelial tissues, specifically expressed in differentiated layers of the mucosal and esophageal KRT4 No epithelia. Mediates attachment, migration and organization of cells into tissues during LAMA1 I Yes embryonic development. (21743468, 18369103) Mediates attachment, migration and organization of cells into tissues during LAMA2 I Yes embryonic development. LARP7 No Negative transcriptional regulator of polymerase II genes. Integral to plasma membrane and has both phlorizin hydrolase activity and LCT No lactase activity. Required for early embryonic development. Involved in cellular lipid LRP1 I Yes homeostasis. Linked with Amyloid protein and Alzheimer's disease. (17920016) Forms bridges between different cytoskeletal elements through specialized MACF1 No modular domains. MAGEC1 No Little information. Sterol-regulated subtilisin-like serine protease involved in regulation of lipid MBTPS1 No metabolism in cells. Central regulator of cellular metabolism, growth and survival in response to MTOR I Yes hormones, growth factors, nutrients, energy and stress signals. Mucins are high molecular mass, highly glycosylated macromolecules that are MUC5B No the major components of mucus secretions Large family of motor proteins. Appears to play a role in cytokinesis, cell shape, MYH10 I Yes and specialized functions such as secretion and capping. Plays a role in homeostatic control of innate immunity and in antiviral defense NLRC5 No mechanisms. May act as a tumor suppressor. Suppresses cell growth and enhanced sensitivity NPRL2 No to various anticancer drug. NRIP1 II No Modulates transcriptional repression by nuclear hormone receptors.

81

Gene Cluster Neural Function Component of the nuclear pore complex, a complex required for the trafficking NUP54 II No across the nuclear membrane. Olfactory receptors interact with odorant molecules in the nose, to initiate a OR4C46 No neuronal response that triggers the perception of a smell. Receptor responsive to both adenosine and uridine nucleotides; may participate P2RY2 I No in control of the cell cycle of endometrial carcinoma cells. Inhibits immunological synapse formation by preventing dynamic arrangement PAG1 Yes of lipid raft proteins. (15310269) Cleaves insulin-like growth factor-binding protein 5; thought to be a local PAPPA2 No regulator of insulin-like growth factor. PHF23 No Little information. Participates in signaling pathway coordinating a diverse range of cell functions including proliferation, cell survival, degranulation, vesicular trafficking and cell PIK3CB I Yes migration. Associated with autism. (19246517) Regulates RHOA activity, and plays a role in cytoskeleton remodeling. Necessary PITPNM1 No for normal completion of cytokinesis. Member of phospholipase family ubiquitously expressed with diverse biological functions including inflammation, cell growth, signaling and death and PLA2G12B No maintenance of membrane phospholipids. PLCL2 No Little information Regulator involved in stress response, cell cycle progression, mitosis, cytokinesis, and the DNA damage response. Expression in rat brain associated with synaptic PLK3 I Yes plasticity. (15640841) Regulates a large number of cellular processes. Required for normal immunity to microbial infections. Required for normal development of the brain cortex PML I Yes during embryogenesis. Creates a molecular switch for regulating the phosphorylation status of PPP1CA PPP1R14D No substrates and smooth muscle contraction. May play a role in the regulation of phospholipid turnover as well as in protection against oxidative injury. Differentially expressed in schizophrenic PRDX6 Yes brains. (19405953) Calcium-activated, phospholipid involved in various cellular processes including regulation of gene expression and control of cell division and differentiation. PRKCB I Yes Linked to autism (18317465) Pregnancy-specific glycoproteins (PSGs), released into the maternal circulation PSG2 No during pregnancy. May play a key role in signal transduction and growth control. Component of PTPRM I Yes developmental pathways important in schizophrenia. (18369103) RARG No Receptor for retinoic acid required for limb bud development. Activates the Erk/MAP kinase cascade and regulates T-cells and B-cells development, homeostasis and differentiation. Enriched in brain and potentially RASGRP1 Yes linked to BDNF signaling. (17453053) RB1CC1 No Tumor suppressor which also enhances retinoblastoma 1 gene expression. Down-regulation by oncogenic signals may facilitate tumor invasion and RECK No metastasis. Required for the biogenesis of motile cilia by governing growth and beating efficiency of motile cells. Implicated in developmental conditions and RFX3 Yes schizophrenia. (22479201, 21792059) Mediates GPR signaling throughout central nervous system. (14992813, RGS12 Yes 21822266) Plays a role in regulation of ribosome production in response to cellular stress RRP1B No and/or changes in growth conditions. Induces cell death. May act as a transcriptional co-repressor of a gene related to SAP30BP No cell survival.

82

Gene Cluster Neural Function Expressed in development of brain and central nervous system for zebrafish. SBNO1 Yes (20503374) SDF4 No Involved in UV and radiation response. May be involved in actively transporting phosphate into cells via Na(+) co- SLC17A1 No transport in the renal brush border membrane. High-affinity transporter for the intake of thiamine. Mutations in this gene cause SLC19A2 No thiamin-responsive megaloblastic anemia syndrome (TRMA). Sulfate transporter. May play a role in spermatogenesis and fetal development. SLC26A8 No (21419855) SLC4A8 Yes Plays a major role in pH regulation in neurons. May play a role in clathrin-dependent retrograde transport from early SMAP2 No endosomes to the trans-Golgi network Plays a role in the maintenance of X-inactivation and the hypermethylation of SMCHD1 No CpG islands associated with inactive X. SPATA22 No Little information. Associated with spermatogenesis. SPATA5 No Little information. Associated with spermatogenesis. May be required for B-cell receptor (BCR) signaling, which is necessary for SPIB II No normal B-cell development and antigenic stimulation. May be involved in transcription regulation of the alpha 2(I) collagen gene SSBP3 No (COL1A2) associated with disease of connective tissue and bones. STAC2 No Little information. Component of cohesin complex, a complex required for the cohesion of sister STAG1 No chromatids after DNA replication. May play a regulatory role in the acute-phase response in systemic inflammation STAP2 Yes and may modulate STAT3 activity. Enriched in DMH neurons. (20380814) Major constituent of the post-synaptic density. Mutations associated with SYNGAP1 Yes mental retardation. (19196676) Type-VI intermediate filament (IF) which plays an important role within the SYNM No muscle cell cytoskeleton. A spermatogensis-specific component of the DNA-binding general transcription TAF7L II No factorcomplex TFIID. TBC1D14 No May act as a GTPase-activating protein for Rab family proteins. TEKT5 No Forms filamentous polymers in the walls of ciliary and flagellar microtubules. Component of the complex responsible for telomerase activity. Potentially TEP1 II Yes relevant to neuron development. (11718848) Catalyzes the cross-linking of proteins and the conjugation of polyamines to TGM6 No proteins (by similarity). Adhesive glycoprotein that mediates cell-to-cell and cell-to-matrix interactions. Potentially linked to the Reelin pathway and neuronal migration and THBS1 I Yes schizophrenia. (18946489, 22311024) DNA , an enzyme that controls and alters the topologic states of TOP3B No DNA during transcription. Involved in the regulation of trafficking of EGF-EGFR complexes and GABA-A receptors important in developmental of hippocampal neurons and linked to TRAK1 II Yes schizophrenia. (21454691, 18382465) Membrane protein that may be involved in chronic inflammation by triggering the production of constitutive inflammatory cytokines. Potentially important in TREM2 Yes microglia function. (15686960) Adapter protein, which is found in various multi-protein chromatin complexes with histone acetyltransferase activity (HAT), which gives a specific tag for TRRAP I No epigenetic transcription activation. A membrane-bound transcription regulator that translocates to the nucleus in TUB No response to phosphoinositide hydrolysis.

83

Gene Cluster Neural Function Ubiquilin thought to functionally link the ubiquitination machinery to the proteasome to affect in vivo protein degradation. Family members associated UBQLN1 Yes with ALS and dementia. (21857683) Recognizes and binds to proteins bearing specific amino-terminal residues UBR5 I No leading to their ubiquitination and subsequent degradation. Important in the conjugation and subsequent elimination of potentially toxic UGT1A3 No xenobiotics and endogenous compounds. Probably plays a role in vesicle maturation during exocytosis. Probably is UNC13C Yes involved in neurotransmitter release. URB2 No Little information. VN1R4 No Little information. Component of the retromer complex, regulates transcytosis of the polymeric immunoglobulin receptor. Might regulate levels of the Aβ peptide in Alzheimer VPS35 Yes and schizophrenia patients. (18039130) May act as an adaptor protein that modulates the transforming growth factor- beta response by coupling the transforming growth factor-beta receptor VPS39 No complex to the Smad pathway. Members of this family are involved in a variety of cellular processes, including WDR11 No cell cycle progression, signal transduction, apoptosis, and gene regulation. XPR1 Potential receptor for xenotropic andpolytropic murine leukemia retroviruses. ZBTB40 No Little information. ZNF229 II No Little information. Positive regulator in MAPK-mediated signaling pathways that lead to the ZNF480 II No activation of AP-1 and SRE important for human heart development. ZNF530 No Little information. ZNF565 II No Little information. Table 5.3: We performed a manual literature review of information for the 159 genes with de novo SNVs from the recent study by Xu et al. In the table we indicate whether each gene is a member of the cluster I or II (Figure 5.2), and whether the gene has a known neural or brain-related function. Functional descriptions in the table were primarily taken from GeneBank and NCBI. We also included Pubmed identifiers for key references.

84

Table 5.4: All Gene Ontology (GO) terms associated with schizophrenia clusters Cluster Ia terms from FuncAssociate N X P-value GO ID GO Term 12 326 <0.001 GO:0007411 axon guidance 10 324 <0.001 GO:0033674 positive regulation of kinase activity 10 334 <0.001 GO:0051347 positive regulation of transferase activity 8 193 <0.001 GO:0006874 cellular calcium ion homeostasis 8 198 <0.001 GO:0072503 cellular divalent inorganic cation homeostasis 8 202 <0.001 GO:0055074 calcium ion homeostasis 8 208 <0.001 GO:0072507 divalent inorganic cation homeostasis 9 310 <0.001 GO:0045860 positive regulation of protein kinase activity 8 265 0.001 GO:0006875 cellular metal ion homeostasis 6 108 0.001 GO:0000187 activation of MAPK activity 8 277 0.001 GO:0055065 metal ion homeostasis 7 193 0.001 GO:0001666 response to hypoxia 8 289 0.001 GO:0019932 second-messenger-mediated signaling 8 293 0.001 GO:0030003 cellular cation homeostasis 7 209 0.001 GO:0070482 response to oxygen levels 6 132 0.001 GO:0008286 insulin receptor signaling pathway 8 333 0.001 GO:0051960 regulation of nervous system development 7 227 0.001 GO:0071375 cellular response to peptide hormone stimulus 8 337 0.001 GO:0055080 cation homeostasis 6 147 0.001 GO:0043406 positive regulation of MAP kinase activity 8 349 0.001 GO:0030030 cell projection organization 6 151 0.001 GO:0010638 positive regulation of organelle organization 6 151 0.001 GO:0045121 membrane raft 3 10 0.001 GO:0046934 phosphatidylinositol-4,5-bisphosphate 3-kinase activity 5 89 0.001 GO:0043900 regulation of multi-organism process 7 264 0.001 GO:0019901 protein kinase binding 6 174 0.001 GO:0032869 cellular response to insulin stimulus positive regulation of protein serine/threonine kinase 6 178 0.001 GO:0071902 activity 5 101 0.001 GO:0031346 positive regulation of cell projection organization 6 188 0.002 GO:0031344 regulation of cell projection organization 2 2 0.008 GO:0005899 insulin receptor complex 7 307 0.008 GO:0019900 kinase binding 7 307 0.008 GO:0050767 regulation of neurogenesis 6 200 0.008 GO:0043405 regulation of MAP kinase activity 3 16 0.008 GO:0035004 phosphatidylinositol 3-kinase activity 4 53 0.008 GO:0045995 regulation of embryonic development 3 17 0.011 GO:0034504 protein localization to nucleus 4 55 0.011 GO:0009895 negative regulation of catabolic process 5 119 0.011 GO:0007204 elevation of cytosolic calcium ion concentration 7 323 0.011 GO:0032870 cellular response to hormone stimulus 4 58 0.011 GO:0007190 activation of adenylate cyclase activity 6 217 0.011 GO:0048011 nerve growth factor receptor signaling pathway 5 126 0.011 GO:0046777 protein autophosphorylation 4 59 0.011 GO:0045762 positive regulation of adenylate cyclase activity 7 335 0.011 GO:0040012 regulation of locomotion 6 220 0.011 GO:0042327 positive regulation of phosphorylation

85

Cluster Ia terms from FuncAssociate N X P-value GO ID GO Term 4 60 0.011 GO:0031281 positive regulation of cyclase activity 7 339 0.011 GO:0043434 response to peptide hormone stimulus 6 225 0.012 GO:0032868 response to insulin stimulus 4 62 0.012 GO:0051349 positive regulation of lyase activity 7 345 0.012 GO:0031401 positive regulation of protein modification process 6 228 0.012 GO:0010562 positive regulation of phosphorus metabolic process 6 228 0.012 GO:0045937 positive regulation of phosphate metabolic process 5 134 0.012 GO:0051480 cytosolic calcium ion homeostasis 4 65 0.012 GO:0007163 establishment or maintenance of cell polarity 5 140 0.027 GO:0032386 regulation of intracellular transport 3 22 0.028 GO:0005979 regulation of glycogen biosynthetic process 3 22 0.028 GO:0010962 regulation of glucan biosynthetic process 3 22 0.028 GO:0032885 regulation of polysaccharide biosynthetic process 6 241 0.028 GO:0032269 negative regulation of cellular protein metabolic process 5 145 0.029 GO:0032147 activation of protein kinase activity 6 253 0.031 GO:0045664 regulation of neuron differentiation 3 25 0.033 GO:0032881 regulation of polysaccharide metabolic process 3 25 0.033 GO:0070873 regulation of glycogen metabolic process 5 155 0.033 GO:0010975 regulation of neuron projection development 4 76 0.033 GO:0008543 fibroblast growth factor receptor signaling pathway

Cluster Ib terms from FuncAssociate N X P-value GO ID GO Term 3 61 0.001 GO:0007229 integrin-mediated signaling pathway 2 7 0.002 GO:0005606 laminin-1 complex 4 326 0.002 GO:0007411 axon guidance 4 332 0.002 GO:0051270 regulation of cellular component movement 4 335 0.002 GO:0040012 regulation of locomotion 2 9 0.002 GO:0006911 phagocytosis, engulfment 2 9 0.002 GO:0043256 laminin complex 2 14 0.01 GO:0043394 proteoglycan binding 3 169 0.026 GO:0044420 extracellular matrix part 2 27 0.039 GO:0018149 peptide cross-linking 3 197 0.039 GO:0030155 regulation of cell adhesion

Cluster II terms from FuncAssociate N X Padj-value GO ID GO Term 7 137 <0.001 GO:0004386 helicase activity 6 101 <0.001 GO:0008026 ATP-dependent helicase activity 6 101 <0.001 GO:0070035 purine NTP-dependent helicase activity 7 252 0.002 GO:0042623 ATPase activity, coupled 7 309 0.025 GO:0016887 ATPase activity

86

Cluster Ia terms from DAVID N X P-value GO ID GO Term 10 345 4.90E-05 GO:0045859 regulation of protein kinase activity 6 61 7.54E-05 GO:0031329 regulation of cellular catabolic process 8 209 1.56E-04 GO:0048667 cell morphogenesis involved in neuron differentiation 8 213 1.65E-04 GO:0048812 neuron projection morphogenesis 8 231 2.67E-04 GO:0033674 positive regulation of kinase activity 8 240 3.24E-04 GO:0051347 positive regulation of transferase activity 8 245 3.34E-04 GO:0048858 cell projection morphogenesis 8 244 3.42E-04 GO:0000904 cell morphogenesis involved in differentiation 8 256 4.25E-04 GO:0031175 neuron projection development 8 256 4.25E-04 GO:0032990 cell part morphogenesis 6 96 4.26E-04 GO:0009894 regulation of catabolic process 7 163 5.06E-04 GO:0030425 dendrite 7 184 5.90E-04 GO:0000165 MAPKKK cascade 6 107 6.12E-04 GO:0007411 axon guidance 7 183 6.18E-04 GO:0006874 cellular calcium ion homeostasis 7 188 6.43E-04 GO:0055074 calcium ion homeostasis 7 193 7.20E-04 GO:0007409 axonogenesis 9 342 7.30E-04 GO:0043005 neuron projection 7 196 7.60E-04 GO:0006875 cellular metal ion homeostasis 7 205 9.48E-04 GO:0055065 metal ion homeostasis 6 143 0.001 GO:0045121 membrane raft 7 223 0.001 GO:0045860 positive regulation of protein kinase activity 7 227 0.002 GO:0030005 cellular di-, tri-valent inorganic cation homeostasis 8 339 0.002 GO:0048666 neuron development 7 239 0.002 GO:0055066 di-, tri-valent inorganic cation homeostasis 7 254 0.003 GO:0030003 cellular cation homeostasis 5 82 0.003 GO:0000187 activation of MAPK activity 7 286 0.004 GO:0055080 cation homeostasis 4 37 0.005 GO:0009895 negative regulation of catabolic process 7 295 0.005 GO:0009967 positive regulation of signal transduction 5 102 0.005 GO:0043406 positive regulation of MAP kinase activity 7 329 0.008 GO:0010647 positive regulation of cell communication 4 55 0.012 GO:0007190 activation of adenylate cyclase activity 4 56 0.012 GO:0045762 positive regulation of adenylate cyclase activity 4 57 0.013 GO:0031281 positive regulation of cyclase activity 5 134 0.013 GO:0001666 response to hypoxia 4 59 0.013 GO:0051349 positive regulation of lyase activity 5 141 0.014 GO:0070482 response to oxygen levels 5 141 0.014 GO:0043405 regulation of MAP kinase activity 3 14 0.014 GO:0032885 regulation of polysaccharide biosynthetic process 3 14 0.014 GO:0010962 regulation of glucan biosynthetic process 3 14 0.014 GO:0005979 regulation of glycogen biosynthetic process 3 15 0.016 GO:0032881 regulation of polysaccharide metabolic process 5 147 0.017 GO:0019901 protein kinase binding 5 168 0.021 GO:0043025 cell soma 6 276 0.021 GO:0016477 cell migration 5 169 0.026 GO:0030334 regulation of cell migration 6 295 0.027 GO:0031399 regulation of protein modification process

87

Cluster Ia terms from DAVID N X P-value GO ID GO Term 3 20 0.027 GO:0031330 negative regulation of cellular catabolic process 3 21 0.029 GO:0043255 regulation of carbohydrate biosynthetic process 6 307 0.030 GO:0048870 cell motility 6 307 0.030 GO:0051674 localization of cell 4 83 0.030 GO:0010638 positive regulation of organelle organization 5 181 0.030 GO:0051130 positive regulation of cellular component organization 5 180 0.030 GO:0032269 negative regulation of cellular protein metabolic process 5 179 0.030 GO:0019900 kinase binding 4 85 0.030 GO:0046777 protein amino acid autophosphorylation 5 187 0.032 GO:0031401 positive regulation of protein modification process 5 187 0.032 GO:0051248 negative regulation of protein metabolic process 5 192 0.035 GO:0040012 regulation of locomotion 5 193 0.035 GO:0051270 regulation of cell motion 3 26 0.039 GO:0050994 regulation of lipid catabolic process 4 96 0.040 GO:0045761 regulation of adenylate cyclase activity 4 99 0.043 GO:0031279 regulation of cyclase activity 4 100 0.043 GO:0048585 negative regulation of response to stimulus 3 28 0.043 GO:0046320 regulation of fatty acid oxidation 6 342 0.043 GO:0007167 enzyme linked receptor protein signaling pathway 4 101 0.044 GO:0030817 regulation of cAMP biosynthetic process 4 101 0.044 GO:0051339 regulation of lyase activity 4 103 0.045 GO:0030814 regulation of cAMP metabolic process 5 215 0.046 GO:0007507 heart development regulation of generation of precursor metabolites and 3 30 0.046 GO:0043467 energy 4 107 0.049 GO:0043010 camera-type eye development

Cluster Ib terms from DAVID N X P-value GO ID GO Term 4 345 0.004 GO:0031012 extracellular matrix 4 137 0.006 GO:0030155 regulation of cell adhesion 3 117 0.012 GO:0044420 extracellular matrix part 3 59 0.012 GO:0005178 integrin binding 2 7 0.024 GO:0005606 laminin-1 complex 2 9 0.025 GO:0043256 laminin complex 3 70 0.029 GO:0007229 integrin-mediated signaling pathway 2 17 0.036 GO:0005605 basal lamina 3 320 0.037 GO:0005578 proteinaceous extracellular matrix 3 196 0.043 GO:0032403 protein complex binding 2 11 0.050 GO:0043236 laminin binding

88

Cluster II terms from DAVID N X P-value GO ID GO Term 8 140 3.03E-05 GO:0004386 helicase activity 6 98 8.85E-04 GO:0008026 ATP-dependent helicase activity 6 98 8.85E-04 GO:0070035 purine NTP-dependent helicase activity 8 274 0.002 GO:0016568 chromatin modification 7 272 0.006 GO:0042623 ATPase activity, coupled 4 71 0.008 GO:0016585 chromatin remodeling complex 7 334 0.013 GO:0016887 ATPase activity Table 5.4: We searched for over-represented GO terms describing the schizophrenia clusters in Figure 5.1. In the table, N is the number of cluster genes annotated with a given term, X is the number of all human genes with that GO term. All P-values have been corrected for multiple hypothesis testing based on the number of considered GO terms. For FuncAssociate, correction was done through randomized simulations. For DAVID, correction was done using the Benjamini-Hochberg procedure.

89

Table 5.5: Likely impact of CNVs on the growth of dendrites and dendritic spines SCZ ASD SCZ ASD Gene Gene function and reported phenotype Polarity Polarity Impact Impact A negative regulator promoting the phosphorylation and destruction of β-catenin AXIN1 (9482734), necessary for dendritic growth (14528308) Del Inc Knockdown of BAIAP2 through RNA reduces the density, length and width of dendritic spines (15673667); overexpression of IRSp53 in cultured hippocampal neurons can increase spine density, whereas RNAi knockdown of IRSp53 protein BAIAP2 decreases spine density (18549783) Del Dec CAP1 promotes actin dynamics, with depletion of CAP1 leading to cofilin-1 improperly localized (15004221), while reduction in cofilin-1 expression exhibited CAP1 longer dendritic protrusions (19380880) Dupl Dec CAP2 less well studied but 64% amino acid identity to CAP1 (8761950) and potentially linked function (11919151); CAP2Δ mutations may cause defects in CAP2 capping filament barbed ends (14680631) Dupl Dec CIT Citron-N knockdown results in a decreased density of mature spines (18309323) Dupl Inc Reduction of CRKL block dendrogensis (18477607); deletions in SCZ versus ASD, CRKL 4:1; duplications in ASD versus SCZ, 9:19 Del Dupl Dec Inc Alpha-catenin regulates actin-filament assembly, potentially through competition CTNNA3 with Arp2/3 (16325583), while Arp2/3 increases actin growth (18430734 ) Del Inc CTNND2 Delta-catenin knockdowns have been shown to increase spine density (19403811 ) Del Inc DKK1 can stimulate neurite outgrowth through Frizzled and JNK signaling DKK1 (18212053) Dupl Inc PDZ (PSD-95/Dlg/ZO-1) scaffold interacts with and promotes actin nucleation on DLG2 the synaptic adhesion molecule neurexin (19118189) Del Dec Mice lacking DVL1 had reduced dendritic arborization in hippocampal neurons (15608632) ; DVL1 is a core component of Wnt7a, which increases the density and DVL1 maturity of dendritic spines (21670302) Del Dec DVL2 DVL2 knockdown suppresses neurite outgrowth38 (11075823) Dupl Inc FLNA is an actin-binding protein that crosslinks filaments, which is necessary for FLNA the development of neuronal filopodia (11075823) Dupl Inc Results suggest that IRs stimulate the growth rate of dendrites and prevent light- induced dendritic plasticity (18549776); overexpression of IRSp53 in cultured hippocampal neurons can increase spine density, whereas RNAi knockdown of INSR IRSp53 protein decreases spine density (18549783) Dupl Inc LIMK1 deficits result in decreased number of dendritic spines or abnormally small, Dupl / LIMK1 thin spines (15184030, 20203211) Del Dupl Inc / Dec Inc Overexpression of Tau leads to decreased levels of β-catenin46, which is necessary MAPT for dendritic growth (18991351) Del Inc NDE1 Promotes neuron outgrowth by regulating microtubules (18303022) Dupl Inc Pak2 phosphorylates LIMK1, inactivating cofilin which increases growth (DOI: PAK2 10.1007/978-1-4419-7368-9_11) Del Del Dec Dec PIASx, as a MEF2 SUMO E3 ligase, promotes dendritic claw differentiation in the PIAS3 cerebellar cortex (17855618); deletions in SCZ versus ASD, 6:19 Del Dec SPTAN1 stabilizes cytoskeleton, intensity corresponds to the density of neuronal SPTAN1 cells (6099746) Dupl Inc Table 5.5: We performed a literature review to assess the impact of cluster genes associated with de novo CNVs on the growth of dendrites and dendritic spines – increase (inc) or decrease (dec). CNV polarity, i.e. deletion (del) or duplication (dupl), allowed us to determine a corresponding change in the gene dosage. CNV-associated genes were taken from either the schizophrenia clusters or the autism cluster identified in our previous work. For the two genes with both duplication and deletion events (CRKL and PIAS3), we used the frequency of CNVs reported in Malhotra et al. to determine the predominant polarity associated with each disease. We included Pubmed identifiers for key references.

90

5.5. References

1. International Schizophrenia, C., et al., Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature, 2009. 460(7256): p. 748-52. 2. O'Donovan, M.C., et al., Identification of loci associated with schizophrenia by genome-wide association and follow-up. Nat Genet, 2008. 40(9): p. 1053-5. 3. Ripke, S., et al., Genome-wide association study identifies five new schizophrenia loci. Nat Genet, 2011. 43(10): p. 969-76. 4. Yue, W.H., et al., Genome-wide association study identifies a susceptibility locus for schizophrenia in Han Chinese at 11p11.2. Nat Genet, 2011. 43(12): p. 1228-31. 5. Malhotra, D. and J. Sebat, CNVs: harbingers of a rare variant revolution in psychiatric genetics. Cell, 2012. 148(6): p. 1223-41. 6. Walsh, T., et al., Rare structural variants disrupt multiple genes in neurodevelopmental pathways in schizophrenia. Science, 2008. 320(5875): p. 539-43. 7. Xu, B., et al., Strong association of de novo copy number mutations with sporadic schizophrenia. Nat Genet, 2008. 40(7): p. 880-5. 8. Xu, B., et al., De novo gene mutations highlight patterns of genetic and neural complexity in schizophrenia. Nat Genet, 2012. 9. Girard, S.L., P.A. Dion, and G.A. Rouleau, Schizophrenia genetics: putting all the pieces together. Curr Neurol Neurosci Rep, 2012. 12(3): p. 261-6. 10. Tandon, R., M.S. Keshavan, and H.A. Nasrallah, Schizophrenia, "Just the Facts": what we know in 2008 part 1: overview. Schizophr Res, 2008. 100(1-3): p. 4-19. 11. Gilman, S.R., et al., Rare de novo variants associated with autism implicate a large functional network of genes involved in formation and function of synapses. Neuron, 2011. 70(5): p. 898-907. 12. Feldman, I., A. Rzhetsky, and D. Vitkup, Network properties of genes harboring inherited disease mutations. Proc Natl Acad Sci U S A, 2008. 105(11): p. 4323-8. 13. Goh, K.I., et al., The human disease network. Proc Natl Acad Sci U S A, 2007. 104(21): p. 8685-90. 14. Iossifov, I., et al., Genetic-linkage mapping of complex hereditary disorders to a whole-genome molecular-interaction network. Genome Res, 2008. 18(7): p. 1150-62. 15. Rossin, E.J., et al., Proteins encoded in genomic regions associated with immune-mediated disease physically interact and suggest underlying biology. PLoS Genet, 2011. 7(1): p. e1001273. 16. Girard, S.L., et al., Increased exonic de novo mutation rate in individuals with schizophrenia. Nat Genet, 2011. 43(9): p. 860-3. 17. Bassett, A.S., et al., Clinically detectable copy number variations in a Canadian catchment population of schizophrenia. J Psychiatr Res, 2010. 44(15): p. 1005-9. 18. Guilmatre, A., et al., Recurrent rearrangements in synaptic and neurodevelopmental genes and shared biologic pathways in schizophrenia, autism, and mental retardation. Arch Gen Psychiatry, 2009. 66(9): p. 947-56. 19. Kirov, G., et al., Comparative genome hybridization suggests a role for NRXN1 and APBA2 in schizophrenia. Hum Mol Genet, 2008. 17(3): p. 458-65.

91

20. Kirov, G., et al., De novo CNV analysis implicates specific abnormalities of postsynaptic signalling complexes in the pathogenesis of schizophrenia. Mol Psychiatry, 2012. 17(2): p. 142-53. 21. Malhotra, D., et al., High frequencies of de novo CNVs in bipolar disorder and schizophrenia. Neuron, 2011. 72(6): p. 951-63. 22. Mulle, J.G., et al., Microdeletions of 3q29 confer high risk for schizophrenia. Am J Hum Genet, 2010. 87(2): p. 229-36. 23. Stefansson, H., et al., Large recurrent microdeletions associated with schizophrenia. Nature, 2008. 455(7210): p. 232-6. 24. Kirov, G., et al., A genome-wide association study in 574 schizophrenia trios using DNA pooling. Mol Psychiatry, 2009. 14(8): p. 796-803. 25. Lencz, T., et al., Converging evidence for a pseudoautosomal cytokine receptor gene locus in schizophrenia. Mol Psychiatry, 2007. 12(6): p. 572-80. 26. Shi, J., et al., Common variants on chromosome 6p22.1 are associated with schizophrenia. Nature, 2009. 460(7256): p. 753-7. 27. Stefansson, H., et al., Common variants conferring risk of schizophrenia. Nature, 2009. 460(7256): p. 744-7. 28. Sullivan, P.F., et al., Genomewide association for schizophrenia in the CATIE study: results of stage 1. Mol Psychiatry, 2008. 13(6): p. 570-84. 29. Berriz, G.F., et al., Next generation software for functional trend analysis. Bioinformatics, 2009. 25(22): p. 3043-4. 30. Huang da, W., B.T. Sherman, and R.A. Lempicki, Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc, 2009. 4(1): p. 44-57. 31. Kang, H.J., et al., Spatio-temporal transcriptome of the human brain. Nature, 2011. 478(7370): p. 483-9. 32. Fiala, J.C., J. Spacek, and K.M. Harris, Dendritic spine pathology: cause or consequence of neurological disorders? Brain Res Brain Res Rev, 2002. 39(1): p. 29-54. 33. Penzes, P., et al., Dendritic spine pathology in neuropsychiatric disorders. Nat Neurosci, 2011. 14(3): p. 285-93. 34. Hutsler, J.J. and H. Zhang, Increased dendritic spine densities on cortical projection neurons in autism spectrum disorders. Brain Res, 2010. 1309: p. 83-94. 35. Glantz, L.A. and D.A. Lewis, Decreased dendritic spine density on prefrontal cortical pyramidal neurons in schizophrenia. Arch Gen Psychiatry, 2000. 57(1): p. 65-73. 36. Irwin, S.A., et al., Abnormal dendritic spine characteristics in the temporal and visual cortices of patients with fragile-X syndrome: a quantitative examination. Am J Med Genet, 2001. 98(2): p. 161-7. 37. Kvajo, M., H. McKellar, and J.A. Gogos, Molecules, signaling, and schizophrenia. Curr Top Behav Neurosci, 2010. 4: p. 629-56. 38. Pickard, B., Progress in defining the biological causes of schizophrenia. Expert Rev Mol Med, 2011. 13: p. e25.

92

6. Function and phenotype across de novo genetic mutations associated with autism

6.1. Introduction

The list of genes affecting an individual’s susceptibility to autism is rapidly growing; recent studies have implicated many causal genes and it is estimated that hundreds more may contribute to disease risk [1]. Studies focusing on de novo events have been particularly successful as these mutations provide a mechanism by which cases may continue to arise even for a disease like autism with a strong negative impact on fecundity [2-7]. With this data has come a new understanding of the biological pathways unifying disease genes and the impact these have on pathophysiology, an understanding crucial to developing better diagnostic tests and uncovering new therapeutic interventions [8-10]. The next step in the study of ASD is to use this biological knowledge to better describe the phenotypes represented by the disorder as autism is associated with particularly wide range of cognitive and behavioral symptoms [4, 10, 11].

In this chapter, we employ our method, NETBAG, to integrate copy number variants (CNVs) and single nucleotide variants (SNVs) from several recent studies of the SSC collection [3-6]. Our method searches a diverse collection of disease-linked genetic variants for gene clusters with high functional similarity, utilizing the fact that true causal mutations will share common molecular mechanisms [12].

We find these mutations converge on a several processes relevant to ASD: signaling at the post synaptic density, neuronal growth particularly of dendritic spines, and DNA modification.

We take this analysis further, leveraging the data from two large scale projects. First and foremost, the Simons Simplex Collection (SSC) [13] is a pioneering effort to collect a comprehensive array of genetic and phenotypic data for ASD subjects, including assessment scores measuring intelligence and social responsiveness. Second, the Human Brain Transcriptome has produced to large collection gene expression microarray data covering multiple healthy individuals over a range of developmental stages and distinct anatomical regions within the brain (HBT, hbatlas.org) [14] allowing

93

for a more detailed study of the impact of certain genes on neuronal development. We explore expression profiles for functional subsets across multiple brain regions and developmental stages, and across phenotype scores collected in the SSC to better understand the connections between mutation, function, and phenotype.

The major goals of this chapter are:

(1) Expand on our previous analysis of de novo CNVs associated with autism by integrating de novo

SNVs. We show that these different types of mutation form an integrated cluster representing

multiple functions relevant to the disease. In addition, there has been a focus on truncating

mutations – such as nonsense, splice site, and frameshift mutations – which are more disruptive

of protein function. These mutations occur more frequently in autistic subjects than their

unaffected siblings making them likely causative [3]. We show, however, that non-truncating

mutations are also crucial to the disease cluster.

(2) We find interesting distinct expression patterns for ASD variants based on mutation type,

function, and phenotype values such as subject sex and disease severity, mirroring the

heterogeneous nature of ASD phenotypes.

(3) Finally, we show that for some phenotypic measurements relate to gene contribution to the

NETBAG cluster, suggesting that in the future the functional relevance of individual mutations

might have some predictive power over disease severity.

94

6.2. Results

6.2.1. Gene cluster affected by de novo mutations

To elucidate the molecular networks underlying ASD, NETBAG was applied to a collection of genes affected by non-synonymous de novo CNVs and SNVs observed in autistic patients in the Simons

Simplex Collection (SSC) [3-6]. The combined data from four recent studies collectively identified 985 unique genes at 619 genomic locations. Our method searched for a subset of these genes that are strongly connected in an underlying phenotype network, where every gene pair is connected by a score proportional to the likelihood that contribute to a shared disease. This search revealed a highly significant cluster containing 157 genes (P-value=0.007, Figure 6.1). Notably, every study contributed a number of genes to the identified cluster highlighting the importance of aggregating genetic variants and variant types (31/157 genes from CNVs in Levy et al. and 129/157 genes were from SNVs; 51 from

O’Roak et al., 54 from Iossifov et al., and 31 from Sanders et al.). In addition, the cluster obtained using all data was more significant than any of the clusters found using a single study (Table 6.1), and no significant clusters were found among the non-synonymous de novo variants identified in unaffected siblings from these studies.

As validation that NETBAG is preferentially selecting causative mutations, we consider gene recurrence and haploinsufficiency. Over the three studies of SNVs, recurrent truncating mutations were reported for four genes (KATNAL2, CHD8, DYRK1A, and SCN2A), making them likely causative genes [3,

15]. Three of these genes were selected by NETBAG, although only 23% (129/574) of all SNV-implicated genes were selected (Fisher’s exact test, two-tail P-value = 0.037). In addition, as all of the de novo variants were reportedly heterozygous, true causative mutations are more likely to occur in haploinsufficient genes. Using predicted probabilities from a recent study [16], we found that genes in our cluster are significantly more likely to be haploinsufficient than the remaining implicated genes

(median probability in network = 0.562, median out of network = 0.31, Mann-Whitney, two-tail P-value

95

< 1e-20). Results were comparable even when considering only SNV-implicated genes. In contrast, no difference in haploinsufficiency was observed between truncating and non-truncating mutations

(median truncating probability = 0.39, median non-truncating probability=0.37, Mann-Whitney P- value=0.95).

Figure 6.1: NETBAG selected cluster using autism-associated de novo SNVs and CNVs. Events taken from recent studies, cluster P-value = 0.007. Node sizes are proportional to the importance of each gene to the final cluster significance, and edge widths are proportional to the likelihood each pair of genes contributes to the same phenotype. For clarity, only the strongest two edges from each gene are shown. Node shape indicates source of the gene: squares represent genes from CNV, circles from SNV, and diamonds represent genes implicated by both types of variation. Hierarchical clustering was applied to divide the cluster into functional groups as indicated by color (Figure 6.7), and general functions of these clusters were identified using DAVID as labeled (Table 6.5). Grey nodes are those which did not fall into a cohesive subset in hierarchical clustering.

96

Table 6.1: Clusters selected by NETBAG for various input sets Input set Cluster Size Cluster P-value Proband SNVs & CNVs 157 0.007 Proband SNVs from O’Roak et al. 32 0.018 Proband SNVs 62 0.019 Proband CNVs 36 0.071 Proband SNVs from Sanders et al. 25 0.527 Proband SNVs from Iossifov et al. 25 0.709 Sibling SNVs & CNVs 72 0.946 Sibling SNVs 50 0.969 Table 6.1: Clusters and P-values determined by NETBAG based on various input sets. Proband sets included 47 unique CNV loci covering 433 genes from Levy et al, 187 genes implicated through SNVs from O’Roak et al., 268 SNV genes from Iossifov et al., and 141 from Sanders et al. Sibling sets from the same sources contained 13 unique CNV loci covering 66 genes, and 352 genes implicated through SNVs. Sibling CNVs were not run independently due to the limited number of events.

Table 6.2: Selected Gene Ontology functions associated with autism cluster GOID Ontology Term P-value GO:0045202 CC synapse 6.94e-8 GO:0016568 BP chromatin modification 2.51e-5 GO:0051015 MF actin filament binding 7.30e-5 GO:0031252 CC cell leading edge 9.53e-5 GO:0005262 MF calcium channel activity 1.47e-4 GO:0004386 MF helicase activity 3.26e-4 GO:0015629 CC actin cytoskeleton 4.18e-4 GO:0030036 BP actin cytoskeleton organization 5.16e-4 GO:0005913 CC cell-cell adherens junction 6.09e-4 GO:0045211 CC postsynaptic membrane 6.64e-4 GO:0016887 MF ATPase activity 8.43e-4 GO:0044456 CC synapse part 8.66e-4 GO:0007611 BP learning or memory 9.21e-4 GO:0030029 BP actin filament-based process 0.001 GO:0005938 CC cell cortex 0.001 GO:0005911 CC cell-cell junction 0.002 Table 6.2: Selected Gene Ontology (GO) term annotations over-represented among genes in the NETBAG cluster (Figure 6.1), as identified by DAVID (david.abcc.ncifcrf.gov). P-values shown have been corrected for multiple hypothesis testing using Bonferroni correction available through DAVID. Ontology indicates GO domain; BP: Biological process, MF: Molecular function, CC: Cellular component. General terms associated with more than 350 human genes are not shown; for a full list of terms associated with each sub-cluster see Table 6.5.

97

6.2.2. Biological functions associated with autism cluster

To explore the biological functions represented by our cluster, we used DAVID [17] to identify

Gene Ontology (GO) terms common to our cluster genes (Table 6.2). Several interesting terms emerged in very different functional areas, while no GO terms were significantly enriched in either the full set of de novo variants or truncating SNVs. We applied hierarchical clustering to these genes based on scores in our underlying phenotype network and uncovered several interesting sub-clusters associated with distinct biological functions as labeled in Figure 6.1 (Figure 6.7). Several of these clusters contained genes and functions with well-documented links to autism.

One sub-cluster contained genes responsible for synapse formation (Figure 6.1, cyan), such as neurexin (NRXN) and neuroligin (NRLG) as well as important scaffolding proteins localized to the postsynaptic density (PSD) of excitatory synapses. These functions have been known to be associated with ASD for several years [18-21]. A closely related sub-cluster involved primarily channel activity at the postsynaptic membrane (Figure 6.1, blue). This sub-cluster contained several voltage-dependent calcium channel genes (CACNA1B/D/E/S), and cholinergic receptors (CHRND/A7). Ca2+ signaling regulates neuronal excitability and has been previously implicated in ASD due to its role in synapse formation and dendritic growth [22]. While calcium channel activity dominated the associated GO terms, several other ion channel and receptor genes were also present in this cluster including: the sodium channel SCN2A linked to familial autism through its binding with calmodulin [23]; metabotropic glutamate receptors

GRM5/7 relatives of genes linked to ASD through their reactions with L-glutamate, the major excitatory neurotransmitter in the central nervous system [24-26]; and channel receptor subunit GABRB3 which processes the inhibitory neurotransmitter GABA and is linked to ASD, Rett syndrome, and

Angelman/Prader-Willi syndrome [27-29].

The largest sub-cluster related to functions convergent on neuronal signaling and growth, particularly relevant to the formation of neuronal structures (Figure 6.1, green). This sub-cluster

98

contained genes directly related to the growth of actin cytoskeleton (such as FLNA, CTNNA3, SPTAN1,

NF1, and ACTN4) and to neuronal projections, axonogenesis, or dendritic formation (DCC and CYFIP1). In addition, there were several genes associated with cell adhesions (CTNNB1/A3 and PTPRK/M), which have an emerging role in ASD [30]. There were also several genes associated with neuronal signaling including some directly involved in axon guidance (EPHA1/B2) [31]. It also contains MAPK13/3, and

WNK2 which modulates the activity of MAPK3, all of which affect the mTOR pathway known to regulate dendritic growth [32]. Overall, the large number of genes in this sub-cluster highlights the critical role of altered synaptogenesis, dendritic growth, and the formation and proper functioning of neuronal junctions in the development of ASD.

The final sub-cluster was functionally convergent on chromosomal modification, containing genes associated with helicase activity and histone modification (Figure 6.1, red). Although the relationships between these functions to ASD are not as well understood, there is evidence that chromatin regulatory mechanisms can effect neural development [33], and histone hyperacetylation was shown to produce autism-like behavior in mice [34]. In addition, several genes in this cluster have been linked to other neurodevelopmental diseases. SMARCA2, a critical hub gene in our cluster, has been implicated in schizophrenia through its role in chromatin remodeling [35]. MeCP2, responsible for

Rett syndrome [36], regulates genes involved in brain function through binding to methylated DNA [37].

CHD7, a relative of CHD8 in our cluster, causes CHARGE syndrome, a disorder with developmental delays and autism-like symptoms [38]. Like the previous sub-cluster, this one also covers a broad range of functions potentially interesting in the study of autism. For example, the ribonuclease DICER1 functions with microRNAs to produce distinct neuroanatomical changes related to ASD [39, 40], and eukaryotic translation initiation factors (EIF2c1/3G/4A1/4G1) may be associated with ASD through altered synaptic protein synthesis [41, 42].

99

6.2.3. Association of NETBAG network with functionally relevant gene sets

It is interesting to consider cluster genes and the wide range of functions they represent in relation to other gene sets relevant to related diseases and functions. Fragile X syndrome is a genetic disorder resulting in a spectrum of cognitive disabilities and often accompanied by autistic symptoms. It is caused by a failure to express the RNA-binding protein FMRP, which promotes a wide array of functions including synaptic plasticity and signaling. One recent study attempted to catalog FMRP interaction partners [43], and Iossifov et al. found a significant overlap between these genes and truncating mutations associated with ASD [3]. Following the method in Iossifov et al., we calculated the expected number of FMRP targets in our cluster using the observed overlap with a set of rare synonymous mutations in a healthy population [44]. We found a significantly higher overlap between our cluster genes and the FMRP proteins, with three times as many FMRP targets than expected by chance (17.6 expected, 47 observed, Binomial P-value = 1.69e-10, Table 6.3). This was comparable to the overlap with truncating mutations observed by Iossifov et al. While there was a modest overlap between FMRP targets and all genes with nonsynonymous proband SNVs (P-value = 0.034), there was no significant overlap with genes with nonsynonymous SNVs in unaffected siblings. We also note that these FMRP targets are spread across all biological functions represented in our cluster.

We performed a similar analysis using proteins identified in the post synaptic density (PSD) [45].

The PSD is a dense structure localized to excitatory synapses and crucial to signaling at synaptic junctions. In previous work, we demonstrated that PSD proteins have many functions in common with

ASD-associated genes [46]. We observed a significant overlap between PSD genes and our network

(8.5:29, Binomial P-value = 7.59e-9), and a modest overlap with truncating mutations (5.8:11, P-value =

0.049). There was no significant overlap with either the full set of proband or sibling nonsynonymous

SNVs (Table 6.3).

100

Table 6.3: Overlap of various autism gene sets with FMRP targets and PSD proteins Darnell et al. FMRP Bayes et al. PSD

Gene set Size Exp : Obs Exp : Obs % P-value Exp : Obs Exp : Obs % P-value Truncating SNV 107 12.0 : 25 11.2 : 23.4% 3.12e-04 5.8 : 11 5.4 : 10.3% 0.049 NETBAG Cluster 157 17.6 : 49 11.2 : 29.9% 1.69e-10 8.5 : 29 5.4 : 18.5% 7.59e-09 SNV Cluster 129 14.4 : 40 11.2 : 31.0% 1.18e-09 7.0 : 19 5.4 : 14.7% 7.51e-05 All Proband SNV 575 64.3 : 81 11.2 : 14.1% 0.034 31.3 : 33 5.4 : 5.7% 0.713 All Sibling SNV 354 39.6 : 49 11.2 : 13.8% 0.128 19.3 : 20 5.4 : 5.6% 0.815 Table 6.3: We observed the rate at which FMRP targets identified by Darnell et al. or postsynaptic density proteins identified by Bayes et al. were hit by rare synonymous mutations in a control population from a recent analysis of human mutations by Fu et al. These rates (11.2% and 5.4%) allowed us to estimate the expected number of FMRP targets or PSD proteins in a set of mutations controlling for higher mutation rates in longer genes. We compared expected overlap (exp) to observed overlap (obs) and assigned significance to the observed overlaps using a two- sided binomial test.

6.2.4. Patterns of neuronal expression associated with ASD

To investigate the dynamics of the network genes in the developing brain, we used the Human

Brain Transcriptome database (HBT, hbatlas.org) [14], which contains gene expression data for samples retrieved postmortem from brains of otherwise healthy individuals of both sexes at various stages of development. The expression trajectories for the NETBAG cluster genes and for truncating variants demonstrate that genes in both sets are highly expressed in the brain, with NETBAG selected genes showing higher expression than truncating mutations (Figure 6.2a).

We note that for truncating mutations there is distinctly reduced expression during the very earliest phases of fetal development, and we see the same pattern among CNVs. Comparing these mutations with the non-truncating mutations from the NETBAG cluster, we see that the two groups have virtually identical expression profiles (Figure 6.2b) from the mid fetal period onwards. This implies that while all these mutations exist in the same functional space, the relatively more severe mutations are not tolerated during the very earliest phases of embryonic development. Although we did not observe a significant overlap between severe mutations and a particular function in our network, several

101

neurodevelopmental processes peak in activity during the tenth post-conception week (PCW) when expression of genes affected by truncating mutations is highest; neuronal migration, a process previously implicated in ASD [47, 48], is known to begin around the seventh PCW and continue until the sixteenth PCW [49-51]. Thus our high expression during these periods emphasizes the importance of structural changes in the autistic brain.

The high male-to-female incidence ratio is one of the distinguishing features of ASD, estimated at 5:1 or 7:1 for high-functioning individuals [4, 52]. It is theorized that females have unknown protective mechanisms, therefore requiring much stronger genetic perturbations to trigger the disorder

[53]. This effect was observed in probands from the collection as females had fewer of the functionally weaker non-truncating mutations (Fisher’s exact test, one-tail P-value = 0.033), a difference strengthened by considering only genes in the NETBAG network (P-value = 0.019). Female probands also had lower IQ values overall (Female median = 72, Male median = 83, Mann-Whitney P-value = 3.37e-4).

In addition, expression trajectories for mutations found in female probands were higher than their male counterparts among either truncating mutations or NETBAG implicated genes (Figure 6.2c) implying that their mutations hit genes which were more critical to neurodevelopment. Interestingly, among truncating mutations there was no significant difference in IQ values (Female median = 78, Male median

= 79.5), however the expression profile for these genes remained much higher for females than for males, demonstrating that even among similar phenotypes and mutation severity female mutations affected genes with relatively stronger functional implications.

Finally, we examined the expression of the functional groups defined by the hierarchical clustering (Figure 6.2d). Not surprisingly, postsynaptic density genes had the highest overall expression in the brain. Both the PSD cluster and the related channel activity cluster demonstrated a distinct rise during early fetal development, consistent with the occurrence of neuronal migration in early brain development [50]. In contrast, the genes associated with DNA modification show highest expression in

102

the earliest developmental phases followed by a decline approaching birth. Genes associated with actin growth and formation of synaptic junctions, which are involved in learning and memory formation throughout human life, show consistent expression across all periods.

103

Figure 6.2: Gene expression profiles in the brain across developmental stages for ASD genes. Expression data were obtained from the Human Brain Transcriptome database (HBT, hbatlas.org). Average expression levels were calculated across all genes in a given set, and error bars represent standard error of the average across samples. (a) Expression profiles for NETBAG cluster genes (see Figure 6.1), all disease-associated truncating mutations, and genes implicated through SNVs but not selected in the NETBAG network. (b) Expression for severe cluster mutations (all truncating SNVs and CNV genes from NETBAG cluster) and other cluster genes. (c) Expression profiles for cluster and truncating mutations delineated by the gender of the autistic subject. (d) Expression profiles for network genes divided into functional clusters using hierarchical clustering (see Figure 6.1).

104

6.2.5. Temporal and regional bias in neuronal expression

For cluster and truncating mutations, we see a notably higher expression during earlier phases including prenatal and late infancy periods. We calculated the expression bias as the average expression during these periods minus the average expression during later periods, and compared this to randomly generated gene sets with similar levels of overall expression in the brain. While our matched sets did show slightly higher expression during the early period, both cluster and truncating sets showed larger differences (Figure 6.3, Bias = 0.134 and 0.18, P-values = 0.03 and 0.017 respectively).

Figure 6.3: Average expression difference between early developmental periods and later periods. Early periods defined as fetal through late infancy, later as early childhood through adulthood. Expression data were obtained from the Human Brain Transcriptome database (HBT, hbatlas.org) and expression difference was calculated as the average expression of all genes in the earlier period minus the average expression in the later period, so that a positive value indicates higher expression during early development. Random gene sets were generated by matching each disease-associated gene to a random gene with a similar level of overall brain expression. A distribution of expression differences across random sets was used to assign significance to the disease difference. Shown are distributions of random sets with values from the disease-associated sets as vertical dashed lines for NETBAG network genes (blue, Bias = 0.134 and P-value = 0.03), and all truncating mutations (orange, Bias = 0.18 and P-value = 0.017).

105

The HBT database distinguishes various regions of the brain, allowing for analysis of the developmental trajectories across locations. The bias towards early development is present among all areas of the brain and significant in several regions, consistent with the pervasive nature of the biological functions uncovered by our analysis, and the phenotypic diversity in ASD (Figure 6.4 and Table

6.4). The largest expression difference, however, was seen in the amygdala. This region is known to play a critical role in facial perception and involuntary responses to emotionally charged stimuli [54] and failure to recognize facial and emotional social cues are notable characteristics of ASD. Autistic individuals also show enhanced fear processing and response concordant with excessive amygdala activity [55]. There was no significant bias observed in any region using sibling SNVs.

Figure 6.4: Average expression difference across brain regions. A positive value indicating higher expression during periods of early development (fetal through late infancy) than later (early childhood through adulthood). Expression data were obtained from the Human Brain Transcriptome database (HBT, hbatlas.org). Error bars represent standard error of the average determined by bootstrap sampling the gene set. Values for truncating mutations are in blue, NETBAG cluster genes in orange, and sibling SNVs in green. Random gene sets were generated by matching each disease-associated gene to a random gene with a similar level of overall brain expression and the distribution of random differences was used to assess significance of the disease-associated expression difference.

106

Table 6.4: Early development expression bias calculations for regions of the brain NETBAG cluster genes Early Bootstrap estimation Expression-matched genesets Region Bias standard Bias standard Average bias Average bias Z-score P-value bias error deviation amygdala 0.217 0.218 0.069 0.005 0.095 2.231 0.013 cerebellum 0.198 0.197 0.071 -0.002 0.096 2.077 0.021 frontal 0.109 0.109 0.069 0.014 0.097 0.973 0.164 hippocampus 0.136 0.135 0.067 0.021 0.092 1.244 0.108 occipital 0.148 0.149 0.075 0.014 0.101 1.325 0.092 parietal 0.110 0.111 0.067 0.013 0.096 1.003 0.157 temporal 0.135 0.136 0.067 0.007 0.095 1.339 0.088 thalamus 0.130 0.130 0.072 0.039 0.092 0.989 0.158 ventral 0.194 0.193 0.078 0.010 0.109 1.682 0.044 Truncating genes Early Bootstrap estimation Expression-matched genesets Region Bias standard Bias standard bias Average bias Average bias Z-score P-value error deviation amygdala 0.261 0.260 0.083 0.002 0.114 2.263 0.012 cerebellum 0.154 0.154 0.090 -0.002 0.117 1.339 0.089 frontal 0.170 0.171 0.082 0.012 0.118 1.339 0.088 hippocampus 0.180 0.179 0.083 0.019 0.112 1.450 0.072 occipital 0.164 0.165 0.087 0.011 0.124 1.233 0.105 parietal 0.155 0.154 0.083 0.011 0.117 1.222 0.109 temporal 0.185 0.185 0.079 0.005 0.115 1.561 0.060 thalamus 0.223 0.222 0.080 0.037 0.111 1.679 0.045 ventral 0.225 0.223 0.094 0.008 0.131 1.647 0.047 Table 6.4: We examined the bias of ASD towards earlier stages of life by comparing expression during early developmental periods (fetal through late infancy) with later periods (early childhood through adulthood) across various regions of the brain. Expression data were obtained from the Human Brain Transcriptome database (HBT, hbatlas.org). The difference between the average expression levels during both periods was calculated, where a positive value indicates a greater bias towards early development. Bootstrap sampling of each gene set was used to calculate the error around the early bias value. Significance was determined by comparison with the bias uncovered in random gene sets where genes were selected with matching expression in brain. The average bias of matched sets is also included, as well z-scores.

6.2.6. Phenotypic variation in ASD-associated genes

Along with genetic variants, the SSC collected numerous test results measuring the severity of the disease including: the Autism Diagnostic Interview-Revised (ADIR, both sections on social interaction and restricted and repetitive behaviors), Social Responsiveness Scores (SRS), and the Autism Behavior

Checklist (ABC). They also measured general IQ, as well as verbal and non-verbal IQ values. We found that some measurements had an impact on gene expression levels, while others affected a gene’s

107

contribution to the significance of the NETBAG cluster. Interestingly general IQ affected both, indicative of the broad factors measured by intelligence tests.

First, to explore the effect of phenotypic variation on gene expression, we divided genes into subsets based on phenotype measurements. We divided genes at the median value of a particular measurement and plotted expression of higher and lower scoring subsets. Mutations from subjects with lower IQ scores, or higher ADIR (both indicating more severe damage) were associated with higher overall expression in the brain (Figure 6.5). This was true for all truncating genes and all genes from the

NETBAG cluster. We note that the pattern is less distinct for cluster genes, possibly reflecting the method’s purpose of finding functionally related genes.

Figure 6.5: Expression by phenotype measurements. Subjects were divided into two groups based on their phenotype scores and divided at the median score for the entire set. Scores associated with more severe damage are shown in red, less damage in blue. Cluster genes are plotted using solid lines, truncating genes using dashed lines. (A) Subject divided by their full scale, or combined verbal and non- verbal IQ scores, median of 78. (B) Subjects divided by their scores on Autism Diagnostic Interview Revised, social interaction section, median of 21. (C) Subjects divided by their scores on the ADIR, restricted and repetitive behaviors section, median of 6.

As it had the most consistent impact on our results, we chose to examine IQ in greater detail.

We observed, not surprisingly, that severe mutations were more likely to occur in individuals with lower

108

IQ scores; truncating mutations were associated with lower IQ scores (truncating median IQ = 79, non- truncating median IQ = 83, Mann-Whitney, one-tail P-value = 0.09). In addition, NETBAG selects genes from subjects with lower IQ scores regardless of mutation severity. We considered three sub- populations: those who were likely unaffected by their mutation and had an average or higher IQ (above

100), those who were potentially affected having an IQ on lower side but within one standard deviation of average (between 80-100), and finally those subjects with more compromised IQs (below 80). While

23% of all SNVs are selected in our cluster, subjects with the lowest IQ values contribute more genes

(26%) while subjects with moderate IQ values contribute fewer (20.4%). Subjects with the highest IQ values are significantly under-represented in our cluster (18.1% selected in cluster, Fisher’s exact test P- value = 5.0e-4). This suggests that functions uncovered by the NETBAG+ algorithm may be particularly relevant to reduced IQ.

Focusing on only cluster genes, we examined the effect of phenotype on a particular gene’s contribution to the overall network significance. Each gene contributes to the score and significance of the final network based on its connections to all other genes in the network; therefore genes with a great degree of functional relevance to the disease will have higher contribution scores. Comparable to our expression results, we found that genes which contributed less to the cluster score were associated with higher IQ values (Figure 6.6a, Mann-Whitney, one-tail P-value = 0.021). Finally, we also examined the association of IQ to the different functional sub-clusters identified (Figure 6.6b). We found that mutations genes associated with neuronal structure and signaling (Figure 6.1, green) were associated with lower IQ scores than other cluster genes (median IQ in sub-cluster = 67, in remaining cluster = 83,

Mann-Whitney P-value=0.005). In contrast, genes associated with chromosomal modification (Figure

6.1, red) were associated with higher IQ scores than the rest of the cluster (median IQ in sub-cluster =

84, in entire cluster = 74, Mann-Whitney P-value=0.007).

109

Figure 6.6: Histograms showing distribution of subject IQ based on gene subsets. Shown is full scale IQ which combines including verbal and non-verbal IQ. (A) Cluster genes are divided based on their contribution to the final cluster score and significance. Genes with higher contribution (red) were more important to cluster significance and associated with stronger effects or more reduced IQ values. Median IQ values for high contribution genes was 73, for low contribution genes was 81. (B) Distribution for functional sub-clusters from Figure 6.1. Median values were 67 for neuron structure and signaling (green), 84 for DNA modification (red), 85 for postsynaptic density (cyan), and 83 for channel activity (blue).

6.3. Discussion

Our analysis clearly demonstrates that the phenotypic heterogeneity of autism spectrum disorders is mirrored by diversity in mutation type, biological function, and gene expression patterns.

Genes affected by de novo mutations converged on multiple biological pathways, each with distinct expression profiles. Despite this variety, several interesting results emerged across biological functions and mutation types.

We observed a significant bias towards higher gene expression during prenatal and early infancy periods, with maximum expression typically occurring during the mid-fetal periods. The emphasis on these early periods is consistent with the typical onset of ASD in early childhood and our understanding of it as a disease of neurological development. Despite the variation in gene expression across functions, this bias was consistent across most regions of the brain indicative of broad ramifications of ASD

110

mutations. Elevated gene expression in the brain was consistent with more substantial phenotypic alterations, as witnessed by the impact of expression on IQ and other phenotype measurements. In addition, it has often been suggested that females have unknown protective mechanisms against ASD, and thus require stronger genetic perturbations in order to trigger the phenotype. This is further supported by higher average expression for genes with identified mutations from female subjects. There has been recent focus on truncating mutations in ASD. We see the importance of such mutations in determining phenotype through their relatively stronger impact on IQ and other phenotype measurements. However, our method, which is agnostic to mutation severity, suggests that non- truncating mutations also significantly contribute to the disease.

Taken together, our work validates the importance of combining data from multiple sources and using a method such as NETBAG to find the biological processes underlying a complex disease like ASD.

In addition, we show that the key properties of mutations – such as their severity, function, and expression – can directly influence the phenotype of the individual, an important step towards using genetic variants to understanding the manifestation of human disease.

111

6.4. Methods

ASD associated de novo variants.

Variants were obtained from four recent studies of families in the Simons Simplex Collection and included de novo copy number variants (CNVs) [4] and de novo single nucleotide variants (SNVs) [3, 5,

6]. Overlapping CNVs were combined into a single event. When SNV genes were contained within CNV events, the SNV gene was considered individually while the remaining genes in the CNV were considered as a separate event. After combining overlaps and removing duplicate genes the set contained 985 genes at 619 distinct genomic locations remained; 433 genes at 47 distinct loci from Levy et al., and 573

SNVs, 141 from Sanders et al., 187 from O’Roak et al., and 269 from Iossifov et al.

Phenotype network and the NETBAG algorithm

NETBAG relies on our previously defined phenotype network in which all pairs of genes are assigned a score proportional to the likelihood that those genes are involved in a shared phenotype [46,

56]. Genes implicated by ASD-associated variants were mapped to this phenotype network and a greedy search algorithm was used to find the highest scoring cluster at each size. Cluster score is the sum of contributions from each gene, and each gene contributes based on a de-weighted sum of the edges between it and other cluster genes. De-weighting, in which the sum is the strongest edge plus half of the second strongest edge, plus one third of the third strongest edge and so on, reflects our belief that because all cluster genes share common functions their edges are not wholly independent. The significance of the best cluster at each size is determined through comparison with clusters built from randomly generated events. The cluster size with the strongest significance is selected and the P-value is adjusted for multiple hypothesis correction as we have examined clusters of multiple sizes.

Random events were constructed by sampling genes that match both protein length and overall functional connectivity of the putative disease gene in the phenotype network. This ensures that

112

significance is not driven by disease genes with high overall connectivity in the phenotype network, or by genes more susceptible to mutations due to longer sequence length. Similar results were obtained by comparison with random events generated by selecting genes based on synonymous mutation rates taken from a large study of rare mutations in healthy population [44]. This also ensures that significance is not affected by the length of particular genes in the input set, as longer genes will also be more likely to be selected for random trials, but it does not rely on any information in the underlying phenotype network.

A distribution of cluster scores from random trials was used to assign significance to disease- associated clusters at each size. Then for each random trial the most significant P-value is selected regardless of size and a distribution of best P-values is used to assign a final significance to the disease- associated cluster. The final P-value represents the strength of the connections in a cluster and the probability of finding a similarly interconnected cluster of any size among random trials, thus correcting for multiple hypothesis testing due to consideration of networks of different sizes. Clusters with five genes or fewer were ignored to ensure that trivial gene networks did not influence our analysis.

In this analysis we found that many clusters shared the same, highly significant P-values. We therefore defined a z-score for each cluster based on scores from clusters of the same size found using random events and selected the final cluster based on a combination of P-value and z-score.

FMRP and PSD enrichment of ASD gene sets

We calculated the expected number of FMRP targets or PSD-associated proteins for a particular gene set by using the observed overlap between FMRP or PSD genes within a set of rare synonymous mutations from a recent large-scale survey of human genetic variation [44]. This ensures that significance was not biased by higher mutation rates in longer genes. Using this percentage, we used a

113

two-sided binomial test to assign P-values to the observed overlaps between disease gene sets and

FMRP targets or PSD-associated proteins.

Hierarchical clustering

An implementation of hierarchical clustering available in R was used to separate the NETBAG results into functional sub-clusters. Pairwise distances between genes were computed by taking the inverse of the corresponding likelihood score from the phenotype network, such that gene pairs with stronger phenotype scores were considered closer together. Clusters were joined using average linkage.

Brain expression trajectories

High throughput gene expression data were obtained from the Human Brain Transcriptome [14]

(HBT, hbatlas.org) database. This database contained gene expression from multiple post-mortem samples of healthy human brains. Preprocessing of expression microarray measurements included quantile normalization and log2 transformation across all samples. To investigate trajectory of gene expression in the brain, we calculated the average for all genes in a given subset at each of the 15 developmental stages, (from embryonic to late adulthood).

Brain expression difference between early and late development

To characterize the relationship between brain expression during phases of early development

(fetal through late infancy, 4 post-conception weeks to 1 year of age) and later phases (early childhood through adulthood, 1 year to 60 years of age), average expression for each age range was computed and the difference determined for each region of the brain. A positive value indicates higher expression during periods of early development. The standard error of the average was calculated by bootstrap sampling the network or truncating genes. To assess significance, random gene sets were generated by

114

matching each disease-associated gene to a random gene with a similar level of overall brain expression.

The brain expression difference was calculated for each random gene set and a distribution of differences for 10,000 such random trials was used to assign a P-value to the observed brain expression difference for a given disease gene set.

Figure 6.7: Hierarchical clustering of genes selected by NETBAG. Pairwise distances between genes were determined using the phenotype network such that genes with strong functional links were considered closer together. Clustering was performed using R with average linkage to join clusters. Clusters were assigned the following functions using DAVID: green represents neuronal signaling along with actin growth and neuronal projection, red represents chromosomal modification, blue represents calcium channel activity and signaling at the postsynaptic membrane, cyan represents genes localized to the post synaptic density, and grey nodes were uncharacterized.

115

Figure 6.8: Gene expression profiles for disease-matched gene sets. Expression data were obtained from the Human Brain Transcriptome database (HBT, hbatlas.org). Average expression levels were calculated across all genes in a given set, and error bars represent standard error of the average across samples. Shown are expression trajectories for NETBAG cluster genes (see Figure 6.1) along with random genes selected to match cluster genes based on various criteria. Genes were matched using protein sequence length (blue), average connectivity in the phenotype network (purple), or both protein sequence length and average network connectivity (green).

116

Table 6.5: Gene Ontology terms significantly associated with functional clusters Postsynaptic density (Figure 6.1, cyan) GOID Ontology Term P-value GO:0044456 CC synapse part 1.77 e-07 GO:0014069 CC postsynaptic density 5.17 e-06 GO:0045211 CC postsynaptic membrane 6.91 e-05 GO:0004385 MF guanylate kinase activity 0.001 GO:0007268 BP synaptic transmission 0.005 GO:0019201 MF nucleotide kinase activity 0.006 GO:0016776 MF phosphotransferase activity, phosphate group as acceptor 0.012 GO:0019205 MF nucleobase, nucleoside, nucleotide kinase activity 0.021 GO:0048489 BP synaptic vesicle transport 0.037 Channel activity (Figure 6.1, blue) GOID Ontology Term P-value GO:0005261 MF cation channel activity 1.64 e-14 GO:0005262 MF calcium channel activity 6.77 e-12 GO:0006816 BP calcium ion transport 5.26 e-11 GO:0015674 BP di-, tri-valent inorganic cation transport 3.73 e-10 GO:0034702 CC ion channel complex 1.02 e-08 GO:0005245 MF voltage-gated calcium channel activity 3.34 e-08 GO:0022843 MF voltage-gated cation channel activity 7.26 e-08 GO:0022832 MF voltage-gated channel activity 5.25 e-07 GO:0005244 MF voltage-gated ion channel activity 5.25 e-07 GO:0034703 CC cation channel complex 7.16 e-05 GO:0022834 MF ligand-gated channel activity 7.95 e-05 GO:0015276 MF ligand-gated ion channel activity 7.95 e-05 GO:0005891 CC voltage-gated calcium channel complex 3.61 e-04 GO:0034704 CC calcium channel complex 6.82 e-04 GO:0044456 CC synapse part 0.002 GO:0045211 CC postsynaptic membrane 0.003 GO:0019932 BP second-messenger-mediated signaling 0.007 GO:0007187 BP G-protein signaling, coupled to cyclic nucleotide second messenger 0.008 GO:0042383 CC sarcolemma 0.011 GO:0019935 BP cyclic-nucleotide-mediated signaling 0.012 GO:0005230 MF extracellular ligand-gated ion channel activity 0.016 GO:0007268 BP synaptic transmission 0.020 GO:0030315 CC T-tubule 0.039 DNA Modification (Figure 6.1, red) GOID Ontology Term P-value GO:0016568 BP chromatin modification 7.48 e-13 GO:0004386 MF helicase activity 1.92 e-10 GO:0042623 MF ATPase activity, coupled 5.23 e-05 GO:0016569 BP covalent chromatin modification 2.39 e-04 GO:0016585 CC chromatin remodeling complex 4.27 e-04 GO:0000123 CC histone acetyltransferase complex 0.003 GO:0016570 BP histone modification 0.004 GO:0070035 MF purine NTP-dependent helicase activity 0.004 GO:0008026 MF ATP-dependent helicase activity 0.004 GO:0016573 BP histone acetylation 0.019 GO:0006473 BP protein amino acid acetylation 0.027

117

(cont.) DNA Modification (Figure 6.1, red) GOID GOID GOID GOID GO:0000122 BP negative regulation of transcription from RNA polymerase II promoter 0.033 GO:0017053 CC transcriptional repressor complex 0.041 GO:0000118 CC histone deacetylase complex 0.041 GO:0000122 BP negative regulation of transcription from RNA polymerase II promoter 0.033 GO:0017053 CC transcriptional repressor complex 0.041 GO:0000118 CC histone deacetylase complex 0.041 GO:0043543 BP protein amino acid acylation 0.046 Neuron structure and signaling (Figure 6.1, green) GOID Ontology Term P-value GO:0051015 MF actin filament binding 9.80 e-9 GO:0015629 CC actin cytoskeleton 5.93 e-8 GO:0030036 BP actin cytoskeleton organization 3.79 e-7 GO:0031252 CC cell leading edge 4.67 e-7 GO:0005913 CC cell-cell adherens junction 7.51 e-7 GO:0030029 BP actin filament-based process 7.94 e-7 GO:0004713 MF protein tyrosine kinase activity 5.93 e-6 GO:0005911 CC cell-cell junction 7.71 e-6 GO:0031175 BP neuron projection development 2.08 e-5 GO:0005912 CC adherens junction 2.48 e-5 GO:0070161 CC anchoring junction 5.52 e-5 GO:0004714 MF transmembrane receptor protein tyrosine kinase activity 7.52 e-5 GO:0001725 CC stress fiber 4.30 e-4 GO:0007507 BP heart development 5.96 e-4 GO:0032432 CC actin filament bundle 6.02 e-4 GO:0022603 BP regulation of anatomical structure morphogenesis 6.96 e-4 GO:0042641 CC actomyosin 7.05 e-4 GO:0007169 BP transmembrane receptor protein tyrosine kinase signaling pathway 8.41 e-4 GO:0005916 CC fascia adherens 0.001 GO:0030027 CC lamellipodium 0.001 GO:0000904 BP cell morphogenesis involved in differentiation 0.002 GO:0022604 BP regulation of cell morphogenesis 0.002 GO:0005938 CC cell cortex 0.004 GO:0014704 CC intercalated disc 0.005 GO:0005516 MF calmodulin binding 0.006 GO:0048812 BP neuron projection morphogenesis 0.006 GO:0030425 CC dendrite 0.008 GO:0045296 MF cadherin binding 0.010 GO:0043531 MF ADP binding 0.014 GO:0044449 CC contractile fiber part 0.015 GO:0048858 BP cell projection morphogenesis 0.017 GO:0043292 CC contractile fiber 0.021 GO:0032990 BP cell part morphogenesis 0.023 GO:0007409 BP axonogenesis 0.031 GO:0016477 BP cell migration 0.040 Table 6.5: Gene Ontology (GO) term annotations identified by DAVID among genes in each of the sub-clusters from the NETBAG results (Figure 6.1). P-values have been corrected for multiple hypothesis testing using Bonferroni correction available through DAVID. Ontology indicates GO domain; BP: Biological process, MF: Molecular function, CC: Cellular component. General terms associated with more than 350 human genes are not shown.

118

6.5. References

1. Berg, J.M. and D.H. Geschwind, Autism genetics: searching for specificity and convergence. Genome Biol, 2012. 13(7): p. 247. 2. Hyman, S.E., A glimmer of light for neuropsychiatric disorders. Nature, 2008. 455(7215): p. 890-3. 3. Iossifov, I., et al., De novo gene disruptions in children on the autistic spectrum. Neuron, 2012. 74(2): p. 285-99. 4. Levy, D., et al., Rare de novo and transmitted copy-number variation in autistic spectrum disorders. Neuron, 2011. 70(5): p. 886-97. 5. O'Roak, B.J., et al., Sporadic autism exomes reveal a highly interconnected protein network of de novo mutations. Nature, 2012. 485(7397): p. 246-50. 6. Sanders, S.J., et al., De novo mutations revealed by whole-exome sequencing are strongly associated with autism. Nature, 2012. 485(7397): p. 237-41. 7. Veltman, J.A. and H.G. Brunner, De novo mutations in human genetic disease. Nat Rev Genet, 2012. 13(8): p. 565-75. 8. Bodmer, W. and C. Bonilla, Common and rare variants in multifactorial susceptibility to common diseases. Nat Genet, 2008. 40(6): p. 695-701. 9. Hirschhorn, J.N., Genomewide association studies--illuminating biologic pathways. N Engl J Med, 2009. 360(17): p. 1699-701. 10. Geschwind, D.H., Genetics of autism spectrum disorders. Trends Cogn Sci, 2011. 15(9): p. 409-16. 11. Freitag, C.M., The genetics of autistic disorders and its clinical relevance: a review of the literature. Mol Psychiatry, 2007. 12(1): p. 2-22. 12. Feldman, I., A. Rzhetsky, and D. Vitkup, Network properties of genes harboring inherited disease mutations. Proc Natl Acad Sci U S A, 2008. 105(11): p. 4323-8. 13. Fischbach, G.D. and C. Lord, The Simons Simplex Collection: a resource for identification of autism genetic risk factors. Neuron, 2010. 68(2): p. 192-5. 14. Kang, H.J., et al., Spatio-temporal transcriptome of the human brain. Nature, 2011. 478(7370): p. 483-9. 15. O'Roak, B.J., et al., Multiplex targeted sequencing identifies recurrently mutated genes in autism spectrum disorders. Science, 2012. 338(6114): p. 1619-22. 16. Huang, N., et al., Characterising and predicting haploinsufficiency in the human genome. PLoS Genet, 2010. 6(10): p. e1001154. 17. Huang da, W., B.T. Sherman, and R.A. Lempicki, Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc, 2009. 4(1): p. 44-57. 18. Sudhof, T.C., Neuroligins and neurexins link synaptic function to cognitive disease. Nature, 2008. 455(7215): p. 903-11. 19. Berkel, S., et al., Mutations in the SHANK2 synaptic scaffolding gene in autism spectrum disorder and mental retardation. Nat Genet, 2010. 42(6): p. 489-91. 20. Feyder, M., et al., Association of mouse Dlg4 (PSD-95) gene deletion and human DLG4 gene variation with phenotypes relevant to autism spectrum disorders and Williams' syndrome. Am J Psychiatry, 2010. 167(12): p. 1508-17.

119

21. Muddashetty, R.S., et al., Dysregulated metabotropic glutamate receptor-dependent translation of AMPA receptor and postsynaptic density-95 mRNAs at synapses in a mouse model of fragile X syndrome. J Neurosci, 2007. 27(20): p. 5338-48. 22. Krey, J.F. and R.E. Dolmetsch, Molecular mechanisms of autism: a possible role for Ca2+ signaling. Curr Opin Neurobiol, 2007. 17(1): p. 112-9. 23. Weiss, L.A., et al., Sodium channels SCN1A, SCN2A and SCN3A in familial autism. Mol Psychiatry, 2003. 8(2): p. 186-94. 24. Dutta, S., et al., Glutamate receptor 6 gene (GluR6 or GRIK2) polymorphisms in the Indian population: a genetic association study on autism spectrum disorder. Cell Mol Neurobiol, 2007. 27(8): p. 1035-47. 25. Jamain, S., et al., Linkage and association of the glutamate receptor 6 gene with autism. Mol Psychiatry, 2002. 7(3): p. 302-10. 26. Serajee, F.J., et al., The metabotropic glutamate receptor 8 gene at 7q31: partial duplication and possible association with autism. J Med Genet, 2003. 40(4): p. e42. 27. Buxbaum, J.D., et al., Association between a GABRB3 polymorphism and autism. Mol Psychiatry, 2002. 7(3): p. 311-6. 28. Samaco, R.C., A. Hogart, and J.M. LaSalle, Epigenetic overlap in autism-spectrum neurodevelopmental disorders: MECP2 deficiency causes reduced expression of UBE3A and GABRB3. Hum Mol Genet, 2005. 14(4): p. 483-92. 29. Cassidy, S.B., et al., Prader-Willi syndrome. Genet Med, 2012. 14(1): p. 10-26. 30. Betancur, C., T. Sakurai, and J.D. Buxbaum, The emerging role of synaptic cell-adhesion pathways in the pathogenesis of autism spectrum disorders. Trends Neurosci, 2009. 32(7): p. 402-12. 31. Egea, J. and R. Klein, Bidirectional Eph-ephrin signaling during axon guidance. Trends Cell Biol, 2007. 17(5): p. 230-8. 32. Tavazoie, S.F., et al., Regulation of neuronal morphology and function by the tumor suppressors Tsc1 and Tsc2. Nat Neurosci, 2005. 8(12): p. 1727-34. 33. Ronan, J.L., W. Wu, and G.R. Crabtree, From neural development to cognition: unexpected roles for chromatin. Nat Rev Genet, 2013. 14(5): p. 347-59. 34. Kataoka, S., et al., Autism-like behaviours with transient histone hyperacetylation in mice treated prenatally with valproic acid. Int J Neuropsychopharmacol, 2013. 16(1): p. 91-103. 35. Koga, M., et al., Involvement of SMARCA2/BRM in the SWI/SNF chromatin-remodeling complex in schizophrenia. Hum Mol Genet, 2009. 18(13): p. 2483-94. 36. Amir, R.E., et al., Rett syndrome is caused by mutations in X-linked MECP2, encoding methyl- CpG-binding protein 2. Nat Genet, 1999. 23(2): p. 185-8. 37. Skene, P.J., et al., Neuronal MeCP2 is expressed at near histone-octamer levels and globally alters the chromatin state. Mol Cell, 2010. 37(4): p. 457-68. 38. Johansson, M., C. Gillberg, and M. Rastam, Autism spectrum conditions in individuals with Mobius sequence, CHARGE syndrome and oculo-auriculo-vertebral spectrum: diagnostic aspects. Res Dev Disabil, 2010. 31(1): p. 9-24.

120

39. Cuellar, T.L., et al., Dicer loss in striatal neurons produces behavioral and neuroanatomical phenotypes in the absence of neurodegeneration. Proc Natl Acad Sci U S A, 2008. 105(14): p. 5614-9. 40. Mellios, N. and M. Sur, The Emerging Role of microRNAs in Schizophrenia and Autism Spectrum Disorders. Front Psychiatry, 2012. 3: p. 39. 41. Kelleher, R.J., 3rd and M.F. Bear, The autistic neuron: troubled translation? Cell, 2008. 135(3): p. 401-6. 42. Santini, E., et al., Exaggerated translation causes synaptic and behavioural aberrations associated with autism. Nature, 2013. 493(7432): p. 411-5. 43. Darnell, J.C., et al., FMRP stalls ribosomal translocation on mRNAs linked to synaptic function and autism. Cell, 2011. 146(2): p. 247-61. 44. Fu, W., et al., Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature, 2013. 493(7431): p. 216-20. 45. Bayes, A., et al., Characterization of the proteome, diseases and evolution of the human postsynaptic density. Nat Neurosci, 2011. 14(1): p. 19-21. 46. Gilman, S.R., et al., Rare de novo variants associated with autism implicate a large functional network of genes involved in formation and function of synapses. Neuron, 2011. 70(5): p. 898- 907. 47. Wegiel, J., et al., The neuropathology of autism: defects of neurogenesis and neuronal migration, and dysplastic changes. Acta Neuropathol, 2010. 119(6): p. 755-70. 48. Persico, A.M. and T. Bourgeron, Searching for ways out of the autism maze: genetic, epigenetic and environmental clues. Trends Neurosci, 2006. 29(7): p. 349-58. 49. Sidman, R.L. and P. Rakic, Neuronal migration, with special reference to developing human brain: a review. Brain Res, 1973. 62(1): p. 1-35. 50. Andersen, S.L., Trajectories of brain development: point of vulnerability or window of opportunity? Neurosci Biobehav Rev, 2003. 27(1-2): p. 3-18. 51. Zoghbi, H.Y., Postnatal neurodevelopmental disorders: meeting at the synapse? Science, 2003. 302(5646): p. 826-30. 52. Newschaffer, C.J., et al., The epidemiology of autism spectrum disorders. Annual Review of Public Health, 2007. 28: p. 235-258. 53. Werling, D.M. and D.H. Geschwind, Understanding sex bias in autism spectrum disorder. Proc Natl Acad Sci U S A, 2013. 110(13): p. 4868-9. 54. Adolphs, R., What does the amygdala contribute to social cognition? Ann N Y Acad Sci, 2010. 1191: p. 42-61. 55. Markram, K., et al., Abnormal fear conditioning and amygdala processing in an animal model of autism. Neuropsychopharmacology, 2008. 33(4): p. 901-12. 56. Gilman, S.R., et al., Diverse types of genetic variation converge on functional gene networks involved in schizophrenia. Nat Neurosci, 2012. 15(12): p. 1723-8.