<<

Integrative Analysis of Heterogeneous Genomic Datasets to Discover Genetic Etiology of Disorders by Sumaiya Nazeen B.Sc. in Computer Science and Engineering, Bangladesh University of Engineering and Technology (2011) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering and Computer Science MASSACHI1-g 516 at the O TECHNOLOGY MASSACHUSETTS INSTITUTE OF TECHNOLOGY SEP 2 5 20% 2014 September LIBRARIES @ Massachusetts Institute of Technology 2014. All rights reserved.

Signature redacted A uthor ...... Department of Electrical Engineering and Computer Science August 28, 2014

Certified by...... Signature ...... Bonnie A. Berger Professor of Applied Mathematics and Computer Science Thesis Supervisor

Accepted by ...... Signature redacted...... / )tOjie A. Kolodziejski Professor of Electrical Engineering Chair, Department Committee on Graduate Students

Integrative Analysis of Heterogeneous Genomic Datasets to Discover Genetic Etiology of Autism Spectrum Disorders by Sumaiya Nazeen

Submitted to the Department of Electrical Engineering and Computer Science on August 28, 2014, in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering and Computer Science

Abstract Understanding the genetic background of complex diseases is crucial to medical research, with implications to diagnosis, treatment and drug development. As molecular approaches to this challenge are time consuming and costly, computational approaches offer an efficient alternative. Such approaches aim at predicting and prioritizing for a particular disease of interest. State-of-the-art prediction and prioritization methods rely on the obser- vation that disease-causing genes have some sort of functional similarity based on either sequence, phenotype, -protein interaction (PPI) network, or functional annotation. Another increasingly accepted view is that human diseases result from perturbations of molecular networks, and genes causing the same or similar diseases tend to be close to one another in molecular networks. Such observations have built the basis for a large collection of computational approaches to find previously unknown genes associated with certain dis- eases. The majority of the methods are designed based on protein interactome networks, with integration of other large-scale omics data, to infer how likely it is that a gene is associated with a disease. In this thesis, we set out to address this outstanding challenge of understanding the genetic etiology of autism spectrum disorder (ASD), which refers to a group of complex neurodevelopmental disorders sharing the common feature of dysfunctional reciprocal so- cial interaction. We introduce three novel methods for computing how likely a given gene is to be involved in ASDs based on copy number variations (CNVs), phenotype similar- ity, and protein interactome network topology. We also customize a random walk with restarts algorithm for ASD gene prioritization for the first time. Finally, we provide a novel integrative approach for combining CNV, phenotype similarity, and topology-related infor- mation with existing knowledge from literature. Our integrative approach outperforms the individual schemes in identifying and ranking ASD related genes. Our candidate gene set provides a number of interesting biological insights in that it is overrepresented in a number of interesting signaling, cell-adhesion and neurological pathways, molecular functions, and biological processes that are worth further investigation in connection with ASDs. We also find evidence for an interesting connection between gastrointestinal disorders, particularly inflammatory bowel diseases (IBD), and ASDs. The subnetworks we identify indicate the possibility of existence of subclasses of disorders along the autism spectrum.

Thesis Supervisor: Bonnie A. Berger Title: Professor of Applied Mathematics and Computer Science

3 4 Acknowledgments

This thesis owes its existence to Professor Bonnie Berger. It has been an amazing experience to work with her. She has been an excellent source of encouragement and inspiration to me. She has been incredibly patient with me and always put my personal growth as a researcher first. I cannot thank her more for teaching me how to approach the process of learning and research.

I am indebted to Dr. Rohit Singh for his constant help, advice, support, and mentorship in all aspects of my thesis. This work would not have been possible without his invaluable advice and support. I remember countless meetings with him in which I walked in frustrated, yet walked out encouraged and excited again. I'd like to thank Rohit for his warm support and patience in teaching me how to face the moments when progress seems slow. I would like to thank Professor Isaac Kohane, Dr. Nathan Palmer, and Dr. Finale Doshi- Velez for sharing their knowledge of autism spectrum disorders with me. Many thanks to the members of Berger lab for sharing my exciting as well as frustrating moments. I'd like to

thank Patrice for lightening up my days with her warm greetings. I am grateful to George, Hoon, Sean, and Christina for having discussions with me and encouraging me along in my research. Thanks to Andrew, Deniz, Jian, Noah, and William for being there whenever I needed help.

I owe my gratitude to the Bangladeshi Students Association at MIT, which has become my family in Boston. As always, I am ever grateful to my parents and siblings for their love and constant support. Finally, I express my utmost gratitude to my greatest supporter: to the Almighty Allah, who has bestowed good health upon me, kept me free from anxiety, and filled my everyday with joy and hope.

5 6 Contents

Abstract 3

Acknowledgments 5

List of Figures 11

List of Tables 13

1 Introduction 15 1.1 M otivation...... 15 1.2 State of the art ...... 18 1.2.1 General Trends in Disease Gene Prediction ...... 18 1.2.2 Computational Advances in ASD Gene Prediction ...... 26 1.3 Contributions ...... 27 1.4 O utline ...... 29

2 Predicting and Prioritizing Candidate Genes for ASD 31 2.1 CNV Information Entropy based Prioritizer ...... 32 2.1.1 Copy Number Variation (CNV) ...... 32 2.1.2 Copy Number Variants in ASD ...... 33 2.1.3 Calculating Information Entropy Score from CNVs ...... 34 2.1.4 Quality of CNV Information Entropy based Prioritization ...... 35 2.2 ASD Similarity based Prioritizer ...... 36 2.2.1 Similarity of Phenotypes or Diseases ...... 36 2.2.2 Gene-Phenotype Association Data ...... 38 2.2.3 Calculating ASD Similarity Scores ...... 38

7 2.2.4 Performance of ASD Similarity based Prioritizer ...... 38 2.3 Diffusion State ASD Proximity based Prioritizer ...... 40 2.3.1 Diffusion State Distance (DSD) in PPI Network ...... 40 2.3.2 Calculating Diffusion State ASD Proximity (DSAP) of Genes ... .. 42 2.3.3 Quality of DSAP-based Ranking ...... 42 2.4 Network Crosstalk based Prioritizer ...... 44 2.4.1 M otivation ...... 44 2.4.2 Problem Formulation ...... 44 2.4.3 Calculating Network Crosstalk Scores ...... 45 2.4.4 Dealing with Statistical Bias ...... 46 2.4.5 Performance of Network Crosstalk based Prioritizer ...... 47

3 Integrative Approach for Identifying ASD Risk Genes 49 3.1 Background ...... 49 3.1.1 Lasso-penalized Logistic Regression ...... 50 3.2 Predicting ASD Association via Logistic Regression based Integrative Approach 50 3.2.1 Preparing Data for Training and Validation ...... 50 3.2.2 Constructing Lasso-regularized Binomial Regression Model ...... 50 3.2.3 Selecting Model Coefficients ...... 51 3.2.4 Creating Regularized Model and Making Predictions ...... 51 3.3 Performance Analysis ...... 52

4 ASD Genetics: Implications from Candidate ASD Risk Genes 57 4.1 Gene Sets for Analysis ...... 57 4.2 Hypergeometric Test for Enrichment ...... 58 4.3 Pathway Enrichment Analysis ...... 58

4.3.1 An Interesting Connection with Inflammatory Bowel Disease (IBD) .. 62 4.4 Enrichment Analysis on GO gene sets ...... 62 4.5 Enrichment Analysis for Subnetworks ...... 63 4.6 Functional Analysis for Overlap with Diseases and Bio-functions ...... 66

5 Conclusion 71

Appendix A SFARI Genes for Autism Spectrum Disorders 75

8 Appendix B Risk Genes for ASDs Identified by Integrative Approach 87

Appendix C Subnetworks in ASD Risk Gene Set 95

Bibliography 99

9 10 List of Figures

2-1 Copy number variations in a pair of ...... 32 2-2 Steps in CNV-based prediction-prioritization of ASD genes...... 35 2-3 Receiver operating characteristic curves for CNV-based prioritizer using dif- ferent scaling factors...... 36 2-4 Lift chart for CNV-based prioritizer...... 37 2-5 Receiver operating characteristic curve for ASD similarity based prioritizer. . 39 2-6 Lift chart for ASD similarity based prioritizer...... 40 2-7 Receiver operating characteristic curve for Diffusion State ASD Proximity (DSAP) based prioritizer...... 43 2-8 Lift chart for Diffusion State ASD Proximity (DSAP) based prioritizer. .... 43 2-9 Receiver operating characteristic curves for network crosstalk based prioritizer using different restart probabilities (r)...... 48 2-10 Lift chart for network crosstalk- based prioritizer...... 48

3-1 Performance curves for integrative approach on training data...... 53 3-2 Receiver operating characteristics curves for different ASD gene prediction- prioritization methods...... 54 3-3 Lift chart of integrative approach for ASD.gene prediction-prioritization. ... 55

4-1 Significant GO biological processes associated with ASD risk gene set. .... 64 4-2 Significant GO molecular functions associated with ASD risk gene set...... 65 4-3 Top four subnetworks in ASD risk gene set generated by QIAGEN's Ingenuity@ Pathway Analysis (IPA)...... 67

11 12 List of Tables

1.1 Summary of general trends in disease gene prediction-prioritization methods. 25

3.1 Selected regression coefficients for the integrative approach from logistic re- gression in order of predictive value...... 52 3.2 Selected logistic regression coefficients for integrating different ASD associa- tion scores in order of predictive value...... 53 3.3 Selected logistic regression coefficients for integrating ASD-pathway member- ship information with weights in order of predictive value...... 53

4.1 Canonical pathways having significant overlap with ASD risk genes ...... 62 4.2 IBD-related pathways having significant overlap with ASD risk genes. .. . . 63 4.3 Top 10 diseases having significant overlap with ASD risk genes found by QIAGEN's Ingenuity® Pathway Analysis (IPA)...... 68 4.4 Top 30 functions having significant overlap with ASD risk genes found by QIAGEN's Ingenuity@ Pathway Analysis (IPA)...... 68

A.1 ASD risk genes reported by SFARI gene module...... 85

B.1 Probabilities of association with ASDs for candidate genes identified. by our integrative analysis approach...... 94

C.1 Subnetworks in ASD risk gene set generated by QIAGEN's Ingenuity® Path- way Analysis (IPA)...... 98

13 14 Chapter 1

Introduction

1.1 Motivation

Identifying disease-causing genes is a fundamental challenge in human health with applica- tions in understanding disease mechanisms, diagnosis, and therapy. Many approaches have been adopted for discovery of candidate genes to date [124]. Traditional genetic mapping methods include linkage analysis and genome-wide association studies (GWAS) of Mendelian diseases and complex traits. While GWAS are powerful and effective, they face challenges in narrowing down long lists of candidate genes [5]. Furthermore, diseases often do not follow the simple genotype-phenotype model, but are rather the consequences of perturbations of multiple genes connected in a molecular network, induced by various factors such as genetic , epigenetic changes, and pathogens [114]. Efforts towards discovering the proper- ties of disease genes in molecular networks have shown that genes associated with the same or similar diseases, tend to have some degree of functional similarity. Such similarity can be based on sequence [36], functional annotation [89], protein-protein interactions [34,56,85], etc. [84]. These findings became the basis for the development of computational approaches for predicting and prioritizing disease genes. While traditional disease-causing gene identi- fication methods are time-consuming and costly, these computational approaches offer an efficient alternative.

Autism spectrum disorder (ASD) refers to a group of neurodevelopmental disorders defined by three categories of deficits: abnormal development or impairment of social in- teraction, abnormal development or impairment of communication skills, and stereotypic and repetitive behaviors [9]. Recent estimates show that ASDs are prevalent in 0.75% to

15 1% of the population [33,53, 54]. Among the conditions encompassed by ASDs, pervasive developmental diseorder-not otherwise specified (PDD-NOS) and autistic disorder are the most common, whereas Asperger syndrome appears less frequently. ASD is almost five times more common among boys (1 in 42) than among girls (1 in 189) [28], an effect that becomes even more pronounced in so-called high-functioning cases. Before the 1970s, autism was not widely appreciated to have a strong genetic basis. Instead, various psychodynamic interpre- tations, including the role of a cold style of mothering, were considered as potential causes. The importance of gentic contributions came into light in the 1980s, when the co-occurence of chromosomal disorders and rare syndromes with ASDs were identified [161. Subsequent twin and family studies provided support for a strong genetic component, but lack of uni- form diagnostic criteria limited the power of those studies. The development of validated diagnostic and assessment tools like ADI-R and ADOS for ASDs in 1990s has proven crucial to the advancement of ASD research, and since then the diagnosis of ASDs has been gaining in frequency. These tools in concert with important technological advances, has made it pos- sible to carry out a range of studies such as, candidate gene association studies, resequencing studies, genome-wide assessment of copy number variations (CNVs), etc. This ability has led to identification of a large number of autism susceptibility genes and an increased attention to the effects of de novo and inherited CNVs, thus supporting the notion that genetic factors are a predominant cause of ASDs. Moreover, higher ASD concordance rates in monozygotic twins (36-95%) compared to dizygotic twins (0-31%) [40,95,96, 108] and increased risk (at least 2-18%) in families with a history of related disorders [46,86,106] also suggest a strong genetic component behind ASD. However, genetic studies have been able to connect only

1-2% of autism cases to individual mutations in the autism susceptibility genes and loci, and about 20% of cases to their combined effect [2].

One difficulty in studying genetic causes of ASDs is that different conditions are caused by different genetic mutations. In addition, since a condition is caused by a combined ef- fect of many mutations, the individual effects of each are often small and thus hard to detect. An additional difficulty in studying ASD relates to its heterogeneous na- ture. Specifically, the ASD population exhibits a wide range of conditions characterized by impairments in reciprocal social interaction and communication, as well as restricted and repetitive behaviors. Although some common pathways related to ASD have been identi- fied [21,91, 98], this heterogeneity of ASDs makes things challenging. Furthermore, small

16 sample sizes in studies limit their statistical power in most cases. Thus, to comprehensively identify risk genes and molecular pathways in ASDs, we need to perform either molecular analysis with substantially larger sample sizes stratifying patients into more heterogeneous groups by diagnostic criteria, sex, or family history; or more sophisticated computational analysis.

Towards understanding the genetics of ASDs over the past two decades, researchers have mainly focused on linkage studies, genome-wide association studies, and microarray studies. Linkage studies aim at finding out the rough location of a disease gene relative to another DNA sequence called a genetic marker, which has its position already known. Affected families are genotyped using a collection of genetic markers across the genome, and how those genetic markers segregate with the disease across multiple families is examined. Most autism-related linkage studies have identified linkage regions reaching the threshold of suggestive linkage at best [35]. Loci on most chromosomes have been suggested to harbor ASD risk, but only a few of them have been independently identified. To date, only loci 7q22-23 [80,81] and 17q11-21 [22,104,123] have been replicated and considered significant on a genome-wide scale. Currently, there are over 25 different loci that may be considered to contain autism susceptibility candidate genes (ASCG), and many more complicated loci are under observation [2]. The lack of genome-wide significant results in most published linkage studies is a consequence of small sample sizes. Thus the establishment of collaborative groups, such as the International Molecular Genetic Study of Autism Consortium (IMGSAC) and Autism Genome Project (AGP) Consortium [80,107], and shared resources, such as the Autism Genetic Research Exchange (AGRE) Consortium [37] have become important steps in facilitating the identification of ASD candidate genes [14].

Unlike linkage studies, genome-wide association studies examine many common genetic variants in different individuals (either in case-control groups or within families) to see if any variant is associated with a disease. Association studies have identified a good number of genome-wide significant chromosomal variations, including CNVs (copy number variations - presence of variable number of copies of a particular gene in the genotype of an individual compared to a reference genome), which play an important role in the etiology of ASD [101]. De novo CNVs, hypothesized to be ASD-specific, have been found in up to 7-10% of sporadic ASD [14,74]. To date more than two thousand CNV loci, harboring both rare and common variants [7,20, 29,62,83,90,116], have been identified in more than three hundred studies

17 attributing to an awful lot of candidate genes. The challenge for CNV studies is to narrow down this list of candidate genes.

Besides linkage and association studies, microarray gene expression studies are also being conducted to provide important insights into genes and pathways that might be dysregulated across ASDs [12,15, 39,41,78] and within individual subtypes of pervasive developmental disorders. Gene expression studies measure the activity (i.e., expression) of thousands of genes across the genome at once, to create a global picture of a specific cellular function or disease. But these studies often suffer from the problem of small sample sizes, and probe and platform specific artifacts [100].

However, availability of vast collections of omics data from all these different types of studies suggests developing sophisticated computational approaches to extract knowledge that will help us better understand the biological underpinnings of ASD. This goal is further motivated by the recent successes of using computational methods in detecting and ranking causal genes for various complex diseases, including Glioblastoma multiferome (GBM) [52], pancreatic cancer [120], type 2 diabetes [111], and so on.

1.2 State of the art

In this section, we provide a brief overview of the computational methods currently available for predicting and prioritizing genes for diseases in general and ASDs in particular. As the challenge of predicting and prioritizing disease-causing genes is central to human health research, a large collection of computational methods have been developed to solve the general problem. The vast majority of these approaches are based on the human protein- protein interaction (PPI) network. We describe the main themes of these approaches as well as some representative methods. On the other hand, researchers have started to design other methods for the problem of ASD gene prediction and prioritization. We discuss the most recent work here.

1.2.1 General Trends in Disease Gene Prediction

General trends in designing computational methods for disease gene prediction-prioritization can be grouped loosely under four categories as discussed below (Table 1.1).

18 Methods Based on Protein Proximity in PPI Networks

Many of the current approaches for disease gene prioritization are based on the proximity of candidate genes to known disease genes within interactome networks using different scoring schemes. The intuition behind this is the 'guilt-by-association' hypothesis, which suggests that genes that are physically or functionally close to each other tend to be involved in the same biological pathways and have similar phenotypic effects [4,82]. Thus a key step in these approaches is to measure the distance between candidate genes and known disease genes in the PPI network. Approaches to measure proximity of elements in the PPI network are based on direct neighborhood, shortest path length, diffusion kernel, random walk with restart, propagation flow, etc. [117]

Oti et al. [85] predicted disease-causing genes in known disease loci by counting the number of known causative genes that are direct network neighbors (Table 1.1). The au- thors achieved approximately 10-fold enrichment by comparing their candidates to a random selection of candidate genes at the same . Krauthammer et al. [57] assigned known dis- ease genes as seed nodes and computed the shortest path length between these and other nodes in the network. A node that has close proximity to multiple seed nodes receives a higher score as a candidate disease gene. However, K6hler et al. [56] demonstrated that the closeness of two cannot be fully captured by their shortest path length. Different network structures surrounding two proteins imply different degrees of closeness between them. This can be captured by global distance measures, such as random walk with restarts and similarity-based diffusion kernel, by allowing equal probability of each protein diffusing along the links of the PPI network. The authors tested 783 genes under 110 disease families and achieved an area under the Receiver Operating Characteristic (ROC) curve up to 98% on simulated linkage intervals containing 100 genes. Navlakha and Kingsford [75] compared the performance of disease gene prediction using different proximity measurements including network neighbors, random walk with restarts, propagation flow, unsupervised graph par- titioning, Markov clustering, or semi-supervised graph partitioning. They reported random walk with restarts to give the best performance in terms of precision and recall. They also proposed a consensus method combining all closness measures, which could capture differ- ent topological properties of the PPIs and yielded better performance than the individual measures.

19 Methods Integrating Large-scale Genomic Data

In addition to being proximal in the interactome network, disease genes are assumed to share common features in annotations, gene expression, protein sequences, and domains and are likely to be involved in similar biological and functional pathways [38]. Thus, a number of computational methods have been designed to integrate genomic data from multiple sources to achieve better performance [45].

Endeavour, a prioritization algorithm through genomic data fusion, integrates functional annotations, microarray expression, expressed sequence tag (EST) expression, literature, protein domains, PPIs, pathway membership, cis-regulatory modules, transcriptional motifs, sequence similarity, and user-data and ranks the candidate genes based on their similarity to known disease genes for each of these features [3]. A global ranking to prioritize candi- date genes is generated by combining the ranks of individual features using order statistics. Prioritizer, a Bayesian classifier based tool, consolidates data from different sources, such as gene ontology, gene expression, and PPIs onto functional networks [34]. The closeness in the functional network of a candidate gene in one susceptible locus to genes residing in another locus was assessed and assigned a higher score for a shorter distance. Prioritizer achieves 2.8-fold enrichment compared to random selection. While at least two susceptible loci are desired by Prioritizer, Linghu et al. [65] performed genome-wide prioritization by constructing an evidence-weighted functional linkage network of 21657 genes based on 16 data sources using a naive Bayes classifier. Candidate genes were assigned a score based on the sum of the weights of the network links to known disease genes. The method was able to achieve a 62% success rate on monogenic, polygenic, and cancer disease families which was a marked improvement over the 44% success rate achieved by PPI network-only methods, confirming the importance of data integration in prioritizing disease genes.

Methods Integrating Phenotype Similarity

Disease with similar phenotypes often share either a common set of underlying genes or functionally related genes [38]. Several studies reported that the integration of disease phe- notype networks and PPI networks outperform other approaches in the gene prioritization task [24,36,58,63,113,121,122]. Wu et al. [121] used a simple linear regression method called CIPHER (Correlating protein Interaction network and PHEnotype network to pRecdict disease

20 genes) to model the correlation between the phenotype similarity profile and closeness profile in the PPI network. The algorithm used the phenotype similarity data from van Driel et al.'s [1121 text mining results along with curated PPIs from the Human Protein Reference Database (HPRD), Biomolecular Interaction Network Database (BIND), Molecular Inter- action Database (MINT), and Online Predicted Human Interaction Database (OPHID) to calculate the Pearson correlation coefficient between the disease similarity profile and gene closeness profile for each disease-gene pair which was recorded as a concordance score to represent the association of a gene with a disease. CIPHER's performance was shown to be reliable and comparable to Endeavour.

Based on the same phenotype similarity metric computed by van Driel et al. [112], Vanunu et al. [113] developed a slightly different method named PRINCE (PRIoritizatioN and Complex Elucidation). They calculated association between a query disease and a gene with a known disease association using a logistic function dependent on the phenotype similarity between the query disease and the known disease. This disease-gene association was then used as prior knowledge in the prioritization function and iteratively smoothed over the network using propagation flow. PRINCE showed superior performance over CIPHER in prioritizing genes for 1369 diseases with a known causal gene by approximately 10% in ranking the real disease gene as the top scoring one. Li and Patra [63] constructed a heterogeneous network by integrating, the PPI network and phenotype network based on disease-gene relationships in the Online Mendelian Inheritance in Man (OMIM) database [1]. They developed an algorithm RWRH (Random Walk with Restart on Heterogeneous network) which extends the random walk with restart algorithm from only PPI network to the entire heterogeneous network of PPIs and phenotypes. The authors reported RWRH was superior to CIPHER in prioritizing disease genes under three circumstances: known disease genes and genetic loci, known disease genes but no known genetic loci, and no known disease genes or loci.

21 Table 1.1 - Summary of the trends in disease gene prediction-prioritisation methods. Category Reference Method Features PPI data source Proximity measurement of Prediction methods Name elements In networks Methods based on pro- Oti et al. [85] - PPI HPRD, human Y2H, fly, Direct neighbor Predict a candidate gene as disease gene tein proximity in PPI worm, yeast if it directly interacts with a known dis- network ease gene and resides in a known disease locus lacking identified disease genes. Kohler et al. [56] - PPI HPRD, BIND, BioGrid, Direct neighbor, shortest path Rank candidate genes based on the prox- IntACT, DIP [971, length, diffusion kernel, random imity scores to known disease genes. STRING [1151, mapped walk with restart from worm, mouse, fruit- fly, and yeast Navlakha and - PPI HPRD, OPHID Random walk with restart, prop- (i)Predict a gene as disease gene if it is Kingsford [751 agation flow, direct neighbor, located in a locus known to be associated graph partitioning, clustering, with the disease and the network mea- Markov clustering and their vari- surement score is above a threshold; (ii) ants Combine all 13 closeness measurements for the ensemble decision trees using a random forest classifier. Methods integrating Aerts et al. [3] Endeavour PPI, TXT, GO, BIND Direct neighbor Rank each candidate gene based on their L~3 large-scale genomic data EXP (microarray similarity to known disease genes for and EST), PDS, each feature, then combine the ranks us- KEGG, TOUCAN, ing order statistics to obtain final rank. TRANSFAC, SEQ, and others Franke at al. [34] Prioritiser PPI, GO, EXP BIND, HPRD, and large- Shortest path length in the inte- Use a Bayesian classifier to build integra- scale experiments grated network tive networks, then score candidate genes baspd on distance to known disease genes using Gaussian kernel scoring function. Radivojac et al. [92] PhenoPred PPI, GO, Structure, HPRD, OPHID Shortest path length Employ support vector machines for pre- SEQ, DO diction. Linghu et al. [65] - Curated PPI, Y2H, Curated PPIs are from Direct neighbor Use a Bayesian classifier to construct Masspec, DDI, HPRD, BIND, BioGrid, a weighted functional linkage network EXP, PDS, PG, IntACT, MIPS [71],DIP, though integrating large-scale genomic GN, TXT, GO MINT [64],STRING, yeast, data. The weights of the links to known (molecular func- worm, fly, mouse-rat disease genes are summed to score the tion and cellular candidate genes. component (continued on next page) Table 1.1 - Continued. Category Reference Method Features PPI data source Proximity measurement of Prediction methods Name elements in networks Karni et at. [49] - PPI, EXP under dis- HPRD, Y2H Shortest path length Find smallest set of genes that cover the ease conditions disease related genes using the Maximum expectation gene cover algorithm. Methods integrating dis- Lage et al. [58] - PPI MINT, BIND, IntAct, Direct neighbor (virtual pull Construct candidate complexes by vir- ease phenotype informa- Pprel [47,481, Ecrel [47,48], down) counting in the evidence- tual pull down, then score the candidate tion Reactome [27,72] weighted PPI network gene by measuring the similarity between phenotype caused by the genes in the complex to the disease phenotype. Wu et al. [121] CIPHER PPI HPRD, OPHID, BIND, Direct neighbor, shortest path Use correlation coefficients as concor- MINT length dance score for each candidate gene based on linear regression of phenotype profile and PPI profile. Wu et al. [122] AlignPI PPI HPRD - Use NetworkBlast algorithm to align the PPI and phenotype networks and obtain high scoring sub-networks. Candilate genes are first assumed to be disease- associated and used in constructing sub- candidate gene with the C~3 networks.The highest scoring sub-network is taken as a positive prediction. Care et al. [24] - PPI, SNPs BIND, IntAct, BioGrid, Direct neighbor Predict deleterious SNPs using random MINT forest, then predict disease genes us- ing the same learning approach based on PPI networks, phenotype similarities and deleterious SNPs. Li and Patra [63] RWRH PPI HPRD Random walk with restart on Use RWRH to score genes and dis- heterogeneous network eases simultaneously by allowing random walker jump between PPI and phenotype networks. Vanunu et al. [113] PRINCE PPI HPRD and large scale ex- Network propagation flow in the Use network propagation method to periments evidence weighted PPI network smooth flow in the PPI network and then use the final converged flow as scores for candidate genes. (continued on next page) Table 1.1 - Continued. Category Reference Method Features PPI data source Proximity measurement of Prediction methods Name elements in networks George et al. [36] Gentrepid Domain compar- OPHID Common pathway, similarity of Combines two methods - common path- ison, pathways, protein domains way scanning (CPS) and common mod- PPI ule profiling for the automated predic- tion of disease genes within known dis- ease intervals. First, known disease genes are used to predict novel disease genes in chromosomal intervals associ- ated with the same disease. Second, without knowledge of the disease genes, candidate disease genes are predicted by comparing all the genes in the multiple intervals associated with the same dis- ease to find common pathways or shared modules between proteins linking the in- tervals.

Disease-module based Taylor et al. [109] - PPI, EXP (microar- OPHID, yeast, literature- betweenness centrality, shortest Identify hubs in the global network, then methods ray) curated path length for each hub assess the average Pearson correlation coefficient of co-expression for each interaction and the hub and re- move insignificant hubs Classify remain- ing hubs based on length, phosphoryla- tion, linear motifs, globularity, domain architecture, etc. For each hub, identify and disease-related genes.

Chen et al. [261 - co-expression net- - Combine the gene expression and geno- work, disease- type data to construct co-expression net- specific QTL works, identify highly connected subnet- works within these; Identify QTL with pleiotropic effects using forward step- wise regression and multivariate likeli- hood test. Identify causal subnetworks by testing for enrichment of expression traits. Mark genes in the causal subnet- works as causal genes. (continued on next page) Table 1.1 - Continued. Category Reference Method Features PPI data source Proximity measurement of Prediction methods Name elements in networks Liu et al. [67] GNEA PPI, GO, DGAP ex- HPRD cumulative expression level Map the relative mRNA expression of ev- pression data, man- ery gene in each insulin resistance or di- ually curated gene abetes condition to the associated pro- sets tein in a global network of proteinOpro- tein interactions and identify signifi- cantly transcriptionally affected subnet- works. Test each gene set for over repre- sentation in each identified subnetwork For each gene set, assign a p-value to the number of conditions in which it was enriched based on comparison against a background distribution. Dess6 et al. [301 - - PPI, EXP MetaCore [191 shortest path length Identify differentially expressed disease genes and map them on to PPI net- work. Construct shortest path subnet- works containing only the nodes in the shortest paths to disease genes. Calcu- 01 late topological score for each gene in the subnetwork based on the number of shortest paths through the gene in the subnetwork as well as in the global PPI network.

PPI, protein-protein interaction; Y2H, yeast two hybrid experiment; PDS, protein domain sharing; PG, phylogenetic profiles; GN, gene neighbor; GO, gene ontology; EXP, gene expression; KEGG, Kyoto encyclopedia for genes and genomes for pathway membership; TOUCAN, cis-regulatory modules; TRANSFAC, transcriptional motifs; SEQ, sequence similarity; DO, disease ontology; TXT, literature text mining; Masspec, mass spectrometry; DDI, domain-domain interactions; SNPs, single nucleotide polymorphisms; DIP, database of interacting proteins; STRING, search tool for the retrieval of interacting genes/proteins; QTL, quantitative trait loci. Disease Module-based Methods

In addition to generic candidate gene prioritization methods, significant efforts have been made towards the prediction of disease genes for individual diseases by constructing disease modules [11]. These methods start with identifying the disease modules or subnetworks, in which members would share similar functions, expression patterns or metabolic pathways assuming that breakdown of one such module causes a disease. This concept has been applied to a wide range of diseases, including several different types of cancers [25,59,77,109], type 2 diabetes [67], obesity [26], asthma [44], neurological diseases [43,73,93], and psoriasis [30].

Liu et al. [67] used a network based approach to identify an insulin signaling module as well as a network of molecular receptors that play significant roles in type 2 diabetes. Chen et al. [26] identified subnetworks in liver and adipose tissues that contain genes for which variants associated with obesity and diabetes have been identified. Taylor et al. [109] constructed disease-associated protein interaction modules for adenocarcinoma of the breast, providing useful predictors for breast cancer outcome. A slightly different approach was developed to prioritize disease-specific genes by constructing disease- and condition-specific subnetworks [30]. Disease-specific genes, differentially expressed under disease conditions, were mapped to global PPI network. The shortest path subnetwork was then built by including only the nodes in the shortest path connecting the disease-specific genes. Each node in this subnetwork was evaluated and assigned a topological score by comparing the number of shortest paths through it in the subnetwork to the number of shortest paths in the global network. This scheme was able to identify novel candidate genes for psoriasis.

1.2.2 Computational Advances in ASD Gene Prediction

To implicate ASD risk genes, recently, Liu et al. have developed an algorithm DAWN (for Detecting Association With Networks) [66]. The algorithm is based on the intuition that ASD genes cluster within a co-expression network [87,119]. DAWN uses two kinds of data: rare variations from exome sequencing and gene co-expression in the mid-fetal prefrontal and motor-somatosensory neocortex. The algorithm casts the ensemble data as a Markov random field in which the graph structure is determined by gene co-expression and it combines these interrelationships with node-specific observations, namely gene identity, expression, genetic data, and the estimated effect on disease-risk. The algorithm works as follows: first it

26 identifies 'hot spots' within the co-expression network at which multiple ASD risk genes (identified from exome data) cluster together. For these hot spots, it uses evidence from neighboring genes to reinforce ASD signal, while in 'cooler' regions the absence of neighboring genes with evidence of ASD association downgrades the signal. By modeling this data, DAWN was able to identify 127 ASD risk genes, many of which are novel. It was also successful in predicting some known ASD genes, not included in the genetic data used to create the model. In addition, the method was able to find three interesting sub-networks in support of the role of abberant connectivity of neuronal circuits due to intrinsically abnormal synapses in ASD. Although currently DAWN's findings are limited by the power of test statistics derived from available samples with exome sequencing, its success shows that computational approaches hold sufficient promise in identifying ASD associated genes.

1.3 Contributions

To address the classic problem of disease-gene prediction in the context of ASDs, this thesis

designs three novel computational methods, one modified random walk with restarts method, and a novel integrative method for combining these four with prior knowledge. While the recent computational approach for solving the problem of ASD gene prediction focuses mainly on rare variations from exome sequencing and gene co-expression data, our methods focus on computationally extracting knowledge from other data sources, including copy number variations (CNVs), phenotype similarity to ASD, and proximity to ASD genes in the PPI network.

Our first method utilizes the copy number variations that have ever been observed in the ASD population as well as appropriate control groups. We calculate an information entropy based score for all the genes that can be mapped to the reported CNV loci, taking into account their frequency of occurrence in ASD case-control groups. To the best of our knowledge, this is the first information theoretic approach to extract knowledge from disease CNVs.

In our second method we incorporate phenotype similarity information to quantify func- tional association of ASD genes to the rest of the genes. Our method incorporates dis- ease/phenotype similarity scores computed by van Driel et al. [112] and gene-phenotype relationships from the Online Mendelian Inheritance in Man (OMIM) database [1]. This

27 method is seeded by high confidence ASD genes from the literature to identify ASD-like phenotypes in OMIM. Genes involved in diseases with phenotypes similar to ASDs are ranked highly by this algorithm.

In our third method, we use the power of topological proximity in the network. We introduce a new diffusion based proximity metric for the proteins in the PPI network namely, Diffusion State ASD Proximity (DSAP). DSAP is defined on diffusion state distances (DSDs) in the PPI network which have supremacy over direct neighborhood and shortest path distances in capturing the functional association of proteins in the PPI network. DSAP of a gene is calculated based on its diffusion state distances to ASD seed genes.

Since random walks with restarts are one of the most effective approaches in solving the generic disease-gene prediction problem, we customize this approach specifically for the ASD context for the first time. Our approach uses the global PPI network structure and can be considered as a generalization of Google's Pagerank algorithm. This method starts with identifying high confidence ASD genes from the literature and simulates a random walk with restarts on the connected PPI network to simulate network crosstalk between the genes in the network. The simulated crosstalk gives a quantification of the functional association of ASD genes to the rest of the genes in the network. All these methods are shown to perform better than random selection.

Finally, we propose a novel integrative approach which incorporates CNV, phenotype similarity, and connectivity, proximity, and topological similarity in the PPI network with ASD-pathway knowledge from available literature. Each gene is assigned an association probability based on a logistic regression model. Lasso regularization with cross validation is performed to avoid over-fitting of the model. We show that the integrative approach significantly outperforms the above four methods.

We provide a number of interesting biological insights into the mechanism of ASDs by performing a series of analyses on the candidate genes selected by our integrative method. Pathway enrichment analysis reveals that, our candidate gene set is overrepresented in a number of pathways related to , , and nervous system devel- opment. These pathways can be useful in explaining the pathophysiology of ASDs. We also find an interesting link between ASDs and Inflammatory Bowel Diseases (IBD) in that our candidate gene set has significant overlap with the majority of the IBD-related pathways. Furthermore, we identify a number of disjoint subnetworks in our candidate gene set, char-

28 acterized by different categories of diseases and bio-functions, which provide an indication of the existence of subclasses of disorders in the autism spectrum. The topmost subnetwork characterized by gastrointestinal disorders, is particularly interesting. Functional and gene ontology enrichment analyses help us identify a number of interesting molecular functions and biological processes in which the candidate genes are overrepresented. For some of these terms, their connection to ASDs is not so obvious and thus worth further investigation.

1.4 Outline

In Chapter 2, we describe three novel computational methods for predicting and prioritizing ASD genes. We also introduce a random walk based approach for solving the disease gene prediction problem in the context of ASDs for the first time. In Chapter 3, we describe a novel integrative analysis approach which outperforms the individual methods described in the previous chapter in identifying and ranking ASD genes. We select a set of candidate genes which are highly likely to be associated with ASDs. We perform a series of analyses to find significant pathways, bio-functions, diseases and subnetworks in which the candidate gene set is overrepresented. The methodology of the analyses as well as the results and their biological implications are discussed in Chapter 4. Finally, we present closing remarks and discussion in Chapter 5.

29 30 Chapter 2

Predicting and Prioritizing Candidate Genes for ASD

In this chapter we introduce three novel methods for gene prediction-prioritization for ASDs. The first one is based on the copy number variations observed in the ASD population as well as appropriate control groups. The second method incorporates disease similarity in- formation with gene-phenotype mappings for OMIM to quantify the association of a gene to ASDs. The third method cbmputes functional association of ASD seed genes with the rest of the genes in the network based on a new diffusion based proximity measure. Finally, we customize a random walk with restarts based algorithm for ASDs which takes into consid- eration the proximity and connectivity information of the genes in the global PPI network to quantify the ASD-association of genes in the network.

The landscape of genes for our methods covers the largest connected component of the PPI network constructed using human PPIs collected from BioGRID [103] and ASD related PPIs collected from the SFARI Autism PIN module [13]. It comprises of 22192 genes and 227341 interactions. In what follows we refer to this largest connected component of the PPI network as "connected PPI network". For measuring the performance of our methods, we need to consider a set of ASD genes as a gold standard. We collected a list of known ASD genes from SFARI Human Gene Module [13]. As of June 2014, this module reported 606 known human genes in connection to ASDs, 548 of which reside in the largest connected component of our PPI network. We use these genes as our gold standard (Appendix A).

31 2.1 CNV Information Entropy based Prioritizer

2.1.1 Copy Number Variation (CNV)

For decades, it has been known to researchers that chromosomal rearrangements can result in a wide range of developmental disorders. However, technological and computational advances in the past decade have enabled the development of assays capable of identifying submicroscopic structural changes in chromosomes that could not have been detected by traditional cytogenetic analysis. Among the most heavily scrutinized of these structural variants are copy number variants, or CNVs. CNVs refer to submicroscopic chromosomal deletions and/or duplications that are typically defined as DNA segments of 1000 base pairs or larger in size that are present in a varying (or zero) number of copies when compared to a reference genome [94] (Figure 2-1).

Deletion Duplication

Normal pair of chromosomes

Pair of Pair of chromosomes chromosomes with one with three copy of "C" copies of "C"

Figure 2-1: Copy number variations in a pair of chromosomes. The pair of normal chromosomes (middle pair) each have sections A-B-C-D. However, the loss of section C from one of the chromosomes results in an abnormal with only sections A-B-D (left pair); an individual with this has only one copy of section C in their chromosomes. On the other hand, the gain of an extra copy of section C on one of the chromosomes results in an abnormal chromosome with sections A-B-C-C-D (right pair); an individual with this duplication has three copies of section C in their chromosomes. Thus, both of the individuals (left and right) have CNVs involving section C - one has lost a copy, the other has gained a copy, but both have a varied number of copies of C when compared to the reference pair of chromosomes.

32 There are many CNVs throughout the that have no adverse influence on the individual(s) harboring them in the general population. However, there are also a large number of CNVs that have been definitively linked with diseases. Evidence also indicates that interaction with additional genetic or environmental factors may influence whether CNVs have a detectable adverse effect on an individual.

2.1.2 Copy Number Variants in ASD

Analyses of large autistic populations over the past decade suggest that CNVs at specific locations in the genome result in increased susceptibility to ASD [69]. It has been estimated that 10-20% of ASD cases result from the presence of one or more pathogenic CNVs in an affected individual [2]. This finding implicates that CNVs are one of the most, if not the most, common genetic causes of ASD.

In 2003, Simons Foundation launched the project "Simons Foundation Autism Research Initiative (SFARI)" to advance the research of autism spectrum disorders. SFARI Gene [13] is a publicly available, curated, web-based, searchable, integrated resource, made available to the autism research community by SFARI. This resource is built on information extracted from the studies on molecular genetics and biology of ASD. SFARI Gene includes genetic, proteomic, and structural variation data from linkage and association studies, cytogenetic abnormalities, and specific mutations associated with ASD. The Copy Number Variant (CNV) module of SFARI Gene is a comprehensive, up-to-date collection of all copy number variants associated with autism spectrum disorders (ASD). The content of the CNV module is compiled in a systematic way from available case studies, CNV studies, and large-scale, genome-wide CNV screens. CNVs from autistic case cohorts and, when available, unaffected control cohorts are reported by the module. CNVs in the module are organized based upon the locus (chromosomal region or band) in which they were observed in each study. As of March 2014, more than 1800 CNV loci have been reported in connection with ASDs. These CNVs map to thousands of genes, which is too large a number to be useful. Thus, we sought an intelligent approach to narrow down the number of ASD risk genes by utilizing the copy number variants reported in ASDs.

33 2.1.3 Calculating Information Entropy Score from CNVs

We downloaded the CNV loci and corresponding case-control occurrence data from SFARI CNV module [13]. We collected sideband annotations for chromosomes from Ensembl [51]. Human gene-locus mapping information was collected from [681. We designed a map- per that maps the CNVs to corresponding genes and calculates their frequency of occurrence in cases and controls using the aforementioned information. Then, for each mapped gene g, we calculated the information entropy score, pg using Formula 2.1. The work flow for our CNV-based prioritizer is shown in Figure 2-2.

Pg = Kg x (1 - IEg) + offset (2.1)

Here, fgy denotes the number of occurrences of gene g in disease group y, where y E

Kg corresponds {case, control}; p9 denotes the probability of gene g occurring in ASD cases; to the scaling factor corresponding to gene g; IEg denotes the information entropy of gene g. These terms are defined by Formula 2.2.

( fcase +fontrol t 2 V(f"ase) 2 +(fgon ro1)

Kg - "-fc**r* (2.2a) 'I -case+fcontrol

IEg = Pg log2 (P9 ) - (1 - Pg) lo 2 (1 - Pg) (2.2b)

_g fgase Pg = fease + fcontrol

We calculate pg using three different Kgs and chose the one which gives largest area (AUC) under the Receiver Operating Characteristic (ROC) curve. The selected scaling factor is: fcasefcontrol K 9 V/cas onero as it gives an AUC of 59.81% (Figure 2-3). We chose a small positive number such as le - 6 as offset. All the genes in the human PPI network to which no CNV is mapped by the mapper were assigned a score equal to the offset value.

34 Chromosome Sidaband Annotations from Ensembl

CV loci in cases r Info tion controls from SFARI Mapper Scorer -MNE#Er ed 16pl.-qI2.2 116 15 Scores Gene frequencies in cases & controls ADA116 I Gene-locus mappings YWA Ii from Elitrez ABAT116I15

YWHAE, lp1.2,... ABAT, 16q11.2,..

Figure 2-2: Steps in CNV-based prediction-prioritization of ASD genes. At first, our custom-built mapper maps CNV loci in ASD case-control groups from SFARI CNV module to genes using chromosome sideband annotations from Ensembl and gene-locus mapping information from Entrez. The mapper also counts the numbers of occurrences of each gene in the case group and control group separately. Next, the scorer calculates an information entropy based score for each gene based on its frequency of occurrence in ASD case-control groups. Genes are ranked in descending order of entropy based scores.

2.1.4 Quality of CNV Information Entropy based Prioritization

To measure the quality of our information entropy based ranking, we calculated the area (AUC) under the Receiver Operating Characteristic (ROC) curve (Figure 2-3). The true positive rate (TPR) or recall, and false positive rate (FPR) are calculated using Equations 2.3 and

2.4 respectively. Using any of the scaling factors we get an AUC of approximately 59%, which is better than the random case (AUC = 50%).

Since, we are more interested in identifying the ASD genes than the non-ASD ones, we look at the lift chart for this method (Figure 2-4). The lift chart shows how much more likely we are to identify ASD genes than if we make random guesses. For example, by considering only the top 2% of genes in the ranklist found by our method, we are able to identify 2.3 times as many known ASD genes, in comparison to using no method. This enrichment indicates a reasonable improvement considering the unbalanced nature of our dataset with ASD genes accounting for only 2.5% of the entire dataset. Here, the lift of a bucket, or a group of genes in the dataset is calculated using Equation 2.5.

Number of ASD genes correctly identified by the method (23 recall = TPR = (2.3) Total number of ASD genes in the dataset

35 FPR =Number of ASD genes wrongly identified by the method Total number of non-ASD genes in the dataset (2 4)

Percentage of true ASD genes in the bucket identified lift of a bucket = by the method Percentage of ASD genes in the bucket selected randomly (2.5)

0.9

0.8

0.7

*. 0.450.5

c.0 0.4 03 - Scaling Factor 1: AUC - 0.5930 Scaling Factor 2: AUC - 0.5981 0.2 -- Scaling Factor 3: AUC - 0.5977 0.1 Baseline: AUC -0.5000

n 0 0.1 0.2 0.3 OA 0.5 0.6 0.7 0.8 0.9 I False Positive Rate (FPR)

Figure 2-3: Receiver operating characteristic curves for CNV-based prioritizer using different scaling factors.

2.2 ASD Similarity based Prioritizer

2.2.1 Similarity of Phenotypes or Diseases

Similarity between phenotypes reflects biological modules of interacting functionally-related genes. These similarities are positively correlated with a number of measures of gene func- tion, including relatedness at the level of protein sequence, protein motifs, functional an- notation, and direct protein-protein interaction [112]. In fact, genes or proteins associated with similar diseases or phenotypes lie in close proximity in the PPI network. Furthermore, phenotype grouping reflects the modular nature of human disease genetics. These facts bring forth the idea of utilizing disease or phenotype similarity information for identification

36 I I I I I I I I I I I I I I I I I I

2.2-

2-

1.6-

1A -

12-

0*8, 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.5 0.7 0.75 0.8 025 0.9 0.95 1 %of Genes in the Rankist Figure 2-4: Lift chart for CNV-based prioritizer.

of disease genes. In 2006, van Driel et al. [112] introduced a text mining algorithm to com- pute disease or phenotype similarity information for 5080 phenotypes collected from OMIM database. The steps of the algorithm can be summarized as follows.

* At first, all the OMIM records are searched and the keywords are searched for pres- ence in the anatomy (A) and the disease (C) sections of the Medical Subject Head- ings (MeSH) vocabulary. MeSH is a controlled vocabulary of U.S. National Library of Medicine. It is specially useful for applications that use information that contains different terminologies for identical medical concepts.

" Each OMIM record is then represented by a (0,1)-vector where each entry of the vector corresponds to whether a term is present (denoted by 1) or absent (denoted by 0) in the record.

" Similarity of two phenotypes is then computed by calculating the cosine of the angle between their respective feature vectors. The similarity score ranges from 0 to 1.

We collected the phenotype similarity matrix computed by van Driel et al. which is available through a web interface (http://www. cmbi.ru.nl/MimMiner/).

37 2.2.2 Gene-Phenotype Association Data

OMIM provides a publicly-accessible and comprehensive database of genotype-phenotype relationships in humans. We downloaded gene-phenotype relationship information from OMIM database [1]. We retained only those gene-phenotype relationships where the pheno- type also has a similarity score available in the disease similarity matrix computed by van Driel et al. [112]. We then mapped the genes associated with those phenotypes onto the connected PPI network. Note that multiple genes can be mapped to a single phenotype and one gene can be involved in multiple phenotypes. Thus, after this step, we are left with a total of 1474 genes mapped to 1999 OMIM phenotypes.

2.2.3 Calculating ASD Similarity Scores

We use the disease similarity matrix computed by van Driel et al. [112] and the gene- phenotype association data from OMIM database [1] to compute the association between each gene and our disease of interest, ASD. We call this association the ASD similarity score of the gene. Let 6 = {di, d2, d3 , ... , dt} be the set of diseases for which similarity scores are available, and q(di, dj) denote the similarity between diseases di and dj. Also let 6 g g 6 be the set of phenotypes associated with gene g. Let S be the set of seed genes which are known to be associated with ASD with high confidence. We select the genes that appear in eight or more ASD-related studies from our gold standard (Appendix A) as the seed set. Thus our seed set S contains 106 genes. Let 6 s denote the set of phenotypes related to ASD genes. We compute the association of a gene to ASD by looking at the similarity of the phenotypes related to it to the ASD-like phenotypes, S (Equation 2.6). The association score is normalized by the sum of pairwise similarity of ASD-like phenotypes. Thus we get an ASD similarity score, VPg for each gene in the largest connected component in the PPI network. Genes with no phenotype mapping receive an ASD similarity score of zero.

-0 . Ed, dj E 0 di) (2.6) 0.5 X c,,EOs Ed EOs 4dm, dn)

2.2.4 Performance of ASD Similarity based Prioritizer

By sorting the genes in descending order of ASD similarity scores, we obtain the ASD similarity based ranking of genes. To measure the performance of ASD similarity based pri-

38 oritizer, we calculate the area (AUC) under the Receiver Operating Characteristic (ROC) curve (Figure 2-5). The TPR and FPR are calculated using Equations 2.3 and 2.4 respec- tively as before. We measure the performance of this method on 22086 genes of the PPI network. These genes does not include the 106 ASD genes used in identifying ASD-like phenotypes. We achieve an AUC of 55.96% using this method, which is better than the random case (AUC = 50%).

Figure 2-6 depicts the lift chart for this method which shows how much more likely we are to identify ASD genes than if we make random guesses. By considering only the top 2% of genes in the ranklist found by our method, we are able to identify 3.62 times as many known ASD genes, in comparison to using no method. This enrichment indicates quite an improvement considering the imbalanced nature of our dataset with ASD genes accounting for roughly 2% of the entire dataset. Here, lift is calculated using Equation 2.5.

I

0.9

6.6

c. 6.7

0

0

2 6.3 I- 6U AUC =0.5596

6.

0 6.1 6.2 6.3 0.4 0.5 6.6 6.7 6.3 6.3 I False Positive Rate (FPR)

Figure 2-5: Receiver operating characteristic curve for ASD similarity based prioritizer.

39 I I I I I I I I I I I I I I

33r - Exuding Seeds incduding Seeds -- ~-BseIne 3

2.5

2

1.5

I i I I I I I I I I I I I I I I I I 0 0.05 0.1 0.15 02 025 0.3 025 A 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.96 1 % of Genes in the Randist

Figure 2-6: Lift chart for ASD similarity based prioritizer.

2.3 Diffusion State ASD Proximity based Prioritizer

2.3.1 Diffusion State Distance (DSD) in PPI Network

As discussed in Section 1.2.1, functional similarity of genes or proteins in the PPI network is often inferred based on direct interaction or some notion of network proximity in a local neighborhood. Most of the disease gene prediction prioritization methods typically measure local proximity based on either direct neighborhood, or shortest path distance, but this has only a limited ability to capture fine-grained neighborhood distinctions because most pro- teins are close to each other, and there are many ties in proximity. Also, the accuracy of these methods is often limited by the incomplete and noisy nature of the PPI data. Address- ing these issues, Cao et al. [23] introduced Diffusion State Distance (DSD), a new distance metric based on the graph diffusion property. DSD captures fine-grained distinctions in proximity for transfer of functional annotation in the PPI network and is able to perform much better than the conventional distance metrics.

Definition of DSD Metric

Cao et al. [23] defined diffusion state distance (DSD) as follows: let G(V, E) be the undirected connected PPI network, where V = {v1, v2, V3,... , vn} is the set of genes or proteins in the

40 network with IVi = n; E = {ei, e2, e3, . .. , em } is the set of interactions with ej = (vi, vj) denoting the interaction between genes vi and vj. Let Hefk) (A, B) be the expected number of times that a random walk starting at node A and proceeding for k steps, will visit node B. Assuming k is fixed, Hejk}(A, B) can be simply denoted as He(A, B). The n-dimensional vector He(vi), Vvi E V is defined as,

He(vi) = (He(vi, vi)He(vi, v2 ),.. . ,He(vi, vn)).

Then, the DSD between two genes u and v, Vu, v E V is given by Equation 2.7.

DSD(u,v) = IIHe(u) - He(v)jji (2.7) where, II.I|1 denotes the L, norm of a vector.

Cao et al. [23] proved three lemmas establishing the fact that, DSD is a metric which is symmetric, positive definite and non-zero whenever u = v, and it obeys the triangle inequality. It also converges as k approaches infinity, and thus, can be defined independent of k.

Lemma 1 DSD is a metric on V, where V is the vertex set of a simple connected graph G(V, E).

Lemma 2 Let G be a connected graph whose random walk one-step transition probability matrix P is diagonalizable and ergodic as a Markov chain, then for any u, v E V, DSD(u, v) converges as k, the length of the random walk, approaches infinity.

Lemma 3 Let G be a connected graph whose random walk one-step transition probability matrix P is diagonalizable and ergodic as a Markov chain, then for any u, v E V, we have limkoo DSD k} (U, v) = (bUT - b T)(I - P + W)-', where I is the identity matrix, W is the constant matrix in which each row is a copy of riT, 1.T -being the unique steady state distribution, and for any i E V, biT is the ith basis vector, i.e., the row vector of all zeros except for a 1 in the ith position.

Proofs of these lemmas are out of the scope of this discussion, but can be found in [23].

41 2.3.2 Calculating Diffusion State ASD Proximity (DSAP) of Genes

A key step in our approach towards ASD gene prediction-prioritization is to measure the proximity between candidate genes and known ASD genes in the connected PPI network. For this purpose, we define a new proximity measure based on DSD and call it Diffusion State ASD Proximity, or DSAP in short. Let DSD(u, v) denote the pairwise diffusion state distance between any two nodes u, v E V in the connected PPI network G(V, E) which is defined by Equation 2.7. Let S be the set of genes known to be associated with ASD with high confidence. Out of the 548 genes in our gold standard (Appendix A), 106 genes appear in eight or more ASD related studies. We build our ASD gene set S using these genes. We define pairwise diffusion state proximity (DSP) of two nodes u, v e V by a Gaussian kernel over DSD(u, v) as follows. -DSD(-,,)) 2 DSP(u, v) = e( 7,)

Here, we divide the DSD(u, v) by 7 not to let the DSP(u, v) value become too small, given that the median DSD(u, v) for connected human PPI network is found to be approx- imately equal to 7. Then, we define the diffusion state ASD proximity of a gene, g E V by Equation 2.8.

DSAP(g) = ( DSP(g, s) (2.8)

We calculate DSAP scores for all the genes in the connected PPI network and sort them in descending order of DSAP scores which gives us the DSAP-based ranking of genes.

2.3.3 Quality of DSAP-based Ranking

To measure the performance of the DSAP-based prioritizer, we calculate the area (AUC) under the Receiver Operating Characteristic (ROC) curve (Figure 2-7). The TPR and FPR are calculated using Equations 2.3 and 2.4 respectively as before. We measure the quality of ranking on 22086 genes of the PPI network. These genes do not include the 106 ASD genes used in measuring proximity to ASD. We achieve an AUC of 54.05% using this method, which is better than the random case (AUC = 50%). With this approach we achieve a lift of 1.1% of ASD genes (excluding seeds) in the top 4% of the ranklist over random selection. Although this measure is worse than the previous two methods, it is still able to identify more non-seed ASD genes than random selection.

42 Inclusion of seed genes boost the lift up to 5.7-fold which means that the seed genes are very close to each other in the network in terms of DSAP and hence make up a significant portion of the top 4% of the ranklist. The lift chart including the seeds is shown in Figure 2-8. Here, lift is calculated using Equation 2.5.

.I 0.9

0.7

0.7 0.2

6 -064406 6.4

0 0.1 6.2 0.3 M. &.S 0.6 0.7 6.8 1.9 False Positive Rate (FPR)

Figure 2-7: Receiver operating characteristic curve for Diffusion State ASD Proximity (DSAP) based prioritizer.

6 i j I I 5.5

5 -E Excluding Seeds ncluding Seeds laselin. 4

3.5 -J 3

2.5

2

1.5

1 i i i 0. I I I I I I I I I I I I I I 0 0.05 0.1 0.15 02 025 0.3 0.35 OA 0.45 0.5 0.55 0.6 0.5 0.7 0.75 0.8 0.85 0.9 0.95 1 % of Genes in the Rankdist

Figure 2-8: Lift chart for Diffusion State ASD Proximity (DSAP) based prioritizer.

43 2.4 Network Crosstalk based Prioritizer

2.4.1 Motivation

Functional association between genes or proteins in the PPI network are often measured us- ing diffusion kernel, random walk with restart, or propagation flow based algorithms. These approaches axe global in nature in that they consider multiple alternate paths and the whole topology of the PPI network. The basic steps for most of these approaches are: first identify seed genes that are significantly associated with the disease of interest. Next, map these seed genes onto the PPI network. Finally, quantify the functional association between genes in the PPI network and the seed genes based on network proximity and connectivity in a global manner. As discussed in Section 1.2.1, these approaches have recently been success- fully applied to identify genes for a number of complex diseases including different types of cancers, type-2 diabetes, neurological disorders, psoriasis, asthma and so on. Motivated by these successes, we aim to develop a global network-based scoring scheme to quantify functional association between ASD seed genes and the rest of the genes in the human PPI network. We redefine the notion of network crosstalk introduced by Nibbe et al. [76] in the context of ASDs and compute ASD association in an approach based on random walk with restarts. To the best of our knowledge, this is a first attempt to capture functional association of ASD genes via network connectivity and proximity in a global manner.

2.4.2 Problem Formulation

Following closely the approach adopted by Nibbe et al. [76] for identifying candidate genes and subnetworks for human colorectal cancers, we reformulate the problem of disease gene prediction-prioritization for ASDs. Let G = (V, E) be the connected PPI network, where V consists of the genes in the network, and an undirected edge e(u, v) E E represents an interaction between genes u E V and v E V. Let N(v) be the set of direct neighbors (i.e., interacting partners) of gene v E V, i.e., N(v) = {u E V : (u, v) E E}. Let S C V be the set of genes known to be associated with ASD with high confidence. Among the 548 gold standard ASD genes, 106 genes appear in eight or more ASD related studies. We build our ASD gene set S using these genes. Our goal is to compute a score a(v) for each gene v E V, to quantify network crosstalk between v and the genes in S, network crosstalk being the indicator of functional association between genes.

44 In order to develop a biologically sound measure of network crosstalk, Nibbe et al. [76] relied on two observations.

(i) Functional similarity between proteins is significantly correlated with their network proximity, as measured by the number of hops between these proteins.

(ii) Existence of multiple alternate paths between two proteins is an indicator of their functional association, since functional multiple paths are often conserved through evolution owing to their contribution to robustness against perturbations, as well as amplification of signals.

Like Nibbe et al. [76], we compute network crosstalk scores for genes in the PPI network using an information flow approach based on random walks with restarts. This approach incorporates both the number of hops and multiple alternate paths between genes into the assessment and can be considered as a generalization of Google's well-known Pagerank algorithm [17].

2.4.3 Calculating Network Crosstalk Scores

For a given ASD seed gene set, S, we calculate network crosstalk scores for all the genes in the PPI network by simulating a random walk as follows. The random walk starts at a randomly chosen gene in S. At each step, when the random walk is at some gene, v E V, it either moves to a neighbor of v with a probability 1 - r, or it restarts at a gene in S with probability r. Here, the parameter 0 < r < 1 is called the restart probability. For each move, the neighbor to be moved to is chosen uniformly at random from N(v). Similarly, for each restart the gene to be restarted from is selected uniformly at random from S.

The network crosstalk between the genes in S and each gene v E V can be computed as the relative amount of time spent at v by such an infinite random walk, or equivalently, the probability that the random walk will be at gene v at a randomly chosen time step after the random walk proceeds for a sufficiently long time. Formally, let at be the |VI-dimensional vector, such that at(v) denotes the probability that the random walk will be at gene v at step t, where ||at| = 1 (here, |1.11 denotes the L 1-norm of a vector). Let P denote the stochastic matrix derived from network G = (V, E), i.e., P(u, v) = 1/IN(v)I if (u, v) E E, 0

45 otherwise. Then, at any step t +1, the crosstalk score vector can be defined by Equation 2.9.

at+1 = (1 - r)(P)at + ry (2.9) where -y denotes the restart vector with -y(u) = 1/ISI for u E S, or 0 otherwise. With initial crosstalk scores set to ao = -y, the vector for final crosstalk scores for each gene in the network is given by a = imta at. In our experiments, we stopped our iterations when we encountered the criterion: IIat+1 - atI11 < 1e- 09 .

As we can see, when r = 0, a is equal to the eigenvector of P that corresponds to its largest eigenvalue (with numerical value 1), i.e., a(v) is exactly equal to the page rank of v in G for all v E V. Thus, the crosstalk score of a gene v is not only an indicator of its connectivity and proximity of ASD genes, but it also considers the significance of centrality of the gene in the network.

2.4.4 Dealing with Statistical Bias

PPI networks are often noisy in that well-studied proteins or genes are highly connected having a lot of interactions, whereas less studied ones often miss interactions. Thus there is a high probability the highly connected hub genes will be assigned artificially high crosstalk scores just by chance, skewing the result towards well-studied genes. However, we are interested in finding those genes that are less characterized but may provide novel insights into ASDs.

To correct for this bias, we assign significance scores to the crosstalk scores using Monte Carlo simulations. We define a null model that accurately captures the degree distribution of the ASD seed genes in S as follows. For a given ASD seed set S, in order to generate a random instance S(i) representative of S, first, for every gene u E S, we create a bucket B(u) of genes in the network, such that UUESB(u) = V and B(u) n B(u') = 0 for all u, u' E S. A gene v e V is assigned to bucket B(u) if IN(v) - N(u) 5 1N(v) - N(u')I for all u' E S and ties are broken randomly. Next we choose one gene from each bucket uniformly at random to construct S('), so that IS(i) = IS1. Note that each bucket consists of genes that have similar number of interactions with a particular ASD seed gene; therefore each seed gene is represented in S(i) by exactly one gene in terms of its number of neighbors. Thus, the expected total degree of genes in S(') is likely to be very close to the total degree of

46 the genes in S. After generating a random instance S('), we compute the corresponding crosstalk vector a(i) by letting y(i) = 1/IS) I for u E SW2, and 0 otherwise. We repeat this procedure N times, where N is sufficiently large (we use N = 1000 in our experiments) to obtain a sampling {ai, a2 , a3,..., aN} of the null distribution of the crosstalk scores, with respect to seed sets that are representative of S in terms of their sizes and degree distributions. Next, we estimate the mean As = N and stan- dard deviation as =

zS(v) = a(v) - ps(v) E V (2.10)

These adjusted crosstalk scores represent the statistical significance of the crosstalk be- tween each gene and the genes in the ASD seed set, accounting for the centrality and degree distribution of the genes in the PPI network. We sort the genes in our PPI network in de- scending order of the adjusted crosstalk scores, which gives us the network crosstalk based ranking of genes.

2.4.5 Performance of Network Crosstalk based Prioritizer

To measure the performance of our network crosstalk based prioritizer, we calculate the area under the Receiver Operating Characteristic (ROC) curve (AUC). As before, the TPR and FPR are calculated using Equations 2.3 and 2.4 respectively. We measure the quality of ranking on 22086 genes of the PPI network. These genes does not include the 106

ASD genes used as seeds. We measure AUC for different values of the parameter r: r = {0, 0.25,0.5,0.75, 0.9} (Figure 2-9).

Figure 2-10 depicts the lift chart for this method which shows how much more likely we are to identify ASD genes than if we make random guesses. By considering only the top 2% of genes in the ranklist found by our method, we are able to identify 2.37 times as many known ASD genes (excluding seeds), as if we selected randomly. This gain indicates quite an improvement considering the unbalanced nature of our dataset with ASD genes accounting for roughly 2% of the entire dataset. Here, lift is calculated using Equation 2.5. Inclusion of seed genes boost the gain up to 10.5-fold.

47 1 1

0.9

0.8

0.7

0.6

CL 0.5

0 0.4 r = 0.00: AUC = 0.4474 2 0.3 r = 025: AUC = 0.5611 . r = 0.50: AUC = .5 0.2 ~~~r =0.75: AUC = O.M57 - r= 0.90: AUC = 0.5525 0.1 Baseline: AUC = 0.5000 -

u 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate (FPR)

Figure 2-9: Receiver operating characteristic curves for network crosstalk based prioritizer using different restart probabilities (r).

I I I I I I I I I IIII I

11 Excluding Seeds 10 Including Seeds Baseline 9

8

7

IS 6

5

4

3

2

1 I I I I I I I I I I I I I I I I I I 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 % Genes in the Ranklist Figure 2-10: Lift chart for network crosstalk based prioritizer.

48 Chapter 3

Integrative Approach for Identifying ASD Risk Genes

3.1 Background

Just to recapitulate, we are interested in the problem of quantifying the association of a gene with ASD and rank the genes based on the strength of association. Each of the methods we have discussed so far focuses on a single aspect of functional similarity of genes which is based on either sequence, phenotype, or topological similarity. However, as discussed in Section 1.2.1, there is plenty of evidence in the literature that an integrative approach incorporating multiple aspects of functional similarity of genes simultaneously can perform reasonably better in predicting and prioritizing disease genes, than the methods focusing on a single aspect. Motivated by this fact, we propose a logistic regression based integrative approach for solving this problem in the context of ASDs. We use lasso-penalized logistic regression [31, 110] to develop a predictor that predicts the probability of a gene being associated with ASDs. To avoid over-fitting the model, we used the adaptive lasso procedure, which simultaneously identifies influential variables and provides the model parameters. Our choice of variables include ASD association scores computed by the methods described in Section 2 as well as information on ASD-pathway membership of genes.

49 3.1.1 Lasso-penalized Logistic Regression

Logistic regression measures the relationship between a categorical dependent variable and one or more independent variables, which axe usually (but not necessarily) continuous, by using probability scores as the predicted values of the dependent variable. It is one simple but widely used approach for integrating predictors from multiple sources. It is often used with lasso regularization. Lasso is a shrinkage estimator, often used to identify important predictors, select among redundant parameters, and produce shrinkage estimates. Lasso estimates have potentially lower predictive errors than an ordinary maximum likelihood estimator. Thus, lasso is a useful alternative to stepwise regression and other dimensionality reduction techniques.

3.2 Predicting ASD Association via Logistic Regression based Integrative Approach

3.2.1 Preparing Data for Training and Validation

Our landscape of genes consists of all 22192 genes of the connected PPI network. Each gene in the set is labeled as ASD gene if it belongs to the gold standard ASD gene set (Ap- pendix A), or non-ASD otherwise. We establish a training set of 4292 genes. It consists of 106 high confidence ASD genes from the ASD gold standard gene set. These genes appear in eight or more ASD-related studies. These 106 genes makeup roughly 19.34% of the total ASD genes in the dataset. To retain this proportionality for non-ASD genes as well, we - domly select 4186 of the 21644 non-ASD genes in the connected PPI network for the training set. The rest of the ASD and non-ASD genes are set aside for validating the performance of the logistic regression based predictor. Thus, the validation set consists of 17900 genes of which 442 are ASD genes and the rest are non-ASD genes. Note that our dataset under consideration is a highly unbalanced one, and we are interested in accurately predicting the ASD genes rather than the non-ASD genes.

3.2.2 Constructing Lasso-regularized Binomial Regression Model

We formulate the logistic regression based predictor as follows. Let V = {v1, v2, v3 ,..., v } be the set of genes. Let the dependent variable p = {1i, 2, A3, ... , An } be the vector of

50 predictions, where pi denotes the probability that gene vi is associated with ASDs. Pi can be any real value between 0 and 1 inclusive. We construct the set of independent vari- ables, X = {CNVIE, AutSim, DSAP, NetCrTk, NeuronPath, SkeletalPath, SynapsePath, CaPath} with eight predictors, where CNVIE refers to CNV information entropy based

scores, AutSim, autism similarity based scores, DSAP, diffusion state ASD proximity scores, NetCrTk, adjusted network crosstalk scores, and NeuronPath, SkeletalPath, SynapsePath, and CaPath refer to the membership information of genes in development pathway, skeletal development pathway, synapse pathway, and Calcium (Ca) signaling pathway, re- spectively. These four pathways have been associated with ASDs in recent studies [21,91]. The gene membership information was extracted using the corresponding pathway gene sets from Molecular Signatures Database (MSigDB) version 4.0 [105]. Here, X is an n x 8 matrix where each row corresponds to the values of eight predictors for the corresponding gene. We fit a lasso regularized weighted binomial regression model with the aforementioned dependent and independent variables on the training data using 100 penalty terms, Lambda and 10-fold cross validation. Cross validation is used to correct for potential over-fitting bias. The weights are given by the number of ASD association studies related to each genes. If a gene does not have any association study associated with it, it is given a very small weight of le-06 . For each non-negative value A in Lambda, lasso tries to minimize the deviance of the model (often estimated as the negative log-odds ratio) fit to the responses using the predictor coefficients as well as a constant term. We use the lassogim function from Matlab

(version 2012a) to fit the lasso regularized binomial regression model.

3.2.3 Selecting Model Coefficients

We select the constant term as well as the set of predictor coefficients such that the deviance of the model remains within one standard error of the minimum deviance found by lasso. The selected model coefficients are shown in Table 3.1 in order of predictive value. According to the fitted lasso penalized logistic regression model all the predictors are informative to some extent.

3.2.4 Creating Regularized Model and Making Predictions

Let the constant term be denoted by /0 and the vector of lasso regularized predictor coeffi- cients be denoted by # = {/1, #2, /3',... , #8}". The resulting regularized model is given by

51 Variable Coefficient AutSim 65.4212 DSAP 1.7637 SkeletalPath 1.4463 NeuronPath 1.2254 CaPath 1.1516 SynapsePath 1.1487 CNVIE 0.3529 NetCrTk 0.2223

Table 3.1: Selected regression coefficients for the integrative approach from logistic regression in order of predictive value.

Equation 3.1.

logit(p) = log = X# + 3o (3.1)

Thus, the predictions are given by Equation 3.2.

e(XP+00)M = + e(XP+0o) (3.2)

We evaluate the model predictions on the training and validation data using this equation. The Matlab function glmval is used for that purpose.

3.3 Performance Analysis

First we assess the accuracy of the model on the training data by measuring the area under the ROC curve as well as the area under the precision-recall curve. The TPR, or recall, and FPR are calculated using Equations 2.3 and 2.4 respectively. Precision is given by Equation 3.3.

Number of ASD genes correctly identified by the method Number of genes identified as ASD genes by the method

Figure 3-1 shows the precision-recall curve and ROC curve for the model on training data. We achieve an area of 99.54% under the ROC curve, and an area of 78.63% under the precision-recall curve which indicate the high quality of the fit on training data. To assess the overall accuracy of the model, we measure the AUC for ROC curve using the validation data set. It achieves an AUC of 65.34% (Figure 3-2). To compare its performance, we also fit two other lasso regularized logistic regression models - one integrating only the ASD association

52 A. Precdsion-Rocall Curve B. ROC Curve

0.9 0.9

0.9 02

0.7 0.7

0.5 0.5

0.3 . 0.3 0.3

0.2 0.2

0.1 AUC 0. 0.1 AUC =.M

0 0 0 02 OA 0. 0.A 1 0 0.2 0.4 0* 0. 1 Recall False Positive Rate (FPR)

Figure 3-1: Performance curves for integrative approach on training data. A. Precision-recall curve with an AUC of 0.7863; B. ROC curve with an AUC of 0.9954. scores from the four methods described in Section 2, and the other integrating only the gene membership information in the four pathways and the weights from the literature. The standardized regression coefficients are listed in Tables 3.2 and 3.3.

Variable Coefficient AutSim 18.0409 DSAP 3.1580 CNVIE 0.2005 NetCrTk 0.1870

Table 3.2: Selected logistic regression coefficients for integrating different ASD association scores in order of predictive value.

Variable Coefficient CaPath 2.6001 SkeletalPath 1.7088 NeuronPath 0.9576 SynapsePath 0.6344

Table 3.3: Selected logistic regression coefficients for integrating ASD-pathway membership information with weights in order of predictive value.

We compute area under ROC curves (AUCs) for each of these models as well as the methods from Section 2 using the validation dataset. As we can see in Figure 3-2, our integrative approach which uses both the ASD association scores from different methods and the ASD-pathway membership information with weights, gives the best performance

53 among all of them.

0.9

0.8

0.7-

*0.6 I0.5- IntApp: AUC =0.534 OA - IntMthd: AUC = 0.6173 2 lntPath: AUC = 0.5309 U .4 CNVIE: AUC = 0.5954 AutSim: AUC = 0.5596 0.2 DSAP: AUC = 0.5416 = 0.5625 0.1 -- NetCrTk: AUC Baseline: AUC = 0.5000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate (FPR)

Figure 3-2: Receiver operating characteristics curves for different ASD gene prediction-prioritization methods. IntApp: Integrative approach incorporating dif- ferent ASD association scores as well as ASD pathway membership information from the literature; IntMthd: Integrative approach incorporating different ASD association scores only; IntPath: Integrative approach incorporating only ASD pathway membership infor- mation from the literature; CNVIE: CNV Information Entropy based Prioritizer; AutSim: Autism Similarity based Prioritizer; DSAP: Diffusion State ASD Proximity based Prioritizer; NetCrTk: Network Crosstalk based Prioritizer.

Figure 3-3 depicts the lift chart for this method which shows how much more likely

we are to identify ASD genes than if we make random guesses. By considering only top

2% of genes in the ranklist found by our method, we are able to identify 3 times as many

known ASD genes (excluding seeds), as if we selected randomly. This gain indicates quite an

improvement considering the imbalanced nature of our dataset with ASD genes accounting

for roughly 2% of the entire dataset. Here, lift is calculated using Equation 2.5. Inclusion

of seed genes boosts the gain up to 11.2-fold. Thus, we construct our risk gene set for ASDs

using the genes from the top 2% of our ranklist which yields 443 genes. Note that among

these genes, 123 are known ASD genes (102 seeds and 21 non-seeds). Among the 21 non-seed ASD genes identified by our integrative approach, DLGAP3, APC, GPC6, and NTRK1 have

appeared in seven ASD association studies; AR, ATRX, and RPS6KA3 appear in six studies;

54 112 I I I I I I I i i i 101 9.8 Test dat moet ~Entire d tInote 8. -Baseine

7.

6.8

5.

4.8

3.

21

1.

0. T I T I 0 0.05 0.1 0.15 0.2 025 0.3 0.35 0.4 0A5 0.5 0.55 0. 0.5 0.7 0.75 0.8 0.5 0. 0.95 1 % Genes in the Ranklist

Figure 3-3: Lift chart of integrative approach for ASD.gene prediction-prioritization.

SCN8A and TBR1 appear in five studies; GNAS and EGR2 appear in four studies; KCND2 and BIN1 appear in three studies; SETD2, TYR, and EPHB2 appear in two studies; and TBX1, PTPN11, DUSP22, BRCA2, and KIT appear in one study. Considering the high gain of known ASD genes in the candidate set, we can hypothesize that the other genes in it have a strong possibility of being associated with ASDs which are worth investigating. The complete list of risk genes along with the probabilities of their association to ASDs is given in Appendix B.

55 56 Chapter 4

ASD Genetics: Implications from Candidate ASD Risk Genes

4.1 Gene Sets for Analysis

We downloaded prior knowledge-based gene sets consisting of gene symbols in Gene Matrix Transposed (GMT) format from the MSigDB version 4.0 [105]. Of the available pathway gene sets, we collected 1320 expert curated ones which include gene sets from Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways (http://www.genome.jp/kegg, [47,48]), Reac- tome pathways (http: //www. react ome. org/ [27, 72]), BioCarta pathways (http://www. biocarta. com), Pathway Interaction Database (PID) pathways [99], SigmaAldrich gene sets (http://www.sigmaaldrich.com/life-science.html), Signaling Gateway gene sets (http: //www. signaling-gateway. org), Signal Transduction KE gene sets (http: //stke. sciencemag.org), and SuperArray gene sets (http://www.superarray.com). From the collection, we filtered out the disease and drug related gene sets. We also excluded very large (>,300 genes) and very small (< 10 genes) gene sets. Thus we were left with 1221 gene sets as of June 2014.

From MSigDB version 4.0, we also collected Gene Ontology (GO) gene sets, which are derived from the controlled vocabulary of the GO project: "The Gene Ontology Con- sortium" [8]. These gene sets are based on GO terms and their associations to human genes. We collected gene sets belonging to two categories - C5:BP (biological process) and C5:MF (molecular function). After filtering out very small and very large gene sets, we were

57 left with 763 and 382 gene sets respectively under these categories. From the collection, we excluded the KEGG pathway, neuron devel- opment, skeletal development, and synapse gene sets since they have been used as prior knowledge in our integrative approach. We also filter out the neurite development and axon guidance gene sets as they are subsets of the neuron development pathway gene set.

4.2 Hypergeometric Test for Enrichment

We use a hypergeometric test to determine the statistical significance of overlap of a pathway or GO gene set with our candidate gene set. The hypergeometric test uses the hypergeomet- ric distribution to calculate the probability of more than k ASD risk genes (out of a set of n ASD risk genes in a dataset of N total genes) appearing in a specific pathway or GO gene set of size K. The probability mass function of the hypergeometric distribution is given by the following expression. (K)k \n-k(N-K\ n~) This test helps identify whether the ASD risk gene set is overrepresented in a certain pathway or GO gene set, and provides us with a p-value. We use the phyper function from R (version 2.15.1) for computing the hypergeometric p-values for the pathways and GO gene sets in our gene set collection.

4.3 Pathway Enrichment Analysis

Performing hypergeometric enrichment analysis on pathway gene sets, we found a total of 32 canonical pathways in which ASD risk genes are overrepresented after Bonferroni correc- tion (Table 4.1). Of note, there was an increased frequency of affected pathways associated with signal transduction pertinent to brain, cellular assembly and communication, synap- tic development, and neuronal development. Most of the affected signaling pathways (e.g., MAPK signaling, FGF signaling, SHP2 signaling etc.) are found to be highly involved in processes such as cell growth and death, specifically neuron , neurite outgrowth, inter neurite cell adhesion etc. They also affect cell proliferation, differentiation, and migra- tion processes. We found a number of affected pathways involving Li family cell adhesion molecules (L1CAM) which play important roles in neuronal migration and synaptic forma-

58 tion [6]. (SHANK proteins) bind to Li proteins and couple them to ion channel proteins, and thus mediate branching and synaptogenesis of cortical inhibitory . Pathways involving neural cell adhesion molecules (NCAM) play important roles in forma- tion and maintenance of the nervous system. Clearly, incorrect synapse development, neuron development, and erosion of synaptic function are widely considered to be key contributors to ASDs. Our results also show that pathways involved in immune response, protein catabolism or modification, tissue and organ morphogenesis, differentiation, inflammatory response, etc. are associated with ASDs, and may be involved in immune related disorders, developmental regression, metabolic abnormalities, morphological impairments, and sleep disturbances.

1 Table 4.1 - Canonical pathways having significant overlap with ASD risk genes. Name Category Functions # Genes Genes Adjusted p-value KEGG MAPK Signal Transduction cell proliferation, 21 MEF2C, FLNA, FLNB, 1.00E-04 SIGNALING differentiation, mi- BDNF, NF1, FGFR2, PATHWAY gration FGFR3, FGFR1, FGF14, RPS6KA3, IKBKG, MAPT, CACNA1H, CACNA1G, CACNA1A, CACNAlC, CACNAIS, SOS1, NTF3, TP53, NTRK1 PID FGF PATH- Signal Transduction cell death, neuron 10 FGFR1, PLCG1, FGFR2, 0.000108898 WAY apoptosis, proteo- FGFR3, RUNX2, PTPN11, somal MET, FRS2, HGF, SOS1 dependent protein catabolism PID SHP2 PATH- Signal Transduction activation of IL2, 10 SOSi, NOS3, FRS2, 0.000184167 WAY IL10, FGF, ERBB1 IL2RG, NTRK3, NTRK1, signaling cascades BDNF, PTPN11, IL6, NTF3 REACTOME IN- Cell Adhesion nervous system de- 7 ANK3, LlCAM, SCN2A, 0.000204988 TERACTION BE- velopment, branching SCN4A, SCN5A, SCN8A, TWEEN Li AND and synaptogenesis of SPTAN1 ANKYRINS cortical neurons REACTOME Cell Adhesion neural cell adhesion, 10 COLlAl, COL2A1, 0.000480482 NCAM SIG- formation and main- COL4A3, FGFR1, NALING FOR tenance of nervous PRNP, SOSI, SPTAN1, NEURITE OUT system CACNA1S, CACNAlH, GROWTH CACNA1G PID BETA- Signal Transduction cell proliferation, im- 11 AR, KRT1, TCF4, SALL4, 0.000492712 NUC mune response PITX2, MITF, MED12, PATHWAY NCOA2, AES, CACNA1G, APC (continued on next page)

59 Table 4.1 - Continued. Name Category Functions # Genes Genes Adjusted p-value KEGG FOCAL Cell Adhesion cell communication, 17 HGF, FLT4, RELN, 0.00059351 ADHESION motility, prolifera- FLNA, FLNB, MET, tion, differentiation, COLilA1, COL11A2, survival VWF, PTEN, CAV3, ITGA2B, COL2A1, COLiAl, BCL2, SOS1, LAMA3 PID DOWN- Signal Transduction cell growth and death FDXR, TSC2, BCL2, 0.000603742 STREAM PATH- TP63, TP53, RFWD2, WAY MET, PTEN, HGF, APC, RB1, VDR, NDRG1, DDB2 KEGG ECM RE- Signal Transduction tissue and organ GP1BA, VWF, HSPG2, 0.000813133 CEPTOR INTER- morphogenesis, cell ITGA2B, COL2A1, RELN, ACTION adhesion, prolifera- COLlA1, LAMA3, SDC3, tion, differentiation COL11Al, COL11A2 and morphogenesis REACTOME Cell Adhesion neurite outgrowth, DNM2, FGFR1, ANK3, 0.001033363 L1CAM INTER- neurite fascination, ITGA2B, L1CAM, ACTIONS inter neuronal adhe- RPS6KA3, SCN2A, sion SCN4A, SCN5A, SCN8A, SPTAN1 REACTOME Signal Transduction activation of MAPK, FRS2, EIF4E, FGFR1, 0.001161965 INSULIN RE- Ras/RAF cascades FGFR3, FGFR2, INSR, CEPTOR SIG- PRKAG2, SOS1, STK11, NALLING CAS- TSC1, TSC2 CADE REACTOME Signal Transduction insulin sub- FRS2, EIF4E, FGFR1, 0.001293256 P13K CASCADE strate mediated sig- FGFR3, FGFR2, INSR, naling PRKAG2, STK11, TSC1, TSC2 REACTOME Signal Transduction insulin binding FRS2, EIF4E, FGFR1, 0.001561111 SIGNALING BY FGFR3, FGFR2, INSULIN RECEP- ATP6VOA2, INSR, TOR PRKAG2, SOS1, STK11, TSC1, TSC2 PID SYNDECAN Signal Transduction tumor necrosis fac- COL11A2, MET, COL7A1, 0.002906631 1 PATHWAY tor mediated signal- COLIlAl, COL2A1, ing, protein ubiqui- COL4A3, HGF, COLlA1 tation, degradation PID SMAD 2,3 Signal Transduction muscle cell differenti- NCOA2, FOXO4, FOXH1, 0.004928594 NUCLEAR PATH- ation, endothelial cell MEF2C, AR, ESRI, WAY migration, negative RUNX2, DLX1, FOXG1, regulation VDR PID TCR PATH- Protein Catabolism, protein catabolytic IKBKG, WAS, FLNA, 0.005740308 WAY Signal Transduction process, activation PLCG1, PTPRC, of calcium signaling, PTPN11, PTEN, SOS1, NFKB signaling STIMI PID INTEGRIN1 Cell Adhesion family cell COL11Al, TGFBI, 0.005740308 PATHWAY surface interactions COL7A1, COL2A1, for adhesion COLlA1, COL4A3, LAMA3, FBN1, COL11A2 SIG PIP3 SIG- Signal Transduction cell growth and sur- RPS6KA3, MET, ERBB4, 0.00651532 NALING IN vival SOS1, PTPN1, TSC1, CARDIAC MY- INPPL1, TSC2, PTEN OCTES (continued on next page)

60 Table 4.1 - Continued. Name Category Functions # Genes Genes Adjusted I I I_ I p-value KEGG NEU- Nervous System differentiation and 12 FRS2, BDNF, BCL2, 0.007891206 ROTROPHIN survival of neuronal PLCG1, PTPN11, SIGNALING cells, learning and RPS6KA3, SOSI, PSEN1, PATHWAY NTF3, TP53, NTRK1, NTRK3 REACTOME Cell Adhesion neuronal -cell ad- COLIA1, COL2A1, 0.009759881 NCAMI INTER- hesion, cellular COL4A3, PRNP, ACTIONS migration, differ- CACNAlS, CACNA1H, entiation, survival, CACNA1G ST MYOCYTE Signal Transduction formation of interface APC, GNAQ, EPHB2, 0.011606902 AD PATHWAY between nervous sys- PITX2, CAV3, RYR1 tem and cardiovascu- lar system BIOCARTA GH Signal Transduction growth factor me- HNF1A, GHR, GH1, 0.014524747 PATHWAY diated signaling, PLCG1, INSR, SOS1 dwarfism, activa- tion of JAK-STAT, MAPK cascades REACTOME NGF Signal Transduction neuronal differentia- FRS2, DNM2, MEF2A, 0.018499536 SIGNALLING tion in response to MEF2C, FOXO4, NTRK1, VIA TRKA FROM neurotrophins PLCGI, PRKAR1A, THE PLASMA PTEN, RPS6KA3, SOSI, MEMBRANE TSC2 REACTOME IN- Protein Catabolism, cell adhesion to ECM, RAPGEF4, COLlA1, 0.025394674 TEGRIN CELL Cell Adhesion protein catabolitic COL2A1, COL4A3, FBN1, SURFACE IN- process ITGA2B, PTPN1, SOS1, TERACTIONS VWF PID PATH- Transcription, Pro- apoptosis, proteo- WT1, GATAI, BIN1, 0.025394674 WAY tein Catabolism somal ubiquitation BRCA2, GNB2L1, dependent protein TP53AIP1, RB1, NTRK1, catabolitic process TP63 ST DIFFER- Signal Transduction PC12 cell differentia- PTPN11, NTRK1, 0.025972966 ENTIATION tion RPS6KA3, GNAQ, FRS2, PATHWAY IN EGR2, OPN1LW PC12 CELLS PID MET PATH- Signal Transduction growth factor medi- PLCG1, PTPN1, PTPN11, 0.028115505 WAY ated signaling HGF, EIF4E, INPPL1, APC, SOSI, MET PID TRKR PATH- Signal Transduction growth factor medi- NTRK3, SOS1, FRS2, 0.028499715 WAY ated signaling PLCGI, NTRK1, PTPN11, BDNF, NTF3 REACTOME Signal Transduction cell proliferation, FRS2, FGFRI, FGFR3, 0.028989079 DOWNSTREAM differentiation, mi- FGFR2, FOXO4, PLCG1, SIGNALING OF gration, survival and PRKARIA, PTEN, SOSI, ACTIVATED cell shape TSC2 FGFR PID DELTA NP63 Signal Transduction calcium signaling TP63, VDR, GNB2LI, 0.03478047 PATHWAY DLX6, BRCA2, KRT14, DLX5 PID NCAD- Signal Transduction inflammatory re- PTPN1, PLCG1, LRP5, 0.039242846 HERIN PATH- sponse, interferon FGFR1, GJA1, PTPN11 WAY (TOLL production, cell PATHWAY) proliferation and migration (continued on next page)

61 Table 4.1 - Continued. Name Category Functions # Genes Genes Adjusted p-value REACTOME Nervous System cell communication, 17 CHRNA1, CHRNA7, 0.046181684 NEURONAL neuronal develop- GABRB3, GRIK2, SYSTEM ment GRIN2A, GRIN2B, KCND2, KCND3, KCNQ1, MAOA, RPS6KA3, SLC1A1, STXBP1, ABCC8, SYN1, CACNA1A, PICKI

1. We excluded neuron development, neurite development, axon guidance, synapse development, and calcium signaling pathways as they were used as input knowledge in our integrative approach.

4.3.1 An Interesting Connection with Inflammatory Bowel Disease (IBD)

The fact that ASD patients often suffer from chronic inflammation of gastrointestinal tracts motivated us to look for possible shared pathogenesis between ASD and inflammatory bowel disease (IBD). We first identified a number of pathways related to IBD from extensive lit- erature review. These IBD pathways are often related to innate and adaptive immunity (T-cell signaling, chemokine signaling, NOD2 signaling, NF-KB signaling, 1L23/Th17 sig- naling etc. [61]), autophagy (IL2 signaling, IL2RB signaling, IL10 signaling, IL6 signaling, TGF-,6 signaling etc. [61,79]), necrosis (TNF signaling, TNFR1/2 signaling, etc.) and apop- tosis (cytokine signaling [79,881. A number of signaling pathways such as, ERK-MAPK signaling [18, 118], WNT signaling [42], Notch signaling [70], Adipocytokine signaling [50], Integrin signaling, Hedgehog signaling [60], BMP signaling, Hippo signaling, JAK-STAT sig- naling [10,32,102] also have been mentioned in relation to IBD. We collected corresponding pathway gene sets from MSigDB. When we looked at the overlap of these pathways with our candidate gene set, we found that ASD risk genes are overrepresented in most of these pathways. Table 4.2 lists the IBD-related pathways that have significant overlaps (p-value < 0.05) with ASD risk genes. This clearly indicates that IBD and ASDs have some sort of shared pathogenesis which is worth further investigation.

4.4 Enrichment Analysis on GO gene sets

We performed hypergeometric enrichment analysis on the gene sets under GO biological pro- cesses and molecular functions categories to find in which biological processes and molecular

62 Name # Genes Overlapped Genes p-value KEGG MAPK SIGNALING PATHWAY 21 MEF2C, FLNA, FLNB, BDNF, NFl, FGFR2, 1.09E-07 FGFR3, FGFR1, FGF14, RPS6KA3, IKBKG, MAPT, CACNA1H, CACNAlG, CACNA1A, CACNA1C, CACNAlS, SOS1, NTF3, TP53, NTRK1 BIOCARTA ERK5 PATHWAY 4 MEF2C, MEF2A, PLCGI, NTRK1 0.000383858 BIOCARTA AKT PATHWAY 4 IKBKG, FOXO4, GHR, GH1 0.000861447 REACTOME CYTOKINE SIGNALING IN IM- 14 EIF4E, FLNB, GH1, GHR, HGF, IL2RG, 0.001132398 MUNE SYSTEM IL6, INPPL1, IRF6, PLCG1, PTPN1, SOSI, TRIM25, IKBKG REACTOME SIGNALLING TO ERKS 4 FRS2, NTRK1, PLCG1, SOSI 0.005568586 REACTOME PROLONGED ERK ACTIVA- 3 FRS2, NTRK1, PLCGI 0.006035527 TION EVENTS REACTOME P13K AKT ACTIVATION 4 FOXO4, NTRK1, PTEN, TSC2 0.006763695 REACTOME ERK MAPK TARGETS 3 MEF2A, MEF2C, RPS6KA3 0.008043482 WNT SIGNALING 6 AES, APC, LRP5, PITX2, WNT2, HPRT1 0.008835118 BIOCARTA IL6 PATHWAY 3 PTPN11, IL6, SOS1 0.009177488 ST T CELL SIGNAL TRANSDUCTION 4 SOS1, PTPRC, PLCG1, EPHB2 0.012244256 PID TNF PATHWAY 4 SMPD1, GNB2L1, IKBKG, CYLD 0.013204051 PID IL6 7PATHWAY 4 PTPN11, IL6, MITF, SOSI 0.014210455 BIOCARTA TNFR1 PATHWAY 3 LMNA, SPTAN1, RB1 0.019653376 BIOCARTA TNFR1 PATHWAY 3 LMNA, SPTAN1, RB1 0.019653376 REACTOME NFKB ACTIVATION 2 TRIM25, IKBKG 0.02298827 THROUGH FADD RIP1 PATHWAY ME- DIATED BY CASPASE 8 AND10 PID IL2 1PATHWAY 4 SOSI, PTPN11, BCL2, IL2RG 0.024014914 ST INTEGRIN SIGNALING PATHWAY 5 EPHB2, SOS1, PLCG1, WAS, PTEN 0.024148774 KEGG HEDGEHOG SIGNALING PATHWAY 4 SHH, WNT2, BMP4, GLI3 0.025467375 ST ERKI ERK2 MAPK PATHWAY 3 SOSI, RPS6KA3, EIF4E 0.025536874 REACTOME APOPTOSIS 7 DSP, APC, LMNA, MAPT, BCL2, SPTAN1, 0.029317822 TP53 BIOCARTA IL2RB PATHWAY 3 IL2RG, BCL2, SOSI 0.039815059 KEGG ADIPOCYTOKINE SIGNALING 4 PTPN11, PRKAG2, IKBKG, STK11 0.044909084 PATHWAY REACTOME IL 2 SIGNALING 3 IL2RG, INPPL1, SOS1 0.048181081

Table 4.2: IBD-related pathways having significant overlap with ASD risk genes.

functions, our ASD risk genes are over represented. Significant biological processes and molecular functions were selected based on the hypergeometric p-values (< 0.05 after Bon- ferroni correction). As expected, the candidate gene set was over represented in a number of developmental processes related to the nervous system and brain. However, it is interesting

to see that the risk gene set is significantly involved in processes such as tissue, muscle, epidermis, and ectoderm development, and organ morphogenesis. This finding might be supporting evidence for the fact that muscular dystrophy is a comorbid condition in many ASD cases [55]. Our candidate gene set is also found to be involved in a number of molecular functions related to ion channel activity, protein dimerization and binding, gated channel activity, sodium and calcium channel activity, etc. Figures 4-1 and 4-2 show the significant biological processes and molecular functions found by our analysis.

4.5 Enrichment Analysis for Subnetworks

Analysis for subnetworks was performed using QIAGEN's Ingenuity® Pathway Analysis (IPA® QIAGEN Redwood City, http: //www. qiagen. com/ingenuity). IPA assembled sub- networks based on gene-to-gene connectivity assuming that, the more connected a gene is, the more influence it has and the more "important" it is. IPA selected a set of seed genes

63 =40&~-Vatue) -- Rtflo of wuap

0.4s

0.35

0.3 is. 0.25 0.2w

0.13

0.M

z 5,

g= g

Figure 4-1: Significant GO biological processes associated with ASD risk gene set. The primary vertical axis shows the negative log of hypergeometric p-values. The secondary vertical axis shows the ratio of overlapped genes.

64 -log(p-value)

'-1 0 U'

- - GATEDCHANNEL ACTIVITY -1

METALION TRANSMEMBRANE TRANSPORTER ACTIVITY - m - - IONTRANSMEMBRANE TRANSPORTER ACTIVITY - -- IONCHANNEL ACTIVITY - VOLTAGEGATED CATION CHANNEL ACTIVITY - - --- SUBSTRATESPEaRC CHANNEL ACTIVITY

CATIONCHANNEL ACTIVITY

VOLTAGEGATED CHANNEL ACTIVITY - - I PROTEINDIMERIZATION ACTIVITY PROTEINHOMODIMERIZATION ACTIVITY I - - - VOLTAGEGATED SODIUM CHANNEL ACTIVITY A 0 CATIONTRANSMEMBRANE TRANSPORTER -o ACTIVITY TRANSMEMBRANERECEPTOR PROTEINTYROSINE ACTIVITY S PROTEINTYROSINE KINASE ACTIVITY I - CL_ TRANSMEMBRANERECEPTOR ACTIVITY

I - e+- PROTEINN TERMINUSBINDING U

SODIUMCHANNEL ACTIVITY a ATPBINDING m' q m >n~- VOLTAGEGATED CALCUM CHANNEL ACTIVITY U ADENYLRIBONUCLEOTIDE BINDING

ADENYLNUCLEOTIDE BINDING

CALCIUMCHANNEL ACTIVITY

o - PROTEINDOMAIN SPECFIC BINDING PROTEINCOMPLEXBINDING m STRUCTURALCONSTITUENT OF

COPPERION BINDING

TRANSCRIPTIONACTIVATOR ACTIVITY T~ f t P P IA CL e9 Ratioof Overlap

...... from our ASD risk gene set. Seeds with the most connections were then connected to other seeds to form a network. Non-seed genes as well as molecules from IPA Knowledge Base were added to the network to fill or join the areas lacking connectivity. For visualization purposes, we limited each subnetwork to a maximum of 35 nodes. Subnetworks were annotated with high level functional categories, scored and sorted in descending order of scores. IPA network analysis revealed 25 significant subnetworks in our supplied ASD risk gene set. Figure 4-3 shows the top 4 subnetworks. The topmost subnetwork is characterized by tissue morphology, and gastrointestinal disease terms. Nervous system development and function characterizes the second subnetwork. The third subnetwork is annotated by developmental, hereditary, and neurological disorders. Organismal injury and abnormalities as well as reproductive system disease characterizes the fourth subnetwork. The complete list of significant subnetworks is given in Appendix C. These findings strongly suggest the possibility of the existence of subclasses of ASDs, each characterized by one of the disorders such as, gastrointestinal disorders, developmental disorders, hereditary disorders, neurological disorders, organismal abnormalities, etc, and calls for further investigation.

4.6 Functional Analysis for Overlap with Diseases and Bio- functions

We performed functional analysis on our ASD risk gene set using QIAGEN's Ingenuity@

Pathway Analysis (IPA® QIAGEN Redwood City, http://www.qiagen.com/ingenuity). With a goal of providing a molecular understanding or model that could explain the func- tionality of the provided gene set, IPA analyzed it for diseases and functions using high quality GO information, manually curated information on diseases and disorders, and nor- mal processes in abnormal tissues available in IPA knowledge base. Significance of overlap between risk genes and genes in diseases and functions was calculated using Fisher's exact test.

IPA functional analysis revealed that our ASD risk gene set is significantly overrepre- sented in a number of diseases under different disease categories including developmental and hereditary disorders, neurological disorders, disorders, auditory disorders, gastrointestinal disorders, psychological disorders, dermatological disorders, inflammatory disorders, organismal abnormalities, cancers, etc. While overlap of ASD with neurologi-

66 FGR

Nicotinic ace e r

Aigar--IL

'*

F SLI 4A

IjG HS l4

:K k 5-9

LG 7 KR 6-

A Network 1 KL2KR3-3

B. Network 2 Ss

(jamily)

ated sodium channel !J2LX(o sterol Tro In

K4C F A Nc 2 G1 Trpnsl1

-- I cyclooicmenase

SOX2-O~t#-NANMG In t P 3 42

C G AS

SIL 12

G 2 eo5

A6

CTNN -Y/LEF

C. Network 3 D. Networ 4

Figure 4-3: Top four subnetworks in ASD risk gene set generated by QIAGEN's Ingenuityg Pathway Analysis (IPA).

67 cal, psychological, developmental, and hereditary disorders are obvious, its connection with gastrointestinal, auditory, and inflammatory disorders are not so obvious, hence more inter- esting for further investigation. The top 10 diseases having significant overlap with our risk gene set are shown in Table 4.3.

Diseases Categories p-Value # Genes Autosomal Dominant Disease Hereditary Disorder 5.19E-66 106 Multiple Congenital Anomalies Developmental Disorder 1.14E-61 93 Congenital Anomaly of Muscu- Developmental Disorder, Skeletal and Muscular Disorders 2.42E-54 108 loskeletal System Dysplasia Developmental Disorder 3.90E-40 59 Cognitive Impairment Neurological Disease 1.79E-38 61 Autosomal Recessive disease Hereditary Disorder 7.52E-36 95 Mental Retardation Developmental Disorder, Neurological Disease 4.05E-33 47 Congenital Anomaly of Limb Developmental Disorder, Skeletal and Muscular Disorders 3.41E-32 43 Dysplasia of Skeleton Connective Tissue Disorders, Developmental Disorder, Skele- 2.53E-29 37 tal and Muscular Disorders Hypoplasia Developmental Disorder 3.89E-29 68

Table 4.3: Top 10 diseases having significant overlap with ASD risk genes found by QIA- GEN's Ingenuity@ Pathway Analysis (IPA).

Functions Categories p-Value # Genes Organismal Death Organismal Survival 5.39E-61 212 Differentiation of cells Cellular Development 2.69E-55 190 Morphology of head Organismal Development 4.37E-54 122 Cell Death Cell Death and Survival 2.76E-51 238 Abnormal Morphology of head Organismal Development 2.81E-51 116 Morphology of Cells Cell Morphology 3.08E-50 179 Apoptosis Cell Death and Survival 2.97E-49 206 Morphology of Nervous System Nervous System Development and Function 3.70E-48 112 Development of Body Axis Embryonic Development, Organismal Development 1.16E-44 115 Development of Head Embryonic Development, Organismal Development 3.62E-44 109 Abnormal Morphology of Nervous Nervous System Development and Function 2.02E-43 102 System Development of Body Trunk Embryonic Development, Organismal Development 4.16E-41 115 Proliferation of Cells Cellular Growth and Proliferation 1.31E-40 231 Quantity of Cells Tissue Morphology 2.40E-40 151 Development of Central Nervous Nervous System Development and Function 6.36E-40 87 System Development of Neurons Cellular Development, Nervous System Development and 1.45E-39 91 Function, Tissue Development Development of Brain Embryonic Development, Nervous System Development and 3.42E-39 76 Function, Organ Development, Organismal Development, Tissue Development Length of Animal Organismal Development 5.02E-38 104 Necrosis Cell Death and Survival 9.09E-38 186 Size of Body Organismal Development 2.12E-37 103 Abnormal Morphology of Cells Cell Morphology 3.14E-37 127 Dynamics Cellular Assembly and Organization, Cellular Function and 8.73E-37 112 Maintenance Morphology of Central Nervous Nervous System Development and Function 3.16E-35 77 System Behavior Behavior 3.95E-35 103 Morphology of Brain Nervous System Development and Function, Organ Morphol- 1.26E-34 73 ogy, Organismal Development Organization of Cytoskeleton Cellular Assembly and Organization, Cellular Function and 1.15E-33 117 Maintenance Abnormal Morphology of Brain Nervous System Development and Function, Organ Morphol- 2.56E-33 70 ogy, Organismal Development Morphology of Bone Connective Tissue Development and Function, Embryonic 4.06E-33 71 Development, Organ Development, Organ Morphology, Or- ganismal Development, Skeletal and Muscular System Devel- opment and Function, Tissue Development Cell Movement Cellular Movement 4.66E-33 155 Abnormal Morphology of Central Nervous System Development and Function 6.86E-33 72 Nervous System I_ I

Table 4.4: Top 30 functions having significant overlap with ASD risk genes found by QIA- GEN's Ingenuity® Pathway Analysis (IPA).

ASD risk genes are also over represented in a number of functional categories, including nervous system development, cell death and survival, cellular development, embryonic de-

68 velopment, organismal survival and development, cell and tissue morphology, behavior, etc. The top 30 functions having significant overlap with our risk gene set is shown in Table 4.4.

69 70 Chapter 5

Conclusion

In this thesis, we have explored different computational approaches for addressing the classic problem of disease gene prediction and prioritization in the context of autism spectrum disorders (ASD). We have introduced three novel computational methods, one ASD-specific generalized Pagerank method, and a novel method that integrates the four, for solving the ASD gene prediction-prioritization problem.

Our first method calculates information entropy based scores for all the genes that can be mapped to the copy number variations that have ever been observed in ASD population as well as appropriate control groups by taking into account their frequency of occurrence in ASD case-control groups. Ranking the genes in descending order of CNV-based scores helps us achieve an area of 59.81% under the ROC curve, and 2.3-fold enrichment of ASD genes in the top 2% of the ranklist.

Our second method incorporates disease/phenotype similarity scores computed by van Driel et al. [112] and gene-phenotype relationships from the OMIM database. This method is seeded by high confidence ASD genes from the literature to identify ASD like phenotypes in OMIM. Genes involved in diseases with phenotypes similar to ASDs are scored highly by this algorithm. This method achieves an area of 55.96% under the ROC curve excluding the seed genes. We are able to achieve a 3.62-fold gain in ASD genes in the top 2% of the ranklist.

In our third method, we introduce diffusion state ASD proximity (DSAP) for the proteins based on diffusion state distance (DSD) metric, which is superior to direct neighborhood and shortest path distances in capturing the functional association of proteins in the PPI

71 network. Genes axe ranked in descending order of their diffusion state proximity to ASD seed genes. DSAP-based prioritizer achieves an AUC of 54.05% under the ROC curve excluding the seed genes. Considering the top 4% of the ranklist accounts for 1.1-fold enrichment of ASD genes (excluding seed genes). However, inclusion of seed genes boosts this enrichment upto 5.7-fold.

The fourth method we introduce is a generalization of Google's Pagerank algorithm for ASDs. This approach uses the global PPI network structure to simulate network crosstalk between the genes in the network and high confidence ASD seed genes. The simulated crosstalk gives a quantification of the functional association of ASD genes to the rest of the genes in the network. Genes are ranked in descending order of their association scores. We achieve an AUC of 56.11% under the ROC curve using this method. In the top 2% of the ranklist of genes, we achieve a 2.37-fold enrichment of ASD genes (excluding seeds).

Considering the unbalanced nature of our dataset these methods can be considered to perform reasonably well, as we can achieve an AUC more than 50% using each of these methods. However, the performances of these methods axe limited in that none of them could give us an AUC more than 60%. Thus, to increase overall accuracy of ASD gene prediction we propose a novel integrative approach which incorporates not only CNV, phenotype similarity, connectivity, proximity and topological similarity in the PPI network, but also ASD pathway knowledge from available literature. Each gene is assigned an association probability based on a simple, yet powerful logistic regression model. Adaptive lasso penalization with cross validation is performed to avoid over-fitting of the model. Genes axe ranked in descending order of their association probabilities. This integrative approach significantly outperforms the above four individual methods achieving an AUC of 65.34% under the ROC curve using test data. The top 2% of the ranklist gives us 3-fold enrichment of ASD genes (excluding seeds) which increases upto 11.2-fold with the inclusion of seed genes. Thus we get a high quality candidate gene set for ASDs consisting of the top 2% genes of the ranklist.

Our candidate gene set provides a number of interesting insights into the genetic back- ground and pathophysiology of ASDs. Pathway enrichment analysis reveals that the can- didate gene set is overrepresented in a number of signaling, cell adhesion and neurological pathways which can be used to explain the pathophysiology of ASDs better. We have been able to discover an interesting connection between ASDs and IBD by showing that, our can- didate gene set has significant overlap with the majority of the IBD-related pathways. We

72 have also found several disjoint subnetworks in our candidate gene set characterized by dif- ferent categories of diseases and bio-functions, which provide an indication of the existence of subclasses of disorders in the autism spectrum. The topmost subnetwork characterized by gastrointestinal disorders is particularly interesting and needs further investigation. Further- more, we have identified a number of interesting molecular functions and biological processes by functional analysis and enrichment analysis on GO terms. For some of these (e.g., molecu- lax functions related to metabolism, organ and tissue morphology, muscle cell differentiation, etc.), connection to ASDs is not so obvious and thus worth further investigation. There is considerable room for the further development of more sophisticated computa- tional integrative approaches for combining ASD-related omics data from different sources. These techniques will become important as the omics data related to ASD is growing at a fast rate, given that more and more studies are being performed on larger ASD cohorts. Thus, sophisticated computational analysis is key to understanding the mysterious dogma of ASDs. This thesis provides a significant step towards understanding the biological un- derpinnings of ASDs better.

73 74 Appendix A

SFARI Genes for Autism Spectrum Disorders

1 Table A.1 - ASD risk genes reported by SFARI gene module. Gene Symbol Gene Name Chromosomal Location # Reports 2 NRXN1 1 p16.3 51 MECP2 Methyl CpG binding protein 2 Xq28 39 CNTNAP2 contactin associated protein-like 2 7q35-q36 38 SHANK3 SH3 and multiple repeat domains 3 22q13.3 33 FMR1 fragile X mental retardation 1 Xq27.3 29 MET met proto-oncogene (hepatocyte growth factor receptor) 7q31 29 CACNA1C calcium channel, voltage-dependent, L type, alpha 1C sub- 12p13.3 27 unit RELN Reelin 7q22 27 FOXP2 forkhead box P2 7q31 26 OXTR oxytocin receptor 3p25 26 DISCI disrupted in schisophrenia 1 1q42.1 24 DMD (muscular dystrophy, Duchenne and Becker types) Xp2l.2 22 NLGN3 3 Xql3.1 22 RBFOX1 RNA binding protein, fox-1 homolog (C. elegans) 1 16p13.3 22 PTEN phosphatase and tensin homolog (mutated in multiple ad- 10q23.3 21 vanced cancers 1) GABRB3 gamma-aminobutyric acid (GABA) A receptor, beta 3 15q11.2-q12 20 NLGN4X neuroligin 4, X-linked Xp22.32-p22.31 20 SYNGAPI. synaptic Rae GTPase activating protein 1 6p21.3 20 AUTS2 autism susceptibility candidate 2 7q11.22 19 SCNIA sodium channel, voltage-gated, type I, alpha subunit 2q24.3 19 SLC6A4 solute carrier family 6 (neurotransmitter transporter, sero- 17q11.l-q12 19 tonin), member 4 DPP6 dipeptidyl-peptidase 6 7q36.2 18 GRIN2B glutamate receptor, inotropic, N-methyl D-apartate 2B 12p12 18 1 3 2 GRIN2A glutamate receptor, ionotropic, N-methyl D-aspartate 2A i6p . 17 MBDS5 Methyl-CpG binding domain protein 5 2q23.1 17 EN2 homolog 2 7q36 16 CDKL5 -dependent kinase-like 5 Xp22 15 HOXAI Al. 7pl5.3 15 NFl neurofibromin 1 (neurofibromatosis, von Recklinghausen dis- 17q11.2 15 ease, Watson disease) SCN2A sodium channel, voltage-gated, type II, alpha subunit 2q23-q24 15 SHANK2 SH3 and multiple ankyrin repeat domains 2 11q13.3-q13.4 15 (contin ued on next page)

75 Table A.1 - Continued. Gene Symbol Gene Name Chromosomal Location # Reports AHII Abelson helper integration site 1 6q23.3 14 CACNA1H calcium channel, voltage-dependent, alpha 1H subunit 16pl3.3 14 CNTN4 contactin 4 3p26-p25 14 ILlRAPL1 interleukin 1 receptor accessory protein-like 1 Xp22.1-p21.3 14 KCNMAI potassium large conductance calcium-activated channel, sub- 10q22.3 14 family M, alpha member 1 RORA RAR-related orphan receptor A 15q22.2 14 SYNI Synapsin 1 Xp1l.23 14 TSC2 2 l6p13.3 14 MEF2C myocyte enhancer factor 2C 5q14 13 NLGN1 neuroligin 1 3q26.31 13 PCDH19 19 Xq13.3 13 SLC25A12 solute carrier family 25 (mitochondrial carrier, Aralar), mem- 2q24 13 ber 12 TSC1 tuberous sclerosis 1 9q34 13 UBE3A ubiquitin protein E3A 15q11.2 13 AVPR1A arginine vasopressin receptor 1A 12q14-q15 12 KDM5C Lysine (K)-specific 5C Xpll.22-pll.21 12 NTRK3 neurotrophic tyrosine kinase, receptor, type 3 15q25 12 PARK2 Parkinson disease (autosomal recessive, juvenile) 2, parkin 6q25.2-q27 12 CACNA1G calcium channel, voltage-dependent, T type, alpha 1G sub- 17q22 11 unit DLX2 distal-less homeobox 2 2q32 11 ERBB4 v-erb-a erythroblastic leukemia viral oncogene homolog 4 2q33.3-q34 11 (avian) ITGB3 integrin, beta 3 (platelet glycoprotein Ilia, antigen CD61) 17q21.32 11 MACROD2 MACRO domain containing 2 20pl2.1 11 MAOA monoamine oxidase A Xpl1.3 11 MCPH1 microcephalin 1 8p23.1 11 MED12 mediator complex subunit 12 Xql3 11 MTHFR methylenetetrahydrofolate reductase (NAD(P)H) 1p36.3 11 RAPGEF4 Rap guanine nucleotide exchange factor (GEF) 4 2q31-q32 11 4 SLC1A1 solute carrier family 1 (neuronal/epithelial high affinity glu- 9p2 11 tamate transporter, system Xag), member 1 STXBP1 Syntaxin binding protein 1 9q34.1 TCF4 4 18q21.1 ADRB2 adrenergic, beta-2-, receptor, surface 5q31-q32 AFF2 AF4/FMR2 family, member 2 Xq28 ANK3 Ankyrin 3, node of Ranvier (ankyrin G) 10q21 BAIAP2 BAll-associated protein 2 17q25 BCL2 B-cell CLL/lymphoma 2 18q21.3 EIF4E eukaryotic translation initiation factor 4E 4q21-q25 FOXPi forkhead box P1 3p14.1 GRIK2 glutamate receptor, ionotropic, kainate 2 6q16.3-q21 HDAC4 deacetylase 4 2q37.3 SYNEI repeat containing, nuclear envelope 1 6q25 ARID1B AT rich interactive domain 1B (SWIl-like) 6q25.1 ARNT2 aryl-hydrocarbon receptor nuclear translocator 2 15q24 ASTN2 astrotactin 2 9q33.1 DIAPH3 Diaphanous-related formin 3 13q21.2 DPP1O Dipeptidyl-peptidase 10 2q14.1 GRIPI glutamate receptor interacting protein 1 12q14.3 IMMP2L IMP2 inner mitochondrial membrane peptidase-like (S. cere- 7q31 visiae) OPHN1 oligophrenin 1 Xq12 9 SEMA5A sema domain, seven thrombospondin repeats (type 1 and type 5p15.2 9 1-like), transmembrane domain (TM) and short cytoplasmic domain, (semaphorin) 5A TTN 2q31 9 (continued on next page)

76 Table A.1 - Continued.

Gene Symbol Gene Name Chromosomal Location # Reports WNT2 wingless-type MMTV integration site family member 2 7q31 9 ANKRD11 ankyrin repeat domain 11 16q24.3 8 ARX aristaless related homeobox Xp22.1-22.3 8 CADPS2 Ca2+-dependent activator protein for secretion 2 7q31.3 8 CHRNA7 cholinergic receptor, nicotinic, alpha 7 15q14 8 CTNNA3 catenin (-associated protein), alpha 3 10q22.2 8 DLX1 distal-less homeobox 1 2q32 8 DLX6 distal-less homeobox 6 7q22 8 ESR1 1 6q25.1 8 FHIT fragile histidine triad gene 3p14.2 8 FOXG1 Forkhead box G1 14q13 8 GLRA2 glycine receptor, alpha 2 Xp22.1-p21.3 8 HOXB1 homeobox Bi 17q21.3 8 HSD11B1 hydroxysteroid (11-beta) dehydrogenase 1 1q32-q41 8 NTNG1 netrin G1 lp13.3 8 PLCD1 , delta 1 3p22-p21.3 8 PTPRC protein tyrosine phosphatase, receptor type, C 1q31-q32 8 RAIl retinoic acid induced 1 17plI.2 8 RFWD2 ring finger and WD repeat domain 2 1q25.1-q25.2 8 RPLlO ribosomal protein L10 Xq28 8 SLC9A9 solute carrier family 9 (sodium/hydrogen exchanger), mem- 3q24 8 ber 9 SNDl staphylococcal nuclease and tudor domain containing 1 7q31.3 8 XPC xeroderma pigmentosum, complementation group C 3p25 8 ADORA2A adenosine A2a receptor 22q11.23 7 APC adenomatosis polyposis coli 5q21-q22 7 CADMI 1 11q23.2 7 4 CDHIO cadherin 10, type 2 (T2-cadherin) 5p1 -p13 7 CHD7 chromodomain DNA binding protein 7 Sq12.2 7 CNTNAP5 contactin associated protein-like 5 2ql4.3 7 CXCR3 chemokine (C-X-C motif) receptor 3 Xql3 7 DHCR7 7-dehydrocholesterol reductase 11q13.2-q13.5 7 DLGAP2 discs, large (Drosophila) homolog-associated protein 2 8p23 7 DLGAP3 Discs, large (Drosophila) homolog-associated protein 3 lp35.3-p34.1 7 DRD3 dopamine receptor D3 3q13.3 7 ESR2 estrogen receptor 2 (ER beta) 14q23.2 7 ESRRB estrogen-related receptor beta 12 41.0 cM 7 GPC6 glypican 6 13q32 7 GRPR Gastrin-releasing peptide receptor Xp22.2-p22.13 7 1 5 5 HRAS v-Ha-ras Harvey rat sarcoma viral oncogene homolog lip . 7 KCNJ10 Potassium inwardly-rectifying channel, subfamily J, member 1q23.2 7 10 MARK1 MAP/microtubule affinity-regulating kinase 1 1q41 7 MKL2 MKL/myocardin-like 2 16p13.12 7 NRXN3 neurexin 3 14q31 7 NTRK1 neurotrophic tyrosine kinase, receptor, type 1 1q21-q22 7 ROBO roundabout, axon guidance receptor, homolog 1 (Drosophila) 3p12 7 SLC9A6 solute carrier family 9 (sodium/hydrogen exchanger), mem- Xq26.3 7 ber 6 VPS13B vacuolar protein sorting 13 homolog B (yeast) 8q22.2 7 AFF4 AF4/FMR2 family, member 4 5q31 6 AR Xqll.2-q12 6 ATRX alpha thalassemia/mental retardation syndrome X-linked Xq2l.1 6 CA6 carbonic anhydrase VI 1p36.2 6 3 14 3 CACNA1D calcium channel, voltage-dependent, L type, alpha 1D p . 6 CDH8 cadherin 8, type 2 16q22.1 6 CDH9 cadherin 9, type 2 (Ti-cadherin) 5p14 6 CHD2 Chromodomain helicase DNA binding protein 2 15q26 6 CNR1 cannabinoid receptor 1 (brain) 6ql4-q15 6 (continued on next page)

77 Table A.1 - Continu ed. Gene Symbol Gene Name Chromosomal Location # Reports DABI1 disabled homolog 1 (Drosophila) 1p32-p31 6 DCX doublecortex, lissencephaly, X-linked (doublecortin) Xq2i.3-q23 6 DPYD dihydropyrimidine dehydrogenase lp22 6 DYRK1A Dual-specificity tyrosine-(Y)-phosphorylation regulated ki- 21q22.13 6 nase 1A EPHA6 EPH receptor A6 3q11.2 6 GABRA4 gamma-aminobutyric acid (GABA) A receptor, alpha 4 4p12 6 GLO1 glyoxalase I 6p21.3-p21.l 6 GRID2 glutamate receptor, ionotropic, delta 2 4q22 6 GRM8 glutamate receptor, metabotropic 8 7q31.3-q32. 6 HTR1B 5-hydroxytryptamine (serotonin) receptor 1B 6q13 6 KCNQ2 Potassium voltage-gated channel, KQT-like subfamily, mem- 20q13.3 6 ber 2 MYO1A IA 12q13-q14 6 MYTlL Myelin transcription factor 1-like 2p25.3 6 NOSIAP 1 (neuronal) adaptor protein 1q23.3 6 NOS2A nitric oxide synthase 2A (inducible, hepatocytes) 17q11.2-q12 6 NRP2 neuropilin 2 2q33.3 6 PCDH9 protocadherin 9 13q21.32 6 PSMD1O proteasome (prosome, macropain) 26S subunit, non-ATPase, Xq22.3 6 10 RGS7 regulator of G-protein signaling 7 1q23.1 6 RPS6KA3 Ribosomal protein S6 kinase, 90kDa, polypeptide 3 Xp22.2-p22.1 6 SLC6A8 solute carrier family 6 (neurotransmitter transporter, crea- Xq28 6 tine), member 8 TBC1D5 TBC1 domain family, member 5 3p24.3 6 TH 11p15.5 6 UPF3B UPF3 regulator of nonsense transcripts homolog B (yeast) Xq25-q26 6 WNK3 WNK lysine deficient protein kinase 3 Xpll.23-pll.21 6 ADA adenosine deaminase 20qI2-q13.11 5 AGAP1 ArfGAP with GTPase domain, ankyrin repeat and PH do- 2q37 5 main 1 ALDH5A1 aldehyde dehydrogenase 5 family, member Al (succinate- 6p22.2-p22.3 semialdehyde dehydrogenase ) APBA2 amyloid beta (A4) precursor protein-binding, family A, mem- 15q11-q12 ber 2 ARHGAP15 Rho GTPase activating protein 15 2q22.2-q22.3 5 BRAF v-raf murine sarcoma viral oncogene homolog B 7q34 5 C4B complement component 4B 6p21.3 5 CACNA1B Calcium channel, voltage-dependent, N type, alpha 1B sub- 9q34 5 unit CELF4 CUGBP, Elav-like family member 4 18q12 CTCF CCCTC-binding factor (sinc finger protein) 16q21-q22.3 CYFIP1 cytoplasmic FMR1 interacting protein 1 15q11 DMPK dystrophia myotonica-protein kinase 19q13.3 DOCK4 Dedicator of cytokinesis 4 7q31.1 4 F13A1 coagulation factor XIII, Al polypeptide 6p25.3-p2 .3 FABP5 fatty acid binding protein 5 (psoriasis-associated) 8q21.13 GPXl glutathione peroxidase 1 3p21.3 GTF2I general transcription factor IIi 7q11.23 HEPACAM hepatic and glial cell adhesion molecule 11q24.2 6 HLA-A major histocompatibility complex, class I, A p21.3 HS3ST5 heparan sulfate (glucosamine) 3-0-sulfotransferase 5 6q21 HTR2A 5-hydroxytryptamine (serotonin) receptor 2A 13ql4-q21 HTR3C 5-hydroxytryptamine (serotonin) receptor 3, family member 3q27.1 C HTR7 5-hydroxytryptamine (serotonin) receptor 7 (adenylate 10q21-q24 5 cyclase-coupled) IL1R2 interleukin 1 receptor, type II 2q12 5 (continued on next page)

78 Table A.l - Continued. Gene Symbol Gene Name Chromosomal Location # Reports ITGA4 integrin, alpha 4 (antigen CD49D, alpha 4 subunit of VLA-4 2q31.3 5 receptor) JARID2 Jumonji, AT rich interactive domain 2 6p24-p23 5 KCNQ3 Potassium voltage-gated channel, KQT-like subfamily, mem- 8q24 5 ber 3 LAMC3 laminin, gamma 3 9q31-q34 5 MAP2 microtubule-associated protein 2 2q34-q35 5 MBD1 methyl-CpG binding domain protein 1 18q21 5 MBD4 methyl-CpG binding domain protein 4 3q21-q22 5 MDGA2 MAM domain containing glycosylphosphatidylinositol anchor 14q21.3 5

MYO16 myosin XVI 13q33.3 5 NRCAM neuronal cell adhesion molecule 7q31.1-q31.2 5 PCDH10 protocadherin 10 4q28.3 5 7 1 3 PER1 period homolog 1 (Drosophila) l p .l-p12 5 PINX1 PIN2/TERF1 interacting, telomerase inhibitor 1 8p23 5 PITX1 paired-like homeodomain 1 5q31 5 PONI paraoxonase 1 7q21.3 5 PTCHD1 patched domain containing 1 Xp22.11 5 PTGS2 prostaglandin-endoperoxide synthase 2 (prostaglandin G/H 1q25.2-q25.3 5 synthase and cyclooxyge nase) SATB2 SATB homeobox 2 2q33 5 SCN8A sodium channel, voltage gated, type VIII, alpha subunit 12q13 5 SEZ6L2 SEZ6L2 seizure related 6 homolog (mouse)-like 2 16pll.2 5 SLC4A10 solute carrier family 4, sodium bicarbonate transporter-like, 2q23-q24 5 member 10 SNTG2 , gamma 2 2p25.3 ST8SIA2 ST8 alpha-N-acetyl-neuraminide alpha-2,8-sialyltransferase 2 15q26 STK39 serine threonine kinase 39 (STE20/SPS1 homolog, yeast) 2q24.3 TBR1 T-box, brain, 1 2q24 TGM3 traneglutaminase 3 20q11.2 TSPAN7 tetraspanin 7 Xpll.4 VIP Vasoactive intestinal peptide 6q25 1 2 ABAT 4-aminobutyrate aminotransferase 16p 3. ACYl Aminoacylase 1 3p2l.1 ADSL adenylosuccinate 22q13.1, 22q13.2 ALOX5AP arachidonate 5-lipoxygenase-activating protein 13qI2 AP1S2 Adaptor-related 1, sigma 2 subunit Xp22.2 ATP2B2 ATPase, Ca++ transporting, plasma membrane 2 3p25.3 BZRAPI. bensodiasapine receptor (peripheral) associated protein 1 17q22-q23 CASC4 cancer susceptibility candidate 4 15q15.3 CD38 CD38 molecule 4p15 CDH22 cadherin-like 22 20q13.1 CHD8 chromodomain helicase DNA binding protein 8 14q11.2 CREBBP CREB binding protein 16p13.3 CTTNBP2 cortactin binding protein 2 7q31 CUL3 3 2q36.2 CYPIBI cytochrome P450, family 11, subfamily B, polypeptide 1 8q21 EGR2 early growth response 2 (Krox-20 homolog, Drosophila) 10q21.1 EPC2 Enhancer of polycomb homolog 2 (Drosophila) 2q23.1 EPHB6 EPH receptor B6 7q33-q35 FBXO40 F-box protein 40 3q13.33 4 GABRB1 gamma-aminobutyric acid (GABA) A receptor, beta 1 p12 GALNT13 UDP-N-acetyl-alpha-D-galactosamine:polypeptide N- 2q23.3-q24.1 acetylgalactosaminyltransferase 13 (GalNAc-T13) GNAS GNAS complex locus 20q13.3 4 GPHN 14q23.3 4 GRM Glutamate receptor, metabotropic 1 6q24 4 HLA-DRB1 major histocompatibility complex, class II, DR beta 1 6p2l.3 4 (continued on next page)

79 Table A.1 - Continued. Gene Symbol Gene Name Chromosomal Location # Reports HUWE1 HECT, UBA and WWE domain containing 1, E3 ubiquitin Xp1l.22 4 protein ligase ICA1 islet cell autoantigen 1, 69kDa 7p22 4 JMJD1C jumonji domain containing 1C 10q21.2 4 KANK1 KN motif and ankyrin repeat domains 1 9p24.3 4 LRP2 Low density lipoprotein receptor-related protein 2 2q24-q31 4 LRRC1 leucine rich repeat containing 1 6p12.1 4 LZTS2 leucine sipper, putative tumor suppressor 2 10q24 4 MBD3 methyl-CpG binding domain protein 3 19p13.3 4 MCC mutated in colorectal cancers 5q21 4 MTF1 metal-regulatory transcription factor 1 1p33 4 NBEA neurobeachin 13q13 4 NPAS2 neuronal PAS domain protein 2 2q11.2 4 NSD1 binding SET domain protein 1 5q35 4 PHF8 PHD finger protein 8 Xp11.22 4 PIK3CG phosphoinositide-3-kinase, catalytic, gamma polypeptide 7q22.3 4 PLN 6q22.1 4 PRICKLE1 Prickle homolog 1 (Drosophila) 12q12 4 PRKCB , beta 16p11.2 4 RAB39B RAB39B, member RAS oncogene family Xq28 4 SGSH N-sulfoglucosamine sulfohydrolase 17q25.3 4 SH3KBP1 SH3-domain kinase binding protein 1 Xp22.1-p21.3 4 SLC30A5 solute carrier family 30 5q12.1 4 SLC6A3 Solute carrier family 6 (neurotransmitter transporter), mem- 5pl5.3 4 ber 3 SYN2 Synapsin II 3p25 4 TDO2 tryptophan 2,3-dioxygenase 4q31-q32 4 UBE3B ubiquitin protein ligase E3B 12q24.11 4 VASH1 vasohibin 1 14q24.3 4 ADNP Activity-dependent neuroprotector homeobox 20q13.13 3 AGBL4 ATP/GTP binding protein-like 4 1p3 3 3 AGTR2 angiotensin II receptor, type 2 Xq22-q23 3 ALDH1A3 Aldehyde dehydrogenase 1 family, member A3 15q26.3 3 APP Amyloid beta (A4) precursor protein 21q21.3 3 ASS1 argininosuccinate synthetase 9q34.1 3 BCKDK Branched chain ketoacid dehydrogenase kinase l6pl1.2 3 BIN1 Bridging integrator 1 2q14 3 C12orf57 open reading frame 57 12p13.31 3 C3orf58 open reading frame 58 3q24 3 CAMTA1 binding transcription activator 1 1p36.31-p36.23 3 CBS cystathionine beta-synthase 21q22.3 3 1 3 CD44 CD44 molecule (Indian blood group) lIp 3 CEP290 Centrosomal protein 29OkDa 12q21.32 3 CEP41 testis specific, 14 7q32 3 CMIP c-Maf inducing protein 16q23 3 DAPK1 death-associated protein kinase 1 9q34.1 3 DCTN5 5 16pl2.2 3 DCUNID1 DCN1, defective in cullin neddylation 1, domain containing 3q26.3 3 1 (S. cerevisiae) DDX11 DEAD/H (Asp-Glu-Ala-Asp/His) box polypeptide 11 12pll 3 DDX53 DEAD (Asp-Glu-Ala-Asp) box polypeptide 53 Xp22.11 3 DRD1 Dopamine receptor D1 5q35.1 3 EHMT1 Euchromatic histone-lysine N-methyltransferase 1 9q34.3 3 EXT1 Exostosin 1 8q24.11 3 FATI FAT tumor suppressor homolog 1 (Drosophila) 4q35 3 FLT1 fms-related tyrosine kinase 1 (vascular endothelial growth 13q12 3 factor/vascular perme ability factor receptor) FRK fyn-related kinase 6q21-q22.3 3 FRMPD4 FERM and PDZ domain containing 4 Xp22.2 3 (continued on next page)

80 Table A.1 - Continued.

Gene Symbol Gene Name Chromosomal Location # Reports GPD2 Glycerol-3-phosphate dehydrogenase 2 (mitochondrial) 2q24.1 3 GRIDI Glutamate receptor, ionotropic, delta 1 10q22 3 GRM5 Glutamate receptor, metabotropic 5 11ql4.3 3 GSTM1 glutathione S- M1 1p13.3 3 HCFC1 Host cell factor C1 (VP16-accessory protein) Xq28 3 HNRNPH2 heterogeneous nuclear ribonucleoprotein H2 (H') Xq22 3 HOMER1 Homer homolog 1 (Drosophila) 5ql4.2 3 INPP1 inositol polyphosphate-l-phosphatase 2q32 3 IQSEC2 IQ motif and Sec7 domain 2 Xpl1.22 3 ITGB7 integrin, beta 7 12q13.13 3 KCND2 Potassium voltage-gated channel, Shal-related subfamily, 7q31 3 member 2 KIAA1586 KIAA1586 6p12.1 3 NDNL2 necdin-like 2 15q13.1 3 NDUFA5 NADH dehydrogenase (ubiquinone) 1 alpha subcomplex, 5, 7q32 3 l3kDa NFIA /A 1p31.3-p31.2 NXF5 Nuclear RNA export factor 5 Xq22 OPRM1 opioid receptor, mu 1 6q24-q25 OTX1 Orthodenticle homeobox 1 2p13 PCDHA11 Protocadherin alpha 11 5q31 PCDHA13 Protocadherin alpha 13 5q31 PCDHA2 Protocadherin alpha 2 5q31 PCDHA4 Protocadherin alpha 4 5q31 PCDHA5 Protocadherin alpha 5 5q31 PCDHA6 Protocadherin alpha 6 5q31 PCDHA7 Protocadherin alpha 7 5q31 PCDHA9 Protocadherin alpha 9 5q31 PDZD4 PDZ domain containing 4 Xq28 PLCB1 phospholipase C, beta 1 (phosphoinositide-specific) 20p12 POGZ Pogo transposable element with ZNF domain 1q21.3 4 PRICKLE2 Prickle homolog 2 (Drosophila) 3p1 .1 PRUNE2 prune homolog 2 (Drosophila) 9q21.2 PSD3 pleckstrin and Sec7 domain containing 3 8p2l.3 RBlCC1 RB1-inducible coiled-coil 1 Sql1 REEP3 receptor accessory protein 3 10q21.3 RHOXF1 Rhox homeobox family, member 1 Xq24 RIMS3 regulating synaptic membrane exocytosis 3 lpter-p22.2 RPS6KA2 ribosomal protein S6 kinase, 9OkDa, polypeptide 2 6q27 SDC2 syndecan 2 (heparan sulfate proteoglycan 1, cell surface- 8q22-q23 associated, fibroglycan ) SOX5 SRY (sex determining region Y)-box 5 12p12.1 STX1A Syntaxin 1A (brain) 7q11.23 4 SUCLG2 succinate-CoA ligase, GDP-forming, beta subunit 3p1 .1 TAFIL TAF1 RNA polymerase II 9p21.1 2 4 TBClD7 TBC1 domain family, member 7 6p .1 TLK2 tousled-like kinase 2 17q23 TMLHE trimethyllysine hydroxylase, epsilon Xq28 TOP1 Topoisomerase (DNA) I 20ql2-q13.1 TOP3B Topoisomerase (DNA) III beta 22q11.22 TRIP12 Thyroid interactor 12 2q36.3 TSN translin 2q21.1 TUBGCP5 , gamma complex associated protein 5 15q11.2 WNT1 Wingless-type MMTV integration site family, member 1 12q13 ADARB1 Adenosine deaminase, RNA-specific, BI 21q22.3 ADCY5 Adenylate cyclase 5 3q21.1 ADORAS Adenosine A3 receptor lpl3.2 ANK2 Ankyrin 2, neuronal 4q25-q27 ASXL3 Additional sex combs like 3 (Drosophila) 18qil (continued on next page)

81 Table A.1 - Continued. Gene Symbol Gene Name Chromosomal Location # Reports CACNA1I Calcium channel, voltage-dependent, T type, alpha 11 sub- 22q13.1 2 unit 1 3 CAPRIN1 Cell cycle associated protein 1 Ilp 2 2 CCDC64 coiled-coil domain containing 64 12q24. 3 2 CHRM3 Cholinergic receptor, muscarinic 3 1q43 2 CLTCL1 clathrin, heavy chain-like 1 22q11.21 2 CNTN3 contactin 3 (plasmacytoma associated) 3p12.3 2 CSMD1 CUB and Sushi multiple domains 1 8p23.2 2 CTNNB1 Catenin (cadherin-associated protein), beta 1, 88kDa 3p21 2 DDC Dopa decarboxylase (aromatic L-amino acid decarboxylase) 7pl2.2 2 DEPDC5 DEP domain containing 5 22ql2.3 2 DLG4 Discs, large homolog 4 (Drosophila) 17p13.1 2 DRD2 Dopamine receptor D2 11q23 2 EML1 echinoderm microtubule associated protein like 1 14q32 2 EP400 ElA binding protein p400 12q24.33 2 EPHB2 EPH receptor B2 1p36.1-p35 2 EXOC6B Exocyst complex component 6B 2pl3.2 2 FAM135B Family with sequence similarity 135, member B 8q24.23 2 FBXO33 F-box protein 33 14q21.1 2 FGD1 FYVE, RhoGEF and PH domain containing 1 Xpll.21 2 FOLHI Folate (prostate-specific membrane antigen) 1 l1pI1.2 2 GALNT14 UDP-N-acetyl-alpha-D-galactosamine:polypeptide N- 2p23.1 2 acetylgalactosaminyltransferase 14 (GalNAc-T14) GSK3B Glycogen synthase kinase 3 beta 3q13.3 2 HERC2 HECT and RLD domain containing E3 ubiquitin protein lig- 15q13 2 ase 2 KATNAL2 Katanin p60 subunit A-like 2 18q21.1 2 KHDRBS2 KH domain containing, RNA binding, signal transduction as- 6q11.1 2 sociated 2 KIAA2022 KIAA2022 Xq13.3 2 KIF5C family member 5C 2q23.1 2 LRPPRC Leucine-rich pentatricopeptide repeat containing 2p2I 2 MAOB Monoamine oxidase B Xp1l1.23 2 MC4R Melanocortin 4 receptor 18q22 2 1 5 1 NELL1 NEL-like 1 (chicken) 11p . 2 NIPA1 non imprinted in Prader-Willi/Angelman syndrome 1 15q11.2 2 NIPA2 non imprinted in Prader-Willi/Angelman syndrome 2 15q11.2 2 NIPBL Nipped-B homolog (Drosophila) 5pl3.2 2 NRXN2 neurexin 2 11q13 2 PCDH15 Protocadherin-related 15 10q21.1 2 PCDHAC2 Protocadherin alpha subfamily C, 2 5q31 2 PDE4A phosphodiesterase 4A, cAMP-specific 19pl3.2 2 PDE4B phosphodiesterase 4B, cAMP-specific lp3l 2 PEX7 peroxisomal biogenesis factor 7 6q23.3 2 3 4 1 POMGNT1 Protein O-linked mannose betal,2-N- 'p . 2 acetylglucosaminyltransferase PTPRT protein tyrosine phosphatase, receptor type, T 20q12-ql3 2 PXDN Peroxidasin homolog (Drosophila) 2p25 2 RAB11FIP5 RABIl family interacting protein 5 2p13 2 RBM8A RNA binding motif protein 8A 1q21.1 2 SAE1 SUMO1 activating ensyme subunit 1 19q13.32 2 SBF1 SET binding factor 1 22q13.33 2 SDK1 Sidekick cell adhesion molecule 1 7p22.2 2 SERPINE1 Serpin peptidase inhibitor, clade E (nexin, plasminogen acti- 7q21.3-q22 2 vator inhibitor type 1), member 1 3 SETD2 SET domain containing 2 p21.31 2 SETDB2 SET domain, bifurcated 2 13q14 2 SGSM3 Small signaling modulator 3 22q13.1-q13.2 2 SLIT3 Slit homolog 3 (Drosophila) 5q35 2 (continued on next page)

82 Table A.1 - Continued. Gene Symbol Gene Name Chromosomal Location # Reports SODI Superoxide dismutase 1, soluble 21q22.11 2 ST7 suppression of tumorigenicity 7 7q31.1-q31.3 2 STXBP5 Syntaxin binding protein 5 (tomosyn) 6q24.3 2 SUV420H1 suppressor of variegation 4-20 homolog 1 (Drosophila) 11q13.2 2 SYAPI Synapse associated protein 1 Xp22.2 2 SYT17 XVII 16p12.3 2 TBL1XR1 (beta)-like 1 X-linked receptor 1 3q26.32 2 TYR (oculocutaneous albinism IA) 11ql4-q21 2 UBE3C Ubiquitin protein ligase E3C 7q36.3 2 YWHAE Tyrosine 3-monooxygenase/tryptophan 5-monooxygenase ac- 17p13.3 2 tivation protein, epsilon polypeptide ABCA7 ATP-binding cassette, sub-family A (ABC1), member 7 19p13.3 ADK adenosine kinase 10qI1-q24 1 ARHGAP24 Rho GTPase activating protein 24 4q22.1 ATRNL1 Attractin-like 1 10q26 1 ATXN7 Ataxin 7 3p21.1-p12 1 BBS4 Bardet-Biedl syndrome 4 15q22.3-q23 BRCA2 breast cancer 2, early onset 13q12.3 BTAF1 RNA polymerase II, B-TFIID transcription factor-associated, 10q22-q23 170kDa (Motl homolog, S. cerevisiae) C15orf43 open reading frame 43 15q21.1 1 CAMK4 Calcium/calmodulin-dependent protein kinase IV 5q21.3 1 CAMSAP2 calmodulin regulated spectrin-associated , lq32.1 1 member 2 CD99L2 CD99 molecule-like 2 Xq28 2 CDKN1B Cyclin-dependent kinase inhibitor 1B (p27, Kipl) 12p13.1-p1 CECR2 Cat eye syndrome chromosome region, candidate 2 22q11.2 CLSTN3 Calsyntenin 3 12p13.31 CNTNAP3 contactin associated protein-like 3 9p13.1 CSNK1D , delta 17q25 DAPPI1 Dual adaptor of phosphotyrosine and 3-phosphoinositides 4q25-q27 DNAJC19 DnaJ (Hsp40) homolog, subfamily C, member 19 3q26.33 DNM1L 1-like 12pl1.21 DOCK10 Dedicator of cytokinesis 10 2q36.2 DOLK Dolichol kinase 9q34.11 DST 6p12.1 DUSP22 dual specificity phosphatase 22 6p25.3 DYDC1 DPY30 domain containing 1 10q23.1 DYDC2 DPY30 domain containing 2 10q23.1 EIF4EBP2 Eukaryotic translation initiation factor 4E binding protein 2 10q21-q22 EP300 ElA binding protein p300 22q13.2 EPS8 Epidermal growth factor receptor pathway substrate 8 12p12.3 ERG v-ets erythroblastosis virus E26 oncogene homolog (avian) 21q22.3 FANI FANCD2/FANCI-associated nuclease 1 15q13.2-q13.3 FBXO15 F-box protein 15 18q22.3 FER Fer (fps/fes related) tyrosine kinase 5q21 FGA Fibrinogen alpha chain 4q28 GABRA3 Gamma-aminobutyric acid (GABA) A receptor, alpha 3 Xq28 GAN 16q24.1 GAP43 Growth associated protein 43 3q13.1-q13.2 GAS2 Growth arrest-specific 2 llp4.3 GNA14 Guanine nucleotide binding protein (G protein), alpha 14 9q21 GNBIL guanine nucleotide binding protein (G protein), beta 22q11.2 polypeptide 1-like GPR37 G protein-coupled receptor 37 (endothelin receptor type B- 7q31 like) GRM4 Glutamate receptor, metabotropic 4 6p2l.3 1 GSN 9q33 1 GUCY1A2 1, soluble, alpha 2 11q21-q22 (contin ued on next page)

83 Table A.1 - Continued. Gene Symbol Gene Name Chromosomal Location # Reports HDAC6 Histone deacetylase 6 Xpl1.23 1 2 HMGN1 high mobility group nucleosome binding domain 1 21q22. 1 HYDIN HYDIN, axonemal central pair apparatus protein 16q22.2 1 INADL InaD-like (Drosophila) lp31.3 KCTD13 tetramerisation domain containing 13 16pll.2 1 KIT V-kit Hardy-Zuckerman 4 feline sarcoma viral oncogene ho- 4q11-q12 molog KLC2 Kinesin light chain 2 11q13.2 KPTN Kaptin ( binding protein) 19q13.32 LAMA1 Laminin, alpha 1 18pll.3 LAMB1 laminin, beta 1 7q22 LEP Leptin 7q31.3 LMX1B LIM homeobox transcription factor 1, beta 9q33.3 LRRC7 Leucine rich repeat containing 7 1p31.1 MAGED1 Melanoma antigen family D, 1 Xp1l.23 MAGEL2 MAGE-like 2 15q11-q12 MAPK1 Mitogen-activated protein kinase 1 22q11.21 MAPK3 mitogen-activated protein kinase 3 16p11.2 MAPK8IP2 Mitogen-activated protein kinase 8 interacting protein 2 22q13.33 MBD6 Methyl-CpG binding domain protein 6 12q13 MSN Moesin Xql1.1 MSR1 macrophage scavenger receptor 1 8p22 MTR 5-methyltetrahydrofolate-homocysteine methyltransferase 1q43 MTX2 Metaxin 2 2q31.1 MYH4 Myosin, heavy chain 4, 17pl3.1 NCKAP5L NCK-associated protein 5-like 12q13.12 NCKAP5 NCK-associated protein 5 2q21.2 NEFL , light polypeptide 8p2l ODF3L2 outer dense fiber of sperm tails 3-like 2 19p13.3 OGT O-linked N-acetylglucosamine (GlcNAc) transferase Xql3 PAH Phenylalanine hydroxylase 12q22-q24.2 PARD3B Par-3 partitioning defective 3 homolog B (C. elegans) 2q33.3 PCDH8 protocadherin 8 13q21.1 PCDHGA11 protocadherin gamma subfamily A, 11 5q31 PECR peroxisomal trans-2-enoyl-CoA reductase 2q35 PIK3R2 Phosphoinositide-3-kinase, regulatory subunit 2 (beta) 19q13.2-q13.4 PLAUR Plasminogen activator, urokinase receptor 19q13 POTI Protection of telomeres 1 homolog (S. pombe) 7q31.33 PPFIA1 Protein tyrosine phosphatase, receptor type, f polypeptide 11ql3.3 (PTPRF), interacting protein (liprin), alpha 1 PPP1R1B 1, regulatory (inhibitor) subunit 1B 17q12 PRKD1 Protein kinase Dl 14q11 PTGER3 Prostaglandin E receptor 3 (subtype EP3) lp3l.2 PTPN11 protein tyrosine phosphatase, non-receptor type 11 12q24 PTPRB Protein Tyrosine Phosphatase, Receptor Type, B 12q15-q21 RASD1 RAS, dexamethasone-induced 1 17pll.2 RASSF5 Ras association (RalGDS/AF-6) domain family member 5 1q32.1 RERE Arginine-glutamic acid dipeptide (RE) repeats 1p36.23 RNPS1 RNA binding protein S1, serine-rich domain l6p13.3 ROBO2 Roundabout, axon guidance receptor, homolog 2 (Drosophila) 3p12.3 RPP25 Ribonuclease P/MRP 25kDa subunit 15q24.2 SCFD2 seci family domain containing 2 4q12 SETDB1 SET domain, bifurcated 1 1q21 SHANKi SH3 and multiple ankyrin repeat domains 1 19q13.3 SLC16A3 solute carrier family 16, member 3 (monocarboxylic acid 17q25 transporter 4) SLC16A7 Solute carrier family 16, member 7 (monocarboxylic acid 12q13 transporter 2) (continued on next page)

84 Table A.1 - Continued. Gene Symbol Gene Name Chromosomal Location # Reports SLC25A14 Solute carrier family 25 (mitochondrial carrier, brain), mem- Xq24 1 ber 14 SLC25A24 Solute carrier family 25 (mitochondrial carrier; phosphate lpl3.3 1 carrier), member 24 SLC35A3 Solute carrier family 35 (UDP-N-acetylglucosamine (UDP- lp2l 1 GlcNAc) transporter), member A3 SLC38A1O solute carrier family 38, member 10 17q25.3 1 SLC39A11 Solute carrier family 39 (metal ), member 11 17q21.31 1 SMG6 Smg-6 homolog, nonsense mediated mRNA decay factor (C. 17p13.3 1 elegans) SNRPN small nuclear ribonucleoprotein polypeptide N 15q11.2 1 SNX19 Sorting nexin 19 11q25 1 SPAST Spastin 2p24-p21 1 SYN3 Synapsin III 22q12.3 1 SYT3 synaptotagmin III 19q13.33 1 TAFlC TATA box binding protein (TBP)-associated factor, RNA 16q24 1 polymerase I, C, 110kDa TBL1X transducin (beta)-like 1X-linked Xp22.3 1 TBX1 T-box 1 22q11.21 1 THRA , alpha 17q11.2 1 TM4SF20 Transmembrane 4 L six family member 20 2q36.3 1 4 TNIP2 TNFAIP3 interacting protein 2 p16.3 1 TOMM20 of outer mitochondrial membrane 20 homolog 1q42 1 (yeast) TPO Thyroid peroxidase 2p25 1 TRIM33 Tripartite motif containing 33 lp13.1 1 TTI2 TELO2 interacting protein 2 8p12 1 UBA6 Ubiquitin-like modifier activating 6 4q13.2 1 UBE2H ubiquitin-conjugating enzyme E2H (UBC8 homolog, yeast) 7q32 1 UBL7 ubiquitin-like 7 (bone marrow stromal cell-derived) 15q24.1 1 UBR5 Ubiquitin protein ligase E3 component n-recognin 5 8q22 1 UBR7 ubiquitin protein ligase E3 component n-recognin 7 (puta- 14q32.12 1 tive) UPF2 UPF2 regulator of nonsense transcripts homolog (yeast) lOpl4-pl3 1 USP9Y ubiquitin specific peptidase 9, Y-linked Yql1.2 1 XPO1 Exportin 1 (CRM1 homolog, yeast) 2p15 1 YEATS2 YEATS domain containing 2 3q27.1 1 YTHDC2 YTH domain containing 2 5q22.2 1 ZBTB16 and BTB domain containing 16 11q23.1 1 ZNF18 zinc finger protein 18 17pll.2 1 ZNF407 Zinc finger protein 407 18q23 1 ZNF827 Zinc finger protein 827 4q31.22 1 ZSWIM5 zinc finger, SWIM-type containing 5 lp34.1 1

1. We consider only the genes which can be mapped to the largest connected component of our PPI network.

85 86 Appendix B

Risk Genes for ASDs Identified by Integrative Approach

Table B.1 - Probabilities of association with ASDs for candidate genes identified by our integrative analysis approach. Gene Symbol Association Probability SHANK2 1.000000 HGF 1.000000 CACNA1H 1.000000 EN2 1.000000 MTHFR 1.000000 GRIN2A 1.000000 ANKRD11 1.000000 GATAD2B 1.000000 FBN1 1.000000 COBL 1.000000 BAIAP2 1.000000 TTN 1.000000 SLC1A1 1.000000 FOXP4 1.000000 GABRB3 1.000000 MACROD2 1.000000 TBX1 1.000000 STOXI 1.000000 TSC1 1.000000 SND1 1.000000 HSPC215 1.000000 GJB2 1.000000 STXBP1 1.000000 GAN2B 1.000000 KtAP5-9 1.000000 FAM1i54A 1.000000 RP11-220B22.3 1.000000 LGALS13 1.000000 HSPY1 1.000000 KRTAP26-1 1.000000 KRTAP3-3 1.000000 MIR1OA 1.000000 ARX 1.000000 (continued on next page)

87 Table B.1 - Continued. Gene Symbol Association Probability RP11-328M4.1 1.000000 SCT 1.000000 SCN1A 1.000000 NFl 1.000000 AVPR1A 1.000000 MIR223 1.000000 MIR181A1 1.000000 LMNA 1.000000 BCL2 1.000000 TCF4 1.000000 PAX5 1.000000 PAX6 1.000000 PTEN 1.000000 MEF2C 1.000000 MECP2 1.000000 MYH9 1.000000 TRPV4 1.000000 FGFR3 1.000000 NCKIPSD 1.000000 SYNI, 1.000000 COLlAl 1.000000 CTNNA3 1.000000 RAI 1.000000 MCPH1 1.000000 CPE 1.000000 RPL10 1.000000 TP63 1.000000 DIAPH3 1.000000 OPHN1 1.000000 MET 1.000000 SLC25A12 1.000000 SETD2 1.000000 DLX1 1.000000 HOXAI 1.000000 GLI3 1.000000 FGFR2 1.000000 COL2Al 1.000000 FLNA 1.000000 MED12 1.000000 REST 1.000000 GNAS 1.000000 MSX2 1.000000 CNTNAP2 1.000000 ANK3 1.000000 GRIK2 1.000000 NLGN3 1.000000 FOXP2 1.000000 GBA 1.000000 GJA1 1.000000 TWIST2 1.000000 TP53AIP1 1.000000 AVP 1.000000 SLC6A4 1.000000 FLNB 1.000000 DMD 1.000000 OXTR 1.000000 DLGAP3 0.999900 CHM 0.999900 ALPL 0.999900 (continued on next page)

88 Table B.1 - Continued. Gene Symbol Association Probability FGF14 0.999900 PAX2 0.999900 SNCG 0.999900 RFWD2 0.999900 FGFR1 0.999900 COL7A1 0.999900 LRP5 0.999800 SMN1 0.999800 AR 0.999800 PRNP 0.999800 GLB1 0.999800 SCN8A 0.999700 RNU5A-1 0.999700 BDNF 0.999700 AES 0.999700 DSP 0.999700 HOXD13 0.999700 ZMIZ1 0.999700 ITLN1 0.999700 PLCD1 0.999600 MAPT 0.999600 XPC 0.999600 PAX3 0.999600 SCN5A 0.999500 WAS 0.999500 PITX2 0.999500 RPl1-519K18.1 0.999500 MSX1 0.999500 UNQ640/PRO1270 0.999400 GATAI 0.999200 RELN 0.999100 CACNAlG 0.999100 RET 0.999100 MPZ 0.999000 KRT1 0.999000 CHRNA7 0.998800 RECQL4 0.998800 DLX2 0.998800 CNTN4 0.998700 ILlRAPLl 0.998700 DISCI. 0.998600 DLX5 0.998400 PELO 0.998400 EDA 0.998300 CACNAlC 0.998200 FAM108A1 0.998200 COLL1A2 0.998100 SLC26A2 0.998000 ABCC8 0.998000 NOG 0.997900 DLGAP1 0.997900 POLG 0.997800 MGAT3 0.997800 CHRNA1 0.997800 LICAM 0.997600 CGI-17 0.997600 IKBKG 0.997500 COLliAl 0.997300 SDC3 0.997300 (continued on next page)

89 Table B.1 - Continued. Gene Symbol Association Probability TDRD7 0.997200 TYR 0.997000 ASCL3 0.997000 CACNA1A 0.997000 GRIPI 0.996800 ESCO2 0.996700 FOXPI 0.996700 CADPS2 0.996700 AFF2 0.996600 FMR1 0.996500 LOC347475 0.996400 MYH7 0.996400 PARK2 0.996300 GNPTAB 0.996300 TP53 0.996300 FHIT 0.996300 RP11-298P3.3 0.996200 KDM5C 0.996200 WT1 0.996200 PRKAR1A 0.996100 RNU4-1 0.995900 CAPN3 0.995700 DNM2 0.995500 GP1BA 0.995500 ATP7A 0.995400 RORA 0.995400 RYR1 0.995300 RP11-394C23.1 0.995300 DPP1O 0.995200 SCN9A 0.995100 ARIDIB 0.995000 NCAPG2 0.994800 KCNMAI 0.994700 SYNE1 0.994700 AUTS2 0.994500 KRT14 0.994400 UGT1A1 0.994300 EFNB1 0.994300 AHIl 0.994200 ERBB4 0.994200 KLHL1 0.994100 SLC9A9 0.994000 TWISTI. 0.993700 PMP22 0.993600 PCDH19 0.993600 SEMA5A 0.993500 CPT2 0.993500 ATRX 0.992900 ARNT2 0.992800 ATR 0.992800 HDAC4 0.992200 LBR 0.991800 PTPN11 0.991700 KCTD3 0.991600 NTRK3 0.991400 NLRP3 0.991200 UBE3A 0.991000 VDR 0.990800 ERCC6 0.990800 (continued on next page)

90 Table B.1 - Continued.

Gene Symbol Association Probability SRGAP2 0.990700 RP11-5F19.1 0.990600 NLGN1 0.990500 VLDLR 0.990300 HPHB2 0.990200 EDNRB 0.990100 GH1 0.990000 SCN2A 0.990000 NLGN4X 0.989500 CMCl. 0.989500 RUNX2 0.989400 APC 0.989300 HSD11B1 0.989300 PSEN1 0.988700 ADRB2 0.988600 SPSB3 0.988500 CBP 0.988300 GNE 0.988300 EIF4E 0.988100 PTPRC 0.987700 DLX6 0.987200 NRXN1 0.986700 NDRG1 0.986400 DUSP22 0.986400 ERCC6L2 0.986200 EVC 0.986000 RMRP 0.985600 MGAT5B 0.985300 ERAG 0.984900 4-OCT 0.984900 SPINK5 0.984400 OFD1 0.984200 MBD5 0.983700 TRPS1 0.983500 SYNE3 0.983100 BSCL2 0.982600 TSC2 0.982400 FH 0.982100 COQ2 0.981700 BRCA2 0.981600 SALL4 0.980500 NCS-1 0.980200 FDXR 0.979900 CTR9 0.979400 MAOA 0.979100 EYAl 0.979000 H R 0.978900 TBCE 0.978200 TR-B 0.977400 HOXA13 0.977100 SPSB4 0.976700 MBNL2 0.976700 PC 0.976400 ANKH 0.976300 PRPS1 0.976200 PHLDA3 0.975900 EPB41 L3 0.975800 HOXC8 0.975700 MYO7A 0.975400 (continued on next page)

91 Table B.1 - Continued. Gene Symbol Association Probability SHANK3 0.974700 CAV3 0.974400 MYO5A 0.973400 SCN4A 0.973200 IL2RG 0.972800 FAM111A 0.972300 NROB1 0.972200 DYSF 0.971900 HOXD3 0.971200 DPP6 0.971100 SOST 0.971000 RNU6-1 0.970500 DYTIO 0.970500 GNB2L1 0.970200 ALG6 0.969900 WFS1 0.969600 XPA 0.966300 BUB1B 0.966300 INPPL1 0.966200 CDKL5 0.966100 IL6 0.966000 ZEB2 0.965700 RB1 0.965700 ALS2 0.965400 PDP1 0.964600 ESR1 0.962300 INSR 0.962200 LRIG1 0.962200 SHH 0.962100 GDAP1 0.961900 CD96 0.960800 HHF1 0.960000 CTSC 0.958600 PTPN1 0.958200 KCND3 0.956900 TGFBI 0.956600 FRS2 0.955700 THTPA 0.955300 CACNA1S 0.955200 ITGA2B 0.953700 GDF5 0.953300 TACC3 0.952700 GHR 0.952500 COL4A3 0.952400 DLL3 0.950700 SLC17A5 0.950700 WNT2 0.949700 FOXG1 0.949300 PIGL 0.949200 AIMI 0.949100 TADA3 0.948600 EMD 0.948200 MIF 0.948000 NTF3 0.948000 IRF6 0.947700 PAX8 0.947400 AASS 0.947300 HMGA2 0.947200 NF2 0.946400 (continued on next page)

92 Table B.1 - Continued. Gene Symbol Association Probability NPHP1 0.946000 KCND2 0.945900 SMARCA2 0.945800 PEX5 0.945700 TBR1 0.944600 THRB 0.944600 SCARF2 0.944000 PANK2 0.943900 HSPG2 0.943600 ARSB 0.943100 FAM123B 0.942500 LAMA3 0.940800 SMS 0.940000 ABCA4 0.939900 SLC13A3 0.939100 KAT6B 0.938900 SMPD1 0.937900 ERCC2 0.937000 TREXI 0.936700 FLCN 0.936000 HOXB1 0.934500 SIM2 0.933800 SNTG1 0.933200 NKX2-1 0.933100 RP11-258C19.2 0.930600 PKHD1 0.929300 FCP1 0.928300 HPRT1 0.928300 ELANE 0.928100 NELL2 0.926300 PRRT2 0.925700 ROR2 0.925600 APC2 0.925500 FKRP 0.925300 OPAl 0.925200 SLC37A4 0.924400 MEF2A 0.923000 HBB 0.922400 STK11 0.922300 RAPGEF4 0.922000 EYA4 0.918500 CDH3 0.917900 ZNF81 0.917700 N4BP2L2 0.917500 DYM 0.917500 EGR2 0.917400 BINI 0.917000 HNFIA 0.915500 NTNG1 0.910400 CPs1 0.910100 KIT 0.909700 AHNAK 0.908500 CFTR 0.907800 TDO2 0.907000 TTR 0.906300 AVEN 0.906300 MITF 0.904200 SPTAN1 0.903100 TBPL1 0.902200 (continued on next page)

93 Table B.1 - Continued. Gene Symbol Association Probability GLRA2 0.901600 PTH 0.901600 SMOCi 0.900300 RPS6KA3 0.899000 ADCK4 0.897800 SEC23A 0.896300 ASTN2 0.892600 CYLD 0.892400 BMP4 0.892100 MBNL1 0.889700 FOXHI 0.888200 KCNQ1 0.887100 ATP2A2 0.885500 DES 0.884400 CASR 0.882600 FLT4 0.877300 BCL7B 0.874000 DAAP-218M18.8 0.872500 FOXO4 0.872500 SOX3 0.872500 SYNM 0.871800 RAB40B 0.871400 GNAQ 0.869900 RP11-419L10.1 0.869900 SMARCEI 0.869400 FAM189A1 0.868800 PHF11 0.866700 PICK1 0.866200 XIST 0.866100 GPC6 0.864900 ATP8B1 0.862900 HSD17B4 0.861100 TRIM25 0.861000 NTRK1 0.859300 PLP1 0.858700 TBC1D24 0.857300 PLCG1 0.857300 OPN1LW 0.857200 GBE1 0.855300 ELN 0.854300 DDX59 0.853800 FOXL2 0.853400 ABCC6 0.852900 LHCGR 0.848500 VWF 0.846900 NOS3 0.846500 TSSK2 0.846100 STIMI 0.845000 DDB2 0.844800 VHL 0.844800 ATP6VOA2 0.844700 PRKAG2 0.844700 NCOA2 0.844600 NPHP3 0.842200 SOS' 0.842000 ITM2B 0.841300

94 Appendix C

Subnetworks in ASD Risk Gene Set

0 T1able C.1 - Subnetworks in ASD risk gene set generated by QIAGEN's Ingenuity Pathway Analysis (IPA).1 ID Molecules in Network Score Seed Genes Top Diseases and Functions 1 ADRB2, Apl, ARNT2, ATP6VOA2, ATP8B1, CFTR, 46 28 Cancer, Tissue Morphology, EGR2, FLNA, GBE1, GH1, GNB2L1, HBB, HNF1A, Gastrointestinal Disease HPRT1, IL1RAPL1, INSR, Insulin, KIT, Mek, MET, p70 S6k, p85 (pik3r), PAX6, PDGF BB, Proinsulin, SLC37A4, SLC9A9, SND1, SNTG1, SPSB3, THRB, UGT1A1, VHL, WFS1, XIST 2 ABHD17A, ADCK4, BSCL2, CHRNA1, CHRNA7, 44 27 Cancer, Tissue Morphology, DUSP22, ERBB, ERK, FAM154A, FDXR, Hnf3, Nervous System Development HOXAI, HOXD3, HSFY1/HSFY2, Igfbp, ITGA2B, and Function KRTAP26-1, KRTAP3-3, KRTAP5-9, LiCAM, LAMA3, LGALS13, MGAT5B, N4BP2L2, Nicotinic acetylcholine receptor, NKX2-1, NRG (family), PAX8, PC, POLG, SERCA, sGC, SMOCI, SOST, TGFBI 3 AES, Akt, ANK3, ARX, , CDKL5, 43 27 Developmental Disorder, CTNNSS-TCF/LEF, CYPI9, DLX1, DLX2, DLX5, Hereditary Disorder, Neuro- DLX6, FGF14, FOXG1, FOXHI, Foxo, FOXO4, logical Disease GABRB3, HOXC8, KDM5C, MECP2, MIR124, MSX1, MSX2, PMP22, REST, SCNlA, SCN2A, SCN4A, SCN5A, SCN8A, SCN9A, -OCT4- NANOG, SYNM, voltage-gated sodium channel 4 ABCC6, ABCC8, ARID1B, ATPase, BCL7B, 41 26 Cancer, Organismal Injury CADPS2, CTR9, , DES, DISCI, and Abnormalities, Repro- DMD, EMD, GDAP1, GNPTAB, IL23, KAT6B, ductive System Disease LDL-cholesterol, LMNA, MEF2C, MYH7, MYH9, NCAPG2, OFD1, P38 MAPK, PELO, PHLDA3, SLC25A12, SMARCA2, SMARCEl, Spectrin, SP- TANI, SRGAP2, , t, tubulin (family) 5 20s proteasome, AHIl, AMER1, APC, APC (com- 34 23 Hereditary Disorder, Audi- plex), B-cell receptor, BUB1B, Ctbp, Eif4g, FH, tory Disease, Neurological GBA, GJB2, Glycogen synthase, Histone Hl, INPPL1, Disease KRT1, KRT14, , Mapk, MBD5, MYO7A, NF2, NPHP1, NPHP3, OPA1, PRKAC, PRKARIA, Rab5, Snare, SNCG, STXBP1, SYN1, SYNEl, SYNE3, TACC3

(continued on next page)

95 Table C.1 - Continued. ID Molecules in Network Score Seed Genes Top Diseases and Functions 6 ASTN2, ATR, BRCA2, Cdc2, CNTNAP2, Cyclin B, 34 24 Cancer, Dermatological Dis- DDB2, ERCC2, ERCC6, MBNL1, MBNL2, MCPH1, eases and Conditions, Heredi- NCOA2, Nuclear factor 1, PARP, Pde4, PDPl, tary Disorder PHF11, PRKAG2, RECQL4, RFWD2, RNA poly- merase I, RNA polymerase II, Rnr, RPA, SETD2, TDO2, TFIIH, TP53, TP53AIP1, TRIM25, Ube3, XPA, XPC, ZMIZ1 7 7S NGF, Arp2/3, BAIAP2, Beta Tubulin, BMP, 32 22 Cell-To-Cell Signaling and In- CAPN3, COBL, COL4A3, DIAPH3, DLGAP1, DL- teraction, Nervous System GAP3, EDA, elastase, ETS, , G-Actin, Development and Function, Gli, GRIN2B, Integrin alpha 3 beta 1, KCND3, Behavior LBR, MYO5A, NCKIPSD, NFkB (complex), NLGN1, NLGN3, , RAPGEF4, RELN, RPS6KA3, SHANK2, SHANKS, SMS, TBRI, Wave 8 Alpha tubulin, APC2, APC/APC2, BMP4, CK1, Cy- 28 20 Embryonic Development, Or- clin D, CYLD, Dgk, Dishevelled, , FLCN, ganismal Development, Gene GLI3, Hedgehog, IRS, Jnk, KCNQ1, LRP, LRP5, mir- Expression 181, MPZ, PAX2, PAX3, PITX2, ROR2, Secretase gamma, SHH, SOX3, TBX1, TWIST2, Vdac, VLDL, VLDLR, Wnt, WNT2, ZEB2 9 14-3-3, ATP7A, c-Src, COL11A1, COL11A2, COLlAl, 28 21 Developmental Disorder, COL2A1, COL7A1, collagen, Collagen type II, Col- Connective Tissue Disor- lagen Type XI, Collagen(s), Cpla2, DLL3, ELANE, ders, Skeletal and Muscular ELN, EPB41L3, FBN1, Fc gamma receptor, Fib- Disorders rin, GDF5, HOXD13, HSD11B1, HSD17B4, mir- 10, MTORC2, Notch, PEX5, ACTIN, SPINK5, trypsin, TSC1, TSC2, Vegf, VWF 10 26s Proteasome, AR, ATRX, caspase, Cdk, Cyclin 27 20 Cancer, Cell Death and Sur- E, Cytochrome bcl, cytochrome C, cytochrome-c ox- vival, Cell Cycle idase, EIF4E, ESCO2, GATAI, HDAC4, HOXA13, , , , MAOA, MED12, Mitochon- drial complex 1, NDRG1, PARK2, PP2A, PRPS1, Rb, SEC23A, SLC6A4, SRC (family), STK11, TBCE, TRPS1, TSSK2, TYR, Ubiquitin, WT1 11 ANKH, Atrial Natriuretic Peptide, cacn, Cacnal, 26 19 Molecular Transport, Cancer, CACNA1A, CACNA1C, CACNA1G, CACNA1H, Organismal Injury and Ab- CACNA1S, CAV3, DPP6, DPP10, ERK1/2, GLRA2, normalities GNAS, Homer, ITPR, KCND2, KCNMA1, KLHL1, L-type Calcium Channel, MGAT3, NELL2, Neu- rotrophin, NOG, NRXN1, Pka catalytic subunit, Pkg, Pki, potassium channel, Presenilin, Ryr, RYR1, T-type Calcium Channel, voltage-gated calcium channel 12 ABCA4, Ahr-aryl hydrocarbon-Arnt, ARSB, Bcl9- 24 20 Renal and Urological System Cbp/p300-Ctnnbl-Lef/Tcf, Cbp/p300, CDH3, CPE, Development and Function, Cyclin A, DSP, , EN2, ESR1, Esrl-Esrl- Reproductive System Devel- estrogen-estrogen, estrogen receptor, FMR1, FOXL2, opment and Function, Organ- glutathione peroxidase, Hat, HISTONE, Histone ismal Development h4, HMGA2, NROB1, RNase A, RPL10, SDC3, SEMA5A, Smadi/5/8, Smad2/3, Smad2/3-Smad4, SMN1/SMN2, TADA3, TBPL1, TP63, TWIST1, UBE3A 13 AHNAK, ATP2A2, Cebp, CPT1, CPT2, DYM, 22 17 Cancer, Gastrointestinal Dis- FOXPI, FOXP2, FOXP4, GATAD2B, HR, IL6, IN- ease, Hematological Disease TERLEUKIN, IRF6, ITLN1, JUN/JUNB/JUND, N- cor, Na+, K+ -ATPase, Nrlh, PEPCK, Pmca, PRKAA, PTH, RAIl, Rar, Rbp, Rxr, SALL4, STIM1, SWI-SNF, Tcf 1/3/4, thymidine kinase, thyroid hor- mone receptor, TTR, VitaminD3-VDR-RXR (continued on next page)

96 Table C.1 - Continued. ID Molecules in Network Score Seed Genes Top Diseases and Functions 14 ALS2, Ampa Receptor, CaMKII, Cofilin, Ctnna, 18 16 Cell Death and Survival, Can- EFNB1, EPHB2, F Actin, FHIT, FLNB, GNE, GRI, cer, Gastrointestinal Disease GRIK2, GRIN2A, GRIP1, Integrin alpha V beta 3, mGluR, Mic, Myosin, N-Cadherin, OPHN1, Pak, PCDH19, PICKI, Pkc(s), PLCDI, PPI protein com- plex group, Pp2b, Rabli, RAB40B, Rap, Rapi, SLC1A1, TSH, TTN 15 amylase, chymotrypsin, Collagen type III, Cytok- 17 14 Cancer, Cell Death and Sur- eratin, DNM2, EFNB, ENaC, Fgf, Fgfr, FGFR1, vival, Cellular Function and FGFR2, FGFR3, FLT4, FRS2, Gap, GHR, GP1BA, Maintenance GPIIB-IIIA, growth factor receptor, Hspg, HSPG2, IRS1/2, NCK, NTRK1, NTRK3, Ntrk dimer, Pdgfr, P13K (complex), P13K p85, PLC gamma, PLCG1, PTPN11, RPS6KA, SCT, Vla-4 16 ABRACL, AFF2, ANKRDl1, BCL6, C9orf78, CHM, 17 14 Cancer, Organismal Injury CPSI, CUTA, CWC27, DYSF, GRB2, KCTD3, and Abnormalities, Repro- LSM5, LSM6, LSM12, MT-ATP8, NOS2, PANK2, ductive System Disease POU2F3, PRPF8, RNU2-1, RNU4-1, RNU5A-1, RNU6-1, SCARF2, SLC17A5, SNRNP25, SPSB4, TDRD7, TESPA1, TTYH2, UBC, XIRP2, ZNF443, ZNF609 17 Adaptor protein 2, ADCY, ADRB, AIM1, Ap2 alpha, 15 13 Cell Death and Survival, Cel- ASCL3, BDNF, BIN1, Caveolin, 0k2, Clathrin, Creb, lular Development, Cellular DNA-methyltransferase, Dynamin, GLB1, Gm-csf, Go- Growth and Proliferation coupled receptor, GTPase, Hdac, mGLUR Group I, MITF, MITF-p300/CBP, NFl, NTF3, Pias, PKHD1, Ppp2c, Ras, RET, Rsk, SIM2, SLC26A2, Syntaxin, TCF, TCF4 18 Actin, , CD3, Cg, CNTN4, Cel, DPY19L3, 14 14 Cell Morphology, Cellular E130116L18Rik, ERBB4, FAM111A, FSH, GJA1, Assembly and Organization, HIAT1, Histone h3, I kappa b kinase, Ikb, IKBKG, Cellular Development IKK (complex), Integrin, MAPT, mir-223, MTORC1, NLRP3, PRNP, PSENI, PTEN, RB1, RORA, STAT, STEAP1, STOXI, SUN5, TCR, Tnf receptor, ZNF211 19 AUTS2, AVEN, BCL2, C1q, CTNNA3, CTSC, Ifn, 12 11 Cellular Growth and Pro- IFN alpha/beta, IFN Beta, IFN type 1, Iga, Ige, liferation, Lymphoid Tissue IgG, IgG1, Igg3, Igm, IL-2R, IL12 (complex), IL12 Structure and Development, (family), IL2RG, Immunoglobulin, Interferon alpha, Organ Morphology ITM2B, Ldh (complex), LRIG1, mediator, MHC Class II (complex), MHC II, MIR101, PAX5, PLP1, snRNP, STATUa/b, Tlr, TREX1 20 ALG6, Baspl, CCNDI, CD96, COQ2, DAGI, DON- 12 11 Cancer, Developmental Disor- SON, DPH1, EPB41L4B, ESCO2, EVC, FKRP, der, Cellular Growth and Pro- GLI1, GPC6, H2AFY2, HACLI, HRAS, IMPA2, liferation MACROD2, PKD1, POPI, POP4, PPCS, PRE- LID1, PTTG1IP, RAB23, RASSF6, RMRP, SFXN3, SLC13A3, TBC1D24, TENM3, THG1L, UBC, ZNF711 21 ALPL, BCR (complex), A, Calcineurin 10 11 Post-Translational Modifica- protein(s), EYA1, EYA4, Fcerl, Gsk3, HOXB1, JAK, tion, Organismal Survival, Lh, MAP2K1/2, , MEF2A, NFAT (complex), Organismal Development Nfat (family), Pdgf (complex), phosphatase, PISK (family), Pka, Ptk, PTPase, PTPN1, PTPRC, Raf, Shc, SHP, Sod, Sos, SOS1, SYK/ZAP, THTPA, TRPV4, tyrosine kinase, WAS (continued on next page)

97 Table C.1 - Continued. ID Molecules in Network Score Seed Genes Top Diseases and Functions 22 AASS, ALDOC, ATP2B2, BAIAP2, CIT, CMC1, 8 9 Cell-To-Cell Signaling and In- CNP, DLG4, EPB41L3, GRID2, Grik, GRIK2, teraction, Nervous System GRIK5, GRIN2C, GRIN2D, HTT, HUNK, KCNA1, Development and Function, KCNAB1, KCNJ2, KCNJ4, KCNJ12, MAP3K10, Cancer MBTPS1, NLGN4X, NRXN1, PCLO, PLP1, PRELP, SDHA, SFXN3, SRGAP3, STX1B, STXBP1, TYRO3 23 Alp, , ALT, AMPK, C/ebp, Collagen 8 8 Cell Death and Survival, Cel- Alphal, Collagen type I, Collagen type IV, crea- lular Growth and Prolifera- tine kinase, CYP, Fibrinogen, Focal adhesion kinase, tion, Tissue Development Growth hormone, HDL, HDL-cholesterol, , HGF, Ifn gamma, ILl, JINK1/2, Laminin, LDL, MIF, MTHFR, Nos, NOS3, Pro-inflammatory Cytokine, Rock, RUNX2, SCARF2, Smad, SMPD1, Tgf beta, Tnf (family), VDR 24 Alpha , Angiotensin II receptor type 1, 5 8 Carbohydrate Metabolism, AVP, AVPRlA, Beta Arrestin, Calmodulin, CASR, Molecular Transport, Small chemokine, EDNRB, Endothelin, G protein, G pro- Molecule Biochemistry tein alpha, G protein alphai, G protein beta gamma, G-protein beta, GNAQ, GNRH, Gpcr, IgG2a, IgG2b, LHCGR, Metalloprotease, Mmp, NMDA Receptor, OPN1LW, OXTR, PLC, PId, Rac, Ras homolog, Re- laxin, Sapk, Sfk, Trk Receptor, tubulin (complex) 25 ERCC6L2, NEK6 2 1 Cell Cycle, Cellular Move- ment, Cell Morphology

1. IPAO QIAGEN Redwood City, http://wwv.qiagen.con/ingenuity

98 Bibliography

[1] Online Mendelian Inheritance in Man, OMIM®, McKusick-Nathans Institute of Ge- netic Medicine, Johns Hopkins University (Baltimore, MD), March 2014. World Wide Web URL: http://omim.org/.

[2] Brett S Abrahams and Daniel H Geschwind. Advances in autism genetics: on the threshold of a new neurobiology. Nature Reviews Genetics, 9(5):341-355, 2008.

[3] Stein Aerts, Diether Lambrechts, Sunit Maity, Peter Van Loo, Bert Coessens, Fred- erik De Smet, Leon-Charles Tranchevent, Bart De Moor, Peter Marynen, Bassem Hassan, et al. Gene prioritization through genomic data fusion. Nature Biotechnology, 24(5):537-544, 2006.

[4] David Altshuler, Mark Daly, and Leonid Kruglyak. Guilt by association. Nature Genetics, 26(2):135-138, 2000.

[5] David Altshuler, Mark J Daly, and Eric S Lander. Genetic mapping in human disease. Science, 322(5903):881-888, 2008.

[6] JY An, AS Cristino, Q Zhao, J Edson, SM Williams, D Ravine, J Wray, VM Marshall, A Hunt, AJO Whitehouse, et al. Towards a molecular characterization of autism spec- trum disorders: an exome sequencing and systems approach. TranslationalPsychiatry, 4(6):e394, 2014.

[7] Richard Anney, Lambertus Klei, Dalila Pinto, Joana Almeida, Elena Bacchelli, Gillian Baird, Nadia Bolshakova, Sven B6lte, Patrick F Bolton, Thomas Bourgeron, et al. In- dividual common variants exert weak effects on the risk for autism spectrum disorders. Human Molecular Genetics, 21(21):4781-4792, 2012.

[8] Michael Ashburner, Catherine A Ball, Judith A Blake, David Botstein, Heather Butler, J Michael Cherry, Allan P Davis, Kara Dolinski, Selina S Dwight, Janan T Eppig, et al. Gene Ontology: tool for the unification of biology. Nature Genetics, 25(1):25-29, 2000.

[9] American Psychiatric Association. The Diagnostic and Statistical Manual of Mental Disorders (5th ed.). American Psychiatric Publishing, 2013.

[10 Samy A Azer. Overview of molecular pathways in inflammatory bowel disease asso- ciated with colorectal cancer development. European Journal of Gastroenterology & Hepatology, 25(3):271-281, 2013.

[11] Albert-Ldszl6 Barab~si, Natali Gulbahce, and Joseph Loscalzo. Network medicine: a network-based approach to human disease. Nature Reviews Genetics, 12(1):56-68, 2011.

99 [12] Colin A Baron, Clifford G Tepper, Stephenie Y Liu, Ryan R Davis, Nicholas J Wang, N Carolyn Schanen, and Jeffrey P Gregg. Genomic and functional profiling of du- plicated chromosome 15 cell lines reveal regulatory alterations in UBE3A-associated ubiquitin-proteasome pathway processes. Human Molecular Genetics, 15(6):853-869, 2006.

[13] Saumyendra N Basu, Ravi Kollu, and Sharmila Banerjee-Basu. AutDB: a gene ref- erence resource for autism research. Nucleic Acids Research, 37(suppl 1):D832-D836, 2009.

[14] Brent R Bill and Daniel H Geschwind. Genetic advances in autism: heterogeneity and convergence on shared pathways. Current Opinion in Genetics & Development, 19(3):271-278, 2009.

[15] Douglas C Bittel, Nataliya Kibiryeva, and Merlin G Butler. Whole genome microarray analysis of gene expression in subjects with . Genetics in Medicine, 9(7):464-472, 2007.

[16] Hans K Blomquist, Michael Bohman, Sven Olof Edvinsson, Christopher Gillberg, Karl-Henrik Gustavson, Gdsta Holmgren, Jan Wahlstr6m, et al. Frequency of the fragile X syndrome in infantile autism. Clinical Genetics, 27(2):113-117, 1985.

[17] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1):107-117, 1998.

[18] OJ Broom, B Widjaya, J Troelsen, Jorgen Olsen, and OH Nielsen. Mitogen acti- vated protein : a role in inflammatory bowel disease? Clinical & Experimental Immunology, 158(3):272-280, 2009.

[19] Andrej Bugrim, Tatiana Nikolskaya, and Yuri Nikolsky. Early prediction of drug metabolism and toxicity: systems biology approach and modeling. Drug Discovery Today, 9(3):127-135, 2004. [20] Joseph D Buxbaum. Multiple rare variants in the etiology of autism spectrum disor- ders. Dialogues in Clinical Neuroscience, 11(1):35, 2009.

[21] Malcolm G Campbell, Isaac S Kohane, and Sek Won Kong. Pathway-based outlier method reveals heterogeneous genomic structure of autism in blood transcriptome. BMC Medical Genomics, 6(1):34, 2013.

[22] Rita M Cantor, Naoko Kono, Jackie A Duvall, Ana Alvarez-Retuerto, Jennifer L Stone, Maricela Alarc6n, Stanley F Nelson, and Daniel H Geschwind. Replication of autism linkage: fine-mapping peak at 17q21. The American Journal of Human Genetics, 76(6):1050-1056, 2005.

[23] Mengfei Cao, Hao Zhang, Jisoo Park, Noah M Daniels, Mark E Crovella, Lenore J Cowen, and Benjamin Hescott. Going the Distance for Protein Function Prediction: A New Distance Metric for Protein Interaction Networks. PloS One, 8(10):e76339, 2013.

[24] MA Care, JR Bradford, CJ Needham, AJ Bulpitt, and DR Westhead. Combining the interactome and deleterious SNP predictions to improve disease gene identification. Human Mutation, 30(3):485-492, 2009.

100 [25] Wenjun Chang, Liye Ma, Liping Lin, Liqiang Gu, Xiaokang Liu, Hui Cai, Yongwei Yu, Xiaojie Tan, Yujia Zhai, Xingxing Xu, et al. Identification of novel hub genes associated with liver metastasis of gastric cancer. International Journal of Cancer, 125(12):2844-2853, 2009.

[26] Yanqing Chen, Jun Zhu, Pek Yee Lum, Xia Yang, Shirly Pinto, Douglas J MacNeil, Chunsheng Zhang, John Lamb, Stephen Edwards, Solveig K Sieberts, et al. Variations in DNA elucidate molecular networks that cause disease. Nature, 452(7186):429-435, 2008.

[27] David Croft, Antonio Fabregat Mundo, Robin Haw, Marija Milacic, Joel Weiser, Guan- ming Wu, Michael Caudy, Phani Garapati, Marc Gillespie, Maulik R Kamdar, et al. The Reactome pathway knowledgebase. Nucleic Acids Research, 42(D1):D472-D477, 2014.

[28] Disabilities Monitoring Network Surveillance Year Developmental, 2010 Principal In- vestigators, et al. Prevalence of autism spectrum disorder among children aged 8 years- autism and developmental disabilities monitoring network, 11 sites, United States, 2010. Morbidity and Mortality Weekly Report. Surveillance Summaries (Washington, DC: 2002), 63:1, 2014.

[29] Bernie Devlin, Nadine Melhem, and Kathryn Roeder. Do common variants play a role in risk for autism? Evidence and theoretical musings. Brain Research, 1380:78-84, 2011.

[30] ZoltAn Dezsd, Yuri Nikolsky, Tatiana Nikolskaya, Jeremy Miller, David Cherba, Craig Webb, and Andrej Bugrim. Identifying disease-specific genes based on their topological significance in protein networks. BMC Systems Biology, 3(1):36, 2009.

[31] Annette J Dobson. An introduction to generalized linear models. CRC press, 2001.

[32] Lynnette R Ferguson. Nutrigenetics, nutrigenomics and inflammatory bowel diseases. Expert Review of Clinical Immunology, 9(8):717-726, 2013.

[33] Eric Fombonne. Epidemiology of autistic disorder and other pervasive developmental disorders. The Journal of Clinical Psychiatry, 66:3-8, 2004.

[34] Lude Franke, Harm van Bakel, Like Fokkens, Edwin D De Jong, Michael Egmont- Petersen, and Cisca Wijmenga. Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. The American Journal of Human Genetics, 78(6):1011-1025, 2006.

[35] Christine M Freitag. The genetics of autistic disorders and its clinical relevance: a review of the literature. Molecular Psychiatry, 12(1):2-22, 2006.

[36] Richard A George, Jason Y Liu, Lina L Feng, Robert J Bryson-Richardson, Diane Fatkin, and Merridee A Wouters. Analysis of proein sequence and interaction data for candidate disease gene prediction. Nucleic Acids Research, 34(19):e130-e130, 2006.

[37] Daniel H Geschwind, Janice Sowinski, Catherine Lord, Portia Iversen, Jonathan Shes- tack, Patrick Jones, Lee Ducat, Sarah J Spence, AGRE Steering Committee, et al. The autism genetic resource exchange: a resource for the study of autism and related neuropsychiatric conditions. American Journal of Human Genetics, 69(2):463, 2001.

101 [38] Kwang-Il Goh, Michael E Cusick, David Valle, Barton Childs, Marc Vidal, and Albert- LAszl6 BarabAsi. The human disease network. Proceedings of the National Academy of Sciences, 104(21):8685-8690, 2007.

[39] Jeffrey P Gregg, Lisa Lit, Colin A Baron, Irva Hertz-Picciotto, Wynn Walker, Ryan A Davis, Lisa A Croen, Sally Ozonoff, Robin Hansen, Isaac N Pessah, et al. Gene expression changes in children with autism. Genomics, 91(1):22-29, 2008.

[40] Joachim Hallmayer, Sue Cleveland, Andrea Torres, Jennifer Phillips, Brianne Cohen, Tiffany Torigoe, Janet Miller, Angie Fedele, Jack Collins, Karen Smith, et al. Genetic heritability and shared environmental factors among twin pairs with autism. Archives of General Psychiatry, 68(11):1095-1102, 2011.

[41] Valerie W Hu, Bryan C Frank, Shannon Heine, Norman H Lee, and John Quacken- bush. Gene expression profiling of lymphoblastoid cell lines from monozygotic twins discordant in severity of autism reveals differential regulation of neurologically relevant genes. BMC Genomics, 7(1):118, 2006.

[42] KR Hughes, F Sablitzky, and YR Mahida. Expression profiling of Wnt family of genes in normal and inflammatory bowel disease primary human intestinal myofibrob- lasts and normal human colonic crypt epithelial cells. Inflammatory Bowel Diseases, 17(1):213-220, 2011.

[43] Daehee Hwang, Inyoul Y Lee, Hyuntae Yoo, Nils Gehlenborg, Ji-Hoon Cho, Brianne Petritis, David Baxter, Rose Pitstick, Rebecca Young, Doug Spicer, et al. A systems approach to prion disease. Molecular Systems Biology, 5(1), 2009.

[44] Sohyun Hwang, Seung-Woo Son, Sang Cheol Kim, Young Joo Kim, Hawoong Jeong, and Doheon Lee. A protein interaction network associated with asthma. Journal of Theoretical Biology, 252(4):722-731, 2008.

[45] Ronald Jansen, Haiyuan Yu, Dov Greenbaum, Yuval Kluger, Nevan J Krogan, Sam- bath Chung, Andrew Emili, Michael Snyder, Jack F Greenblatt, and Mark Gerstein. A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science, 302(5644):449-453, 2003.

[46] LB Jorde, SJ Hasstedt, ER Ritvo, A Mason-Brothers, BJ Freeman, C Pingree, WM McMahon, B Petersen, WR Jenson, and A Mo. Complex segregation analy- sis of autism. The American Journal of Human Genetics, 49(5):932, 1991.

[47] Minoru Kanehisa and Susumu Goto. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Research, 28(1):27-30, 2000.

[48] Minoru Kanehisa, Susumu Goto, Yoko Sato, Masayuki Kawashima, Miho Furumichi, and Mao Tanabe. Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Research, 42(D1):D199-D205, 2014.

[49] Shaul Karni, Hermona Soreq, and Roded Sharan. A network-based method for pre- dicting disease-causing genes. Journal of ComputationalBiology, 16(2):181-189, 2009.

[50] Arthur Kaser and Herbert Tilg. "Metabolic aspects" in inflammatory bowel diseases. Current Drug Delivery, 9(4):326-332, 2012.

102 [51] Paul Julian Kersey, James E Allen, Mikkel Christensen, Paul Davis, Lee J Falin, Christoph Grabmueller, Daniel Seth Toney Hughes, Jay Humphrey, Arnaud Ker- hornou, Julia Khobova, et al. Ensembl Genomes 2013: scaling up access to genome- wide data. Nucleic Acids Research, 42(D1):D546-D552, 2014.

[52] Yoo-Ah Kim, Stefan Wuchty, and Teresa M Przytycka. Identifying causal genes and dysregulated pathways in complex diseases. PLoS Computational Biology, 7(3):e1001095, 2011.

[53] Young Shin Kim, Bennett L Leventhal, Yun-Joo Koh, Eric Fombonne, Eugene Laska, Eun-Chung Lim, KeuA-Ah Cheon, Soo-Jeong Kim, Young-Key Kim, HyunKyung Lee, et al. Prevalence of autism spectrum disorders in a total population sample. American Journal of Psychiatry, 168(9):904-912, 2011.

[54] Michael D Kogan, Stephen J Blumberg, Laura A Schieve, Coleen A Boyle, James M Perrin, Reem M Ghandour, Gopal K Singh, Bonnie B Strickland, Edwin Trevathan, and Peter C van Dyck. Prevalence of parent-reported diagnosis of autism spectrum disorder among children in the US, 2007. Pediatrics, 124(5):1395-1403, 2009.

[55] Isaac S Kohane, Andrew McMurry, Griffin Weber, Douglas MacFadden, Leonard Rap- paport, Louis Kunkel, Jonathan Bickel, Nich Wattanasin, Sarah Spence, Shawn Mur- phy, et al. The co-morbidity burden of children and young adults with autism spectrum disorders. PloS One, 7(4):e33224, 2012.

[56] Sebastian K6hler, Sebastian Bauer, Denise Horn, and Peter N Robinson. Walking the interactome for prioritization of candidate disease genes. The American Journal of Human Genetics, 82(4):949-958, 2008.

[57] Michael Krauthammer, Charles A Kaufmann, T Conrad Gilliam, and Andrey Rzhet- sky. Molecular triangulation: bridging linkage and molecular-network information for identifying candidate genes in Alzheimer's disease. Proceedings of the National Academy of Sciences of the United States of America, 101(42):15148-15153, 2004.

[58] Kasper Lage, E Olof Karlberg, Zenia M Storling, PAl I Olason, Anders G Pedersen, Olga Rigina, Anders M Hinsby, Zeynep Tiimer, Flemming Pociot, Niels Tommerup, et al. A human phenome-interactome network of protein complexes implicated in genetic disorders. Nature Biotechnology, 25(3):309-316, 2007.

[59] Eunjung Lee, Hyunchul Jung, Predrag Radivojac, Jong-Won Kim, and Doheon Lee. Analysis of AML genes in dysregulated molecular networks. BMC Bioinformatics, 10(Suppl 9):S2, 2009.

[601 Charles William Lees. Role of the hedgehog signalling pathway in inflammatory bowel disease. PhD thesis, University of Edinburgh, 2009.

[61] CW Lees, JC Barrett, M Parkes, and J Satsangi. New IBD genetics: common pathways with other diseases. Gut, 60(12):1739-1753, 2011.

[62] Dan Levy, Michael Ronemus, Boris Yamrom, Yoon-ha Lee, Anthony Leotta, Jude Kendall, Steven Marks, B Lakshmi, Deepa Pai, Kenny Ye, et al. Rare de novo and transmitted copy-number variation in autistic spectrum disorders. Neuron, 70(5):886- 897, 2011.

103 [63] Yongjin Li and Jagdish C Patra. Genome-wide inferring gene-phenotype relationship by walking on the heterogeneous network. Bioinformatics, 26(9):1219-1224, 2010.

[64] Luana Licata, Leonardo Briganti, Daniele Peluso, Livia Perfetto, Marta lannuccelli, Eugenia Galeota, Francesca Sacco, Anita Palma, Aurelio Pio Nardozza, Elena San- tonico, et al. MINT, the molecular interaction database: 2012 update. Nucleic Acids Research, 40(D1):D857-D861, 2012.

[65] Bolan Linghu, Evan S Snitkin, Zhenjun Hu, Yu Xia, and Charles DeLisi. Genome-wide prioritization of disease genes and identification of disease-disease associations from an integrated human functional linkage network. Genome Biology, 10(9):R91, 2009.

[66] Li Liu, Jing Lei, Stephan J Sanders, Arthur Jeremy Willsey, Yan Kou, Abdullah Erc- ument Cicek, Lambertus Klei, Cong Lu, Xin He, Mingfeng Li, et al. DAWN: a frame- work to identify autism genes and subnetworks using gene expression and genetics. Molecular Autism, 5(1):22, 2014.

[67] Manway Liu, Arthur Liberzon, Sek Won Kong, Weil R Lai, Peter J Park, Isaac S Kohane, and Simon Kasif. Network-based analysis of affected biological processes in type 2 diabetes models. PLoS Genetics, 3(6):e96, 2007.

[68] Donna Maglott, Jim Ostell, Kim D Pruitt, and Tatiana Tatusova. Entrez Gene: gene- centered information at NCBI. Nucleic Acids Research, 39(suppl 1):D52-D57, 2011.

[69] Christian R Marshall and Stephen W Scherer. Detection and characterization of copy number variation in autism spectrum disorder. In Genomic Structural Variants, pages 115-135. Springer, 2012.

[70] Douglas R Mathern, Avantika Chitre, Lloyd Mayer, and Stephanie Dahan. The Notch signaling pathway mediates tight junction protein stoichiometry in IBD: P-203. In- flammatory Bowel Diseases, 17:S72, 2011.

[71] Hans-Werner Mewes, Sabine Dietmann, Dmitrij Frishman, Richard Gregory, Gertrud Mannhaupt, Klaus FX Mayer, Martin Miinsterk6tter, Andreas Ruepp, Manuel Span- nagl, Volker Stimpflen, et al. MIPS: analysis and annotation of genome information in 2007. Nucleic Acids Research, 36(suppl 1):D196-D201, 2008.

[72] Marcela K Monaco, Joshua Stein, Sushma Naithani, Sharon Wei, Palitha Dhar- mawardhana, Sunita Kumari, Vindhya Amarasinghe, Ken Youens-Clark, James Thomason, Justin Preece, et al. Gramene 2013: comparative plant genomics resources. Nucleic Acids Research, 42(D1):D1193-D1199, 2014.

[73] Linda B Moran and Manuel B Graeber. Towards a pathway definition of Parkin- sonSs disease: a complex disorder with links to cancer, diabetes and inflammation. Neurogenetics, 9(1):1-13, 2008.

[74] Eric M Morrow, Seung-Yun Yoo, Steven W Flavell, Tae-Kyung Kim, Yingxi Lin, Robert Sean Hill, Nahit M Mukaddes, Soher Balkhy, Generoso Gascon, Asif Hashmi, et al. Identifying autism loci and genes by tracing recent shared ancestry. Science, 321(5886):218-223, 2008.

[751 Saket Navlakha and Carl Kingsford. The power of protein interaction networks for associating genes with diseases. Bioinformatics, 26(8):1057-1063, 2010.

104 [76] Rod K Nibbe, Mehmet Koyutiirk, and Mark R Chance. An integrative-omics approach to identify functional sub-networks in human colorectal cancer. PLoS Computational Biology, 6(1):e1000639, 2010.

[77] Rod K Nibbe, Sanford Markowitz, Lois Myeroff, Rob Ewing, and Mark R Chance. Discovery and scoring of protein interaction subnetworks discriminative of late stage human colon cancer. Molecular & Cellular Proteomics, 8(4):827-845, 2009.

[78] Yuhei Nishimura, Christa L Martin, Araceli Vazquez-Lopez, Sarah J Spence, Ana Is- abel Alvarez-Retuerto, Marian Sigman, Corinna Steindler, Sandra Pellegrini, N Car- olyn Schanen, Stephen T Warren, et al. Genome-wide expression profiling of lym- phoblastoid cell lines distinguishes different forms of autism and reveals shared path- waysF. Human Molecular Genetics, 16(14):1682-1698, 2007.

[79] Tiago Nunes, Claudio Bernardazzi, and Heitor S de Souza. Cell Death and Inflamma- tory Bowel Diseases: Apoptosis, Necrosis, and Autophagy in the Intestinal Epithelium. BioMed Research International, 2014, 2014.

[80] International Molecular Genetic Study of Autism Consortium et al. A full genome screen for autism with evidence for linkage to a region on chromosome 7q. Human Molecular Genetics, 7(3), 1998.

[81] International Molecular Genetic Study of Autism Consortium et al. A genomewide screen for autism: strong evidence for linkage to chromosomes 2q, 7q, and 16p. Amer- ican Journal of Human Genetics, 69(3):570, 2001.

[82] Stephen Oliver. Proteomics: guilt-by-association goes global. Nature, 403(6770):601- 603, 2000.

[83] Brian J O'Roak, Laura Vives, Santhosh Girirajan, Emre Karakoc, Niklas Krumm, Bradley P Coe, Roie Levy, Arthur Ko, Choli Lee, Joshua D Smith, et al. Sporadic autism exomes reveal a highly interconnected protein network of de novo mutations. Nature, 485(7397):246-250, 2012.

[84] Martin Oti and Han G Brunner. The modular nature of genetic diseases. Clinical Genetics, 71(1):1-11, 2007.

[85] Martin Oti, Berend Snel, Martijn A Huynen, and Han G Brunner. Predicting disease genes using protein-protein interactions. Journal of Medical Genetics, 43(8):691-698, 2006.

[86] Sally Ozonoff, Gregory S Young, Alice Carter, Daniel Messinger, Nurit Yirmiya, Lon- nie Zwaigenbaum, Susan Bryson, Leslie J Carver, John N Constantino, Karen Dobkins, et al. Recurrence risk for autism spectrum disorders: a Baby Siblings Research Con- sortium study. Pediatrics, 128(3):e488-e495, 2011.

[87] Neelroop N Parikshak, Rui Luo, Alice Zhang, Hyejung Won, Jennifer K Lowe, Vi- jayendran Chandran, Steve Horvath, and Daniel H Geschwind. Integrative functional genomic analyses implicate specific molecular pathways and circuits in autism. Cell, 155(5):1008-1021, 2013.

105 [88] Luca Pastorelli, Carlo De Salvo, Marissa A Cominelli, Maurizio Vecchi, and Theresa T Pizarro. Novel cytokine signaling pathways in inflammatory bowel disease: insight into the dichotomous functions of IL-33 during chronic intestinal inflammation. Therapeutic Advances in Gastroenterology, 4(5):311-323, 2011.

[891 Carolina Perez-Iratxeta, Peer Bork, and Miguel A Andrade-Navarro. Update of the G2D tool for prioritization of gene candidates to inherited diseases. Nucleic Acids Research, 35(suppl 2):W212-W216, 2007.

[90] Dalila Pinto, Alistair T Pagnamenta, Lambertus Klei, Richard Anney, Daniele Merico, Regina Regan, Judith Conroy, Tiago R Magalhaes, Catarina Correia, Brett S Abra- hams, et al. Functional impact of global rare copy number variation in autism spectrum disorders. Nature, 466(7304):368-372, 2010.

[91] G Poelmans, B Franke, DL Pauls, JC Glennon, and JK Buitelaar. AKAPs integrate genetic findings for autism spectrum disorders. Translational Psychiatry, 3(6):e270, 2013.

[92] Predrag Radivojac, Kang Peng, Wyatt T Clark, Brandon J Peters, Amrita Mohan, Sean M Boyle, and Sean D Mooney. An integrated approach to inferring gene- disease associations in humans. Proteins: Structure, Function, and Bioinformatics, 72(3):1030-1037, 2008. [931 Monika Ray, Jianhua Ruan, and Weixiong Zhang. Variations in the transcriptome of Alzheimer's disease reveal molecular networks involved in cardiovascular diseases. Genome Biology, 9(10):R148, 2008.

[94] Richard Redon, Shumpei Ishikawa, Karen R Fitch, Lars Feuk, George H Perry, T Daniel Andrews, Heike Fiegler, Michael H Shapero, Andrew R Carson, Wen- wei Chen, et al. Global variation in copy number in the human genome. Nature, 444(7118):444-454, 2006. [95] Angelica Ronald, Francesca Happ6, Patrick Bolton, Lee M Butcher, Thomas S Price, Sally Wheelwright, Simon Baron-Cohen, and Robert Plomin. Genetic heterogeneity between the three components of the autism spectrum: a twin study. Journal of the American Academy of Child & Adolescent Psychiatry, 45(6):691-699, 2006. [96] Rebecca E Rosenberg, J Kiely Law, Gayane Yenokyan, John McGready, Walter E Kaufmann, and Paul A Law. Characteristics and concordance of autism spectrum disorders among 277 twin pairs. Archives of Pediatrics & Adolescent Medicine, 163(10):907-914, 2009. [97] Lukasz Salwinski, Christopher S Miller, Adam J Smith, Frank K Pettit, James U Bowie, and David Eisenberg. The database of interacting proteins: 2004 update. Nucleic Acids Research, 32(suppl 1):D449-D451, 2004. [98] Rodney C Samaco, Amber Hogart, and Janine M LaSalle. Epigenetic overlap in autism-spectrum neurodevelopmental disorders: MECP2 deficiency causes reduced expression of UBE3A and GABRB3. Human Molecular Genetics, 14(4):483-492, 2005. [99] Carl F Schaefer, Kira Anthony, Shiva Krupa, Jeffrey Buchoff, Matthew Day, Timo Hannay, and Kenneth H Buetow. PID: the pathway interaction database. Nucleic Acids Research, 37(suppl 1):D674-D679, 2009.

106 [100] Patrick R Schmid, Nathan P Palmer, Isaac S Kohane, and Bonnie Berger. Making sense out of massive data by going beyond differential expression. Proceedings of the National Academy of Sciences, 109(15):5594-5599, 2012.

[101] Jonathan Sebat, B Lakshmi, Dheeraj Malhotra, Jennifer Troge, Christa Lese-Martin, Tom Walsh, Boris Yamrom, Seungtai Yoon, Alex Krasnitz, Jude Kendall, et al. Strong association of de novo copy number mutations with autism. Science, 316(5823):445- 449, 2007.

[102] David Q Shih and Stephan R Targan. Insights into IBD pathogenesis. Current Gas- troenterology Reports, 11(6):473-480, 2009.

[103] Chris Stark, Bobby-Joe Breitkreutz, Teresa Reguly, Lorrie Boucher, Ashton Bre- itkreutz, and Mike Tyers. BioGRID: a general repository for interaction datasets. Nucleic Acids Research, 34(suppl 1):D535-D539, 2006.

[104] Jennifer L Stone, Barry Merriman, Rita M Cantor, Amanda L Yonan, T Conrad Gilliam, Daniel H Geschwind, and Stanley F Nelson. Evidence for sex-specific risk alleles in autism spectrum disorder. American Journal of Human Genetics, 75(6):1117- 1123, 2004.

[105] Aravind Subramanian, Pablo Tamayo, Vamsi K Mootha, Sayan Mukherjee, Ben- jamin L Ebert, Michael A Gillette, Amanda Paulovich, Scott L Pomeroy, Todd R Golub, Eric S Lander, et al. Gene set enrichment analysis: a knowledge-based ap- proach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America, 102(43):15545-15550, 2005.

[1061 Satoshi Sumi, Hiroko Taniai, Taishi Miyachi, and Mitsuyo Tanemura. Sibling risk of pervasive developmental disorder estimated by means of an epidemiologic survey in Nagoya, Japan. Journal of Human Genetics, 51(6):518-522, 2006.

[107] Peter Szatmari, Andrew D Paterson, Lonnie Zwaigenbaum, Wendy Roberts, Jessica Brian, Xiao-Qing Liu, John B Vincent, Jennifer L Skaug, Ann P Thompson, Lill Senman, et al. Mapping autism risk loci using genetic linkage and chromosomal rear- rangements. Nature Genetics, 39(3):319-328, 2007.

[108] Hiroko Taniai, Takeshi Nishiyama, Taishi Miyachi, Masayuki Imaeda, and Satoshi Sumi. Genetic influences on the broad spectrum of autism: Study of proband- ascertained twins. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics, 147(6):844-849, 2008.

[109] Ian W Taylor, Rune Linding, David Warde-Farley, Yongmei Liu, Catia Pesquita, Daniel Faria, Shelley Bull, Tony Pawson, Quaid Morris, and Jeffrey L Wrana. Dy- namic modularity in protein interaction networks predicts breast cancer outcome. Nature Biotechnology, 27(2):199-204, 2009.

[110] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267-288, 1996.

[111] Nicki Tiffin, Euan Adie, Frances Turner, Han G Brunner, Marc A van Driel, Mar- tin Oti, Nuria Lopez-Bigas, Christos Ouzounis, Carolina Perez-Iratxeta, Miguel A

107 Andrade-Navarro, et al. Computational disease gene identification: a concert of meth- ods prioritizes type 2 diabetes and obesity candidate genes. Nucleic Acids Research, 34(10):3067-3081, 2006.

[112] Marc A van Driel, Jorn Bruggeman, Gert Vriend, Han G Brunner, and Jack AM Leunissen. A text-mining analysis of the human phenome. European Journal of Human Genetics, 14(5):535-542, 2006.

[113] Oron Vanunu, Oded Magger, Eytan Ruppin, Tomer Shlomi, and Roded Sharan. As- sociating genes and protein complexes with disease via network propagation. PLoS ComputationalBiology, 6(1):e1000641, 2010.

[114] Marc Vidal. A unifying view of 21st century systems biology. FEBS Letters, 583(24):3891-3894, 2009.

[115] Christian Von Mering, Lars J Jensen, Michael Kuhn, Samuel Chaffron, Tobias Doerks, Beate Kruger, Berend Snel, and Peer Bork. STRING 7-recent developments in the integration and prediction of protein interactions. Nucleic Acids Research, 35(suppl 1):D358-D362, 2007.

[116] Kai Wang, Haitao Zhang, Deqiong Ma, Maja Bucan, Joseph T Glessner, Brett S Abrahams, Daia Salyakina, Marcin Imielinski, Jonathan P Bradfield, Patrick MA Sleiman, et al. Common genetic variants on 5p14. 1 associate with autism spectrum disorders. Nature, 459(7246):528-533, 2009.

[117] Xiujuan Wang, Natali Gulbahce, and Haiyuan Yu. Network-based methods for human disease gene prediction. Briefings in Functional Genomics, 10(5):280-293, 2011.

[118] Jia Wei and Jiexiong Feng. Signaling pathways associated with inflammatory bowel disease. Recent Patents on Inflammation & Allergy Drug Discovery, 4(2):105-117, 2010.

[119] A Jeremy Willsey, Stephan J Sanders, Mingfeng Li, Shan Dong, Andrew T Tebbenkamp, Rebecca A Muhle, Steven K Reilly, Leon Lin, Sofia Fertuzinhos, Jeremy A Miller, et al. Coexpression networks implicate human midfetal deep cortical projection neurons in the pathogenesis of autism. Cell, 155(5):997-1007, 2013.

[120] Christof Winter, Glen Kristiansen, Stephan Kersting, Janine Roy, Daniela Aust, Thomas Kn6sel, Petra Rimmele, Beatrix Jahnke, Vera Hentrich, Felix Rickert, et al. Google goes cancer: improving outcome prediction for cancer patients by network- based ranking of marker genes. PLoS Computational Biology, 8(5):e1002511, 2012.

[121] Xuebing Wu, Rui Jiang, Michael Q Zhang, and Shao Li. Network-based global infer- ence of human disease genes. Molecular Systems Biology, 4(1), 2008.

[122] Xuebing Wu, Qifang Liu, and Rui Jiang. Align human interactome with phenome to identify causative genes and networks underlying disease families. Bioinformatics, 25(1):98-104, 2009.

[123] Amanda L Yonan, Maricela Alarcon, Rong Cheng, Patrik KE Magnusson, Sarah J Spence, Abraham A Palmer, Adina Grunn, Suh-Hang Hank Juo, Joseph D Terwilliger, Jianjun Liu, et al. A genomewide screen of 345 families for autism-susceptibility loci. The American Journal of Human Genetics, 73(4):886-897, 2003.

108 [124] Mengjin Zhu and Shuhong Zhao. Candidate gene identification approach: progress and challenges. InternationalJournal of Biological Sciences, 3(7):420, 2007.

109