PREDICTING INFORMATIVE SPATIO-TEMPORAL NEURODEVELOPMENTAL WINDOWS AND RISK FOR AUTISM SPECTRUM DISORDER.

a thesis submitted to the graduate school of engineering and science of bilkent university in partial fulfillment of the requirements for the degree of master of science in computer engineering

By O˘guzhanKarakahya October 2020 Predicting informative spatio-temporal neurodevelopmental windows and gene risk for autism spectrum disorder. By O˘guzhanKarakahya October 2020

We certify that we have read this thesis and that in our opinion it is fully adequate, in scope and in quality, as a thesis for the degree of Master of Science.

A. Erc¨ument C¸i¸cek(Advisor)

Can Alkan

Tunca Do˘gan

Approved for the Graduate School of Engineering and Science:

Ezhan Kara¸san Director of the Graduate School ii ABSTRACT PREDICTING INFORMATIVE SPATIO-TEMPORAL NEURODEVELOPMENTAL WINDOWS AND GENE RISK FOR AUTISM SPECTRUM DISORDER.

O˘guzhanKarakahya M.S. in Computer Engineering Advisor: A. Erc¨ument C¸i¸cek October 2020

Autism Spectrum Disorder (ASD) is a complex neurodevelopmental disorder with a strong genetic basis. Due to its intricate nature, only a fraction of the risk were identified despite the effort spent on large-scale sequencing studies. To perceive underlying mechanisms of ASD and predict new risk genes, a deep learning architecture is designed which processes mutational burden of genes and gene co-expression networks using graph convolutional networks. In addition, a mixture of experts model is employed to detect specific neurodevelopmental periods that are of particular importance for the etiology of the disorder. This end-to-end trainable model produces a posterior ASD risk probability for each gene and learns the importance of each network for this prediction. The results of our approach show that the ASD gene risk prediction power is improved compared to the state-of-the-art models. We identify mediodorsal nucleus of thalamus and cerebellum brain region and neonatal & early infancy to middle & late childhood period (0 month - 12 years) as the most informative neurodevelopmental window for prediction. Top predicted risk genes are found to be highly enriched in ASD- associated pathways and transcription factor targets. We pinpoint several new candidate risk genes in CNV regions associated with ASD. We also investigate confident false-positives and false negatives of the method and point to studies which support the predictions of our method.

Keywords: Autism Spectrum Disorder, Graph Convolutional Networks, Deep Learning.

iii OZET¨ OTIZM˙ SPEKTRUM BOZUKLUGU˘ IC¸˙ IN˙ BILG˙ I˙ VERIC˙ I˙ ZAMAN-UZAMSAL SIN˙ IR˙ GELIS¸˙ IM˙ ARALIGI˘ VE GEN RISK˙ I˙ TAHMIN˙ I˙

O˘guzhanKarakahya Bilgisayar M¨uhendisli˘gi,Y¨uksekLisans Tez Danı¸smanı:A. Erc¨ument C¸i¸cek Ekim 2020

Otizm Spektrum Bozuklu˘gu(OSB), genetik sebeplerle ortaya ¸cıkabilen, zihin- sel geli¸simi olumsuz etkileyen karma¸sık bir hastalıktır. Karma¸sık do˘gasından dolayı, bu hastalı˘gasebep olan risk genlerinin sadece k¨u¸c¨ukbir y¨uzdesi,gen dizileme ¸calı¸smalarısayesinde tespit edilebilmi¸stir. Bu hastalı˘gasebep olan et- menleri anlamak i¸cin,mutasyon y¨uk¨uverisini gen ortak ifade ¸cizgeleri¨uzerinde kullanabilen bir derin ¨o˘grenmemimarisi tasarlandı. Ek olarak, hastalık i¸cin¨onem arz eden sinirsel geli¸simperiyotlarını tespit edebilmek i¸cinderin ¨o˘grenmemod- eline uzmanların karı¸sımımodeli de eklendi. Bu u¸ctanuca e˘gitilebilensistem ¸cizgeba¸sınabir a˘gırlık¨o˘grenerekb¨ut¨ungenler i¸cinbir olasılık atayabilmekte- dir. Modelimizin sonu¸cları,otizm geni risk tahmin g¨uc¨un¨unen geli¸smi¸smod- ellere kıyasla arttı˘gınıg¨ostermektedir. En y¨uksekrisk penceresi olarak talamus ve serebellum beyin b¨olgesininmediyodorsal ¸cekirde˘ginive yenido˘gan/erken be- beklikten orta/ge¸c¸cocukluk d¨onemine kadar olan periyot (0 ay - 12 ya¸s)belir- lenmi¸stir. Sonu¸clarımız, otizm ile alakalı bilinen anahtar biyolojik yollar ve gen hedefleri i¸cin iyi bir zenginle¸smeye de i¸saretetmektedir. OSB ile ili¸skilibir etiketi olmayan kopya sayısı de˘gi¸sikli˘gib¨olgelerinderisk geni olmaya aday birka¸cgen g¨ozlemlenmi¸stir. Yalancı-pozitif kesin referans genler, etiketlenmemi¸solmasına ra˘gmenOSB ile ili¸skiliolma olasılı˘gıy¨uksekgenler ve y¨ukseksıralamalı yalancı- negatif kesin referans genler incelenmi¸stir.

Anahtar s¨ozc¨ukler: Otizm Spektrum Bozuklu˘gu,C¸izge Evri¸simselA˘glar,Derin O˘grenme.¨

iv Acknowledgement

First of all, I would like to thank my advisor Asst. Prof. Erc¨ument C¸i¸cekfor his understanding and assistance throughout my study. It wouldn’t be possible for me to neither conduct this study, nor becoming a researcher without his guidance and patience. Therefore, i am very grateful for his continuous support.

I am also thankful to my jury members X and Y for reading my thesis and for accepting being in my thesis committee. I thank to Simons Foundation Autism Research Initiative for funding this research via the SFARI 640935 pilot grant awarded to Erc¨ument C¸i¸cek.

I would like to thank Ilayda˙ Beyreli for her support on this work. She helped me to design the architecture and implement the source code for this study. She also shared her invaluable feedbacks with me during the whole process.

I feel very lucky for being in the community of Bilkent University for 7 years. During my studies, I earned invaluable friendships and collected wonderful mem- ories. I would like to thank Do˘gukan, Arda, Yusuf, Ozan and Muammer for their friendship for our good memories. I am also very grateful for all the good time we had with Ayberk, Sezernaz and Ya˘gmur, and for their precious friendship. I also would like to thank my colleague and friend Ilayda˙ for all of her support, feedbacks and efforts. She has a great amount of contribution to this thesis. I also thank Mustafa for his friendship and introducing me to the life-long hobby of board games.

Finally, I am endlessly thankful for all the efforts of my parents. They were always on my side, supporting me all the time. Without their love, support and belief, I wouldn’t be where I am now in the first place. I cannot pay back for all of their efforts whatever I do in return. I will be doing my best to make them proud.

v Contents

1 Introduction 1

2 Background Information 4

2.1 De Novo Gene Disrupting Mutation ...... 4

2.2 Biological Pathways and DNA/RNA Binding ...... 5

2.3 Copy Number Variation ...... 6

2.4 Gene Co-expression Network ...... 7

2.5 -Protein Interaction Networks ...... 8

2.6 TADA Framework ...... 8

3 Related Work 10

3.1 DAWN ...... 10

3.2 Genome-wide Ranking by SVM-based Classifier ...... 11

3.3 DAMAGES Score ...... 12

3.4 ST-Steiner ...... 13

vi CONTENTS vii

4 Methods 14

4.1 Construction of Gene Co-Expression Networks ...... 14

4.2 Ground Truth Gene Sets ...... 16

4.3 DeepASD Model ...... 17

4.3.1 Graph Convolutional Network ...... 17

4.3.2 Early Stopping ...... 18

4.3.3 Weight Decay ...... 18

4.3.4 Dropout Regularization ...... 18

4.3.5 Mixture of Experts ...... 19

4.3.6 Cross-validation Setting ...... 19

4.3.7 Optimizer and Loss Function ...... 20

5 Results 22

5.1 Comparison against the State-of-the-art Methods ...... 22

5.2 Enrichment Analysis ...... 28

5.3 CNV Region Analysis ...... 32

5.4 Neurodevelopmental Period Analysis ...... 37

5.5 Evaluation of Edge Case Predictions ...... 39

5.6 Protein-Protein Interactions between ASD Genes ...... 42 CONTENTS viii

6 Conclusion 44

A Supplementary Tables 60 List of Figures

4.1 The architectural model of DeepASD for genome-wide ASD gene risk assessment. The model uses TADA features as well as 52 gene co-expression networks extracted from BrainSpan dataset. The feature set includes de novo loss of function (and missense) mu- tation counts, transmitted mutation counts for control and case groups, pLI value, de novo mutation frequency values and protein truncating variant counts. TADA dataset is used by all GCN mod- ules and the gating network. This whole system produces a single probability valuey ˆ for each of the 25,825 genes and it is end-to-end trainable...... 21

5.1 ROC and precision-recall curve distribution comparison between DeepASD and Krishnan et al. (a) Area under ROC curve distri- bution between the two methods. (b) Area under precision-recall curve distributions comparing the same methods as in (a). Both in (a) and (b), outlier points are depicted. The solid center line depicts the median, dashed line depicts the mean value for each panel. Box limits demonstrates lower and upper quartiles and whiskers denote 1.5 interquartile range...... 22

ix LIST OF FIGURES x

5.2 Process of smoothing SVM output values. Using ten-fold cross- validation, 10 different isotonic regression model is fit. Then, we find the knots in each model and combine them to obtain (a). The transitions are not smooth and further smoothing is required. (b) Isotonic regression is applied once more on this knot collection to obtain a smoother curve. (c) Linear interpolation is used to obtain final mapping...... 25

5.3 Probability value scatter plot of DeepASD and Krishnan et al. (a) Probability values of E1 genes against all other genes for both methods. (b) Non-mental-health related genes compared to all other genes. Both panels contain the same 25,825 genes. y = x line (gray) is also drawn for visual aid...... 26

5.4 Precision-recall curves for DeepASD, DAWN (PFC-MSC3-5 and PFC-MSC4-6), Krishnan et al. and Zhang and Shen DAMAGES score. (a) The curve for E1 plus non-mental-health genes. (b) E1 + E2 genes are used. (c), all gold standard genes are used. All precision-recall curves have a cutoff rank of 2000. (c) has starting rank value of 5 whereas (a) and (b) have starting rank value of 1. 27

5.5 Enrichment analysis for WNT and MAPK pathways, CHD8, RB- FOX (splice and peak), FMRP and TOP1 target genes, post- synaptic density complex and histone modification processes. P values are calculated using Binomial test...... 29

5.6 The genes spanned by 16p11.2 and 15q11-13 CNVs. For each gene, DeepASD rank is provided along with its evidence level if it is labeled...... 34

5.7 The genes spanned by 15q13.3 and 1q21.1 CNVs. For each gene, DeepASD rank is provided along with its evidence level if it is labeled...... 35 LIST OF FIGURES xi

5.8 The genes spanned by 22q11 CNV. For each gene, DeepASD rank is provided along with its evidence level if it is labeled...... 36

5.9 The heatmap for average posterior probabilities assigned by each GCN module of DeepASD to E1 genes. The heatmap shows 4 brain regions with 13 neurodevelopmental windows to illustrate average posterior probability for all 52 co-expression networks. . . 38

5.10 Heatmap illustrating the weights assigned by gating network to each GCN module (network). 52 networks are plotted as 4 × 13 matrix for each brain region and for each neurodevelopmental period. 38

5.11 Tissue-specific frontal cortex PPI network from DifferentialNet database constructed by using NetworkAnalyst system. Node sizes indicate betweenness centrality of nodes (larger nodes have higher betweenness centrality) and node colors map node degrees, red indicating higher degree...... 43 List of Tables

4.1 Neurodevelopmental time periods proposed by Willsey et al. (2013). Since the neurodevelopment mainly occurs during preg- nancy period, higher precision for this period is provided. Sliding window approach uses these 15 periods to create 13 windows. The window size is 3. Thus, the windows are constructed as [1-3], [2-4], [3-5], ..., [13-15]...... 15

4.2 Brain region clusters used to construct 52 co-expression network using BrainSpan data. These clusters are formed by using hierar- chical clustering based on transcriptional similarity of these brain regions (Willsey et al., 2013)...... 16

5.1 P value comparison for the rankings of DeepASD, Krishnan et al., DAWN and DAMAGES score using different set of gold standard genes for evaluation. Each P value is calculated using two-sided Wilcoxon rank-sum test...... 26

5.2 DeepASD top 1% gene enrichment in BioPlanet 2019 dataset. Top 29 pathways are included out of 560 pathways. Enrichr tool is used to calculate enrichment values. Pathways are sorted in increasing order with respect to P values...... 31

xii LIST OF TABLES xiii

5.3 Edge case genes with their 2-hop distance neighbor statistics. 2 genes per group described in section 5.5 are provided. CHD8 is also added to table as a reference since it is ranked as 1st by TADA 3rd by DeepASD, and it is a well-established ASD risk gene. TADA rank describes the ranking of a gene in TADA with respect to its q-value. E1, E2, E3/E4 and negative genes describe the count of such genes in 2-hop neighborhood. TADA genes column is the count of unlabeled genes within 2-hop neighborhood with their TADA q-value rank < 1000...... 41

A.1 46 E1 ground truth gene list. These genes are the most confident ASD risk genes and also used for performance calculation along with non-mental-health related genes. Evidence weight of 1.0 is used while training with these genes...... 60

A.2 67 E2 ground truth genes. Their evidence weight is set to 0.5 and they are excluded while calculating performance metrics. These genes are the most confident ASD genes after E1 genes...... 60

A.3 525 E3-E4 ground truth genes with evidence weight 0.25. They are the positive ground truth genes with the lowest confidence. They are not included in performance metric calculation...... 61

A.4 First 600 non-mental-health related genes. All non-mental-health related genes have evidence weight of 1.0. They are used along with E1 genes for performance metric calculation. Remaining genes are given in Table A.5 ...... 62

A.5 Remaining 585 non-mental-health related genes. First 600 genes are given in Table A.4 ...... 63

A.6 Gene list for WNT signaling process. This gene set is used to calculate enrichment for WNT signaling...... 64 LIST OF TABLES xiv

A.7 Gene list for MAPK signaling process. This set is used to perform enrichment analysis using DeepASD ranking for MAPK signaling. 65

A.8 CHD8 target gene list part 1 (of 3 parts) containing 750 genes. This gene set is used for enrichment calculations of DeepASD ranking with respect to CHD8 targets...... 66

A.9 Part 2 (out of 3) of CHD8 target gene list containing second 750 genes...... 67

A.10 Part 3 (out of 3) of CHD8 target gene list containing last 584 genes. 68

A.11 RBFOX (splice) target genes used for calculating DeepASD rank- ing enrichment...... 69

A.12 Part 1 (out of 2) of RBFOX (peak) target gene list (first 588 genes) that is used for DeepASD enrichment analysis...... 70

A.13 Part 2 (out of 2) of RBFOX (peak) target gene list containing last 397 genes...... 71

A.14 Part 1 (out of 2) of gene list for FMRP targets containing first 588 genes. These genes are used to calculate enrichment score of DeepASD for FMRP targets...... 72

A.15 Part 2 (out of 2) of gene list for FMRP targets containing last 207 genes...... 73

A.16 TOP1 target gene list that is used to calculate DeepASD enrich- ment for TOP1 targets...... 74

A.17 Part 1 (out of 3) of post synaptic density (PSD) complex gene set containing first 588 genes. These genes are used to calculate enrichment of DeepASD ranking on PSD complex gene set. . . . . 75 LIST OF TABLES xv

A.18 Part 2 (out of 3) of PSD complex gene set containing second 588 genes...... 76

A.19 Part 3 (out of 3) of post synaptic density (PSD) complex gene set containing last 282 genes ...... 77

A.20 Part 1 (out of 2) of histone modifier gene list containing first 588 genes. These genes are used to calculate enrichment for DeepASD ranking on histone modifiers...... 78

A.21 Part 2 (out of 2) of histone modifier gene list containing last 129 genes...... 79

A.22 First 132 gene rankings and posterior probabilities generated by DeepASD. Probabilities are obtained by averaging results from 200 epochs, excluding results coming from training data...... 80

A.23 Genes with ranks between 133 to 258 and their posterior probabil- ities generated by DeepASD. Probabilities are obtained by averag- ing results from 200 epochs, excluding results coming from training data...... 81 Chapter 1

Introduction

Autism Spectrum Disorder (ASD) is a neurodevelopmental genetic disorder with approximately a thousand genes are involved in the etiology [1]. Due to its complexity and high frequency among the population, many large scale exome [2, 3, 4, 5, 6, 7, 8] and genome [9, 10, 11, 12, 13, 14] sequencing studies are conducted to understand the cellular, functional and genetic structure of ASD. Despite all of the effort, the most comprehensive and recent study found strong evidence for 102 risk genes (FDR ≤ 0.1) after inspecting ∼ 36k samples [15]. Therefore, 90% of the risk genes are yet to be discovered and validated. The reason for this huge information gap stems from the complexity of disorder mechanics and the insufficiency of samples (i.e., trios). One of the most valuable information we have for gene risk assessment is de novo loss-of-function (dnLoF) mutations. These mutations are observed on various set of genes and they are scarce. The scarcity and variety of such mutations avert discovery of important signals. As a result, number of identified risk genes are low compared to the count of samples inspected. dnLoF mutations and their significance in ASD gene risk assessment are also described in Section 1 of Chapter 2. The mutation burden data is sparse since most of the genes in do not have observed dnLoF mutation. Moreover, most of the dnLoF mutations are observed once in a gene. Although dnLoF mutations generally provide useful information, single observed dnLoF mutation might be insufficient to identify a gene as a risk gene, and might be

1 noise. Therefore, the data is sparse and have intrinsic noise, which makes it essential to design algorithms that is robust against this noise and capable of discovering missing risk genes. This leads to a broad literature on gene risk prediction methods.

The first type of gene risk assessment methods calculate risk using statistical methods on genetic burden data obtained from case-control and family studies [1]. These methods can also work with multiple traits thanks to the recent im- provements in these methods [16]. The second type is rather a gene discovery approach. In this line of work, gene disruptive mutations and de novo gene dis- ruptive mutation counts are used as prior risk to obtain posterior gene interaction network-adjusted gene risk values. These methods use guilt by association princi- ple and (i) provide a genome-wide ranking by propagating available data through the network and perform prediction for unlabeled genes and (ii) infer biological pathways, molecular functions and gene subnetworks that are associated with the disorder [1, 17, 18, 19, 20, 21, 22, 23, 24]. These methods generally utilize a sin- gle source of network data such as gene co-expression networks, protein-protein interaction networks, or brain-specific gene interaction data. Several methods combine a portion of these data, however, they are not capable of utilizing all of them simultaneously [19, 21, 25, 26], which reduces the overall prediction capabil- ity since many sources of information are discarded. For instance, gene expression values measured from various brain regions can be used to construct gene inter- action networks that model neurodevelopment. However, since these methods cannot process multiple networks at once, gene risk assessment per brain region and per developmental window cannot be performed. Moreover, it is not possi- ble to pinpoint important brain regions and neurodevelopmental time points for diesaese etiology. A common practice is to perform such an analysis indepen- dently from gene risk assessment as a separate downstream task [26, 27]. This is unfortunate as if the model was able to analyze all networks simultaneously, this task would have been embedded in the genomewide risk assessment task which then (i) would improve the prediction power by utilizing a diverse set of gene in- teraction information, and (ii) would let the model learn the importance of each brain region and neurodevelopmental period. This way, it is possible to obtain

2 important biological insights while possibly having a better ranking of the risk genes.

To address the issues stated above, we introduce a novel deep learning model (Deep Autism Spectrum Disorder algorithm - DeepASD). Our model uses graph convolution and analyzes multiple gene co-expression networks to learn an em- bedding for each gene on each network. Then, it uses a gating network (mixture of experts) to learn the importance of each network for risk prediction. Hence, the model is interpretable in this sense. Our results show that mediodorsal nucleus of the thalamus and cerebellar cortex brain region and neonatal/early infancy to middle & late childhood period (0 month - 12 years), is the most important neu- rodevelopmental window for ASD gene risk prediction. We compare DeepASD with other state-of-the-art methods in terms of gene risk prediction performance. Additionally, we also show that the top percentile genes predicted by DeepASD are highly enriched in pathways that are known to be associated to ASD. We also inspect CNV regions that are associated to ASD and genes within these regions to show that DeepASD can prioritize risk genes that are inside these regions. We also discuss the candidacy of several unlabeled genes inside these CNV regions that are ranked high by DeepASD. Finally, we discuss new findings and confident false positive/negative predictions of the method: top-ranked negatively-labeled genes and bottom-ranked positively labeled ground truth genes.

The subsequent chapters have the following organization. Chapter 2 con- tains more background information such as descriptions of technical terms and concepts. In Chapter 3, previous studies and their findings on ASD gene risk prediction are discussed. In Chapter 4, all the methods and algorithms we have used such as graph convolutional networks and mixture of experts are described. In Chapter 5, we present our results. Finally, we conclude this thesis in Chapter 6 with a brief summary of the findings of this study.

3 Chapter 2

Background Information

2.1 De Novo Gene Disrupting Mutation

A gene is defined as a sequence of chromosomal DNA which can be translated to either functional or non-functional RNA, which then can be used to produce proteins [28]. Genes are the hereditary building blocks. Since humans have two copies of each , there exists at least two copies of each gene, which are called as alleles. Allele genes determine the phenotype, the physical traits, of a person. Although human DNAs are %99.9 similar to each other, the actual phenotypes could vary a lot as a result of the variation in the DNA. Most of the variants are harmless. However, some can cause harm when it alters the produced protein by that gene. Thus, the mutation location bears the most importance. There are multiple mutation types such as missense, nonsense, silent, frameshift etc. that may or may not affect the final protein of a gene. If, after a mutation, a gene cannot produce its functional protein, that gene is disrupted, or knocked out.

The term de novo (literally means ’of new’) could be used for mutations. When a variant is present on child’s DNA, but not present on parents’ DNAs, it is called a de novo variant. This happens when one of the parent’s germ cell is mutated,

4 or the fertilized egg is mutated during the early embryogenesis period.

De novo gene disrupting mutations are proved to be essential for ASD gene risk assessment [21]. When both parents are healthy and the child has ASD, it is likely that one of the de novo mutations caused disruption in one or more genes, and as a result the child has ASD. Therefore, genes with de novo gene disrupting mutations on ASD probands are good candidates for being risk genes.

2.2 Biological Pathways and DNA/RNA Bind- ing Proteins

A biological pathway is defined as successive processes within a cell that alter the cell structure or produce a certain product. Biological pathways control various kind of processes within the body, and can be disrupted by mutations occurred in at least one of the genes involved.

Recent studies showed that there are many different pathways that are dis- rupted commonly among the affected ASD probands [5, 29, 30, 31]. These path- ways include Wnt-β cathenin, mTOR, and MAPK pathways. It is important to identify such pathways since all genes involved within a pathway become candi- date risk genes, which helps to narrow down the list of suspected genes.

Apart from the biological pathways, we are also concerned about some partic- ular target genes of a specific protein. A protein is called RNA-binding protein (RBP) if it is responsible for regulating mRNA translation and its maintenance. Similarly, there are also proteins that binds to DNA, which are called DNA- binding proteins (DBPs). DBPs and RBPs regulate production of other proteins. Some RBPs and DBPs are shown to be associated to ASD [32, 33, 34, 35]. We also inspect RBPs and DBPs related to ASD to find out possible risk genes, or use known RBP and DBP target genes to validate our findings.

In the context of ASD gene risk assessment, we perform enrichment analyses to

5 show that the ranking generated by DeepASD is in fact sensible. The target gene lists of RBPs and DBPs, the genes involved in a biological pathway or the genes involved in a molecular process can be used to perform enrichment analysis. First, we divide gene ranking produced by DeepASD into deciles. Then, we calculate the number of genes that are in a pathway, target gene list etc. for a specific decile. Then, we plot a graph that shows the number of genes in a gene list for each decile. If the top decile contains more genes compared to other deciles, then we can infer that our ranking is in fact sensible. To quantify, we calculate a P -value indicating the probability of randomly obtaining such an enrichment profile. This analysis is performed and the results are described clearly in Chapter 5 (Results).

2.3 Copy Number Variation

After the Human Genome Project has been completed, it is become evident that some people have more (or less) copies in their genome, and this is quite common among the population [36]. This phenomenon is called copy number variation (CNV). CNVs could occur anywhere within genome by having multiple insertions or deletions of a region (or complex rearrangements of regions). CNV regions could span multiple genes, and can include more than a thousand nucleotides up to millions of nucleotides [37]. CNVs could be transmitted, but they could also occur de novo. CNVs could be harmless, or could cause several disorders such as ASD, Schizophrenia and Parkinson Disease [38].

Similar to de novo mutation analysis, researches show that there are particular CNVs that are associated to ASD [37, 39, 40, 41, 42, 43, 44]. Sebat et al. [45] identified de novo CNVs in 10% of the case group, whereas they identified 1% de novo CNVs in control groups, which shows us that similar to de novo gene disrupting mutations, de novo CNVs are also important for identifying risk genes for ASD. As an example, Chung et al. [43] point that maternal 15q11-13 CNV is associated to ASD. Maternal 15q11-13 location indicates the region in the longer arm of maternal chromosome 15. 11-13 denotes the band within the chromosome.

6 We inspect the CNVs occurred on 16p11.2, 15q11-13, 15q13.3, 1q21.1 and 22q11 locations which are pointed to be associated with ASD [26]. Since these CNVs span multiple genes, we identified the genes that lie inside the boundaries of these regions, and then plotted the rankings of these genes in Chapter 5 (Results).

2.4 Gene Co-expression Network

In human body, approximately 400.000 proteins are being produced [46]. Differ- ent cells across our body need to produce particular sets of proteins. Therefore, it is not necessary for all of our genome to be active in every cell of our body. Gene expression mechanism regulates this by adjusting the type and amounts of mRNA in a particular cell. We can utilize this information to identify genes that are actively working in a specific tissue, or we can calculate their co-expression.

Co-expression is calculated using correlation matrix between genes for a gene expression sample set. Once the gene expression data is given, we can calcu- late Pearson correlation from gene expression matrix to obtain correlation matrix which we can name as co-expression matrix. This matrix contains values between −1 and 1 that indicates the co-expression value of two genes. To obtain a net- work from a co-expression matrix, we apply a threshold on absolute co-expression values to convert them to boolean values. This newly obtained matrix will be the adjacency matrix for our gene co-expression network. A typical threshold value could be 0.7, which is often enough to capture strong co-expression between genes. This threshold value could be fine-tuned to obtain a network of desired size. A higher threshold yields smaller networks with stronger connections whereas a smaller threshold yields bigger networks with loose connections.

For ASD gene risk assessment, we focus on gene expression in various brain regions. BrainSpan dataset from Allen Brain Atlas [47, 48] provides 57 samples from 15 brain regions with ages varying from 8 postconception week to 40 years. We cluster these samples with respect to brain regions and then create a gene co- expression network for each brain region and neurodevelopmental period. These

7 spatio-temporal networks allow us to feed the neurodevelopmental process into our model. More detail on constructing these networks are provided in Chapter 4 (Methods).

2.5 Protein-Protein Interaction Networks

Protein-protein interaction (PPI) networks denote biochemical interactions of proteins. These interactions are often physical and they represent a specific func- tion within an organism. There are multiple types of protein-protein interactions such as transient interactions, permanent interactions, covalent interactions, non- covalent interactions etc. Similarly, we can also categorize PPIs depending on the region of interest. Cell PPIs focus interactions occurring within a cell whereas tissue-specific PPIs include interactions specific to a tissue. There are many PPI databases such as BioGrid [49], HPRD [50] and STRING [51]. These databases are compiled using thousands of academic studies and provide various types of PPIs including the types we have described above. In the scope of ASD risk assessment, PPIs can be used to utilize functional similarity information among genes. A gene could be a risk gene provided that it is functionally similar to other risk genes. However, we didn’t use PPI networks during the training process of DeepASD. Instead, we used the top 1% genes selected by DeepASD as a gene filter for tissue-specific PPI networks and then we identified genes with the most edges in these filtered tissue-specific PPI networks. Such genes are candidate risk genes since they are functionally similar to many other candidate/known risk genes.

2.6 TADA Framework

He et al. [1] propose a statistical framework called as Transmission and De Novo Association (TADA) that uses transmitted and de novo gene disrupting mutation data to obtain an ASD gene risk P -value for each gene. This framework uses a

8 Hierarchical Bayes framework, which pools all mutation information to improve overall prediction and still keeps scores gene specific. This model assumes that some genes are disorder-related and the others are not, then it learns the param- eters of each group by performing likelihood maximization and then calculates the score afterwards [1]. This approach improves the previous ASD gene risk prediction power and creates a basis for our mutation burden dataset.

TADA framework is extended with more trios and more data to improve ASD gene risk prediction performance [15]. This new dataset (extTADA) is the most comprehensive to date and it is the dataset we used for our study. In addition to the mutation counts for control and case groups, the dataset also include pLI, intolerance of genes to mutations [52]. pLI values vary between 0 and 1. The value of 1 indicates that the gene is totally intolerant to mutations, i.e., a person with such a mutated gene cannot survive. On the contrary, pLI value of 0 indicates that a gene is totally tolerant to mutations and henceforth different versions of such a gene exist among the population. Our aim for our model is to make it learn the relationships between these values (mutation counts and pLI) and ASD gene risk association. Thus, we feed all these variables to our model instead of simply using TADA scores.

9 Chapter 3

Related Work

In this chapter, we discuss and explain related methods in the field.

3.1 DAWN

Detecting Association with Networks (DAWN) is a network-driven Hidden Markov Random Field (HMRF) based ASD gene risk detection algorithm [25, 53]. The DAWN algorithm consists of two main parts: Partial Neighbourhood Selec- tion (PNS) and HMRF parameter estimation. They use TADA scores for ASD as well as gene co-expression networks extracted from BrainSpan transcriptome dataset [48]. Among 4 brain regions and 13 neurodevelopmental periods, they have chosen prefrontal cortex (PFC) region and early fetal to early mid-fetal period (3-5) and PFC region and early mid-fetal to late mid-fetal (4-6) period networks since these networks are labeled as the most critical networks for ASD by Willsey et al. [27]. Since the algorithm is designed to operate on a single network, they provided results using these networks separately.

PNS step of the algorithm eliminates nodes with P -values higher than a thresh- old. Then, it applies a threshold based on the correlation between nodes. First order neighbors of selected nodes are taken and finally the network is constructed

10 using a special regression-based approach. This partial network concentrates on the known ASD risk genes along with their first order neighbors. Parameter es- timation part of the algorithm assumes two clusters. First cluster consists of the risk genes and second cluster consists of other genes. Assuming these clusters following a Gaussian mixture distribution, the model iteratively constructs these two clusters. The final version of the first cluster contains the risk genes, ranked by their FDR.

Although the risk genes selected by DAWN are shown to be associated with ASD, the method lacks a few features. DAWN cannot utilize more than one network, and it cannot provide a genome-wide ranking since it can only rank the selected genes.

3.2 Genome-wide Ranking by SVM-based Clas- sifier

Krishnan et al. [26] propose an SVM based method for ASD gene risk assess- ment. They obtain a large gene interaction dataset by combining thousands of large genomics data and using a naive Bayes classifier. This dataset contains gene expressions, molecular functions, pathway data and various other genomics data. They perform SVM ten times with 5-fold cross-validation and obtain their genome-wide prediction values. While training, they assign evidence weights on input genes. These evidence weights are calculated using the information obtained from multiple sources. Highest evidence genes (E1) are curated from category 1 and 2 genes from Simons Foundation Autism Research Initiative (SFARI) [54] and genes from Online Mendelian Inheritance in Man (OMIM) [55]. Second evi- dence level genes (E2) are curated from category 3 SFARI genes. Third evidence level genes (E3) are obtained from HUGE [56] and GAD [57] databases. Finally, E4 genes are collected from Gene2MeSH (http://gene2mesh.ncibi.org), SFARI category 4 genes and DGA [58] database. In their model, E1 genes are given a weight of 1.0, E2 genes are given a weight of 0.5 and E3 and E4 genes have 0.25

11 as their weights. Since there are more E3 and E4 genes compared to combined sum of E1 and E1 + E2 genes, it is important to reduce the overall importance of E3 and E4 genes because they contain considerable amount of noise. We use the same evidence weights in DeepASD since we use the same ground truth set Krishnan et al. curated.

An important analysis performed by Krishnan et al. is spatio-temporal brain region and neurodevelopmental period enrichment. Using the gene expression values for each brain region and for each neurodevelopmental time period, they obtained a heatmap showing the important brain regions and neurodevelopmen- tal periods. According to their findings, early fetal to late midfetal periods and almost all of the brain regions are active in terms of gene expressions. Their find- ings indicate that genes responsible for neurodevelopmental processes are more active during early fetal to midfetal periods. In this study, we performed a similar analysis as well by letting our model to learn weights per each network during the training process. The explanation and discussion on that analysis could be found in Chapter 5 (Results).

3.3 DAMAGES Score

DAMAGES method uses gene expression data from central nervous system cells of 24 mice, coming from 6 brain regions [59]. 76% of the genes within the database are orthologous to human genes. Likely gene disrupting (LGD) mutation data are used to calculate DAMAGES score by a 2-step process. First, PCA is performed on log2 transformed expression intensity values, then, LASSO regularization is applied on the principal components obtained before with respect to mutation sources (probands versus siblings) to evaluate the contribution of each principal component. After this process, a DAMAGES score indicating ASD risk for each gene is obtained. Furthermore, the score is then converted to an ensemble score by applying logistic regression on (i) the combined dataset of ExAC metrics (pLI and mis-Z) and (ii) the original DAMAGES scores. In this study, instead of creating a network out of gene expression data, they have used the actual gene expression

12 data itself. This study provides improved results compared to Krishnan et al. and DAWN.

3.4 ST-Steiner

ST-Steiner is a method based on prize collecting steiner trees [20]. Similar to other studies in this area, ST-Steiner also uses gene co-expression networks. The algorithm picks the optimum tree that maximizes the profit where each node has a prize based on its TADA q-value and each edge has a penalty inversely correlated with co-expression value of two connecting nodes. ST-Steiner optimizes the following equation.

X X oF (F ) = c(e) + β p(v) + ωκF , β ≥ 0, ω ≥ 0 (3.1)

e∈EF v∈ /VF

β value in this equation is a multiplier for each node prize, c(e) is the cost function for edges and p(v) is the prize function. κF denotes the number of connected subgraphs (trees). Thus, this algorithm can find more than one tree. However, it performs on a single gene interaction network. Another feature of this algorithm is that it can model spatio-temporal development of brain and can incorporate multiple networks as a spatio-temporal cascade. The results of the algorithm on networks that belong to earlier time windows can be used to boost the node prizes of the genes in the following (later) time windows. Thus, the algorithm has a higher chance of selecting these nodes (genes). This way, spatio- temporal development of the brain can be modelled. However, the output tree size adjustment is time consuming, and this method cannot provide a genome-wide ranking since it only selects a subset of genes as candidate risk genes.

13 Chapter 4

Methods

4.1 Construction of Gene Co-Expression Net- works

Using the BrainSpan data published by Allen Brain Atlas [48, 60], we have con- structed 52 networks based on 4 brain region clusters and 13 neurodevelopmental windows. BrainSpan data contain samples from 57 postmortem brains and 16 brain regions. Ages of these samples range between 8 postconception week (pcw) to 40 years, which is modeled using 13 neurodevelopmental windows constructed by applying a sliding window approach on 15 neurodevelopmental time periods. The sliding windows and the original time periods are given in Table 4.1. This spatio-temporal system of networks helps us to model human brain development. We generalized 16 brain regions into 4 clusters based on their transcriptional similarity as Willsey et al. [27] proposed. These 4 brain regions are (i) PFC- MSC (prefrontal cortex and primary motor-somatosensory cortex), (ii) MDCBC (mediodorsal nucleus of the thalamus and cerebellum), (iii) V1C-STC (primary visual cortex and superior temporal cortex) and (iv) SHA (striatum, hippocam- pus and amygdala). The brain regions inside each cluster is given in Table 4.2.

14 Each of the 52 networks are constructed after filtering samples according to their brain regions and ages. After filtering, each network is constructed by applying Pearson correlation among samples and then applying a threshold r = 0.8 on this absolute threshold value. We use 25,825 nodes in our model, so each network must contain exactly 25,825 nodes. However, after applying a threshold, some nodes are removed since they are isolated. We add self-loops to these isolated nodes to obtain the required node count. Each network is named after its brain region and time window. For instance, MDCBC9-11 network is constructed by using samples from mediodorsal nucleus of the thalamus and cerebellar cortex, and samples that are in age periods between 9 and 11 (inclusive).

Table 4.1: Neurodevelopmental time periods proposed by Willsey et al. (2013). Since the neurodevelopment mainly occurs during pregnancy period, higher pre- cision for this period is provided. Sliding window approach uses these 15 periods to create 13 windows. The window size is 3. Thus, the windows are constructed as [1-3], [2-4], [3-5], ..., [13-15].

Time window Age interval 1 Embryonic (4-8 pcw) 2 Early fetal (8-10 pcw) 3 Early fetal 2 (10-13 pcw) 4 Early mid-fetal (13-16 pcw) 5 Early mid-fetal 2 (16-19 pcw) 6 Late mid-fetal (19-24 pcw) 7 Late fetal (24-38 pcw) 8 Neonatal & early infancy (0 - 6 months) 9 Late infancy (6 - 12 months) 10 Early childhood (1 - 6 years) 11 Middle and late childhood (6 - 12 years) 12 Adolescence (12 - 20 years) 13 Young adulthood (20 - 40 years) 14 Middle adulthood (40 - 60 years) 15 Late adulthood (60+ years)

15 Table 4.2: Brain region clusters used to construct 52 co-expression network using BrainSpan data. These clusters are formed by using hierarchical clustering based on transcriptional similarity of these brain regions (Willsey et al., 2013).

Cluster Name Brain Regions in Cluster M1C (Primary motor cortex), S1C (Primary somatosensory cortex), PFC-MSC VFC (ventral prefrontal cortex), MFC (medial prefrontal cortex), DFC (dorsal prefrontal cortex), OFC (orbital prefrontal cortex) MDCBC MD (mediodorsal nucleus of the thalamus), CBC (cerebellar cortex) V1C (primary visual cortex), ITC (inferior temporal cortex), V1C-STC IPC (posterior inferior parietal cortex), A1C (primary auditory cortex), STC (superior temporal cortex) SHA S (striatum), H (hippocampus), A (amygdala)

4.2 Ground Truth Gene Sets

DeepASD works in a semi-supervised setting. A subset of genes with various evidence levels are used as ground truth genes. There are 1,823 labeled genes aout of 25,825 genes. Using the networks and mutation burden (TADA) data, a score for each gene is calculated so that we can obtain a genome-wide ranking. We use the same labels (both positive and negative) provided by Krishnan et al. (2016) to provide a fair comparison.

Our positive ground truth set consists of 638 genes with 4 different evidence lev- els (E1, E2, E3 and E4). These genes are collected from SFARI [54], OMIM [55], DGA [58], GAD [57], HUGE [56] and Gene2Mesh (http://gene2mesh.ncbi.org/). E1 genes indicate the highest evidence and evidence levels decrease towards E4 genes. There are 46 E1 (with evidence weight 1.0) genes provided in Table A.1, 67 E2 (with evidence weight 0.5) genes provided in Table A.2 and 525 E3 and E4 genes (with evidence weight 0.25) provided in Table A.3. Similarly, there are 1,185 non-mental health related genes with a negative label provided in Table A.4 and Table A.5. The evidence weight for all non-mental health related genes is 1.0, which is the same as E1 genes. Performance metrics are calculated using E1 and non-mental health related genes as it is also done by Krishnan et al. (2016).

16 4.3 DeepASD Model

Parts of the DeepASD model and its training process are described in the follow- ing sections. DeepASD model is illustrated in Figure 4.1.

4.3.1 Graph Convolutional Network

Computer vision field achieved a milestone after convolutional neural networks (CNN) are first proposed [61]. Utilizing the spatial information within images us- ing filters (kernels) have proven successful over traditional computer vision tech- niques. Applying CNNs to graphs is not possible due to the irregular structure of graphs. Thus, spectral and spatial graph convolution algorithms are proposed to apply the same concepts on graphs, which are proven to be successful [62, 63, 64]. Nevertheless, these algorithms were slow. Kipf and Welling proposed the graph convolutional network (GCN) algorithm that estimates convolutional filters as the Chebsyhev expansion of the graph Laplacian [65, 66]. This algorithm uses message passing to perform convolution on first order neighbors. GCN learns a weight for each of its layer neurons. Graph convolution uses the formula below   ˆ −0.5 ˆ ˆ −0.5 Hk[i] = σ D ED Hk−1[i]Wk−1 (4.1)

In the equation above, Hk[i] denotes the output of the single graph convolution operation of layer k for gene i, Dˆ denotes the diagonal degree matrix, Eˆ is the normalized adjacency matrix with self-loops, and Wk is the trainable weight ma- trix, and finally, σ is an activation function. We use Rectified Linear Unit (ReLU) as our activation function. Kipf and Welling proposed that cascading 2 layers of GCN yields the best results, and hence we apply the same. At the end of the sec- ond layer, we use Softmax as our activation function instead of ReLU to obtain probability values as a result. For DeepASD, we have d1 = 4 for our first layer hidden neuron count, and we have d2 = 2 for our final layer output neuron count.

Each of the 52 networks are fed into a 2-layer GCN module, and they produce 52 probability values (for each network). Layer neuron counts are fixed for all

17 52 GCN modules. Probability values produced by each GCN module are then combined using mixture of experts model.

4.3.2 Early Stopping

Early stopping is a widely used technique in deep learning that provides an im- plicit regularization on some convex problems [67, 68]. DeepASD performs early stopping during cross-validation to prevent overfitting. As the complexity of a model increases, the model becomes more prone against overfitting. To avoid this, we generally use generalization methods or perform early stopping. In DeepASD, early stopping is performed by monitoring validation loss during the training process. A sliding window of size 7 is used to monitor the validation loss. If the validation loss decreases during the entire window, training process halts immediately.

4.3.3 Weight Decay

Weight decay is a simple regularization method that reduces the model weights by the provided amount after each epoch [69]. Square of all parameters are multiplied with weight decay parameter so that we penalize bigger weight values. This forces model to keep weights as small as possible while optimizing the loss function. During backpropagation, derivative of the term with weight decay becomes 2 × wd × w, wd being the weight decay parameter and w is the actual weight matrix, and it is subtracted from previous weights. Thus, parameters are ’decaying’ over time. We use the weight decay value of 1e − 4 for our model.

4.3.4 Dropout Regularization

Dropout is another method of regularization that is widely used. The main idea of dropout is to drop random neurons during training process [70]. This method

18 uses a parameter called p, which indicates the probability of dropping a neuron. Lower values of p makes the dropping seldom, whereas higher values of p could cause higher dropping rate. It is shown that using p = 0.2 for input neurons and p = 0.5 for hidden neurons generally gives the optimal results [70].

Randomly dropping neurons forces the model to perform using a different set of neurons each time. This way, we prohibit a couple of neurons becoming the main source of information, and the model becomes capable of performing with any subset of neurons. For DeepASD, we use dropout between the first and second layers of GCNs and p = 0.5 value is used for all GCN modules.

4.3.5 Mixture of Experts

Mixture of experts (MoE) is a method for combining multiple weak learners and dividing and sharing the feature space among these weak learners [71]. MoE con- sists of multiple learners and a gating network. In our case, the gating network is a single layer neural network without any hidden layers, and it acts as a su- pervisor that assigns a part of feature space to each learner. It takes the feature matrix X as input, and then outputs a weight vector ~v, which contains a weight per weak learner. The gating network applies Softmax to its outputs so that all the weights assigned are summed up to 1.

DeepASD uses MoE to combine outputs produced by each of the 52 networks. It takes the output of each GCN module and multiplies it with the weight assigned by gating network. Let H~ [i] be the output vector of all 52 networks for gene i. v~[i] = MoE(X[i]) is the weight vector produced by MoE for gene i. Final output of DeepASD for gene i then becomes y[i] = H~ [i] · ~v[i].

4.3.6 Cross-validation Setting

Cross-validation is a popular method used in machine learning for hyper- parameter optimization. We have used it to optimize parameters for DeepASD

19 as well. Additionally, we are using cross-validation to depict our results and com- pensate for the low E1 gene count. DeepASD uses an adjusted version of k-fold cross-validation with k = 5. All genes with labels are uniformly and randomly divided among 5 folds, and 1 fold is selected as test fold, whereas an other fold is selected as validation fold, and the rest are our training set. During an iteration, the model is trained until the model meets early stopping condition on validation fold, and then we record the performance on the test fold. Then, the validation fold changes as the test fold is fixed. This process is repeated until all folds but the fixed test fold becomes the validation fold. Then, the test fold changes and the whole process repeats until all folds become test fold at least once. With k folds, this process runs for k × (k − 1) folds, which is 20 for k = 5. We run this algorithm 10 times, and therefore obtain 200 results.

4.3.7 Optimizer and Loss Function

DeepASD uses Adam optimizer to perform training. Adam optimizer is an adap- tive momentum estimating algorithm [72]. It is developed to utilize advantages of two other optimizers: Adagrad and RMSProp. Adam optimizer uses a sep- arate learning rate for each parameter, as like Adagrad and RMSProp, and it uses both average first and second moments unlike the other optimizers. We use Adam optimizer with the default parameters provided by PyTorch. The learning rate we use in DeepASD is 7 × 10−4.

Categorical cross entropy loss with evidence weights are used in DeepASD since we have two classes and our task is to perform classification. We tried hinge loss, L1, L2 regression losses and focal loss, and cross entropy loss provided the best results among all [73].

20 Figure 4.1: The architectural model of DeepASD for genome-wide ASD gene risk assessment. The model uses TADA features as well as 52 gene co-expression networks extracted from BrainSpan dataset. The feature set includes de novo loss of function (and missense) mutation counts, transmitted mutation counts for control and case groups, pLI value, de novo mutation frequency values and protein truncating variant counts. TADA dataset is used by all GCN modules and the gating network. This whole system produces a single probability valuey ˆ for each of the 25,825 genes and it is end-to-end trainable.

21 Chapter 5

Results

5.1 Comparison against the State-of-the-art Methods a b Krishnan et al. DeepASD

ASD

0 0.2 0.4 0.6 0.8 1 0.75 0.8 0.85 0.9 0.95 1 Area under precision-recall curve Area under ROC curve

Figure 5.1: ROC and precision-recall curve distribution comparison between DeepASD and Krishnan et al. (a) Area under ROC curve distribution between the two methods. (b) Area under precision-recall curve distributions comparing the same methods as in (a). Both in (a) and (b), outlier points are depicted. The solid center line depicts the median, dashed line depicts the mean value for each panel. Box limits demonstrates lower and upper quartiles and whiskers denote 1.5 interquartile range.

We first compare DeepASD against the SVM based genome-wide ranking

22 algorithm of Krishnan et al. (2016), which is run 10 times with a 5-fold cross-validation setting, obtaining 50 values of area under AUC and area un- der precision-recall (AUPRC) [26]. DeepASD uses a modified version of 5-fold cross validation and is also run 10 times to obtain results from 200 epochs. Deep- ASD achieves median AUC value of 0.9127 and mean AUC value of 0.9113. SVM based method of Krishnan et al. achieves median AUC of 0.8983 and mean AUC of 0.8961. DeepASD improves this state-of-the-art performance by 1.4%. Simi- larly, DeepASD provides 26% improvement in terms of AUPRC with its median AUPRC value of 0.6304 (mean: 0.6396) compared to 0.3695 median AUPRC value (mean: 0.377) of Krishnan et al. All of these performances are calculated using only E1 and non-mental-health related genes, as it is also done by Krishnan et al. The comparison between DeepASD and Krishnan et al. is illustrated in Figure 5.1.

We also compare our method to SVM based algorithm of Krishnan et al. in terms of the distribution of probability values calculated by each algorithm. Since DeepASD uses Softmax layers, it produces probability values. However, SVM approach of Krishnan et al. gives distance to the decision boundary. As suggested in their paper, these distance values are converted to probability values by apply- ing isotonic regression on these values. Then, they detect the ’knots’ (group of points with same value) and use cubic hermite spline to interpolate these ’knots’. This method gives a smooth transition and maps distance values to [0, 1] range. Such a method can be applied to any result coming from SVM. As an example, we have converted SVM output values to probabilities for intellectual disability (ID) using isotonic regression twice. Firstly, we have applied ten-folds cross-validation on original data to obtain knots for each of them. Then, we observed that it does not yield a smooth curve for whichever interpolation method we use. Thus, we applied the isotonic regression for a second time. Resulting knots are connected via linear interpolation (Figure 5.2). In Figure 5.3, probability values of Deep- ASD and Krishnan et al. are illustrated by scatter plots. Figure 5.3a depicts the probabilities assigned to E1 genes (red) vs all other genes (gray). Points above the x = y line favors DeepASD, meaning higher probabilities are assigned to those by DeepASD. Among 46 E1 genes, 33 of them are above the x = y line. This

23 clearly shows that DeepASD assigns higher probability to E1 genes compared to SVM method of Krishnan et al. The vertical bands are formed since SVM based algorithm of Krishnan et al. assigns a fixed probability to a group of genes due to the ’knots’ of isotonic regression. In Fig 5.3b, non-mental-health related genes are highlighted against all other genes. In this case, points below the x = y line are favorable for DeepASD since it indicates that Krishnan et al. assigns a higher probability value to these negatively labeled genes compared to DeepASD. 953 of the 1185 non-mental-health related genes are below the x = y line, meaning that DeepASD assigns a lower probability of being a risk gene to these genes. These probability values are calculated by taking the mean of 200 epochs, excluding the results coming from the training set. Thus, we can conclude that the model learns to assign high probability to known ASD risk genes by looking other similar known ASD risk genes, and the other way around for non-mental-health related genes.

Finally, we compare DeepASD with Krishnan et al., Zhang and Shen’s DAMA- GAES score [59] and DAWN [25] (with PFC-MSC3-5 and PFC-MSC4-6) in terms of precision-recall (Figure 5.4). Since all these methods produce a final ranking, we could utilize them to obtain precision-recall curves. Three plots are created based on the precision-recall values for E1 genes, E1 + E2 genes and all positive genes, respectively. We can observe that for the first two settings, DeepASD has a better precision-recall plot. For the last one, the performance gain between DeepASD and the other methods degrades. This degradation stems from the fact that E3 and E4 genes are relatively low evidence genes, and therefore they might not be good indicators for ASD etiology. In addition, count of E3 + E4 genes is almost 5 times the count of E1 + E2 genes. Thus, they dominate the resulting precision-recall plot. Still, DeepASD manages to produce better precision-recall curves compared to other methods when considering E1 and E2 genes as the ground truth. To see if the improvement od DeepASD is statistically better, we provide P values for each test with each gold standard combination. in Table 5.1. Table 5.1 verifies that DeepASD produces significantly better results for E1 and E2 genes as the ground truth genes, and Krishnan et al. produce better re- sults for all positive gold standard genes. DeepASD still yields improved results

24 a b

c

Figure 5.2: Process of smoothing SVM output values. Using ten-fold cross- validation, 10 different isotonic regression model is fit. Then, we find the knots in each model and combine them to obtain (a). The transitions are not smooth and further smoothing is required. (b) Isotonic regression is applied once more on this knot collection to obtain a smoother curve. (c) Linear interpolation is used to obtain final mapping. compared to the remaining state-of-the-art methods for all positive gold standard genes. In Figure 5.4, all plots are shown for the top 2,000 genes since after that rank, all recall values become 1 and precision values get close to 0. Similarly, we start plotting the graph from the first rank except Figure 5.4c, where the starting rank is set to 5 to obtain a smoother precision-recall curve and avoid zig-zags for recall values around 0.

25 a b

Figure 5.3: Probability value scatter plot of DeepASD and Krishnan et al. (a) Probability values of E1 genes against all other genes for both methods. (b) Non- mental-health related genes compared to all other genes. Both panels contain the same 25,825 genes. y = x line (gray) is also drawn for visual aid.

Table 5.1: P value comparison for the rankings of DeepASD, Krishnan et al., DAWN and DAMAGES score using different set of gold standard genes for eval- uation. Each P value is calculated using two-sided Wilcoxon rank-sum test.

P Value Evidence Sets DeepASD Krishnan et al. DAMAGES score DAWN PFC-MSC3-5 DAWN PFC-MSC4-6 E1 7.33 × 10−22 5.07 × 10−20 3.61 × 10−16 2.23 × 10−01 5.57 × 10−01 E1+E2 3.7 × 10−30 1.11 × 10−25 4.75 × 10−21 1.73 × 10−01 2.56 × 10−01 E1+E2+E3+E4 7.69 × 10−37 2.97 × 10−52 1.4 × 10−13 1.82 × 10−21 4.38 × 10−17

26 a b 1 1

0.8 0.8 n n

o 0.6 o 0.6 i i s s i i c c e e r r

P 0.4 P 0.4

0.2 0.2 E1 genes E1 + E2 genes 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall Recall c 1 Krishnan et al. DAWN PFC-MSC4-6 0.9 DAWN PFC-MSC3-5 0.8 Zhang and Shen DeepASD

n 0.7 o i s i

c 0.6 e r

P 0.5 0.4 0.3 E1 + E2 + E3 + E4 genes 0.2 0 0.2 0.4 0.6 0.8 1

Recall

Figure 5.4: Precision-recall curves for DeepASD, DAWN (PFC-MSC3-5 and PFC- MSC4-6), Krishnan et al. and Zhang and Shen DAMAGES score. (a) The curve for E1 plus non-mental-health genes. (b) E1 + E2 genes are used. (c), all gold standard genes are used. All precision-recall curves have a cutoff rank of 2000. (c) has starting rank value of 5 whereas (a) and (b) have starting rank value of 1.

27 5.2 Enrichment Analysis

We use the average (final) ranking produced by DeepASD to perform enrichment analyses on various biological pathways related to ASD such as WNT (Table A.6) [74] and MAPK (Table A.7) [75] signaling, target gene lists of transcrip- tion regulators like CHD8 (Table A.8, Table A.9 and Table A.10) [76], RBFOX (Table A.11, Table A.12 and Table A.13) [77, 78], FMRP (Table A.14 and Table A.15) [79, 80] and TOP1 (Table A.16) [35] and biological processes and molecular functions such as post-synaptic density complex (Table A.17, Table A.18 and Ta- ble A.19) [81, 82] and histone modifications (Table A.20 and Table A.21) [2, 83]. Although these gene sets are not the gold standard, they are shown to be asso- ciated with the etiology. Thus, we aim to show that DeepASD gene ranking is enriched in these gene sets. We divided the whole gene set into deciles (2,582 genes in each decile except the last decile) and we perform enrichment analysis by calculating the overlap percentage of a particular gene set for each decile. Since the top-ranked genes are in the first decile, we expect to observe a high overlap percentage of the first decile compared to other deciles and we expect a decreasing overlap percentage towards the last decile.

Figure 5.5 shows enrichment results for the specified gene sets. We observe that the first decile is highly enriched in all of these gene lists, which is also supported by the calculated P values using the Binomial test. While calculating the P value, we assert our null hypothesis by assuming that the gene ranking is determined randomly, therefore each decile is evenly enriched. Among all of the pathways, target genes and biological processes we have inspected, DeepASD ranking is enriched in FMRP the most, followed by post-synaptic density genes and CHD8 target genes in terms of statistical significance. We also expect to see a decreasing enrichment pattern towards the last decile, however, such a pattern does not exist for any of the gene lists except FMRP target genes. This is because the remaining genes (after the first decile) are not ranked high by DeepASD. Either these genes are not ASD related or DeepASD cannot find sufficient evidence to assign higher risk probability to these genes. But still, having a high enrichment for all these sets show the ranking produced by DeepASD is sensible.

28

Figure 5.5: Enrichment analysis for WNT and MAPK pathways, CHD8, RBFOX (splice and peak), FMRP and TOP1 target genes, post-synaptic density complex and histone modification processes. P values are calculated using Binomial test.

29 In addition to the enrichment analysis described above, we also take the top 1% (258) genes (Table A.22 and Table A.23) from DeepASD rank- ing and checked their enrichment against other pathways using Enrichr (https://amp.pharm.mssm.edu/Enrichr/). Enrichr found out that DeepASD’s top 1% genes are highly enriched in developmental biology, neuronal system, transmission accross chemical synapses and axon guidance pathways of BioPlanet 2019 dataset [84]. BioPlanet 2019 dataset contains 560 pathways. Table 5.2 con- tains top 29 pathways filtered taking samples with P < 10−4. We can observe that the top genes predicted by DeepASD regulates neurodevelopmental system and biological development. In addition, we can also observe that the circadian rhythm pathway is also enriched. Pinato et al. and Zuculo et al. suggest that ASD could cause dysregulations in circadian cycle, which could cause sleep disor- ders that exacerbate ASD associated behavioral disorders [85, 86]. The pathway with name ”RORA activates circadian expression” from BioPlanet 2019 dataset is also enriched, which supports the circadian rhythm disruption relation with ASD. A mutation in the top 258 genes predicted by DeepASD could cause disruption in the nerve growth factor (NGF) signaling pathway. NGFs are responsible for neuron growth and maintenance, therefore they have a crucial role in human neurodevelopment. Consequently, most of the pathways in Table 5.2 are related with neurodevelopment and nervous system. To compare, we can check the path- ways with the highest P values. Bottom 5 pathways with highest P values are as follows: Olfactory transduction, , GPCR ligand binding, class A GPCRs (rhodopsin-like) and carbohydrate metabolism. None of these pathways are known to be associated with ASD, which supports the fact that the pathways with the lowest P values are in fact good candidates for further research on ASD to reveal more about its complex nature.

30 Table 5.2: DeepASD top 1% gene enrichment in BioPlanet 2019 dataset. Top 29 pathways are included out of 560 pathways. Enrichr tool is used to calculate enrichment values. Pathways are sorted in increasing order with respect to P values.

Pathway Overlap P-value Odds Ratio Developmental biology 27/420 8.21E-12 4.96 Neuronal system 22/283 2.01E-11 6.00 Transmission across chemical synapses 17/190 4.80E-10 6.91 Axon guidance 18/325 2.67E-7 4.28 Circadian rhythm 8/62 1.31E-6 9.96 Multi-step regulation of transcription by PITX2 6/28 1.32E-6 16.6 Interaction between L1-type proteins and ankyrins 5/26 1.85E-5 14.9 Signaling by NGF 12/221 3.30E-5 4.19 Long-term potentiation 7/70 3.37E-5 7.72 Presenilin action in Notch and Wnt signaling 6/48 3.46E-5 9.65 Integration of energy metabolism 9/125 3.69E-5 5.56 CRMPs in Sema3A signaling 4/16 4.42E-5 19.3 31 NGF signaling via TRKA from the plasma membrane 9/136 7.14E-5 5.11 Lipid metabolism regulation by peroxisome proliferator-activated receptor alpha (PPAR-alpha) 8/112 1.06E-4 5.52 L1CAM interactions 7/94 2.21E-4 5.75 Wnt signaling pathway 11/231 2.23E-4 3.68 YAP1- and WWTR1 (TAZ)-stimulated gene expression 4/26 3.28E-4 11.88 Energy metabolism 5/48 3.81E-4 8.04 RORA activates circadian expression 4/27 3.81E-4 11.4 Control of gene expression by vitamin D receptor 4/27 3.81E-4 11.4 GABA (A) receptor activation 3/12 4.33E-4 19.3 Opening of calcium channels triggered by depolarization of the presynaptic terminal 3/12 4.33E-4 19.3 Transcriptional regulation of white adipocyte differentiation 6/77 4.89E-4 6.02 Reelin signaling pathway 4/29 5.06E-4 10.7 CDO in myogenesis 4/29 5.06E-4 10.7 Inactivation of GSK3 by Akt causes accumulation of beta-catenin in alveolar macrophages 4/30 5.78E-4 10.3 GABA A and B receptor activation 5/53 6.06E-4 7.29 SMAD2/3 nuclear pathway 6/82 6.85E-4 5.65 Calcium regulation in the cardiac cell 8/149 7.35E-4 4.15 5.3 CNV Region Analysis

Certain copy number variants (CNVs) are associated with ASD. These CNVs are large regions that contain multiple genes. We inspect 5 CNVs that are particu- larly associated to ASD. These 5 CNVs are 16p11.2, 15q11-13, 15q13.3, 1q21.1 and 22q11 [26]. In this section, we consider novel genes that are within these CNVs and (i) not labeled or have weak confidence, (ii) assigned high posterior probability by DeepASD and (iii) assigned relatively lower posterior probability by other methods. Genes with evidence levels E1 and E2 are expected to be as- signed higher posterior probability since they also have strong prior information. Such genes are excluded from this analysis. Genes spanned by these CNVs are likely risk genes but still, they may not necessarily be ASD related. In Figure 5.6, Figure 5.7 and Figure 5.8, gene ranks for each CNV is given.

NIPA1 gene inside 15q11-13 location is ranked 619th although it is labeled as non-mental health related. DAWN does not have a ranking for this gene since it is not co-expressed inside the network used by DAWN algorithm. DAMAGES score rank of NIPA1 is 5185 and Krishnan et al. ranks it as 3267th. TADA q-value for NIPA1 gene is 0.718 which indicates no significant association. This shows us that DeepASD could differentiate important genes even if they have a negative label and/or they have no significant prior knowledge. NIPA1 gene is not involved within any pathway or target gene lists we have discussed in section 5.2. According to NCBI entry of NIPA1, it encodes transporter in various epithelial and neuronal cells (https://www.ncbi.nlm.nih.gov/gene/123606). This gene is believed to be playing a role in neuron development and maintenance. Therefore, disruption of this gene may damage neurodevelopment and cause ASD as a result.

SLC25A1 is another gene of interest in 22q11 since it is not labeled and ranked as 430th by DeepASD. SLC25A1 is ranked as 2643rd, 8333rd and 12689th by Kr- ishnan et al. DAWN PFC-MSC4-6 and DAMAGES score, respectively. DAWN PFC-MSC3-5 does not have a ranking for SLC25A1. SLC25A1 has TADA q- value of 0.1819 and it is inside post-synaptic density complex. SLC25A1 protein

32 regulates the movement of citrate inside of mitochondria, and it is indicated that the dysregulation of mitochondrial metabolism may lead to neurodevelopmen- tal issues [87]. Therefore, mutations of this gene might be important for ASD symptomatology.

CORO1A gene, which is contained within 16p11.2 location, is another candi- date ASD risk gene that is ranked as 293th by DeepASD. CORO1A is ranked as 19729th, 3745th, 8863rd and 408th by Krishnan et al., DAWN (with both networks) and DAMAGES score, respectively. CORO1A gene is unlabeled and its TADA q-value is 0.0922. Its prior information is relatively significant and it is inside post- synaptic density complex gene list. According to SFARI, CORO1A is a strong evidence ASD gene due to the detected two de novo gene disrupting mutation on CORO1A, and it is labeled as a risk gene by SFARI with 0.05 < F DR < 0.1. Our results, along with DAMAGES score ranking, also supports the scoring provided by SFARI for this gene.

33 16p11.2 15q11-13

KIF22 GOLGA8I QPRT TUBGCP5 TMEM219 GOLGA8DP CDIPT LOC440243 HIRIP3 (E3-E4) CXADRP2 TBX6 OR4N3P PAGR1 GOLGA6L1 C16orf54 GOLGA6L6 GDPD3 HERC2P2 SLC7A5P1 NPAP1 INO80E CYFIP1 SPN OR4M2 ZG16 NIPA2 OR4N4 MVP GOLGA8EP C16orf92 PAR5 CORO1A LOC727924 MAPK3 (E3-E4) NBEAP1 YPEL3 GOLGA6L2 FAM57B HERC2P3 KCTD13 (E3-E4) SNURF ASPHD1 NIPA1 (Neg.) TAOK2 (E3-E4) MKRN3 PRRT2 ATP10A (E2) DOC2A (E3-E4) NDN (E3-E4) MAZ (E3-E4) SNURF-SNRPN (E3-E4) PPP4C (E3-E4) SNRPN (E3-E4) SEZ6L2 (E3-E4) UBE3A (E3-E4) ALDOA (E3-E4) MAGEL2 (E1)

0 10k 20k 0 10k 20k Rank Rank

Figure 5.6: The genes spanned by 16p11.2 and 15q11-13 CNVs. For each gene, DeepASD rank is provided along with its evidence level if it is labeled.

34 15q13.3 1q21.1

FAN1 GPR89C

GPR89B LOC283710 ACP6

HERC2P10 CHD1L

ARHGAP11B PDIA3P

NBPF24 MTMR10 NBPF8

OTUD7A FMO5

GJA8 KLF13

GJA5 CHRNA7 (E2) PRKAB2

TRPM1 (Neg.) BCL9

5k 10k 15k 20k 25k 5k 10k 15k 20k 25k Rank Rank Figure 5.7: The genes spanned by 15q13.3 and 1q21.1 CNVs. For each gene, DeepASD rank is provided along with its evidence level if it is labeled.

35 22q11 CDC45 SNAP29 THAP7 MRPL40 GNB1L (E3-E4) DGCR6L TRMT2A RANBP1 TANGO2 C22orf29 LOC388849 C22orf39 LZTR1 USP41 HIRA DGCR8 CLTCL1 (E3-E4) TMEM191B TSSK2 TXNRD2 DGCR14 CLDN5 GSC2 LINC00896 COMT (E3-E4) ZNF74 RIMBP3 SCARF2 ARVCF AIFM3 SERPIND1 RTN4R CRKL P2RX6 ZDHHC8 KLHL22 SLC7A4 MED15 DGCR2 SLC25A1 TBX1 (E3-E4) PI4KA UFD1L

0 10k 20k Rank

Figure 5.8: The genes spanned by 22q11 CNV. For each gene, DeepASD rank is provided along with its evidence level if it is labeled.

36 5.4 Neurodevelopmental Period Analysis

ASD is generally diagnosed after 24th month although the symptoms start to appear between 12 months and 18 months [88]. Yet, disruption of the neurode- velopmental process is the main cause of ASD, and this disruption starts during early fetal period (8 postconception week). Therefore, it is crucial to identify when disruption happens and where it happens. Using brain co-expression data, Willsey et al. [27] identified PFC-MSC3-5 (prefrontal cortex, early fetal 2 to early mid-fetal 2) window as the most important neurodevelopmental period and PFC- MSC as the most important brain region for ASD etiology. Similarly, Krishnan et al. state that early neurodevelopmental periods until late mid-fetal period are significant. These studies determine significance by checking co-expression values of original data. We, however, use a different approach. Using a model with mixture of experts, we can obtain weights, learned by the gating network, for each gene and co-expression network as well as the probability values assigned by each network. This way, we can visualize both the weights assigned by gating network, and the average probability values assigned by the model to each net- work. Using this approach, we allow the model to learn the importance of each network instead of assessing each network independently for significance.

For our model, we use 52 network from 4 brain regions (PFC-MSC, MDCBC, V1C-STC and SHA) and 13 neurodevelopmental periods as described in section 4.1 (Chapter 4, Methods). DeepASD assigns highest probability for E1 genes to MDCBC9-11 network, followed by MDCBC8-10 and MDCBC10-12 (Figure 5.9). It is important to note that these probabilities are generated by each GCN module (co-expression network) right before they are multiplied with weights calculated by the gating network. Willsey et al. marked MDCBC8-10 as the second most important network after PFC-MSC3-5. However, DeepASD cannot detect a powerful signal for PFC-MSC3-5 and PFC-MSC4-6 regions. Instead, DeepASD prioritizes PFC-MSC8-10, PFC-MSC9-11 and PFC-MSC10-12, which is also in correlation with Willsey et al. Additionally, the gating network assigns weights parallel to the probabilities produced by each network (Figure 5.10). The gating network prioritizes MDCBC between periods 5 to 12 (early mid-fetal to

37 adolescence). The strength of the signal coming from PFC-MSC9-11, SHA5-7 and SHA9-11 are also medium level. Consequently, DeepASD learns that the 9-11 period (late infancy to late childhood) is the most important period and MDCBC (mediodorsal nucleus of the thalamus and cerebellum) brain region is the most important brain region for ASD etiology. ASD ID T S - D N p e e D

Figure 5.9: The heatmap for average posterior probabilities assigned by each GCN module of DeepASD to E1 genes. The heatmap shows 4 brain regions with D 13 neurodevelopmental windows to illustrate average posterior probability for all N p

e 52 co-expression networks. e ASD ID D T S - D N p e e D

Figure 5.10: Heatmap illustrating the weights assigned by gating network to each

D GCN module (network). 52 networks are plotted as 4 × 13 matrix for each brain N

p region and for each neurodevelopmental period. e e D

38

5.5 Evaluation of Edge Case Predictions

In this section, we discuss several edge case predictions performed by DeepASD. We consider these edge cases in 3 groups: (i) non-mental health related genes that are ranked within top 1%, (ii) unlabeled genes with low prior evidence (high TADA q-value) that are ranked within top 1% and (iii) E1 genes that are ranked low. We consider the most 2 extreme genes belonging to each group and discuss with findings from the literature. It is important to understand how DeepASD ranks genes since there are genes such as SLC9A9 (E1 evidence) with rank lower than 25,000. In the context of ASD, it is worthwhile to question even the genes with the highest confidence as the genes with the highest confidence might not directly be related to ASD due to the lack of complete knowledge. Thus, we consider these edge cases to propose candidate genes and question the existing labels. Since DeepASD uses GCN in its core, genes pass messages among each other, and by using 2-layer GCN modules, we actually perform a 2-hop message passing throughout the whole network. Therefore, number of E1, E2, E3-E4, negative genes and unlabeled genes with rank < 1000 in TADA q-value ranking (will be referred as TADA neighbors for the rest of this section) within 2-hop distance are calculated for these edge case genes. We only consider MDCBC8- 10 network since its selected as one the most informative network by the gating network. Table 5.3 shows calculated values for each edge case gene.

In first group, we consider GIGYF2 and BRAF genes. They are labeled as non-mental-health related, but their DeepASD ranks are 52 and 180, respectively. Their TADA rankings are also 1466 and 1646, but since their E1, E2 and TADA neighbor counts are similar to the neighbor counts of CHD8, they are ranked higher. In fact, both these genes are marked as high confidence ASD genes by SFARI (https://gene.sfari.org/, accessed 19 August 2020). GIGYF2 and BRAF are ranked as 2264th and 11033rd, respectively by Krishnan et al. Considering our ranking, network topology of these genes and SFARI’s classification, we be- lieve that these genes could be marked as E1 genes instead of non-mental-health related genes. This example shows that DeepASD can make contrary predictions against the label of a gene if the model finds sufficient evidence. Therefore, we

39 can say that DeepASD can detect false positive and false negative labels, as we demonstrate in this case.

HIVEP2 and UBE20 genes are in group 2 and considered an edge case since their prior information is weak. Their TADA q-values would put them towards the end of the ranking scale, however, they are ranked 165th and 197th by Deep- ASD, respectively. Such a ranking is only achievable by having E1 and E2 neigh- bors in close proximity. Their labeled 2-hop neighbors are similar to CHD8 ’s in terms of their count. This way, these genes could get these scores. Hav- ing many positively labeled neighbors means having co-expressed together with these positive genes, therefore it is a good indicator for being an ASD risk gene. HIVEP2 is considered to be a risk gene for intellectual disability (ID), which is a comorbid disorder with ASD (https://ghr.nlm.nih.gov/condition/hivep2-related- intellectual-disability). HIVEP2 is also a FMRP (fragile mental retardation pro- tein) target, which supports the claim. It is known that some portion of the ASD risk genes are in fact overlap with ID risk genes [89]. Therefore, HIVEP2 could be a culprit gene for ASD risk. Similarly, UBE2O is also a FMRP target, which makes both of these genes a risk gene candidate considering their co-expression pattern and ranking.

SLC9A9 and CACNA1H are two E1 genes that are ranked the lowest by DeepASD (24,620 and 21,032, respectively). These rankings stem from the fact that both of these genes have high TADA q-values and their neighbors are not as strong as CHD8’s or similar other E1 genes’. SLC9A9 has a better TADA q-value compared to CACNA1H but its neighbors are have lower signal. For CACNA1H, the opposite of these conditions present. SLC9A9 is /hydrogen exchange protein with reported de novo mutations in some cohorts [90, 54]. SLC9A9 is highly enriched in spinal cord, which makes it an unlikely ASD risk gene [91]. Similarly, CACNA1H is highly enriched in basal ganglia and pituitary gland in terms of RNA expression levels, which also makes it a controversial ASD risk gene [91]. These genes are marked with scores 2 and 3 (strong to medium ev- idence) by SFARI. Although mutational evidences points otherwise, these two genes may be false positive genes.

40 Table 5.3: Edge case genes with their 2-hop distance neighbor statistics. 2 genes per group described in section 5.5 are provided. CHD8 is also added to table as a reference since it is ranked as 1st by TADA 3rd by DeepASD, and it is a well-established ASD risk gene. TADA rank describes the ranking of a gene in TADA with respect to its q-value. E1, E2, E3/E4 and negative genes describe the count of such genes in 2-hop neighborhood. TADA genes column is the count of unlabeled genes within 2-hop neighborhood with their TADA q-value rank < 1000.

Gene ID Gene Name Label DeepASD Rank DeepASD Probability TADA Rank E1 Neighbors E2 Neighbors E3/E4 Neighbors Negative Neighbors TADA Neighbors 57680 CHD8 E1 1 0.9814 1 38 35 239 357 231 26058 GIGYF2 Negative 52 0.8366 1466 29 23 116 161 127 673 BRAF Negative 180 0.711 1646 38 37 232 312 209 3097 HIVEP2 Non-labeled 165 0.712 12492 38 37 233 317 207 63893 UBE2O Non-labeled 197 0.6892 16973 40 39 256 394 242 8912 CACNA1H E1 21032 0.3552 10459 19 14 122 152 103 285195 SLC9A9 E1 24620 0.2774 4379 7 11 78 171 72 41 5.6 Protein-Protein Interactions between ASD Genes

We investigate the interaction between top percentile genes selected by DeepASD in a tissue-specific protein-protein interaction network and search for hub genes which frequently interact with our predicted genes. Using this orthogonal source of interction information we look for new ASD risk genes. The network we use is a frontal cortex PPI network taken from DifferentialNet database [92], and it is constructed by NetworkAnalyst system (Figure 5.11) [93]. In this network, we observe several hub proteins such as ELAVL1, HECW2, EP300, CUL3 and CREBBP. We detected hubs using betweenness centrality and node degree and considered top 5 hubs for this analysis. ELAVL1 is an RNA-binding protein regu- lating other mRNAs containing AU-rich elements. Therefore, ELAVL1 regulates all proteins with direct edges in Figure 5.11. These genes include E1 genes such as KDM5B and DYRK1A, E2 genes such as SPAST and E3/E4 genes such as SATB2, SETD2. Therefore, ELAVL1 is a candidate risk gene for ASD since it is the largest hub within the top 1% percentile genes. HECW2 is the second largest hub in our PPI network. HECW2 has a low prior information (TADA P = 0.95 ) and it is marked as a strong candidate by SFARI although it is unlabeled in our ground truth set. DeepASD ranked HECW2 in top 1% without any infor- mation regarding PPI and yet it is the second largest hub in the tissue-specific PPI network. CREBBP is the third largest hub in our network, and it is labeled as non-mental-health related in our ground truth set. However, SFARI rank- ing of HECW2 indicates high confidence. Being the third largest hub supports the confidence level of HECW2 assigned by SFARI. This also demonstrates the prediction power of DeepASD. Our model can still assign high probabilities to genes marked as non-mental-health related as we have discussed in the previous section. The next hub, CUL3 is an E1 gene in our ground truth set and also marked as high confidence ASD gene by SFARI. Independent PPI network anal- ysis also supports the ASD risk candidacy of this gene. Finally, EP300 is the fifth largest hub in our network. Similar to CREBBP, EP300 is also labeled as non-mental-health related in our ground truth set but marked as high confidence

42 ASD risk gene by SFARI.

MARK2 SIN3B CTCF SYT1 KPNA1 BAZ1B MARK1

ZZZ3 RAI1 ZMYND8 CDK13 SMARCA2 PHF12 SIN3A

CSNK2A1 PHIP ZC3H14 HECW2 PTPRT ZMYND11 SMARCC2 SRPK2 SRCAP TP53BP1 SMURF1 LARP4B MAP1B

MLL UBE3C MEF2C RBFOX2 ENC1 CUL3 SPTBN1 MKX CREBBP EP300 SPAST PRKAR1B CHD1 MEF2D PPP2R5D RTN4RL1 KAT6A UBAP2L GRIA2 DYRK1A GSK3B NCOA1 NCOA6 BTAF1 ELAVL1 GIGYF1 HIVEP2 MYO5A ZRANB1

BTRC DLGAP1 RORB MLL3 THRB IRF2BPL FBXW11 SETD2 NUAK1 SHANK2 SATB2

ARID5B KDM5B CLTC TBL1XR1 KDM4B XPO4

NSD1 CUL1 CIC DPYSL2 RAPGEF2 NLGN2 EPB41L1 ARHGEF7 GABRB1 ARID1B TNKS NRXN1 MLL5

MYH10 TERF2

Figure 5.11: Tissue-specific frontal cortex PPI network from DifferentialNet database constructed by using NetworkAnalyst system. Node sizes indicate be- tweenness centrality of nodes (larger nodes have higher betweenness centrality) and node colors map node degrees, red indicating higher degree.

43 Chapter 6

Conclusion

ASD is one of the most common neurodevelopmental disorders that is affecting many children and their families all around the world. It has a strong genetic basis. ASD is estimated to have a thousand ASD risk genes in its etiology, and yet, only a fraction of them are identified with sufficient evidence. Although we have ways to go for deciphering this disorder fully, ASD gene risk prediction algorithms are proven to be effective since they prioritize genes for further investigation.

In this study, we have developed a GCN based deep learner to prioritize ASD risk genes and learn important neurodevelopmental periods. We utilized gene co- expression networks along with mutation burden data by using GCN. By adding a mixture of experts module, our model is capable of using arbitrarily many networks, and learning the relative importance of these networks. Our aim is to surpass the state-of-the-art ASD gene risk prediction methods by utilizing 52 co-expression networks constructed using BrainSpan data. We use 52 2-layered GCN modules as learners for the mixture of experts model. Each of these modules produce a probability value for each gene. These probabilities are multiplied by the weights learned by the gating network and then summed up to obtain final predictions for each gene, which then multiplied with evidence weights of genes to obtain risk probabilities.

44 DeepASD improves state-of-the-art method of Krishnan et al. in terms of area under ROC curve and area under precision-recall curve distributions. In addition, ranking produced by DeepASD performs better than DAMAGES score and DAWN, in terms of E1 and E1 + E2 based precision-recall curves.

Results of DeepASD are also tested by checking enrichment on pathways, tar- get gene lists and biological processes that are related to ASD etiology. In all of the tests, first decile of DeepASD is highly enriched and their corresponding P values also justify their significance. Among all, FMRP target gene list had the most significant enrichment statistically. We also perform enrichment analysis on BioPlanet 2019 [84] dataset and the results show that developmental biology is the most affected pathway considering the top 1% genes of DeepASD. We also inspected 5 ASD associated CNV regions to find out candidate genes within these CNV regions. NIPA1, CORO1A and SLC25A1 genes are candidate genes that are ranked high by DeepASD.

Using the information mixture of experts model provide us, we found out that MDCBC8-10 network is selected as the most significant network by the gating network. MDCBC9-11 and MDCBC10-12 networks are also marked as significant by both gating network and average individual GCN module probabilities. 9-11 neurodevelopmental window (late infancy to late childhood) is selected as the most informative period and MDCBC is selected as the most informative brain region for ASD etiology.

We inspect edge case predictions to point out that GIGYF2 and BRAF genes are more likely to be an E1 gene rather than non-mental-health related genes. Similarly, CACNA1H and SLC9A9 genes are ranked low by DeepASD although their evidence level is E1. These genes are highly expressed in regions other brain which makes them controversial ASD risk genes.

Finally, we constructed a tissue specific PPI network for observing interactions between top 1% genes predicted by DeepASD. After identifying the hubs of these networks by using betweenness centrality and node degree information, we iden- tified top 5 hubs as ELAVL1, HECW2, EP300, CUL3 and CREBBP. Among

45 these hub genes, ELAVL1, HECW2 are the two largest hubs without any label. Therefore, they are good candidate ASD risk genes. EP300 and CREBBP are two other hubs that are labeled as non-mental-health related. However, both of these genes are marked as high confidence ASD risk genes by SFARI. CUL3 is the last hub in our PPI network. It is labeled as E1 gene and marked as high confidence ASD risk gene by SFARI as well. All of these hubs are formed inde- pendent from our model. In other words, our model didn’t have any information regarding PPIs. Still, all major hubs in PPI network are either marked as high confidence gene by SFARI already, or associated to ASD by other means.

46 Bibliography

[1] X. He, S. J. Sanders, L. Liu, S. D. Rubeis, E. T. Lim, J. S. Sutcliffe, G. D. Schellenberg, R. A. Gibbs, M. J. Daly, J. D. Buxbaum, and et al., “Inte- grated model of de novo and inherited genetic variants yields greater power to identify risk genes,” PLoS Genetics, vol. 9, no. 8, 2013.

[2] S. De Rubeis, X. He, A. P. Goldberg, C. S. Poultney, K. Samocha, A. E. Cicek, Y. Kou, L. Liu, M. Fromer, S. Walker, et al., “Synaptic, transcrip- tional and chromatin genes disrupted in autism,” Nature, vol. 515, no. 7526, pp. 209–215, 2014.

[3] I. Iossifov, B. J. O’Roak, S. J. Sanders, M. Ronemus, N. Krumm, D. Levy, H. A. Stessman, K. T. Witherspoon, L. Vives, K. E. Patterson, et al., “The contribution of de novo coding mutations to autism spectrum disorder,” Nature, vol. 515, no. 7526, pp. 216–221, 2014.

[4] S. J. Sanders, M. T. Murtha, A. R. Gupta, J. D. Murdoch, M. J. Raubeson, A. J. Willsey, A. G. Ercan-Sencicek, N. M. DiLullo, N. N. Parikshak, J. L. Stein, M. F. Walker, G. T. Ober, N. A. Teran, Y. Song, P. El-Fishawy, R. C. Murtha, M. Choi, J. D. Overton, R. D. Bjornson, N. J. Carriero, K. A. Meyer, K. Bilguvar, S. M. Mane, N. Sestan, R. P. Lifton, M. Gunel, K. Roeder, D. H. Geschwind, B. Devlin, and M. W. State, “De novomutations revealed by whole-exome sequencing are strongly associated with autism,” Nature, vol. 485, 2012.

[5] B. J. O’Roak, L. Vives, S. Girirajan, E. Karakoc, N. Krumm, B. P. Coe, R. Levy, A. Ko, C. Lee, J. D. Smith, E. H. Turner, I. B. Stanaway, B. Vernot,

47 M. Malig, C. Baker, B. Reilly, J. M. Akey, E. Borenstein, M. J. Rieder, D. A. Nickerson, R. Bernier, J. Shendure, and E. E. Eichler, “Sporadic autism exomes reveal a highly interconnected protein network ofde novomutations,” Nature, vol. 485, 2012.

[6] B. J. O’Roak, L. Vives, W. Fu, J. D. Egertson, I. B. Stanaway, I. G. Phelps, G. Carvill, A. Kumar, C. Lee, K. Ankenman, J. Munson, J. B. Hiatt, E. H. Turner, R. Levy, D. R. O’Day, N. Krumm, B. P. Coe, B. K. Martin, E. Boren- stein, D. A. Nickerson, H. C. Mefford, D. Doherty, J. M. Akey, R. Bernier, E. E. Eichler, and J. Shendure, “Multiplex targeted sequencing identifies recurrently mutated genes in autism spectrum disorders,” Science, vol. 338, 2012.

[7] B. M. Neale, Y. Kou, L. Liu, A. Ma’ayan, K. E. Samocha, A. Sabo, C. F. Lin, C. Stevens, L. S. Wang, V. Makarov, P. Polak, S. Yoon, J. Maguire, E. L. Crawford, N. G. Campbell, E. T. Geller, O. Valladares, C. Schafer, H. Liu, T. Zhao, G. Cai, J. Lihm, R. Dannenfelser, O. Jabado, Z. Peralta, U. Nagaswamy, D. Muzny, J. G. Reid, I. Newsham, and Y. Wu, “Patterns and rates of exonicde novomutations in autism spectrum disorders,” Nature, vol. 485, 2012.

[8] I. Iossifov, M. Ronemus, D. Levy, Z. Wang, I. Hakker, J. Rosenbaum, B. Yamrom, Y. H. Lee, G. Narzisi, A. Leotta, J. Kendall, E. Grabowska, B. Ma, S. Marks, L. Rodgers, A. Stepansky, J. Troge, P. Andrews, M. Bekrit- sky, K. Pradhan, E. Ghiban, M. Kramer, J. Parla, R. Demeter, L. L. Fulton, R. S. Fulton, V. J. Magrini, K. Ye, J. C. Darnell, and R. B. Darnell, “De novogene disruptions in children on the autistic spectrum,” Neuron, vol. 74, 2012.

[9] J. Grove, S. Ripke, T. D. Als, M. Mattheisen, R. K. Walters, H. Won, J. Pallesen, E. Agerbo, O. A. Andreassen, R. Anney, et al., “Identification of common genetic risk variants for autism spectrum disorder,” Nature genetics, vol. 51, no. 3, p. 431, 2019.

[10] P. Feliciano, A. M. Daniels, L. G. Snyder, A. Beaumont, A. Camba, A. Esler, A. G. Gulsrud, A. Mason, A. Gutierrez, A. Nicholson, et al., “Spark: a us

48 cohort of 50,000 families to accelerate autism research,” Neuron, vol. 97, no. 3, pp. 488–493, 2018.

[11] B. Devlin, N. Melhem, and K. Roeder, “Do common variants play a role in risk for autism? evidence and theoretical musings,” Brain research, vol. 1380, pp. 78–84, 2011.

[12] D. Ma, D. Salyakina, J. M. Jaworski, I. Konidari, P. L. Whitehead, A. N. Andersen, J. D. Hoffman, S. H. Slifer, D. J. Hedges, H. N. Cukier, et al., “A genome-wide association study of autism reveals a common novel risk at 5p14. 1,” Annals of human genetics, vol. 73, no. 3, pp. 263–273, 2009.

[13] “Meta-analysis of gwas of over 16,000 individuals with autism spectrum dis- order highlights a novel locus at 10q24. 32 and a significant overlap with schizophrenia,” Molecular autism, vol. 8, pp. 1–17, 2017.

[14] R. Anney, L. Klei, D. Pinto, J. Almeida, E. Bacchelli, G. Baird, N. Bol- shakova, S. Bolte, P. F. Bolton, T. Bourgeron, S. Brennan, J. Brian, J. Casey, J. Conroy, C. Correia, C. Corsello, E. L. Crawford, M. de Jonge, R. Delorme, E. Duketis, F. Duque, A. Estes, P. Farrar, B. A. Fernandez, S. E. Folstein, E. Fombonne, J. Gilbert, C. Gillberg, J. T. Glessner, and A. Green, “In- dividual common variants exert weak effects on risk for autism spectrum disorders,” Hum Mol Genet, vol. 21, 2012.

[15] F. K. Satterstrom, J. A. Kosmicki, J. Wang, M. S. Breen, S. De Rubeis, J.- Y. An, M. Peng, R. L. Collins, J. Grove, L. Klei, et al., “Large-scale exome sequencing study implicates both developmental and functional changes in the neurobiology of autism,” 2019.

[16] T.-H. Nguyen, A. Dobbyn, R. C. Brown, B. P. Riley, J. D. Buxbaum, D. Pinto, S. M. Purcell, P. F. Sullivan, X. He, and E. A. Stahl, “mtada is a framework for identifying risk genes from de novo mutations in multiple traits,” Nature Communications, vol. 11, no. 1, pp. 1–12, 2020.

49 [17] A. J. Gonzalez-Mantilla, A. Moreno-De-Luca, D. H. Ledbetter, and C. L. Martin, “A cross-disorder method to identify novel candidate genes for de- velopmental brain disorders,” JAMA psychiatry, vol. 73, no. 3, pp. 275–283, 2016.

[18] M. J. Gandal, J. R. Haney, N. N. Parikshak, V. Leppa, G. Ramaswami, C. Hartl, A. J. Schork, V. Appadurai, A. Buil, T. M. Werge, et al., “Shared molecular neuropathology across major psychiatric disorders parallels poly- genic overlap,” Science, vol. 359, no. 6376, pp. 693–697, 2018.

[19] F. Hormozdiari, O. Penn, E. Borenstein, and E. E. Eichler, “The discov- ery of integrated gene networks for autism and related disorders,” Genome Research, vol. 25, pp. 142–154, May 2014.

[20] U. Norman and A. E. Cicek, “St-steiner: a spatio-temporal gene discovery algorithm,” Bioinformatics, vol. 35, no. 18, pp. 3433–3440, 2019.

[21] S. R. Gilman, I. Iossifov, D. Levy, M. Ronemus, M. Wigler, and D. Vitkup, “Rare de novo variants associated with autism implicate a large functional network of genes involved in formation and function of synapses,” Neuron, vol. 70, no. 5, pp. 898–907, 2011.

[22] S. R. Gilman, J. Chang, B. Xu, T. S. Bawa, J. A. Gogos, M. Karayiorgou, and D. Vitkup, “Diverse types of genetic variation converge on functional gene networks involved in schizophrenia,” Nature neuroscience, vol. 15, no. 12, pp. 1723–1728, 2012.

[23] H. T. Nguyen, J. Bryois, A. Kim, A. Dobbyn, L. M. Huckins, A. B. Munoz- Manchado, D. M. Ruderfer, G. Genovese, M. Fromer, X. Xu, et al., “In- tegrated bayesian analysis of rare exonic variants to identify risk genes for schizophrenia and neurodevelopmental disorders,” Genome medicine, vol. 9, no. 1, p. 114, 2017.

[24] L. Brueggeman, T. Koomar, and J. J. Michaelson, “Forecasting risk gene discovery in autism with machine learning and genome-scale data,” Scientific Reports, vol. 10, no. 1, pp. 1–11, 2020.

50 [25] L. Liu, J. Lei, S. J. Sanders, A. J. Willsey, Y. Kou, A. E. Cicek, L. Klei, C. Lu, X. He, M. Li, R. A. Muhle, A. Ma’ayan, J. P. Noonan, N. Sestan,ˇ K. A. McFadden, M. W. State, J. D. Buxbaum, B. Devlin, and K. Roeder, “Dawn: a framework to identify autism genes and subnetworks using gene expression and genetics,” Molecular Autism, vol. 5, p. 22, Mar 2014.

[26] A. Krishnan, R. Zhang, V. Yao, C. L. Theesfeld, A. K. Wong, A. Tadych, N. Volfovsky, A. Packer, A. Lash, and O. G. Troyanskaya, “Genome-wide prediction and functional characterization of the genetic basis of autism spec- trum disorder,” Nature neuroscience, vol. 19, no. 11, pp. 1454–1462, 2016.

[27] A. J. Willsey, S. J. Sanders, M. Li, S. Dong, A. T. Tebbenkamp, R. Muhle, S. K. Reilly, L. Lin, S. Fertuzinhos, J. A. Miller, M. Murtha, C. Bichsel, W. Niu, J. Cotney, A. G. Ercan-Sencicek, J. Gockley, A. R. Gupta, W. Han, X. He, E. J. Hoffman, L. Klei, J. Lei, W. Liu, L. Liu, C. Lu, X. Xu, Y. Zhu, S. M. Mane, E. S. Lein, and L. Wei, “Coexpression networks implicate human midfetal deep cortical projection neurons in the pathogenesis of autism,” Cell, vol. 155, 2013.

[28] S. L. Salzberg, “Open questions: How many genes do we have?,” BMC Biology, vol. 16, p. 94, 2018.

[29] O. Oron and E. Elliott, “Delineating the common biological pathways per- turbed by asd’s genetic etiology: Lessons from network-based studies,” In- ternational journal of molecular sciences, vol. 18, no. 4, p. 828, 2017.

[30] M. A. Mines, C. J. Yuskaitis, M. K. King, E. Beurel, and R. S. Jope, “Gsk3 influences social preference and anxiety-related behaviors during social in- teraction in a mouse model of fragile x syndrome and autism,” PLOS ONE, vol. 5, pp. 1–12, 03 2010.

[31] N. D. Okerlund and B. N. Cheyette, “Synaptic wnt signaling-a contributor to major psychiatric disorders?,” J Neurodev Disord, vol. 3, pp. 162–174, Jun 2011.

[32] J. A. Lee, A. Damianov, C. H. Lin, M. Fontes, N. N. Parikshak, E. S. An- derson, D. H. Geschwind, D. L. Black, and K. C. Martin, “Cytoplasmic

51 rbfox1 regulates the expression of synaptic and autism-related genes,” Neu- ron, vol. 89, pp. 113–128, Jan 2016.

[33] C. R. Casingal, T. Kikkawa, H. Inada, and N. Osumi, “Identification of fmrp target genes expressed in corticogenesis: implication for common phenotypes among neurodevelopmental disorders,” bioRxiv, 2019.

[34] M. Alotaibi and K. Ramzan, “A de novo variant of chd8 in a patient with autism spectrum disorder,” Discoveries (Craiova), 2020.

[35] I. F. King, C. N. Yandava, A. M. Mabb, J. S. Hsiao, H.-S. Huang, B. L. Pearson, J. M. Calabrese, J. Starmer, J. S. Parker, T. Magnuson, et al., “Topoisomerases facilitate transcription of long genes linked to autism,” Na- ture, vol. 501, no. 7465, pp. 58–62, 2013.

[36] “National human genome research institute home,” Agu 2020. Online; ac- cessed 25 August 2020.

[37] A. Thapar and M. Cooper, “Copy number variation: What is it and what has it told us about child psychiatric disorders?,” Journal of the American Academy of Child and Adolescent Psychiatry, vol. 52, no. 8, pp. 772–774, 2013.

[38] F. Zhang, W. Gu, M. E. Hurles, and J. R. Lupski, “Copy number variation in human health, disease, and evolution,” Annual review of genomics and human genetics, vol. 10, pp. 451–481, 2009.

[39] D. Levy, M. Ronemus, B. Yamrom, Y.-h. Lee, A. Leotta, J. Kendall, S. Marks, B. Lakshmi, D. Pai, K. Ye, et al., “Rare de novo and transmit- ted copy-number variation in autistic spectrum disorders,” Neuron, vol. 70, no. 5, pp. 886–897, 2011.

[40] C. R. Marshall, A. Noor, J. B. Vincent, A. C. Lionel, L. Feuk, J. Skaug, M. Shago, R. Moessner, D. Pinto, Y. Ren, et al., “Structural variation of in autism spectrum disorder,” The American Journal of Hu- man Genetics, vol. 82, no. 2, pp. 477–488, 2008.

52 [41] D. Pinto, A. T. Pagnamenta, L. Klei, R. Anney, D. Merico, R. Regan, J. Con- roy, T. R. Magalhaes, C. Correia, B. S. Abrahams, J. Almeida, E. Bacchelli, G. D. Bader, A. J. Bailey, G. Baird, A. Battaglia, T. Berney, N. Bolshakova, S. Bolte, P. F. Bolton, T. Bourgeron, S. Brennan, J. Brian, S. E. Bryson, A. R. Carson, G. Casallo, J. Casey, B. H. Chung, and L. Cochrane, “Func- tional impact of global rare copy number variation in autism spectrum dis- orders,” Nature, vol. 466, 2010.

[42] S. J. Sanders, A. G. Ercan-Sencicek, V. Hus, R. Luo, M. T. Murtha, D. Moreno-De-Luca, S. H. Chu, M. P. Moreau, A. R. Gupta, S. A. Thomson, C. E. Mason, K. Bilguvar, P. B. Celestino-Soper, M. Choi, E. L. Crawford, L. Davis, N. R. Wright, R. M. Dhodapkar, M. DiCola, N. M. DiLullo, T. V. Fernandez, V. Fielding-Singh, D. O. Fishman, S. Frahm, R. Garagaloyan, G. S. Goh, S. Kammela, L. Klei, J. K. Lowe, and S. C. Lund, “Multiple recur- rent de novo, cnvs, including duplications of the 7q11.23 williams syndrome region, are strongly associated with autism,” Neuron, vol. 70, 2011.

[43] B. H.-Y. Chung, V. Q. Tao, and W. W.-Y. Tso, “Copy number variation and autism: New insights and clinical implications,” Journal of the Formosan Medical Association, vol. 113, no. 7, pp. 400–408, 2014.

[44] M. Velinov, “Genomic copy number variations in the autism clinic—work in progress,” Frontiers in Cellular Neuroscience, vol. 13, p. 57, 2019.

[45] J. Sebat, B. Lakshmi, D. Malhotra, J. Troge, C. Lese-Martin, T. Walsh, B. Yamrom, S. Yoon, A. Krasnitz, and J. Kendall, “Strong association of de novo copy number mutations with autism,” Science, vol. 316, no. 5823, pp. 445–449, 2010.

[46] T. Schr¨oder,“The protein puzzle,” Biology Medicine, Max Planck Institute, vol. 3, no. 17, pp. 54–59, 2017.

[47] R. E. Stevenson, C. E. Schwartz, and R. J. Schroer, X-linked mental retar- dation. No. 39, Oxford University Press, USA, 2000.

[48] H. J. Kang, Y. I. Kawasawa, F. Cheng, Y. Zhu, X. Xu, M. Li, A. M. Sousa, M. Pletikos, K. A. Meyer, G. Sedmak, T. Guennel, Y. Shin, M. B. Johnson,

53 Z. Krsnik, S. Mayer, S. Fertuzinhos, S. Umlauf, S. N. Lisgo, A. Vortmeyer, D. R. Weinberger, S. Mane, T. M. Hyde, A. Huttner, M. Reimers, J. E. Kleinman, and N. Sestan, “Spatio-temporal transcriptome of the human brain,” Nature, vol. 478, 2011.

[49] C. Stark, B. . J. Breitkreutz, T. Reguly, L. Boucher, and A. Breitkreutz, “Biogrid: a general repository for interaction datasets,” Nucleic Acids Res, vol. 34, 2006.

[50] T. Keshava Prasad, R. Goel, K. Kandasamy, S. Keerthikumar, S. Kumar, S. Mathivanan, D. Telikicherla, R. Raju, B. Shafreen, A. Venugopal, et al., “Human protein reference database-2009 update,” Nucleic acids research, vol. 37, no. suppl 1, pp. D767–D772, 2008.

[51] D. Szklarczyk, A. Franceschini, M. Kuhn, M. Simonovic, A. Roth, P. Minguez, T. Doerks, M. Stark, J. Muller, P. Bork, et al., “The string database in 2011: functional interaction networks of proteins, globally inte- grated and scored,” Nucleic acids research, vol. 39, no. suppl 1, pp. D561– D568, 2010.

[52] J. A. Kosmicki, K. E. Samocha, D. P. Howrigan, S. J. Sanders, K. Slowikowski, M. Lek, K. J. Karczewski, D. J. Cutler, K. Roeder, J. D. Buxbaum, et al., “Refining the role of de novo protein-truncating variants in neurodevelopmental disorders by using population reference samples,” Na- ture genetics, vol. 49, pp. 504–510, Apr 2017.

[53] L. Liu, J. Lei, and K. Roeder, “Network assisted analysis to reveal the genetic basis of autism,” The annals of applied statistics, vol. 9, no. 3, p. 1571, 2015.

[54] B. S. Abrahams, D. E. Arking, D. B. Campbell, H. C. Mefford, E. M. Mor- row, L. A. Weiss, I. Menashe, T. Wadkins, S. Banerjee-Basu, and A. Packer, “Sfari gene 2.0: a community-driven knowledgebase for the autism spectrum disorders (asds),” Molecular autism, vol. 4, no. 1, p. 36, 2013.

54 [55] A. Hamosh, A. F. Scott, J. S. Amberger, C. A. Bocchini, and V. A. McKu- sick, “Online mendelian inheritance in man (omim), a knowledgebase of hu- man genes and genetic disorders,” Nucleic acids research, vol. 33, no. suppl 1, pp. D514–D517, 2005.

[56] W. Yu, M. Gwinn, M. Clyne, A. Yesupriya, and M. J. Khoury, “A navigator for human genome epidemiology,” Nature genetics, vol. 40, no. 2, p. 124, 2008.

[57] K. G. Becker, K. C. Barnes, T. J. Bright, and S. A. Wang, “The genetic association database,” Nature genetics, vol. 36, no. 5, p. 431, 2004.

[58] K. Peng, W. Xu, J. Zheng, K. Huang, H. Wang, J. Tong, Z. Lin, J. Liu, W. Cheng, D. Fu, et al., “The disease and gene annotations (dga): an anno- tation resource for human disease,” Nucleic acids research, vol. 41, no. D1, pp. D553–D560, 2012.

[59] C. Zhang and Y. Shen, “A cell type-specific expression signature predicts haploinsufficient autism-susceptibility genes,” Human Mutation, vol. 38, pp. 204–215, 2 2017.

[60] S. M. Sunkin, L. Ng, C. Lau, T. Dolbeare, T. L. Gilbert, C. L. Thomp- son, M. Hawrylycz, and C. Dang, “Allen brain atlas: an integrated spatio- temporal portal for exploring the central nervous system,” Nucleic acids research, vol. 41, no. D1, pp. D996–D1008, 2012.

[61] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information pro- cessing systems, pp. 1097–1105, 2012.

[62] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and lo- cally connected networks on graphs,” arXiv preprint arXiv:1312.6203, 2013.

[63] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams, “Convolutional networks on graphs for learning molecular fingerprints,” in Advances in neural information process- ing systems, pp. 2224–2232, 2015.

55 [64] M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networks on graph- structured data,” arXiv preprint arXiv:1506.05163, 2015.

[65] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in Advances in neural information processing systems, pp. 3844–3852, 2016.

[66] T. N. Kipf and M. Welling, “Semi-supervised classification with graph con- volutional networks,” arXiv preprint arXiv:1609.02907, 2016.

[67] Y. Yao, L. Rosasco, and A. Caponnetto, “On early stopping in gradient descent learning,” Constructive Approximation, vol. 26, no. 2, p. 289, 2007.

[68] J. Lin, R. Camoriano, and L. Rosasco, “Generalization properties and im- plicit regularization for multiple passes sgm,” Proceedings of Machine Learn- ing Research, vol. 48, pp. 2340–2348, 2016.

[69] A. Krogh and J. A. Hertz, “A simple weight decay can improve generaliza- tion,” in Advances in Neural Information Processing Systems 4, pp. 950–957, 1991.

[70] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdi- nov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, no. 56, pp. 1929–1958, 2014.

[71] S. Masoudnia and R. Ebrahimpour, “Mixture of experts: a literature survey,” Artificial Intelligence Review, vol. 42, no. 2, pp. 275–293, 2014.

[72] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

[73] T. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar,“Focal loss for dense ob- ject detection,” in 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2999–3007, 2017.

[74] H. O. Kalkman, “A review of the evidence for the canonical wnt pathway in autism spectrum disorders,” Molecular autism, vol. 3, no. 1, p. 10, 2012.

56 [75] D. Pinto, E. Delaby, D. Merico, M. Barbosa, A. Merikangas, L. Klei, B. Thiruvahindrapuram, X. Xu, R. Ziman, Z. Wang, et al., “Convergence of genes and cellular pathways dysregulated in autism spectrum disorders,” The American Journal of Human Genetics, vol. 94, no. 5, pp. 677–694, 2014.

[76] J. Cotney, R. A. Muhle, S. J. Sanders, L. Liu, A. J. Willsey, W. Niu, W. Liu, L. Klei, J. Lei, J. Yin, et al., “The autism-associated chromatin modifier chd8 regulates other autism risk genes during human neurodevelopment,” Nature communications, vol. 6, p. 6404, 2015.

[77] I. Voineagu, X. Wang, P. Johnston, J. K. Lowe, Y. Tian, S. Horvath, J. Mill, R. M. Cantor, B. J. Blencowe, and D. H. Geschwind, “Transcriptomic analy- sis of autistic brain reveals convergent molecular pathology,” Nature, vol. 474, no. 7351, pp. 380–384, 2011.

[78] S. M. Weyn-Vanhentenryck, A. Mele, Q. Yan, S. Sun, N. Farny, Z. Zhang, C. Xue, M. Herre, P. A. Silver, M. Q. Zhang, et al., “Hits-clip and integra- tive modeling define the rbfox splicing-regulatory network linked to brain development and autism,” Cell reports, vol. 6, no. 6, pp. 1139–1152, 2014.

[79] J. C. Darnell, S. J. Van Driesche, C. Zhang, K. Y. S. Hung, A. Mele, C. E. Fraser, E. F. Stone, C. Chen, J. J. Fak, S. W. Chi, et al., “Fmrp stalls ribosomal translocation on mrnas linked to synaptic function and autism,” Cell, vol. 146, no. 2, pp. 247–261, 2011.

[80] E. Rosina, B. Battan, M. Siracusano, L. Di Criscio, F. Hollis, L. Pacini, P. Curatolo, and C. Bagni, “Disruption of mtor and mapk pathways cor- relates with severity in idiopathic autism,” Translational psychiatry, vol. 9, no. 1, pp. 1–10, 2019.

[81] A.` Bay´es,L. N. Van De Lagemaat, M. O. Collins, M. D. Croning, I. R. Whittle, J. S. Choudhary, and S. G. Grant, “Characterization of the pro- teome, diseases and evolution of the human postsynaptic density,” Nature neuroscience, vol. 14, no. 1, pp. 19–21, 2011.

57 [82] H. Y. Zoghbi and M. F. Bear, “Synaptic dysfunction in neurodevelopmental disorders associated with autism and intellectual disabilities,” Cold Spring Harbor perspectives in biology, vol. 4, no. 3, p. a009886, 2012.

[83] S. Iwase, N. G. B´erub´e,Z. Zhou, N. N. Kasri, E. Battaglioli, M. Scandaglia, and A. Barco, “Epigenetic etiology of intellectual disability,” Journal of Neu- roscience, vol. 37, no. 45, pp. 10773–10782, 2017.

[84] R. Huang, I. Grishagin, Y. Wang, T. Zhao, J. Greene, J. C. Obenauer, D. Ngan, D.-T. Nguyen, R. Guha, A. Jadhav, et al., “The ncats bioplanet - an integrated platform for exploring the universe of cellular signaling path- ways for toxicology, systems biology, and chemical genomics,” Frontiers in pharmacology, vol. 10, p. 445, 2019.

[85] L. Pinato, C. S. Galina Spilla, R. P. Markus, and S. da Silveira Cruz- Machado, “Dysregulation of circadian rhythms in autism spectrum disor- ders,” Current pharmaceutical design, vol. 25, no. 41, pp. 4379–4393, 2019.

[86] G. M. Zuculo, B. S. Gon¸calves, C. Brittes, L. Menna-Barreto, and L. Pinato, “Melatonin and circadian rhythms in autism: Case report,” Chronobiology International, vol. 34, no. 4, pp. 527–530, 2017. PMID: 28426389.

[87] J. Kopel, Citrate and Autism, pp. 1–2. New York, NY: Springer New York, 2019.

[88] H. Brentana, C. S. d. Paula, D. Bordini, D. Rolim, F. Sato, J. Portolese, M. C. Pacifico, and J. T. McCracken, “Autism spectrum disorders: an overview on diagnosis and treatment.,” Brazilian Journal of Psychiatry, vol. 35, pp. 62–72, 2013.

[89] J. Li, T. Cai, Y. Jiang, H. Chen, X. He, C. Chen, X. Li, Q. Shao, X. Ran, Z. Li, et al., “Genes with de novo mutations are shared by four neuropsychi- atric disorders discovered from npdenovo database,” Molecular psychiatry, vol. 21, no. 2, p. 290, 2016.

[90] E. M. Morrow, S.-Y. Yoo, S. W. Flavell, T.-K. Kim, Y. Lin, R. S. Hill, N. M. Mukaddes, S. Balkhy, G. Gascon, A. Hashmi, et al., “Identifying autism loci

58 and genes by tracing recent shared ancestry,” Science, vol. 321, no. 5886, pp. 218–223, 2008.

[91] “The gtex portal,” Agu 2019. Online; accessed 20 February 2020.

[92] O. Basha, R. Shpringer, C. M. Argov, and E. Yeger-Lotem, “The differential- net database of differential protein–protein interactions in human tissues,” Nucleic acids research, vol. 46, no. D1, pp. D522–D526, 2018.

[93] G. Zhou, O. Soufan, J. Ewald, R. E. Hancock, N. Basu, and J. Xia, “Net- workanalyst 3.0: a visual analytics platform for comprehensive gene expres- sion profiling and meta-analysis,” Nucleic acids research, vol. 47, no. W1, pp. W234–W241, 2019.

59 Appendix A

Supplementary Tables

Table A.1: 46 E1 ground truth gene list. These genes are the most confident ASD risk genes and also used for performance calculation along with non-mental-health related genes. Evidence weight of 1.0 is used while training with these genes.

NRXN1 ADNP FOXP1 ANKRD11 CNTN4 SYNGAP1 KMT2A CNTNAP2 TBR1 KDM5B CACNA2D3 SHANK3 ANK2 MED13L POGZ PTEN NLGN3 CHD2 SHANK2 SUV420H1 MAGEL2 CHD8 DEAF1 MET BCL11A MYT1L RELN ASH1L ARID1B SLC9A9 CTNND2 DYRK1A GRIN2B KMT2C GRIP1 KATNAL2 GABRB3 DSCAM CUL3 NLGN4X ASXL3 MECP2 SCN2A SETD5 PTCHD1 CACNA1H

Table A.2: 67 E2 ground truth genes. Their evidence weight is set to 0.5 and they are excluded while calculating performance metrics. These genes are the most confident ASD genes after E1 genes.

ATP2B2 CACNA1D SEMA5A MBD5 LMX1B SLC6A3 VIL1 NRXN3 ZMYND11 ASTN2 CNTNAP4 MYO9B SHANK1 ANXA1 CACNB2 CTCF SETBP1 MACROD2 PAX5 CTNNB1 ELP4 TRIO ZBTB20 NR3C2 TCF7L2 CEP41 GIGYF1 LAMB1 WAC RIMS1 CHRNA7 AVPR1A NCKAP1 CC2D1A ADA TNRC6B KDM6B TRIP12 ATP10A BCKDK ITGB3 ETFB RBFOX1 PHF2 SPAST GPHN DIP2A NAA15 SLC38A10 GRIK2 PRKCB HMGN1 CTTNBP2 OXTR APH1A WDFY3 KMT2E CDC42BPB FOXP2 OPHN1 TMLHE AUTS2 DISC1 TRPC6 PON1 EHMT1 EFR3A

60 Table A.3: 525 E3-E4 ground truth genes with evidence weight 0.25. They are the positive ground truth genes with the lowest confidence. They are not included in performance metric calculation.

NTRK3 GABRA1 CACNA1G RAB3A CLOCK SNRPN H2AFY CTNNA3 SETDB1 DRD3 ZNF778 EGR2 RHOXF1 DBH HES6 MCPH1 ITGA4 SLC19A1 DPP6 NPAS2 ADCY1 NOS1 MIB1 ST8SIA2 CGA NFIL3 LHFPL3 CCDC148 CASC4 OPRL1 MC3R SMO CTSD CDKL4 MUC1 HFE AFF2 KCNMA1 SORCS1 EPC2 ING3 TAOK2 WNT2 CADPS2 OSBPL6 CALCRL MBD1 MTNR1B HOXD12 ASMT DYX1C1 TP53 NDNL2 TCN2 SNAP25 CDH8 CALB1 SOX5 SCG5 NPTX2 DLX2 VPS13B CNTNAP5 PAH HGF MAOB SLC7A5 TDO2 HOXB1 ST7-OT3 CYP1B1 PCBD1 CNR1 ASIC2 ATP6V0C PCDH19 NPY1R GALNT13 HTR7 PRL SLC30A4 ESRRB TAS2R1 HLA-B LRRC1 ABCA13 MLPH MAOA KIAA1586 SPR NBEA CDH10 NTNG1 SCN3A TBL1X DOCK4 HMGB1 IL1RAPL2 CDH9 EIF4E SRD5A2 MAPK3 ADORA2A MBD4 CYP21A2 MYO1D IGF2 SERPINE1 PSD3 GRM5 STX1A KCNJ3 MAP2K2 VGF DLX1 GABRA6 CHRM5 CHRNA4 AANAT TPH1 NKX2-5 HEPACAM COMT FAM120C DAO PEX7 GABBR2 RIMS3 CDK14 DRD1 BZRAP1 ESR1 GHSR UBE2A RARA ZBBX MFSD6 HTR3C NTRK1 FHIT FABP5 PITX3 LEPR MTRR GRIA2 GABRA2 CASK RAPGEF4 GRIK3 TUB BCL2 RASGRP1 HS3ST5 NEUROG1 OR1C1 PLN AVPR1B OXT FRK SCT C4B IL1RN NTRK2 GABBR1 PAX6 MAP2K1 GABRG3 HCRTR2 PTK2 LHCGR LAMB4 BHMT TPH2 CYP11B1 RARG XIRP2 GJA1 LHB SETDB2 SHMT1 DCX APC VLDLR SYN1 RXRB CCK SHFM1 AQP4 DRD2 NTS FOSB SLC13A1 ARC LZTS2 SDC2 HLA-A VWA8 MTR CAMK2B MARK1 TBL1XR1 SLC6A2 DCLK2 AR COBL PPP1R1B CYP19A1 EXT1 CCDC64 HTR3A AGRP NOS2 NEDD4 FRAXE DHCR7 SCP2 HTR2C YWHAZ NRXN2 NRG1 NXPH1 PER2 NLGN4Y LRFN5 STK39 KDR TAF1C CASP3 CA6 CACNA1F IMMP2L FRAXA GNB1L MCM7 GRIN2A NOS1AP NPY JMJD1C SPON1 FMNL2 NRP2 DLX6 SLC4A10 MTNR1A TMEM100 PYY COPG2 HCRT NSD1 TGFB1 ADRB2 ADSL NRCAM NLGN1 ARNT2 PARK2 MOG AGTR2 HTR5A KCTD13 MC4R TH TBX1 UBL5 GSTP1 GPX1 HIRIP3 CHRNB2 HLA-DRB1 DHFR PAFAH1B1 RARB SETD2 GRM1 BAIAP2 CARTPT FBXO40 VASH1 GABRG1 ADRA1B UBE2H PTS KIR2DS5 CNR2 QDPR TFPI ALDH3A1 MAPT SLC6A1 FABP7 FOLH1 DOC2A PPIG SCAMP5 CDK5 TYR NLRP3 HCRTR1 AGBL4 HSD11B1 KIR2DS4 NOSTRIN SND1 IL1R2 GRIA3 DOCK3 MAZ NR2E1 PIK3CA SCN1A MDGA2 STS FOXO1 SRD5A1 TBL1Y C15orf43 KIR2DS1 MSR1 NTF4 EIF1AY APOE GAD2 DLG2 RPS6KA2 TSPAN7 FEZF2 EN2 GLO1 MBD6 FLT1 PER1 UPP2 DRD4 RFC1 STK38 PRODH INPP1 FIGNL1 ERBB4 SLC1A1 MYO16 LRP2 ICA1 DPP10 COL11A1 GRID2 GRPR HOXD13 HES1 CPA2 GHRL LEP SSTR5 FASTK SLC25A13 DLGAP2 UBE3A HTR1E SEZ6L2 PRLR GABRA3 ALDOA NTF3 CYP17A1 LASP1 ATP1A1 TRPV1 VIPR1 GCH1 ODF3L2 MTHFR GCHFR

61 YWHAB NF1 NPY5R NR4A2 CPLX2 MED12 ABCB1 RAF1 TSPO CBS DDC CHMP2B TSC2 DDX53 POMC DRG1 PECAM1 PCDH9 SORCS3 TAC1 OPRK1 ADRA2A CDH22 PLD5 KCNN3 MTF1 CYP1A1 MEST KIR3DS1 ST7 TSPAN12 SLC6A14 CASP9 RPL10 GAD1 KCND2 CSNK1E SYNE1 GRM8 FBXO33 IGF1 SNTG2 VEGFA PTGS2 AGRN CYP7A1 SHBG RXRA NDUFA5 CD38 CYP2D6 HTR2A CADM1 AGAP1 AKT1 SYT17 PHF8 WNK3 HTR1A HTR6 HRAS OPRD1 SLC1A3 ROBO4 DMPK AGMO HSD17B2 DPYD RGS7 APBA2 IL1RAPL1 GABRB1 KHDRBS2 ESR2 CNTN5 EFHC2 SLC6A4 HOXD11 ADIPOQ HTR1D DAOA C3orf58 TNF PTGIR GSTM1 GRM7 PTPRN2 TSC1 NPY2R PCDH10 PTPRZ1 RXRG NDN FABP3 HOXA1 SULT2A1 MT1A TRIM32 WFS1 CLTCL1 PITX1 ADK SATB2 MEF2C BDNF NDEL1 PPP4C GTF2I POU6F2 HTR3B ZNF385B MKL2 GPC6 SLC6A8 HSD17B3 RETN MBD3 POMGNT1 AMT PIK3R1 FMR1 OMG DRD5 MIF CUX2 RAB39B HOMER1 TBC1D5 APAF1 NGFR CYP11A1 SLC25A12 EGF FEZF1 REEP3 HSD17B4 GABRA5 GABRA4 CSRNP3 AHI1 RYR3 OPRM1 MCHR1 PNOC ABAT POR NGF HTR1B PIK3CG FSHB ROBO3 ARX FOLR2 Table A.4: First 600 non-mental-health related genes. All non-mental-health related genes have evidence weight of 1.0. They are used along with E1 genes for performance metric calculation. Remaining genes are given in Table A.5

ATXN1 MAPK8IP1 FGD1 BCR PRKAR1A SLC45A2 NPHP4 FGF10 TNFRSF11A CHMP4B SEPT9 SLC12A3 LFNG TBX20 PRPF3 NIPBL WNK1 PAX3 CNGB1 MYO15A HR SLC30A8 IFT80 ALOX12B F9 CHIC2 RAB7A PRX SPINK5 GHR ATRX SNCA ROBO2 GNAS GFAP LPP ARHGAP26 CLRN1 XPA MS4A1 MAD1L1 KCNK9 PDX1 LRRK2 BBS4 PLCB1 FGF14 GPR98 TRPM1 ATP7A FUS ARFGEF2 CYLD NDUFS7 ACVR1 PAX8 POU4F3 CCND1 CABP4 USH1G KRAS UCHL1 CNGA3 OCRL SIX3 MSX1 SMAD9 CHN1 BARD1 ENPP1 NOTCH2 ZNF592 SH3BP2 MYL2 STRC KCNQ2 TARDBP GUCA1A ZEB1 ATXN7 PHF6 FAM134B PROK2 RPGRIP1 PHEX SCN1B TGFBR1 GJB4 TMPRSS3 SLC19A3 REEP1 ATP2A2 SPTAN1 HCN4 EGFR NODAL FOXE3 GDNF RP1 VAPB ALMS1 DNAH11 MC1R UCP3 DLD NEUROD1 PABPN1 NDUFS8 OTX2 ATL1 SH3TC2 LCA5 SCN5A CBX2 COCH TNNT2 ATG16L1 PTGIS STXBP2 SLC2A4 TRIM33 TRPS1 OCA2 SPTBN2 GRM6 ADAMTS17 TLL1 GPR143 KLK4 NME8 ATXN3 GLI2 KCNJ5 CHM TAP1 MAPK10 KCNJ10 IRS2 KCNJ1 PICALM ZFHX3 SLC12A1 PPP1R3A KRT83 SCN9A RB1 GJB6 ABCA4 EYS TGFBR2 CREBBP KIF1B COL4A3 MMADHC SLC25A22 PRICKLE2 SAG RDX FHL1 ATP2C1 CCDC28B CYBA MYO7A TMC1 RP9 SCN3B KCNA5 TRIM37 USH1C EWSR1 TGIF1 DMP1 ABCA1 PRNP PORCN GCGR NR5A1 G6PC CEBPA CLN3 NEFL PPIB IRS1 HTT NEFH NIPA1 NRL SFTPA2 HABP2 GLI3 TUBA1A DLX3 PDE6C PTPRJ FIG4 LARGE DICER1 RHO PTHLH FGFR3 GATA2 EDNRA F11 GAA VSX2 F13B CDH23 LTBP3 SLC26A2 SLC5A2 OPCML ZIC3 ATXN10 SETX KCNJ2 SCN4B FGF23 COL2A1 NOG AGTR1 CIITA SLC34A2 DNAI2 KCNJ11 SEMA4A SOX2 ACTN2 FSHR EDNRB RHAG PARK7 FOXE1 BMP2 TRIM24 DIAPH2 MPZ FGD4 KCNQ1OT1 NPPA EDAR DCTN1 CACNA1C ATXN2 XYLT1 GCK OAT AIP AIPL1 ASPA DNAAF1 SPTA1 EDARADD RSPO4 KCNE1 IL23R 62 FGFR2 SOS1 RUNX1 LEMD3 ACTN4 LRRC8A AKT2 RBM20 TMIE MSH2 PITPNM3 DHH CD24 GJB2 KRT1 CACNA1A DMD PRKCG CDHR1 SOD1 ATXN8OS ALOXE3 SDCCAG8 FASLG BBS9 LIPI GUCY2D RD3 SUFU F5 BMPR2 LYST NDP SLC3A1 RBM8A MAP3K1 AASS GDF5 AKR1D1 ALS2 TWIST1 CFC1 PDE6B SLC40A1 NYX PDE8B HPRT1 PRPF8 KCNC3 SNRNP200 LDB3 FLT3 HSPD1 GIF MITF PSAT1 FOXL2 ZNF513 LRP5 SCN4A EP300 DLL3 SNCB NDUFA13 SLC34A1 MYO3A KLF11 MEN1 PRICKLE1 TPO DNAI1 RDH12 ABCC6 PRCD STRA6 PDGFRA ZFPM2 SOX10 ABCC9 CYP2A6 PAX4 DNM2 ETV6 UBR1 PCDH15 DYNC2H1 ENAM ASPSCR1 FGA SPATA7 KIF5A ARHGEF9 EDN3 FZD4 TYRP1 AMELX GLUD1 LPL FIGLA IFT122 RSPH9 KARS PANK2 IL10 IL1B DPM1 THRB CA4 SLC26A4 THPO IMPG2 PRKAG2 SYNE2 RPE65 MYO6 AVPR2 ACTG1 C2orf71 IGF2BP2 GDF3 ARHGEF12 TOPORS TTBK2 RNF6 CNBP RLBP1 SDC3 TAT ARL6 IRF4 GJB1 LHFPL5 GNB3 MYOC GATA6 ATN1 EDA MLLT10 ATP1A2 CLCN5 MSH6 PTPN1 HMCN1 TTN LITAF IKBKAP COG1 BEST1 KRT86 CSRP3 ABCC8 SBF2 ZIC2 BMP15 NOTCH1 CLCNKB AXIN2 LRAT COL4A5 MMP20 SNTA1 LRTOMT CSTB FGB FLCN SUMO1 RYR2 PDYN MSX2 ADD1 PLA2G6 FLNA GLIS2 CRB1 GLDC FLG UROS MYOT TTR ALPL CHD7 NR4A3 AKAP9 GIGYF2 SLC7A9 OTOF PRPH CYP7B1 WWOX CLN8 FKRP MYBPC3 RGR HGSNAT NDUFS3 GPI SOX9 ACVR2B NR0B1 TFAP2A DFNB31 GDF1 PROM1 GARS NDRG1 REN SLC5A5 GPC3 KIF7 BBS5 KCNA1 PTCH1 HMGA1 BCOR ATP7B FOXF1 MPL LBR TMPRSS6 PKHD1 ERCC3 KCNQ1 MYH9 CRX ACAN RUNX2 PNPLA6 JAG1 GDF6 CDKN1C SIM1 SCNN1G PNKP CCDC50 GRHL2 NPR2 PTPN11 STAT3 HNF1B TSG101 PPP2R2B KIT LZTS1 PTCH2 CFTR PRPH2 SLC16A2 HBA1 MTTP BMP4 PEX5 ADAMTS10 FGF8 DBT HES7 RET PLP1 RNF139 BMPR1B ADRB3 PAX7 SRY SLC17A8 INSR USP9Y SGCG COL10A1 NOBOX PADI4 FBN1 KCNH2 WHSC1L1 IDS EYA4 POU3F4 GDAP1 AXIN1 TMEM67 RXFP2 FOXI1 IHH PYGM OPN1MW CNGA1 CEL RB1CC1 KLF6 MN1 BEAN1 PLAG1 STOX1 OTC ERCC4 CST3 EFNB1 GNE MMP13 TFR2 BANK1 ROR2 STX16 MAMLD1 MXI1 SGCD TECTA CDKL5 RAX USH2A RSPH4A ZFYVE27 KCNE2 EPB41 PITX2 PCCA RNASEH2B GABRG2 PKD1 SPG7 CNGB3 TAPBP UCP1 WT1 MCM6 NF2 PCM1 PDZD7 GNAS-AS1 RP1L1 TULP1 FLT4 ITPR1 EYA1 COL4A4 PRCC PLEKHM1 ATM TF TNFSF11 STK11 HSPB3 LIG4 MEFV NUP62 FAM161A DFNB59 Table A.5: Remaining 585 non-mental-health related genes. First 600 genes are given in Table A.4

GFI1 GATM CEP152 OCLN HCCS PALB2 NCSTN PLA2G2A HLA-DQA1 CLCN1 PCCB PMM2 SLC39A4 COX15 AGPS PC ZNF469 PHKG2 RMRP UPK3A TMC8 ABCC2 PTPN22 RNASEH2C ERBB2 COG4 TPM1 SMN2 EXT2 RAD54L BRAF GP9 MTMR2 COMP SHH TBP PEX26 BBS2 SLC22A5 GBE1 FOXRED1 SLC25A15 WRN SLC35C1 COL6A2 FANCC SPG20 ERCC6 MAP3K8 TERC IL10RA F12 CD79B FCGR2B GLA COG5 BCKDHB ENG IDH3B STIL CEP290 SIX1 NHLRC1 MYH3 F2 WDPCP UNC13D CYBB DIAPH1 XPC TNNC1 LMNA AK1 NAGLU FANCA NPC1 PTPRQ POMT1 MKKS PINK1 AP3B1 TNFRSF13B CFI MYC COL6A3 APOL1 IRF5 ECM1 GALK1 AGA SLC6A19 FOXP3 IMPDH1 ESCO2 POLR1D SLC4A11 KLHDC8B DYSF SLC22A4 CFH MPI SLC17A5 AGPAT2 RPS7 GLB1 SCO2 GATA4 AGT PHGDH KIAA0196 DSP HLCS C5 COL1A2 GJB3 DPM3 IL2RG CHEK2 ALG9 LAMP2 CLCN7 FKBP10 COX6B1 NT5C3A PEX10 JUP KRT13 SLC16A1 RFXANK CD79A MINPP1 NDUFA12 DOLK COL5A1 PHYH NPC2 ACTC1 STX11 RPS26 EPM2A SLC52A3 CDK5RAP2 BBS7 COL11A2 PPOX NDUFAF6 ASS1 HTRA2 GM2A RAD51 EPCAM PHKA2 DFNA5 HNF1A APOA1 CENPJ HEXA KLHL7 NEXN TCIRG1 ATR HPS1 EPHX1 RPS24 RPS17 MLH3 CP AVP GALC ARSB PDE6G NCF1 INS CDAN1 BLNK CASP10 TMC6 MUTYH HADH RPL11 BBS12 NLRP7 NPHP3 INVS INSL3 F13A1 KLF1 TTC8 LYZ HSPB1 MGAT2 SGCA SLC37A4 DDB2 DPAGT1 PFKM H19 GMPS CERKL TLR2 MVK CA2 RP2 B4GALT1 CFHR1 DNAAF2 COL6A1 HTRA1 C3 RECQL4 VHL CLCNKA OPN1LW TGM1 CYP3A5 MYH7 NR0B2 FBP1 NDUFA2 GALT SMN1 MYH14 FASTKD2 TDP1 CLN5 MFN2 CFHR3 ODC1 GJC2 DSPP BSCL2 SAMHD1 TCAP MESP2 ARSA SNAI2 IQCB1 ANG BTD PHB LMBR1 IL6 NME1 SCNN1B MYO1A FKTN SPTLC2 COL8A2 PEX19 MCCC1 WDR36 TPP1 GALNS CAV1 PTRF LOXHD1 RFXAP CD244 NCOA4 EPHA2 CAPN3 NOD2 SPG11 PPARG OSMR LMBRD1 IGHM RNASEH2A GNPAT BUB1B 63 IRF1 IGLL1 FBXO7 POLH ARG1 MYL3 COA5 MARVELD2 ALG10 HBB COX14 PEX13 ASPM OPTN HMMR ACVRL1 TNNI3 BBS10 VWF G6PC3 ECE1 ABHD5 COG7 GATA1 MAN2B1 GSN KRT6B NDUFAF2 ASAH1 AURKA SGSH SLC11A2 MYLK2 GP1BB CD19 PSENEN IDUA ATP2A1 NDUFS4 MYH11 AK2 OSTM1 TNXB TPM2 GNS DUOX2 HFE2 GRK1 FSCN2 FAM83H GTF2H5 AMACR BTK PSPH SLC2A10 CLN6 SLC22A18 CRYAB FUCA1 SLC33A1 BLK EVC2 POLG YARS JAK2 CDKN2A ERCC5 BRIP1 DSG2 LAMA2 SDHB WAS TAZ GOSR2 BLM PRKCH ALG12 FOXC1 SH2B3 D2HGDH TRIP11 TPRN GCSH CYP4V2 RFT1 SLC35A1 ALG3 GALE IL10RB POLR1C EVC MKS1 RYR1 PPARGC1B HGD ERCC8 MMAA SDHA COL3A1 SPTLC1 ADCK3 UROD GOLGA5 COG8 HMBS HOGA1 TLR4 PEX14 CAV3 WDR72 TERT PRPF31 HPGD SCARB2 IKBKG KRT16 ALDH3A2 LEPRE1 RPS10 NKX2-6 CPS1 MYH6 ELANE ABCB4 PKD2 AFG3L2 KRT4 KRT17 GYG1 SLC25A38 TAP2 PDHA1 HEXB PAX9 IFNG APOA5 CDC73 WDR35 VCL ITPKC HPD TREX1 PPP2R1B SERPINB6 BRCA1 CTSC MLH1 AP5Z1 BSND PPT1 MPDU1 ITGA7 ALAS2 RFX5 HAMP XRCC3 CARD9 TRIOBP PLOD1 IVD SEC23B POMT2 ESPN KRT81 FERMT3 TNFRSF13C L2HGDH SGCB BCL10 GUSB SQSTM1 MMACHC CASP8 ALG6 PNP CD2AP PROKR2 CDKAL1 RAX2 DSC2 TMEM43 MANBA ELN KRT6A CDH1 COL1A1 OGG1 SMPD1 CRTAP AGXT KCNQ4 PMP22 CLEC7A PKP2 HSD3B7 MED25 SERPINA1 HNF4A ITGB2 C2 BRCA2 FECH OAS1 OTOA SMARCB1 BBS1 SLC25A4 GNAT1 DES ERCC2 MOGS HLA-DQB1 CYP27A1 HAX1 CTNS SERPING1 C10orf2 LIPC FLNB SFTPA1 TRPV4 SRD5A3 AAAS ITGA2B RDH5 NCF2 CFB HSPB8 TNNI2 ABCB7 NDUFA9 NPHP1 RPGR ACTA2 MMAB EMD ADH1C SIX5 GLRX5 INF2 GPD2 PRF1 RPL5 MPLKIP FXN CD81 IDH2 GRXCR1 NBN SEPN1 OFD1 ASL UGT1A1 IRF6 GBA TNNT3 B4GALT7 PEX2 LTBP2 ALG11 AHCY NRAS RPL35A NDUFA10 PHKB MFSD8 GRHPR ALOX5AP NUP214 PDGFRL RPS19 TCOF1 FAH KRT10 ABCD1 UMOD KCTD7 GPD1L SURF1 EFHC1 DNASE1 GP1BA FBLN5 PEX1 ALG1 MCCC2 DKC1 HBA2 TGFB3 MCOLN1 KCNE3 COL5A2 XYLT2 PEX3 BCKDHA EMG1 MUC5B SDHC FAS ADAM9 ACAT1 NR2E3 ICOS DUOXA2 AGL CACNA1S LCAT NPM1 TP63 CISD2 PRPS1 SUMF1 LOXL1 CPOX ALG8 Table A.6: Gene list for WNT signaling process. This gene set is used to calculate enrichment for WNT signaling.

CHD8 ROCK2 NFAT5 PPP3R1 NFATC1 GPC4 WNT11 FOSL1 SKP1 WNT5B CREBBP PRICKLE2 MAPK8 SENP2 WNT3 PLCB4 NKD2 DVL1 WNT3A WIF1 GSK3B CSNK1E PLCB1 CTBP1 MAPK9 WNT7A FZD8 RHOA PPP3CC TP53 CTNNB1 CAMK2B PRICKLE1 DAAM1 PRKX CSNK2B WNT4 WNT8A MMP7 SFRP4 TBL1XR1 DVL3 SMAD3 MAPK10 APC MYC CCND3 WNT2 CCND1 PLCB2 LRP6 TCF7L2 VANGL2 CSNK1A1 DKK2 PSEN1 DKK4 SFRP5 RBX1 BAMBI 64 FBXW11 SMAD4 FZD3 MAP3K7 FZD4 TBL1X SOX17 NKD1 RAC2 DAAM2 EP300 RAC1 PRKACA CXXC4 PRKACG WNT8B TCF7 NFATC3 CER1 FZD6 CUL1 PPP3CA PRKCG LEF1 FZD1 JUN WNT9B WNT5A DVL2 BTRC CSNK2A2 CAMK2G RAC3 WNT1 WNT10B NFATC4 WNT6 WNT2B PRKCB NLK CCND2 PORCN FRAT2 AXIN1 ROCK1 VANGL1 TCF7L1 PRKCA PRKACB PPP3CB RUVBL1 PPARD CSNK1A1L SFRP1 CTBP2 CHP2 CSNK2A1 WNT7B SOST CHP1 FRAT1 PPP3R2 WNT16 CACYBP WNT10A WNT9A APC2 SIAH1 AXIN2 FZD9 DKK1 FZD10 SFRP2 NFATC2 CAMK2A LRP5 TBL1Y CAMK2D FZD5 CTNNBIP1 FZD2 FZD7 PLCB3 Table A.7: Gene list for MAPK signaling process. This set is used to perform enrichment analysis using DeepASD ranking for MAPK signaling.

NF1 RAC1 MAPK1 FGF16 MAPK10 CACNG8 JUND RAP1A FGF11 TGFB3 CDC25B PTPN7 TAOK1 MAP2K7 CACNG3 STK4 MAP3K7 CACNA2D2 MAX MAPK11 RAP1B CACNG6 MAPK13 PDGFRB CACNA2D1 CACNA1A FGF12 PPP3CB TGFB2 FGF13 NFKB1 CASP3 TNF RELA FGF5 CACNA1S CACNA1C DUSP3 MAPT CACNA1G RAC3 PRKACG FGF8 CDC42 HRAS FLNA FGF2 EGFR CACNA1E TAOK2 RPS6KA3 DUSP6 CACNA1F MRAS MAP3K5 PPP3R2 NFATC4 PTPN5 CHP2 PLA2G4D RAPGEF2 MAP2K6 RPS6KA6 PPM1A RPS6KA5 PDGFA MECOM PDGFRA RRAS FGF1 TGFBR2 MAP3K20 BRAF PPP3CA MAPK14 AKT2 RAF1 MAP3K14 RRAS2 LAMTOR3 HSPB1 MAP4K1 FGFR3 MKNK1 MEF2C CACNG2 PAK1 GADD45G MAP3K2 MAP3K3 MAPK7 DUSP10 RPS6KA1 FGFR1 NFATC2 PLA2G4F

65 CACNA2D3 MAP2K4 MAPK8 MAPK8IP2 CHP1 DUSP16 TAB2 MAP3K13 MAP3K11 DUSP5 CACNA1H RASGRP3 PRKCB SOS1 ATF4 GRB2 TRAF2 CHUK IL1A HSPA1A DDIT3 PPP3CC GNA12 PLA2G4E MAPK8IP3 NLK PTPRR IKBKG ATF2 DUSP2 MAP3K1 IKBKB CACNG1 ECSIT RASA2 MAP4K2 PRKCA DUSP8 PRKACA ELK4 FGF19 NRAS FGF22 MAP2K2 MAP2K3 MAP3K8 PLA2G4C MAPKAPK3 RASGRF1 TRAF6 MAP4K4 PPP3R1 MAPKAPK5 DUSP14 STMN1 FAS MKNK2 FGF6 NTRK2 CACNA2D4 CACNA1I SRF PRKCG FGF10 AKT1 RASGRP2 MYC NTF3 HSPA6 FGFR4 FLNC MAP3K6 CACNA1D MAPK8IP1 FGF14 CRKL PAK2 DUSP1 TAB1 HSPA1B FGF3 MAPKAPK2 TNFRSF1A FLNB MAP3K4 PRKACB BDNF DUSP9 MAPK9 DUSP22 PPM1B FGF20 MAPK12 TGFB1 IL1R2 EGF MAP2K5 FGF4 CACNB2 ARRB2 PRKX HSPA8 MAPK3 FGF7 RASGRP4 IL1R1 CACNG5 JMJD7-PLA2G4B MOS MAP2K1 RPS6KA4 RELB CRK CACNB3 CACNG4 NGF SOS2 FGFR2 TP53 DUSP7 RASA1 ELK1 FGF18 PDGFB FGF23 NR4A1 DAXX FGF17 RAC2 STK3 PPP5C RASGRP1 MAP4K3 DUSP4 RASGRF2 MAP3K12 TGFBR1 NTF4 GNG12 IL1B PLA2G4A AKT3 ARRB1 FOS RPS6KA2 CACNG7 KRAS FASLG HSPA1L HSPA2 PLA2G4B NFKB2 CACNA1B CACNB1 GADD45B TAOK3 FGF9 CACNB4 JUN GADD45A FGF21 CD14 NTRK1 Table A.8: CHD8 target gene list part 1 (of 3 parts) containing 750 genes. This gene set is used for enrichment calculations of DeepASD ranking with respect to CHD8 targets.

ARID1B MORC3 UBAP2L SMARCA4 CUL5 TMEM39B NR2F1 ZNF512B ATRNL1 UBFD1 ZKSCAN2 USP47 CKAP5 FKRP SMAD2 ADNP LDB1 APBA1 CAPRIN1 HNRNPA3 ITSN1 ETS2 HELZ OGA HNRNPA1P10 USP10 USP48 FAM133DP ZSWIM6 FAM117B MED13L KAT6A ZBTB20 SPIRE1 DUSP7 KDM2A TOP2B RNF6 PROSER1 TPI1P2 ZBTB4 ZNF384 RALBP1 ZDHHC8 INA ASH1L TBL1XR1 TERF2 ZNF608 DVL3 NCOA2 DDI2 RGL1 TLE3 ZNRF1 ATN1 HECTD2 PTPN4 CNOT6 LCORL POGZ PHF21A BIRC6 ZMIZ1 EIF4G3 SLC22A23 FAM155A USP9X ZNF746 HNRNPR ASH1L-AS1 RNF44 DIS3L2 MED15 DBN1 SETD5 NCOA6 ING1 GATAD2B KLC2 BMPR1A SMG7 ANKRD52 NAA25 TMEM121B ACVR2A SRRM2 SPIN1 SMCHD1 ZC3H7A WDFY3 NCKAP1 ZC3H4 GPBP1 UBR4 PEG10 GTF2IRD1 PHC3 NECTIN1 ABCE1 YWHAH RO60 ZNF319 TSPYL1 CCSAP ANKRD11 DIP2C ARHGEF7 SPATS2L MEMO1 DEDD PAFAH1B1 LATS1 CDK5R1 TMEM131 STK35 LRRC41 PPP3CB STT3B PHF8 TRIP12 NUAK1 PBRM1 ITCH VEZF1 KLF13 DOT1L PTPRS PRRC2A DCAF10 NUFIP2 NCK2 SMIM10L1 DSTYK FBXO42 DYRK1A RFX3 MAP1B DIDO1 RPTOR AFF4 LMTK2 MAGI1 VPS54 WDTC1 OPA1 KIAA1549 FAM171A1 NCL ZBED6 SIN3A PTEN CIC UBE2R2 KLF7 RAPH1 RALA CORO2B INSR POLR2A MEIS2 SNORD68 PPM1A SRSF4 PPP3R1 TRIO COL4A3BP CDK19 PDS5A RNF165 ECPAS PIP4K2B KANSL1 ZC3H7B LINC00167 HMG20A YME1L1 ZFP91-CNTF SYNCRIP HNRNPM PHF12 CHD1 PBX1 MGRN1 BPTF SCAF8 PLEKHO1 MNT HNRNPUL2 ZFAT NCAM2 MIR1281 SNORD105 TM9SF3 SNORA80B TAOK1 SPTBN1 ATP9A TRRAP USP22 SPECC1L LRP8 PDS5B MSL1 EGR3 KDM5A WIZ ASB7 BNIP1 TTC9 MED13 KIF3C AGO2 ZFP64 SNX27 KPNA6 AMD1 EIF4ENIF1 MAP2 ATXN2L KLHL42 NR1D2 CDH2 HDGFL3 LNPEP TCF4 CLASP1 MARK2 ELAVL3 CPEB3 TENT4A KDM3A GTF2I DEK SKIL SNORD42B FAXC SRGAP1 HP1BP3 EPC1 KMT5B RAPGEF2 TNPO2 ADCY5 RAB14 DDX6 TNPO3 CCDC6 FBXL17 NUP50 ADAR SNORD101 AGPAT3 HSPA4 LRRC4B FOXP1 KIF1B CUL1 NR2C2 XPO6 ZNF398 SLC38A1 ZNF532 RBM15B HERC4 C16ORF72 DMXL2 HIST2H2AB ZNF146 NRIP1 KDM5B PPP4R3A ZC3H11A NSD3 PRDM10 RAB2A SRF KLF12 KDM5C ANKH UBQLN2 AKIRIN1 SNORD12 EHMT2 TMEM151B RAI1 DHX29 RBBP5 CMIP ZNF827 SF1 RC3H2 GTF3C4 WASF1 ZNF777 PTBP2 PLAGL2 LOC145783 MBTD1 GNL3L GIGYF2 XPO4 USP7 ATP2A2 ATRN HIVEP1 CHAMP1 MAN1A2 PIM1 ATXN7 CBLL1 RCOR2 RAP2B RTN4R ZC3H3

66 WAC MYO5A SCAF4 XPO1 DENND5B SSH2 CDC40 AUTS2 PRDM15 TTYH3 SSBP3 FBXL19 AKT2 RPL21P28 NUCKS1 TLK2 FLRT2 GGNBP2 HNRNPU LOC256880 CAMSAP1 TRIM2 CPEB4 FUBP3 TFAP4 HNRNPA0 SNAP25 C11ORF68 SMAD2 ZNF821 TNRC6B UBE3C YTHDC1 SETD1B LOC257357 BCL9 BCAR1 TEX10 PDE4DIP SYT14 FOXO3 DYNC1LI2 SF3A2 FAM117B CRKL SMARCC2 TAF4 DCAF5 PPP1R9B LOC400684 RIMS2 FARP1 LARGE1 CCNT1 USP12 CBX6 TUG1 SP4 INA FUS CTCF LRP6 HUNK ARID2 LOC100272217 STRN4 SMAD7 REPIN1 GSE1 CNPPD1 EXOSC6 ZFX DIP2B LCORL ARHGDIA SPEN BRAF KPNA1 TNRC18 LOC100289361 KDM1A KDM3B CAND1 PHF13 ZCCHC2 CHD9 CDC42EP3 GPATCH8 DBN1 ZNF689 GSK3B ATP1A1 ARID5B RPRD2 LOC100289230 HIRA WRNIP1 TRIM33 ZDHHC14 ZSWIM8 RPRD1B CBX3P2 SEMA4C ZC3H7A IRS2 KIAA0232 PUM1 ZNF362 CACTIN NIPBL-DT YY1 WDR82 DPYSL5 KIAA2026 RIF1 TUT4 MROH8 CCDC88A CCSAP TNPO1 SRCAP RFX7 RNF19A STAG1 MZF1-AS1 UBTF NOL4L SFPQ RIPOR1 GABBR1 PSMD11 FAM160B1 CNOT6L PHF8 PTBP3 FBXO11 FBXW11 IREB2 U2SURP MIR2861 ARHGAP21 DNMT1 ZFYVE9 ZFC3H1 H3F3A SETD1A TBL3 DLG5-AS1 FBXO42 PSD CTNNB1 INO80 NIPBL IGF1R GABPB1-IT1 MAP2K4 NPLOC4 PRDM2 CELSR3 TLK1 PRKAR1A SMAD1 CALM3 ZBED6 CTBP1 PHIP NAA15 NCOR1 SOCS7 SNORD43 ARHGEF11 ZNF618 ZDHHC17 TRAPPC10 MAP4K4 ZNF592 PRCC MAPK8IP2 PPP3R1 TSC1 ZMYND8 EP300 PDIK1L GNB1 SNORD84 AMBRA1 UBXN7 UBQLN1 SOGA3 FAM13B DDX3X MIR1915 CIRBP-AS1 HNRNPM MVK PHF2 INO80D KAT6B SF3B2 STPG3-AS1 FBXL14 HCFC1 USP24 B4GALT5 UBE2Q1 ZNF668 ETV6 NAMPT SNORA80B ZDHHC21 SRPK2 DPYSL2 RSF1 MAP3K4 LOC646903 NEDD4L NUP153 TSC22D2 ETV3 CBX5 CLSTN1 GPAT4 HSP90AA1 TTC9 SNORD12C SETD2 HECTD1 YWHAG LRFN3 LOC644656 AP1G1 APP GID8 ATAD2B NFKBIL1 CUL4A DHX30 B4GALNT1 LNPEP GTF2A1 IRF2BPL PSD3 FUBP1 CHERP FBXL19-AS1 ZBTB2 ZNF629 NCOR2 USP14 PPP2R5E ELMSAN1 PAN3-AS1 ZBTB11 EPC1 HMGB1 BAZ1B MEF2D SPTAN1 RESF1 LOC729683 GMPS RANBP9 WDR33 KDM6A ZMYM4 NEO1 WBP11 PAPOLA LRRC4B TEX22 CLTC ZMYND11 GNAS CTDSPL2 LOC7795 ZFAND5 PRKACB KPNB1 UBE3A WDR5 UCHL5 C1ORF21 ZNF710 NRIP1 ZBTB12 PRR12 TNKS KLC1 MAP2K5 LOC730183 PPP1R10 SLC23A2 PSMD3 NFAT5 ARHGAP33 CDK17 BCL7A NCOA5 TMEM151B INTS2 SETBP1 FBH1 ZFP14 RNF146 BICRA XPO7 FASN ATXN2 SMARCE1 LANCL2 AGFG1 MACO1 SENP1 GNL3L TASOR NSD2 PRKAR1B USP46 ARIH1 EP400 ARFGEF1 CLOCK PTMS SREBF2 CARM1 STKLD1 LPGAT1 VPS35 ZC3H3 PELP1 SATB2 RNF145 USP15 ANP32A YWHAZ REXO1 UCK2 FEM1B EPC2 VPS4A KCTD20 LRFN4 IVNS1ABP NUCKS1 POU3F3 CNOT3 SON HNRNPD ARID1A BICD2 INTS6 TBC1D9 MATR3 ELOVL6 CEP170 ANKRD50 RBM33 PCM1 ZNF821 JAK2 DYNC1H1 CBX4 CELF2 CNOT4 CDC42BPB LSM14B USP49 MEF2A HEXIM1 SNORA3A HIPK1 KCTD1 FAM120A CRKL DDHD1 ZNF462 ASAP2 CPSF6 ATP1B1 ELAVL2 RCOR1 ZBTB45 FOXJ3 KBTBD2 TIAL1 ARF1 CCDC71L HNRNPUL1 FUS MFSD14A DNMT3A USP34 RUFY3 PPP2R1A SP3 NRF1 FNBP1L MAPK1 ZC3H6 SOGA1 TRERF1 UBE2K PTK2 ARHGDIA SOCS5 MYH10 DICER1 LDLRAD4 NEURL4 ZCCHC14 MORC2 CLUH EFNB2 FERMT2 SAFB CTTNBP2 ZZEF1 APLP1 ZNF689 CREBL2 CDK13 VCP ZEB2 CLIP1 MYCBP2 AP2A2 EXOC5 BRD2 DNAJC11 CALM1 BASP1-AS1 BMPR2 ADD1 IRS2 SCRN1 Table A.9: Part 2 (out of 3) of CHD8 target gene list containing second 750 genes.

GDI1 CREBRF IFFO2 ZBTB14 SLC16A14 MLLT10 MTDH ZNF576 CNEP1R1 TTC39C DDX5 HSPBAP1 KIAA0355 MVD DDX17 ZNF282 NRBP1 MED12L SNAP25-AS1 HNRNPH3 IMP4 FAR1 FSD1L TKT SPINDOC SLC25A4 CWC27 ZNF687 DLEU2 RANBP1 PTP4A1 SOBP EIF4G2 R3HDM1 CEBPG ARRDC3-AS1 CNOT7 TNRC6A HIPK3 TAPT1 DENND1B IER3IP1 PKM EEF1A1 ZC3H15 FAM171A2 ZBTB7A UBE2G1 YBX1 PPAN-P2RY11 SPCS2 CHMP1B USB1 SRSF2 RNF126 PPM1B MTCH2 BNIP3L TK2 PRR19 MYO18A ADO BEND3 TRIB1 MIDN PIM3 DACH1 CCNT2 LUC7L2 TOPORS BORCS8-MEF2B RGMA NDUFAF3 RBBP4 SRSF3 PIAS4 LRRC47 NAT8L IRGQ C7ORF26 PAXIP1 C6ORF136 LRRC8D FAM53B PFN2 SENP6 PLXNB1 GPC1 SMARCD3 LIMD1-AS1 BRWD1 MAP7D1 SAFB2 YARS USO1 UBTD2 U2AF1 TMEM50A MBLAC2 GMFB RAB5A ZBTB8OS URI1 RPL27A DICER1-AS1 SS18L1 RSBN1 DPAGT1 PGBD4 VASH1 CITED2 NAPG RBM25 PTPN2 EIF5 TTC9C STX16 PNISR CCDC92 EPM2A C11ORF95 GTF3C2 SEC24B WDR4 C22ORF24 FJX1 ZNF518A IBTK SACM1L AP5B1 ZBED3 TMEM259 ZNF143 VDAC3 C4ORF48 MAZ DBP AXIN2 AEBP2 SOX12 DNAJC27-AS1 LMNB1 FKBP4 SMURF2 CBFB PIP5K1A GPBP1L1 SNX18 RAB1B RBM17 PATZ1 AKIRIN2 TAPT1-AS1 ATF1 RHOBTB1 NEMP2 N4BP2 HIST1H2AC ATXN3 KCTD3 PAIP1 ARPC5 NHEJ1 MASTL LRRC27 METTL8 SCAMP1 HNRNPA2B1 CUL2 KDELR2 MIER3 SPRY2 KCMF1 SPNS1 WTIP C1D PPP4R1 MYEF2 PSMC2 PCBP2 SUGP2 PCBP1 TTI1 LRP12 TIAM1 MEX3D RPS26 DRAXIN MLF2 PPP2CB YWHAQ SLC25A40 NHLRC2 MNAT1 SLC44A1 IPO7 BASP1 CIPC CLPTM1 ITPK1 UBALD2 PEX5 DDX39B RBM26 AMFR DCUN1D4 MEAF6 SOD1 MRPL40 ATAD3A FAF1 JMY WAC-AS1 SLC20A2 ATL2 HNRNPA1 CYSRT1 DNAJC7 EMC8 LINC01465 ATAD5 ARFGAP2 RWDD1 LPIN2 TSN HACD2 ZNF770 ATF2 XXYLT1 PANX1 ARID3A HSPA8 EIF1AD UBE2V1 MEIS3 C7ORF25 RCC2 SCAND1 FKBP7 NSA2 NEGR1 CNOT11 RTN3 ZNF664 RBBP6 EIF4H MAP4K5 JUND ZBTB1 DGAT1 C21ORF62-AS1 CYB5D1 GET4 SENP8 LIMD2 NDST2 TMEM170B API5 ARF6 SOCS6 HNRNPH1 ZFR PXT1 TRAF3IP2-AS1 UNC50 TRIR SFR1 PDCD7 GNB1L CAPZA2 FAM110B MANEAL CPSF2 CDC27 PTMA RBM4 CIZ1 ZMIZ1-AS1 UBE2M TRA2B METTL9 DPP3 TLE5 SLC2A1-AS1 SLAIN1 DLG5 ATXN7L3B MAPKAPK5 DNAJC27 GABPB1 SOWAHC HMBOX1 HEXIM2 SEC1P CHN1 CIRBP CLUAP1 POFUT1 MAF1 VAPA TRPC4AP PDZD8 H3F3AP4 CEP350 YTHDF2 RPL21 THAP11 TM2D3 HIST2H2AC KLLN TLCD3A MAPK11 SUB1 TCF25 TMEM9B EIF2B5 MEX3C ELMO2 PPAN TYW5 ODC1 ZCCHC3 MGAT4B PEX13 HDGF REPS1 ATP6V1G2 FAM133B PEX3 ARFRP1 67 FNBP4 HDAC2 DCLK2 PLCG1 WASL ZNF341 SMG7-AS1 PSIP1 VPS37D ZNF219 RBL2 FOXN2 ATP6V1G2-DDX39B E4F1 C8G CAB39 HNRNPK MVB12B BCL2L1 TMEM187 TEF SH3GLB1 SFT2D3 RAB33B GABPA FAM120AOS CASP3 EIF2B4 STAG2 SNRPD1 ZC3H12C NETO2 NACA VDAC1 C1ORF122 SRRM1 FHOD3 RRAGA PPP4R3B DNAJA2 TMEM245 MRPL48 SPRED1 UBE2B REX1BD PPP2CA IRF2BP2 DAPK1 PPP2R5C FIP1L1 RUFY2 MAP3K12 GCLM SURF4 TMEM17 H1FX CHMP6 NAPEPLD RNF11 CASC3 PTPN9 RAC3 ARL5B PSMG4 C9ORF40 SEC62 BZW1 FIZ1 TAB2 SREK1IP1 H1FX-AS1 LINS1 PKIG GALE DTD1 C16ORF87 SLTM NFYA YPEL1 PET117 TMPO-AS1 GOLPH3 IER2 ZNF579 STMN1 MPND CDC42 PLAG1 CCT2 NDUFA4L2 USF2 MKRN1 MGAT5B FLVCR1-DT ZEB1-AS1 NDFIP1 TMEFF2 DENND6B MYO9A SLBP USP1 LRRC58 FRA10AC1 MSI1 COX4I1 RAD21-AS1 USP3 UBN2 CHSY1 BAG5 UBE2L3 DRAP1 MLLT11 UBE2I TMEM253 HOMEZ BEND5 TMEM87A RNF103 FMC1 ATP6V0C PPP6R3 FAM20B TCERG1 MAPKBP1 RTL10 ARMC5 CKLF-CMTM1 SOD2 KANSL1-AS1 TUBB2B ARID4B IKBKB CUX1 VEZT MN1 LOC100128164 ASB6 IGF2BP1 UBC RETREG2 RPL22 LINC01184 YRDC NUDT3 SPAG7 TRIM13 TOB2 PPP5D1 MSRA ARHGEF12 ACTG1P20 RAN BRD9 AP4E1 RDH14 KLHL24 SMG6 ZNF428 GAR1 MAML1 MKRN3 DBF4 TMA16 TRDMT1 TP53I11 IRF2BP1 KLHL9 DLG1 MARCKS ENAH ATF6 RCOR3 CAPZB MYC MPLKIP NELFB FAM47E MTX1 MED18 HMGXB4 SLAIN2 PELI2 ZNF628 BANP TOB1-AS1 TOX IGF2BP3 NDUFC1 MAPK1IP1L UBOX5 RAB5C YWHAE ZNF24 ZBED3-AS1 DUSP4 ERF SRRT RIMKLB FBXO9 TMEM184B PPP1CB ARPC4 UBXN1 SRRM2-AS1 PCGF2 CDKN2AIP ANAPC13 ZEB1 MZT2B ZMYM2 FYN ZNF507 LMO4 MRAS EIF4E CASD1 LOC729614 RNF208 DCLRE1A SENP3 MED16 PXYLP1 FCHSD2 FRG1 PYGO2 LACTB2-AS1 LOC344967 KIF9-AS1 SET DCAF17 CAMK2N1 MTF2 EARS2 ERBIN TGFBR1 AHCYL2 LINC00461 EXOC6 MRTO4 BAHCC1 NKIRAS2 ZNF652 NELFA TESK1 COPS7B RER1 MIR17HG PRELID3A ABCA17P MMAB C10ORF25 HIST2H2BE CDC42SE1 HOMER3 PANK3 B4GALT6 RBPJ TET2 DCUN1D1 AKAP8 ACTG1 STK39 ACADSB MAT2A MALAT1 TRMT5 PPDPF ZBED5 SCRIB PYGO1 GNB2 H2AFY2 SNAI3-AS1 SLC7A6OS PHF20L1 BLMH CCDC85C SKIDA1 SLC43A2 NCOA7 CEP97 HIST1H3E MTHFD2 ANKLE2 LOC100128398 REV3L HDAC5 IRS1 PCNX1 SLC12A2 ZNF524 FDFT1 MAP3K1 RNF138 CCDC115 ZNF581 MPPE1 TTC32 CENPS-CORT ARL8A MEX3B KCTD15 ACTR3 LSM14A RSRC2 MBOAT2 RHOT1 RPL3 SNRPD3 S100PBP EMC7 MAPKAPK5-AS1 GPR161 CFL2 MOB4 PPP1CC MEPCE PCBP1-AS1 NME6 RNF130 RAD21 TOB1 CKLF VPS37A FAM219B MTX2 NDUFS7 GDI2 LSM4 MAEA ATE1 PPP1R9A MKLN1 FNDC3A TRIP13 GABPB1-AS1 ATG3 WNK1 DYNC1I2 CCDC97 PTCH1 FAM173A TSPAN4 C1ORF174 NAA30 KLHL23 WDR43 SUMO2 CLPX LRP3 CHRAC1 GORASP2 RHOB CHTOP GAS1 IRAK1BP1 COPB1 BMI1 OSBPL5 ZNF638 SCMH1 FAM98B SLC25A51 ZNFX1 PRMT2 TMEM250 PALM TRIM8 PCIF1 VPS16 MTREX ZNF580 PIGP AP5S1 ZFP91 RAF1 SPOP HERC3 ZNF48 SBF1 RABL6 ERC1 TLNRD1 KIF5B SLC35B1 MBD3 CHCHD7 TBC1D5 DCAF12 ATL1 GMEB1 ACSL3 PDE7A DNAJB14 RTN4 SAMD1 CREBZF WBP4 TMEM263 ESF1 ATG4B PHF21B SSU72 S100A13 PPP6R1 UBE2J1 PWWP2B CCNG2 IBA57 POU2F1 UNK MSANTD3 ANP32B CCT7 THUMPD3-AS1 IKZF5 BLOC1S6 PRKAA1 ARL5A Table A.10: Part 3 (out of 3) of CHD8 target gene list containing last 584 genes.

COX6C FBXL15 NSMCE4A THAP2 UBXN4 GPSM3 TTLL12 UQCRC2 CYP51A1 ATRAID PKD2 MTBP MANEA ACO2 SERINC2 C11ORF58 NBR1 USP30 PSKH1 PAQR3 SEC23A CRBN AGK RETSAT STAT5B CMPK1 SP1 PEAK1 OARD1 STX18 ADAMTS10 LEMD2 DHX40 MTMR10 SAAL1 TCHP TBCD PKN2 UIMC1 MRPS18B MARK3 PPCS SLC35D1 SIDT2 ZFP36L1 TBCK PIK3R3 CEP104 DDX28 TSC22D1 EFCAB7 AVEN UQCRH ARMC6 FBXO24 ALG9 GUCD1 PHACTR4 POGLUT2 RAD52 RPL23A SLC6A9 COX11 FGD6 ANKRD16 TSFM SDCCAG8 RBM12 FLVCR1 ZRANB3 ZNF646 PPP2R3A COX20 BIVM UBE2D3 RBM5 SLC2A1 PSMG2 DNAAF5 MTMR9 RTCA MRPL44 RDX TDP1 ITGB3BP FBXW5 FZD2 APOLD1 NAE1 REEP3 ITFG2 ERLIN2 LIMS1 PUS10 PPP1R21 ANAPC5 MPC1 TMED2 TACC1 TYSND1 BORCS8 SKP1 AK3 LINC01089 ALOXE3 WRAP53 PEX19 CPNE1 ZNF624 STRN3 ZBTB25 ACBD5 NAA38 YARS2 UTP18 DTYMK SGCE KLHDC2 SFXN5 ICE2 SGMS1 WDR45B TMED1 C9ORF43 THAP3 CERS2 ATP5MC2 GRK4 RASA2 DDX41 DDR1 SPG7 RAB18 ATXN7L2 DDX54 HDAC7 ACYP2 MRM3 LRRC8A CD276 RAB34 PFKM CEP76 KNTC1 LIG3 TMEM242 ADAT2 CCNE2 USPL1 AAMDC AP4S1 MMP15 SLC38A6 UHRF1BP1L ELOB NOP14 UBE2N GAPDH PHF5A RPL24 AASDHPPT ARMT1 ITPKC USP6NL LRRC75B GTPBP3 ATG4C TMPO EIF1 POLR2L ULK2 MYL12B EFS MRPL22 FITM2 USP13 BAD ZNF335 SELENOO LZTR1 BANF1 RCN2 CLK3 ZMPSTE24 PSMD14 GALNT7 OGFOD2 EEF1B2 PNN TTC3 SAMD14 NEK4 NDUFB1 CASC4 CBX3 POLE3 NUP205 H2AFZ MSI2 UNC119B WDR34 LENG8 PHC2 EFCAB5 ARRDC3 PSMC6 CTDSP2 ZNF518B LUC7L UBE2Q2 NDUFS4 POLK HACD3 ALS2 COIL SCARB2 BSCL2 ID4 GPR137 TRAPPC3 INTS9 NSL1 INTS14 ARID4A ABHD12 ELMOD3 GNL3 MIB1 STX16-NPEPL1 ZNF22 WTAP MAD2L2 SHLD1 ZHX2 C2 DISP3 DDX31 NT5C2 BBX TSEN54 SEPHS1 AMOTL2 NPDC1 BTF3 STX10 LRCH4 MAML3 CLK4 MORN1 NDUFS1 ZNF180 FAM160B2 CALM2 ARL3 TBCC PPP1R15B ACTR1B THAP4 COX15 KPTN ESYT2 ZNF276 CRELD1 PXDN

68 SLC66A2 TADA3 DLAT ANAPC10 MGAT1 IRF2 ACBD6 DPY19L3 TXNDC15 GEMIN5 AP4M1 TMEM94 USF1 EWSR1 GINS1 TMEM222 SRR FBLN1 CHD7 IST1 SASS6 CRYZL1 FBXO7 CCDC57 SEC63 PAN3 COPS8 NHLRC3 CTBP2 MDK CYB5D2 NME5 SMPD4 KIAA0753 CWF19L1 ZNF606 PAFAH1B3 COA8 ABCA3 CUTC RHBDD3 CLIP4 TRUB2 TBC1D12 ITGB1 KAT2B CPS1 THBS3 GLUL CRLS1 GNA13 SZRD1 SCCPDH SIRT5 HPD MTFP1 PFKFB3 QSER1 ZMYND12 CCDC50 CFAP20 ANKRD40 MORF4L1 THOC7 CDK9 LPXN CDS2 CLK1 VTI1B USP36 NSMAF RBL1 FAM89B PCED1A TXNL1 WDFY3-AS2 TIMM50 HMGN1 TP53BP2 RNH1 B3GLCT PRMT7 UFL1 TTLL4 PSMA3-AS1 PTGES2 MRPL58 LUC7L3 SMAD5 DSE WEE1 DDHD2 NUDT5 XRRA1 PBLD PHTF1 USP25 ZFAS1 UBE2D2 RPL13 GGPS1 BACH1 PCNA DZIP1 DDX59 FLAD1 SLC35F5 GANC RBM39 C1ORF43 NEAT1 SNRNP70 ZNRD2 APPL1 SHARPIN PTBP1 TMEM69 ST7L CCDC136 CEP95 PDE3B YPEL5 PRADC1 B4GALT7 CACYBP CCNL1 ARPC4-TTLL3 EDEM1 ZNF786 MGME1 RMI1 MAN2C1 CCT4 C17ORF75 AP5M1 RPL35A KBTBD3 TPM4 MRPL1 ATP11B ZNF32 PIGL KIF13A TTC31 BOLA3 PTGES3 MIS18BP1 NOL12 ZGPAT MYO1B RPN2 PTPRJ DOCK7 PCID2 FBXO36 HIST1H2BC PRPSAP2 ANAPC7 PSMA4 STAM2 TMEM267 BOD1 MRPS35 FUT11 CEP63 SCAF11 FADS2 SHF CASKIN2 ZNF565 PDCD4 PXK AIMP1 C12ORF60 ECHDC1 NBAS IQCG GSTA4 TEDC1 TSPAN3 SYPL1 USP8 SLC49A4 PHLPP1 SNX17 SECISBP2L PARD3 NDUFAF5 RPL38 PLEKHJ1 GABARAP NDUFA9 SFSWAP PARK7 GSR NIPSNAP2 CNN3 ORC6 EXD2 PLEKHA3 VEGFA DALRD3 RFXANK SS18 PAXBP1 EPB41L5 COQ8B DCP1A KLHL35 GART SQSTM1 TCTEX1D2 TATDN3 PHPT1 C1ORF198 HIST1H4H PCBP4 GLOD4 TRMT13 KIAA0586 SNAPC2 LSM5 NSRP1 CDK2AP1 TRAM1 ARL2BP MCCC1 COA5 IGDCC3 COQ4 GFM2 MCM7 C1ORF35 RPS12 TARBP2 SELENOT EFCAB2 PSMA1 CENPS UVSSA POLR3G GPATCH1 ANGEL2 ASF1A ZCWPW1 KDSR ATP5PF EFCAB11 MAIP1 BUD13 EMC1 MFSD11 SLC35A5 ME2 YDJC EIF3E RCC1 NFKBIA TXNDC17 MRPL13 STXBP4 FASTKD5 CDC123 ASPSCR1 CCDC77 MCL1 PSMA7 FAM204A ENOPH1 RCBTB2 TMTC4 TMEM168 SLC39A6 TP53 NBN DENND4C CCDC142 TIMM9 FAM102A PLPBP HIPK2 PSMB3 DFFB ATR BCAR3 TBXAS1 ELP2 COX5A WDR36 SNX3 NREP MXI1 DMTF1 LETM2 TCF12 LPIN1 MRPL35 DIS3L CAPZA1 FBXL5 PFDN4 VPS9D1 SNRPC TDRD3 SLC5A6 RMND1 TMEM237 ZNF79 KYAT1 ANO8 ATXN10 SOCS1 PSMD9 LANCL1 CNBP AVL9 TRMT2A SNX5 SFXN2 CARMIL1 Table A.11: RBFOX (splice) target genes used for calculating DeepASD ranking enrichment.

CAMTA1 COL25A1 DOCK7 PPP1R9A NFIA QRICH1 CACNA1S YTHDC1 GRIP1 SRC CHL1 WAC C330007P06RIK PRUNE2 KCND3 RHOT1 MYO1B SCN2A1 NRAP BCAS3 FOXM1 NADKD1 STRN3 ZFP532 ARHGEF12 2210018M11RIK CLDN25 PCNX STXBP5L SH3GLB2 GSN KITL HNRNPA2B1 CADPS2 RAPGEF6 SETD5 PLAG1 TMED2 DBN1 POSTN SULF1 PLEKHA6 GRIN1 WNK2 HEATR5A LAMA2 PLCH1 ABLIM3 FAM38A STXBP1 PAM GIT2 COL6A3 IZUMO2 TRIM24 PLEKHA5 RBFOX1 PPP3CB MTAP7D1 NTNG1 KIF13A CTAGE5 KCNH7 KIF1A AP1AR SLC38A10 TKT GRIK1 FAM70A FIGN RTN3 NCAM1 DNM1L SH3PXD2A NRXN1 SMARCA2 KTN1 NFIB HPN PUS7 KIF9 CHKA ZHX1 UBE2H CAMKK2 BAZ2B DOCK4 SCN9A SEC24B ADCYAP1R1 8030462N17RIK LRRFIP2 SYT6 UNC79 MYBPC1 TMEM183A BRD2 TBC1D1 RBFOX2 MTMR1 MAP3K7 SNAP23 LUC7L2 SLMAP PTPRZ1 NDEL1 AUTS2 ZFP112 FBXW11 RNF167 AKAP9 SIRPA MBNL2 CLASP1 VPS13D EPB4.1L3 FAM168A GFPT1 BCLAF1 ERC2 NDRG4 MAP2K5 STRADA ABLIM2 ROCK2 ZBTB20 CAMK2G CADM1 UPF3B SEC24C AFTPH MYCBP2 PAK3 MYO18A MYEF2 RPS6KC1 PICALM MED15 ARPP21 FNBP1L KIF21A ODF2 SHANK3 DIAP2 GOLGA2 CNNM2 REEP6 GRM5 CSNK1G1 PRPF40A ATP2B4 SRSF11 ARID4B ATF2 PTPRD ERI3 ZFP384 NAV3 CCDC88A USP47 TRAPPC9 CORO6 SLC22A23 EPB4.1L2 EXOC7 SUMO2 FBXO8 TTC13 EXOC1 PLOD2 TIMM44 UGGT1 4932438A13RIK GOPC LRRFIP1 RAI14 COL4A3BP DCLK2 2010106G01RIK SIDT2 CASP2 ZKSCAN5 CADPS FAT1 IQSEC1 RYR1 MPP6 EP400 STK39 RSRC2 SRSF3 VEGFA OCIAD1 SMEK2 BBX HOOK3 NRCAM TIA1 DCTN1 PPFIA3 EVI5 ARPC1A ACIN1 SMARCE1 DEPDC5 RALGAPA1 RNF130 CACNA1G PHF21A SCN3A ADAM22 0610005C13RIK MAPT MYO9B WNK1 PXDN MPRIP LPHN3 CAR12 NASP FAM13B DYNC1I1 TPD52 GRIA2 GPATCH8 CADM2 WHSC1 DTNB SLC9A7 MCRS1 ITGB1 TJP1 OSBPL9 MAX HIPK1 PI16 OSBPL3 USP28 BAI3 NUMB DIAP1 PDE4DIP KIF2A RBM33 FCHSD2 2500003M10RIK LRRC20 NACA TCF12 ACAP2 NELF FAT3

69 SYTL2 APP LMO7 COL13A1 PBRM1 APOBEC3 MGRN1 ADD1 KIFAP3 PEX5L PROM1 UNC5D POLDIP3 CRX HIVEP2 NUMA1 RIMS2 PTPRM PARD3 ERBB2IP UNC80 GYG CD47 DEPDC1A EIF4A2 2210404J11RIK MLLT10 RBM6 SEC16A PPP1R12A ESYT2 MADD SYNE2 EIF4G1 MALT1 MYO9A NDRG3 MEG3 NFIX TARBP2 ANK1 CDK7 FNBP1 ABI1 ADD3 OGDH ST7 FN1 SLC4A10 H13 SF3B1 CACNA1C PLCE1 MCM10 TRIP10 ADNP DYNC1I2 PTPRK IP6K2 DAB2 MYH11 ATG16L1 CLASP2 PPFIBP1 R3HDM2 GAPVD1 5730419I09RIK PNPLA6 FKBP14 TSG101 FRY LRP8 EPB4.1 OSCP1 CASK TTC3 SLK OSBPL6 DCUN1D4 LAYN ARID3C NFATC3 CAMK2B 4930506M07RIK CAPN3 ROBO2 R3HDM1 GABRG2 SORBS2 SLIT2 SNCAIP RLTPR USP37 CLTA 2310035C23RIK RYR2 CNKSR2 VPS13C PPIP5K2 FYN ZFP207 KCNMA1 ABI2 EPB4.1L1 EML4 G3BP2 LSM14B TCF4 RHBDF1 MFF RNF115 MBP PRPF18 NFYA PCBP2 KCNQ5 KCNQ2 NKTR ANKRD10 MLLT4 GOLIM4 PLEKHA1 FMNL2 PRRC2B IQGAP1 ACTN1 CSDE1 CNNM1 RAB11FIP3 MACF1 ARHGEF6 MAST4 ALCAM LRRC7 PCID2 MON2 CCNL2 TPK1 TBC1D24 SCN8A SRSF6 RBM39 TLE4 PPP6R3 SPEF2 MUSK AKAP2 TEAD1 KCNIP1 CNOT2 NFASC EIF1AD PLA2G6 MEF2A LPHN2 LIMCH1 HNRNPD D4WSU53E TCERG1 RFX3 APBA2 ITFG2 FLNB TTN RAPGEF1 APC CACNA2D1 SNAP25 MAP4K4 BIN1 CSNK1D NEK1 ARFGAP1 TRDN PPFIA1 TRA2B TSC22D2 RPS24 KRIT1 EFNA5 NUP98 SLC7A2 MORF4L1 BPTF SEC11A SLC24A2 ARNT DGKH ADAM15 SIPA1L2 2410002O22RIK MAGI1 PTK2B GTF2I KCNAB2 NEB ELAVL2 CAMK2D DST KIF16B SNX14 1110054O05RIK TBX3 STMN4 UBTF EEA1 DOCK11 ARAP1 AP3S1 CEPT1 PLEKHM2 ECT2 CKAP5 KIF1B DNM2 SYNE1 CAMSAP1 SSBP3 CDC42BPA TBC1D4 TCF7L2 ERG XIAP RAB6A EIF4G3 SORBS1 AKAP11 CTNND1 ACTG1 LMAN2L ITPR1 RPRD2 SRRM1 BTRC NBEAL1 TNIK ARHGEF25 H2AFY DNM1 CSNK1G3 MAST2 SYNJ1 MSI1 DLG1 NLGN2 ATP11C TOR1AIP2 HNRNPK PRPF38B ATXN2 MAP2K4 MAPK14 ODZ4 CCNL1 KLC1 PTK2 RUFY3 NRXN3 PHC3 FAM126B ZFP385B SUN1 SARNP TNFRSF12A WHRN SLC25A3 MBNL1 CLTC EWSR1 CLIP1 SPAG9 ANK2 PPP3CA FGFR1OP2 RNF152 ARHGAP26 USH2A MYL6 ATP8A1 DLG3 CACNA1D UAP1 DOCK9 FXR1 UBR4 RPN2 MARK3 ACTB FIP1L1 NAV2 ABLIM1 UBE2D3 IRF2 MAPK8 FAM49B SPNA2 REPS1 SEC31A EPN2 NRBP1 RBM5 SNRK CSNK1A1 PPFIBP2 FAM38B 6230409E13RIK CACNA1A KCNQ4 RNF114 DLG2 TSC2 CLSTN1 ARMCX1 INSR DCAF6 ARHGAP21 PTPRS ITGA7 SMC5 GRM1 ATP2B2 WASF3 HNRNPA1 MEF2D TPP2 NRG3 TANC2 PTBP1 CCDC85A NCOR1 GPHN HNRPDL BAI2 SGIP1 SH3GLB1 Table A.12: Part 1 (out of 2) of RBFOX (peak) target gene list (first 588 genes) that is used for DeepASD enrichment analysis.

RN45S NFE2L1 SEPT7 GABBR2 PAM TACC1 ATP2B1 NRXN3 THRA TTC9 SLC6A17 TXNL4A UBTF ZFP963 MALAT1 RTN4 SEPT5 SMAP2 ZFP365 MAST1 AKAP6 AF357425 FMNL1 UNC79 RTCD1 SPNB3 ARL16 UNC13A MEG3 RIAN AKAP11 AHI1 COPB1 SLITRK4 SRSF5 HCN1 PPM1A ZFP386 FUT9 GANAB DDX1 ZFP266 SNURF TTLL7 LRRC7 NTRK2 PJA1 INPP4A DYNC1H1 THTPA EML1 ESD UQCRH SMARCA2 SNHG10 IDH3G TUG1 HNRNPM ANK3 ENC1 PTCHD1 ZMIZ2 ANKRA2 TSC22D1 MAPRE2 RICTOR RNFT2 SETX MAPK8 LMBRD1 ATP1A1 SNRNP70 MAPK9 NAV2 SNHG1 TM9SF2 ZMIZ1 NDUFB9 SLC4A10 CTNND2 CCDC92 RIF1 SUPT16H KIFAP3 ATP2B2 PGM2L1 EGR1 PTPRN DCLK1 ACO2 ZFR PRRT1 FBXO3 GOLGB1 ATP5J2 GPD2 MARCH6 UHRF1BP1L GNAS CPE PPP3CA MAP2K4 SV2A CNTN1 KHDRBS3 CSNK1A1 ATG4C ITSN1 IQSEC1 UBR3 KCNV1 TBC1D9B MIR5109 ELMOD1 GM2694 ADAM22 CLTA PGP RBFOX1 SNORD22 RERE SRSF3 GPI1 MTCH2 KCNQ3 GRIA1 MIB1 RPLP1 HNRNPU ARL8A HP1BP3 GABBR1 SOCS5 FRMD4A DCUN1D4 REEP2 NIPA2 MBNL1 BRWD1 MINK1 STRN3 NRCAM TRIM2 EEF2 PHF14 ATP5A1 COQ4 LRP8 WASF3 B3GNT1 COL4A1 SH3GL2 NDUFB10 PRPF8 MTAP1B ZEB1 SRSF6 YWHAE CAPZA2 DISP2 RAPGEF4 WHSC1 PAN3 KLF9 WDR6 KRIT1 STK38 TAF15 CCNL2 CLSTN3 SLITRK2 SDC3 PDE2A TNIK CLCC1 RUFY3 PTPRZ1 ATRNL1 MAGED1 YWHAH NDUFV2 CAR10 ATP1A3 DHCR24 GRIA2 DLG2 SNRK LPHN3 BRSK1 FBXO21 CUL1 KIF5C NAP1L2 USO1 PREPL RPRML MIAT GDE1 ANK2 SLITRK1 XKR4 RNF34 NEDD4 PPP1R9A MTAP6 CKAP5 PCDH19 LMTK2 NDUFA2 PUM2 FTH1 DST ATP5B STXBP5L ZFP938 COL1A2 CSPG5 GRM5 MT1 CHST1 TRO FRY ARL2 NOL7 CNR1 LPHN1 NCRNA00085 NDUFB8 NCOR1 CASD1 ATP2B3 UQCRC2 SPG7 SNRNP200 TMEM70 TAX1BP1 ATP5G3 TAF9 PDPK1 PRKCB SPNA2 SCG5 CLTC ITPR1 OGT ARHGEF7 SACM1L USP33 BMPR2 RNF103 MADD SAMD8 70 APP CALM2 TSPAN7 MARK4 KIF13A AI414108 SMAP1 ARFGEF1 USP9X SLC44A1 ATG16L1 STRAP SPG11 ERCC6 FARP1 CALM3 SNRPN MAZ ITM2B CACNA2D2 STK25 NDUFS1 GPRASP1 TSHZ3 COPA NOMO1 CST3 ZFHX2AS CLSTN1 NCAM1 GDAP1 PIAS1 ADCY8 PCYT1B DHX9 NAV1 GM20300 SMARCA4 RANBP2 BRSK2 NBEA CLU TTC3 ARIH1 PSAP DOCK3 ATP13A3 LONRF2 KIF5A CYFIP2 POLR2A MOBP PALM MT3 DCLK2 STMN4 SYNJ1 EIF4G1 FBXL16 MOB4 FBXO11 AKT3 RNF187 RALGAPA1 ABR NKTR FBXW11 VPS4A AMPD2 PCDH17 PAFAH1B1 CADPS MTA3 BSG PRMT8 DOS MDGA2 GFOD1 UQCRFS1 RIMS1 GIT1 SESN3 ZYG11B LIFR RASGRF1 SYT4 KCNT1 ADCY1 APBB1 RTN1 RANBP9 DLEU2 JPH4 CUL3 MAPT DNM2 B4GALT2 ZBTB20 MYO5A APLP2 PCM1 MIRG TMEFF2 GM10336 HMGCR LNPEP NELL2 ADARB1 SEC14L1 VPS13C SKI TULP4 PCSK2 SON UQCRC1 KLC1 PISD-PS1 CACNA1D SLC22A17 NRXN1 TBC1D23 CANX KIDINS220 RORA GRM3 RPUSD1 RTN3 APC VSNL1 PIAS2 SHANK3 EXOC5 KCNJ4 SLC25A46 MAPK8IP3 CPD DGKB CDK16 CAMKK2 FTSJD2 HNRNPH1 NELF UBE2QL1 PITPNM1 DMXL1 ADRBK1 FADS2 ATG13 ZFP119B WSB1 ADARB2 MTMR1 ANAPC5 RCAN2 RIMS2 LUC7L2 SPARCL1 HECTD1 MBP GARNL3 MGEA5 STMN3 PTPRS GPATCH8 VPS41 GDI1 CLIP1 LCLAT1 HSP90AB1 KRBA1 SV2B TRA2B CDS2 CHN1 ZMYND8 ASPH SORBS1 PRKAR2B PAK1IP1 HUWE1 CTTNBP2 TMEM178 CCDC88A SPNB2 NSD1 DNM1 TSPAN2 NASP AI314180 EPB4.1 CELF2 PTCH1 TPPP PHF3 JAZF1 WAC KCNMA1 NPTN RNF114 HIPK3 NFX1 ATPIF1 PTPRD CNBP SNPH PEG13 FLNB SRGAP2 CAMK1 RIOK3 APLP1 PFKP FUBP1 GAK CACNA2D1 WBSCR17 MACF1 SEPW1 KCNQ2 ARF3 XKR6 PRRC2C ERC1 PCDHGB2 ZFP207 CTTN PCDH7 PPP3R1 GABRB3 NDUFA4 TARDBP MAPK1IP1 ZFP687 TMEM242 IGSF5 REEP3 WNK1 NHLRC2 TRAPPC9 GABRB2 PEG10 TRIM37 MYO9A PPP6R1 KIF1B TSPAN3 MKRN1 REEP5 SRRM2 PHYHIPL PTMS RAPGEF1 DLG4 PURG NDUFA3 MEF2C GRIA3 GRLF1 TTYH3 MTAP2 GRIN2B FEM1C PRAM1 HSP90B1 WBP11 HSPA4L WAPAL ADAM10 SHANK1 EIF4A2 PTPN4 EIF4G2 SNORD35A FUCA2 UNC80 RBM14 MRPL2 UQCRQ APOE MEF2D ATP2A2 SYT1 CACNA1A SLC1A2 CHD3 SLC25A4 CALR MED23 MIR5117 SCN1A ZFP959 FAT2 NUDT19 ASH1L DTNA ALCAM FZD3 CLDN11 NALCN DMXL2 BSN PPFIA2 CNIH3 ATF2 CAMK4 TOM1L2 TBC1D10B MIR669B OLFM1 GAS5 SH3GLB2 ZFP788 RMRP ATP2C1 EIF4ENIF1 LIN7A FSTL4 GLRB PCDHB20 RNF112 HRAS1 RAB2A CAMK2B NDRG4 STXBP1 FAR1 GAS6 XIST SEZ6 AFF4 SEPT4 SNX27 ST8SIA3 GGNBP2 TNKS ANGPTL3 Table A.13: Part 2 (out of 2) of RBFOX (peak) target gene list containing last 397 genes.

PHC2 CSF2RA MTDH ITGB1 THRAP3 NME3 HIATL1 SNX1 PLXNB2 BC037704 SNHG12 TTLL11 CYC1 ICAM5 MTAP7D1 ZFP955B USP54 MANF ZFP385A GANC UBR4 ZEB2 PDXP DDX6 STX12 CUL2 ELP3 ITM2A NLRC3 CTXN2 DPP6 BAZ2B SREBF2 USP28 TMEM50A RBM27 PTK2B SMPDL3A TEKT5 ABCB1A GM15800 SP3 KCNH3 RWDD2A GABRD RPL17 KCNK9 RPS15 ARHGAP31 TMEM43 ATXN2 NCKAP1 NAA50 ZFP949 WFS1 CTSF RGMB MATK ST3GAL6 SLC5A11 TRIM24 MYEF2 SNRPD1 IP6K1 CABP1 AHNAK ARHGDIG NR2C1 NDC80 THAP1 TIA1 CHRNA4 PCDHB18 DYNC1LI1 PCYOX1 SLC12A6 SATB1 GNB2L1 MCC AW046200 AAK1 SPEN SLK MYRIP ATN1 DSTN SNX15 ZFP692 RPS6KB2 NDUFB7 CADM4 PRKCZ CACNB2 STAG2 TSPAN9 SYS1 SGMS1 MED9 IDE SNORD68 HERC2 MLL3 MBD5 SPHKAP SNORD35B PIGT SLIT1 SLC35B1 IGSF10 SOX17 AI854517 SEL1L3 EIF2AK4 XPR1 DENND5A SSX2IP ARL3 FBXW17 FAM43B PRRX1 PAK1 MAPKAPK5 TYRO3 DCAF6 TMEM184C RNF19B ABLIM1 NSUN2 SLC25A33 DTL ILK PLXNA4 TTL CAND1 NETO2 BSDC1 SURF4 CLN5 SEPSECS SHMT2 SEMA7A CTNNA2 SIRPA MON2 NEO1 RCC2 SCAI NADKD1 KDR P2RY6 CLASP2 FOXP1 TSHZ2 LRP1 UNC13C ATP13A2 TCP11L1 COX17 ADAMTS3 SPNS1 DEB1 SRGAP3 FXR1 MIR3470A STT3B MAD2L2 GM14005 BRPF3 PRKG2 FANCA PNMA3 DENND5B RNF13 ERLEC1 MED14 MTOR ABHD12 MSH6 ABCB9 MYZAP SNHG6 PPP5C PI4KB SNORD42A RPS4X CHD5 BC029722 RPS14 MEPCE TCEAL6 71 BEND6 GABRG3 SORT1 USP32 ZDBF2 SPPL3 LAMTOR2 GM13544 FBXL18 NGEF SC4MOL EXTL2 DGKE RAB3GAP1 ALMS1 FPGT ROMO1 PTOV1 LIN28B ZFP868 AKIRIN2 TIMM9 KCNT2 SLC25A26 ERP44 GM9833 GABRA5 GOPC ZFP869 RIMS3 SGPP1 EDEM3 PHB2 TXN1 CP UROS SLC38A10 RNF214 KLHL7 GM5141 NME7 EML2 CDK14 ELTD1 IKBKB RAB10 POLR2M TMEM214 SDHA IGSF8 KCTD13 DMTF1 ACO1 BRF2 MAP4K5 CLSTN2 UCHL1 JMY NHSL1 FKBP8 PPM1G CCDC164 GM10033 TRIP11 ACSL4 STX1A ENPP2 SIM1 RLTPR GABRA2 BC005561 AGT DDX24 TPP2 COPS6 SLC38A2 STK11 RPL10 FRYL RASA4 PORCN RYR2 STK16 SUN1 CCT8 CCM2 ARMCX4 RASGEF1B TTC9B CASK DBN1 PHLPP1 WIPI2 DAAM2 LRRC45 SERPINE2 TNRC18 FAM57B SYN1 CMYA5 DCAF8 ZYX TRMT61B SYT16 ZBED6 CCZ1 OTUD5 NKRF CHD8 INTS7 NPY THUMPD2 RIOK1 NEK7 KATNAL1 SLC25A5 FLNA MYCBP2 MAP3K5 GRM7 APBB3 GRK6 BC003331 DLX6AS1 PRPS1 ARMCX2 SLC1A3 PTPRK CRELD1 ZFP407 CLPTM1L HK1 TMEM111 OBFC2B AEBP1 CCT5 RNF41 LRRC4B KCNIP2 TOP2B GM872 GPR162 DHX40 SRP54B UBR5 AP2B1 GM2115 GFRA1 MAPK1IP1L GUK1 GRAMD1A RAB5C ACOT1 ZHX1 AW555464 FBXL19 ATP9A UCHL3 VPS53 AP2A1 LRRN3 FOXD1 AKAP8L LARP4B FAT1 UGT8A RPL37 ITGA3 PIK3C2A STON2 GPAA1 ZFP955A B4GALT7 TNPO2 RABGGTB KCTD17 ABCA5 ABCE1 ZFP712 SLC12A8 PRRC2A CCDC127 NUP93 RIPK2 PICK1 ZFYVE1 GOT2 IPO4 MYLK FEZ2 POC5 ATMIN PNRC1 OPA1 LRRC16A CEP164 RHOBTB2 PEX6 ZFP608 MUDENG JPH3 RPS8 SENP7 CLTB CUL5 RPL3 DDX11 Table A.14: Part 1 (out of 2) of gene list for FMRP targets containing first 588 genes. These genes are used to calculate enrichment score of DeepASD for FMRP targets.

ARID1B ZNF462 GABBR2 PLPPR4 PACS1 NEURL4 UNC5A XPO7 SPTBN2 ZER1 ARFGEF3 LRRC41 SIPA1L1 OLFM1 ADNP SKI EP300 AGAP1 GRIN1 NBEA KALRN ARFGEF1 PRKACB SPEG FSCN1 GNAL FAM120A MACF1 CHD8 MYH10 NLGN2 PRKCB GTF3C1 CLASP2 KIF1A AP2A2 IQSEC3 CDK5R1 TTYH3 TMOD2 HNRNPUL1 MYO18A MED13L KCNB1 EPB41L1 KCNMA1 DIDO1 HTT TULP4 NR2F1 ZFYVE1 PHYHIP TTBK2 IPO5 PTK2 PPP2R5B MBD5 STXBP5 DPYSL2 AGAP2 CAMK2A ADGRB2 DLG4 GNAZ FASN KIF21B LRRC8B BAP1 APLP1 BSN ASH1L NCOA6 USF3 PCDH10 TRRAP PPM1E GRAMD1B MINK1 NRXN2 ZC3H7B AGTPBP1 PITPNM1 ADD1 TSPYL4 TCF20 STXBP1 BRD4 ADGRL1 SEZ6L2 GRIK5 PHACTR1 DOT1L TBC1D9 SLC8A2 KCNC3 NPTXR GBF1 MAZ SETD5 NCKAP1 GRIK3 MAPK8IP3 ADCY5 EEF1A2 CYFIP2 LMTK2 SMG1 MAP2 GABBR1 ARVCF MYO16 ATP6V0D1 SYNGAP1 DLGAP1 HIVEP3 DAGLA ADARB1 EIF4G3 PDE2A HCN2 PTPRD KDM5C MAP4K4 DCTN1 KCNT1 RPH3A TANC2 SLC6A1 MEF2D PTPN11 MPRIP NCDN DAB2IP DUSP8 USP9X JPH3 CELSR2 CHST2 SNPH PJA2 SCN2A DIP2C TNKS RIMBP2 ATP2A2 PTPRF NDRG4 FAM91A1 ANKRD52 WASF1 ARHGAP33 AKAP6 NCS1 EGR1 SHANK2 SMARCA2 RALGAPB RELN SYNJ1 KIF21A GRM5 TMEM151A CIT PLXND1 ADGRB1 DMXL2 ZMIZ2 MAPK4 NF1 PTEN GAS7 DSCAML1 PPP1R9B ANK3 EPHA4 MICAL2 GIT1 APC2 CALM1 GPRIN1 SPTB AKAP9 WDFY3 DOCK3 SCN8A NCOR1 KCNH1 RPTOR AFF4 FRMPD4 PTPRS PDE4DIP TLN2 CRTC1 UBA1 DLG5 ANK2 HIVEP2 SON UNC13A ARID2 HERC1 TCAF1 C2CD2L BCR SGIP1 PRKCG FBXL19 EHMT2 RAPGEF4 ANKRD11 SPTBN1 BRSK1 HDLBP PRICKLE2 HERC2 RUSC1 LRP8 MTMR4 ATP6V1B2 ATN1 SNAP25 RTN4R TRPC4AP TRIP12 SHANK3 ABR ADGRL3 TNRC18 BPTF CACNA1A CTNND2 PDS5B RIPOR1 MFHAS1 LMTK3 RAP1GAP ARHGEF12 TRIO INTS1 USP34 DLG2 RPRD2 SUPT6H SHANK1 AP2A1 ATXN1 CELSR3 MAST1 SLC8A1 TUBB3 SEC16A 72 PHF12 UBR5 PLXNA1 ADCY1 CAMTA1 SV2B TMEM63B PPP2R2C IQSEC2 MTOR DIRAS2 CAMTA2 FBXL16 RC3H1 MYT1L PTPRT DICER1 ATP1A3 SLC4A8 USP22 PRKCE TRIM9 NTRK3 TRAPPC10 BRSK2 DHX30 IDS EXTL3 TAOK1 KIF3C PCDH1 YWHAG R3HDM2 XPO6 WNK2 RC3H2 AUTS2 PCDH7 CLCN3 SYMPK RTN1 CADPS MED13 KDM4B UBR3 GNAO1 ATP2B4 ZNF827 SIPA1L3 KIF5C LARGE1 ULK1 DLGAP3 TRAK1 SYNPO CKB TCF4 ENC1 UBAP2L SPTAN1 NAV1 CHD6 TAOK2 JPH4 CAND1 MTSS2 PGM2L1 AP1B1 FAT4 DISP2 GRIN2B AP3D1 RYR2 GNAS GNB1 EP400 EEF2 TRIM2 PDE8B SREBF2 CELF5 TTC7B OXR1 SOBP ANKRD17 NFIX APBA1 NRXN3 PITPNM2 USP32 GCN1 SLC12A6 FBXO41 CRMP1 UBQLN2 CASKIN1 NFIC PCLO NRXN1 SRGAP3 AFF3 HDAC4 EIF4G1 SLC6A17 HIVEP1 DNAJC6 UBQLN1 SYNGR1 DMWD BMPR2 LRRC4B PCNX3 TNRC6B CACNA1E DCLK1 KLC1 PLXNA4 MAGI2 AAK1 VAMP2 PACS2 ZNF704 FAM160A2 CKAP5 NRIP1 MAP7D1 NSD1 ITPR1 SAP130 SNAP91 ACTB PPFIA3 CAMSAP1 KCNA2 RHOBTB2 TNIK FOXO3 CACNA1G TMEM151B GTF3C2 SMARCC2 CLASP1 BIRC6 SLC12A5 FOXK2 CDC42BPB PHF20 GRK2 NCOR2 RAPGEFL1 CBX6 ARF3 SMPD3 DDN CREBBP RAPGEF2 ZC3H4 RASGRF1 DNM1 AKT3 DOCK9 RGS7BP RASGRP1 ELFN2 NDST1 SORBS2 TRO NCAN SPEN KIF1B ARHGEF7 ARHGAP32 CAMK2B PIP5K1C PPP3CA MRTFB ARNT2 PLCB1 CLSTN1 PPARGC1A TRIM37 IGSF9B GSK3B PLXNA2 ARHGEF2 HSP90AB1 SPAG9 ZCCHC14 STOX2 ATP6V0A1 ARRB1 DPP8 SLC17A7 AGPAT3 CPT1C WASHC2A DSCAM RERE SYT7 PRPF8 CHD5 TTBK1 GARNL3 ATF7IP ANAPC1 RAP1GAP2 CDK17 MAGED1 IRS2 PDZD8 CTNNB1 MYO5A EHMT1 CACNA1I KCND2 CHD3 STRN4 ZYG11B BCL9L TPPP MED14 BAZ2A SLC24A2 CDK5R2 KCNQ3 KCNH7 MAP1B APBB1 INPP4A HUWE1 ARHGAP21 LRRC7 MARF1 PHF24 HMGXB3 LRRN2 PSD SEPTIN3 NCOA1 UBE3C CIC ZEB2 LINGO1 CACNA1B DLGAP4 CHD4 HK1 PTPRN2 HIPK1 NPAS2 CTBP1 FYN DIP2A GRIN2A ATP9A SMARCA4 SCAF1 MYCBP2 AP2B1 HCFC1 ATG2A NSF TTLL7 DIP2B NUP98 REV3L CLTC ATP1A1 SYT1 ACLY ARID1A NAV2 ARHGEF11 APP CACNB1 NCAM1 ICE1 VPS41 USP5 PDZD2 KCNQ2 CUX2 PUM2 RALGAPA1 RAPGEF1 ITSN1 CDKL5 CLIP3 SPRN KIF5A SYN1 PCDH9 CAMKK2 PAK6 PRR12 PUM1 AHDC1 CDC42BPA ATP2B2 DLGAP2 PCDHAC2 MAPK8IP1 SV2A RUSC2 SRRM2 CALM3 SPARCL1 CDK16 KDM6B SLITRK5 TNPO2 SPIRE1 ATP1B1 OGDH KCNH3 LHFPL4 MAPK1 POLR2A SEPTIN5 AGAP3 DGCR2 NLGN3 DYNC1H1 UBE2O TSHZ1 ZMIZ1 PPP2R1A NCOA2 SRCIN1 NISCH TLE3 NGEF B3GAT1 SORT1 CPLX2 EIF4G2 Table A.15: Part 2 (out of 2) of gene list for FMRP targets containing last 207 genes.

NOMO1 ZFHX2 AATK TMEM8B FKBP8 KIFC2 CNP PINK1 MAST4 PHLDB1 ADAP1 GPM6A TSPAN7 PDE4B AMPH SLC25A23 CUL9 ARPP21 NTRK2 ARAP2 NAT8L ARHGAP23 ARHGEF17 FRY TRIL MADD APOE DTNA RUBCN SALL2 NACAD CHN2 LRP3 KNDC1 PLXNB1 VPS13D PTPN5 NAV3 UNC13C BCAN RTN3 NRGN KLHL22 HIPK3 WDR13 TMEM132A DST PCNX2 PLEC LARS2 SCAP TIAM1 SBF1 AGRN MED16 LYNX1 TRIM32 PLCH2 PTPRG WDR6 PREX1 DTX1 RTN4 SORL1 TNK2 KIAA0100 WWC1 KIAA1109 SECISBP2L MAST2 ELMO2 GRM4 DDX24 ABCA2 PTCH1 GLUL TRAK2 TOGARAM1 PREX2 TBC1D9B DAPK1 ATG9A NHSL1 PSAP THRA ACO2 RALGDS SETX DOP1B SASH1 73 MGAT5B MAPKBP1 ZFR JAK1 SPRED1 TSC22D1 COBL QKI GPAM SYNE1 APC TSC2 PLD3 SAMD4B ATG2B FAT3 HIPK2 SH3BP4 TNS3 MAN2A2 PCDHA4 FAM171B CACNB3 ATMIN FCHO1 ULK2 ALDOC DLC1 NDRG2 NWD1 ABCG1 IPO13 MAP3K12 ATP5F1B LPIN2 ALDOA PLP1 TTYH1 PFKM MIB1 TRIM3 PCNX1 CAMK2N1 WNK1 TCF25 ATP13A2 PCDHGC3 UQCRC1 ATP1A2 IPO4 HDAC5 ZNFX1 DGKZ RHOB CPE PEG3 SEC14L1 SGSM2 PKP4 FAT2 CPLX1 ZHX3 ANK1 STK25 PER1 ABCA3 SLC1A2 PTK2B TSPOAP1 FAT1 SLC22A17 TEF DENND5A MON2 CUX1 CABIN1 UBE3B PCDHGA12 UHRF1BP1L NEDD4 R3HDM1 ZNF365 PI4KA CHN1 GPR158 PIKFYVE ZNF521 SPHKAP TTC3 MYO10 EPN1 ARHGEF4 SIPA1L2 DCAF6 MAP4 PIGQ SLC4A4 PTPRJ ALS2 MAP1A CLEC16A ZNF536 ATP5F1A MMP24 TRPM3 LLGL1 HEATR5B DOP1A ARHGAP20 GPR162 SLC4A3 DOCK4 MBP LRP1 EML2 SEC23A PKD1 Table A.16: TOP1 target gene list that is used to calculate DeepASD enrichment for TOP1 targets.

MYT1L CDH12 MYB FBXL17 B3GALT1 GRID1 C3 DLK1 AQP4 WWOX NRXN1 RELN NTM MDGA2 ROBO2 CNTN4 PDGFRA TRPM3 RNASE1 VPS13B DSCAM ADGRL3 OPCML CTNNA2 NLGN1 NKAIN2 LUZP2 NPAS3 SLC1A3 GPC5 CACNA2D1 CADM2 CSMD1 EXOC6B TMEM132B GLYCAM1 GRIK2 RBMS3 NTRK2 MYH8 CACNA1C NRXN3 FRMPD4 HS6ST3 EPHA6 SLIT3 CNTN5 PTPRM PLPP3 DEUP1

74 DLGAP1 NFIA CTNND2 IL1RAPL2 PRKG1 HES5 FAT3 LRP1B PTPRG AK7 PTPRT GRM7 LRRC7 PLCB1 NRG3 CSMD3 CXCL5 SLC4A4 NCKAP5 PCDH15 UNC5D KCNT2 PTPRK LSAMP CCNO NXPH1 EXOC4 GRID2 GFAP CFAP65 RYR2 KCNQ5 PTPRD IL1RAPL1 PCDH9 ASTN2 IGSF1 GALNTL6 RFX4 CCDC146 SYT1 NBEA ATXN1 PTPRN2 PRKN GPC6 FOXN4 TRPS1 SFXN5 PARD3B KCNMA1 MAGI2 CNKSR2 ERBB4 DCC ADARB2 FAM107A DTNA BCAN DOCK1 GABRB1 KALRN LARGE1 DPP10 CDH13 KCNIP4 C4B FHIT ADAMTSL3 SLIT2 ERC2 ATRNL1 CLSTN2 NEGR1 C1QL1 SOX6 SLC2A4 LPP FMN2 GRM5 RBFOX1 LINGO2 TAFA1 SLC39A12 PGM5 ATP6V0D2 FBXO36 Table A.17: Part 1 (out of 3) of post synaptic density (PSD) complex gene set containing first 588 genes. These genes are used to calculate enrichment of DeepASD ranking on PSD complex gene set.

SYNGAP1 ABI2 UNC13A ACTR2 KALRN PURB USP9X UBE4A UPF1 SLC17A7 ARF3 GGA3 CSE1L ACSL4 SHANK2 EPB41L1 HDLBP AFDN KIF1A ARHGAP35 RAB3A WASF1 FSCN1 NEO1 PDHA1 LMNB2 MYO18A DPYSL4 ANK2 DPYSL2 ADGRL3 EIF4G1 DLG4 AP2A2 CIT STX1A PITPNA VAC14 SORBS2 INA PABPC4 PALM2 TRIO SRPRA LZTS3 PLXNA4 PHACTR1 GNAZ GIT1 AP1M1 WDR1 GK NPTN DBN1 MAPRE3 NDUFS2 TAOK1 WDR37 DLG2 ACTB ERC2 NUMBL PTPRS SGIP1 ARCN1 ARF1 VPS39 BAG6 SYN2 PLEKHA5 GRIN2B PSD3 CSNK2A1 DNM1 CYFIP2 MINK1 ARHGAP26 MAPT PPP1R12A SYN1 CDH2 IGHA1 BSN HNRNPK NRXN1 GAS7 ATP1A3 CAMK2B PDE2A LMTK2 BCR ATP6V1B2 GABBR1 GNB5 NPTX1 VPS53 KIAA1211L VPS11 GSK3B BRSK1 YWHAG LINGO1 RAC1 GLS CORO2B RAB6B CASK SEPTIN5 STUM RASAL2 DNM1L PGAM5 CTNNB1 ABR GNAO1 DPYSL3 DAB2IP RALA GRM2 CTNNA2 KIAA0513 MYH11 ATP13A1 HNRNPM ATP6V0D1 TPM1 GRIA2 ATAT1 SPTAN1 ATP2B2 GRM5 PIP4K2B GNL1 CNTNAP1 CS SPTBN4 ENO2 RTN1 IPO7 MPP1 SRPK2 PLXNA1 GNAS ATP1B1 EPHA4 HOMER1 SEC24C EXOC6B NPEPPS KIAA1549 DIP2B SYNPO RPH3A IGSF21 CLTC PCDH1 NRXN3 PPP2R1A MAPRE2 RELCH IQSEC2 GOT2 KIAA1549L REPS2 TRAPPC9 PRRT1 FMNL2 VSNL1 KCNQ2 VCP KLC1 CLASP2 RAPH1 GABRA1 EPS15L1 USP14 LANCL2 TMOD2 CALM3 DMTN PDE1A FAM49B PPP2R5D RYR2 ADAM23 HTT SHANK1 CTNND2 GNA11 COPG1 VPS4A IPO5 AGAP3 OXR1 GRIA4 SEPTIN3 DYNC1H1 DCLK1 SNAP91 PDPK1 PRKCE AP2A1 MAP2K1 MPP2 NCALD SYN3 HSP90AA1 TCP11L1 BAIAP2 GDAP1L1 CACNA2D1 ARHGEF7 SLC12A5 EEF1A2 DUSP3 KIF5C ATP2B3 IQSEC1 ADGRB1 KIAA0408 VARS DLST PDCD6IP FYN MYH10 ARHGAP44 HSP90AB1 CORO1A DDX6 TRIM2 CAND1 FAM49A CALM1 GRIA3 CEP170P1 CRKL NEGR1 HSPA9 RIMS1 ARHGEF2 USP15 GAK TAOK2 DNAJC6 ETFB TUBB4A AKR1C2 BEGAIN ABLIM2 RAB6A HSPH1 AK5 75 STXBP5 SYT7 RUFY3 NCDN EEF2 FARP1 FBXO41 CRMP1 TLN2 DCTN1 ST13P5 ARHGDIA CPNE5 TMOD1 STXBP1 MAP1B AP2M1 PTPRF CSNK1D VAMP2 CAMSAP3 PAK1 PRKCG DMXL2 CDH4 MFN2 RAPGEF4 GNB2 NCKAP1 DPP6 ACLY KIF21A AAK1 GRK2 EFR3B HGS YWHAH PSMD2 SIPA1L1 WDR47 CTNND1 SIRPA DLGAP1 SYT1 RALGAPA1 KLC2 FLII SLC25A1 LONP1 SORBS1 CAMK2G GPRIN1 VPS35 TNPO1 EIF3C SLK MARK1 PURA GNB4 UBR4 DOCK9 KCTD16 PRKDC HOOK3 OPA1 ACTN2 PTK2 ANKFY1 PDK3 PPP1CC DOCK3 MARK2 CDC42BPA ANK3 TUBG1 KIF3A KPNB1 DLG3 DIRAS2 SNAP25 ADD1 PSD CDC42EP4 NRCAM SPTBN1 RTN4RL2 SPIRE1 SNX27 PPP3CA CNTN1 FMNL1 TNR NCAM2 DYNC1LI2 CPNE6 CTBP1 STK32C RRBP1 SHANK3 GRIA1 PACS1 RAB14 JUP ATP6V0A1 ARRB1 SND1 PRKAR2A LMTK3 FXR2 USP5 DAAM1 ANKS1B HECW2 PLPPR4 GRIN1 MT-CO2 STRN4 SGSM1 DGKB PACSIN1 DLGAP3 PRKAR2B SNPH CAMKK2 EXOC3 RANGAP1 VCPIP1 AGAP1 CAMK2A SLC25A6 ARHGAP21 CEP170B AP2S1 ELFN2 PGM2L1 ADAM22 SYNCRIP TOLLIP SYP SNX12 AP3D1 PRKCB MPRIP SEC22B TBK1 LRRC7 ARHGAP39 PLCB1 DMWD CAMKK1 SEPTIN11 ATP6V1C1 KCNAB2 WASF3 MPP6 WDR7 ATP2A2 YWHAZ CSNK2A2 SRC HK1 LSAMP CYTH3 DHX30 VAPB MAP6 RASAL1 NOMO1 SRGAP3 AGAP2 ROCK2 ARFGEF2 DLGAP4 CLIP3 ATP6V1A INF2 SGCD NAPB SLC25A22 COG3 CADPS FAM81A CLASP1 GAPVD1 XPO1 PPFIA2 NTM ACTN1 CACNB1 DNAJC11 SHISA7 RGS7 VGF CDH13 CKB DNM3 RAPGEF2 ADGRL1 DYNC1I1 MAGI2 ACACA GABRA4 SV2A TPPP DNAJC13 AP1B1 CAP2 CORO1C PCLO TUBA1B PLXNA2 MAPK8IP3 GNAI1 PPFIA3 AP2B1 SPTBN2 EHD1 PHF24 CC2D1A TTC7B ACTR1A TOMM70 LRRC47 CAMK2D CDH10 AP3B2 SYNJ1 CDC42BPB AP1G1 PRKACB MAPK1 SESTD1 PSMD11 CASKIN1 HSPA4 LETM1 MAP7D1 SEPTIN6 MYO5A KIF2A PPP1R9B PIP5K1C CDKL5 IQSEC3 LASP1 NSF EIF2AK2 CKAP5 WIPF2 PC RAB3C PTPN23 UBE3C PTPN11 CLIP2 SCAI SRCIN1 SLC25A12 PHYHIP CHMP4B PLEKHA6 WDR48 GLG1 SCRN1 CKAP4 GNG2 GRIN2A RIMBP2 ATP2B1 TUBA1A HECW1 FASN SLC8A2 NCAM1 APOOL YWHAB SPTB GDI1 SNX9 RTN3 ATP1A1 FMN2 ATP2B4 MYCBP2 XPO7 GPHN MAP2 DCX EXOC8 DYNLL2 UBA1 OLFM1 PCBP1 OGT HSPA12A PRKCA KBTBD11 ITSN1 ADD2 PABPC1 COPA KIF5A PRKAR1A PPP3CB CAMKV SKIV2L BASP1 LRRC59 GABBR2 KPNA1 CSNK1E DLGAP2 OPCML EXOC5 L1CAM RAB35 DDX3X FAM171A1 RAP1GAP RAB15 RPL12 CHMP1A NLGN2 CYLD GNB1 OGDH CRTAC1 PTPRD STX1B GNAQ NFASC VPS18 TUBB3 MACF1 NCAN PREX1 Table A.18: Part 2 (out of 3) of PSD complex gene set containing second 588 genes.

ELMO2 AHSA1 SH3GL2 ARPC4 WNK1 THEM6 STX12 RPS27 EXOC6 CHCHD3 CAPZA1 HSD17B12 NME3 DST DCLK2 USO1 RAB5B GAP43 RHOB CCT5 TUFM LYN GPR158 GPX1 VTA1 DNAJA1 CISD1 RPS3 RPL7 CYFIP1 MTDH NRN1 FKBP1B SFXN3 LY6H SLC25A3 MAP4 KIAA1217 ROCK1 ERLIN1 ENO3 PRDX2 LGI1 STIP1 NAPA RTCB RAB13 YWHAQ ARPC3 EEF1G ABI1 SLC25A11 CALCOCO1 ALDOA RPS15A RPL10A KCTD12 CACNA2D2 DYNC1LI1 SARS SCFD1 MACROD1 UGP2 RPS25 DSTN SCD5 NDUFB4 RIC8A RFTN1 ACTR1B HSPA5 RAB11FIP5 NAPG STK39 EHD3 SEPTIN7 CDC42 NDUFS7 PEA15 COX6C HSPB1 LAP3 SNX1 AQP1 APC ATL2 FABP3 ANKRD24 RAB8B TMEM245 MDH1 NEFH GDI2 NPM1 MYL6B ACSL6 RIN1 SCCPDH SYNGR3 SNTB2 LIN7A RHOT1 ATP8A2 TUBB2B PIP4K2A PYCR3 DCTN4 NDUFB8 ACO2 DLAT GNG12 SSBP1 RAB11B ABHD16A HSPA8 PALM FDPS RAB21 MAP6D1 UBE2V2 ABLIM1 PIN1 SNX4 TSC22D4 HSPA2 TIMM50 PLEKHA1 FARSA RAP1GDS1 ERC1 DNAJA2 AMPH RAB5C CHL1 DDX17 ACOT13 CMPK1 GNA13 CD59 ATP1B2 TRIM3 PLXNA3 ACKR1 TKT FSD1 ACTN3 TOM1L2 MAP2K2 KARS ALDH3A2 FLOT1 TXNL1 COX5B NDUFB6 PDXK PFKP PLD3 PCMT1 MAP1LC3A SHTN1 FAM241B IDH2 CCDC127 CBR3 FBXO2 PSMB6 TUBG2 CACYBP PPP1R9A RAB1A TAGLN3 BIN1 CCDC22 MYO1E EFHD2 CCDC124 ATAD1 TBCD RMDN3 C2ORF72 GPX4 RPN1 SH3PXD2A HIP1 CACNB3 SEPTIN9 SVIP VPS16 PRDX6 PPP1CA COX6B1 DDAH1 RPL9 NDUFA4 RHOG HSPB8 NEFL WASL SLC27A4 CFL1 SH3GL3 TBC1D24 CKMT1B ACTN4 MAOA RPL23A GAPDH RPL36 RPS3A DLD ACSL3 BAG5 MPP7 LRRC73 PRMT5 RPL18A TOMM34 EEF1A1 MBP MARC2 NINJ2 RPS11 RPL13 PAM16 EPRS UBC AMER2 SACM1L TBCB HSPD1 MAPRE1 SFN WFS1 STRAP NDUFA12 HOMER2 APOD LCP1 IRGQ TUBB2A CRYM MTCH1 ERBIN EPB41L2 RAP2C FN3K GUK1 HAPLN1 DYNLL1 GABARAPL2 CNP TPI1 YARS TSC2 PPP1CB NAP1L4 FABP7 GNAI2 PPIA QDPR HSPA1L LYNX1 AP3S2 SIRT2 NDUFA9 C1ORF198 76 G6PD MRAS SAR1A LRRC57 DNAJB6 EEF1D PKM HSPA1B RAB4B RAB18 DECR2 RHOA SLIRP HCK AMPD2 PGD KRAS GIPC1 TPM3 NSFL1C PRAF2 VDAC3 PCBP2 UBE2N PMVK FLOT2 PLPBP GOT1 PSMC1 EFR3A REEP5 PKN1 CLTB UQCRB GPC1 RAB1B ATAD3A VPS51 ANXA6 RPL35 EMC2 DARS ATP2A1 VPS52 MARCKSL1 EVL CEND1 PRKRA CCNY RPL6 ATP6V1E1 AKAP12 NDUFA5 HSPA6 ATP5MF ATP5MG RASGRF2 SH3GL1 NIPSNAP1 RAB10 SYT5 TCP1 AP3M2 GNG7 GLIPR2 STK38L VPS33A ABCB8 NDUFS3 CAPN5 PDIA3 FTH1 CCT3 CORO1B DYNC1I2 COQ10B ATP5ME ATP6V1G1 VAT1 PRDX1 EXOC4 SNX3 SNX6 ALDOC ICAM5 PRPS1 NNT UBE2M ATP5F1A MTCH2 AHCY EIF4A2 PACSIN2 CALM2 DCTN2 MTHFD1 ATP5MD PCCA PGAM1 IGHM C2CD4C PPFIA1 PEBP1 SNX2 OXCT1 NDUFA10 CAPZA2 PLCD3 NDUFB7 TOMM40L SLC2A1 PLP1 PLCG1 VCL CACNB4 PYGB KIF5B MDH2 DDRGK1 STAT1 ACP1 MRPS36 PDHX DDOST STXBP3 LANCL1 VDAC1 HNRNPA1 EPB41L3 PLCB4 DOCK4 STX4 GRIN2D CRAT VAPA UBXN6 CRYAB ACBD5 AARS ALDH16A1 MAP1A LIN7C SRPRB PPIB CCT7 IMPA1 UCHL1 DDX1 DNAJA3 GLUL ATP6V1H RPL30 RPS14 NME1 PTPRZ1 PNPLA6 FKBP4 ARPC2 AP1S1 EXOC2 EIF2S1 CRIP2 NDUFS6 SDCBP SH3GLB2 YES1 ACYP2 COX7A2L DLG1 MYH9 TFAM ATP8A1 CCT6A PHB PDHB SLC25A5 PLGRKT PGK1 HAPLN2 STUB1 GDAP1 ANXA2 CYTH2 LDHB HPCA ABCD1 SEPTIN8 SFXN1 TAGLN2 DAD1 CLTA VDAC2 UQCRFS1 SPECC1 PLLP SEPTIN4 PRR7 TWF1 ARHGAP1 PFN1 SLC25A4 ARFGAP2 CAPN1 EDARADD ACAA1 GLUD1 GSTO1 RPL24 FIS1 KTN1 ACTR3 PDXP ANK1 NEFM PEX11B RAP1A SEC13 CCT2 COX4I1 CCT4 LDHA MYL12B AASDHPPT C1QBP LMO7 DYNLRB1 GNG3 IDH3A GOLGA7B STX7 GPSM1 RPL4 RPL14 TUBA4A FAM107A DES ARF5 CNTN2 LIMCH1 NRAS DNAJB2 CAPZB FKBP8 AP3S1 VCAN EEA1 TALDO1 DCAKD ATXN10 RPS13 APOE PNKD HSP90B1 TOMM20 PI4KA ACOT7 CAP1 C3 HSPA4L MTX1 FGG RPL38 PFKL TSG101 FLNA PHB2 RPS17 HSDL1 MAOB ATP5F1B RAB5A CCT8 HSPA1A CYCS SCRIB RPL8 BSG GSTP1 RPS18 NUDCD2 RAB7A SLC12A2 PPP1R7 DNAJB4 MAPK3 UBL4A DNAJB1 BTBD17 ATP5F1D GPRC5B NOP58 TRAPPC3 PSMD14 STOML2 ARHGAP23 SBF1 CDK5 RPL3 DBNL MFF YWHAE NDRG1 MADD COX5A TACC1 ANXA11 DNAJC19 SLC1A2 TUBB4B RTN4 KCTD8 OMG FH ATP6V1G2 LIN7B AK1 CFL2 RPL13A RPS16 LRP1 MOG ATP6V1D Table A.19: Part 3 (out of 3) of post synaptic density (PSD) complex gene set containing last 282 genes

RPLP0 SUCLA2 ERLIN2 AKR7A2 DSP NEBL MCCC2 RPS5 VIM SCIN TBC1D17 PFKM SFXN5 DOCK10 LRSAM1 EZR WASF2 BCAS1 ATP12A DAAM2 MYO1C LGALS8 DTNA SEC14L2 DOCK2 AIFM3 COASY ALDH6A1 FARSB LAMTOR1 NDUFA13 KIF20B VPS8 FMNL3 SEPTIN10 AFG3L2 OLA1 RAB3GAP1 EPB42 SLC25A18 LRPPRC EPX AHCYL1 MPO C1QC DNM2 HSDL2 AKR1A1 WASHC5 SUCLG1 SLC4A1 GNPAT SEC31A ATP1A2 TLN1 TTC37 HIGD1A NDUFB9 SLC9A3R2 HACD3 PKP4 PHLDB1 CDK5RAP3 FGB CD9 NDUFA7 LRRC40 NDUFS1 B3GAT3 ALDH4A1 ACAT1 AK3 IST1 ESYT2 EXOC1 PARP1 SLC25A13 NDUFB10 SAMM50 GSTK1 NDUFV2 C2CD5 KIF2C PYGM BLVRB LRRC8A RAB3GAP2 ENPP6 GFAP POR TJP2 CYB5R1 CMC1 WASHC4 IMMT NT5E BCAN ESYT1 GSTM2 ATP5F1C KIF2B DYNLRB2 SACS QARS TARSL2 SLC4A4 BAG3 GRHPR KANK2 PRPH LIMA1 DGLUCY APPL1 CRYZ NDUFV1 ITSN2 ROGDI AKR1C1 SBF2 RPS8 PAICS MAG EPB41 ECI2 SNTB1 FKBP15 ATP5PB CST3 MSN CORO2A ALDH2 GOLGB1 TNC 77 LLGL1 NDUFA2 TPP1 PLEC CPT1A ANXA7 SHMT2 TPM4 SLC3A2 EXOC7 ANXA1 MOGS TRAP1 MYO1D MYO1B VPS45 MYL6 HADHA NCKIPSD UFL1 AHNAK AIP CYBRD1 SLC25A46 PLPP3 PHGDH AUH PRODH CA1 CNTNAP2 CAND2 TMEM256 HARS2 AGL ALDH1L1 PARK7 HADH CTNNA1 GPD2 PADI2 ANXA5 HADHB RPS19 PPFIA4 IGSF8 DPM1 SEPTIN2 ENO1 DOCK5 ATP5PO RPN2 MYH14 ABCF3 ACOT8 OGDHL DOCK1 PRDX5 FGF2 HIBCH GPI CCDC93 PBXIP1 PLCD1 PDE8A EMC4 NIPSNAP2 ADD3 CHCHD6 SLC27A1 APPL2 AP3B1 FRYL RARS SNX5 MVP MTHFD1L EPPK1 DBT SLC25A31 CLU CAPG AGK IARS ATP5PD CSRP1 ATP6V1E2 GSTM3 ALDH5A1 ABLIM3 PDIA6 ATP4A GLOD4 CNDP2 MYO6 CDK18 GNAI3 SRI FAAH GJA1 TJP1 IQGAP1 CBR1 UQCRC1 EMC1 APOL2 LIPE SYNE1 VAMP3 PPID DECR1 WDR91 MARS ATIC EHD4 PTK2B DCTN3 GNA14 SYNM HIP1R RPL7A AQP4 SNTA1 NCKAP1L RDX HSD17B4 RBX1 CAD ABCD3 MYO1F PPP1R21 ALDH7A1 BCKDK UQCRC2 CA2 TMEM126A NT5C1B RHOT2 VPS29 SLC1A3 CA4 GSN PPIL1 ATP1A4 STOM HMOX2 SEC23A NDRG2 CCT6B AGPAT5 Table A.20: Part 1 (out of 2) of histone modifier gene list containing first 588 genes. These genes are used to calculate enrichment for DeepASD ranking on histone modifiers.

ARID1B NCOA6 BRPF1 HUWE1 PRDM2 SMARCA5 PPARGC1A HIF1AN PRKAG2 CHRAC1 RNF2 DND1 EXOSC2 ING4 ADNP SMARCA2 SF3B3 KDM2A BCORL1 ZMYM3 TAF1 ASH2L BRD9 USP17L2 PRMT5 ANP32E UBE2A MORF4L1 CHD8 CHD1 SUPT16H UBN1 MBD2 TRIM24 BAZ2A ZMYM2 TET2 SUV39H2 JDP2 TFDP1 IKZF1 SIRT6 ASXL3 UBR5 HDAC4 NCOA2 NCOR2 WDR5 NPAS2 NAP1L2 JADE3 HDAC9 AICDA MSL3 UBE2N SIRT2 MBD5 KDM4B BRPF3 SSRP1 PRKDC CARM1 GADD45G FBRS EED FBL CHTOP BRD8 FOXA1 PRDM13 KMT2E MSL2 USP46 CDC73 ARRB1 SAFB SMARCAD1 SMARCA1 CBX1 EID2 SGF29 RBBP4 EYA4 RCC1 ASH1L LEO1 NFRKB BRD1 SETD7 ATN1 DPF1 MAP3K7 TAF2 EID2B HMGN3 SMARCD3 CBX2 TADA1 POGZ PPP4R3A TET3 PHF20 SFMBT1 ACTR3B SENP1 HDAC2 TTK MAX CXXC1 ZNF541 BRCA1 CIR1 SETD5 TAF4 USP15 KDM1A MECP2 RNF40 RIOX1 SMYD2 TYW5 EXOSC3 PRMT1 HSPA1B SNRPB2 ZNF516 KMT2C INO80 APBB1 HIRA BRD2 HMG20A TRIM27 USP3 TAF12 SIRT1 PRKAB1 ZNF217 KANSL2 EYA1 SIN3A SIN3B SMARCA4 YY1 SMARCD1 KDM5A DDX21 PRPF31 SMYD1 PRDM12 TFPT MASTL MORF4L2 HDAC7 PHF12 EP300 ZZZ3 KMT2B OGA USP11 SMARCC1 KAT2A UBR7 PSIP1 TRIM16 SETDB2 PHF19 POLE3 KMT5B INO80D GATAD2B DDB1 PAF1 PDP1 RNF20 CHAF1B BANP MLLT6 GFI1 NAP1L1 ASF1A L3MBTL2 KMT2A BRD4 TRRAP RCOR1 SLF1 CBX6 ASXL2 RLIM SNAI2 CDK5 RUVBL2 ING5 PRDM14 UBE2T FOXP1 JADE2 NSD3 BAZ2B CDYL2 EXOSC6 NCL RPS6KA5 SET HLCS MRGBP KAT8 KEAP1 SMYD3 KDM5B ZMYND11 SETD1B TOP2B DEK CHD9 SCML2 PRDM4 BRD7 RCOR3 PCGF2 UBE2B WDR77 NASP RAI1 GLYR1 ARID2 DOT1L KDM5C SETD1A SYNCRIP SCMH1 TDG MTF2 SENP3 L3MBTL4 EXOSC1 NEK6 CHD2 FOXP2 RAD54L2 KDM3A GSE1 TLE1 ARNTL DNMT3B MLLT10 SUPT7L APOBEC3H HIRIP3 CDC6 ACTR8 78 WAC CBX4 MPHOSPH8 TADA2B PHF13 ZNF592 PRDM8 RUVBL1 PAXIP1 KDM8 ARTN PCGF6 PRDM1 SMARCAL1 TLK2 CUL3 SF3B2 MEN1 RPS6KA3 BRWD3 HP1BP3 UHRF2 CUL4B PWWP3A EZH2 ABRAXAS2 ZNF22 NOC2L NSD1 TLE4 BRD3 KDM3B KAT7 CUL4A EHMT2 DAPK3 DR1 ZBTB33 RBBP7 BMI1 PRDM6 PADI3 SMARCC2 SAP130 CTR9 SRSF1 KDM7A SFMBT2 MBTD1 TAF5L CHUK RING1 TSSK6 DAXX HMGN4 SAP25 CTCF ING1 ACTB WDR82 FOXO1 UCHL5 TAF3 EYA3 SETDB1 NAP1L4 NFYB PRKAA1 VPS72 PARP3 CREBBP PBRM1 KMT2D DNMT1 ATAD2B CDK17 NCOA3 TAF5 GATAD2A HMGN5 APOBEC2 DPPA3 TADA3 CTBP2 SPEN EHMT1 CHD5 ATF7IP KDM6A RPS6KA4 UTY ATF2 IKZF3 PKN1 MEAF6 KDM4E TAF7 CDK9 SATB1 CUL1 ANP32A CHD4 SMARCE1 TAF1L EPC1 OGT UBR2 TET1 INO80E SRSF3 EZH1 CDK7 SRCAP RBBP5 ARID1A HCFC1 ING2 BAP1 ATXN7L3 MYSM1 PHF20L1 CECR2 DPF2 ENY2 PAGR1 CDK3 PHIP PRKCB CUL5 BAHD1 EPC2 PCGF3 SUV39H1 PAK2 PRMT2 TAF6L ARID4B KAT5 YEATS4 APOBEC1 ZMYND8 USP7 KDM2B BCOR ATRX HCFC2 PRDM11 RYBP ZHX1 TNP2 KDM4D BABAM2 ZCWPW1 DNTTIP2 NCOA1 JARID2 BPTF MBD1 TRIM28 PRKCD CTBP1 MTA2 PHC1 PPP4R3B YAF2 RB1 APOBEC3C ELP5 PHF2 MDC1 SUPT6H CLOCK UBE2H PRDM16 HMGB1 FBRSL1 ZBTB7C INO80C CENPC PCGF1 SP1 ZGPAT SETD2 YEATS2 USP22 USP49 MLLT1 ZNF711 PELP1 HDAC5 BRMS1L TNP1 ZFP57 HDAC3 UIMC1 HELLS TP53BP1 KDM4A CHD6 CIT L3MBTL3 JMJD1C JAK2 SUZ12 HDAC6 KMT5A MBD3 E2F6 ING3 C17ORF49 BAZ1B PRKCA CDY2B PHC3 ERBB4 ZBTB16 JMJD6 MTA1 PPP4R2 PHF14 ZNF687 GADD45A UBE2D3 PPP4C KDM6B NIPBL CDY2A KANSL1 FOXP3 GADD45B BRWD1 BAZ1A CBX7 FOXP4 PKM BRDT SIRT7 EXOSC4 NSD2 NCOR1 CDY1B GTF2I NFYC PPM1G SS18L1 WSB2 UBE2D1 SAP30 VDR ACTL6A HAT1 APOBEC3B SATB2 CHAF1A CDY1 ZNF532 ATXN7 RAD51 MAZ MSH6 SAP30L ANP32B PHF1 MCRS1 CBX3 MOV10 DNMT3A KAT6B PRMT8 GTF3C4 USP12 MTA3 ASXL1 SPOP PRMT6 PCGF5 HDAC11 PRKAG3 SCML4 PRDM5 EMSY KDM1B EP400 MGA JADE1 HDAC8 SP140 SMARCB1 NAA60 TDRKH HSPA1A CDYL PRR14 DPY30 KAT6A RSF1 YWHAZ TEX10 RNF8 YWHAB PARG AEBP2 PRKAB2 SAP18 SS18L2 TAF9B UBE2E1 SKP1 TBL1XR1 CSNK2A1 RARA TRIM33 TLK1 KDM5D SUDS3 CBX8 HMGN2 HDGF ACTR5 NPM1 EXOSC5 HMG20B PHF21A KANSL3 CHD3 SFPQ CBX5 STK4 PPP2CA CUL2 ACTL6B APOBEC3A YWHAE DDX50 EID1 DNAJC2 Table A.21: Part 2 (out of 2) of histone modifier gene list containing last 129 genes.

ACTR6 RRP8 PRKAG1 RAD54L INO80B ARID4A PADI2 ATM BABAM1 AURKC GFI1B NEK9 HMGN1 DPF3 UHRF1BP1 HDAC1 HINFP SP100 SETMAR DEPDC1B TOP2A SETD3 DZIP3 SUPT3H CTCFL HDAC10 PHC2 ZNHIT1 ATR HR ASF1B CHEK1 MAPKAPK3 VRK1 USP21 PARP1 CLNS1A ERCC6 HASPIN TDRD3 TADA2A MYBBP1A LBR SMARCD2 ELP6 GATAD1 NPM2 ELP1 RBX1 HJURP RMI1 AIRE PADI4 PARP2 RAG1 LRWD1 APOBEC3F TAF10 DTX3L APOBEC3G 79 LAS1L RNF168 ATAD2 KMT5C EXOSC9 LOC10600 ELP3 PIWIL4 DNAJC1 MBD6 TAF8 TONSL TAF9 PRMT9 AURKB EXOSC7 RAG2 ELP2 CDK1 BARD1 MST1 EYA2 NAT10 BUB1 BRMS1 RIOX2 RAD54B ABRAXAS1 TP53 ALKBH1 AURKA PRDM7 MYO1C REST TAF6 SMYD4 CHD7 DNMT3L APEX1 SHPRH SETD6 APOBEC3D PCNA ZRANB3 PHF10 TDRD7 MBD4 USP44 PRKAA2 PRDM9 L3MBTL1 DMAP1 CRB2 PADI1 MBIP KAT2B KDM4C CDK2 USP36 TLE2 EXOSC8 KAT14 HLTF ELP4 PRMT7 CHD1L PBK DDB2 BRCA2 Table A.22: First 132 gene rankings and posterior probabilities generated by DeepASD. Probabilities are obtained by averaging results from 200 epochs, excluding results coming from training data.

Gene Rank Posterior Probability Gene Name Gene Rank Posterior Probability Gene Name Gene Rank Posterior Probability Gene Name Gene Rank Posterior Probability Gene Name 1 0.9516 ASH1L 34 0.8487 TNRC6B 67 0.7869 RFX3 100 0.7479 MPP6 2 0.9504 TMED7-TICAM2 35 0.8472 FBXO11 68 0.7859 DHX29 101 0.7471 NAA15 3 0.9427 CHD8 36 0.8447 GSK3B 69 0.7858 OR51A2 102 0.7453 BCL11A 4 0.9290 ANK2 37 0.8425 PHF12 70 0.7836 LDB1 103 0.7384 ENC1 5 0.9270 MED13L 38 0.8420 SRPK2 71 0.7824 LRRC4C 104 0.7374 DYNC1H1 6 0.9201 KMT2E 39 0.8408 GRIN2B 72 0.7819 PHIP 105 0.7368 SMARCA2 7 0.9167 MBD5 40 0.8394 RAI1 73 0.7807 TBL1XR1 106 0.7360 THRB 8 0.9162 ARID1B 41 0.8358 CACNA1C 74 0.7796 DIP2A 107 0.7359 SKI 9 0.9162 NF1 42 0.8307 RORB 75 0.7783 PCDH10 108 0.7355 SORCS3 10 0.9158 TANC2 43 0.8288 TRIO 76 0.7769 PRR14L 109 0.7353 NUAK1 11 0.9131 TCF20 44 0.8284 SETD5 77 0.7762 SMURF1 110 0.7326 KDM6B 12 0.9115 SCN2A 45 0.8230 SATB1 78 0.7756 PTEN 111 0.7326 INTS1 13 0.9065 ADNP 46 0.8227 GRIA2 79 0.7716 SRCAP 112 0.7318 ZMYND11 80 14 0.9061 MED13 47 0.8221 DSCAM 80 0.7713 TLK2 113 0.7306 HIVEP2 15 0.9061 KDM5B 48 0.8217 KCNQ3 81 0.7710 MARK1 114 0.7296 VCPIP1 16 0.9058 ASXL3 49 0.8202 NCKAP1 82 0.7707 SETBP1 115 0.7289 DLGAP1 17 0.9048 WDFY3 50 0.8159 TBR1 83 0.7695 NSD2 116 0.7280 CUX2 18 0.9016 KMT5B 51 0.8152 MYH10 84 0.7653 GABRB2 117 0.7266 BAZ1B 19 0.8955 NRXN1 52 0.8150 POGZ 85 0.7653 ZC3H14 118 0.7257 NSD1 20 0.8936 DYRK1A 53 0.8124 FOXP1 86 0.7642 CNOT3 119 0.7257 PRKAR1B 21 0.8928 TAOK1 54 0.8097 KCNB1 87 0.7633 TP53BP1 120 0.7256 PCSK2 22 0.8901 SIN3A 55 0.8093 PRPF4B 88 0.7627 GABRB3 121 0.7244 ZMYND8 23 0.8854 KMT2C 56 0.8046 CACNA2D1 89 0.7615 CHD1 122 0.7222 NCOA6 24 0.8825 MYT1L 57 0.8010 SMARCC2 90 0.7583 SPEN 123 0.7218 RIMS1 25 0.8796 SYNGAP1 58 0.7973 QRICH1 91 0.7582 STXBP1 124 0.7212 SATB2 26 0.8729 SHANK2 59 0.7971 SPAST 92 0.7558 ZNF407 125 0.7212 LRRTM2 27 0.8692 KMT2A 60 0.7953 CLTC 93 0.7554 SHANK3 126 0.7211 XPO4 28 0.8649 KIAA0232 61 0.7933 LRRC4 94 0.7547 ANKRD11 127 0.7208 SPTBN1 29 0.8600 TRIP12 62 0.7929 MORC3 95 0.7535 CDK13 128 0.7205 PHF2 30 0.8599 TCF4 63 0.7920 CELF4 96 0.7519 LARP4B 129 0.7191 CUL3 31 0.8598 CTCF 64 0.7920 GIGYF2 97 0.7516 WAC 130 0.6612 APBA1 32 0.8494 NCOA1 65 0.7912 NMT1 98 0.7493 USP34 131 0.6612 CAMK2A 33 0.8488 ANKRD17 66 0.7877 CPD 99 0.7488 STXBP5 132 0.6609 MARK2 Table A.23: Genes with ranks between 133 to 258 and their posterior probabilities generated by DeepASD. Probabilities are obtained by averaging results from 200 epochs, excluding results coming from training data.

Gene Rank Posterior Probability Gene Name Gene Rank Posterior Probability Gene Name Gene Rank Posterior Probability Gene Name 133 0.6598 HIVEP3 166 0.6432 KPNA1 231 0.6810 PPP2R5D 134 0.6596 LRP6 167 0.6432 KIRREL3 232 0.6802 GLYR1 135 0.6591 FOXP2 168 0.6428 PLPPR4 233 0.6783 GABRG3 136 0.6587 IGSF3 169 0.6425 RTN4RL1 234 0.6781 SRPRA 137 0.6583 NLGN2 170 0.6424 ARHGEF7 235 0.6760 UNC5D 138 0.6581 TERF2 171 0.6416 DPP6 236 0.6755 KCNMA1 139 0.6578 CDH10 172 0.6415 LRFN5 237 0.6754 CREBBP 140 0.6570 USF3 173 0.6410 RBFOX2 238 0.6753 DOCK3 141 0.6567 PCDH1 174 0.6407 PRAMEF10 239 0.6751 FAM169A 142 0.6565 WDR37 175 0.6405 BTRC 240 0.6744 CCPG1 143 0.6564 KDM1B 176 0.6391 PLXNA2 241 0.6740 SLITRK5 144 0.6559 AGAP1 177 0.6389 TLE4 242 0.6736 COL4A3BP 145 0.6551 CLASP1 178 0.6386 GGNBP2 243 0.6729 EPB41L1

81 146 0.6550 SYT1 179 0.6383 CCSER1 244 0.6728 GIGYF1 147 0.6536 MTMR12 180 0.6378 ASAP2 245 0.6693 GABRB1 148 0.6517 RALGAPB 181 0.6377 PRR21 246 0.6692 TM9SF4 149 0.6516 MAP1B 182 0.6377 CUL1 247 0.6686 UBAP2L 150 0.6516 IREB2 183 0.6376 INO80D 248 0.6684 ABI2 151 0.6512 NCBP1 184 0.6374 GAS7 249 0.6684 PRR12 152 0.6502 CYLD 185 0.6373 HUNK 250 0.6682 PLCL2 153 0.6500 SYT7 186 0.6365 PSD3 251 0.6675 MSL2 154 0.6495 SLC12A5 187 0.6364 ATAT1 252 0.6657 RUNX1T1 155 0.6484 KAT6A 188 0.6358 CIC 253 0.6653 AP3D1 156 0.6482 ARID5B 189 0.6356 TNKS 254 0.6642 CHD2 157 0.6475 BRAF 190 0.6355 ADCY9 255 0.6637 TRIM23 158 0.6471 KIF2A 191 0.6355 ZZZ3 256 0.6635 HRH2 159 0.6464 TRIM71 192 0.6347 MYO5A 257 0.6633 ARHGAP44 160 0.6463 SIN3B 193 0.6339 UNC80 258 0.6620 JADE2 161 0.6457 CSNK2A1 194 0.7188 RNF38 162 0.6444 MEF2C 195 0.7176 KIF3C 163 0.6439 STRIP1 228 0.6854 PUM1 164 0.6436 MKX 229 0.6841 SRGAP3 165 0.6433 CDK19 230 0.6822 ZNF462