<<

Meta-analysis of Expression in Individuals with Spectrum Disorders

by

Carolyn Lin Wei Ch’ng

BSc., University of Michigan Ann Arbor, 2011

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHEREQUIREMENTSFORTHEDEGREEOF

Master of Science

in

THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES (Bioinformatics)

The University of British Columbia (Vancouver)

August 2013

c Carolyn Lin Wei Ch’ng, 2013 Abstract

Autism spectrum disorders (ASD) are clinically heterogeneous and biologically complex. State of the art genetics research has unveiled a large number of variants linked to ASD. But in general it remains unclear, what biological factors lead to changes in the brains of autistic individuals. We build on the premise that these heterogeneous genetic or genomic aberra- tions will converge towards a common impact downstream, which might be reflected in the transcriptomes of individuals with ASD. Similarly, a considerable number of transcriptome analyses have been performed in attempts to address this question, but their findings lack a clear consensus. As a result, each of these individual studies has not led to any significant advance in understanding the autistic phenotype as a whole. The goal of this research is to comprehensively re-evaluate these expression profiling studies by conducting a systematic meta-analysis. Here, we report a meta-analysis of over 1000 microarrays across twelve independent studies on expression changes in ASD compared to unaffected individuals, in blood and brain. We identified a number of that are consistently differentially expressed across studies of the brain, suggestive of effects on mitochondrial function. In blood, consistent changes were more difficult to identify, despite individual studies tending to exhibit larger effects than the brain studies. Our results are the strongest evidence to date of a common transcriptome signature in the brains of individuals with ASD.

ii Preface

Under the supervision of Dr. Paul Pavlidis, I conducted and authored the work presented henceforth. Willie Kwok performed preliminary research under the mentorship of Dr. Sanja Rogic, who, together with Dr. Paul Pavlidis, contributed to the development of this project.

A version of this work will be submitted to a peer reviewed journal for publication. Carolyn Ch’ng, Willie Kwok, Sanja Rogic, Paul Pavlidis. Meta-analysis of expression profiles in the blood and brains of individuals with autism spectrum disorders (in preparation).

Eloi Mercier provided all the aggregated networks for the network analysis in Chapter 2. Portions of Chapter 2 are used with permission from Portales-Casamar et al., of which I am a second author. Elodie Portales-Casamar, Carolyn Ch’ng, Frances Lui, Nicolas St- Georges, Anton Zoubarev, Artemis Y. Lai, Mark Lee, Cathy Kwok, Willie Kwok, Luchia Tseng, and Paul Pavlidis. Neurocarta: aggregating and sharing disease-gene relations for the neurosciences. BMC Genomics, 14(1):129, February 2013. ISSN 1471-2164. doi:10.1186/1471-2164-14-129. URL http://www.biomedcentral.com/1471-2164/14/129/ abstract. PMID: 23442263.

iii Table of Contents

Abstract ...... ii

Preface ...... iii

Table of Contents ...... iv

List of Tables ...... vi

List of Figures ...... ix

Glossary ...... xi

Acknowledgments ...... xii

Dedication ...... xiii

1 Introduction ...... 1 1.1 History and early theories in autism ...... 2 1.2 Emerging theories in autism genetics ...... 3 1.3 The search for convergence in the autism spectrum ...... 5 1.3.1 Transcriptomic analysis in ASD ...... 7 1.4 Meta-analysis in neuropsychiatry ...... 8

2 Meta-analysis of profiles in the blood and brain tissues of individuals with autism spectrum disorders ...... 10 2.1 Methods ...... 10 2.1.1 Data retrieval, preprocessing and quality control ...... 10 2.1.2 Re-analysis of differential expression in existing autism data sets . . 17 2.1.3 Meta-analysis of differentially expressed genes ...... 18 2.1.4 Functional enrichment analysis ...... 21

iv 2.1.5 Literature derived ASD candidates ...... 22 2.1.6 Copy number variation enrichment analysis and prediction classifier 22 2.1.7 Network analysis ...... 23 2.2 Results ...... 24 2.2.1 Systematic review shows technical differences and heterogeneity in independent Autism Spectrum Disorders (ASD) transcriptome studies 24 2.2.2 Re-analysis for differential expression ...... 27 2.2.3 Meta-analysis of differential expression ...... 29 2.2.4 Robust molecular commonalities are more evident in brain samples compared to blood ...... 32 2.2.5 Functional analyses reveal perturbations in metabolic processes . . 39 2.2.6 Shared signatures between autism and other neurodevelopmental syndromes ...... 40 2.2.7 Meta-signature genes in rare structural variants associated with ASD 41 2.2.8 Network analysis and candidate gene characterization ...... 45

3 Discussion and conclusion ...... 50 3.1 Similarities and differences between key findings and previous results . . . 50 3.2 Biological interpretations of meta-analyzed ASD expression profiles . . . . 54 3.3 Limitations and future directions ...... 55 3.4 Conclusion ...... 56

Bibliography ...... 57

A Appendix ...... 72

v List of Tables

Table 1.1 Data sets from transcriptomic analysis in ASD...... 6

Table 2.1 Summary of platform annotations from Gemma. Number of probes and unique genes for each platform were obtained from the Gemma platform database...... 12 Table 2.2 Summary of tissue sources...... 13 Table 2.3 Samples excluded in each study...... 14 Table 2.4 Summary of diagnosis criteria and ASD phenotypes in the original stud- ies. Refer to Table 1.1 for study citations...... 25 Table 2.5 Demographics I - Gender. Gender imbalance is seen in some data sets, such as GSE37772. OR: Odds ratio...... 26 Table 2.6 Demographics II - Age, PMI and race of subjects in each study. C: Caucasian or white; AA: African American; A: Asian; M: Mixed or multiracial; U: Unknown ...... 26 Table 2.7 Differentially expressed genes in each data set after re-analysis. DE: Differentially expressed genes at FDR threshold of 0.05; Up: Up-regulated genes; Down: Down-regulated genes; Number of genes: Number of genes after applying filters...... 27 Table 2.8 Overlap between results reported in the literature and individual re- anal- ysis of differential expression. Significant probes: Per data set signifi- cant probes from re-analysis, reported at an false discovery rate (FDR) threshold of 0.05. Probes reported: Differentially expressed probes published in original papers of each study. Gene symbols are used as a proxy for probes in GSE18123.1; GenBank accessions are used in GSE15451 and GSE15402; Spot IDs are used for GSE7329. GSE25507 computed differences in expression variance instead of differential ex- pression; GSE37772 reported outlier genes instead of differentially ex- pressed genes; GSE32136 is not published...... 29

vi Table 2.9 Overlap (overlap/total up or down-regulated in data set) between meta- signature (FDR <0.05) and significantly differentially expressed genes per data set (FDR <0.05), as well as enrichment of meta-signatures in the results of individual differential expression analysis. One sided p- values were used to compute FDR here. AU-ROC: area under receiver operating characteristic curve; AP: average precision...... 31 Table 2.10 Comparisons of blood and brain signatures. AU-ROC reported for sig- nature of tissue A on ranked gene list from meta-analysis of tissue B (A-B)...... 32 Table 2.11 Top genes in the “cellular respiration” (GO) category at a meta-analysis raw p-value threshold of 0.0001. There are a total of 116 genes in this functional group...... 39 Table 2.12 Top genes in the Simons Foundation Autism Research Initiative (SFARI) “syndromic” category at a meta-analysis raw p-value threshold of 0.05. There are a total of 19 genes in this gene set...... 40 Table 2.13 Dysregulated genes (FDR <0.05, meta-signature) within ASD-associated CNV. Fisher’s exact test was used to compute significance. NS: Not sig- nificant...... 42 Table 2.14 Dysregulated genes in the brain that are found in known ASD CNVs. CNVs that span the same gene or set of genes are grouped together. . . . 42 Table 2.15 Dysregulated genes in the blood that are found in known ASD CNVs. . . 44 Table 2.16 Predictions on GSE37772 samples using preliminary copy number vari- ation (CNV) classifier. CV: cross validated; SV: support vectors; AU- ROC: AU-ROC computed for other 15q samples (originally predicted but not confirmed)...... 45 Table 2.17 Categorization of our candidate genes based on Neurocarta. Ndev.: neu- rodevelopment...... 48 Table 2.18 Meta-signature genes that are also dysregulated in . Meta- analysis FDR <0.1 ...... 49

Table 3.1 Comparisons between core signature genes in blood and differentially expressed genes reported in original studies. Total hits: Total hits re- ported in original study (Genes); Total genes analyzed: Estimated total number of genes analyzed in each study based on Gemma platform an- notations...... 52 Table 3.2 Similar to Table 3.1, for core signature genes in the brain...... 53

vii Table A.1 Up-regulated brain meta-signature. FDR Computed before removal of sex-biased genes. A: Known Candidate; B: Gender Biased; C: Known CNV. Y: Yes; N: No...... 72 Table A.2 Down-regulated brain meta-signature. FDR Computed before removal of sex-biased genes. A: Known Candidate; B: Gender Biased; C: Known CNV. Y: Yes; N: No...... 73 Table A.3 Up-regulated blood meta-signature. FDR Computed before removal of sex-biased genes. A: Known Candidate; B: Gender Biased; C: Known CNV. Y: Yes; N: No...... 75 Table A.4 Down-regulated blood meta-signature. FDR Computed before removal of sex-biased genes. A: Known Candidate; B: Gender Biased; C: Known CNV. Y: Yes; N: No...... 79 Table A.5 Genes that have been shown to exhibit sexual dimorphism in blood and brain. Asterisks denote known ASD candidates...... 83

viii List of Figures

Figure 1.1 Two possible models leading to similar behavioral characteristics in ASD. 4

Figure 2.1 Overview of analysis pipeline...... 11 Figure 2.2 Expression profiles of samples from the differ from that of the cortex, as seen in the sample correlation matrix of GSE28521. . . . 14 Figure 2.3 Samples obtained from the temporal cortex and frontal cortex of the same individual exhibit highly correlated expression values. Data from GSE28521 shown here...... 15 Figure 2.4 Batch effects: a) Clustering of datapoints into distinct batches with re- spective percentage variances, suggesting the presence of batch effects. b) Batch effects were removed after batch correction (includes robust probes only). Each point marks a sample. Colours represent different batches; shapes indicate ASD or control...... 17 Figure 2.5 Comparison between results obtained from the Fisher’s method and the Meta-Rank method. The peaks on the left suggests that genes ranked at the top are similar for both methods...... 21

Figure 2.6 π0 values for each study against sample size. Error bars denote stan- dard errors for 100 bootstrap iterations. PB: peripheral blood; PBL: peripheral blood lymphocytes; LCL: lymphoblastoid lines; WB: whole blood...... 28 Figure 2.7 Profiles of meta-signatures from the blood and brain: raw p-values for each individual data set are plotted against corrected p-values FDR of the meta-signatures. Local Polynomial Regression (LOESS) is used to obtain a smooth fit. The shaded areas represent 95% confidence inter- vals of the prediction using the t-based approximation (see “stat smooth” in the ggplot2 R package) ...... 30

ix Figure 2.8 Heat map visualizations of core-signatures expression values in each of the brain data sets. Batch corrected expression values were scaled across samples within each data set. Relative expression levels: yellow - high; blue - low. A different visualization for each core signature gene can be seen in Fig. 2.11 ...... 33 Figure 2.9 Gene expression levels of core blood signature. Relative expression levels: yellow - high; blue - low; grey - missing values...... 34 Figure 2.10 P-values of core blood signature in individual studies. Deviation from the diagonal for quantile-quantile plots a) Up-regulated genes. b) Down- regulated genes...... 36 Figure 2.11 P-values of core brain signature in individual studies. a) Up-regulated genes. b) Down-regulated genes...... 38 Figure 2.12 Raw p-values of genes located in 15q11-13 (UBE3A, CYFIP1) Xp22 (CDKL5), and 7q11.23 (RFC2). Top(Q-Q plots): The lack of an over- all deviation from the uniform diagonal suggests that the signals are skewed. Bottom: Per-data set p-value with a p-value threshold of 0.05 (dashed grey); genes that meet an FDR threshold of 0.05 in the data set are marked with a triangle. Compare with core-blood signatures in Figure 2.10 ...... 41 Figure 2.13 PPIN network properties of core candidate genes in the blood and brain compared to that of respective random gene sets...... 46 Figure 2.14 Brain co-expression network properties of core candidate genes in the blood and brain compared to that of respective random gene sets. . . . . 47 Figure 2.15 Liver co-expression network properties of core candidate genes in the blood and brain compared to that of respective random gene sets. . . . . 47 Figure 2.16 Common brain meta-signatures between the autism (current study) and schizophrenia meta-analyses by Mistry et al. a) Up-regulated genes; b) Down-regulated genes...... 49

x Glossary

ACRD Autism Chromosomal Rearrangement Database

ADI-R Autism Diagnostic Interview, Revised

ADOS Autism Diagnostic Observation Schedule

ASD Autism Spectrum Disorders

AP average precision, equivalent to the area under the precision-recall curve.

AU-ROC area under receiver operating characteristic curve, equivalent to the Wilcoxon rank-sum test.

CNV copy number variation

DSM-IV Diagnostic and Statistical Manual of Mental Disorders, 4th edition

DSM-5 Diagnostic and Statistical Manual of Mental Disorders, 5th edition

FDR false discovery rate

GO Gene Ontology

MD mitochondrial dysfunction

PPIN -protein interaction network

PDD-NOS pervasive developmental disorder not otherwise specified

SFARI Simons Foundation Autism Research Initiative

xi Acknowledgments

First and foremost I would like to thank my supervisor, Dr. Paul Pavlidis, whose brilliance and tenacity brought me this far.

I would like to express my deepest appreciation for my thesis committee, comprising Dr. Jennifer Bryan and Dr. Suzanne Lewis, for their time and effort in reviewing my work. Special thanks to Dr. Steven Jones, the program director and examination chair.

To research associates, Dr. Sanja Rogic and Dr. Elodie Portales-Casamar, thank you for providing valuable advice throughout my research. I would also like to thank past and present Pavlidis lab members for their support.

I am grateful to all investigators and institutions who made their data made publicly avail- able, as well as Dr. Christian Marshall (The Centre for Applied Genomics, The Hospital for Sick Children, Toronto), who shared data from the Autism Chromosomal Rearrangement Database (ACRD).

Finally, many thanks to faculty members, staff members and funding agencies of the Cana- dian Institutes of Health Research (CIHR) Strategic Training Program in Bioinformatics for making this program a fulfilling one.

xii Dedication

For my family.

xiii Chapter 1

Introduction

Neuropsychiatry has come a long way in the last century. Thanks to technological ingenuity, we saw a shift from Freudian psychoanalysis to the rigorous analyses of high- throughput biological information we have today. Apart from its magnitude, high-throughput biology offers systems level insights which complement targeted approaches in molecular neurobiology. Biological information is encoded in various molecular forms, such as DNA, RNA and amino acids. The fundamental relationship between these components, as described in the central dogma, is that the most basic encoding, the DNA, will be transcribed to RNA and consequently translated to . New technologies quickly emerged after the comple- tion of the genome project. These high-throughput technologies yielded a massive amount of molecular data on human illnesses, which were digitalized and deposited into bio-repositories. These repositories continue to grow. But for many disorders, particularly complex ones like heart diseases or mental illnesses, the underlying biology remains cryp- tic. Knowledge accumulated over the years has proven that the mechanisms are far more complicated, even more so when the biological system in question is the human brain. Discussions on the cause of mental illnesses have spanned many disciplines. But ulti- mately the symptoms are manifestations of biological processes that occur in a life form, governed by the genome. To better understand the biological basis of a common and com- plex neuropsychiatric disorder, I conducted a comprehensive investigation on the gene expression profiles (transcriptomes) of individuals with autism. Autism Spectrum Dis- orders (ASD) encompass a range of neuropsychiatric disorders that together, manifest a highly heritable neurodevelopmental disease [2]. ASD is currently characterized as a set of behavioral characteristics including social communication deficits, as well as restrictive

1 and repetitive behaviors (DSM-5 299.00) [3]. In this introduction, I will first review the history of ASD and the current state of research. I will then describe analytical approaches previously applied in neuropsychiatry, in pursuit of biological commonalities that might explain the autistic phenotype.

1.1 History and early theories in autism According to the Oxford Dictionary, the word autism stems from the Greek word autos, meaning “self”. In 1911, Eugen Bleuler used the term “autism” to describe one of the fun- damental symptoms in schizophrenia [4]. He likened the autistic behavior to that of monks in monasteries or scientists absorbed with their studies. But schizophrenia has an adult on- set. The “autism” that occurs in early childhood so known today was first described by Leo Kanner in 1943. He presented eleven cases with what he called “infantile autism”, repre- senting symptoms that distinguished these children from those with childhood schizophre- nia [5]. Some of the symptoms Kanner documented, such as communication impairment (echolalia) and obsessive repetitiousness, are used to this day. Remarkably, Hans Asperger independently reported similar findings around the same time (1944). Asperger described an “autistic psychopathy” in four boys, whose symptoms were similar to cases reported by Kanner [6]. However, these boys did not have communication impairments. Based on DSM-IV, individuals with this condition, later termed Asperger’s syndrome, would re- ceive a separate diagnosis from those with classical autism. The other autism “subtypes” in DSM-IV are autistic disorder and pervasive developmental disorder not otherwise spec- ified (PDD-NOS). But these subtypes were removed in DSM-5 for several non-biological reasons that are debated. The implications of Kanner’s original report were threefold. Following its publication in the 1940s, several psychogenic theories emerged, putting the blame on parents or “refrig- erator mothers” for their children’s autism [7]. This was partially influenced by Kanner’s ending remarks, that “there are very few really warmhearted fathers and mothers” in the families of the eleven children studied. Secondly, along with the advent of neuroimag- ing techniques, Kanner’s documentation of enlarged head size in five of his eleven cases shifted some of the focus to neuroanatomical abnormalities in individuals with autism [8]. Another major implication which arose later in the 1970s, is the possibility of a heritable component or inborn defect in this disorder, given the fact that it occurs in early infancy. Because initial single family or twin pair studies were inconclusive, it was not until 1977 when Folstein and Rutter reported a study of 21 twin pairs that genetic influences in autism came to light [9]. Folstein and Rutter also acknowledged the potential impact of environ-

2 mental factors. The exact mode of inheritance was unclear. But as evidence accrued, more studies explored the neurobiology and genetics of autism, which in effect deemed claims of “bad parenting” invalid.

1.2 Emerging theories in autism genetics

ASD is strongly (perhaps primarily) influenced by genetics, but variations in single genes account for only a small fraction of cases. A relatively common ASD associated single gene aberration, FMR1, constitutes merely 1-3% of the cases [10]. To understand the ge- netic basis of idiopathic autism, large collaborations and consortiums such as the Autism Genome Project (AGP) and Autism Genetic Resource Exchange (AGRE) were formed. Like their predecessors in linkage mapping, genome wide association studies (GWAS) on differ- ent cohorts yielded a few variants or loci that confer risk of ASD. But the effects were weak in that associations were only established with combined cohorts or “mega-analysis” [11]. Also the results were not entirely reproducible across studies [12]. Other than monogenetic forms of autism, rare copy number variations (CNVS), both transmitted and de novo are perhaps the next “well-established” genetic variation category contributing to ASD (in terms of the fraction of cases accounted for). Among the structural variants that show compelling evidence of associations with ASD are 7q11.23, 15q11.2-13.1 and 16p11.2 [13, 14]. These findings were replicated in multiple individuals. While some of these structural variants have been implicated in other neurodevelopmental disorders, their etiological role in the brain remains poorly understood. More variants were identified with the availability of affordable next generation se- quencing technology. In a single issue of Nature (Vol. 485, 2012), three high resolution exome sequencing studies were published [15–17], reporting many point mutations in cod- ing regions that presumably originated from germ line mutations (though there could be exceptions where mutations occur in the embryonic stages of development). I emphasize that the “genetics” I discuss here do not necessarily imply some form of inheritance, as the term de novo suggests that pathological changes in risk genes occur in autistic children of healthy or unaffected parents. This is also the case in simplex families where only a single child is diagnosed with autism or where siblings are discordant for autism. So find- ings of sporadic cases might explain the incomplete penetrance and variable expressivity. On the other hand, studies that focused on the heritable components have indeed identified deleterious variants in multiplex or consanguineous families [18, 19]. But because these are isolated cases, and like known monogenetic forms, they usually come in a form of a syndrome, it is difficult to validate how they lead to ASD.

3 Figure 1.1: Two possible models leading to similar behavioral characteristics in ASD.

4 The underlying etiology of ASD is still unknown despite the recognition of a common set of behavioral traits and intensive research. At present, there are several hundreds of ASD candidate genes. What molecular autism genetics have unfolded is the genetic het- erogeneity of the neurodevelopmental disorder. There is substantial biological variability among individuals with autism, such that genetic variants identified in small fractions of in- dividuals may not be sufficient for solving the “big-picture” or understanding the functional architecture in an autistic brain. Given this complexity, there appears to be two models for how ASD arises (Fig. 1.1). One is that many different genetic lesions lead to a common set of changes in the brain, which gives rise to a common range of behavioral traits. Alterna- tively, the behavioral manifestations may be due to widely varying underlying pathologies. The truth may lie between these two extremes, moreover complicated by the phenotypic heterogeneity of ASD. In recognizing the different manifestations of the disorder, Uta Frith first coined the term Autism Spectrum Disorder [20]. But there is a feeling that there must, at some level, be aspects in common beyond the behavior so diagnosed or documented. As the search for genetic biomarkers has not been tremendously fruitful, systems level approaches began to take place in search for a unifying theory in autism neurobiology.

1.3 The search for convergence in the autism spectrum Autism was thought to be an uncommon disorder back in the 1970s when genetic influences were first explored. But with the widening of diagnostic criteria and increasing awareness, the number of autism cases has been rising, currently reporting an average prevalence of approximately 1% (1 in 100 children) worldwide according to the Centers for Disease Control and Prevention (CDC)1. Therefore, while quests for a unitary genetic element have diminished, there is growing interest in exploring other biological dimensions to find a “common ground” for autism. Besides GWAS, the search for converging endophenotypes or biomarkers has spanned many modalities, including , proteomics and transcriptomics. Markers high- lighted in imaging studies include facial features [21] and neural responses to facial ex- pressions [22]. In a recent protein interactome study, Sakai et al. revealed new interactions among ASD-associated genes [23]. Comparing transcriptomes of groups of individuals with ASD to individuals without ASD has been another approach in the search for biological con- vergences among ASD cases (Table 1.1). Of the different biological strata, I am particularly interested in the transcriptome be- cause it marks the initial expression of the genome. An oversimplified concept is that a

1http://www.cdc.gov/ncbddd/autism/data.html, retrieved in 2013.

5 common transcriptome profile gives rise to higher level similarities observed in the pro- teome, as well as physiological and anatomical properties of the brain. Unfortunately, even though some of them individually reported striking results, there seems to be little agree- ment when research findings are compared across related ASD transcriptome studies. I will provide some background on transcriptome analysis in the section below.

Number of samples Data sets Platform Reference Tissue type ASD:Control

Brain GSE28475 GPL6883 Chow et al. DLPFC, middle 13 : 21 frontal gyrus GSE28521 GPL6883 Voineagu et al. Frontal (BA9)/ 12 : 15 Temporal Cortex (BA41,42,22) GSE38322 GPL10558 Ginsberg et al. Occipital Cortex 4 : 6 (BA19)

29 : 42 = 71 Blood GSE6575 GPL570 Gregg et al. Whole Blood 33 : 11 GSE7329 GPL1708 Nishimura et al. LCL 7 : 6 GSE15402 GPL3427 Hu et al. LCL 77 : 29 GSE15451 GPL3427 Hu et al. LCL 15 : 12 GSE18123.1 GPL570 Kong et al. Whole Blood 64 : 28 GSE18123.2 GPL6244 Kong et al. Whole Blood 93 : 63 GSE25507 GPL570 Alter et al. PBL 80 : 63 GSE32136 GPL3427 Unpublished LCL 9 : 7 GSE37772 GPL6883 Luo et al. LCL 232 : 199

610 : 418 = 1028 DLPFC: dorsolateral prefrontal cortex LCL: lymphoblastoid cell line PBL: peripheral blood lymphocytes

Table 1.1: Data sets from transcriptomic analysis in ASD.

6 1.3.1 Transcriptomic analysis in ASD

I now take up the idea that commonalities among ASD cases might be discerned in the transcriptome. One of the potential benefits of transcriptome analyses is it is removed from what are likely to be diverse genetic influences, though there are exceptions, such as copy number sensitive genes. Furthermore it has been relatively easy and practical to analyze transcriptomes, compared to analyzing proteomes, which involves more molecular dynamics and kinetics. To provide some perspective on the popularity of transcriptome analysis, the Gene Expression Omnibus (GEO)2 currently holds 732,789 RNA samples, 173,588 genomic samples and 6,421 protein samples. The hypothesis is that molecular commonalities might be revealed across transcriptomes of individuals, helping to explain the autistic phenotype regardless of their genetic background or specific causal variants underlying their autism. In agreement with this, two previous studies reported some convergence in the tran- scriptomes of independent ASD cohorts [25, 28]. Nishimura et al. (2007) studied ASD indi- viduals with either maternally derived 15q11-13 duplications (15q for brevity), or fragile-X mutations (FMR1-FM). They reported similarities in the molecular pathways affected be- tween the two syndromes. Voineagu et al. (2011) found evidence for convergent molecular abnormalities between gene expressions in post mortem brain samples and an independent cohort from a GWAS. However, while these reports described some agreements within studies, it is not clear how much agreement there is across different studies. For example, Nishimura et al.’s gene list was most enriched for “cell communication”; Voineagu et al. reported enrichment of genes involved in “synaptic function”, “vesicular transport” and “neuronal projection”. Other transcriptome studies have implicated an even more diverse array of biological functions, ranging from circadian rhythms [29] to metabolism [26]. Discrepancies in research findings can be partially attributed to methodological differ- ences. Among the differences seen in previous transcriptome studies are tissue type and the expression profiling platform used. In autism, many researchers have turned to exam- ining biological samples that are more accessible, namely samples from the blood tissue. Although analyses in brain samples may be more relevant to the disorder, limited resource poses problems for achieving significant statistical power. Unlike the genome that is, in theory, similar throughout an organism, the transcriptome comprises different composi- tions of RNA transcripts in different cell types and developmental stages. Because the blood and brain are fundamentally two different tissues with different biological roles, one wonders what the tradeoff is in using blood RNA samples to boost statistical power. A di-

2http://www.ncbi.nlm.nih.gov/geo/summary/?type=samples, retrieved on July 30, 2013.

7 rect comparison between gene expressions from the blood and brain was inconclusive [34]. A more recent study suggests that transcriptome profiles of different tissue types are distin- guishable [35], so cross tissue comparisons should be done with caution. However, cross tissue comparisons are often omitted in blood transcriptome studies, raising questions on whether accurate functional inferences can be made from the results. I decided to take a more conservative approach, analyzing blood and brain samples separately. Autism research is evolving rapidly with next generation sequencing technology, but currently the larger fraction of transcriptome studies has been performed on microarray platforms. Therefore, we will focus on microarray expression profiling data. There is a variety of microarray platforms. But the fundamental process in microarray expression profiling is the hybridization of labeled RNA samples onto the DNA of known genes, also known as probes. The specific details of how it is done depend on the manufacturer of the microarrays or platform used. The readouts are similar, such that the abundance of each RNA transcript probed is quantified (within a certain range). However, differences among these platforms have been documented. Experts found that the inconsistencies originate from varying experiment protocols, including preprocessing, quality control, and gene an- notations [36]. These incompatibilities become a problem when we are making compar- isons across studies, potentially contributing to the discrepancies in current research find- ings. Thankfully, such issues have been extensively addressed due to the widespread use of microarray technology [37]. It is possible to reconcile these differences by conduct- ing a systematic re-analysis and obtain comparable expression profiles that are platform or laboratory independent. As far as we know, a detailed comparison or meta-analysis has not been conducted for ASD expression profiling studies. It remains unclear whether there might be more subtle similarities among expression profiles of independent cohorts.

1.4 Meta-analysis in neuropsychiatry The explosion of genomic data is a double-edged sword. The information we have has thrusted biomedical research in an unprecedented manner. But what comes with that is the unrelenting chaos in research findings. There are many possible reasons why previous studies report different genes and pathways as being affected in ASD, even if there are com- monalities present. One is the difference in tissues or cell types analyzed. Another likely source of lack of consensus is clinical heterogeneity [38, 39], which might lead to some differences in research populations among studies. Also contributing are methodological differences in the design and implementation of analyses as mentioned earlier. Finally, small sample sizes of individual studies might not provide sufficient statistical power to

8 uncover subtle perturbations. These issues may mask reproducible aspects of the transcrip- tome in ASD, which might be revealed by re-examining the original data and performing a meta-analysis. A systematic meta-analysis can overcome sample size limitations and reduce the effects of methodological differences. The term “meta-analysis” can be interpreted as the “analysis of analyses” [40]. Like review articles, meta-analyses aim to summarize findings of multiple independent research in the field. However, meta-analyses can be more thorough, such that the primary data of each study is integrated and combined quantitatively. The summaries provided in literature reviews are sometimes inconclusive and uninformative due to divergent findings [41] . On the other hand, a meta-analysis can yield new insights that were not previously discovered in individual studies, leading to advances in the research area. Meta-analysis techniques have been successfully applied in diverse fields, including so- cial sciences and pharmaceutical studies. It is also gaining traction in neuropsychiatry in recent years [42–44]. In designing a meta-analysis, one has to account for a number of fac- tors, including the information available for re-analysis and whether they are compatible across studies. Partly driven by the data, different methods were used in previous meta- analyses of expression profiles in neuropsychiatry: Mistry et al. (2013) first combined expression profiles across studies and subsequently computed for differential expression with a fixed effects linear model; Rogic and Pavlidis (2009) reanalyzed individual studies with a fixed effects model and then combined the p-values; Choi et al. (2008) computed a consensus fold change. To our knowledge, cross-cohort gene expression analyses have only been done in at most two independent ASD cohorts, primarily for cross validation pur- poses [25, 31]. Other ASD related meta-analyses are geared towards examining pathogenic variations in whole exomes of individuals [46, 47], not transcriptomes. A systematic inte- gration of expression data across multiple independent ASD cohorts will add value to the existing data, and may yield novel insights. In the next chapters I will present research methods used, followed by the findings and a discussion on the subject matter.

9 Chapter 2

Meta-analysis of gene expression profiles in the blood and brain tissues of individuals with autism spectrum disorders

We report the meta-analysis of expression data from twelve ASD transcriptome studies (Table 1.1). Together, they comprise over 1000 human sample microarrays, 639 of which are from ASD individuals (Fig. 2.1). Our analysis reveals a small number of genes with con- sistently altered expression levels in the brains or blood of individuals with ASD. The blood and brain profiles are dissimilar, and thus have profound implications in the interpretations of our findings. Functional analysis performed on the results of the meta-analysis sug- gests pathological convergence towards neurological and metabolic co-morbidities, both of which have been previously associated with the disorder.

2.1 Methods

2.1.1 Data retrieval, preprocessing and quality control We retrieved gene expression data sets matching the keywords “autism” or “autistic” from the GEO1 [48]. There were no additional unique data sets found in ArrayExpress2. Short-

1http://www.ncbi.nlm.nih.gov/geo/, retrieved on September 10, 2012. 2http://www.ebi.ac.uk/arrayexpress/

10 Figure 2.1: Overview of analysis pipeline. listed data sets include human blood and brain expression profiling studies with case- control experiment designs only. A preliminary analysis was conducted on these data sets. Two studies in the initial pool, GSE4187 and GSE26415 were disqualified for anal- ysis. GSE4187 was removed because all the autistic subjects (case-control channels) in GSE4187 are also included in GSE15402 (after removing outlying samples), so it was re- moved to avoid biasing the analysis. GSE26415 was disqualified as an outlier, based on preliminary analysis. This data set exhibits an implausibly large number of differentially expressed genes, to the extent that it provided some evidence of differential expression for nearly all genes (estimated overall fraction of differentially expressed genes is 70% based on q-value analysis; applying a false discovery rate (FDR) threshold of 0.05 yields 4857 differentially expressed genes, nearly all being up-regulated in the ASD cases). The reason for this unusual finding is unclear but is in agreement with the original report by Kuwano et al. [49], who found that 9784 probes were differentially expressed at an FDR of 0.05. The final set of twelve studies (Table 1.1) consist of data collected on a variety of platforms, including one channel intensity data from Affymetrix and Illumina platforms,

11 and two channel intensity data from Agilent and TIGR platforms (Table 2.1).

Platforms Platform Name Gemma Probes Unique Genes

GPL10558 Illumina HumanHT-12 V4.0 ex- 47323 22020 pression beadchip GPL1708 Agilent-012391 Whole Hu- 44347 19123 man Genome Oligo Microarray G4112A GPL3427 TIGR 40k Human Array 41472 15422 GPL570 Affymetrix U133 54681 19460 Plus 2.0 Array GPL6244 Affymetrix Human Gene 1.0 ST 33297 20404 Array GPL6883 Illumina HumanRef-8 v3.0 expres- 24526 17562 sion beadchip

Table 2.1: Summary of platform annotations from Gemma. Number of probes and unique genes for each platform were obtained from the Gemma platform database.

Raw data (e.g., .CEL, .MEV) are often processed with various methods that differ across studies. Whenever possible, we downloaded raw data files for data sets on Affymetrix and Illumina platforms and uniformly preprocessed them locally. Affymetrix data sets were subjected to Robust Multi-array Analysis (RMA) from the affy [50] package in Bioconduc- tor. Illumina data sets were quantile normalized and log2 transformed using the lumi [51] package. data sets on the TIGR platform were not locally preprocessed as the submitters’ preprocessing methods are similar across the studies. There is only one data set that uses Agilent arrays. Standard preprocessing is not necessary. Sample sources and tissue types for each data set are specified in Table 2.2. The processed data were then subjected to an additional set of quality controls. We identified and excluded 17 samples that were used in more than one study, retaining data for the samples in the study with a smaller samples size. We removed eight samples from subjects with syndromic disorders of known genetic etiology (Fragile-X syndrome), nine non-ASD cases with mental retardation, as well as samples which were prepared differently than the rest of the samples (e.g. formalin fixed). Cerebellar expression profiles differ from that of the cortex (Fig. 2.2). Ideally we would analyze the cerebellum separately, however only two out of the three brain data sets had samples from the cerebellum, so we excluded

12 Tissue Type Tissue Source

Brain GSE28475 Dorsal lateral prefrontal cortex, middle HBTRC, NICHD frontal gyrus GSE28521 Frontal cortex (BA9), temporal cortex ATP, HBB (BA41/42 or BA22) GSE38322 BA19 ATP, HBTRC, NICHD Blood GSE6575 Whole blood CHARGE GSE7329 Lymphoblastoid cell lines AGRE GSE15402 Lymphoblastoid cell lines AGRE GSE15451 Lymphoblastoid cell lines AGRE GSE18123.1 Whole blood CHB, ACB GSE18123.2 Whole blood CHB, ACB GSE25507 Peripheral blood lymphocytes Phoenix GSE32136 Lymphoblastoid cell lines AGRE GSE37772 Lymphoblastoid cell lines Simon Simplex Collection HBTRC: Harvard Brain Tissue Resource Centre. NICHD: National Institute for Child Health and Human Development Brain and Tissue Bank. ATP: Autism Tissue Program. HBB: Harvard Brain Bank. CHARGE: Childhood Autism Risks from Genetics and the Environment. AGRE: Autism Genetic Resource Exchange. CHB: Children’s Hospital Boston. ACB: Autism Consortium Boston.

Table 2.2: Summary of tissue sources. them entirely from the analysis. Further details of exclusion are in Table 2.3. We removed sample outliers in each study using a sample correlation analysis. Outlying samples were identified as those with correlation more than two standard deviations from the mean sample-to-sample expression profile correlation, and removed iteratively until no samples met the threshold for removal. This resulted in the removal of a total of 54 samples, affecting seven studies. The remaining samples in the data set were renormalized using quantile normalization. The identification of independent units is crucial for statistical analysis, because hidden correlations (Figure 2.3) can lead to biases and inflate statistical significance [52]. As the question of interest in our study concerns a biological comparison between whole organ-

13 Figure 2.2: Expression profiles of samples from the cerebellum differ from that of the cortex, as seen in the sample correlation matrix of GSE28521.

Accession Samples excluded

GSE6575 Removed non ASD subjects with mental retardation or developmental delay. GSE7329 Removed samples with a Fragile X (FMR1-FM) mutation.Remove sam- ples in 2005 batch (scan date). GSE15402 Remove samples in batches not performed by user “KyungS”. GSE15451 Removed tissue samples that overlap with GSE32136 [Blood ID: HI0779, HI2022, HI2772, HI3143, HI3914, HI2044, HI2769, HI3144, HI4360, HI0777], as well as a subject that overlap with GSE15402 [Subject ID: AU0325]. GSE18123.2 Excluded samples without an assigned batch. GSE28475 Excluded seizure samples, in-vitro (IVT) assays.Excluded formalin-fixed samples. Removed samples that are also present in GSE38322 [Subject ID: UMB4670, UMB1860]. GSE28521 Excluded samples from the cerebellum. Removed subjects that over- lap with subjects in GSE38322 [Subject ID: AN19511, AN06420, AN08873, AN10833]. GSE32136 Excluded propionic acid (PPA) treated samples and samples in batch “bcmmes” (user ID). GSE38322 Excluded samples from the cerebellum. GSE37772 Excluded samples from mothers.

Table 2.3: Samples excluded in each study.

14 isms, that is individuals with ASD and individuals without ASD, an independent biological unit is then equivalent to a single sample from a unique subject. Multiple samples ob- tained from the same subject were regarded as technical replicates. Two of the studies (GSE28521, GSE28475) included such technical replicates for some specimens, in which case we computed the mean of the expression values to get a single expression profile for each subject.

Figure 2.3: Samples obtained from the temporal cortex and frontal cortex of the same individual exhibit highly correlated expression values. Data from GSE28521 shown here.

We also looked at possible batch effects whenever batch information is available. Batch information was obtained by automated extraction of “scan dates” or “users” in raw data files, as well as supplementary texts and metadata provided by submitters. To detect pos- sible batch effects, we compared the first two principal components to batch data, and

15 visually identify groups that are separated by the principal components. The amount of variation explained by each principal component is also reported (Figure 2.4). In this anal- ysis, we discovered 34 samples in which batch effects were confounded with the case grouping. These samples were removed. In other data sets, we corrected for possible batch effects using ComBat [53] after discarding probes that are missing in more than 20% of the samples. Batch effects could not be corrected in GSE28521 due to small batch sizes, such that priors cannot be estimated in ComBat. Illumina slide numbers were used as batch information here.

(a)

16 (b)

Figure 2.4: Batch effects: a) Clustering of datapoints into distinct batches with re- spective percentage variances, suggesting the presence of batch effects. b) Batch effects were removed after batch correction (includes robust probes only). Each point marks a sample. Colours represent different batches; shapes indicate ASD or control.

2.1.2 Re-analysis of differential expression in existing autism data sets

Differential expression analysis was conducted using analysis of variance (ANOVA) based on an empirical Bayes approach provided in the limma R package. For each data set, we conducted a two-group disease-control comparison for all probes. Phenotypic subgroups such as Asperger’s disorder and PDD-NOS were pooled into one generic autism disease group. To consider the direction of expression change in the meta-analyses, we computed

17 one-tailed p-values from the resulting two-tailed p-values and t-statistics. Probes were annotated with platform specific annotations in Gemma3, where gene assignments are made based on current genome annotations obtained via sequence analysis [54]. Each data set was then collapsed to the gene level to allow cross-platform integration. When multiple probes map to a single gene, we assign the Bonferroni corrected minimum p-value (p-value of best scoring probe, n×min(p) < α) to the gene, as the smallest p-value is least likely to occur by chance [37]. We excluded probes that map to multiple genes or do not map to a gene at all from the analysis. The proportion of differentially expressed genes (π1 = 1−π0) was estimated using the qvalue package in R. As the internal “bootstrap” method in the

package does not return a standard error, we computed π0 standard errors over one hundred bootstrap iterations locally. π0 values from the “bootstrap” and “smoother” method were similar. We compared the results from our re-analysis to that of previous publications, and eval- uated the outcomes using the area under receiver operating characteristic curve (AU-ROC), as well as average precision (AP). The AU-ROC gives a probability for which a true positive is ranked higher than a false positive. AP gives us the amount of correct hits from the top N ranked genes, averaged across N = {1...n}, where n is the total number of genes in the rankings. While AU-ROC is an evaluation of where the true positives lie relative to the false positives, AP is sensitive to genes with higher rankings (top hits). The same threshold free methods are also used in later sections.

2.1.3 Meta-analysis of differentially expressed genes Several meta-analysis techniques have been implemented in previous studies mentioned in the introduction. We abandoned the idea of combining raw data because the studies we surveyed were conducted on very different analysis platforms. One example of incompat- ibility is that data from two channel arrays are reported as expression ratios, whereas one channel arrays provide expression levels. This leaves us with the option of analyzing in- dividual studies separately, and subsequently combining the results. As we have seen in the examples, one can either combine significance levels (p-value) or the effect size (fold change) of the genes. These two approaches are related, but because the size of study is accounted for in the test of significance [45], we chose this approach, as implemented by Rogic and Pavlidis (2009). Fishers combined probability test [55] was applied independently to the blood and brain data sets. Genes were only analyzed if they were represented in at least three data sets

3http://www.chibi.ubc.ca/Gemma/arrays/showAllArrayDesigns.html, retrieved on October 10, 2012.

18 in each of the meta-analyses. 18994 and 16591 genes were included in the blood and brain meta-analyses respectively. The resulting p-values were corrected for multiple testing using Benjamini-Hochberg’s false discovery rate approach [56]. A second meta-analysis method, ”Meta-Rank analysis”, gave similar results. Details of both methods are provided in the following subsections. Because of the gender imbalance in some of the data sets, we excluded from downstream analysis genes which were known or strongly suspected to show changes in expression between genders. This set of genes include Y-linked genes, X-linked genes that escape X-inactivation (genes with strong evidence only; n=61) [57], as well as autosomal genes previously shown to exhibit sexual dimorphism [58, 59] (Total number of genes excluded: Brain = 202; Blood = 116; see A.5 ). We noted that some of the genes so filtered (e.g., USP9Y and KDM5C) have been previously associated with ASD, but we were unconfident we could discriminate gender from disease effects for them in our analysis. The combined probability method is sensitive to outliers; that is, a single study with a very low p-value can result in statistical significance even when the other studies provide little evidence for rejection of the null. To control for this, we used a jackknife approach to further select for genes that are robust to statistical outliers (a similar approach was used in Mistry et al. (2013)). The jackknife procedure involves repeating the meta-analysis k times, where k is the number of data sets, For each trial k, the kth data set is left out. The agreement among these k jackknife meta-analyses was used as a basis for identifying a “core” signature that excludes genes appearing due to the influence of a single data set.

Fisher’s combined probability test Fisher’s combined probability test [55] takes the raw p-value calculated from the individ- ual differential expression analysis for a gene across all data sets and combines them to generate a summary statistic (S) using the equation:

k Si = −2 ∑ log(p j) j=1

where k =number of studies and p j =p-value of gene i in study j. Using this method, we combined our results from multiple independent tests, all having the same null hypothesis (no difference between autism and control groups). Under the null hypothesis, the resulting test statistic has a χ2 distribution. P-values for the meta-analysis can then be obtained from the test-statistic using the χ2 distribution with 2k degrees of freedom. We corrected the p-values for multiple testing using the Benjamini Hochberg [56]

19 correction method. The resulting FDR represents the proportion of false positives among all the positive results returned at a given threshold.

Meta-rank analysis The meta-rank analysis is a rank aggregation strategy that involves using the average rank of the gene instead of the combination of p-values. For each individual gene, its rank was determined by their order of p-values. The smallest p-value in an experiment would have the highest rank. The meta-rank of gene i is computed by averaging the ranks across data sets 1 to k, 1 k Ri = ∑ Ri j k j=1 The ranks of these averages were then computed for all genes. Unlike the Fisher’s methods where we can detect individual data sets that produce deviant significance values, this method is less sensitive to the influence of a single data set. Another difference is that the distribution of ranks cannot be represented by a known statistical distribution. In order to obtain a test statistic, we computed the permutation null distribution. We randomly permuted the p-values in each study, and recalculate the metric R. Repeating the process 10000 times gives an M ×10000 matrix of permuted values, where M is the number of genes. The permutation null is then the empirical distribution of all values in that matrix. We can assess for significance by testing against the permutation null. F(x) is a function of the empirical cumulative distribution of the permutation null, where x is a random variable, which is, in this case, the metarank of gene i. This was computed for the number of studies, k = 3,4,5,...,9 for the null distribution of blood data and k = 3 for brain data. To ensure that the meta-signature genes are not sensitive to the choice of the meta-analysis method used, we reanalyzed both blood and brain data sets using the method described. The rank correlations between these two methods are 0.80 and 0.59 for brain and blood data sets respectively (averaged over up-regulated and down- regulated lists), suggesting that there are some discrepancies between these methods. How- ever, by quantifying the predictive power of this method with respect to meta-signatures from Fisher’s method, we observed that the choice of method will not have a substantial effect on our selection of the top hits in the signatures (Figure 2.5). Subsequent functional analyses were thus based on results from Fisher’s method.

20 Figure 2.5: Comparison between results obtained from the Fisher’s method and the Meta-Rank method. The peaks on the left suggests that genes ranked at the top are similar for both methods.

2.1.4 Functional enrichment analysis Gene set enrichment analysis was performed using ErmineJ Version 3.04 [60], a software for determining enrichment of Gene Ontology (GO) [61] terms for a given gene list. GO terms represent a controlled vocabulary that links a certain molecular function, biological process, or cellular component to genes. We focused on terms under the “biological pro- cess” tree for our analysis. Significant enrichment of a specific GO term could suggest that the gene list is enriched for genes involved in a particular biological process. We used the Precision-Recall method. Precision-Recall uses average precision as a scoring function, thus it is sensitive to genes at the top of the rankings, without having to set a threshold.

4http://erminej.chibi.ubc.ca

21 For each run, the negative log of Fisher’s corrected p-values were used as the input, test- ing against gene sets within the 5 − 200 size range over 500000 iterations [62]. ErmineJ also accounts for the “multifunctionality” bias of the gene sets (refer to ErmineJ’s manual5 for more details on “multifunctionality”). Gene sets that are less affected by this bias are prioritized. We also downloaded candidate gene categories from the Simons Foundation Autism Research Initiative (SFARI)6 database. Seven categories were established by SFARI based on their gene scoring syndrome - 1. High Confidence; 2. Strong Candidate; 3. Suggestive Evidence; 4. Minimal Evidence; 5. Hypothesized; 6. Not Supported; S Syndromic. Cate- gories 1 and 6 were excluded for analysis. This is because none of the genes met SFARI’s requirements for the “High Confidence” category (zero genes). The latter is irrelevant be- cause these genes show no association with ASD.

2.1.5 Literature derived ASD candidates

Known ASD candidate genes were downloaded from Neurocarta7 [1], a knowledge base of gene-phenotype associations encompassing 664 unique genes linked to ASD. This includes candidate genes from model organisms (mouse = 11, zebrafish = 1) which were mapped to their human homologs using HomoloGene8 [63]. In addition to studies in Table 1.1, dif- ferentially expressed genes reported in two additional expression profiling studies [64, 65] for which data were not publicly available were also obtained and compared to our re- sults. Because autism is historically associated with schizophrenia, we were also inter- ested to see if there were similarities between the autism meta-signatures and schizophrenia meta-signatures. We obtained differentially expressed genes reported in a meta-analysis of schizophrenia expression profiles [42] and compared it with our brain meta-signatures.

2.1.6 Copy number variation enrichment analysis and prediction classifier

We collated CNV data from the Autism Chromosomal Rearrangement Database (ACRD)9 [14], Sanders et al. [13] (Table S4 in original study) as well as Pinto et al. [66] (Table S8 in orig- inal study), obtaining a total of 1023 CNVS. These variants are thought to be pathological, but to ensure uniformity, we computed their frequencies in the Database of Genomic Vari-

5http://erminej.chibi.ubc.ca/help/tutorials/multifunctionality/ 6www.sfari.org, retrieved in December 2012. 7www.neurocarta.chibi.ubc.ca, retrieved in February 2013. 8ftp://ftp.ncbi.nih.gov/pub/HomoloGene, build 67. 9http://projects.tcag.ca/autism/

22 ants (DGV)10 [67] as described in Sanders et al.. We identified seven common CNVS. After lifting the genomic coordinates genes over to hg18 (to match CNV data) using the UCSC liftOver tool11 [68] , Fisher’s exact test was used to compute global enrichment of dysregu- lated genes in ASD-associated rare CNV (n = 1016). To reduce the amount of overlap for the exact test, we merged individual CNVS into CNV regions using a 90% reciprocal overlap. We also merged small CNVS that are completely nested within larger CNVS by taking the breakpoints of the larger CNV (union). The total number of merged CNV regions is 732 (Gain = 385, Loss = 340, Unknown = 1).We included all classes of CNV transmissions (in- herited, de novo, unknown). Restricting our analysis to de novo CNV did not substantially affect the results. Using expression profiling data, we also sought to predict the CNV status of an indi- vidual. We used Gist12 [69], a support vector machine and kernel principal components analysis software toolkit to build a preliminary CNV classifier. The performance of the re- sulting classifier was evaluated using leave-one-out cross validation. We used expression profiles from GSE7329 for training and GSE37772 for testing. Because the training and test data are different in several aspects, we attempted to make the data comparable by scaling the expression values or computing values relative to expression levels of a house keeping gene, “GAPDH”. Because this is exploratory, we used default parameters, focus- ing on the predicted gene ranks (based on discriminants) rather than the predicted class labels.

2.1.7 Network analysis

We conducted the network analysis on a human protein-protein interaction network (PPIN). The PPIN network comprise data from the Human Protein Reference Database (HPRD) [70], Molecular Interaction Database (MINT) [71], Database of Interacting Proteins (DIP) [72], innateDB [73] and irefIndex [74]. With the aggregated network, we computed local net- work properties for the core candidate gene sets. 10000 random gene sets with similar size and node degree (±50 window) were sampled from the network to construct permutation distributions of the average shortest path length (Dijkstra’s algorithm) and local clustering coefficient [42]. The same analysis was repeated on aggregated co-expression networks of brain or liver tissues, except that they were separately performed for up-regulated and down-regulated gene sets.

10http://projects.tcag.ca/variation/, retrieved in March 2013. 11http://genome.ucsc.edu/cgi-bin/hgLiftOver 12http://www.chibi.ubc.ca/gist/

23 2.2 Results

2.2.1 Systematic review shows technical differences and heterogeneity in independent ASD transcriptome studies

We analyzed twelve independent ASD expression profiling studies and identified differ- ences in microarray preprocessing and data quality control. To ensure comparability among data sets from different laboratories, we identified and corrected for technical variation where possible (Figure 2.1). From the original total of 1371 samples, the resulting data af- ter quality control comprise 639 ASD microarray samples and 460 controls, which sum to 1099 samples from both blood-derived and brain tissues. As summarized in Table 2.4, there are differences in the criteria used to select the pool of ASD individuals. Some individu- als were diagnosed based on DSM-IV [75]; others were determined using alternative forms of evaluation such as the Autism Diagnostic Interview, Revised (ADI-R) [76] and Autism Diagnostic Observation Schedule (ADOS) [77]. More importantly, the range of autistic phenotypes included in each cohort differs, particularly among the blood studies. While some focused on classical autism, other included milder forms like Asperger’s syndrome and PDD-NOS. We then compared the original lists of significantly differentially expressed genes to see if there is any consistency in previous results. None of the genes reported overlapped across all brain data sets or blood data sets. Though this is partially due to different significance thresholds or filters (so there could be some overlaps in smaller subsets of studies that are more similar in some aspects), there is generally no evidence of concordance. We expect the heterogeneity to influence the results of our study too. In later sections, we will demonstrate our approach in circumventing the problem and the results achieved. Because ASD is generally more prevalent in males than females [29, 30], we also in- vestigated whether gender imbalance was a factor affecting study designs. Indeed, a few studies showed evidence of gender imbalance (Table 2.5). There were no striking differ- ences in the age, race and post-mortem interval (the latter being relevant to brain studies only) between cases and controls of each study (Table 2.6).

24 Diagnosis Criteria Phenotypic descriptions

Brain GSE28475 ADI-R, ADOS, TARF, medical records Autism GSE28521 Available upon request, includes ADI-R Autism diagnostic scores, AN-Brain Bank Case Number GSE38322 ADI-R Autism Blood GSE6575 DSM-IV, ADI-R, ADOS Autism no regression, autism with regression GSE7329 ADI-R, ADOS, Raven-IQ ASD with dup(15q) GSE15402 ADI-R, Raven’s score, Peabody Picture ASDa Vocabulary Test GSE15451 ADI-R ASDb GSE18123.1 DSM-IV-TR, ADOS, ADI-R, compre- Autism, Asperger’s Disorder, hensive clinical testing PDD-NOS GSE18123.2 DSM-IV-TR, ADOS, ADI-R, compre- Autism, Asperger’s Disorder, hensive clinical testing PDD-NOS GSE25507 DSM-IV, ADOS, ADI-R classical autism GSE32136 - ASD GSE37772 Refer to the SFARI database for phenotype Autism information a Language, Mild, Savant (cluster analysis of ADI-R scores) b severe language impairment (cluster analysis of ADI-R scores) ADI-R: Autism Diagnostic Interview-Revised ADOS: Autism Diagnostic Observation Schedule TARF: The Autism Research Foundation DSM-IV: Diagnostics and Statistical Manual of Mental Disorders IV, (TR: text revision) SFARI: Simons Foundation Autism Research Initiative

Table 2.4: Summary of diagnosis criteria and ASD phenotypes in the original studies. Refer to Table 1.1 for study citations.

25 ASD Control Male Female Male Female OR Total Brain GSE28475 11 2 18 3 0.92 34 GSE28521 8 4 14 1 0.14 27 GSE38322 4 0 6 0 - 10 Blood GSE6575 28 5 8 3 2.10 44 GSE7329 7 0 6 0 - 13 GSE15402 77 0 29 0 - 106 GSE15451 15 0 12 0 - 27 GSE18123.1 64 0 28 0 - 92 GSE18123.2 72 21 30 33 3.80 156 GSE25507 80 0 63 0 - 143 GSE32136 9 0 7 0 - 16 GSE37772 198 34 105 94 5.20 431

Table 2.5: Demographics I - Gender. Gender imbalance is seen in some data sets, such as GSE37772. OR: Odds ratio.

Age range (years) PMI Race ASD Control ASD Control Brain GSE28475 2-56 3-56 4-43.25 5-36 C, AA, U, M GSE28521 5-51 6.75-43.25 4.75-32.92 16-56 Predominantly C, A GSE38322 2-39 4-22.5 13-24.2 1-60 C, U Blood GSE6575 matched matched - - - GSE7329 - - - - - GSE15402 5-28 3-34 - - Predominantly C, A, M GSE15451 4-12 2-12 - - Predominantly C, U GSE18123.1 3.4-17.5 2.8-16 - - Predominantly C, A, U, M GSE18123.2 2-21 2.5-22 - - Predominantly C, A, U, M GSE25507 2-14 3-11 - - Primarily C GSE32136 - - - - - GSE37772 4-17.7 3-23.8 - - C and non-C

Table 2.6: Demographics II - Age, PMI and race of subjects in each study. C: Caucasian or white; AA: African American; A: Asian; M: Mixed or multiracial; U: Unknown

26 2.2.2 Re-analysis for differential expression The first stage of our meta-analysis was to analyze each data set individually for differential expression. The results are summarized in Table 2.7. Most data sets had low levels of dif- ferential expression, but a few range up to hundreds of significantly differentially expressed genes at a FDR of 0.05.

DE Up Down 1 − π0 Number of genes Number of samples Brain GSE28475 0 0 0 0.20 16598 34 GSE28521 4 1 3 0.25 16598 27 GSE38322 0 0 0 0.15 19558 10 Blood GSE6575 0 0 0 0.00 18305 44 GSE7329 314 160 154 0.41 17159 13 GSE15402 5 1 4 0.11 9821 106 GSE15451 0 0 0 0.04 12066 27 GSE18123.1 333 103 230 0.27 18305 92 GSE18123.2 57 35 22 0.47 18617 156 GSE25507 2 2 0 0.28 18305 143 GSE32136 0 0 0 0.10 8100 16 GSE37772 0 0 0 0.00 16598 431

Table 2.7: Differentially expressed genes in each data set after re-analysis. DE: Dif- ferentially expressed genes at FDR threshold of 0.05; Up: Up-regulated genes; Down: Down-regulated genes; Number of genes: Number of genes after apply- ing filters.

We checked if sample size or FDR threshold could explain the variable amount of dif- ferential expression. If one assumes the effect size of ASD on expression is similar across studies, the amount of differential expression should be consistent. A comparison between the estimated proportion of differentially expressed genes (1 − π0) and sample size shows that this is clearly not the case for these data (Fig. 2.6). We also grouped π0 values by tissue type, cell type and platform type, but there were no obvious trends. There are other possible explanations such as phenotype heterogeneity or comorbidities for this phenomenon, but we were unable to quantify or directly address these factors with the information available. We next compared the result of each analysis to that previously published for the same data set, where available. This was done by examining where the differentially expressed genes from the original studies rank in our results (using the AU-ROC). Despite the ex- tensive additional data cleanup we performed and differences in the statistical analysis

27 Figure 2.6: π0 values for each study against sample size. Error bars denote standard errors for 100 bootstrap iterations. PB: peripheral blood; PBL: peripheral blood lymphocytes; LCL: lymphoblastoid cell lines; WB: whole blood. methods, our re-analyses were generally concordant with the original reports (Table 2.8).

28 Data sets Significant probes Probes reported Overlap AUC Precision(%) GSE15402 73 530 65 0.98 58.70 GSE15451 0 45 0 0.86 1.58 GSE18123.1 284 489 43 0.79 8.34 GSE18123.2 69 610 43 0.92 31.00 GSE25507 - - - - - GSE28475 0 200 0 0.89 6.94 GSE28521 4 588 2 0.89 21.50 GSE32136 - - - - - GSE37772 - - - - - GSE38322 0 41 0 0.98 6.84 GSE6575 0 55 0 0.96 3.34 GSE7329 596 1281 339 0.95 44.10

Table 2.8: Overlap between results reported in the literature and individual re- analy- sis of differential expression. Significant probes: Per data set significant probes from re-analysis, reported at an FDR threshold of 0.05. Probes reported: Differen- tially expressed probes published in original papers of each study. Gene symbols are used as a proxy for probes in GSE18123.1; GenBank accessions are used in GSE15451 and GSE15402; Spot IDs are used for GSE7329. GSE25507 computed differences in expression variance instead of differential expression; GSE37772 reported outlier genes instead of differentially expressed genes; GSE32136 is not published.

2.2.3 Meta-analysis of differential expression A key observation at this point is that most of the data sets showed clear evidence for differential expression (π0 < 1), but were largely underpowered to separate differentially expressed genes from the background. In the re-analyses, there was also no overlap across any of the studies among the genes selected at an FDR of 0.05. We hypothesized that there might still be similarities among the studies that would emerge in a combined or meta- analysis. We therefore applied a p-value combination strategy, choosing to analyze the blood and brain data sets separately. This approach combines the results for all the data sets without applying any statistical threshold, and thus provides a p-value for all the genes analyzed. The meta-analysis yields four ranked gene lists: one pair each for blood and brain, with separate lists for up- and down-regulation, noting that at this stage they contain all the genes considered without applying a threshold. We then compared the results of individual study re-analyses to the ranked gene lists. If each data set contributes some signal in the meta-analysis, their results should individually resemble the ranked gene lists. Generally, the trends we observed concur with the amount

29 of differential expression estimated (1 − π0) for each data set. Data sets with more dif- ferential expression displayed stronger associations with the results of the meta-analyses. As shown in Fig. 2.7A, there is a clear similarity among the three brain data sets in their contributions towards the final gene rankings, as evidenced by the similar trend lines for all three data sets. In contrast, the blood data sets were in lower agreement, with a frac- tion showing a stronger relationship to the meta-analysis results while others show weak associations (Fig. 2.7B).

Figure 2.7: Profiles of meta-signatures from the blood and brain: raw p-values for each individual data set are plotted against corrected p-values FDR of the meta- signatures. Local Polynomial Regression (LOESS) is used to obtain a smooth fit. The shaded areas represent 95% confidence intervals of the prediction using the t-based approximation (see “stat smooth” in the ggplot2 R package)

Applying a threshold to these rankings yielded blood and brain “meta-signatures”. At an FDR threshold of 0.05, 30 up-regulated genes and 49 down-regulated genes were found in the brain. The blood meta-analysis yielded 111 up-regulated and 87 down-regulated hits (see appendix A.1-A.4). As the number of studies in the brain and blood differ, we cannot tell whether the smaller number of brain hits is due to an underpowered analysis, or simply due to a weaker biological effect in the brain. While the studies were balanced for

30 covariates such as age and post-mortem interval (for the brain data), we checked the lists for genes previously reported to be influenced by these factors [78]. There were minimal overlaps, confirming that our results were not influenced by them. Genes affected by sex differences were removed in the results reported here, though they can be found in the appendix for reference. We further characterized the relative contributions of each data set towards the hits we obtained, to more directly identify any single study that “drives” genes towards significance in the meta-analyses. By assessing the amount of overlap between meta-signatures and differential expression in each data set, we quantified the contribution of each data set to the meta-analysis (Table 2.9). For instance, the results of GSE28521 analyzed alone can identify, with relatively high precision (29.47%), down-regulated genes in the brain meta- signature. Overall, GSE28521, along with GSE18123.1 and GSE7329 are studies that appeared to have a strong impact on the meta-analysis. As described in the next section we implemented procedures to find genes robust to the selection of data sets.

Up-regulated AUC AP(%) Down-regulated AUC AP(%)

Brain GSE28475 2/3 0.92 10.77 0/0 0.90 5.60 GSE28521 0/0 0.96 15.33 5/5 0.94 29.47 GSE38322 0/0 0.90 5.70 0/0 0.84 10.80 Blood GSE15402 0/1 0.78 3.31 0/29 0.71 1.35 GSE15451 0/0 0.54 0.80 0/0 0.55 1.22 GSE18123.1 16/92 0.82 8.36 28/235 0.84 15.01 GSE18123.2 13/38 0.86 10.73 2/9 0.74 5.41 GSE25507 0/3 0.67 2.83 0/0 0.59 2.03 GSE32136 0/0 0.76 5.95 0/0 0.69 2.18 GSE37772 0/2 0.65 0.97 0/0 0.57 0.80 GSE6575 0/0 0.67 3.05 0/0 0.67 1.69 GSE7329 25/183 0.84 13.21 20/234 0.78 5.84

Table 2.9: Overlap (overlap/total up or down-regulated in data set) between meta- signature (FDR <0.05) and significantly differentially expressed genes per data set (FDR <0.05), as well as enrichment of meta-signatures in the results of indi- vidual differential expression analysis. One sided p-values were used to com- pute FDR here. AU-ROC: area under receiver operating characteristic curve; AP: average precision.

31 To see if the meta-signatures in blood and brain are similar, we quantified the recipro- cal predictive value of meta-signatures from both tissue types using AU-ROC. The blood meta-signatures were randomly placed in the brain ranked gene list, and vice versa (Ta- ble 2.10). Thus there was no indication of a common signature between the blood and brain, supporting our choice for conducting separate analyses.

Upregulation Downregulation

Blood-Brain 0.51 0.52 Brain-Blood 0.51 0.59

Table 2.10: Comparisons of blood and brain signatures. AU-ROC reported for signa- ture of tissue A on ranked gene list from meta-analysis of tissue B (A-B).

2.2.4 Robust molecular commonalities are more evident in brain samples compared to blood To focus our attention on the genes that show the strongest concordance across studies, we employed a jackknife procedure. Jackknifing yields multiple lists of gene ranks equivalent to the number of the data sets in the meta-analysis, where each list is the result with one data set left out. We initially performed this at the same stringency as the initial meta-analysis, applying an FDR threshold of 0.05 for every jackknife result. With this conservative ap- proach, we identified 10 genes from the blood data for which significant values are not dominated by any single data set, but none from the brain. Because removing data sets re- duces power, to establish a less stringent criterion for identifying robust patterns, we define our core signatures as the intersection of the top 200 (arbitrary cut off) genes retrieved from each leave-one-out iteration [42]. From this analysis, the core blood signature consists of

15 up-regulated genes and 8 down-regulated genes (corresponding FDRup= 0.14, FDRdown = 0.15). 15 up-regulated genes and 10 down-regulated genes were observed in the core

brain signature (corresponding FDRup = 0.29, FDRdown = 0.24). We visualized these core signatures using heat maps of the gene expression levels for each sample in the twelve data sets meta-analyzed. While there is a relative lack of a clear pattern in blood data sets, the heat maps for the core brain hits showed good concordance across all three brain data sets (Fig. 2.8).

32 Figure 2.8: Heat map visualizations of core-signatures expression values in each of the brain data sets. Batch corrected expression values were scaled across sam- ples within each data set. Relative expression levels: yellow - high; blue - low. A different visualization for each core signature gene can be seen in Fig. 2.11

The relatively noisy expression profiles of the core blood signature (Fig. 2.9) expose general difficulties in detecting genes with heterogeneous expression levels. Very few genes exhibited robust concordance when visualized individually (Fig. 2.10).

33 Figure 2.9: Gene expression levels of core blood signature. Relative expression lev- els: yellow - high; blue - low; grey - missing values.

34 (a)

35 (b)

Figure 2.10: P-values of core blood signature in individual studies. Deviation from the diagonal for quantile-quantile plots a) Up-regulated genes. b) Down- regulated genes.

36 (a)

37 (b)

Figure 2.11: P-values of core brain signature in individual studies. a) Up-regulated genes. b) Down-regulated genes.

38 2.2.5 Functional analyses reveal perturbations in metabolic processes To explore gene functional themes in our data, we conducted a threshold-free enrichment analysis (using the full list of ranked genes). None of the functions tested for were sig- nificantly enriched in the blood. The brain was enriched for genes involved in “cellular respiration” (GO:0045333, FDR = 0.11), suggesting differences at a functional level be- tween individuals with and without autism. An analysis using the three jackknifed gene lists from the brain data (that is, meta-analysis of each pair of data sets) showed that the result is robust. Dysregulated genes in this functional group are shown in Table 2.11. Other top enriched functions were also related to respiration including GO:0022904 (“respiratory electron transport chain”) and GO:0022900 (“electron transport chain”).

Gene Symbol Gene Name p-value

ATP5O ATP synthase, H+ transporting, mitochondrial F1 complex, O 1.83E-05 subunit UQCRQ ubiquinol-cytochrome c reductase, complex III subunit VII, 5.45E-05 9.5kDa UQCRC1 ubiquinol-cytochrome c reductase core protein I 1.86E-04 CYC1 cytochrome c-1 2.90E-04 COX5B cytochrome c oxidase subunit Vb 2.98E-04 NDUFA11 NADH dehydrogenase (ubiquinone) 1 alpha subcomplex, 11, 4.38E-04 14.7kDa ATP5L ATP synthase, H+ transporting, mitochondrial Fo complex, sub- 4.53E-04 unit G UQCR10 ubiquinol-cytochrome c reductase, complex III subunit X 4.53E-04 UQCRC2 ubiquinol-cytochrome c reductase core protein II 5.25E-04 NDUFA13 NADH dehydrogenase (ubiquinone) 1 alpha subcomplex, 13 5.35E-04 SLC25A12 solute carrier family 25 (aspartate/glutamate carrier), member 12 5.37E-04 FH fumarate hydratase 7.55E-04 UQCR11 ubiquinol-cytochrome c reductase, complex III subunit XI 7.74E-04 NDUFS4 NADH dehydrogenase (ubiquinone) Fe-S protein 4, 18kDa 8.29E-04 (NADH-coenzyme Q reductase) IDH3A isocitrate dehydrogenase 3 (NAD+) alpha 9.06E-04

Table 2.11: Top genes in the “cellular respiration” GO category at a meta-analysis raw p-value threshold of 0.0001. There are a total of 116 genes in this functional group.

39 2.2.6 Shared signatures between autism and other neurodevelopmental syndromes

A natural question is whether any of the signature genes are known ASD candidates re- ported in previous genetics or functional studies. We first checked for overall patterns of enrichment based on the ranked gene lists from the blood and brain. We observed enrich- ment of genes in the SFARI “syndromic” category (FDR=0.15; Table 2.12) in the blood signature.

Gene Symbol Gene Name p-value

UBE3A ubiquitin protein ligase E3A 1.76E-06 CDKL5 cyclin-dependent kinase-like 5 1.43E-03 DMD dystrophin 2.36E-03 SHANK3 SH3 and multiple ankyrin repeat domains 3 6.58E-03 HOXA1 homeobox A1 2.02E-02 PTEN phosphatase and tensin homolog 2.09E-02 TSC1 tuberous sclerosis 1 3.09E-02

Table 2.12: Top genes in the SFARI “syndromic” category at a meta-analysis raw p- value threshold of 0.05. There are a total of 19 genes in this gene set.

Analysis of the jackknifed gene lists indicated this was primarily due to the influence of the 15q duplication cohort (GSE7329). We can directly observe the skew in the top two syndromic genes: UBE3A (FDR=0.003), CDKL5 (FDR=0.095) (Fig. 2.12). While UBE3A resides on the 15q11-13 region, CDKL5 (Xp22) does not. The link between 15q duplication and CDKL5 dysregulation is unclear. We repeated this analysis using a more inclusive list of 664 ASD candidates from Neuro- carta [1], but found no significant enrichment. Among the few Neurocarta candidates genes identified in our meta-signatures are 11 genes in the blood signature (CAMSAP2, UBE3A, CYFIP1, JARID2, PAFAH1B1, FAN1, BRAF, CXCR3, PRDX4, GAP43, GABRA4) and one gene in the brain signature (GAS2). We also looked for known candidates in the brain using a relaxed FDR threshold of 0.1. Additional genes found in the brain include ADM, CADM1, STAT3, CD44, CYP19A1, PTCHD1, SLC30A5, SLC25A12, APBA2 and DLX1. Note that only CAMSAP2 and BRAF are core hits. None of the existing candidates are common to the meta-signatures of both tissue types.

40 Figure 2.12: Raw p-values of genes located in 15q11-13 (UBE3A, CYFIP1) Xp22 (CDKL5), and 7q11.23 (RFC2). Top(Q-Q plots): The lack of an overall devi- ation from the uniform diagonal suggests that the signals are skewed. Bottom: Per-data set p-value with a p-value threshold of 0.05 (dashed grey); genes that meet an FDR threshold of 0.05 in the data set are marked with a triangle. Com- pare with core-blood signatures in Figure 2.10

2.2.7 Meta-signature genes in rare structural variants associated with ASD The candidate gene lists used in the last section do not, for the most part, include genes covered by rare structural variants associated with ASD, because the precise gene or genes involved are often not known and are thus not documented by SFARI or Neurocarta. To explore the potential links between gene expression and rare structural variations, we as- sembled ASD-associated CNVS from several sources. We first observed that genes in the meta-signatures are distributed widely across the genome. There were no obvious hot spots, and none of the CNVS analyzed were significantly enriched for dysregulated genes. (cor- rected p-value  0.05). Globally, 6.3% of the brain meta-signature (total=79) and 10.6% in blood (total=198) are located in known CNV regions, which is not a significant enrichment (Table 2.13). This computation was constrained to genes showing positive associations between ex- pression levels and copy number changes (up-regulated genes within a duplicated region and down-regulated genes within a deleted region). All dysregulated CNV genes are shown in Tables 2.14 and 2.15.

41 Observed Expected Total p-value n % n % Brain Up 1 3.33 3 10.00 30 NS Down 4 8.16 4 8.16 49 NS Total 5 6.3 7 8.86 79 Blood Up 18 16.22 12 10.81 111 0.04 Down 3 3.45 7 8.05 87 NS Total 21 10.6 19 9.60 198

Table 2.13: Dysregulated genes (FDR <0.05, meta- signature) within ASD-associated CNV. Fisher’s exact test was used to compute significance. NS: Not significant.

Genes Gain/Loss CNV Start CNV End Reference

SCIN Gain 7 12186385 17527285 AGP Consortium (2007)

ABCG2 Loss 4 86507718 101626937 Jaquemont et al. (2006)

GRK6 Loss 5 175492445 177359136 Sanders et al (2011)

46277400 49509100 Sanders et al (2011) 46335545 49565822 Marshall et al. (2008) PANX2 Loss 22 45202172 49522605 Sebat et al. (2007) 45144027 49465883 Sanders et al (2011)

SNRNP25 Loss 16 835 1253638 Sanders et al (2011)

Table 2.14: Dysregulated genes in the brain that are found in known ASD CNVs. CNVs that span the same gene or set of genes are grouped together.

Genes Gain/Loss Chromosome CNV Start CNV End Reference

52699516 54408816 Sanders et al (2011) CSTF2T Gain 10 51672210 61490637 Sanders et al (2011) 50562149 61478511 Sebat et al. (2007) Continued on next page

42 Genes Gain/Loss Chromosome CNV Start CNV End Reference

54218922 58779615 Pinto et al 2010 CTDSP2 Gain 12 54218922 58779615 Sanders et al (2011)

20235613 20807351 Pinto et al 2010 CYFIP1 Gain 15 20303106 20800564 Pinto et al 2010 20090262 21038099 Pinto et al 2010

28723577 30231488 Sanders et al (2011) FAN1 Gain 15 28723577 30238780 Sanders et al (2011)

FUT8-AS1 Gain 14 61897100 65075600 AGP Consortium (2007)

HCK, Gain 20 28251057 35143867 Sanders et al (2011) C20orf112

IRF2BPL Gain 14 76007842 76924400 Marshall et al. (2008)

P2RX7, 114191663 132287723 Marshall et al. (2008) Gain 12 GPR133 114170000 132388000 Sanders et al (2011)

72411506 73811186 Sanders et al (2011) 72300351 73782113 Sanders et al (2011) RFC2 Gain 7 72344426 73782113 Sanders et al (2011) 72355583 73782113 Sanders et al (2011)

SCCPDH Gain 1 244912594 245041638 Marshall et al. (2008)

SH2D1B Gain 1 160435966 161133966 AGP Consortium (2007)

SMARCA2 Gain 9 175632 3373495 Sanders et al (2011)

22736034 25689610 Jaquemont et al. (2006) 21490300 25698400 AGP Consortium (2007) UBE3A Gain 15 21190624 26203954 Pinto et al 2010 21190624 26203954 Sanders et al (2011) 21240037 26095621 Sanders et al (2011)

20428583 26069606 Christian et al. (2008) UBE3A, 19925826 26069606 Christian et al. (2008) Gain 15 CYFIP1 20197683 26069606 Christian et al. (2008) 19767013 26134114 Sanders et al (2011) Continued on next page

43 Genes Gain/Loss Chromosome CNV Start CNV End Reference

18376200 30298800 Marshall et al. (2008) 18427100 30298847 Marshall et al. (2008) UBE3A, 18376200 30298800 Sanders et al (2011) CYFIP1, Gain 15 18427100 30298847 Sanders et al (2011) FAN1 18526971 30756771 Sebat et al. (2007) 18526971 30756771 Sanders et al (2011)

ZNF611 Gain 19 57836600 58246200 Marshall et al. (2008)

ZNF721 Gain 4 328851 542862 Marshall et al. (2008)

ZNF721, 35410 3511385 Sanders et al (2011) Gain 4 SPON2 398952 6722859 AGP Consortium (2007)

CCDC50 Loss 3 187295051 193862987 Jaquemont et al. (2006)

SLC17A9 Loss 20 61056624 61076763 Sanders et al (2011)

113335000 128821721 Sanders et al (2011) TSPAN12 Loss 7 113528285 129015006 Marshall et al. (2008) Table 2.15: Dysregulated genes in the blood that are found in known ASD CNVs.

Because 15q11-13 duplication is one of the most common CNV aberrations in ASD [10], it was unsurprising that we detected dysregulated genes in this region. A closer look at these genes (UBE3A, CYFIP1; Fig. 2.12) again reveals the sensitivity towards the data set which comprises only autistic subjects with maternally derived 15q duplications (GSE7329). In other ASD-associated CNVS, we detected genes from the core signatures that are dysreg- ulated in the same direction as the change in copy number: ZNF721 (4p16) in the blood; SCIN (7p21.1), SNRNP25 and ABCG2 (4q21) in the brain. However, we conclude that while some of the genes in our signatures are ASD candidate genes or fall in known rare CNV regions, there is no striking overall relationship between the expression patterns and the current state of knowledge of ASD genetics. Previous work has shown that gene expression profiles can be used to predict cytogentic abnormalities [33, 79]. We attempted to build a classifier using expression levels of dysreg- ulated genes as features. Because class labels for 15q duplications were available (among other CNVS), we focused on predicting the presence of 15q duplications using preselected features (i.e., CYFIP1, UBE3A and FAN1 expression levels). Our training data, GSE7329, comprise seven subjects with 15q duplication and six subjects without. We tested the clas-

44 sifier on GSE37772 (the only other data set with CNV labels). Out of a total of 431 samples in GSE37772, there was only one sample with a confirmed 15 duplication status, as re- ported by Luo et al.. Eleven other samples were predicted to have 15q duplications. The ideal classifier would be able to predict presence of 15q duplications in individuals whose CNV statuses are unknown. But the lack of samples with confirmed CNV statuses makes it a challenge to evaluate the performance of the classifier. We report our predictions given the information we have in Table 2.16. While the sample with a confirmed 15q duplication, GSM927674, had relatively high rankings, other samples originally predicted were ran- domly placed. Besides the lack of CNV labels, technical differences between the two data sets used for training and testing might pose problems in predicting outcomes. A classi- fier that can be generalized across different data sets might require more complex machine learning methods such as transfer learning.

Standardization Training CV SV (total=13) Rank of AU-ROC Method AU-ROC AU-ROC GSM927674 None 1.00 1.00 12 1 0.58 Scaled 1.00 1.00 9 18 0.53 GAPDH nor- 0.86 0.79 13 4 0.36 malized

Table 2.16: Predictions on GSE37772 samples using preliminary CNV classifier. CV: cross validated; SV: support vectors; AU-ROC: AU-ROC computed for other 15q samples (originally predicted but not confirmed).

2.2.8 Network analysis and candidate gene characterization We were also interested to see whether these core signature genes possess distinctive prop- erties at the systems level. To do so, we compared PPIN properties of core signature genes to that of random gene sets with similar size and node degree. Our analysis shows there is no evidence of significant changes in the functional connectivity of core dysregulated genes in autistic individuals (Fig. 2.13). Analyses on the coexpression networks of brain (Fig. 2.14) and liver (Fig. 2.15) showed similar results. Because the biological mechanism of autism is unknown, it is conceivable that different subsets or combinations of genes might share a common functional topology. In other words, the core signatures alone might be insuffi- cient to cause global alterations in the brain. We have shown earlier that there is little or no change in global functional connectivity between autistic individuals and controls. We further explored local PPIN neighbourhoods of the same core-signatures and found several known candidates in the vicinity (direct

45 (a) Blood. 15 out of 23 core candidate genes included.

(b) Brain. 20 out of 25 core candidate genes included.

Figure 2.13: PPIN network properties of core candidate genes in the blood and brain compared to that of respective random gene sets. or first degree connections). A few known candidates were directly linked to our core signatures (blood = 15, brain = 37). But since the number of known candidates observed did not significantly differ from what is expected given random gene sets of similar size and node degree, the links observed are likely to arise by chance due to the large node degrees of some core signatures genes.

46 (a) Blood; 13/15 up-regulated genes included. (b) Blood; 4/8 down-regulated genes included.

(c) Brain; 11/15 up-regulated genes included. (d) Brain; 9/10 down-regulated genes included.

Figure 2.14: Brain co-expression network properties of core candidate genes in the blood and brain compared to that of respective random gene sets.

(a) Blood; 12/15 up-regualted genes included. (b) Blood; 7/8 down-regulated genes included.

(c) Brain; 11/15 up-regulated genes included. (d) Brain; All down-regulated genes included.

Figure 2.15: Liver co-expression network properties of core candidate genes in the blood and brain compared to that of respective random gene sets.

47 To further characterize the candidate genes, we extracted their phenotypes associations from Neurocarta [1]. Neurocarta is an in-house knowledge base of gene-phenotype as- sociations, so it provides a global view of what is currently known about the genes. We categorized the candidates based on their phenotype associations, defining genes that are associated with only autism as “ASD specific” genes. Those that are only linked to a list of manually curated neurodevelopmental disorders are considered “neurodevelopment spe- cific”. Results show that a large fraction of genes was not catalogued in Neurocarta. This could be because Neurocarta mainly focuses on the genetic basis of neurodevelopmental disorders [1], thus capturing only a subset of genes. There were very few “specific” candi- date genes. Because the operational definition of “specificity” is only valid to the extent of our prior knowledge, it is not clear whether these genes are actually biologically specific to the disorder, or if they are not well studied.

Total In Neuro- ASD Can- Ndev. ASD carta didates Specific Specific

Blood Core signature 23 10 2 2 1 Meta-signature 198 63 11 11 3 Brain Core signature 25 12 0 0 0 Meta-signature 79 29 1 1 0

Table 2.17: Categorization of our candidate genes based on Neurocarta. Ndev.: neurodevelopment.

The genetic basis of neurodevelopment or neuropsychiatric disorders might not be re- flected in the transcriptome. To look for potential biological similarities in other neurode- velopmental disorders, we compared our signatures to candidates from previous gene ex- pression studies. We were interested in schizophrenia as a methodologically similar study was done in our lab. Results suggest that there are some overlaps between autism and schizophrenia expression profiles in postmortem brain (Figure 2.16). Further investiga- tions are required to associate these similarities with overlapping symptoms seen between the disorders.

48 (a) (b)

Figure 2.16: Common brain meta-signatures between the autism (current study) and schizophrenia meta-analyses by Mistry et al. a) Up-regulated genes; b) Down- regulated genes.

Genes p-value FDR

ABCA1 7.19e-05 4.16e-02 Up-regulated P4HA1 6.83e-04 9.19e-02

CCDC25 2.59e-05 3.27e-02 LRRC17 3.06e-04 5.77e-02 RMND5B 4.30e-04 6.09e-02 Down-regulated SLC25A12 5.37e-04 6.57e-02 FARSA 8.45e-04 7.78e-02 PPA2 9.61e-04 8.26e-02 APBA2 1.11e-03 9.01e-02

Table 2.18: Meta-signature genes that are also dys- regulated in schizophrenia. Meta-analysis FDR <0.1

49 Chapter 3

Discussion and conclusion

I presented a meta-analysis of autism gene expression profiling studies, providing the most comprehensive survey on gene expression in autism available to date. The main finding is that there are molecular commonalities across multiple independent groups of individuals with ASD. These similarities have, to our knowledge, gone overlooked in indi- vidual gene expression studies. Genes I identified as most robustly changed across cohorts were not previously underscored in ASD literature. In this final section, I will discuss these findings in the context of other autism research, noting some limitations of the current study and avenues for future work.

3.1 Similarities and differences between key findings and previous results The question of whether one should expect some homogeneous molecular aspects across individuals with ASD is an open one. The studies included in this analysis used a range of criteria to select subjects, but they were by and large made up of idiopathic cases (the excep- tion being GSE7329). Since they are not of monogenic etiology, we anticipated variability within and among individual ASD data sets. Besides methodological differences (which we minimized by handling the data sets uniformly), there are other sources of heterogeneity that are difficult to address, and raise some questions about the interpretation of ASD expres- sion studies. For example, the smallest data set we used (GSE7329) showed substantially more differential expression than the largest data set (GSE37772). GSE7329 comprises in- dividuals with 15q duplications while GSE37772 comprises idiopathic probands from the SSC! (SSC!) (Table 2.4), who have moderate to severe symptoms [80].

50 Given the high degree of variability among and within studies, it is striking that we found some genes showing differences that are relatively consistent across cohorts. At this time the full biological significance of the genes we identified is unclear. Several of the concordant genes (core hits) we found are linked to genetic disorders with neurological implications. Among the genes in the core brain signature are PDYN (prodynorphin) and ABCA1 (ATP-binding cassette, sub-family A). Mutations in PDYN, a gene that codes for a neuropeptide precursor, has causal links to spinocerebellar ataxia (MIM 610245) [81]. Mu- tations in ABCA1 are an established cause for Tangier disease (MIM 205400), a disorder which features include neuropathies [82]. There were fewer clear hits in the blood data, but several genes stand out (Fig. 2.10). Two known ASD candidates CAMSAP2 (calmodulin regulated spectrin-associated protein family, member 2) and BRAF (v-raf murine sarcoma viral oncogene homolog B1), showed consistent dysregulation in at least three cohorts. Other novel candidates in blood are PRKCH (protein kinase C eta, a member of the pro- tein kinase C family) and ABLIM1 (actin binding LIM protein 1), which have been widely studied in cellular signaling and axon guidance [83] respectively. As discussed, results from previously published transcriptome analyses have, at the sur- face, shown little agreement. We have also described some reasons why this might occur, including differences in clinical properties or technical aspects of the expression analysis. However, we note that some of our candidates were hits reported in the original studies, as well as in other transcriptome studies not included for analysis (Tables 3.1 and 3.2). In fact, a few were validated with a second independent cohort in the original studies - ZNF322 (zinc finger protein 322) in Kong et al. (2012); HSPA1A (heat shock 70kDa pro- tein 1A) and PDYN in Voineagu et al.(2011), further suggesting bona fide associations with ASD. However these genes were not discussed in previous publications, perhaps because of their relatively low rankings in the results or the lack of known functional implications. In addition, most existing studies have not dwelled upon the findings of other related stud- ies, either choosing to ignore them or attributing differences to experimental procedures. Our results suggest that in fact many of the molecular or functional differences observed in individual studies are likely to be specific to that study and thus of questionable inter- pretation when the entire autism spectrum is considered. While inferences made based on our findings are preliminary, the fact that some changes show a tendency to be reproducible opens promising avenues for further research.

51 Core Brain Studies Blood Studies Total Blood Studies Signature GSE28475 GSE28521 GSE38322 Purcell et al. Garbett et al. GSE15402 GSE15451 GSE18123.1 GSE18123.2 GSE37772 GSE6575 GSE7329

SCARNA17 X 1 ZNF594 X 1 GIMAP8 X 1 PRKCH X 1 CXCR7 XX XX X 3 (CMKOR1) MALAT1 X 1 ZNF322 X* X* X 3 CAMSAP2 X 1 Up-regulated ENO3 X 1 MAN2A2 XX 2 ZNF721 0 APBB1 X 1 ABLIM1 0 ZFP62 X 1 CYBRD1 X 1 BRAF X 1 STRA13 0 SERPINB9 X 1 MED25 0 ZNF784 0 SNX22 0

Down-regulated FAM46C X 1 BLVRB X 1 Total hits 184 537 32 29 130 361 35 487 457 332 24 747 Total genes 17979 17979 21348 - 19763 14753 14753 19763 20353 17979 19763 19326 analyzed XX Overlap with concordant direction. X- Overlap with discordant direction. X Overlap with unknown direction. * Validated in replication cohort. Not tested.

Table 3.1: Comparisons between core signature genes in blood and differentially ex- pressed genes reported in original studies. Total hits: Total hits reported in orig- inal study (Genes); Total genes analyzed: Estimated total number of genes ana- lyzed in each study based on Gemma platform annotations.

52 Core Brain Studies Blood Studies Total Brain Studies Signature GSE28475 GSE28521 GSE38322 Purcell et al. Garbett et al. GSE15402 GSE15451 GSE18123.1 GSE18123.2 GSE37772 GSE6575 GSE7329

HSPA1A XX*XX X- 3 IGFBP5 XX 1 PDYN XX* 1 ZC3HAV1 XX 1 PTPN1 X- 1 C2CD4A 0 DNAJB1 XX 1 C5AR1 0 C1orf106 0

Up-regulated TAGLN2 XX XX 2 ABCA1 XX 1 LILRB3 XX dup. 2 SCIN XX X 2 CD93 0 HMOX1 0 ABCG2 0 KLHDC2 0 COA1 X 1 TTC1 0 FBXL15 0 FAM58A 0 SNRNP25 del. 1

Down-regulated (C16orf33) C12orf57 0 FIS1 0 PIH1D1 0 Total hits 184 537 32 29 130 361 35 487 457 332 24 747 Total genes 17979 17979 21348 - 19763 14753 14753 19763 20353 17979 19763 19326 analyzed XX Overlap with concordant direction. X- Overlap with discordant direction. X Overlap with unknown direction. * Validated in replication cohort. Not tested.

Table 3.2: Similar to Table 3.1, for core signature genes in the brain.

53 3.2 Biological interpretations of meta-analyzed ASD expression profiles Taken as a whole, the expression patterns we observed in brain point to the possibility of ef- fects relating to cellular respiration. Within the cellular respiration group, SLC25A12 (not a hit at an FDR of 0.05 but falls within a relaxed FDR threshold of 0.1), a mitochondrial aspar- tate/glutamate carrier, was previously reported as a susceptibility gene as it harbors SNPs strongly associated with autism [84]. In addition to the genes which were directly anno- tated with this function, a further examination reveals other highly-ranked genes in our data which are known to play regulatory roles in cellular metabolism or mitochondrial related functions, though not directly annotated in the GO functional groups. For instance, P2RX7 (purinergic receptor P2X, ligand-gated , 7; CNV gene) is involved in puriner- gic signaling, a pathway that might play a role in mitochondrial dysfunction-associated ASD [85]. Mitochondrial dysfunction (MD) has been a topic of study in some neuropsychi- atric disorders (notably bipolar disorder [86, 87]). Some have conjectured a 4-5% preva- lence of MD in individuals on the autism spectrum [10, 88], but there is little direct evidence in the literature. Investigations on mitochondrial DNA mutations in ASD yielded mixed conclusions [89, 90]. In part supported by the enrichment of “cellular respiration” (consist- ing of only nuclear encoded genes), current research seems to indicate a role for nuclear genes in the co-occurrence of MD and ASD [91, 92]. The genetics of which might not be as simple as other monogenic metabolic disorders with high prevalence of ASD, like Smith- Lemli Optiz syndrome (MIM 270400). However, as our analysis of brain transcriptomes have shown, there are convergent functional consequences of what could be heterogeneous genetic or genomic aberrations underlying the disorders. Potentially causative rare CNVs are found in up to 20% of ASD cases [10]. While sev- eral genes we identified are within regions implicated in CNV studies of ASD, there was no overall significant enrichment. It is still possible that the changes in RNA levels we observed are linked indirectly to CNVs or other types of rare genetic variants, which we are not able to determine because the genomic backgrounds for most of the cases in our data sets were unknown. Genes suggestive of direct correlations include PANX2, RFC2 and 15q genes, which reside in regions that have recurrent (previously reported in several ASD cases) rare CNVs. RFC2 lies in the 7q11.23 region, deletions of which are associated with Williams-Beuren syndrome (MIM 194050). Duplications of this region, concordant with an up-regulated RFC2 we found, have been strongly linked to autism previously [13]. It is likely, based on the data at hand, that expression changes in an individual reflect under- lying chromosome abnormalities. However it should be noted that some genes with large

54 effects, such as 15q genes, are driven by a subset of cohorts. There might also be other complex links between copy number variation and RNA expression that are not obvious because of varying dosage sensitivity [93], potentially explaining observations of genes with common expression changes in rare or uncommon CNVs. For genes that are actually copy number-sensitive, previous work showed that their expression profiles can be used to predict chromosome abnormalities in blood [79]. Efforts like these will improve progno- sis of the disorder, if successful. Our attempt was impeded partly by the lack of data or “labels” that are needed to train an accurate classifier. Another question raised is whether the blood “markers” are relevant to the brain, because the two tissue types exhibit different profiles.

3.3 Limitations and future directions

Most of the literature on ASD expression profiling focuses on analysis of blood samples rather than brain. Because brain samples are hard to obtain, the hope is that blood cells can serve as a surrogate for brain [30, 31]. But this was not supported by our results. We showed that the biological profiles of brain and blood differ at the molecular and functional levels. Although we note that external factors such as medication, age range (while blood samples are often obtained from a younger population, the brain samples include a wider age range), and cause of death may affect gene expression in post mortem brain tissue, it may be infeasible to address these issues at present due to the lack of brain tissue resources. In addition, current data do not provide information on developmental trajectories, so we are limited by “snapshots” of expression levels in both tissue types. Longitudinal studies in neuroimaging [94] or embryonic brain cells [95] may yield further insights. Another important caveat for our interpretation is the difficulty of attributing any causal role to the changes we observe. They could be sequelae of ASD, or due to comorbid condi- tions. Most of the studies we used did not provide any details about comorbidities, making this difficult to address in our analysis. Future studies should endeavor to provide such de- tails to allow further dissection of real effects from potential confounds. As current leading hypotheses on the etiology of ASD focus on brain connectivity or synaptic function [96], it is a challenge to determine where mitochondrial function fits into the picture. The multi- tude of genes involved in this function also makes it hard to determine its specificity to the disorder. Sure enough, one can refute specificity by providing evidence of similar trends in other unrelated expression studies. In previous work from the lab, gene annotation biases were shown to impact gene func- tion prediction [62, 97]. Briefly, genes or gene groups that are deeply annotated (multi-

55 functional) are more likely to appear as being enriched in a functional enrichment analysis, whereas genes that are not well characterized tend to fall out of favour. This bias is ame- liorated in our functional enrichment analysis where we recovered a set of genes (with relatively low multifunctionality) representing “cellular respiration”. When evaluating in- dividual genes, however, one has to be aware that although they could be relevant to autism, they might not be specific. This is evident in their associations with schizophrenia, as well as numerous other non-neuropsychiatric phenotypes or disorders. Retrospectively, this is also reflected in the lack of enrichment in GO terms associated with neurological func- tions although several hits in the meta-signatures are associated with neuropsychiatry or neurodevelopmental syndromes. It is not to say that “non-specific” genes are hence un- interesting. But rather I emphasize on the importance of making cautious assessments by accounting for potential biases.

3.4 Conclusion In conclusion, this meta-analysis reveals subtle but consistent changes in expression in the brains of individuals with ASD. Future work could explore whether these changes are replicable in additional cohorts. In blood, the signals were much weaker and more heterogeneous, with the clearest effects being associated with duplications in 15q. The tentative interpretation of this is that blood may not be a good tissue for identification of commonalities in the transcriptome in ASD, but it might be useful in probing the biological effects of specific chromosome abnormalities.

56 Bibliography

[1] Elodie Portales-Casamar, Carolyn Ch’ng, Frances Lui, Nicolas St-Georges, Anton Zoubarev, Artemis Y. Lai, Mark Lee, Cathy Kwok, Willie Kwok, Luchia Tseng, and Paul Pavlidis. Neurocarta: aggregating and sharing disease-gene relations for the neurosciences. BMC Genomics, 14(1):129, February 2013. ISSN 1471-2164. doi:10.1186/1471-2164-14-129. URL http://www.biomedcentral.com/1471-2164/14/129/abstract. PMID: 23442263. → pages iii, 22, 40, 48

[2] Jamee M. Berg and Daniel H. Geschwind. Autism genetics: searching for specificity and convergence. Genome Biology, 13(7):247, July 2012. ISSN 1465-6906. doi:10.1186/gb4034. URL http://genomebiology.com/2012/13/7/247/abstract. → pages1

[3] American Psychiatric Association. Diagnostic and statistical manual of mental disorders DSM-5. Arlington, VA, 2013. ISBN 9780890425596 0890425590 9780890425541 089042554X 9780890425558 0890425558. URL http://dsm.psychiatryonline.org/book.aspx?bookid=556. → pages2

[4] E. Bleuler and N.S. Kline. Synopsis of Eugen Bleuler’s Dementia Praecox, Or The Group of . International Universities Press, 1952. URL http://books.google.ca/books?id=Jg54OwAACAAJ. → pages2

[5] L Kanner. Autistic disturbances of affective contact. Acta Paedopsychiatr, 35(4): 100–136, 1968. ISSN 0001-6586. PMID: 4880460. → pages2

[6] Dennis S Charney and Eric J Nestler. Neurobiology of mental illness. Oxford University Press, Oxford; New York, 2004. ISBN 9780199725250 019972525X 9780195149623 0195149629. URL http://site.ebrary.com/id/10317706. → pages2

[7] Stuart Murray. Autism. The Routledge series integrating science and culture. Routledge, New York, 2012. ISBN 9780415884983. → pages2

[8] Joseph D Buxbaum and Patrick R Hof. The neuroscience of autism spectrum disorders. Academic Press, Oxford; Waltham, MA, 2013. ISBN 9780123919243 012391924X. URL http://www.sciencedirect.com/science/book/9780123919243. → pages2

57 [9] S Folstein and M Rutter. Infantile autism: a genetic study of 21 twin pairs. Journal Of Child Psychology And Psychiatry, And Allied Disciplines, 18(4):297–321, 1977. ISSN 0021-9630. URL http://search.ebscohost.com/login.aspx?direct=true&db= mnh&AN=562353&site=ehost-live&scope=site. → pages2

[10] Judith H. Miles. Autism spectrum disordersA genetics review. Genetics in Medicine, 13(4):278–294, 2011. ISSN 1098-3600. doi:10.1097/GIM.0b013e3181ff67ba. → pages 3, 44, 54

[11] Kai Wang, Haitao Zhang, Deqiong Ma, Maja Bucan, Joseph T. Glessner, Brett S. Abrahams, Daria Salyakina, Marcin Imielinski, Jonathan P. Bradfield, Patrick M. A. Sleiman, Cecilia E. Kim, Cuiping Hou, Edward Frackelton, Rosetta Chiavacci, Nagahide Takahashi, Takeshi Sakurai, Eric Rappaport, Clara M. Lajonchere, Jeffrey Munson, Annette Estes, Olena Korvatska, Joseph Piven, Lisa I. Sonnenblick, Ana I. Alvarez Retuerto, Edward I. Herman, Hongmei Dong, Ted Hutman, Marian Sigman, Sally Ozonoff, Ami Klin, Thomas Owley, John A. Sweeney, Camille W. Brune, Rita M. Cantor, Raphael Bernier, John R. Gilbert, Michael L. Cuccaro, William M. McMahon, Judith Miller, Matthew W. State, Thomas H. Wassink, Hilary Coon, Susan E. Levy, Robert T. Schultz, John I. Nurnberger, Jonathan L. Haines, James S. Sutcliffe, Edwin H. Cook, Nancy J. Minshew, Joseph D. Buxbaum, Geraldine Dawson, Struan F. A. Grant, Daniel H. Geschwind, Margaret A. Pericak-Vance, Gerard D. Schellenberg, and Hakon Hakonarson. Common genetic variants on 5p14.1 associate with autism spectrum disorders. Nature, 459(7246): 528–533, April 2009. ISSN 0028-0836. doi:10.1038/nature07999. URL http://www.nature.com/nature/journal/v459/n7246/full/nature07999.html. → pages3

[12] Lauren A. et al. Weiss. A genome-wide linkage and association scan reveals novel loci for autism. Nature, 461(7265):802–808, October 2009. ISSN 0028-0836. doi:10.1038/nature08490. URL http://www.nature.com/nature/journal/v461/n7265/full/nature08490.html. → pages3

[13] Stephan J Sanders, A Gulhan Ercan-Sencicek, Vanessa Hus, Rui Luo, Michael T Murtha, Daniel Moreno-De-Luca, Su H Chu, Michael P Moreau, Abha R Gupta, Susanne A Thomson, Christopher E Mason, Kaya Bilguvar, Patricia B S Celestino-Soper, Murim Choi, Emily L Crawford, Lea Davis, Nicole R Davis Wright, Rahul M Dhodapkar, Michael DiCola, Nicholas M DiLullo, Thomas V Fernandez, Vikram Fielding-Singh, Daniel O Fishman, Stephanie Frahm, Rouben Garagaloyan, Gerald S Goh, Sindhuja Kammela, Lambertus Klei, Jennifer K Lowe, Sabata C Lund, Anna D McGrew, Kyle A Meyer, William J Moffat, John D Murdoch, Brian J O’Roak, Gordon T Ober, Rebecca S Pottenger, Melanie J Raubeson, Youeun Song, Qi Wang, Brian L Yaspan, Timothy W Yu, Ilana R Yurkiewicz, Arthur L Beaudet, Rita M Cantor, Martin Curland, Dorothy E Grice, Murat Gnel, Richard P Lifton, Shrikant M Mane, Donna M Martin, Chad A Shaw, Michael Sheldon, Jay A Tischfield, Christopher A Walsh, Eric M Morrow, David H Ledbetter, Eric Fombonne, Catherine Lord, Christa Lese Martin, Andrew I Brooks, James S Sutcliffe, Jr Cook, Edwin H, Daniel Geschwind, Kathryn Roeder, Bernie

58 Devlin, and Matthew W State. Multiple recurrent de novo CNVs, including duplications of the 7q11.23 williams syndrome region, are strongly associated with autism. , 70(5):863–885, June 2011. ISSN 1097-4199. doi:10.1016/j.neuron.2011.05.002. PMID: 21658581. → pages 3, 22, 23, 54

[14] Christian R Marshall, Abdul Noor, John B Vincent, Anath C Lionel, Lars Feuk, Jennifer Skaug, Mary Shago, Rainald Moessner, Dalila Pinto, Yan Ren, Bhooma Thiruvahindrapduram, Andreas Fiebig, Stefan Schreiber, Jan Friedman, Cees E J Ketelaars, Yvonne J Vos, Can Ficicioglu, Susan Kirkpatrick, Rob Nicolson, Leon Sloman, Anne Summers, Clare A Gibbons, Ahmad Teebi, David Chitayat, Rosanna Weksberg, Ann Thompson, Cathy Vardy, Vicki Crosbie, Sandra Luscombe, Rebecca Baatjes, Lonnie Zwaigenbaum, Wendy Roberts, Bridget Fernandez, Peter Szatmari, and Stephen W Scherer. Structural variation of in autism spectrum disorder. Am. J. Hum. Genet., 82(2):477–488, February 2008. ISSN 1537-6605. doi:10.1016/j.ajhg.2007.12.009. PMID: 18252227. → pages 3, 22

[15] Brian J. ORoak, Laura Vives, Santhosh Girirajan, Emre Karakoc, Niklas Krumm, Bradley P. Coe, Roie Levy, Arthur Ko, Choli Lee, Joshua D. Smith, Emily H. Turner, Ian B. Stanaway, Benjamin Vernot, Maika Malig, Carl Baker, Beau Reilly, Joshua M. Akey, Elhanan Borenstein, Mark J. Rieder, Deborah A. Nickerson, Raphael Bernier, Jay Shendure, and Evan E. Eichler. Sporadic autism exomes reveal a highly interconnected protein network of de novo mutations. Nature, April 2012. ISSN 0028-0836. doi:10.1038/nature10989. URL http://www.nature.com/nature/journal/ vaop/ncurrent/full/nature10989.html?WT.ec id=NATURE-20120405. → pages3

[16] Benjamin M. Neale, Yan Kou, Li Liu, Avi Maayan, Kaitlin E. Samocha, Aniko Sabo, Chiao-Feng Lin, Christine Stevens, Li-San Wang, Vladimir Makarov, Paz Polak, Seungtai Yoon, Jared Maguire, Emily L. Crawford, Nicholas G. Campbell, Evan T. Geller, Otto Valladares, Chad Schafer, Han Liu, Tuo Zhao, Guiqing Cai, Jayon Lihm, Ruth Dannenfelser, Omar Jabado, Zuleyma Peralta, Uma Nagaswamy, Donna Muzny, Jeffrey G. Reid, Irene Newsham, Yuanqing Wu, Lora Lewis, Yi Han, Benjamin F. Voight, Elaine Lim, Elizabeth Rossin, Andrew Kirby, Jason Flannick, Menachem Fromer, Khalid Shakir, Tim Fennell, Kiran Garimella, Eric Banks, Ryan Poplin, Stacey Gabriel, Mark DePristo, Jack R. Wimbish, Braden E. Boone, Shawn E. Levy, Catalina Betancur, Shamil Sunyaev, Eric Boerwinkle, Joseph D. Buxbaum, Edwin H. Cook, Bernie Devlin, Richard A. Gibbs, Kathryn Roeder, Gerard D. Schellenberg, James S. Sutcliffe, and Mark J. Daly. Patterns and rates of exonic de novo mutations in autism spectrum disorders. Nature, April 2012. ISSN 0028-0836. doi:10.1038/nature11011. URL http://www.nature.com/nature/journal/ vaop/ncurrent/full/nature11011.html?WT.ec id=NATURE-20120405. → pages

[17] Stephan J. Sanders, Michael T. Murtha, Abha R. Gupta, John D. Murdoch, Melanie J. Raubeson, A. Jeremy Willsey, A. Gulhan Ercan-Sencicek, Nicholas M. DiLullo, Neelroop N. Parikshak, Jason L. Stein, Michael F. Walker, Gordon T. Ober, Nicole A. Teran, Youeun Song, Paul El-Fishawy, Ryan C. Murtha, Murim Choi, John D. Overton, Robert D. Bjornson, Nicholas J. Carriero, Kyle A. Meyer, Kaya

59 Bilguvar, Shrikant M. Mane, Nenad estan, Richard P. Lifton, Murat Gnel, Kathryn Roeder, Daniel H. Geschwind, Bernie Devlin, and Matthew W. State. De novo mutations revealed by whole-exome sequencing are strongly associated with autism. Nature, April 2012. ISSN 0028-0836. doi:10.1038/nature10945. URL http://www.nature.com/nature/journal/vaop/ncurrent/full/nature10945.html?WT. ec id=NATURE-20120405. → pages3

[18] Timothy W. Yu, Maria H. Chahrour, Michael E. Coulter, Sarn Jiralerspong, Kazuko Okamura-Ikeda, Bulent Ataman, Klaus Schmitz-Abe, David A. Harmin, Mazhar Adli, Athar N. Malik, Alissa M. DGama, Elaine T. Lim, Stephan J. Sanders, Ganesh H. Mochida, Jennifer N. Partlow, Christine M. Sunu, Jillian M. Felie, Jacqueline Rodriguez, Ramzi H. Nasir, Janice Ware, Robert M. Joseph, R. Sean Hill, Benjamin Y. Kwan, Muna Al-Saffar, Nahit M. Mukaddes, Asif Hashmi, Soher Balkhy, Generoso G. Gascon, Fuki M. Hisama, Elaine LeClair, Annapurna Poduri, Ozgur Oner, Samira Al-Saad, Sadika A. Al-Awadi, Laila Bastaki, Tawfeg Ben-Omran, Ahmad S. Teebi, Lihadh Al-Gazali, Valsamma Eapen, Christine R. Stevens, Leonard Rappaport, Stacey B. Gabriel, Kyriacos Markianos, Matthew W. State, Michael E. Greenberg, Hisaaki Taniguchi, Nancy E. Braverman, Eric M. Morrow, and Christopher A. Walsh. Using whole-exome sequencing to identify inherited causes of autism. Neuron, 77(2):259–273, January 2013. ISSN 0896-6273. doi:10.1016/j.neuron.2012.11.002. URL http://www.cell.com/neuron/abstract/S0896-6273(12)00993-2. → pages3

[19] Gaia Novarino, Paul El-Fishawy, Hulya Kayserili, Nagwa A Meguid, Eric M Scott, Jana Schroth, Jennifer L Silhavy, Majdi Kara, Rehab O Khalil, Tawfeg Ben-Omran, A Gulhan Ercan-Sencicek, Adel F Hashish, Stephan J Sanders, Abha R Gupta, Hebatalla S Hashem, Dietrich Matern, Stacey Gabriel, Larry Sweetman, Yasmeen Rahimi, Robert A Harris, Matthew W State, and Joseph G Gleeson. Mutations in BCKD-kinase lead to a potentially treatable form of autism with . Science, 338(6105):394–397, October 2012. ISSN 1095-9203. doi:10.1126/science.1224631. PMID: 22956686. → pages3

[20] Uta Frith. Autism: explaining the enigma. Blackwell Pub, Malden, MA, 2nd ed edition, 2003. ISBN 0631229000. → pages5

[21] P. Hammond, C. Forster-Gibson, A. E. Chudley, J. E. Allanson, T. J. Hutton, S. A. Farrell, J. McKenzie, J. J. A. Holden, and M. E. S. Lewis. Facebrain asymmetry in autism spectrum disorders. Mol Psychiatry, 13(6):614–623, March 2008. ISSN 1359-4184. doi:10.1038/mp.2008.18. URL http://www.nature.com/mp/journal/v13/n6/full/mp200818a.html. → pages5

[22] M. D. Spencer, R. J. Holt, L. R. Chura, J. Suckling, A. J. Calder, E. T. Bullmore, and S. Baron-Cohen. A novel functional brain imaging endophenotype of autism: the neural response to facial expression of emotion. Transl Psychiatry, 1(7):e19, July 2011. doi:10.1038/tp.2011.18. URL http://www.nature.com/tp/journal/v1/n7/full/tp201118a.html. → pages5

60 [23] Yasunari Sakai, Chad A Shaw, Brian C Dawson, Diana V Dugas, Zaina Al-Mohtaseb, David E Hill, and Huda Y Zoghbi. Protein interactome reveals converging molecular pathways among autism disorders. Sci Transl Med, 3(86): 86ra49, June 2011. ISSN 1946-6242. doi:10.1126/scitranslmed.3002166. PMID: 21653829. → pages5

[24] Maggie L. Chow, Tiziano Pramparo, Mary E. Winn, Cynthia Carter Barnes, Hai-Ri Li, Lauren˜ Weiss, Jian-Bing Fan, Sarah Murray, Craig April, Haim Belinson, Xiang-Dong Fu, Nicholas J. Wynshaw-Boris, Anthony a nd Schork, and Eric Courchesne. Age-dependent brain gene expression and copy number anomalies in autism suggest distinct pathological processe s at young versus mature ages. PLoS Genet, 8(3):e1002592, March 2012. doi:10.1371/journal.pgen.1002592. URL http://dx.doi.org/10.1371/journal.pgen.1002592. → pages6

[25] Irina Voineagu, Xinchen Wang, Patrick Johnston, Jennifer K. Lowe, Yuan Tian, Steve Horvath, Jonathan Mill, Rita M. Cantor, Benjamin J. Blencowe, and Daniel H. Geschwind. Transcriptomic analysis of autistic brain reveals convergent molecular pathology. Nature, advance online publication, May 2011. ISSN 1476-4687. doi:10.1038/nature10110. URL http://dx.doi.org/10.1038/nature10110. → pages 6, 7, 9, 51

[26] Matthew R. Ginsberg, Robert A. Rubin, Tatiana Falcone, Angela H. Ting, and Marvin R. Natowicz. Brain transcriptional and epigenetic associations with autism. PLoS ONE, 7(9):e44736, September 2012. doi:10.1371/journal.pone.0044736. URL http://dx.doi.org/10.1371/journal.pone.0044736. → pages 6, 7

[27] Jeffrey P. Gregg, Lisa Lit, Colin A. Baron, Irva Hertz-Picciotto, Wynn Walker, Ryan A. Davis, Lisa A. Croen, Sally Ozonoff, Robin Hansen, Isaac N. Pessah, and Frank R. Sharp. Gene expression changes in children with autism. Genomics, 91(1): 22–29, January 2008. ISSN 0888-7543. doi:10.1016/j.ygeno.2007.09.003. URL http://www.sciencedirect.com/science/article/pii/S0888754307002327. → pages6

[28] Yuhei Nishimura, Christa L. Martin, Araceli Vazquez-Lopez, Sarah J. Spence, Ana Isabel Alvarez-Retuerto, Marian Sigman, Corinna Steindler, Sandra Pellegrini, N. Carolyn Schanen, Stephen T. Warren, and Daniel H. Geschwind. Genome-wide expression profiling of lymphoblastoid cell lines distinguishes different forms of autism and reveals shared pathways. Hum. Mol. Genet., 16(14):1682–1698, July 2007. ISSN 0964-6906, 1460-2083. doi:10.1093/hmg/ddm116. URL http://hmg.oxfordjournals.org/content/16/14/1682. → pages 6, 7

[29] Valerie W Hu, Tewarit Sarachana, Kyung Soon Kim, AnhThu Nguyen, Shreya Kulkarni, Mara E Steinberg,˜ Truong Luu, Yinglei Lai, and Norman H Lee. Gene expression profiling differentiates autism case-controls and phenotypic variants of autism spectrum disor ders: evidence for circadian rhythm dysfunction in severe autism. Autism Res, 2(2):78–97, April 2009. ISSN 1939-3806. doi:10.1002/aur.73. URL http://www.ncbi.nlm.nih.gov/pubmed/19418574. PMID: 19418574. → pages 6, 7, 24

61 [30] Valerie W. Hu, AnhThu Nguyen, Kyung Soon Kim, Mara E. Steinberg, Tewarit Sarachana, Michele A. Scully, Steven J. Soldin, Truong Luu, and Norman H. Lee. Gene expression profiling of lymphoblasts from autistic and nonaffected sib pairs: Altered pathways in neuronal development and steroid biosynthesis. PLoS ONE, 4 (6):e5775, June 2009. doi:10.1371/journal.pone.0005775. URL http://dx.doi.org/10.1371/journal.pone.0005775. → pages 6, 24, 55

[31] Sek Won Kong, Christin D. Collins, Yuko Shimizu-Motohashi, Ingrid A. Holm, Malcolm G. Campbell, In-Hee Lee, Stephanie J. Brewster, Ellen Hanson, Heather K. Harris, Kathryn R. Lowe, Adrianna Saada, Andrea Mora, Kimberly Madison, Rachel Hundley, Jessica Egan, Jillian McCarthy, Ally Eran, Michal Galdzicki, Leonard Rappaport, Louis M. Kunkel, and Isaac S. Kohane. Characteristics and predictive value of blood transcriptome signature in males with autism spectrum disorders. PLoS ONE, 7(12):e49475, December 2012. ISSN 1932-6203. doi:10.1371/journal.pone.0049475. → pages 6, 9, 51, 55

[32] Mark D. Alter, Rutwik Kharkar, Keri E. Ramsey, David W. Craig, Raun D. Melmed, Theresa A. Grebe, R. Curtis Bay, Sharman Ober-Reynolds, Janet Kirwan, Josh J. Jones, J. Blake Turner, Rene Hen, and Dietrich A. Stephan. Autism and increased paternal age related changes in global levels of gene expression regulation. PLoS ONE, 6(2):e16715, February 2011. doi:10.1371/journal.pone.0016715. URL http://dx.doi.org/10.1371/journal.pone.0016715. → pages6

[33] Rui Luo, Stephan J Sanders, Yuan Tian, Irina Voineagu, Ni Huang, Su H Chu, Lambertus Klei, Chaochao Cai, Jing Ou, Jennifer K Lowe, Matthew E Hurles, Bernie Devlin, Matthew W State, and Daniel H Geschwind. Genome-wide transcriptome profiling reveals the functional impact of rare de novo and recurrent CNVs in autism spectrum disorders. Am J Hum Genet, June 2012. ISSN 1537-6605. doi:10.1016/j.ajhg.2012.05.011. URL http://www.ncbi.nlm.nih.gov/pubmed/22726847. PMID: 22726847. → pages 6, 44, 45

[34] Patrick F. Sullivan, Cheng Fan, and Charles M. Perou. Evaluating the comparability of gene expression in blood and brain. Am. J. Med. Genet. Part B, 141B(3):261268, 2006. ISSN 1552-485X. doi:10.1002/ajmg.b.30272. URL http://onlinelibrary.wiley.com/doi/10.1002/ajmg.b.30272/abstract. → pages8

[35] Margus Lukk, Misha Kapushesky, Janne Nikkila,¨ Helen Parkinson, Angela Goncalves, Wolfgang Huber, Esko Ukkonen, and Alvis Brazma. A global map of human gene expression. Nature biotechnology, 28(4):322–324, 2010. doi:10.1038/nbt0410-322. URL http://dx.doi.org/10.1038/nbt0410-322. → pages8

[36] Jennie E Larkin, Bryan C Frank, Haralambos Gavras, Razvan Sultana, and John Quackenbush. Independence and reproducibility across microarray platforms. Nat. Methods, 2(5):337–344, April 2005. ISSN 1548-7091, 1548-7105. doi:10.1038/nmeth757. URL http://www.nature.com/doifinder/10.1038/nmeth757. → pages8

62 [37] Adaikalavan Ramasamy, Adrian Mondry, Chris C Holmes, and Douglas G Altman. Key issues in conducting a meta-analysis of gene expression microarray datasets. PLoS Med, 5(9):e184, September 2008. doi:10.1371/journal.pmed.0050184. URL http://dx.doi.org/10.1371/journal.pmed.0050184. → pages 8, 18

[38] Daniel H Geschwind and Pat Levitt. Autism spectrum disorders: developmental disconnection syndromes. Curr. Opin. Neurobiol., 17(1):103–111, February 2007. ISSN 0959-4388. doi:10.1016/j.conb.2007.01.009. PMID: 17275283. → pages8

[39] Jon McClellan and Mary-Claire King. Genetic heterogeneity in human disease. Cell, 141(2):210–217, April 2010. ISSN 1097-4172. doi:10.1016/j.cell.2010.03.032. URL http://www.ncbi.nlm.nih.gov/pubmed/20403315. PMID: 20403315. → pages8

[40] Julia Feichtinger, Gerhard G. Thallinger, Ramsay J. McFarlane, and Lee D. Larcombe. Microarray meta-analysis: From data to expression to biological relationships. In Zlatko Trajanoski, editor, Computational Medicine, pages 59–77. Springer Vienna, Vienna, 2012. ISBN 978-3-7091-0946-5, 978-3-7091-0947-2. URL http://www.springerlink.com/content/u961373j411101w0/fulltext.html. → pages9

[41] Morton M Hunt. How science takes stock: the story of meta-analysis. Russell Sage Foundation, New York, 1997. ISBN 0871543893 9780871543899 0871543982 9780871543981. → pages9

[42] M. Mistry, J. Gillis, and P. Pavlidis. Genome-wide expression profiling of schizophrenia using a large combined cohort. Mol Psychiatry, 18(2):215–225, February 2013. ISSN 1359-4184. doi:10.1038/mp.2011.172. URL http://www.nature.com/mp/journal/v18/n2/full/mp2011172a.html. → pages 9, 19, 22, 23, 32

[43] S. Rogic and P. Pavlidis. Meta-analysis of kindling-induced gene expression changes in the rat hippocampus. Front Neurosci, 3:53, 2009. → pages 9, 18

[44] Kwang Ho Choi, Michael Elashoff, Brandon W Higgs, Jonathan Song, Sanghyeon Kim, Sarven Sabunciyan, Suad Diglisic, Robert H Yolken, Michael B Knable, E Fuller Torrey, and Maree J Webster. Putative psychosis genes in the prefrontal cortex: combined analysis of gene expression microarrays. BMC Psychiatry, 8:87, November 2008. ISSN 1471-244X. doi:10.1186/1471-244X-8-87. URL http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2585075/. PMID: 18992145 PMCID: PMC2585075. → pages9

[45] Harris M. Cooper, Larry V. Hedges, and Jeff C. Valentine. The handbook of research synthesis and meta-analysis. Russell Sage Foundation, New York, 2nd ed edition, 2009. ISBN 9780871541635. → pages 18

[46] Li Liu, Aniko Sabo, Benjamin M. Neale, Uma Nagaswamy, Christine Stevens, Elaine Lim, Corneliu A. Bodea, Donna Muzny, Jeffrey G. Reid, Eric Banks, Hillary Coon, Mark DePristo, Huyen Dinh, Tim Fennel, Jason Flannick, Stacey Gabriel,

63 Kiran Garimella, Shannon Gross, Alicia Hawes, Lora Lewis, Vladimir Makarov, Jared Maguire, Irene Newsham, Ryan Poplin, Stephan Ripke, Khalid Shakir, Kaitlin E. Samocha, Yuanqing Wu, Eric Boerwinkle, Joseph D. Buxbaum, Edwin H. Cook, Bernie Devlin, Gerard D. Schellenberg, James S. Sutcliffe, Mark J. Daly, Richard A. Gibbs, and Kathryn Roeder. Analysis of rare, exonic variation amongst subjects with autism spectrum disorders and population controls. PLoS Genet, 9(4): e1003443, April 2013. doi:10.1371/journal.pgen.1003443. URL http://dx.doi.org/10.1371/journal.pgen.1003443. → pages9

[47] E Ben-David and S Shifman. Combined analysis of exome sequencing points toward a major role for transcription regulation during brain dev elopment in autism. Mol. Psychiatry, November 2012. ISSN 1476-5578. doi:10.1038/mp.2012.148. PMID: 23147383. → pages9

[48] T. Barrett, S. E. Wilhite, P. Ledoux, C. Evangelista, I. F. Kim, M. Tomashevsky, K. A. Marshall, K. H. Phillippy, P. M. Sherman, M. Holko, A. Yefanov, H. Lee, N. Zhang, C. L. Robertson, N. Serova, S. Davis, and A. Soboleva. NCBI GEO: archive for functional genomics data sets–update. Nucleic Acids Res, 41(D1): D991–D995, November 2012. ISSN 0305-1048, 1362-4962. doi:10.1093/nar/gks1193. URL http://www.nar.oxfordjournals.org/cgi/doi/10.1093/nar/gks1193. → pages 10

[49] Yuki Kuwano, Yoko Kamio, Tomoko Kawai, Sakurako Katsuura, Naoko Inada, Akiko Takaki, and Kazuhito Rokutan. Autism-associated gene expression in peripheral leucocytes commonly observed between subjects with autism and healthy women having autistic children. PLoS ONE, 6(9):e24723, 2011. doi:10.1371/journal.pone.0024723. URL http://dx.doi.org/10.1371/journal.pone.0024723. → pages 11

[50] Laurent Gautier, Leslie Cope, Benjamin M. Bolstad, and Rafael A. Irizarry. affy—analysis of affymetrix genechip data at the probe level. Bioinformatics, 20(3): 307–315, 2004. ISSN 1367-4803. doi:http://dx.doi.org/10.1093/bioinformatics/btg405. → pages 12

[51] Pan Du, Warren A. Kibbe, and Simon M. Lin. lumi: a pipeline for processing illumina microarray. Bioinformatics, 24(13):1547–1548, July 2008. ISSN 1367-4803, 1460-2059. doi:10.1093/bioinformatics/btn224. URL http://bioinformatics.oxfordjournals.org/content/24/13/1547. PMID: 18467348. → pages 12

[52] Gary A. Churchill. Fundamentals of experimental design for cDNA microarrays. Nat Genet, 32:490–495, December 2002. doi:10.1038/ng1031. URL http://www.nature.com/ng/journal/v32/n4s/full/ng1031.html. → pages 13

[53] W. Evan Johnson, Cheng Li, and Ariel Rabinovic. Adjusting batch effects in microarray expression data using empirical bayes methods. Biostat, 8(1):118–127, January 2007. ISSN 1465-4644, 1468-4357. doi:10.1093/biostatistics/kxj037. URL

64 http://biostatistics.oxfordjournals.org/content/8/1/118. PMID: 16632515. → pages 16

[54] Anton Zoubarev, Kelsey M Hamer, Kiran D Keshav, E Luke McCarthy, Joseph Roy C Santos, Thea Van Rossum, Cameron McDonald, Adam Hall, Xiang Wan, Raymond Lim, Jesse Gillis, and Paul Pavlidis. Gemma: A resource for the re-use, sharing and meta-analysis of expression profiling data. Bioinformatics, 28(17): 2272–3, July 2012. ISSN 1367-4811. doi:10.1093/bioinformatics/bts430. URL http://www.ncbi.nlm.nih.gov/pubmed/22782548. PMID: 22782548. → pages 18

[55] RA Fisher. Combining independent tests of significance. American Statistician, 2:30, 1948. → pages 18, 19

[56] Y Benjamini and Y Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series, B(57):289–300, 1995. → pages 19

[57] Laura Carrel and Huntington F. Willard. X-inactivation profile reveals extensive variability in x-linked gene expression in females. Nature, 434(7031):400–404, March 2005. ISSN 0028-0836. doi:10.1038/nature03479. URL http://www.nature.com/nature/journal/v434/n7031/full/nature03479.html. → pages 19

[58] Adeline R. Whitney, Maximilian Diehn, Stephen J. Popper, Ash A. Alizadeh, Jennifer C. Boldrick, David A. Relman, and Patrick O. Brown. Individuality and variation in gene expression patterns in human blood. PNAS, 100(4):1896–1901, February 2003. ISSN 0027-8424, 1091-6490. doi:10.1073/pnas.252784499. URL http://www.pnas.org/content/100/4/1896. PMID: 12578971. → pages 19

[59] Hyo Jung Kang, Yuka Imamura Kawasawa, Feng Cheng, Ying Zhu, Xuming Xu, Mingfeng Li, Andre´ M. M. Sousa, Mihovil Pletikos, Kyle A. Meyer, Goran Sedmak, Tobias Guennel, Yurae Shin, Matthew B. Johnson, Zeljka˘ Krsnik, Simone Mayer, Sofia Fertuzinhos, Sheila Umlauf, Steven N. Lisgo, Alexander Vortmeyer, Daniel R. Weinberger, Shrikant Mane, Thomas M. Hyde, Anita Huttner, Mark Reimers, Joel E. Kleinman, and Nenad estan. Spatio-temporal transcriptome of the human brain. Nature, 478(7370):483–489, October 2011. ISSN 0028-0836. doi:10.1038/nature10523. URL http://www.nature.com/nature/journal/v478/n7370/full/nature10523.html. → pages 19

[60] Homin K Lee, William Braynen, Kiran Keshav, and Paul Pavlidis. ErmineJ: tool for functional analysis of gene expression data sets. BMC Bioinformatics, 6:269, 2005. ISSN 1471-2105. doi:10.1186/1471-2105-6-269. PMID: 16280084. → pages 21

[61] M Ashburner, C A Ball, J A Blake, D Botstein, H Butler, J M Cherry, A P Davis, K Dolinski, S S Dwight, J T Eppig, M A Harris, D P Hill, L Issel-Tarver, A Kasarskis, S Lewis, J C Matese, J E Richardson, M Ringwald, G M Rubin, and

65 G Sherlock. Gene ontology: tool for the unification of biology. the gene ontology consortium. Nat. Genet., 25(1):25–29, May 2000. ISSN 1061-4036. doi:10.1038/75556. PMID: 10802651. → pages 21

[62] Jesse Gillis, Meeta Mistry, and Paul Pavlidis. Gene function analysis in complex data sets using ErmineJ. Nat. Protocols, 5(6):1148–1159, June 2010. ISSN 1754-2189. doi:10.1038/nprot.2010.78. URL http://www.nature.com/nprot/journal/v5/n6/full/nprot.2010.78.html. → pages 22, 55

[63] D. L. Wheeler, T. Barrett, D. A. Benson, S. H. Bryant, K. Canese, V. Chetvernin, D. M. Church, M. DiCuccio, R. Edgar, S. Federhen, L. Y. Geer, Y. Kapustin, O. Khovayko, D. Landsman, D. J. Lipman, T. L. Madden, D. R. Maglott, J. Ostell, V. Miller, K. D. Pruitt, G. D. Schuler, E. Sequeira, S. T. Sherry, K. Sirotkin, A. Souvorov, G. Starchenko, R. L. Tatusov, T. A. Tatusova, L. Wagner, and E. Yaschenko. Database resources of the national center for biotechnology information. Nucl. Acids Res., 35(Database):D5–D12, January 2007. ISSN 0305-1048, 1362-4962. doi:10.1093/nar/gkl1031. URL http://www.nar.oxfordjournals.org/cgi/doi/10.1093/nar/gkl1031. → pages 22

[64] Krassimira Garbett, Philip J. Ebert, Amanda Mitchell, Carla Lintas, Barbara Manzi, Kroly Mirnics, and Antonio M. Persico. Immune transcriptome alterations in the temporal cortex of subjects with autism. Neurobiol Dis, 30(3):303–311, June 2008. ISSN 0969-9961. doi:10.1016/j.nbd.2008.01.012. URL http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2693090/. PMID: 18378158 PMCID: PMC2693090. → pages 22, 52, 53

[65] A E Purcell, O H Jeon, A W Zimmerman, M E Blue, and J Pevsner. Postmortem brain abnormalities of the glutamate neurotransmitter system in autism. Neurology, 57(9):1618–1628, November 2001. ISSN 0028-3878. PMID: 11706102. → pages 22, 52, 53

[66] Dalila Pinto, Alistair T. Pagnamenta, Lambertus Klei, Richard Anney, Daniele Merico, Regina Regan, Judith Conroy, Tiago R. Magalhaes, Catarina Correia, Brett S. Abrahams, Joana Almeida, Elena Bacchelli, Gary D. Bader, Anthony J. Bailey, Gillian Baird, Agatino Battaglia, Tom Berney, Nadia Bolshakova, Sven Bolte,¨ Patrick F. Bolton, Thomas Bourgeron, Sean Brennan, Jessica Brian, Susan E. Bryson, Andrew R. Carson, Guillermo Casallo, Jillian Casey, Brian H. Y. Chung, Lynne Cochrane, Christina Corsello, Emily L. Crawford, Andrew Crossett, Cheryl Cytrynbaum, Geraldine Dawson, Maretha de Jonge, Richard Delorme, Irene Drmic, Eftichia Duketis, Frederico Duque, Annette Estes, Penny Farrar, Bridget A. Fernandez, Susan E. Folstein, Eric Fombonne, Christine M. Freitag, John Gilbert, Christopher Gillberg, Joseph T. Glessner, Jeremy Goldberg, Andrew Green, Jonathan Green, Stephen J. Guter, Hakon Hakonarson, Elizabeth A. Heron, Matthew Hill, Richard Holt, Jennifer L. Howe, Gillian Hughes, Vanessa Hus, Roberta Igliozzi, Cecilia Kim, Sabine M. Klauck, Alexander Kolevzon, Olena Korvatska, Vlad Kustanovich, Clara M. Lajonchere, Janine A. Lamb, Magdalena Laskawiec, Marion

66 Leboyer, Ann Le Couteur, Bennett L. Leventhal, Anath C. Lionel, Xiao-Qing Liu, Catherine Lord, Linda Lotspeich, Sabata C. Lund, Elena Maestrini, William Mahoney, Carine Mantoulan, Christian R. Marshall, Helen McConachie, Christopher J. McDougle, Jane McGrath, William M. McMahon, Alison Merikangas, Ohsuke Migita, Nancy J. Minshew, Ghazala K. Mirza, Jeff Munson, Stanley F. Nelson, Carolyn Noakes, Abdul Noor, Gudrun Nygren, Guiomar Oliveira, Katerina Papanikolaou, Jeremy R. Parr, Barbara Parrini, Tara Paton, Andrew Pickles, Marion Pilorge, Joseph Piven, Chris P. Ponting, David J. Posey, Annemarie Poustka, Fritz Poustka, Aparna Prasad, Jiannis Ragoussis, Katy Renshaw, Jessica Rickaby, Wendy Roberts, Kathryn Roeder, Bernadette Roge, Michael L. Rutter, Laura J. Bierut, John P. Rice, Jeff Salt, Katherine Sansom, Daisuke Sato, Ricardo Segurado, Ana F. Sequeira, Lili Senman, Naisha Shah, Val C. Sheffield, Latha Soorya, Inesˆ Sousa, Olaf Stein, Nuala Sykes, Vera Stoppioni, Christina Strawbridge, Raffaella Tancredi, Katherine Tansey, Bhooma Thiruvahindrapduram, Ann P. Thompson, Susanne Thomson, Ana Tryfon, John Tsiantis, Herman Van Engeland, John B. Vincent, Fred Volkmar, Simon Wallace, Kai Wang, Zhouzhi Wang, Thomas H. Wassink, Caleb Webber, Rosanna Weksberg, Kirsty Wing, Kerstin Wittemeyer, Shawn Wood, Jing Wu, Brian L. Yaspan, Danielle Zurawiecki, Lonnie Zwaigenbaum, Joseph D. Buxbaum, Rita M. Cantor, Edwin H. Cook, Hilary Coon, Michael L. Cuccaro, Bernie Devlin, Sean Ennis, Louise Gallagher, Daniel H. Geschwind, Michael Gill, Jonathan L. Haines, Joachim Hallmayer, Judith Miller, Anthony P. Monaco, John I. Nurnberger Jr, Andrew D. Paterson, Margaret A. Pericak-Vance, Gerard D. Schellenberg, Peter Szatmari, Astrid M. Vicente, Veronica J. Vieland, Ellen M. Wijsman, Stephen W. Scherer, James S. Sutcliffe, and Catalina Betancur. Functional impact of global rare copy number variation in autism spectrum disorders. Nature, 466(7304):368–372, July 2010. ISSN 0028-0836. doi:10.1038/nature09146. URL http://www.nature.com/nature/journal/v466/n7304/full/nature09146.html. → pages 22

[67] A. John Iafrate, Lars Feuk, Miguel N. Rivera, Marc L. Listewnik, Patricia K. Donahoe, Ying Qi, Stephen W. Scherer, and Charles Lee. Detection of large-scale variation in the human genome. Nat Genet, 36(9):949–951, September 2004. ISSN 1061-4036. doi:10.1038/ng1416. URL http://www.nature.com/ng/journal/v36/n9/abs/ng1416.html. → pages 23

[68] A. S. Hinrichs, D. Karolchik, R. Baertsch, G. P. Barber, G. Bejerano, H. Clawson, M. Diekhans, T. S. Furey, R. A. Harte, F. Hsu, J. Hillman-Jackson, R. M. Kuhn, J. S. Pedersen, A. Pohl, B. J. Raney, K. R. Rosenbloom, A. Siepel, K. E. Smith, C. W. Sugnet, A. Sultan-Qurraie, D. J. Thomas, H. Trumbower, R. J. Weber, M. Weirauch, A. S. Zweig, D. Haussler, and W. J. Kent. The UCSC genome browser database: update 2006. Nucleic Acids Res, 34(Database issue):D590–D598, January 2006. ISSN 0305-1048. doi:10.1093/nar/gkj144. URL http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1347506/. PMID: 16381938 PMCID: PMC1347506. → pages 23

67 [69] P. Pavlidis, I. Wapinski, and W. S. Noble. Support vector machine classification on the web. Bioinformatics, 20:586–7, 2004. ISSN 1367-4803 (Print). URL http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt= Citation&list uids=14990457. 4. → pages 23 [70] T S Keshava Prasad, Kumaran Kandasamy, and Akhilesh Pandey. Human protein reference database and human proteinpedia as discovery tools for systems biology. Methods Mol. Biol., 577:67–79, 2009. ISSN 1940-6029. doi:10.1007/978-1-60761-232-2 6. PMID: 19718509. → pages 23 [71] Andrew Chatr-aryamontri, Arnaud Ceol, Luisa Montecchi Palazzi, Giuliano Nardelli, Maria Victoria Schneider, Luisa Castagnoli, and Gianni Cesareni. MINT: the molecular INTeraction database. Nucl. Acids Res., 35(suppl 1):D572–D574, January 2007. ISSN 0305-1048, 1362-4962. doi:10.1093/nar/gkl950. URL http://nar.oxfordjournals.org/content/35/suppl 1/D572. PMID: 17135203. → pages 23 [72] Ioannis Xenarios, Danny W. Rice, Lukasz Salwinski, Marisa K. Baron, Edward M. Marcotte, and David Eisenberg. DIP: the database of interacting proteins. Nucleic Acids Res, 28(1):289–291, January 2000. ISSN 0305-1048. URL http://www.ncbi.nlm.nih.gov/pmc/articles/PMC102387/. PMID: 10592249 PMCID: PMC102387. → pages 23 [73] David J. Lynn, Geoffrey L. Winsor, Calvin Chan, Nicolas Richard, Matthew R. Laird, Aaron Barsky, Jennifer L. Gardy, Fiona M. Roche, Timothy H. W. Chan, Naisha Shah, Raymond Lo, Misbah Naseer, Jaimmie Que, Melissa Yau, Michael Acab, Dan Tulpan, Matthew D. Whiteside, Avinash Chikatamarla, Bernadette Mah, Tamara Munzner, Karsten Hokamp, Robert E. W. Hancock, and Fiona S. L. Brinkman. InnateDB: facilitating systems-level analyses of the mammalian innate immune response. Mol Syst Biol, 4(1), September 2008. doi:10.1038/msb.2008.55. URL http://www.nature.com/msb/journal/v4/n1/full/msb200855.html. → pages 23 [74] Sabry Razick, George Magklaras, and Ian M. Donaldson. iRefIndex: a consolidated protein interaction database with provenance. BMC Bioinformatics, 9(1):405, September 2008. ISSN 1471-2105. doi:10.1186/1471-2105-9-405. URL http://www.biomedcentral.com/1471-2105/9/405/abstract. PMID: 18823568. → pages 23 [75] American Psychiatric Association, American Psychiatric Association, and Task Force on DSM-IV. Diagnostic and statistical manual of mental disorders DSM-IV-TR. American Psychiatric Association, Washington, DC, 2000. ISBN 0890423342 9780890423349. URL http://dsm.psychiatryonline.org/book.aspx?bookid=22. → pages 24 [76] C Lord, M Rutter, and A Le Couteur. Autism diagnostic interview-revised: a revised version of a diagnostic interview for caregivers of individuals with possible pervasive developmental disorders. J Autism Dev Disord, 24(5):659–685, October 1994. ISSN 0162-3257. PMID: 7814313. → pages 24

68 [77] C Lord, M Rutter, S Goode, J Heemsbergen, H Jordan, L Mawhood, and E Schopler. Autism diagnostic observation schedule: a standardized observation of communicative and social behavior. J Autism Dev Disord, 19(2):185–212, June 1989. ISSN 0162-3257. PMID: 2745388. → pages 24

[78] M. Mistry and P. Pavlidis. A cross-laboratory comparison of expression profiling data from normal human postmortem brain. Neuroscience, 167:384–95, 2010. ISSN 1873-7544 (Electronic) 0306-4522 (Linking). doi:S0306-4522(10)00017-5[pii]10.1016/j.neuroscience.2010.01.016. URL http://www.ncbi.nlm.nih.gov/pubmed/20138973. 2. → pages 31

[79] Yiming Zhou, Qing Zhang, Owen Stephens, Christoph J Heuck, Erming Tian, Jeffrey R Sawyer, Marie-Astrid Cartron-Mizeracki, Pingping Qu, Jason Keller, Joshua Epstein, Bart Barlogie, and Jr Shaughnessy, John D. Prediction of cytogenetic abnormalities with gene expression profiles. Blood, 119(21):e148–150, May 2012. ISSN 1528-0020. doi:10.1182/blood-2011-10-388702. PMID: 22496154. → pages 44, 55

[80] Gerald D. Fischbach and Catherine Lord. The simons simplex collection: A resource for identification of autism genetic risk factors. Neuron, 68(2):192–195, October 2010. ISSN 0896-6273. doi:10.1016/j.neuron.2010.10.006. URL http://www.sciencedirect.com/science/article/pii/S0896627310008305. → pages 50

[81] Georgy Bakalkin, Hiroyuki Watanabe, Justyna Jezierska, Clo Depoorter, Corien Verschuuren-Bemelmans, Igor Bazov, Konstantin A Artemenko, Tatjana Yakovleva, Dennis Dooijes, Bart P C Van de Warrenburg, Roman A Zubarev, Berry Kremer, Pamela E Knapp, Kurt F Hauser, Cisca Wijmenga, Fred Nyberg, Richard J Sinke, and Dineke S Verbeek. Prodynorphin mutations cause the neurodegenerative disorder spinocerebellar ataxia type 23. Am. J. Hum. Genet., 87(5):593–603, November 2010. ISSN 1537-6605. doi:10.1016/j.ajhg.2010.10.001. PMID: 21035104. → pages 51

[82] J F Oram. Tangier disease and ABCA1. Biochim. Biophys. Acta, 1529(1-3): 321–330, December 2000. ISSN 0006-3002. PMID: 11111099. → pages 51

[83] Linda Erkman, Paul A. Yates, Todd McLaughlin, Robert J. McEvilly, Thomas Whisenhunt, Shawn M. O’Connell, Anna I. Krones, Michael A. Kirby, David H. Rapaport, John R. Bermingham Jr., Dennis D.M. O’Leary, and Michael G. Rosenfeld. A POU domain transcription FactorDependent program regulates axon pathfinding in the vertebrate visual system. Neuron, 28(3):779–792, December 2000. ISSN 0896-6273. doi:10.1016/S0896-6273(00)00153-7. URL http://www.sciencedirect.com/science/article/pii/S0896627300001537. → pages 51

[84] Nicolas Ramoz, Jennifer G. Reichert, Christopher J. Smith, Jeremy M. Silverman, Irina N. Bespalova, Kenneth L. Davis, and Joseph D. Buxbaum. Linkage and association of the mitochondrial Aspartate/Glutamate carrier SLC25A12 gene with autism. Am J Psychiatry, 161(4):662–669, April 2004. ISSN 0002-953X.

69 doi:10.1176/appi.ajp.161.4.662. URL http://dx.doi.org/10.1176/appi.ajp.161.4.662. → pages 54 [85] Maria P Abbracchio, Geoffrey Burnstock, Alexei Verkhratsky, and Herbert Zimmermann. Purinergic signalling in the : an overview. Trends Neurosci., 32(1):19–29, January 2009. ISSN 0166-2236. doi:10.1016/j.tins.2008.10.001. PMID: 19008000. → pages 54 [86] Ana C Andreazza, Li Shao, Jun-Feng Wang, and L Trevor Young. Mitochondrial complex i activity and oxidative damage to mitochondrial proteins in the prefrontal cortex of patients with bipolar disorder. Arch. Gen. Psychiatry, 67(4):360–368, April 2010. ISSN 1538-3636. doi:10.1001/archgenpsychiatry.2010.22. PMID: 20368511. → pages 54 [87] Xiujun Sun, Jun-Feng Wang, Michael Tseng, and L Trevor Young. Downregulation in components of the mitochondrial electron transport chain in the postmortem frontal cortex of subjects with bipolar disorder. J Psychiatry Neurosci, 31(3): 189–196, May 2006. ISSN 1180-4882. PMID: 16699605. → pages 54 [88] D. A. Rossignol and R. E. Frye. Mitochondrial dysfunction in autism spectrum disorders: a systematic review and meta-analysis. Mol Psychiatry, 17(3):290–314, 2012. ISSN 1359-4184. doi:10.1038/mp.2010.136. URL http://www.nature.com/mp/journal/v17/n3/full/mp2010136a.html. → pages 54 [89] Fahimeh Piryaei, Massoud Houshmand, Omid Aryani, Sepideh Dadgar, and Zahra-Soheila Soheili. Investigation of the mitochondrial ATPase 6/8 and tRNA(Lys) genes mutations in autism. Cell J, 14(2):98–101, 2012. ISSN 2228-5806. PMID: 23508290. → pages 54 [90] Vanesa lvarez Iglesias, Ana Mosquera-Miguel, Ivn Cusc, ngel Carracedo, Luis Alberto Prez-Jurado, and Antonio Salas. Reassessing the role of mitochondrial DNA mutations in autism spectrum disorder. BMC Med. Genet., 12:50, 2011. ISSN 1471-2350. doi:10.1186/1471-2350-12-50. PMID: 21470425. → pages 54 [91] Sukhbir Dhillon, Jessica A Hellings, and Merlin G Butler. Genetics and mitochondrial abnormalities in autism spectrum disorders: A review. Curr Genomics, 12(5):322–332, August 2011. ISSN 1389-2029. doi:10.2174/138920211796429745. URL http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3145262/. PMID: 22294875 PMCID: PMC3145262. → pages 54 [92] Ayyappan Anitha, Kazuhiko Nakamura, Ismail Thanseem, Kazuo Yamada, Yoshimi Iwayama, Tomoko Toyota, Hideo Matsuzaki, Taishi Miyachi, Satoru Yamada, Masatsugu Tsujii, Kenji J Tsuchiya, Kaori Matsumoto, Yasuhide Iwata, Katsuaki Suzuki, Hironobu Ichikawa, Toshiro Sugiyama, Takeo Yoshikawa, and Norio Mori. Brain region-specific altered expression and association of mitochondria-related genes in autism. Mol Autism, 3(1):12, 2012. ISSN 2040-2392. doi:10.1186/2040-2392-3-12. PMID: 23116158. → pages 54

70 [93] C. N. Henrichsen, E. Chaignat, and A. Reymond. Copy number variants, diseases and gene expression. Hum Mol Gen, 18(R1):R1–R8, April 2009. ISSN 0964-6906, 1460-2083. doi:10.1093/hmg/ddp011. URL http://www.hmg.oxfordjournals.org/cgi/doi/10.1093/hmg/ddp011. → pages 55

[94] Jason J. Wolff. Differences in white matter fiber tract development present from 6 to 24 months in infants with autism. Am J Psychiatry, February 2012. ISSN 0002-953X. doi:10.1176/appi.ajp.2011.11091447. URL http://neuro. psychiatryonline.org/article.aspx?articleid=668180&RelatedWidgetArticles=true. → pages 55

[95] G Konopka, E Wexler, E Rosen, Z Mukamel, G E Osborn, L Chen, D Lu, F Gao, K Gao, J K Lowe, and D H Geschwind. Modeling the functional genomics of autism using human . Mol Psychiatry, 17(2):202–214, February 2012. ISSN 1359-4184. URL http://dx.doi.org/10.1038/mp.2011.60. → pages 55

[96] C. Ecker, W. Spooren, and D. G. M. Murphy. Translational approaches to the biology of autism: false dawn or a new era? Mol Psychiatry, 18(4):435–442, April 2013. ISSN 1359-4184. doi:10.1038/mp.2012.102. URL http: //www.nature.com/mp/journal/v18/n4/full/mp2012102a.html?WT.ec id=MP-201304. → pages 55

[97] Jesse Gillis and Paul Pavlidis. The impact of multifunctional genes on ”Guilt by association” analysis. PLoS ONE, 6(2):e17258, February 2011. doi:10.1371/journal.pone.0017258. URL http://dx.doi.org/10.1371/journal.pone.0017258. → pages 55

71 Appendix A

Appendix

Table A.1: Up-regulated brain meta-signature. FDR Computed before removal of sex-biased genes. A: Known Candidate; B: Gender Biased; C: Known CNV. Y: Yes; N: No.

Gene Symbol ID Number of p-value FDR A B C Studies

PTPN1 5770 3 1.47E-06 1.37E-02 N N N 20q13.1-q13.2 HSPA1A 3303 3 2.48E-06 1.37E-02 N N N 6p21.3 ZC3HAV1 56829 3 2.52E-06 1.37E-02 N N N 7q34 C5AR1 728 3 4.19E-06 1.37E-02 N N N 19q13.3-q13.4 STC1 6781 3 4.55E-06 1.37E-02 N N N 8p21-p11.2 KIF20A 10112 3 5.55E-06 1.37E-02 N N N 5q31 PDYN 5173 3 6.05E-06 1.37E-02 N N N 20p13 GANAB 23193 3 6.60E-06 1.37E-02 N N N 11q12.3 TMED9 54732 3 1.16E-05 2.14E-02 N N N 5q35.3 CALR 811 3 1.87E-05 2.98E-02 N N N 19p13.3-p13.2 FAM159A 348378 3 2.00E-05 2.98E-02 N N N 1p32.3 SCIN 85477 3 2.27E-05 2.98E-02 N N Y 7p21.3 HSPA5 3309 3 2.52E-05 2.98E-02 N N N 9q33.3 C1orf106 55765 3 2.60E-05 2.98E-02 N N N 1q32.1 BRPF1 7862 3 2.93E-05 2.98E-02 N N N 3p26-p25 IGFBP5 3488 3 3.11E-05 2.98E-02 N N N 2q33-q36 C2CD4A 145741 3 3.31E-05 2.98E-02 N N N 15q22.2 Continued. . .

72 Gene Symbol Entrez ID Number of p-value FDR A B C Locus Studies

LILRB3 11025 3 3.34E-05 2.98E-02 N N N 19q13.4 CALU 813 3 3.56E-05 2.98E-02 N N N 7q32.1 TAGLN2 8407 3 3.59E-05 2.98E-02 N N N 1q21-q25 CD93 22918 3 4.37E-05 3.45E-02 N N N 20p11.21 DNAJB1 3337 3 4.91E-05 3.50E-02 N N N 19p13.2 CLDN23 137075 3 5.00E-05 3.50E-02 N N N 8p23.1 LMAN2L 81562 3 5.07E-05 3.50E-02 N N N 2q11.2 ADAMTS9 56999 3 5.73E-05 3.80E-02 N N N 3p14.1 CRISPLD2 83716 3 6.19E-05 3.95E-02 N N N 16q24.1 PCOLCE2 26577 3 7.02E-05 4.18E-02 N N N 3q21-q24 ABCA1 19 3 7.19E-05 4.18E-02 N N N 9q31.1 JUN 3725 3 7.36E-05 4.18E-02 N N N 1p32-p31 ITPRIP 85450 3 7.65E-05 4.18E-02 N Y N 10q25.1 ALPK1 80216 3 7.81E-05 4.18E-02 N N N 4q25

Table A.2: Down-regulated brain meta-signature. FDR Computed before removal of sex-biased genes. A: Known Candidate; B: Gender Biased; C: Known CNV. Y: Yes; N: No.

Gene Symbol Entrez ID Number of p-value FDR A B C Locus Studies

ABCG2 9429 3 3.18E-07 5.27E-03 N N Y 4q22 SHD 56961 3 1.09E-06 9.04E-03 N N N 19p13.3 FIS1 51024 3 4.39E-06 1.66E-02 N N N 7q22.1 RCAN2 10231 3 5.50E-06 1.66E-02 N N N 6p12.3 MRPL2 51069 3 6.51E-06 1.66E-02 N N N 6p21.3 COA1 55744 3 6.72E-06 1.66E-02 N N N 7p13 C12orf57 113246 3 8.19E-06 1.66E-02 N N N 12p13.31 SLC22A18AS 5003 3 8.66E-06 1.66E-02 N N N 11p15.5 TTC1 7265 3 9.01E-06 1.66E-02 N N N 5q33.3 KLHDC2 23588 3 1.72E-05 2.64E-02 N N N 14q21.3 ATP5O 539 3 1.83E-05 2.64E-02 N N N 21q22.11 Continued. . .

73 Gene Symbol Entrez ID Number of p-value FDR A B C Locus Studies

ZFAND2B 130617 3 1.91E-05 2.64E-02 N N N 2q35 CCDC25 55246 3 2.59E-05 3.31E-02 N N N 8p21.1 THOC5 8563 3 2.90E-05 3.43E-02 N N N 22q12.2 ANKRD29 147463 3 3.11E-05 3.44E-02 N N N 18q11.2 ZNF25 219749 3 3.64E-05 3.72E-02 N N N 10p11.1 HAPLN4 404037 3 3.81E-05 3.72E-02 N N N 19p13.1 ANO1 55107 3 4.19E-05 3.86E-02 N N N 11q13.3 FBXL15 79176 3 4.70E-05 3.98E-02 N N N 10q24.32 ABTB1 80325 3 4.79E-05 3.98E-02 N N N 3q21 UQCRQ 27089 3 5.45E-05 4.20E-02 N N N 5q31.1 MRPL54 116541 3 5.57E-05 4.20E-02 N N N 19p13.3 ASGR2 433 3 6.85E-05 4.89E-02 N N N 17p PRKAB1 5564 3 7.74E-05 4.89E-02 N N N 12q24.1-q24.3 VILL 50853 3 7.97E-05 4.89E-02 N N N 3p21.3 PANX2 56666 3 8.07E-05 4.89E-02 N N Y 22q13.33 SNRNP25 79622 3 8.69E-05 4.89E-02 N N Y 16p13.3 CCBL2 56267 3 8.75E-05 4.89E-02 N N N 1p22.2 COQ3 51805 3 8.80E-05 4.89E-02 N N N 6q16.2 PIH1D1 55011 3 9.13E-05 4.89E-02 N N N 19q13.33 CHCHD6 84303 3 9.32E-05 4.89E-02 N N N 3q21.3 NAT6 24142 3 9.79E-05 4.89E-02 N N N 3p21.3 EIF3K 27335 3 9.82E-05 4.89E-02 N N N 19q13.2 FAM58A 92002 3 1.00E-04 4.89E-02 N N N Xq28 GRK6 2870 3 1.10E-04 4.97E-02 N N Y 5q35 GAS2 2620 3 1.15E-04 4.97E-02 Y N N 11p14.3 HINT2 84681 3 1.16E-04 4.97E-02 N N N 9p13.3 NEFH 4744 3 1.18E-04 4.97E-02 N N N 22q12.2 PVALB 5816 3 1.25E-04 4.97E-02 N N N 22q13.1 ZBTB8OS 339487 3 1.29E-04 4.97E-02 N N N 1p35.1 TOX2 84969 3 1.33E-04 4.97E-02 N N N 20q13.12 NFU1 27247 3 1.33E-04 4.97E-02 N N N 2p15-p13 ACTR6 64431 3 1.34E-04 4.97E-02 N N N 12q23.1 NDUFAF2 91942 3 1.38E-04 4.97E-02 N N N 5q12.1 Continued. . .

74 Gene Symbol Entrez ID Number of p-value FDR A B C Locus Studies

EDN3 1908 3 1.41E-04 4.97E-02 N N N 20q13.2-q13.3 ACTR1B 10120 3 1.43E-04 4.97E-02 N N N 2q11.1-q11.2 GLI4 2738 3 1.47E-04 4.97E-02 N N N 8q24.3 FDX1L 112812 3 1.49E-04 4.97E-02 N N N 19p13.2 ALOX12B 242 3 1.49E-04 4.97E-02 N N N 17p13.1

Table A.3: Up-regulated blood meta-signature. FDR Computed before removal of sex-biased genes. A: Known Candidate; B: Gender Biased; C: Known CNV. Y: Yes; N: No.

Gene Symbol Entrez ID Number of p-value FDR A B C Locus Studies

KDM5D 8284 9 2.40E-16 4.56E-12 N Y N Yq11 ZFY 7544 6 1.23E-14 1.17E-10 N Y N Yp11.3 RPS4Y1 6192 6 3.03E-13 1.92E-09 N Y N Yp11.3 USP9Y 8287 9 2.81E-11 1.33E-07 Y Y N Yq11.2 EIF1AY 9086 9 5.27E-11 2.00E-07 N Y N Yq11.223 PRKY 5616 8 2.76E-10 8.23E-07 N Y N Yp11.2 UTY 7404 9 3.03E-10 8.23E-07 N Y N Yq11 DDX3Y 8653 9 2.13E-09 5.05E-06 N Y N Yq11 TXLNG2P 246126 9 3.85E-08 7.32E-05 N Y N Yq11.222 SCARNA17 677769 5 3.55E-08 7.32E-05 N N N 18q21.1 ZNF322 79692 9 8.12E-08 1.40E-04 N N N 6p22.1 ZNF594 84622 6 1.94E-007 3.07E-04 N N N 17p13 CXCR7 57007 9 2.10E-07 3.07E-04 N N N 2q37.3 CAMSAP2 23271 9 3.84E-007 5.21E-04 Y N N 1q32.1 GIMAP8 155038 6 5.05E-07 6.40E-04 N N N 7q36.1 TMSB4Y 9087 8 5.92E-007 6.63E-04 N Y N Yq11.221 ZNF721 170960 9 5.94E-07 6.63E-04 N N Y 4p16.3 NOTCH2 4853 9 8.69E-07 9.17E-04 N Y N 1p13-p11 ABLIM1 3983 9 9.89E-07 9.39E-04 N N N 10q25 PRKCH 5583 8 9.63E-07 9.39E-04 N N N 14q23.1 Continued. . .

75 Gene Symbol Entrez ID Number of p-value FDR A B C Locus Studies

UBE3A 7337 9 1.76E-06 1.59E-03 Y N Y 15q11.2 MALAT1 378938 8 1.97E-06 1.70E-03 N N N 11q13.1 ZNF763 284390 4 2.71E-06 2.24E-03 N N N 19p13.2 PLAG1 5324 7 2.99E-06 2.37E-03 N N N 8q12 MAN2A2 4122 9 4.66E-06 3.54E-03 N N N 15q26.1 BAHD1 22893 6 6.38E-06 4.39E-03 N N N 15q15.1 HCK 3055 8 6.40E-06 4.39E-03 N N Y 20q11-q12 APBB1 322 9 6.47E-06 4.39E-03 N N N 11p15 FAM82A1 151393 6 7.22E-06 4.57E-03 N N N 2p22.2 LOC728392 728392 7 7.06E-06 4.57E-03 N N N 17p13.2 ENO3 2027 6 7.67E-06 4.70E-03 N N N 17pter-p11 P2RX7 5027 6 8.50E-06 4.83E-03 N N Y 12q24 GFOD1 54438 9 8.62E-06 4.83E-03 N N N 6pter-p22.1 SEMA4C 54910 6 8.65E-06 4.83E-03 N N N 2q11.2 LOC100287482 100287482 5 9.03E-06 4.87E-03 N N N 7q32.1 ZNF611 81856 5 9.22E-06 4.87E-03 N N Y 19q13.41 ZNF445 353274 7 1.05E-05 5.24E-03 N N N 3p21.32 LCP2 3937 6 1.03E-05 5.24E-03 N N N 5q35.1 ADCY1 107 9 1.78E-05 8.67E-03 N N N 7p13-p12 TRUB1 142940 9 1.98E-05 9.17E-03 N N N 10q25.3 ADCY10P1 221442 6 2.03E-05 9.17E-03 N N N 6p21.1 ZFP62 643836 7 2.02E-05 9.17E-03 N N N 5q35.3 CCDC88C 440193 5 2.12E-05 9.34E-03 N N N 14q32.11 KIF26B 55083 6 2.16E-05 9.34E-03 N N N 1q44 IFFO2 126917 8 2.60E-05 9.48E-03 N N N 1p36.13 MTERFD2 130916 9 2.49E-05 9.48E-03 N N N 2q37.3 ESF1 51575 9 2.58E-05 9.48E-03 N N N 20p12.1 TXK 7294 7 2.52E-05 9.48E-03 N N N 4p12 AHNAK 79026 9 2.56E-05 9.48E-03 N N N 11q12.2 ATRN 8455 9 2.26E-05 9.48E-03 N N N 20p13 ORMDL1 94101 9 2.57E-05 9.48E-03 N N N 2q32 CYBRD1 79901 9 2.59E-05 9.48E-03 N N N 2q31.1 ZNF292 23036 8 2.65E-05 9.49E-03 N N N 6q14.3 Continued. . .

76 Gene Symbol Entrez ID Number of p-value FDR A B C Locus Studies

LOC284757 284757 4 3.11E-05 1.07E-02 N N N 20q13.33 IL21R 50615 9 3.11E-05 1.07E-02 N N N 16p11 GVINP1 387751 5 3.22E-05 1.09E-02 N N N 11p15.4 CYFIP1 23191 9 4.69E-05 1.53E-02 Y N Y 15q11 SMARCA2 6595 9 4.65E-05 1.53E-02 N N Y 9p22.3 SH2D1B 117157 6 4.98E-05 1.59E-02 N N Y 1q23.3 CIB4 130106 6 5.08E-05 1.59E-02 N N N 2p23.3 BIN2 51411 6 5.11E-05 1.59E-02 N N N 12q13 SEPT8 23176 6 5.32E-05 1.61E-02 N N N 5q31 RPS24 6229 8 5.35E-05 1.61E-02 N N N 10q22 C1orf63 57035 8 5.92E-05 1.76E-02 N N N 1p36.13-p35.1 HIF1AN 55662 9 6.92E-05 2.02E-02 N N N 10q24 AGPAT5 55326 9 7.22E-05 2.08E-02 N N N 8p23.1 ZNF37A 7587 7 7.94E-05 2.25E-02 N N N 10p11.2 ITPKB 3707 8 9.00E-05 2.48E-02 N N N 1q42.13 ENGASE 64772 9 8.89E-05 2.48E-02 N N N 17q25.3 ZNF197 10168 9 9.69E-05 2.63E-02 N N N 3p21 JARID2 3720 9 9.91E-05 2.65E-02 Y N N 6p24-p23 CARD11 84433 9 1.02E-04 2.68E-02 N N N 7p22 SPON2 10417 6 1.06E-04 2.71E-02 N N Y 4p16.3 CSTF2T 23283 7 1.07E-04 2.71E-02 N N Y 10q11 ARMCX3 51566 9 1.06E-04 2.71E-02 N N N Xq22.1 CX3CR1 1524 7 1.17E-04 2.76E-02 N N N 3p21.3 FOXK1 221937 9 1.19E-04 2.76E-02 N N N 7p22.1 HIPK2 28996 9 1.19E-04 2.76E-02 N N N 7q32-q34 HOXB2 3212 9 1.20E-04 2.76E-02 N N N 17q21.32 INSR 3643 9 1.20E-04 2.76E-02 N N N 19p13.3-p13.2 CD244 51744 6 1.18E-04 2.76E-02 N N N 1q23.3 ZFP14 57677 7 1.11E-04 2.76E-02 N N N 19q13.12 FUT8-AS1 645431 4 1.19E-04 2.76E-02 N N Y 14q23.3 IFFO1 25900 6 1.26E-04 2.84E-02 N N N 12p13.3 ZNF514 84874 6 1.28E-04 2.85E-02 N N N 2q11.1 RAG1 5896 8 1.31E-04 2.88E-02 N N N 11p13 Continued. . .

77 Gene Symbol Entrez ID Number of p-value FDR A B C Locus Studies

SON 6651 9 1.39E-04 3.04E-02 N N N 21q22.11 RCN1 5954 9 1.43E-04 3.05E-02 N N N 11p13 RFC2 5982 6 1.42E-04 3.05E-02 N N Y 7q11.23 CARS2 79587 6 1.45E-04 3.06E-02 N N N 13q34 KIAA1107 23285 6 1.51E-04 3.16E-02 N N N 1p22.1 IRF2BPL 64207 9 1.59E-04 3.26E-02 N N Y 14q24.3 DENND2D 79961 6 1.60E-04 3.26E-02 N N N 1p13.3 FAM161A 84140 9 1.62E-04 3.28E-02 N N N 2p15 TET3 200424 8 1.64E-04 3.29E-02 N N N 2p13.1 LMO7 4008 9 1.67E-04 3.30E-02 N N N 13q22.2 KLHL29 114818 8 1.78E-04 3.48E-02 N N N 2p24.1 CYTH3 9265 9 1.84E-04 3.58E-02 N N N 7p22.1 MKRN2 23609 9 1.92E-04 3.68E-02 N N N 3p25 BICD2 23299 9 1.95E-04 3.70E-02 N N N 9q22.31 ZNF337 26152 9 2.00E-04 3.70E-02 N N N 20p11.1 HK2 3099 9 1.99E-04 3.70E-02 N N N 2p13 ACYP1 97 9 2.01E-04 3.70E-02 N N N 14q24.3 NLGN4Y 22829 6 2.07E-04 3.77E-02 Y Y N Yq11.221 LUC7L3 51747 9 2.09E-04 3.77E-02 N N N 17q21.33 GPR133 283383 6 2.20E-04 3.94E-02 N N Y 12q24.33 C20orf112 140688 9 2.25E-04 3.99E-02 N N Y 20q11.21 ACRC 93953 8 2.27E-04 3.99E-02 N N N Xq13.1 PAFAH1B1 5048 9 2.31E-04 4.02E-02 Y N N 17p13.3 PPP1R12A 4659 9 2.34E-04 4.03E-02 N N N 12q15-q21 ERCC5 2073 7 2.37E-04 4.05E-02 N N N 13q33 BOK 666 8 2.43E-04 4.08E-02 N N N 2q37.3 ZC3H14 79882 9 2.43E-04 4.08E-02 N N N 14q31.3 SNRNP200 23020 9 2.57E-04 4.23E-02 N N N 2q11.2 ZC3H7B 23264 9 2.55E-04 4.23E-02 N N N 22q13.2 SPOCK2 9806 7 2.59E-04 4.23E-02 N N N 10pter-q25.3 HNRNPA2B1 3181 9 2.71E-04 4.40E-02 N N N 7p15 SCCPDH 51097 9 2.78E-04 4.48E-02 N N Y 1q44 CTDSP2 10106 9 2.89E-04 4.53E-02 N N Y 12q14.1 Continued. . .

78 Gene Symbol Entrez ID Number of p-value FDR A B C Locus Studies

RNF168 165918 7 2.93E-04 4.53E-02 N N N 3q29 FAN1 22909 7 2.90E-04 4.53E-02 Y N Y 15q13.2-q13.3 MIR17HG 407975 7 2.89E-04 4.53E-02 N N N 13q31.3 KLRF1 51348 6 2.94E-04 4.53E-02 N N N 12p13.31

Table A.4: Down-regulated blood meta-signature. FDR Computed before removal of sex-biased genes. A: Known Candidate; B: Gender Biased; C: Known CNV. Y: Yes; N: No.

Gene Symbol Entrez ID Number of p-value FDR A B C Locus Studies

HDHD1 8226 9 1.45E-10 1.51E-06 N Y N Xp22.32 KDM6A 7403 9 1.59E-10 1.51E-06 N Y N Xp11.2 BRAF 673 9 5.50E-07 3.02E-03 Y N N 7q34 SERPINB9 5272 8 7.02E-07 3.02E-03 N N N 6p25 HIST1H3F 8968 4 7.95E-07 3.02E-03 N N N 6p22.2 PNOC 5368 7 1.12E-06 3.55E-03 N N N 8p21 STRA13 201254 7 2.47E-06 6.24E-03 N N N 17q25.3 SLC17A9 63910 6 2.83E-06 6.24E-03 N N Y 20q13.33 AURKB 9212 9 2.96E-06 6.24E-03 N N N 17p13.1 MPC2 25874 6 4.00E-06 7.03E-03 N N N 1q24 TRANK1 9881 8 4.07E-06 7.03E-03 N N N 3p22.2 MRPL10 124995 6 5.07E-06 8.02E-03 N N N 17q21.32 LOC338758 338758 6 6.74E-06 9.84E-03 N N N 12q21.33 EMC4 51234 6 9.10E-06 1.21E-02 N N N 15q14 KDM5C 8242 9 9.59E-06 1.21E-02 Y Y N Xp11.22-p11.21 RNASE2 6036 5 1.07E-05 1.27E-02 N N N 14q24-q31 MED25 81857 6 1.33E-05 1.48E-02 N N N 19q13.3 SNX22 79856 9 1.75E-05 1.77E-02 N N N 15q22.31 BLVRB 645 9 1.80E-05 1.77E-02 N N N 19q13.1-q13.2 CHRM3 1131 5 1.87E-05 1.77E-02 N N N 1q43 MCM6 4175 9 2.20E-05 1.99E-02 N N N 2q21 Continued. . .

79 Gene Symbol Entrez ID Number of p-value FDR A B C Locus Studies

ZNF784 147808 8 2.35E-05 2.03E-02 N N N 19q13.42 ISG15 9636 9 2.47E-05 2.04E-02 N N N 1p36.33 CXCR3 2833 6 2.76E-05 2.11E-02 Y N N Xq13 CCDC50 152137 9 2.83E-05 2.11E-02 N N Y 3q28 C1orf170 84808 5 2.99E-05 2.11E-02 N N N 1p36.33 NINJ2 4815 8 2.99E-05 2.11E-02 N N N 12p13 TSPAN12 23554 8 3.23E-05 2.19E-02 N N Y 7q31.31 KLF1 10661 7 3.52E-05 2.27E-02 N N N 19p13.2 C12orf49 79794 7 3.69E-05 2.27E-02 N N N 12q24.22 TK1 7083 7 3.81E-05 2.27E-02 N N N 17q23.2-q25.3 GNPDA1 10007 8 3.83E-05 2.27E-02 N N N 5q21 TREML3P 340206 4 4.30E-05 2.47E-02 N N N 6p21.1 KCND1 3750 8 4.82E-05 2.69E-02 N N N Xp11.23 FAM46C 54855 9 5.12E-05 2.78E-02 N N N 1p12 DALRD3 55152 6 5.55E-05 2.93E-02 N N N 3p21.31 ZDHHC16 84287 9 5.92E-05 3.04E-02 N N N 10q24.1 PLK1 5347 6 6.43E-05 3.21E-02 N N N 16p12.2 POU2AF1 5450 9 6.75E-05 3.23E-02 N N N 11q23.1 MYBL2 4605 9 6.93E-05 3.23E-02 N N N 20q13.1 PSMA5 5686 9 6.97E-05 3.23E-02 N N N 1p13 RUVBL2 10856 7 7.40E-05 3.31E-02 N N N 19q13.3 SLC25A2 83884 9 7.49E-05 3.31E-02 N N N 5q31 WDR31 114987 9 7.70E-05 3.32E-02 N N N 9q32 MGC39372 221756 5 8.31E-05 3.51E-02 N N N 6p25.2 TESK2 10420 6 8.54E-05 3.53E-02 N N N 1p32 EGR1 1958 9 9.11E-05 3.68E-02 N N N 5q31.1 STAP1 26228 6 9.54E-05 3.70E-02 N N N 4q13.2 IFI27L1 122509 9 9.54E-05 3.70E-02 N N N 14q32.12 APOM 55937 9 1.00E-04 3.73E-02 N N N 6p21.33 SLC17A8 246213 6 1.01E-04 3.73E-02 N N N 12q23.1 FTL 2512 6 1.02E-04 3.73E-02 N N N 19q13.33 C3orf37 56941 6 1.13E-04 4.03E-02 N N N 3q21.3 OPTC 26254 6 1.17E-04 4.10E-02 N N N 1q32.1 Continued. . .

80 Gene Symbol Entrez ID Number of p-value FDR A B C Locus Studies

PRDX4 10549 5 1.26E-04 4.16E-02 Y N N Xp22.11 SLC16A6 9120 6 1.27E-04 4.16E-02 N N N 17q24.2 GAP43 2596 9 1.28E-04 4.16E-02 Y N N 3q13.1-q13.2 FAM98C 147965 9 1.28E-04 4.16E-02 N N N 19q13.2 SNX7 51375 7 1.29E-04 4.16E-02 N N N 1p21.3 RBM4B 83759 6 1.32E-04 4.17E-02 N N N 11q13 NXT1 29107 6 1.39E-04 4.32E-02 N N N 20p12-p11.2 MYL4 4635 9 1.42E-04 4.35E-02 N N N 17q21-qter OR1J2 26740 5 1.49E-04 4.50E-02 N N N 9q34.11 LCT 3938 6 1.56E-04 4.55E-02 N N N 2q21 TIMM17B 10245 6 1.68E-04 4.55E-02 N N N Xp11.23 SF3B4 10262 9 1.69E-04 4.55E-02 N N N 1q21.2 RAPGEF2 9693 8 1.73E-04 4.55E-02 N N N 4q32.1 TXNDC12 51060 6 1.74E-04 4.55E-02 N N N 1p32.3 ARHGAP39 80728 9 1.74E-04 4.55E-02 N N N 8q24.3 RAB30 27314 6 1.74E-04 4.55E-02 N N N 11q12-q14 SH2B2 10603 7 1.74E-04 4.55E-02 N N N 7q22 CCER1 196477 5 1.75E-04 4.55E-02 N N N 12q21.33 DCTPP1 79077 9 1.76E-04 4.55E-02 N N N 16p11.2 PAFAH1B3 5050 9 1.78E-04 4.55E-02 N N N 19q13.1 DGUOK 1716 7 1.87E-04 4.55E-02 N N N 2p13 MRPS30 10884 9 1.89E-04 4.55E-02 N N N 5q11 MRPS18A 55168 7 1.89E-04 4.55E-02 N N N 6p21.3 SHISA4 149345 9 1.90E-04 4.55E-02 N N N 1q32.1 C18orf61 497259 4 1.90E-04 4.55E-02 N N N 18p11.22 HMBS 3145 9 1.92E-04 4.55E-02 N N N 11q23.3 CLIC1 1192 9 1.95E-04 4.55E-02 N N N 6p21.3 DESI1 27351 9 1.97E-04 4.55E-02 N N N 22q13.2 ADCY6 112 6 2.01E-04 4.58E-02 N N N 12q12-q13 ASPM 259266 6 2.03E-04 4.58E-02 N N N 1q31 C16orf59 80178 6 2.10E-04 4.67E-02 N N N 16p13.3 FAM59A 64762 8 2.12E-04 4.67E-02 N N N 18q12.1 SEC13 6396 8 2.16E-04 4.72E-02 N N N 3p25-p24 Continued. . .

81 Gene Symbol Entrez ID Number of p-value FDR A B C Locus Studies

LOC100130776 100130776 4 2.28E-04 4.85E-02 N N N 12q14.1 GABRA4 2557 6 2.28E-04 4.85E-02 Y N N 4p12 SPSB2 84727 6 2.30E-04 4.85E-02 N N N 12p13.31

82 Table A.5: Genes that have been shown to exhibit sexual dimorphism in blood and brain. Asterisks denote known ASD candidates.

Blood KDM5D, ZFY, RPS4Y1, USP9Y*, EIF1AY, PRKY, UTY, DDX3Y, TXLNG2P, TMSB4Y, NOTCH2, NLGN4Y*, BCORP1, AKAP17A, SLA, NCRNA00185, RPS4X, OFD1, CSF2RA, RAB9A, ADAM19, TTTY15, RXRA, P2RY8, SH3BGRL, TTTY7, KAL1, KPNB1, CD99, MYO5A, S100G, SHOX, MAL, RBMY2FP, AP1S2*, TN- FRSF1B*, RAB27A, GYG2*, TTTY10, ASMT*, ARHGDIB, CDK16, RBBP7, GPM6B, ASMTL-AS1, TBL1Y, CTPS2, EIF2S3, MCL1, PCDH11Y, ZDHHC9, TTTY12, PNPLA4, LINC00685, ZBED1, INE1, ASMTL, TMEM27, SEPT6, DDX3X, TRAPPC2, PIR*, IFITM3, EIF1AX, VAMP7, DHRSX, CA5BP1, MAOA*, MED14, FUNDC1, SELL, TTTY5, RBMY3AP, LINC00102, SPRY3, BTG3, CA5B, CSPG4P1Y, TSC22D3, ALPL, AMELY, PPP2R3B, SLC25A6, HBZ, IFNGR2, IL3RA, IL9R, CD99P1, STS, ARSD, ARSE, NHLH2, OAZ1, PLXNB3, GEMIN8, PLCXD1, TXLNG, NLGN4X*, RPS9, DUSP21, SRY, UBA1, KDM6A, XG, XIST, ZFX, CA4, GTPBP6, HDHD1, USP9X, KDM5C*, TTTY11, TTTY13, TTTY14, SYAP1, ITM2B Brain ITPRIP, SERPINH1, S100A10, ANXA1, TIMP1, CKS2, MSN*, EHD2, SDC2*, CEBPD, RAB27A, BAMBI, EMP3, FBLN5, TM4SF1, PDK4, LAPTM5, COL1A1, KDM5C*, LY6E, DUSP21, MFAP2, CTSC, BGN, CEP135, HIST1H2BK, HIST1H4F, HIST1H4D, NPC2, VIM, SCN9A, TTTY19, IL9R, PLXNA4, PREX1, HIST1H1A, GYPC, TYROBP, PRRX1, TET2, LIFR, KIAA1009, LGALS3BP, H2AFX, KDM6A, P2RY8, USP9X, ALOX5AP*, GEMIN8, ARHGDIB, CD99, CSMD3*, TTTY12, NEU- ROG2, STOM, CD53, IQGAP2, DHRSX, MAOA*, SNCAIP, ZNF423, HIST1H4B, JAG1, CBR1, GSN*, TRAPPC2, EDNRA, NLGN4X*, PLTP, SPARC, C3, GTPBP6, OFD1, CFH, CSF2RA, DHRS3, UBA1, EMP2, COL3A1, TXLNG, SRY, ASCL1, ZNF804B, PIR*, FOLR2, RAB9A, CYBB, AKAP17A, GPM6B, HYDIN, TMEM27, CALD1, EZR, CSF1R, CTGF, ZFX, TTTY10, SPRY3, LUC7L, SLC9A3R1, HIST1H2BB, ASMT*, KDM5D, PRKY, KIAA1199, WIPF1, ZDHHC9, LYVE1, TXLNG2P, ZNF337, CTPS2, KAL1, NR3C2, ISLR, FZD8, PARVA, COLEC12, MEGF10, NLGN4Y*, COL1A2, LPAR6, C3orf62, ZBED1, ZNF793, GNG11, NKTR, ARGLU1, GLO1*, SHOX, GYG2*, ACTG2, MYL9, CA5B, CSPG4P1Y, COX7A2, CRABP1, FUNDC1, CX3CR1, RIBC1, RBMY2EP, RBMY2FP, DDX3X, EIF1AX, EIF2S3, AIF1, KHDRBS2*, TSPAN15, PPP2R3B, RCOR2, BCORP1, IL3RA, ITIH2, LGALS1, LUM, STS, ARSD, ARSE, MYO5A, OGN, PCSK2, CDK16, PLXNB3, PPP1R2, PLCXD1, KIF16B, OSGEP, OLFML3, CCDC146, ACTA2, RBBP7, RNASE1, RPS4Y1, SH3BGRL, CAPRIN2, VAMP7, TSPAN6, UTY, XG, ZFY, ZIC1, S100G, CY- BRD1, KCNIP4, ITIH5*, EEPD1, HDHD1, PNPLA4, USP9Y*, PCDH11Y, HIST1H4A, HIST1H4C, TTTY5, TTTY11, TTTY13, TTTY14, LSMD1, ASMTL, DDX3Y, AP1S2*, TBL1Y, EIF1AY, TMSB4Y

83