Supplement.

April 11, 2008

Contents

1 Simulation: comparison of our method with the traditional linkage 3

2 Overlap Significance 12

3 Simulated Annealing 13 3.1 Network Jumps ...... 13 3.2 Expectation-maximization (EM) Update versus Maximum-Likelihood Estimation of Cluster Probabilities . . . 13 3.3 Iteration Number and Temperature Schedule ...... 14

4 Unlinked Data Simulation 15

5 Analysis Settings and Important Observations 15

6 Data 17 6.1 Molecular Network ...... 17 6.2 ...... 17 6.3 Linkage Datasets ...... 17 6.4 Marker Maps ...... 18

7 Linkage Tools and Parameters 19

1 List of Figures

S1 Simulation LOD scores ...... 4 S2 Simulation Results ...... 5 S3 An outline of the generative model of data in our analysis, depicted as a graphical model...... 6 S4 A hypothetical family...... 7 S5 A graphical model representation of a pedigree...... 8 S6 Testing significance of the -specific statistic values. . . . . 9

List of Tables

S1 Additional Highly significant and suggestively significant genes 10 S2 Overlaps for the analyses including the X . . . . . 11 S3 Analyses parameters...... 16 S4 Two- and three-way global overlap p-values...... 18 S5 Data sets...... 19 S6 Multipoint linkage analysis tools...... 20

2 1 Simulation: comparison of our method with the traditional linkage

We used our generative model to produce and analyze 100 artificial data sets. For each data set we defined a randomly chosen “true” (disease predis- posing) cluster of ten genes. We then simulated genotypes and phenotypes of individuals in 1,000 families in the following way. First, we generated 1,000 “empty” pedigree topologies by randomly choosing a real pedigree with less than 17 meioses (to reduce computation time) from “our” autism data. Sec- ond, for each pedigree, we sampled a gene from the predisposing gene cluster using uniform cluster probabilities (1/10), and assigned the sampled gene to the pedigree. Third, we simulated states for all genetic marker loci and for the family-specific “causal” (disease) gene for all individuals in each family. We assigned states of genetic markers for founders of each pedigree by sam- pling from the empirical frequency distribution estimated from the autism dataset. We assumed a single disease- and a single wild-type allele for each “causal” gene, setting allele frequencies to 0.01 and 0.99, respectively. Given genotypes of pedigree founders, we simulated meiotic events leading to the rest of the pedigree, assigning allelic states to all “empty” genomes in the family. Fourth, given known genotypes of all individuals, we simulated their phenotypes. Considering only a family-specific “causal” disease gene (and not the rest of genes in the disease cluster), we attributed disease pheno- type to individuals with two wild-type alleles of the gene with probability 0.001. Similarly, individuals with one wild type and one disease allele or two disease alleles of the gene were assigned disease phenotype with probability 0.8. Fifth, we modeled ascertainment by rejecting families with less than two affected individuals. Using these 100 1,000-family artificial datasets, we compared our current approach with the conventional linkage analysis. We could not afford computationally to estimate a separate background distribution for each for the 100 simulated datasets as we did in the analyses of the real datasets. Instead we estimated one background distribution for all 100 samples by simulating 10 unlinked genotype sets for each of the 100 simulations and thus gathering a total of 1,000 unlinked simulations. As expected, due to extreme heterogeneity of data, the conventional link- age analysis, produces overwhelmingly low LOD scores, often smaller than -100 (see Figure S1 for an example LOD score curve over one of the correct

3 Figure S1: Simulation LOD scores. In the shown simulation the TRIT1 was part of the correct cluster and it was assigned to the top 10% of the families. No other on chromosome 1 was linked to the phenotype. A. LOD score curve produced by traditional linkage analysis. B. Per-family LOD scores. 4 genes). Such abysmal LOD scores would certainly lead to the conclusion of absence of linkage; the data would likely to be discarded without reporting.

Figure S2: Simulation Results

However, if we use gene-specific LOD scores to rank genes by their likely involvement in the disease etiology (by integrating over all possible thresh- olds for the LOD scores), we achieve a rather high mean AUC (area under the ROC curve) value of 0.77 [0.76, 0.78]. Suggesting that even the tradi- tional linkage analysis can be modified to extract more information under assumption of genetic heterogeneity than the approach does now. When we applied our current method to prioritize the genes based on the significance of their estimated cluster probabilities, we achieved a significantly higher mean AUC value of 0.92 [0.90, 0.93]. Further, we measured how many out of the 20 top-ranking genes produced by both methods are true-positives. As expected, our method, implement- ing the “true” statistical model, significantly outperformed the conventional

5 linkage. The average numbers of true-positive were 4.13 [3.79, 4.47] and 0.11 [0.05, 0.17], for our and conventional linkage methods, respectively (see Figure S2).

We assume that every gene in the Penetrance parameters are All markers and genes are disease-predisposing gene cluster C common for all genes in the arranged according to a sex- has one normal and one disease- gene cluster C averaged genetic map predisposing allele; the frequency of the disease-predisposing allele is the same for all genes in the cluster C Molecular network Map (gene and Penetrance Pedigree m arker positions Allelic Uniform prior parameters topology in the genom e) frequencies for distribution over genes and connected components m arkers of size c Gene cluster of c Pedigree-specific Gene-specific genes genotypes cluster probability (marker alleles, Pedigree- gene alleles) No errors in specific genotyping predisposing gene Observed Observed Assumption Dependence phenotype m arker alleles

Parameter Data or variable

Figure S3: An outline of the generative model of data in our analysis, de- picted as a graphical model.

6 1 2 3 4 5 6

7 8 9 10

11 12 13

14 15

Figure S4: A hypothetical family. Each pedigree (family) described in our data is provided with a kinship structure and phenotypic information. Circles and squares represent female and male individuals, respectively. By conven- tion, the filled shapes represent affected (sick) individuals; the open shapes represent unaffected (healthy) ones.

7 0 1 0 0 1 0

g1 g2 g3 g4 g5 g6

h3,2 h4,1 h1 h2 h5 h6 h3,1 h4,2

1 g7 0 g8 g9 0 g10 1

h7,2 h8,1 h9 h10 h7,1 h8,2

0 g11 g12 1 g13 0 h12,2 h13,1 h13,2 h12,1

0 g14 g15 1

Figure S5: The same pedigree as that shown in the Figure S4, represented as a graphical model. We can directly observe phenotypes (Φ0 and Φ1 are phenotypes of unaffected and affected individuals, respectively). We can think of the phenotypes as emitted messages produced by the hidden (not directly observable) genotypes, gi. Variables hkl represent haplotypes that are passed from parents to their children.

8 Original data (Pedigrees, disease phenotypes, marker map, marker genotypes) l

e Repeat K times d

o Simulate the k-th set of

m disease-unlinked genotypes

l

l Real chromosome LOD scores

u (LOD score for every position on

n k-th simulated set of

every chromosome)

chromosome LOD scores e a t h t a

r Real gene LOD score matrix d - e k-th simulated gene LOD score matrix (LOD score for every gene and every pedigree) l d (LOD score for every gene and every pedigree) a n e u R

n

o Bootstrap Loop i Bootstrap Loop t u b i

r k-th simulation-based gene-specific Real-data gene-specific statistic value t statistic value s i D For every gene

K gene-specific statistic values

p-value

Bootstrap Loop:

Input: gene LOD score matrix [LOD scores for G genes and F pedigrees]

Set all gene statistic counts to 0

Repeat B times

Sample with replacement F pedigrees

Find the cluster C with the highest likelihood (via simulated annealing)

Update statistic values for all genes

Output: gene-specific statistic values

Figure S6: Testing significance of the gene-specific statistic values. See the main text of the paper for discussion and explanation.

9 Table S1: Additional Significant and suggestively significant genes Chromosome Max Sum GeneID Symbol Gene name location p-value p-value autism-no-x 10913 EDAR 2q11-q13 ectodysplasin A receptor <0.000 <0.000 3991 LIPE 19q13.2 lipase, hormone-sensitive <0.000 <0.000 889 KRIT1 7q21-q22 ankyrin repeat containing <0.000 0.001 autism-x 2067 ERCC1 19q13.2-q13.3 excision repair <0.000 <0.000 10658 CUGBP1 11p11 CUG triplet repeat, RNA binding <0.000 0.001 7486 WRN 8p12-p11.2 Werner syndrome <0.000 0.003 10018 BCL2L11 2q13 BCL2-like 11 (apoptosis facilitator) <0.000 0.009 1747 DLX3 17q21 distal-less homeo box 3 <0.000 0.013 11030 RBPMS 8p12-p11 RNA binding protein 0.002 <0.000 megalencephalic leukoencephalopathy with 23209 MLC1 22q13.33 0.002 <0.000 subcortical cysts 1 autism-x-dom-rec 10893 MMP24 20q11.2 matrix metallopeptidase 24 <0.000 <0.000 1992 SERPINB1 6p25 serpin peptidase inhibitor <0.000 <0.000 5903 RANBP2 2q12.3 RAN binding protein 2 <0.000 0.007 8205 TAM* 21q11.2 myeloproliferative syndrome, transient 0.004 <0.000 bipolar-x 1437 CSF2 5q31.1 colony stimulating factor 2 0.001 <0.000 1604 CD55 1q32 decay accelerating for complement 0.005 <0.000 1869 E2F1 20q11.2 E2F transcription factor 1 0.002 <0.000 3586 IL10 1q31-q32 interleukin 10 0.001 <0.000 5075 PAX1 20p11.2 paired box gene 1 0.009 <0.000 6696 SPP1 4q21-q25 secreted phosphoprotein 1 0.003 <0.000 10893 MMP24 20q11.2 matrix metallopeptidase 24 0.002 0.001 22915 MMRN1 4q22 multimerin 1 0.018 0.001 23463 ICMT 1p36.21 isoprenylcysteine carboxyl methyltransferase 0.001 0.001 5335 PLCG1 20q12-q13.1 phospholipase C, gamma 1 0.002 0.001 6364 CCL20 2q33-q37 chemokine (C-C motif) ligand 20 0.017 0.001 9261 MAPKAPK2 1q32 MAPK-activated protein kinase 2 0.001 0.001 10544 PROCR 20q11.2 protein C receptor, endothelial (EPCR) 0.002 0.002 1380 CR2 1q32 complement component receptor 2 0.004 0.002 4092 SMAD7 18q21.1 mothers against DPP 0.001 0.002 5707 PSMD1 2q37.1 proteasome 26S subunit 1 0.015 0.002 5783 PTPN13 4q21.3 protein tyrosine phosphatase 0.013 0.002 7027 TFDP1 13q34 transcription factor Dp-1 0.010 0.002 schizophrenia-x 10783 NEK6 9q33.3-q34.11 (never in mitosis gene a)-related kinase <0.000 <0.000 901 CCNG2 4q21.1 cyclin G2 <0.000 <0.000 2801 GOLGA2 9q34.11 golgi autoantigen, golgin subfamily a, 2 <0.000 0.001 4948 OCA2 15q11.2-q12 oculocutaneous albinism II <0.000 0.001 6638 SNRPN 15q11.2 small nuclear ribonucleoprotein <0.000 0.001 2057 EPOR 19p13.3-p13.2 erythropoietin receptor <0.000 0.005 55571 C2orf29 2q11.2 chromosome 2 open reading frame 29 <0.000 0.006 5903 RANBP2 2q12.3 RAN binding protein 2 <0.000 0.009 10474 TADA3L 3p25.3 transcriptional adaptor 3 0.001 0.001 8178 ELL 19p13.1 elongation factor RNA polymerase II 0.001 0.001 467 ATF3 1q32.3 activating transcription factor 3 0.001 0.002 4217 MAP3K5 6q22.33 MAPK kinase kinase 5 0.001 0.007 6054 RNR3 15p12 RNA, ribosomal 3 0.003 0.001 2104 ESRRG 1q41 estrogen-related receptor gamma 0.004 <0.000 238 ALK 2p23 anaplastic lymphoma kinase (Ki-1) 0.006 0.001

Table S1: Additional Highly significant and suggestively significant genes. Note that the TAM gene is not yet approved by HUGO (see the genes not in HUGO.xls file)

10 Table S2: Significant overlaps between suggestively linked genes for pairs and the triplet of disorders for the analyses including the X chromosome GeneID Symbol Location Gene name p-values Autism and bipolar disorder Autism Bipolar Combined 22915 MMRN1 4q22 multimerin 1 0.024 0.001 0.0002 3381 IBSP 4q21-q25 integrin-binding sialoprotein 0.041 0.004 0.0010 5602 MAPK10 4q22.1-q23 MAPK 10 0.078 0.003 0.0016 5734 PTGER4 5p13.1 prostaglandin E receptor 0.014 0.038 0.0025 4907 NT5E 6q14-q21 5'-nucleotidase, ecto (CD73) 0.043 0.014 0.0028 598 BCL2L1 20q11.21 BCL2-like 1 0.048 0.018 0.0039 8027 STAM 10p14-p13 signal transducing adaptor 0.022 0.052 0.0050 9584 RBM39 20q11.22 RNA-binding motif protein 39 0.039 0.035 0.0057 3055 HCK 20q11-q12 hemopoietic cell kinase 0.048 0.033 0.0066 10611 PDLIM5 4q22 PDZ and LIM domain 5 0.071 0.024 0.0074 6714 SRC 20q12-q13 v-src sarcoma 0.026 0.091 0.0102 2690 GHR 5p13-p12 growth hormone receptor 0.076 0.066 0.0189 5933 RBL1 20q11.2 retinoblastoma-like 1 (p107) 0.092 0.084 0.0278 Autism and schizophrenia Autism Schiz. Combined 30818 KCNIP3 2q21.1 calsenilin 0.010 0.069 0.0035 9994 CASP8AP2 6q15 CASP8 associated protein 2 0.080 0.049 0.0153 171425 CLYBL 13q32.3 citrate lyase beta like 0.039 0.056 0.0089 266710 COMA* 2q13 congential oculomotor apraxia 0.058 0.027 0.0067 2274 FHL2 2q12-q14 four and a half LIM domains 2 0.015 0.002 0.0002 5602 MAPK10 4q22.1-q23 MAPK 10 0.078 0.043 0.0133 8440 NCK2 2q12 NCK adaptor protein 2 0.002 0.013 0.0002 4867 NPHP1 2q13 nephronophthisis 1 (juvenile) 0.015 0.054 0.0038 5734 PTGER4 5p13.1 prostaglandin E receptor 0.014 0.025 0.0016 5903 RANBP2 2q12.3 RAN binding protein 2 0.020 0.009 0.0009 8027 STAM 10p14-p13 signal transducing adaptor 0.022 0.035 0.0034 19q13.2- vasodilator-stimulated 7408 VASP 0.076 0.046 0.0137 q13.3 phosphoprotein Bipolar disorder and schizophrenia Bipolar Schiz. Combined 5602 MAPK10 4q22.1-q23 MAPK 10 0.003 0.043 0.0008 1380 CR2 1q32 complement component receptor 0.002 0.075 0.0011 2745 GLRX 5q14 glutaredoxin (thioltransferase) 0.017 0.033 0.0026 5734 PTGER4 5p13.1 prostaglandin E receptor 0.038 0.025 0.0041 3290 HSD11B1 1q32-q41 hydroxysteroid dehydrogenase 1 0.034 0.040 0.0057 8027 STAM 10p14-p13 signal transducing adaptor 0.052 0.035 0.0075 10769 PLK2 5q12.1-q13.2 polo-like kinase 2 (Drosophila) 0.051 0.038 0.0080 ral guanine nucleotide 5900 RALGDS 9q34.3 0.039 0.050 0.0080 dissociation stimulator 23114 NFASC 1q32.1 neurofascin homolog (chicken) 0.025 0.095 0.0104 2159 F10 13q34 coagulation factor X 0.093 0.035 0.0133 26765 SNORD12C 20q13.13 small nucleolar RNA 0.061 0.094 0.0215

Bipolar disorder, schizophrenia and autism Autism Schiz. Bipolar Combined 5602 MAPK10 4q22.1-q23 MAPK 10 0.078 0.043 0.003 0.0003 5734 PTGER4 5p13.1 prostaglandin E receptor 0.014 0.025 0.038 0.0003 8027 STAM 10p14-p13 signal transducing adaptor 0.022 0.035 0.052 0.0008

Table S2: Overlaps for the analyses including the X chromosome. Note that the COMA gene is not yet approved by HUGO (see the genes not in HUGO.xls file)

11 2 Overlap Significance

Our lists of candidate genes contain a number of genes that overlap between two or three of the analyzed disorders. We measure the significance of this apparent overlap in two distinct ways. One approach (local overlap) involves assigning each gene a two-disorder or three-disorder-specific overlap p-value. Another approach (global overlap) involves estimating overlap significance related to the total number of overlapping genes, regardless of their identity. To compute the global overlap p-value we use 1,000 simulated phenotype- unlinked data sets per disorder. We compute the local (gene-specific) overlap p-value by applying the Z- method (see Table 2 in the main text and Table S2 and [1]). For example, to gene MAPK10, which has disorder-specific p-values of 0.067, 0.019, and 0.077 for autism, bipolar disorder, and schizophrenia (no-x analyses), respectively, we assign three-way local overlap probability of 0.00516 with the Z-methods. While computing the local overlap p-values, we substitute the zero estimates of the disorder-specific p-values with 0.0005 (half of the smallest positive p- value that can be estimated in 1,000 data simulations); otherwise each gene that has a zero estimate of the p-value for at least one disorder would also have a zero estimate of the local overlap probability regardless of the p-value estimates for the rest of the disorders. We measure significance of the two-way global overlap in the following way. We estimate the distribution of the number of overlapping genes by computing the random overlap between all pairs of simulated data sets for the two disorders. For every simulated data set we estimate gene-specific p-values by using the other 999 disorder-specific simulated datasets to build a background distribution. A gene is included in the overlap between two disorders if both of its disorder-specific p-values are smaller than a prede- fined threshold. Table S4 shows the overlap significance with a threshold of 0.1. The three-way overlap is measured analogously using 10,000 randomly selected triplets of simulated data sets. Note that the two types of tests (local and global) measure the significance of overlap under different null models and therefore produce different results. The local overlap p-value for a specific gene measures how likely it is that a gene that is unlinked to any of the disorders considered will have a signal (gene-specific statistic) as strong as or stronger than the actual values of the gene-specific statistics for each of the disorders considered. The global overlap p-value evaluates the probability of observing a spurious overlap of

12 k genes (unlinked to any of the disorders) between two or three disorders, averaged over all possible overlapping sets of genes of the same cardinality, k.

3 Simulated Annealing

We implemented a version of the simulated annealing technique [2] to identify those clusters that have the maximum LOD score. In a nutshell, simulated annealing is a random walk through the space of network clusters (connected subgraph with c nodes) in which a new cluster is proposed; it is accepted if its LOD score is higher than that of the current cluster. If its LOD score is lower, the new cluster is accepted with a probability that depends on two quantities: a parameter called temperature and the difference between LOD scores for the newly proposed and the current clusters. The temperature of the annealing process decreases through the annealing run. At the beginning the temperature is high and the clusters with lower (worse) LOD scores are accepted readily; towards the end of the annealing run the temperature gets lower and only clusters with higher (better) LOD scores are accepted.

3.1 Network Jumps In our implementation, a new cluster is proposed by randomly removing a gene from the current cluster, then randomly adding a new gene while making sure that the genes in the new reconstituted cluster remain connected.

3.2 Expectation-maximization (EM) Update versus Maximum-Likelihood Estimation of Cluster Prob- abilities We identify the optimum LOD score of a cluster by maximizing that score with respect to the cluster probability parameters, the results being the maximum-likelihood estimates of the parameter values. To perform this esti- mation with computational efficiency, we use the expectation-maximization (EM) algorithm [3]. The basic idea of EM is that we start with an arbitrary guess of all parameter values (we use uniform cluster probabilities as the initial guess), then update estimates iteratively until we see convergence to a stable set of parameter values. We use the following update equation:

13 F (j) LODf (gene ) (j+1) 1 X pˆ 10 i pˆ = i , (1) i (j) F P LODf (genek) f=1 k pˆk 10 where F is the number of pedigrees described by our data set and k iterates over all genes in the current gene cluster. This sequence converges to the maximum-likelihood estimates of pis monotonically and, in our case, the maximum always exists and is unique. To decrease the computational cost of simulated annealing, we split the annealing iterations into two consecutive steps. In the first (hotter) step we use the cluster probabilities obtained after only one EM update, starting from uniform cluster probabilities. In the second (colder) step, we use the cluster probabilities after EM has converged, which can take several hundred iterations. Our motivation for this shortcut is the observation that a strong positive and statistically significant correlation exists between the cluster LOD scores computed with the rigorous maximum-likelihood computation and with one-update EM (data not shown). The crude one-step EM updating allows the algorithm to jump faster at the beginning of the search, and the full EM updating gives more precise parameter estimates when the search converges.

3.3 Iteration Number and Temperature Schedule We use a total of 5,000 annealing iterations for the gene-specific significance experiments and 20,000 runs of 10,000 annealing iterations to identify the best clusters for the real data. In every case, the final 100 iterations of the annealing run are based on the maximum-likelihood estimates of the cluster probabilities. We use the following probability of accepting a cluster with a smaller LOD score:

LODnew−LODold paccept = min[1, 10 T ], (2) where T is the temperature parameter. Our initial temperature is T = 10; after every 10% of the iterations, we decrease the temperature by a factor of 0.4.

14 4 Unlinked Data Simulation

In the simulation of the unlinked genotypes, we preserve the structure of the pedigrees while assuming that the phenotypes (the diagnoses of the individ- uals) and the state of the unobserved markers are unknown. The simulation starts by first assigning marker alleles to the markers of the founder indi- viduals in the family by sampling from the given marker-allele frequency independently for each marker. Next, for every child we simulate the two meioses for the child’s two parents. For each meiosis, we randomly choose to have or not have a recombination between all pairs of adjacent markers based on the transmission probability determined from the distance between the markers on the marker map and the chosen map function. The recombination status for every interval, taken together with the two parental , uniquely determines the chromosome inherited by the child. We used the SIMULATE tool [4] to perform this process.

5 Analysis Settings and Important Observa- tions

Table S3 describes the full range of analyses performed and their parame- ters. The three parameters that varied across the different analyses were: i) whether or not to include the X chromosome in the analysis; ii) the pene- trance model; and iii) the number of simulations used to build the background distributions of the gene-specific statistics. Many of the families in the bipolar disorder dataset were not genotyped for markers on the X chromosome (see Table S5), which led us to perform two sets of analyses. In the first set of experiments (autism-x, bipolar-x, schizophrenia-x) we used the X chromosome markers and genotypes at the expense of losing some of the bipolar families. In the second set of exper- iments (autism-no-x, bipolar-no-x, schizophrenia-no-x) we used all of the families but excluded the X chromosome. The obvious benefit of the former analysis is being able to use a larger and thus more informative molecular network (with the X chromosome genes); the benefit of the latter is the larger number of families that can be included in the analysis. When performing 1,000 simulations of unlinked datasets, the smallest possible non-zero p-value is 0.001 and typically we observe a few genes with p-values of 0. To check the robustness of 1,000-simulations significance esti-

15 Number of Penetrance Number of Analysis X chr Simulations Model Families autism-no-x 1000 NO dom 336 autism-x 1000 YES dom 336 autism-x-dom-rec 1000 YES dom-rec 336 autism-x-rec 10000 YES rec 336 bipolar-no-xA 1000 NO dom 414 bipolar-xA 1000 YES dom 295 schizophrenia-no-x 1000 NO dom 87 schizophrenia-x 1000 YES dom 87 Table S3: Analyses parameters. A. Note that the bipolar analyses use jointly the families from the three studies b1, b2 and nimh-wave4. mates, we performed one computationally very expensive 10,000-simulations analysis (autism-x-rec). None of the genes (with the exception of the SFRP1 gene, which had a 10,000-simulations p-value of 0 according the MAX statis- tic) turned out to have higher significance. Thus, we interpret p-values of 0 in our analyses of 1,000 simulations as being close to 0.0005. We analyzed the autism dataset under three different penetrance mod- els. Although the results of the three analyses (autism-x, autism-x-rec, and autism-x-dom-rec) were not identical, they overlapped significantly (see Ta- ble S4). The observation that changes in the penetrance model do not lead to major changes in the results is important because the use of the penetrance model is typically deemed to be a weak point of the parametric linkage anal- ysis framework. We use a commonly applied dominant-like (dom) penetrance model for most of the analysis: We set the frequency of the disease allele to 0.01 and the penetrance parameters to 0.001 for two wild-type alleles, 0.8 for one wild-type and one disease allele, and 0.8 for two disease alleles. For the autism-x-rec analysis we use a recessive-like (rec) penetrance model; we set the frequency of the disease allele to 0.1 and penetrance parameters of 0.001, 0.001, and 0.8 for two, one, and no wild-type alleles, respectively. These models are the same ones used in recent analyses of bipolar disorder data [5, 6]. For the autism-x-dom-rec analysis we use a mixture (dom-rec) penetrance model with a dominant-like model for the autosomal chromosomes and a recessive- like model for the X chromosome.

16 6 Data

6.1 Molecular Network The molecular network that we use in our analysis is a human-specific subset of the GeneWays 6.0 database [7]. This database was compiled through automated text mining of nearly 250,000 full-text articles from 78 leading biomedical journals. We remove all non-human-specific interactions; of the remaining interactions we use only those that are direct (such as bind and phosphorylate). Furthermore, we use only those molecular interactions for which all names of the involved genes or are unambiguously mapped to human genes. To identify genes uniquely, we use the GeneIDs defined by the National Center of Biotechnological Information (NCBI). We map the gene names using synonym lists provided by the NCBI. In addition, we use only those genes for which we can identify physical coordinates in the . We use two different networks depending on whether the X chromosome is included in the analysis. Our resulting molecular-interaction network for the analyses without the X chromosome comprises 4,374 genes and 13,612 in- teractions, including 1,120 self-interactions. It contains one large and several small connected components. We use only the large connected component, which includes 3,829 genes and 12,446 interactions (of the latter 985 are self-interactions, which we exclude from analysis). The largest connected component of the network for the analyses including the X chromosome has 4,019 genes and 13,256 interactions, including 1,025 self-interactions.

6.2 Genes We integrated the NCBI Gene [8] and the UCSC Genome Browser [9, 10] databases. We used the GeneIDs, gene symbols, and gene synonyms from the NCBI gene database and the physical coordinates from the UCSC database.

6.3 Linkage Datasets For autism, bipolar disorder, and schizophrenia, we analyzed 336, 434, and 87 affected families, respectively, with 334/11, 384/16, and 473/16 microsatel-

17 lite markers (a forward slash separates the numbers of autosomal and X- chromosomal markers used for each analysis) (see Table S5)

6.4 Marker Maps We obtained marker maps from the Marshfield human genetic map [11]. The transformation from Kosambi to Haldane centimorgan (cM) maps was performed by first joining the markers of all analyzed studies into a combined Kosambi marker map, then transforming the interval between every pair of neighboring markers from Kosambi cM distance to recombination fraction and then to Haldane cM distance using the corresponding mapping functions.

Observed Expected Overlap Disorders Overlap Overlap p-value no-x analyses, cutoff: 0.1 Autism–bipolar disorder 17 12.357 0.1924 Autism–schizophrenia 15 10.972 0.2155 Bipolar disorder–schizophrenia 13 11.252 0.3643 Autism–bipolar disorder–schizophrenia 3 1.007 0.1106 x analyses, cutoff: 0.1 Autism–bipolar disorder 14 12.527 0.3888 Autism–schizophrenia 12 10.854 0.4087 Bipolar disorder–schizophrenia 12 11.089 0.4087 Autism–bipolar disorder–schizophrenia 3 0.990 0.1080 penetrance model autism analyses, cutoff: 0.1 dom–rec 37 12.645 0.0001 dom–dom-rec 147 32.891 0.0000 rec–dom-rec 93 33.996 0.0000 dom–rec–dom-rec 61 5.837 0.0000

Table S4: Two- and three-way global overlap p-values.

18 Phenotype Number of Number of Study Model Families Markers Bipolar Disorder b1 [6] BP3A 27/20B 332+11E b2 [5] BP3A 112/0B 382+0E nimh-wave4 BP3A 275 384+16E [12, 13, 14, 15, 16] Autism a1[17] Autism InclusiveC 336 334+11E Schizophrenia nimh-sz8 Schizophrenia InclusiveD 87 473+16E

Table S5: Data sets. A. BP3—major psychiatric disorder characterized by mania (BPI) or hypomania (BPII) alternating with periods of depression (schizoaffective disorder manic type) or recurrent major depressive disorder (MDD) (including some cases of recurrent schizoaffective, mainly affective- depressed only). B. The number of families used in the analysis without the X chromosome/the number of families used in the analysis with the X chromosome. C. Autism + PDD (pervasive developmental disorders) + As- perger syndrome). D. SZ—schizophrenia; SADD—schizoaffective disorder depressed type; NSPECT—schizotypal psychotic (PD) or nonaffective psy- chotic disorder or mood–incongruent psychotic disorder; BSPECT—schizoid PD or paranoid PD or mood-congruent psychotic depressive disorder or ”unknown psychotic disorder” with or without psychiatric hospitalization; SADB—schizoaffective disorder bipolar type). E. Number of autosomal markers+number of markers on chromosome X.

7 Linkage Tools and Parameters

We compute 10 per-family LOD values between every pair of adjacent genetic markers and 10 measurements spanning 30 cM before the first and after the last genetic marker on each chromosome. With the MORGAN program, we use 50,000 burn-in, 30,000 preliminary (for building the pseudo-priors), and 20,000 MC iterations. We also perform 10,000 sequential imputation realizations. The LOD scores are computed every 10 iterations. The L-sampler probability is set to 0.5, and the sequential

19 imputation is proposed every 100 iterations (see MORGAN User Manual).

Precise Likelihood Algorithm Estimation via MCMC Computation LINKAGEa [18] Elston-Steward (Peeling) [19] FASTLINK 4.1Pa [20] GeneHunter 2.1b [21] Lander-Green algorithm [22] Merlin 1.0.1b [23] Morgan 2.7 [24] Allegro 2.0fb [25] aUnable to handle large number of markers. bUnable to handle large pedigrees.

Table S6: Multipoint linkage analysis tools.

20 References

[1] Whitlock MC (2005) Combining probability from independent tests: the weighted z-method is superior to fisher’s approach. J Evol Biol 18:1368– 73. [2] Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by simulated annealing. Science 220:671–680. [3] Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via EM algorithm. Journal of the Royal Statistical Society 39:1–38. [4] Terwilliger JD, Speer M, Ott J (1993) Chromosome-based method for rapid computer simulation in human genetic linkage analysis. Genet Epidemiol 10:217–224. [5] Cheng R, Juo SH, Loth JE, Nee J, Iossifov I, et al. (2006) Genome- wide linkage scan in a large bipolar disorder sample from the National Institute of Mental Health genetics initiative suggests putative loci for bipolar disorder, psychosis, suicide, and panic disorder. Mol Psychiatry 11:252–260. [6] Park N, Juo SH, Cheng R, Liu J, Loth JE, et al. (2004) Linkage analysis of psychosis in bipolar pedigrees suggests novel putative loci for bipolar disorder and shared susceptibility with schizophrenia. Mol Psychiatry 9:1091–1099. [7] Rzhetsky A, Iossifov I, Koike T, Krauthammer M, Kra P, et al. (2004) GeneWays: a system for extracting, analyzing, visualizing, and integrat- ing molecular pathway data. J Biomed Inform 37:43–53. [8] Maglott D, Ostell J, Pruitt KD, Tatusova T (2005) Entrez Gene: gene- centered information at NCBI. Nucleic Acids Res 33:D54–D548. [9] Hinrichs AS, Karolchik D, Baertsch R, Barber GP, Bejerano G, et al. (2006) The UCSC Genome Browser Database: update 2006. Nucleic Acids Res 34:D590–D598. [10] Karolchik D, Baertsch R, Diekhans M, Furey TS, Hinrichs A, et al. (2003) The UCSC Genome Browser Database. Nucleic Acids Res 31:51– 54.

21 [11] Broman KW, Murray JC, Sheffield VC, White RL, Weber JL (1998) Comprehensive human genetic maps: individual and sex-specific varia- tion in recombination. Am J Hum Genet 63:861–9.

[12] Rice JP, Goate A, Williams JT, Bierut L, Dorr D, et al. (1997) Initial genome scan of the NIMH genetics initiative bipolar pedigrees: chromo- somes 1, 6, 8, 10, and 12. Am J Med Genet 74:247–253.

[13] Detera-Wadleigh SD, Badner JA, Yoshikawa T, Sanders AR, Goldin LR, et al. (1997) Initial genome scan of the NIMH genetics initiative bipolar pedigrees: chromosomes 4, 7, 9, 18, 19, 20, and 21q. Am J Med Genet 74:254–262.

[14] Stine OC, McMahon FJ, Chen L, Xu J, Meyers DA, et al. (1997) Initial genome screen for bipolar disorder in the NIMH genetics initiative pedi- grees: chromosomes 2, 11, 13, 14, and x. Am J Med Genet 74:263–269.

[15] Edenberg HJ, Foroud T, Conneally PM, Sorbel JJ, Carr K, et al. (1997) Initial genomic scan of the NIMH genetics initiative bipolar pedigrees: chromosomes 3, 5, 15, 16, 17, and 22. Am J Med Genet 74:238–246.

[16] (1997) Genomic survey of bipolar illness in the NIMH genetics initiative pedigrees: a preliminary report. Am J Med Genet 74:227–237.

[17] Yonan AL, Alarcon M, Cheng R, Magnusson PK, Spence SJ, et al. (2003) A genomewide screen of 345 families for autism-susceptibility loci. Am J Hum Genet 73:886–897.

[18] Lathrop GM, Lalouel JM, Julier C, Ott J (1984) Strategies for multilocus linkage analysis in humans. Proc Natl Acad Sci U S A 81:3443–3446.

[19] Elston RC, Stewart J (1971) A general model for the genetic analysis of pedigree data. Hum Hered 21:523–542.

[20] Cottingham J R W, Idury RM, Schaffer AA (1993) Faster sequential genetic linkage computations. Am J Hum Genet 53:252–263.

[21] Kruglyak L, Daly MJ, Reeve-Daly MP, Lander ES (1996) Parametric and nonparametric linkage analysis: a unified multipoint approach. Am J Hum Genet 58:1347–1363.

22 [22] Lander ES, Green P (1987) Construction of multilocus genetic linkage maps in humans. Proc Natl Acad Sci U S A 84:2363–2367.

[23] Abecasis GR, Cherny SS, Cookson WO, Cardon LR (2002) Merlin–rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet 30:97–101.

[24] George AW, Thompson EA (2003) Discovering disease genes: Multi- point linkage analysis via a new Markov chain Monte Carlo approach. Statistical Science 18:515–531.

[25] Gudbjartsson DF, Thorvaldsson T, Kong A, Gunnarsson G, Ingolfsdot- tir A (2005) Allegro version 2. Nat Genet 37:1015–6.

23