Supp Supplement.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
Supplement. April 11, 2008 Contents 1 Simulation: comparison of our method with the traditional linkage 3 2 Overlap Significance 12 3 Simulated Annealing 13 3.1 Network Jumps . 13 3.2 Expectation-maximization (EM) Update versus Maximum-Likelihood Estimation of Cluster Probabilities . 13 3.3 Iteration Number and Temperature Schedule . 14 4 Unlinked Data Simulation 15 5 Analysis Settings and Important Observations 15 6 Data 17 6.1 Molecular Network . 17 6.2 Genes . 17 6.3 Linkage Datasets . 17 6.4 Marker Maps . 18 7 Linkage Tools and Parameters 19 1 List of Figures S1 Simulation LOD scores . 4 S2 Simulation Results . 5 S3 An outline of the generative model of data in our analysis, depicted as a graphical model. 6 S4 A hypothetical family. 7 S5 A graphical model representation of a pedigree. 8 S6 Testing significance of the gene-specific statistic values. 9 List of Tables S1 Additional Highly significant and suggestively significant genes 10 S2 Overlaps for the analyses including the X chromosome. 11 S3 Analyses parameters. 16 S4 Two- and three-way global overlap p-values. 18 S5 Data sets. 19 S6 Multipoint linkage analysis tools. 20 2 1 Simulation: comparison of our method with the traditional linkage We used our generative model to produce and analyze 100 artificial data sets. For each data set we defined a randomly chosen “true” (disease predis- posing) cluster of ten genes. We then simulated genotypes and phenotypes of individuals in 1,000 families in the following way. First, we generated 1,000 “empty” pedigree topologies by randomly choosing a real pedigree with less than 17 meioses (to reduce computation time) from “our” autism data. Sec- ond, for each pedigree, we sampled a gene from the predisposing gene cluster using uniform cluster probabilities (1/10), and assigned the sampled gene to the pedigree. Third, we simulated states for all genetic marker loci and for the family-specific “causal” (disease) gene for all individuals in each family. We assigned states of genetic markers for founders of each pedigree by sam- pling from the empirical frequency distribution estimated from the autism dataset. We assumed a single disease- and a single wild-type allele for each “causal” gene, setting allele frequencies to 0.01 and 0.99, respectively. Given genotypes of pedigree founders, we simulated meiotic events leading to the rest of the pedigree, assigning allelic states to all “empty” genomes in the family. Fourth, given known genotypes of all individuals, we simulated their phenotypes. Considering only a family-specific “causal” disease gene (and not the rest of genes in the disease cluster), we attributed disease pheno- type to individuals with two wild-type alleles of the gene with probability 0.001. Similarly, individuals with one wild type and one disease allele or two disease alleles of the gene were assigned disease phenotype with probability 0.8. Fifth, we modeled ascertainment by rejecting families with less than two affected individuals. Using these 100 1,000-family artificial datasets, we compared our current approach with the conventional linkage analysis. We could not afford computationally to estimate a separate background distribution for each for the 100 simulated datasets as we did in the analyses of the real datasets. Instead we estimated one background distribution for all 100 samples by simulating 10 unlinked genotype sets for each of the 100 simulations and thus gathering a total of 1,000 unlinked simulations. As expected, due to extreme heterogeneity of data, the conventional link- age analysis, produces overwhelmingly low LOD scores, often smaller than -100 (see Figure S1 for an example LOD score curve over one of the correct 3 Figure S1: Simulation LOD scores. In the shown simulation the TRIT1 was part of the correct cluster and it was assigned to the top 10% of the families. No other locus on chromosome 1 was linked to the phenotype. A. LOD score curve produced by traditional linkage analysis. B. Per-family LOD scores. 4 genes). Such abysmal LOD scores would certainly lead to the conclusion of absence of linkage; the data would likely to be discarded without reporting. Figure S2: Simulation Results However, if we use gene-specific LOD scores to rank genes by their likely involvement in the disease etiology (by integrating over all possible thresh- olds for the LOD scores), we achieve a rather high mean AUC (area under the ROC curve) value of 0.77 [0.76, 0.78]. Suggesting that even the tradi- tional linkage analysis can be modified to extract more information under assumption of genetic heterogeneity than the approach does now. When we applied our current method to prioritize the genes based on the significance of their estimated cluster probabilities, we achieved a significantly higher mean AUC value of 0.92 [0.90, 0.93]. Further, we measured how many out of the 20 top-ranking genes produced by both methods are true-positives. As expected, our method, implement- ing the “true” statistical model, significantly outperformed the conventional 5 linkage. The average numbers of true-positive were 4.13 [3.79, 4.47] and 0.11 [0.05, 0.17], for our and conventional linkage methods, respectively (see Figure S2). We assume that every gene in the Penetrance parameters are All markers and genes are disease-predisposing gene cluster C common for all genes in the arranged according to a sex- has one normal and one disease- gene cluster C averaged genetic map predisposing allele; the frequency of the disease-predisposing allele is the same for all genes in the cluster C Molecular network Map (gene and Penetrance Pedigree m arker positions Allelic Uniform prior parameters topology in the genome) frequencies for distribution over genes and connected components m arkers of size c Gene cluster of c Pedigree-specific Gene-specific genes genotypes cluster probability (marker alleles, Pedigree- gene alleles) No errors in specific genotyping predisposing gene Observed Observed Assumption Dependence phenotype m arker alleles Parameter Data or variable Figure S3: An outline of the generative model of data in our analysis, de- picted as a graphical model. 6 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Figure S4: A hypothetical family. Each pedigree (family) described in our data is provided with a kinship structure and phenotypic information. Circles and squares represent female and male individuals, respectively. By conven- tion, the filled shapes represent affected (sick) individuals; the open shapes represent unaffected (healthy) ones. 7 0 1 0 0 1 0 g1 g2 g3 g4 g5 g6 h3,2 h4,1 h1 h2 h5 h6 h3,1 h4,2 1 g7 0 g8 g9 0 g10 1 h7,2 h8,1 h9 h10 h7,1 h8,2 0 g11 g12 1 g13 0 h12,2 h13,1 h13,2 h12,1 0 g14 g15 1 Figure S5: The same pedigree as that shown in the Figure S4, represented as a graphical model. We can directly observe phenotypes (Φ0 and Φ1 are phenotypes of unaffected and affected individuals, respectively). We can think of the phenotypes as emitted messages produced by the hidden (not directly observable) genotypes, gi. Variables hkl represent haplotypes that are passed from parents to their children. 8 Original data (Pedigrees, disease phenotypes, marker map, marker genotypes) l e Repeat K times d o Simulate the k-th set of m disease-unlinked genotypes l l Real chromosome LOD scores u (LOD score for every position on n k-th simulated set of every chromosome) chromosome LOD scores e a t h t a r Real gene LOD score matrix d - e k-th simulated gene LOD score matrix (LOD score for every gene and every pedigree) l d (LOD score for every gene and every pedigree) a n e u R n o Bootstrap Loop i Bootstrap Loop t u b i r k-th simulation-based gene-specific Real-data gene-specific statistic value t statistic value s i D For every gene K gene-specific statistic values p-value Bootstrap Loop: Input: gene LOD score matrix [LOD scores for G genes and F pedigrees] Set all gene statistic counts to 0 Repeat B times Sample with replacement F pedigrees Find the cluster C with the highest likelihood (via simulated annealing) Update statistic values for all genes Output: gene-specific statistic values Figure S6: Testing significance of the gene-specific statistic values. See the main text of the paper for discussion and explanation. 9 Table S1: Additional Significant and suggestively significant genes Chromosome Max Sum GeneID Symbol Gene name location p-value p-value autism-no-x 10913 EDAR 2q11-q13 ectodysplasin A receptor <0.000 <0.000 3991 LIPE 19q13.2 lipase, hormone-sensitive <0.000 <0.000 889 KRIT1 7q21-q22 ankyrin repeat containing <0.000 0.001 autism-x 2067 ERCC1 19q13.2-q13.3 excision repair protein <0.000 <0.000 10658 CUGBP1 11p11 CUG triplet repeat, RNA binding <0.000 0.001 7486 WRN 8p12-p11.2 Werner syndrome <0.000 0.003 10018 BCL2L11 2q13 BCL2-like 11 (apoptosis facilitator) <0.000 0.009 1747 DLX3 17q21 distal-less homeo box 3 <0.000 0.013 11030 RBPMS 8p12-p11 RNA binding protein 0.002 <0.000 megalencephalic leukoencephalopathy with 23209 MLC1 22q13.33 0.002 <0.000 subcortical cysts 1 autism-x-dom-rec 10893 MMP24 20q11.2 matrix metallopeptidase 24 <0.000 <0.000 1992 SERPINB1 6p25 serpin peptidase inhibitor <0.000 <0.000 5903 RANBP2 2q12.3 RAN binding protein 2 <0.000 0.007 8205 TAM* 21q11.2 myeloproliferative syndrome, transient 0.004 <0.000 bipolar-x 1437 CSF2 5q31.1 colony stimulating factor 2 0.001 <0.000 1604 CD55 1q32 decay accelerating for complement 0.005 <0.000 1869 E2F1 20q11.2 E2F transcription factor 1 0.002 <0.000 3586 IL10 1q31-q32 interleukin 10 0.001 <0.000 5075 PAX1 20p11.2 paired box gene 1 0.009 <0.000 6696 SPP1 4q21-q25 secreted phosphoprotein