Mutagenesys: Estimating Individual Disease Susceptibility Based On
Total Page:16
File Type:pdf, Size:1020Kb
Vol. 24 no. 3 2008, pages 440–442 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btm587 Databases and ontologies MutaGeneSys: estimating individual disease susceptibility based on genome-wide SNP array data Julia Stoyanovich* and Itsik Pe’er Department of Computer Science, Columbia University, 1214 Amsterdam Avenue, New York, NY 10025, USA Received on May 29, 2007; revised on October 16, 2007; accepted on November 22, 2007 Advance Access publication November 29, 2007 Associate Editor: Jonathan Wren ABSTRACT First and foremost, genetic information remains expensive Summary: We present MutaGeneSys: a system that uses genome- to collect, and it is currently economically prohibitive to wide genotype data to estimate disease susceptibility. Our system make a complete set of an individual’s genotypes of SNPs integrates three data sources: the International HapMap project, available for analysis. Nature Genetics’ ‘Question of the Year’ whole-genome marker correlation data and the Online Mendelian (www.nature.com/ng/qoty) announced the sequencing of the Inheritance in Man (OMIM) database. It accepts SNP data of indivi- entire human genome for $1000 as a goal for the genetics duals as query input and delivers disease susceptibility hypotheses community. Cost-effective methods (e.g. SNP arrays) currently even if the original set of typed SNPs is incomplete. Our system exist for collecting genetic data from 1% to 5% of all 11 million is scalable and flexible: it produces population, technology and human SNPs. This calls for the development of techniques that confidence-specific predictions in interactive time. can effectively utilize partial genetic information for disease Availability: Our system is available as an online resource at http:// prediction. The second reason is that OMIM is only accessible magnet.c2b2.columbia.edu/mutagenesys/. Our findings have also in text form, and the limited cross-reference between OMIM been incorporated into the HapMap Genome Browser as the and other NCBI databases makes it difficult to use this data for OMIM_Disease_Associations track. automated genome-wide diagnostics. Contact: [email protected] Several studies, culminating with the International HapMap project (International HapMap consortium), report on a sig- nificant amount of correlation among markers in the genome 1 INTRODUCTION (Pe’er et al., 2006). This genomic redundancy enables one to experimentally type an incomplete set of SNPs, and expand this Availability of genetic information continues to revolutionize set by including correlated proxies. Indirect association between the way we perceive medicine, with an ever stronger trend proxy genotype and phenotype thus facilitates effective and towards personalized diagnostics and treatment of heritable efficient association analysis. conditions. One challenge towards this goal is per-patient In the simplest case, SNPs are correlated pairwise, and one evaluation of susceptibility to disease and potential to gain of them may be predicted by the other; such correlations are from treatment based on single nucleotide polymorphisms referred to as single-marker predictors. Many two-marker and (SNPs) — DNA sequence variations occurring when a single three-marker predictors are also known. Correlations between nucleotide in the genome differs between individuals. causal SNPs and their proxies are associated with a coefficient Significant attention of the research community is devoted to of determination (squared Pearson’s correlation coefficient) determining direct causal association between the genotype and r2½0; 1. Marker correlation is population-dependent the phenotype, and many interesting associations have et al already been reported. The Online Mendelian Inheritance in (de Bakker ., 2006); the typed SNPs, and hence, the allowed Man (OMIM) (www.ncbi.nlm.nih.gov/entrez/query.fcgi?db¼ predictors also depend on the genotyping technology. For OMIM) database is currently the most complete source of example, according to our marker correlation dataset (www. such associations. A text search of OMIM yields, for example, cs.columbia.edu/itsik/StandardGenotyping.htm), we can best a correlation between a C/T SNP (rs908832) in exon 14 of the predict the minor allele T of rs1205 on chromosome 1 based on ABCA2 gene and Alzheimer disease and a connection between the the Affymetrix GeneChip 500K genotyping technology, and 2 a SNP in the IFIH1 gene, rs1990760 and type-I diabetes. the prediction accuracy is r ¼ 0:733 (in the Japanese and Much individual genetic data is now being collected in Chinese population). OMIM links the predicted SNP rs1205 the context of association studies, typing thousands of with Systemic Lupus Erythematosus (SLE) and antinuclear individuals for millions of variants. Yet, fully exploiting genetic antibody production. information for disease prediction is difficult for two reasons. Genome-wide correlation can be used to augment an individual’s genetic information, greatly enhancing its diag- nostic value. In the example above, if rs1205 was not typed, but rs12076827 and rs1572970 were, probabilities for the *To whom correspondence should be addressed. presence of the SLE-associated variant may be estimated. ß 2007 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. MutaGeneSys: estimating individual disease susceptibility Our project is the first step in this direction. Our goal is to 3 RESULTS stream-line the process of correlating SNP information with Our database contains a significant amount of SNP and marker heritable disorders, and to enable real-time retrieval of disease correlation data, but only a limited number of SNP to OMIM susceptibility hypotheses on genome-wide scale. Our system is associations. Across all populations and platforms we store over scalable and flexible: it produces population, technology and 10 million SNPs, close to 50 million single correlations, and confidence-specific hypotheses in real time. MutaGeneSys over 20 million two-marker correlations. OMIM contains about (Mutation Genome System) is available to the scientific 18 000 scientific articles, many of which are not relevant community as an online resource (magnet.c2b2.columbia.edu/ to association studies. However, we still identify 187 articles mutagenesys/). Our findings have also been incorporated into that mention associations between heritable disorders and HapMap GBrowse (International HapMap consortium). SNPs, with 133 unique participating SNPs. Combining OMIM with marker correlation data, we are able to estimate dis- ease susceptibility for additional 328 unique single-marker pairs, 2 METHODS and 396 double-marker sets. The dataset is enriched with a total of 1312 population-specific correlations. The number of 2.1 Data processing and integration susceptibility hypotheses will grow as more information about We integrate three datasets: the International HapMap project direct associations between SNPs and heritable disorders (International HapMap consortium), Online Mendelian Inheritance becomes available. in Man (www.ncbi.nlm.nih.gov/entrez/query.fcgi?db¼OMIM), and a For an example of the effectiveness of MutaGeneSys, con- dataset of marker correlations (www.cs.columbia.edu/itsik/Standard sider age-related macular degeneration (ARMD). According to Genotyping.htm). We use HapMap — a comprehensive repository OMIM, two SNPs are implicated in this disorder: rs3793784 in of SNP genotypes, to compile population-specific lists of SNP alleles the ERCC6 gene and rs380390 in the CFH gene. MutaGeneSys and frequencies. Our prediction dataset consists of single- and two- associates 72 additional SNPs with ARMD. As another marker correlations; consequently, these are the types of correlation example, Systemic Lupus Erythematosus (SLE) is associated that our system supports. with two CRP polymorphisms in OMIM; MutaGeneSys uses Both HapMap and marker correlation datasets are clean, non- redundant, and available in machine-processable form. The challenge 15 additional SNPs to indicate potential susceptibility to SLE. with these datasets is the sheer volume of data. However, while there is a lot of information regarding correlations along the genome (the marker correlation dataset is large), relatively little is still known about 4 DISCUSSION correlations between SNPs and heritable disorders. We observe that our MutaGeneSys cannot yet be considered as a source of diagnostic system can take advantage of marker correlations only if they ultimately predictions, because of a number of uncertainties involved lead to a hypothesis of disease susceptibility, and use available marker- in going from a specific marker to disease. Given the available to-disorder data as the limiting factor. A correlation between SNP and 1 data, we made our best effort to control for population SNP2 is only useful if at least one of these SNPs is associated with a heritable disorder. and correlation-specific effects: marker associations are com- We currently use OMIM, a repository of publications about human puted separately for different populations, and susceptibility genes and genetic disorders, as our data source for marker to disorder results include correlation parameters. For lack of information, associations. Associations between SNPs and diseases