Simulating Association Studies: a Data-Based Resampling Method for Candidate Regions Or Whole Genome Scans Fred A
Total Page:16
File Type:pdf, Size:1020Kb
Vol. 23 no. 19 2007, pages 2581–2588 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm386 Genetics and population analysis Simulating association studies: a data-based resampling method for candidate regions or whole genome scans Fred A. Wright1,2,3,*, Hanwen Huang1, Xiaojun Guan4, Kevin Gamiel4, Clark Jeffries4,5, William T. Barry1, Fernando Pardo-Manuel de Villena2,6, Patrick F. Sullivan2,6, Kirk C. Wilhelmsen2,6 and Fei Zou1,2,3 1Department of Biostatistics, 2Center for Genome Sciences, 3Center for Environmental Bioinformatics, University of North Carolina, Chapel Hill, NC 27599, 4Renaissance Computing Institute, Europa Drive, 5School of Pharmacy and 6Department of Genetics, UNC Chapel Hill, NC, USA Received on August 25, 2006; revised on June 21, 2007; accepted on July 20, 2007 Advance Access publication September 4, 2007 Associate Editor: Keith Crandall ABSTRACT of appropriate significance thresholds (Dudbridge and Motivation: Reductions in genotyping costs have heightened Koeleman, 2004; Thomas et al., 2005), methods for fine- interest in performing whole genome association scans and in the mapping and reconstructing haplotypes (De La Chapelle and fine mapping of candidate regions. Improvements in study design Wright, 1998; Stephens and Donnelly, 2003) and approaches and analytic techniques will require the simulation of datasets with for handling multiple interacting susceptibility loci (Marchini realistic patterns of linkage disequilibrium and allele frequencies for et al., 2005). typed SNPs. In many cases, the best approaches depend on specifics of the Methods: We describe a general approach to simulate genotyped disease model and polymorphism in the population (Pritchard datasets for standard case-control or affected child trio data, by and Cox, 2002). In order to rigorously compare competing resampling from existing phased datasets. The approach allows for approaches, simulation studies must be performed which considerable flexibility in disease models, potentially involving a provide realistic patterns of allele frequencies and linkage large number of interacting loci. The method is most applicable for disequilibrium (LD) structure. Unfortunately, uncertainty of diseases caused by common variants that have not been under human population genetic history makes it difficult to perform strong selection, a class specifically targeted by the International such simulations. Forward simulation approaches (Dudek HapMap project. et al., 2006; Peng et al., 2007) can be sensitive to underlying Results: Using the three population Phase I/II HapMap data as a assumptions and starting genotypes, and are typically highly testbed for our approach, we have implemented the approach in variable across simulations (Calafell et al., 2000) for observed HAP-SAMPLE, a web-based simulation tool. LD and disease outcomes. Backward coalescent approaches for Availability: The web-based tool is available at http:// multiple loci (Laval and Excoffier, 2004; Posada and Wiuf, www.hapsample.org 2003; Wang and Rannala, 2005) can be ‘calibrated’ to fit Contact: [email protected]; [email protected]; observed data structures (Schaffner et al., 2005), but remain [email protected] computationally infeasible for dense SNP collections spanning large genomic regions. Moreover, coalescent methods are not well suited to handle unknown and variable selection pressures that may have affected broad genomic regions (Altshuler et al., 1 INTRODUCTION 2005). These approaches involve de novo simulation of artificial It has long been recognized (Risch and Merikangas, 1996) SNPs, while the researcher may be interested in simulation that large-scale genotype–phenotype association studies will tailored to a certain genomic region or the actual list of SNPs have great power and precision to elucidate genetic influences from a favored genotyping platform. Alternatively, one might in complex disease (Gibbs et al., 2003; Hirschhorn and Daly, simulate SNPs to fit pairwise LD measures observed in real 2005). However, key issues in optimal design and analysis data (Montana, 2005), but this approach may be unable to remain unresolved. A partial list of areas of active research reflect higher-order haplotype structures likely crucial for (Hirschhorn and Daly, 2005) includes reassessment of the evaluating haplotype reconstruction and inference (Liu and relative strengths of case-control versus family based designs Lin, 2005). Moreover, it is not clear how to incorporate disease (Hintsanen et al., 2006), design of multistage association models into this approach. studies (Lowe et al., 2004; Satagopan et al., 2004), selection Another possible data-based approach involves cataloguing the frequency of inferred haplotypes in real data across regions *To whom correspondence should be addressed. constituting haplotype blocks (Altshuler et al., 2005), and ß 2007 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. F.A.Wright et al. resampling from these haplotypes. The specification of distinct Chromosome pool block boundaries is somewhat artificial (Schwartz et al., 2003), and may not reflect longer-range LD that is apparent in real data. By sampling from haplotypes of very long range, we Disease SNP allele may avoid the problem of applying arbitrary haplotype block 00011 definitions. Recent work (de Bakker et al., 2005) employed values resampling data across the 500-kb HapMap ENCODE regions (Feingold et al., 2004), but it is not clear how to extend these efforts to a large scale or how to flexibly specify disease models. Many complex diseases are likely to be influenced by ancient SNP variants that are common, and maintain appreciable frequencies across continent-level populations (Altshuler et al., Case chromosomes Control chromosomes 2005; Lohmueller et al., 2003; Peng and Kimmel, 2007). Variants with low penetrance or predisposing for diseases of old age will not have undergone strong selection, and investigation of this class of diseases is among the motivations Origin 11 111 1 Origin 00110 1 for the HapMap project (Altshuler et al., 2005; Gibbs et al., 2003). Under such a model, disease chromosomes may be thought of as drawn from the same population as control chromosomes, but with selection probabilities that differ from controls at causal disease loci. With these considerations, we developed a method to simulate realistic human autosomal SNP data for disease Fig. 1. Simulation of case and control chromosomes. Chromosome association studies, by resampling chromosome-length haplo- pool. These are derived from HapMap or other source as chromosome- types derived from real data. The simulated data follows length haplotypes. Case chromosomes. Genotypes at the disease SNP observed linkage disequilibrium structure and allele frequencies are determined according to P(g|D ¼ 1), and pool chromosomes are at actual SNP loci, and thus is well suited for power analyses chosen to be compatible with the genotype. Then the artificial crossover process is simulated, following the process described in the text, using and investigations of competing techniques for study design the disease SNP location as the origin. The example depicted here and analysis. We started by assuming that phased SNP data are shows a recessive disease, for which two ‘1’ alleles are required. Control available at a series of loci from a sample of individuals. chromosomes. Genotypes at the disease SNP are determined according Assuming random mating, the individual typed chromosomes to P(g|D ¼ 0), and otherwise the process is the same as with case form the relevant pool from which we draw in order to simulate chromosomes. Simulation of affected-child trio data proceeds similarly, SNP alleles for new individuals. We further implemented an with transmitted chromosomes simulated in the same manner as case artificial ‘crossover’ process that allows recombination of chromosomes. Non-transmitted chromosomes in the trios are simulated chromosomes at simulated crossovers. This crossover process in a similar manner to control chromosomes, but follow the mimics meiosis, but is arguably not necessary, as the original unconditional genotype frequencies P(g) at the disease loci. chromosome sample is already reflective of the population. However, we reasoned that a modest crossover process would replacement, as homozygosity of short-range haplotypes may increase novelty and avoid long-range allelic association arise in real data from a single shared ancestry. For low produced by chance variation or subtle population substruc- penetrance diseases, this conditional sampling scheme is far ture. As described below, we take care to simulate crossovers in more efficient than unconditionally generating large numbers a manner that preserves haplotype block structure, and the of genotypes and retaining only the small fraction with disease. crossover rate is controlled by the user. The remaining allele values are generated by following an Figure 1 shows a schematic of the approach. The user artificial crossover process using the disease locus as the origin identifies one or more ‘disease’ SNPs in the pool data (currently (see Methods Section). At each generated crossover, a random limited to one per autosome), for which ascertainment of cases chromosome from the