Robustness of Inference of Haplotype Block Structure
Total Page:16
File Type:pdf, Size:1020Kb
JOURNAL OFCOMPUTATIONAL BIOLOGY Volume10, Number 1,2003 ©MaryAnn Liebert,Inc. Pp. 13–19 Robustness of Inference of Haplotype Block Structure RUSSELL SCHWARTZ, 1;3 BJARNI V.HALLDÓRSSON, 1 VINEET BAFNA, 1;4 ANDREW G.CLARK, 2 andSORIN ISTRAIL 1 ABSTRACT In this report,we examinethe validity ofthe haplotypeblock conceptby comparing block decompositions derived frompublic datasets byvariants of several leading methodsof block detection.We rst developa statistical methodfor assessing the concordanceof two block decompositions.We then assess the robustness ofinferred haplotypeblocks tothe specic detection methodchosen, to arbitrarychoices madein the block-detection algorithms,and to the sample analyzed.Although the block decompositions showlevels ofconcordancethat are veryunlikely bychance, the absolute magnitudeof the concordancemay be lowenough to limit the utility ofthe inference. Forpurposes ofSNPselection, it seems likely thatmethods thatdo not arbitrarily impose block boundaries amongcorrelated SNPs might perform better thanblock-based methods. Key words: haplotypeblock, SNP ,recombination,linkage disequilibrium, diversity. 1.INTRODUCTION inglenucleotides polymorphisms (SNPs) inthe genome showgreat promise as predictors of Sdisease,but the large number of knownSNPs presentsa signicant challenge in locating those mean- ingfullycorrelated with disease phenotypes. The detection of a haplotypeblock structure to the human genome(Daly et al.,2001;Jeffreys et al.,2001;Johnson et al.,2001;Patil et al.,2001)—in which the genomeis largely made up of regions of low diversity, each of which can be characterized by a small numberof SNPs— presents a possibleway to reduce the complexity of the problem. However, whether constructionof haplotype blocks is the most ef cient way to reduce the number of SNPs thatone needs totype to represent the extant genetic variation remains a controversialtopic. Manymethods have been suggested for dening block structures, which can be roughly classi ed into threegroups. One group consists of linkage disequilibrium (LD) methods,such as that of Gabriel et al. (2002),which de ne blocks so as to enforce generally high pairwise LD withinblocks and generally low pairwiseLD betweenblocks. Another group consists of diversity-based methods, such as that of Patil et al. (2001),which de ne blocks so as to enforce low sequence diversity, by some diversity measure, 1CeleraGenomics, Informatics Research, Rockville, MD 20850. 2Departmentof Molecular Biology and Genetics, Cornell University, Ithaca, NY 14853. 3Currentaddress :Departmentof Biological Sciences, Carnegie Mellon University, Pittsburgh, P A15213. 4Currentaddress :TheCenter for Advancement of Genetics, Rockville, MD 20850. 13 14 SCHWARTZ ETAL. withineach block. Finally, there are methods that look for directevidence of recombination, such as the four-gametetest applied by Hudson and Kaplan (1985), de ning blocks as apparently recombination-free regions.In addition to distinctblock de nitions, blocking methods can differ according to the optimization criterionby which the best block decomposition is chosen among all possible decompositions satisfying theblock test. T woimportant optimization criteria, which were examinedby Zhang et al. (2002)in a paper describinga generalalgorithm for efcient block decomposition, are minimization of thenumber of blocks consistentwith the block de nition and minimization of the number of SNPs neededto characterize all sequenceswithin each block. Here we assessthe merits of haplotype block inference by examining robustness of the block concept tomultiple variants in block detection strategies. If haplotypeblocks are genuinely capturing islands of lowdiversity separated by recombination sites, as they are generallyrepresented by proponents, then thevarious methods proposed in the literature for locatingblocks ought to derive essentially the same decompositionseach time they are applied. W eassessthis prediction by developinga statisticfor comparing blockdecompositions and applying it to decompositions derived on publiclyavailable phase-known datasets usingvariants of thecommonly used block detection methods. Inthe remainder of thispaper, we describeour empirical analysis of therobustness of theblock concept. Werst presentour methodology, describing computational methods used for calculatingblock boundaries andpresenting a statisticalmethod for comparingtwo block decompositions. W ethendescribe results ofan analysis assessing the robustness of block methods to variants in the block detection protocol. In particular,we examinerobustness to changes in block de nition, optimization criterion, and population sample,as well as robustness to arbitrarychoices made in thealgorithms. W econcludethat different block decompositionsof a singlegenetic region tend to be far moreconsistent than can be explained by chance, butthat the absolute similarity is nonetheless frequently small. It therefore appears that while there is a commonunderlying structure to haplotype blocks which all methods detect, that structure is less rigidly dened than individual block decompositions might suggest. 2.METHODS AND DA TA 2.1.Finding block boundaries Inthis paper, we comparevariants of the three general block-detection methodologies that have so far appearedin the literature: recombination-based, diversity-based, and linkage-disequilibri um(LD) based. Weusethe four-gamete test (Hudson and Kaplan, 1985) as the simplest example of arecombination-based test.One problem with the four-gamete test is that it isbased on the in nite sites model (in which a given sitemutates at most once during a sequence’s evolutionaryhistory) and can falsely infer recombination eventswhen that model fails due to recurrent mutations. That problem is not a signicant drawback for blockconstruction, however, since sites of recurrent mutation can be reasonably considered to disrupt blockpatterns even if theydo not correspond to recombination sites. For adiversity-basedtest, we usea generalizationof thePatil et al. (2001)test. In their test, a regionis ablockif at least80% of the sequences occurin more than one chromosome. This test was developedfor asampleof only20 chromosomes and doesnot scale well to larger sample sizes as it will tend to yield larger blocks as more chromosomes are studied.W egeneralizedthis test by de ning a regionas a potentialblock if sequenceswithin that region (haplotypes)accounting for atleast 80% of the sampled population each occur in at least 10% of the sample.Finally, we developeda testbased on theD’ statistic,similar to that used by Gabriel et al. (2002) butwith the test of signicance tuned to produce more meaningful results on smallpopulation samples. In thisapproach, we considera setof SNPs toform ablockif theyare contiguous and the D’ valueof every pairof SNPs withinthe block shows signi cant LD witha P-valueof < 0:001,as estimated by simulations usingthe empirically measured single-site allele frequencies for agivenSNP pairand the assumption of completelinkage equilibrium. Wefurtherconsider the two common optimization criteria discussed in theintroduction: minimizing the totalnumber of blocksand minimizing the number of SNPs neededto characterizethe block decomposition ofevery sequence assuming all sequences contain only observed haplotypes within each block. W esolve optimallyfor eachmeasure using a variantof thedynamic programming algorithm of Zhang et al. (2002) modied to sample uniformly at random from alloptimal solutions rather than deterministically choosing asingleoptimum. All of themethods were implementedin C++ running on DEC AlphaUnix machines. ROBUSTNESS OFHAPLOTYPEBLOCK INFERENCE 15 2.2.A statistic for blockcomparison Weusethe number of shared block boundaries as a statisticfor thesimilarity of two block partitions. Thisyields a computationallytractable method for exactlycomputing the p-value for rejectingthe null hypothesisthat two block decompositions were chosenat random (uniformly) and independently from one another. If B1 and B2 arethe number of boundariesin the two partitions, m thenumber of boundariesshared bythe partitions, and S thetotal number of SNPs, thenthe probability they share exactly m boundaries underthe null hypothesis is B S 1 B 1 ¡ ¡ 1 m B2 m ³ ´³ ¡ ´ S 1 ¡ B ³ 2 ´ yieldinga p-value(probability of the statistic being at least m) of B1 S 1 B1 min.B1;B2/ ¡ ¡ i B2 i ³ ´³ ¡ ´: S 1 i m ¡ XD B ³ 2 ´ Thismeasure thus allows us totest the hypothesis that two partitions are related and provides a degreeof condence with which we canreject the null hypothesis of independence. 2.3. Data For evaluation,we relyon two publicly available datasets. The rst isthe Perlegen chromosome 21 dataset (Patil et al.,2001),which consists of 24,047SNPs typedon 20phased chromosomes. This dataset contains alargecontiguous set of SNPs providingan excellent test of blocking algorithms. Although a substantial portionof chromosome 21 was notcovered by multi-SNP blocks by any method, it yielded enough long blocksto clearly detect statistically signi cant concordances between distinct block decompositions. W e alsouse a datasetderived from 71individualstyped at 88polymorphicsites in the human lipoprotein lipase (LPL) gene(Nickerson et al.,2000),from whichwe ignoredone multi-alleic site to simplifyour analysis. Thefewer SNPs inthe LPL datasetmakes it more manageable for illustrativepurposes. In addition, its greaterdepth of coverage allows us to draw morecon dent predictions and provides enough individuals tocompare results from distinctsubsets of the population