JOURNAL OFCOMPUTATIONAL BIOLOGY Volume10, Number 1,2003 ©MaryAnn Liebert,Inc. Pp. 13–19

Robustness of Inference of Block Structure

RUSSELL SCHWARTZ, 1;3 BJARNI V.HALLDÓRSSON, 1 VINEET BAFNA, 1;4 ANDREW G.CLARK, 2 andSORIN ISTRAIL 1

ABSTRACT

In this report,we examinethe validity ofthe haplotypeblock conceptby comparing block decompositions derived frompublic datasets byvariants of several leadingmethods of block detection.We Žrst developa statistical methodfor assessing the concordanceof two block decompositions.We then assess the robustness ofinferred haplotypeblocks tothe speciŽc detection methodchosen, to arbitrarychoices madein the block-detection algorithms,and to the sample analyzed.Although the block decompositions showlevels ofconcordancethat are veryunlikely bychance, the absolute magnitudeof the concordancemay be lowenough to limit the utility ofthe inference. Forpurposes ofSNPselection, it seems likely thatmethods thatdo not arbitrarily impose block boundariesamong correlated SNPs might perform better thanblock-based methods.

Key words: haplotypeblock, SNP ,recombination,linkage disequilibrium, diversity.

1.INTRODUCTION

inglenucleotides polymorphisms (SNPs) inthe showgreat promise as predictors of Sdisease,but the large number of knownSNPs presentsa signiŽcant challenge in locating those mean- ingfullycorrelated with disease phenotypes. The detection of a haplotypeblock structure to the (Daly et al.,2001;Jeffreys et al.,2001;Johnson et al.,2001;Patil et al.,2001)—in which the genomeis largely made up of regions of low diversity, each of which can be characterized by a small numberof SNPs— presents a possibleway to reduce the complexity of the problem. However, whether constructionof haplotype blocks is the most efŽ cient way to reduce the number of SNPs thatone needs totype to represent the extant genetic variation remains a controversialtopic. Manymethods have been suggested for deŽning block structures, which can be roughly classiŽ ed into threegroups. One group consists of linkage disequilibrium (LD) methods,such as that of Gabriel et al. (2002),which deŽ ne blocks so as to enforce generally high pairwise LD withinblocks and generally low pairwiseLD betweenblocks. Another group consists of diversity-based methods, such as that of Patil et al. (2001),which deŽ ne blocks so as to enforce low sequence diversity, by some diversity measure,

1CeleraGenomics, Informatics Research, Rockville, MD 20850. 2Departmentof Molecular Biology and , Cornell University, Ithaca, NY 14853. 3Currentaddress :Departmentof Biological Sciences, Carnegie Mellon University, Pittsburgh, P A15213. 4Currentaddress :TheCenter for Advancement of Genetics, Rockville, MD 20850.

13 14 SCHWARTZ ETAL. withineach block. Finally, there are methods that look for directevidence of recombination, such as the four-gametetest applied by Hudson and Kaplan (1985), deŽ ning blocks as apparently recombination-free regions.In addition to distinctblock deŽ nitions, blocking methods can differ according to the optimization criterionby which the best block decomposition is chosen among all possible decompositions satisfying theblock test. T woimportant optimization criteria, which were examinedby Zhang et al. (2002)in a paper describinga generalalgorithm for efŽcient block decomposition, are minimization of thenumber of blocks consistentwith the block deŽ nition and minimization of the number of SNPs neededto characterize all sequenceswithin each block. Here we assessthe merits of haplotype block inference by examining robustness of the block concept tomultiple variants in block detection strategies. If haplotypeblocks are genuinely capturing islands of lowdiversity separated by recombination sites, as they are generally represented by proponents, then thevarious methods proposed in the literature for locatingblocks ought to derive essentially the same decompositionseach time they are applied. W eassessthis prediction by developinga statisticfor comparing blockdecompositions and applying it to decompositions derived on publiclyavailable phase-known datasets usingvariants of thecommonly used block detection methods. Inthe remainder of thispaper, we describeour empirical analysis of therobustness of theblock concept. WeŽrst presentour methodology, describing computational methods used for calculatingblock boundaries andpresenting a statisticalmethod for comparingtwo block decompositions. W ethendescribe results ofan analysis assessing the robustness of block methods to variants in the block detection protocol. In particular,we examinerobustness to changes in block deŽ nition, optimization criterion, and population sample,as well as robustness to arbitrarychoices made in thealgorithms. W econcludethat different block decompositionsof a singlegenetic region tend to be far moreconsistent than can be explained by chance, butthat the absolute similarity is nonetheless frequently small. It therefore appears that while there is a commonunderlying structure to haplotype blocks which all methods detect, that structure is less rigidly deŽned than individual block decompositions might suggest.

2.METHODS AND DATA

2.1.Finding block boundaries Inthis paper, we comparevariants of the three general block-detection methodologies that have so far appearedin the literature: recombination-based, diversity-based, and linkage-disequilibri um(LD) based. Weusethe four-gamete test (Hudson and Kaplan, 1985) as the simplest example of arecombination-based test.One problem with the four-gamete test is that it isbased on the inŽ nite sites model (in which a given sitemutates at most once during a sequence’s evolutionaryhistory) and can falsely infer recombination eventswhen that model fails due to recurrent mutations. That problem is not a signiŽcant drawback for blockconstruction, however, since sites of recurrent mutation can be reasonably considered to disrupt blockpatterns even if they do not correspond to recombination sites. For adiversity-basedtest, we usea generalizationof thePatil et al. (2001)test. In their test, a regionis ablockif at least80% of the sequences occurin more than one chromosome. This test was developedfor asampleof only20 chromosomes and doesnot scale well to larger sample sizes as it will tend to yield larger blocks as more chromosomes are studied.W egeneralizedthis test by deŽ ning a regionas a potentialblock if sequences within that region ()accounting for atleast 80% of the sampled population each occur in at least 10% of the sample.Finally, we developeda testbased on theD’ statistic,similar to that used by Gabriel et al. (2002) butwith the test of signiŽcance tuned to produce more meaningful results on smallpopulation samples. In thisapproach, we considera setof SNPs toform ablockif theyare contiguous and the D’ valueof every pairof SNPs withinthe block shows signiŽ cant LD witha P-valueof < 0:001,as estimated by simulations usingthe empirically measured single-site allele frequencies for agivenSNP pairand the assumption of completelinkage equilibrium. Wefurtherconsider the two common optimization criteria discussed in theintroduction: minimizing the totalnumber of blocksand minimizing the number of SNPs neededto characterizethe block decomposition ofevery sequence assuming all sequences contain only observed haplotypes within each block. W esolve optimallyfor eachmeasure using a variantof thedynamic programming algorithm of Zhang et al. (2002) modiŽed to sample uniformly at random from alloptimal solutions rather than deterministically choosing asingleoptimum. All of themethods were implementedin C++ running on DEC AlphaUnix machines. ROBUSTNESS OFHAPLOTYPEBLOCK INFERENCE 15

2.2.A statistic for blockcomparison Weusethe number of shared block boundaries as a statisticfor thesimilarity of two block partitions. Thisyields a computationallytractable method for exactlycomputing the p-value for rejectingthe null hypothesisthat two block decompositions were chosenat random (uniformly) and independently from one another. If B1 and B2 arethe number of boundariesin the two partitions, m thenumber of boundariesshared bythe partitions, and S thetotal number of SNPs, thenthe probability they share exactly m boundaries underthe null hypothesis is

B S 1 B 1 ¡ ¡ 1 m B2 m ³ ´³ ¡ ´ S 1 ¡ B ³ 2 ´ yieldinga p-value(probability of the statistic being at least m) of

B1 S 1 B1 min.B1;B2/ ¡ ¡ i B2 i ³ ´³ ¡ ´: S 1 i m ¡ XD B ³ 2 ´ Thismeasure thus allows us totest the hypothesis that two partitions are related and provides a degreeof conŽdence with which we canreject the null hypothesis of independence.

2.3. Data For evaluation,we relyon two publicly available datasets. The Ž rst isthe Perlegen chromosome 21 dataset (Patil et al.,2001),which consists of 24,047SNPs typedon 20phased chromosomes. This dataset contains alargecontiguous set of SNPs providingan excellent test of blocking algorithms. Although a substantial portionof chromosome 21 was notcovered by multi-SNP blocks by any method, it yielded enough long blocksto clearly detect statistically signiŽ cant concordances between distinct block decompositions. W e alsouse a datasetderived from 71individualstyped at 88polymorphicsites in the human lipoprotein lipase (LPL) gene(Nickerson et al.,2000),from whichwe ignoredone multi-alleic site to simplifyour analysis. Thefewer SNPs inthe LPL datasetmakes it more manageable for illustrativepurposes. In addition, its greaterdepth of coverage allows us to draw moreconŽ dent predictions and provides enough individuals tocompare results from distinctsubsets of the population sample.

3. RESULTS

WeŽrst askedwhether the concept of blocks is robust to the various block measures. W econducted comparisonsby running the available algorithms on the two datasets and comparing the outcomes using theshared boundary statistic described above. T able1 describesthe results when run on the chromosome 21data of Patil et al. (2001).W enotethat it is also possible to compare an algorithm to itself because ofthefact that the block deŽ nitions generally yield many equally good solutions of whichthe algorithms mustchoose just one. By independently sampling among the optima, we cancompare distinct runs of a singlealgorithm, revealing what is robust to that single deŽ nition as well as providing a baselinefor how muchsolutions derived by distinct methods can possibly coincide with one another .Althoughthe percent similaritiesbetween distinct methods or even within any one method are not high in an absolute sense, theyare much greater than can be explained by chance. The four-gamete test appears much closer to the diversity-and LD-based tests than either of those is to the other .Furthermore,minimum SNP solutions appearto have more in common with one another than minimum block solutions. For purposesof illustration,we providevisual comparisons of blockassignments for theLPL datasetof Nickerson et al. (2000).Due to themuch smaller number of SNPs intheLPL datasetcompared to thechro- mosome21 dataset, pairwise comparisons do not generally yield statistically signiŽ cant results, although 16 SCHWARTZ ETAL.

Table 1. Comparisons of Block DeŽ nitionson the Chromosome 21 Dataset of Patil et al. Minimizing Blocksa

4-gamete Diversity LD

Min. blocks 4-gamete45.3%/ 7 :25 10 1206 15.7%/3:02 10 162 12.3%/1:70 10 18 £ ¡ £ ¡ £ ¡ Diversity -/- 33.8%/2:97 10 626 7.19%/3:89 10 5 £ ¡ £ ¡ LD -/- -/- 54.3%/3:09 10 1761 £ ¡ Min. SNPs 4-gamete65.3%/ 2 :16 10 2276 22.6%/5:30 10 366 16.0%/1:03 10 54 £ ¡ £ ¡ £ ¡ Diversity -/- 51.5%/6:71 10 1222 9.70%/6:14 10 23 £ ¡ £ ¡ LD -/- -/- 70.5%/7:67 10 2921 £ ¡ aEach element ofthe matrix gives the percentage ofblock boundaries assigned by either method that are sharedby both,followed by the p-value of theoverlap. Elements comparinga methodto itself showvalues for twodistinct runs of the same algorithmeach ofwhichis choosingan optimal solution uniformly at random.

FIG. 1. Pairwisecomparison of block boundaries for the different block deŽ nitions for the LPL datasetof Nickerson et al. Eachimage shows the block boundaries as verticallines, with boundaries above the horizontal line coming from theŽ rstmethod and those below the horizontal line coming from the second method. Those boundaries appearing aboveand below the horizontal line are shared by both methods and are drawn thicker in order to highlight them.

Table 2. Comparisons of Minimal Block to Minimal SNP Optimization Criteria for Each Block DeŽ nition on the Chromosome 21 Dataset of Patil et al.a

Tests % Identity P-value

4-gamete 41.4% 4:51 10 1038 £ ¡ Diversity-based 29.9% 6:75 10 526 £ ¡ LD-based 49.8% 7:19 10 1513 £ ¡ aEach element ofthe table gives the percentage ofblock boundaries assigned by eithermethod that are sharedby both, followed by the p-value of theoverlap.

theyshow comparable percent agreement to the pairwise comparisons. Figure 1 illustratesthe correspon- dencebetween the different block measures by showingside-by-side comparisons of blockboundaries for each.The results appear to showgenerally poor agreement between block boundaries derived from distinct measures,with somewhat better agreement between distinct runs of asinglemeasure. Inorder to assess the role of optimizationcriteria in concordance, we conductedfurther pairwise com- parisonsbetween the two criteria— minimum blocks or minimum SNPs— for eachof the three block deŽnitions. T able2 showscomparisons of the two criteria for eachof thethree block deŽ nitions examined. ROBUSTNESS OFHAPLOTYPEBLOCK INFERENCE 17

FIG. 2. Pairwisecomparison of block boundaries for the different optimization methods for the LPL datasetof Nickerson et al. Thecomparisons shown are ( A) 4-gamete (B)diversity-based( C) LD-based.

Table 3. Comparisons of Runs of a Single Block Assignment Algorithm on Two Halves of a Balanced Randomly Selected Partitionof the LPL Dataset of Nickerson et al.

Minimizingblocks MinimizingSNPs

Tests % Identity P-value % Identity P-value

4-gamete 35.1% 0.00287 34.0% 0.024 Diversity-based 0.00% 1.00 13.8% 0.372 LD-based 22.9% 0.072 16.3% 0.569

FIG. 3. Comparisonsof block assignments using distinct population samples for a singlemethod. Each row shows, fora singleblock deŽ nition and optimization criterion, a comparisonof boundaries derived from one part of the partitioneddataset to boundaries derived from the other part followed by comparisons of each of the partial-dataset solutionsto the full-dataset solution.

Weseesubstantially higher correspondence between distinct optimization criteria for asingleblock deŽ - nitionthan we saw betweendistinct block criteria for asingleoptimization method. This result suggests thatminimizing the number of blocks is a reasonableapproximation to minimizing the number of SNPs. Figure2 illustratesthis greater concordance in the LPL data,particularly for thefour-gamete method. Furthermore,the Ž gureshows that when block decompositions do disagree, it is often because of aslight slippagein some boundaries rather than a sizableoverall change in block structure. Itis also important to assess our ability to detect haplotype blocks for realisticsizes of datasets. W e assessedthis with the LPL databy dividingthe dataset into two subsets, randomly placing individuals into oneor the other, then comparing block decompositions drawn from thetwo samples by a singlemethod. Wedidnot attempt this analysis on the Patil et al. (2001)chromosome 21 dataset because the number ofchromosomes would have been too small for thehalf-size datasets to yield reasonable results. T able3 describesthe results. There was substantiallyworse agreementbetween runs on separate half-size samples (eachwith 71 chromosomes) than we saw betweendistinct runs on the same full-size sample (with 142 chromosomes).This is discouraging news, because it shows that even a sampleof 35 individuals may be toosmall to reliably capture haplotype block structures. The four-gamete test shows the most robustness tosample and the diversity-based test the least. W eattributethe lack of statistical signiŽ cance of most resultsprimarily to the small number of SNPs inthe dataset. Figure3 illustratesthe similarities of theblock decompositions. Different block decomposition algorithms shownoticeably weaker correspondence than when the same algorithm is run twice on the same full-size 18 SCHWARTZ ETAL. dataset.Some slippage of boundaries no longer seems an adequate explanation for thediscrepancies. While theblock structures have some correspondence, individual boundaries in oneno longergenerally have clear correspondingboundaries nearby in the other.

4.DISCUSSION

Our resultsfor comparisonsbetween distinct methods support the notion that there is atendencyfor the variabilityin the human genome to be organized into blocks of adjacentsites that share ancestral history. Theseblocks can be identiŽ ed crudely by a varietyof adhoc algorithms. Our resultsshow unequivocally thatthe different block-Ž nding algorithms identify similar structure to an extent that cannot be explained bychance. They also show ,however,that the absolute correspondence between block assignments can differmarkedly in response to changes in both block deŽ nition and optimization criterion. Our resultsappear consistent with two contradictory explanations: that haplotype blocks are a valid scientiŽc realitybut require greater sample sizes or better algorithms to detect reliably, or thatthe population levelprocesses of mutation,drift, and recombination that give rise to haplotypeblocks do soin awaythat makesit very difŽ cult to settle on a singleuniŽ ed deŽ nition that best captures that history. W earguein favorof the latter hypothesis on the grounds that if the blocks were simplyand universally well-deŽ ned, thencurrent methods ought to be robust to a varietyof measures of diversity, linkage disequilibrium, andrecombination probability. The fact that these measures do not consistently provide the same block decompositionsuggests that the imperfections lie in the blocks themselves and not the methods. Thisresult does not contradict the evidence for recombinationhotspots in the genome (Jeffreys et al., 2001),but does suggest that hotspots do notin general lead to a welldeŽ ned block structure; this may be eitherbecause existing computational methods cannot detect them reliably or becausethey do notaccount for asufŽciently large fraction of the total recombination in the genome. Nor doour results contradict thenotion that there are genuine regions of low diversity, high linkage disequilibrium, or low historic recombinationin the human genome. Rather, they argue that such regions are not as sharply deŽ ned as theconcept of discrete blocks would imply. Nonetheless, the fact that blocks, however deŽ ned, do allow oneto signiŽ cantly compress the information in SNP datasetssuggests that they can be quite useful for theprimary task of representingSNP datasetsby an informative subset of theSNPs. Theinconsistency of blockdecompositions, however, coupled with the correlation of SNPs amongadjacent blocks, suggests that othermethods might perform better than haplotype blocks. In particular, we suggestconsidering methods for reducinghaplotype redundancy, such as the Alignment Algorithm of Schwartz et al. (2000),that detect conservedhaplotype regions without requiring a globalblock structure.

5.ACKNOWLEDGMENTS

Wewouldlike to thank Francis Kalush, Francisco de la V ega,Ross Lippert, Michele Cargill, Kit Lau andMark Adams for manyvaluable discussions and technical suggestions about SNPs andhaplotypes data analysischallenges, and for sharingthe Applera Resequencing Project data with us.

REFERENCES

Daly,M., Rioux, J., Schaffner,S., and Hudson, T .2001.High-resolution haplotype structure in the human genome. Nat. Genet. 29,229– 232. Gabriel,S., Schaffner,S., Nguyen, H., Moore, J., Roy,J., Blumenstiel,B., Higgins, J., DeFelice,M., Lochner,A., Faggart,M., Liu-Cordero, S.N., Rotimi, C., Adeyemo, A., Cooper,R., W ard,R., Lander, E.S., Daly,M.J., and Altschuler,D. 2002.The structure of haplotype blocks in the human genome. Science 296,2225– 2229. Hudson,R., Kaplan, N. 1985.Statistical properties of the number of recombination events in the history of a sample ofDNA sequences. Genetics 111,147– 164. Jeffreys,A., Kauppi, L., Neumann, R. 2001.Intensely punctute meiotic recombination in the class II regionof the majorhistocompatibility complex. Nat. Genet. 29,217– 222. ROBUSTNESS OFHAPLOTYPEBLOCK INFERENCE 19

Johnson,G.C., Esposito, L., Barratt, B.J., Smith, A.N., Heward, J., Di Genova,G., Ueda,H., Cordell, H.J., Eaves, I.A.,Dudbridge, F .,Twells,R.C., Payne, F .,Hughes,W .,Nutland,S., Stevens, H., Carr,P .,Tuomilehto-Wolf,E., Tuomilehto,J., Gough, S.C., Clayton,D.G., and T odd,J.A. 2001. Haplotype tagging for the identiŽ cation of common diseasegenes. Nat. Genet. 29,233– 237. Nickerson,D.A., T aylor,S.L., Fullerton,S.M., W eiss,K.M., Clark, A.G., Stengaard, J.H., Salomaa, V .,Boerwinkle,E., Sing,C.F .2000.Sequence diversity and large-scale typing of SNPs inthe human apolipoprotein E gene. Genome Res. 10,1532– 1545. Patil,N., Berno,A.J., Hinds, D.A., Barrett, W .A.,Doshi, J.M., Hacker, C.R., Kautzer, C.R., Lee, D.H., Marjoribanks, C.,McDonough,D.P .,Nguyen,B.T .,Norris,M.C., Sheehan, J.B., Shen, N., Stern, D., Stokowski, R.P .,Thomas, D.J.,Trulson, M.O., Vyas, K.R., Frazer, K.A., Fodor, S.P .,andCox, D.R. 2001.Blocks of limitedhaplotype diversity revealedby high resolution scanning of human chromosome 21. Science 294,1719– 1722. Schwartz,R., Clark, A.G., and Istrail, S. Methodsfor inferring block-wise ancestral history from haploid sequences. InGuigó, R.,and GusŽ eld, D., eds., Algorithmsin Bioinformatics.Lecture Notes in ComputerScience 2452,44– 59, Springer,Berlin. Zhang,K., Deng,M., Chen, T .,Waterman,M., andSun, F .2002.A dynamicprogramming algorithm for haplotype blockpartitioning. Proc.Natl. Acad. Sci. USA 99,7335– 7339.

Addresscorrespondence to: SorinIstrail CeleraGenomics InformaticsResearch 45W estGude Drive Rockville,MD 20850

E-mail: [email protected]