Copyright  2004 by the Genetics Society of America DOI: 10.1534/genetics.103.021584

Application of Coalescent Methods to Reveal Fine-Scale Rate Variation and Recombination Hotspots

Paul Fearnhead,* Rosalind M. Harding,†,1 Julie A. Schneider,†,2 Simon Myers‡ and Peter Donnelly‡,3 *Department of Mathematics and Statistics, Lancaster University, Lancaster LA1 4YF, United Kingdom, †Medical Research Council Molecular Hematology Unit, Institute of Molecular Medicine, University of Oxford, Oxford OX3 9DS, United Kingdom and ‡Department of Statistics, University of Oxford, Oxford OX1 3TG, United Kingdom Manuscript received August 22, 2003 Accepted for publication April 29, 2004

ABSTRACT There has been considerable recent interest in understanding the way in which recombination rates vary over small physical distances, and the extent of recombination hotspots, in various genomes. Here we adapt, apply, and assess the power of recently developed coalescent-based approaches to estimating recombination rates from sequence polymorphism data. We apply full-likelihood estimation to study rate variation in and around a well-characterized recombination hotspot in humans, in the ␤-globin gene cluster, and show that it provides similar estimates, consistent with those from sperm studies, from two populations deliberately chosen to have different demographic and selectional histories. We also demon- strate how approximate-likelihood methods can be used to detect local recombination hotspots from genomic-scale SNP data. In a simulation study based on 80 100-kb regions, these methods detect 43 out of 60 hotspots (ranging from 1 to 2 kb in size), with only two false positives out of 2000 subregions that were tested for the presence of a hotspot. Our study suggests that new computational tools for sophisticated analysis of population diversity data are valuable for hotspot detection and fine-scale mapping of local recombination rates.

ECOMBINATION plays a central role in shaping (MHC) class II region (Jeffreys et al. 2001; Jeffreys R patterns of molecular genetic diversity in natural and Neumann 2002). However, achieving high resolu- populations. There is growing evidence of extensive vari- tion using laboratory-based methods is technically diffi- ation in recombination rates over scales as small as kilo- cult and costly, and the recombination rate estimates bases, although neither the mechanism nor the pattern are specific to males. As there are substantial differences of this variation is well understood (Daly et al. 2001; in sex-specific recombination rates at the centimorgan Jeffreys et al. 2001; Petes 2001; Nachman 2002; scale (Kong et al. 2002), we should not expect fine-scale Schneider et al. 2002). In humans, the traditional ap- recombination rate estimates from sperm studies to be proach to estimating recombination rates has been fully informative of the evolutionary process of recombi- through pedigree studies, but these have resolution only nation, which averages across males and females, and over scales of centimorgans (or megabases in physi- over long spans of time. cal distance at the human genome-wide average recom- Another source of information about local recombi- bination rate). Even in organisms where breeding ex- nation rates is population genetics data. These data do periments are more straightforward, the landscape of not simply enumerate a direct count of the recombina- fine-scale variation in the recombination rate remains tion events, but have the advantage of reflecting the relatively uncharted. evolutionary process. While qualitative information Recently, laboratory-based studies have mapped re- about underlying recombination rates can sometimes combination hotspots within several human loci from be obtained from standard pairwise measures of linkage analysis of crossovers detected in sperm, for example, disequilibrium (LD), quantitative estimation of recombi- to intervals of 1–2 kb within the major histocompatibility nation rates from polymorphism data is a challenging statistical problem that has only recently begun to attract considerable attention (e.g., Griffiths and Marjoram 1Present address: Department of Statistics, University of Oxford, Ox- 1996a; Kuhner et al. 2000; Wall 2000; Fearnhead and ford OX1 3TG, United Kingdom. Donnelly 2001, 2002; Hudson 2001; McVean et al. 2Present address: National Cancer Institute, 31 Center Dr., 31/10A52 2002). At one end of a spectrum of statistical approaches Bethesda, MD 20892-2590. are simpler methods, which ignore much of the infor- 3Corresponding author: Department of Statistics, University of Oxford, 1 S. Parks Rd., Oxford OX1 3TG, United Kingdom. mation in the sample and base inference on one or a E-mail: [email protected] few summary statistics (Wall 2000). At the other ex-

Genetics 167: 2067–2081 ( August 2004) 2068 P. Fearnhead et al. treme, so-called full-likelihood methods utilize all of the tion hotspot to a 2-kb region just 5Ј of the ␤-globin gene information in the data, but at the expense of being (Wall et al. 2003). highly computationally intensive (Griffiths and Mar- The availability of molecularly phased population joram 1996a; Kuhner et al. 2000; Fearnhead and Don- data from a worldwide sample of populations, and of nelly 2001). An intermediate class of computational direct estimates of recombination rates across the hot- methods approximates the true likelihood in various spot from sperm studies, is ideal for examining both ways (Hudson 2001; Fearnhead and Donnelly 2002; the accuracy of full-likelihood estimation and its sensitiv- McVean et al. 2002; Li and Stephens 2003). To date, ity to different population demographic histories. We virtually all statistical methods have assumed that the analyze samples from two populations, The Gambia and recombination rate is constant over the interval being the United Kingdom, deliberately chosen on the basis studied. of their different demographic histories and selective With the advent of surveys of molecular genetic varia- pressures. Another valuable feature of these data for tion on genomic scales, particularly but not exclusively the present analyses is that there are no additional un- in humans, computational population-based methods certainties associated with allelic (haplotype) inference should enable fine-scale recombination rate variation to by statistical methods. be characterized across the human and other genomes, There is growing evidence that recombination hot- providing information that may be crucial not only for spots may be widespread across the human genome, disentangling the molecular, demographic, and selec- although to date relatively few have been characterized. tive factors generating linkage disequilibrium, but also It is thus of considerable interest to be able to detect for association mapping of complex diseases (Krug- hotspots (defined here as small regions where the re- lyak 1999; Jorde 2000; Ott 2000; Pritchard and combination rate is increased considerably relative to Przeworski 2001; Reich et al. 2001). the local background rate) on genomic scales, and such Here we adapt, apply, and assess two methods to study information would also be invaluable for the design and different aspects of fine-scale variation in recombination analysis of disease studies. Again, one potential source rates from population data. We first apply full-likelihood of information is population data, with the increasing estimation (Fearnhead and Donnelly 2001) to study availability of population surveys of genetic diversity over fine-scale rate variation around and inside one of the genomic scales. Because of the computational burden most well-characterized recombination hotspots in the involved, it does not seem practicable to adapt full- human genome. The ␤-globin gene complex on human likelihood methods to this problem. Here we present chromosome 11p (Figure 1) includes a recombination and study a new method for detecting hotspots on the hotspot that was originally identified by Chakravarti et basis of an approximate-likelihood approach (Fearn- al. (1984). In that study, a pattern of random association head and Donnelly 2002), which can closely mimic Ј Ј ␤ between 5 and 3 haplotypes for the -globin complex, full-likelihood inference. Informally, the approach is to based on restriction fragment length polymorphisms separately analyze small subregions of the genome, and (RFLPs), was replicated in four populations, suggesting then to detect hotspots by comparing the likelihood Ј ␤ a recombination hotspot 5 to the -globin gene with- curves for each subregion against that for an underlying Ј in a 9.1-kb interval marked by a TaqI RFLP 5 to the background rate. ␦-globin gene (Figure 1). Chakravarti et al. (1984) estimated recombination rates to be elevated 3–30 times above the genome-wide average. Recent single-sperm MATERIALS AND METHODS informative meioses gave an estimate 5000ف typing of for the male recombination fraction across the hotspot ␤-Globin DNA sequences: The haplotype sequences were of 80 ϫ 10Ϫ5 kbϪ1 [95% confidence interval (C.I.): 9 ϫ reported in Harding et al. (1997) and consist of data from Ϫ Ϫ 31 chromosomes from The Gambia and 4b chromosomes 10 5–160 ϫ 10 5] and detected no recombination Ј from Oxfordshire, United Kingdom. The chromosomes were events over the adjacent 5 90-kb region (Schneider et sequenced in a 3-kb region that encompasses the ␤-globin al. 2002). (The human genome-wide average recombi- gene (see Figure 1). Haplotypes were determined experimen- ϫ 10Ϫ5 kbϪ1.) There have also been tally. Polymorphisms at sites 318 and 320 are associated with 1ف nation rate is six observations to date in families of recombination length variation rather than with nucleotide substitution and with flanking exchange in the ␤-globin cluster (Smith were excluded from our analysis. All other polymorphic sites were included. et al. 1998). The precision with which these crossovers Full-likelihood inference: Within the coalescent framework, can be localized to the hotspot varies, being limited by the optimal approach to analyzing population data is via full- the location of informative single-nucleotide polymor- likelihood statistical methods: these methods use all of the phisms (SNPs) in the parental chromosomes. However, information in the data. Calculating the likelihood surface is location within the hotspot can be confirmed for most, usually intractable, as it involves integrating over all possible genealogical histories that are consistent with the data. There and none is definitely outside it. A recent population- are a number of computational approaches to approximating based analysis, adopting different statistical approaches the full-likelihood surface, via importance sampling (Grif- from those described here, narrowed the recombina- fiths and Tavare´ 1994; Fearnhead and Donnelly 2001) or Application of Coalescent Methods 2069

Markov chain Monte Carlo (Kuhner et al. 1995); see Stephens estimators of the population-scaled recombination rate are to and Donnelly (2000) for a review. do with their consistency (Fearnhead 2003). Approximating the full-likelihood surface in the presence of For the case of full likelihood and AML, simulation results recombination is a particularly challenging statistical problem. suggest that confidence intervals with approximately the cor- All the computational methods can produce an arbitrarily rect coverage probabilities can be obtained using the usual accurate approximation to the full-likelihood surface given chi-square approximation to the generalized likelihood-ratio infinite computing time. This is not necessarily the case for statistic (Fearnhead and Donnelly 2001, 2002). Less seems practical amounts of computing time, especially for some avail- to be known about obtaining confidence intervals from the PL. able methods, for which the approximations can be poor Recombination detection: One alternative to estimating re- (Fearnhead and Donnelly 2001). combination rates is to try to detect historical recombination We use the infs program of Fearnhead and Donnelly events from population data. A recent method, that of Myers (2001) to estimate the local recombination rate for the and Griffiths (2003), uses a dynamic programming ap- ␤-globin region. This is an importance sampling method for proach to provide a lower bound on the minimum number approximating the likelihood surface of the recombination of recombination events in each subset of SNPs in the region. and mutation rates, under a coalescent model, and has been The approach is nonparametric, in the sense that it is not demonstrated to be substantially more efficient than alterna- based on any model for the of the population (al- tive methods [orders of magnitude more efficient than the though as currently implemented it does assume no repeat methods of Griffiths and Marjoram (1996a) and Kuhner mutations). In contrast, all the statistical methods described et al. (2000), but it can still take days of computing time to above do require an evolutionary model [most are based on analyze even small data sets accurately, particularly when the the standard coalescent, although several, for example, those recombination rate is large]. The model assumes a panmictic, of McVean et al. (2002), Fearnhead and Donnelly (2002), constant-sized population, and an infinite-sites mutation and Li and Stephens (2003), explicitly allow for repeat muta- model (which excludes the possibility of repeat mutation), tion], but their output is also different, in providing an esti- but has been shown in simulation studies to perform reason- mate of the underlying recombination rate. ably well when the true demographic scenario differs from this (Fearnhead and Donnelly 2001). We return in the discussion to possible consequences of departures from these RESULTS assumptions. The method is based on (i) simulating a set of ␤ possible genealogical histories for the data, (ii) calculating -Globin hotspot: We applied the full-likelihood ap- the likelihood surface for each of these histories, and (iii) ap- proach for estimating recombination rates from popula- proximating the full-likelihood surface by a suitable weighted tion data to previously reported sequence haplotypes average of the surfaces for each history. (see materials and methods) from a 3-kb region (Fig- Approximate-likelihood inference: The computational cost Ј of full-likelihood inference has led to a number of approxi- ure 1) extending across the 3 boundary of the hotspot. mate-likelihood methods being developed (Hudson 2001; This data set is just within what can be practicably ana- Fearnhead and Donnelly 2002; McVean et al. 2002; Li and lyzed using the full-likelihood approach. Comparison Stephens 2003). Such approximate methods seem essential of full-likelihood estimates with direct estimates from for analysis of data over large genomic regions. sperm typing, and with other population-based esti- The approximate marginal-likelihood (AML) method of mates, allows an assessment of the accuracy of the Fearnhead and Donnelly (2002) is based on an approxima- tion to the likelihood surface for the data at sites whose minor method. Comparison of estimates across two different allele frequency is above a certain threshold. The simulation population samples allows an empirical check of the results in Fearnhead and Donnelly (2002) suggest that if robustness of the estimates to population demographic only singleton mutations are removed, then the AML can be history. a very close approximation to the full likelihood, but with a To analyze rate variation, we divided the 3-kb region computational saving of 1–3 orders of magnitude. For de- tecting the hotspots from the simulated data, we used the into a number of small subregions and separately esti- program sequenceLD (available from www.maths.lancs.ac.uk/ mated the recombination rate in each subregion. We -fearnhea/software) to calculate the AML surface. used four subregions (see Figure 1). The choice of reف The pairwise likelihood (PL), which has also been called gion I (as the first 510 bp) was based in part on the need the composite likelihood, is a method devised by Hudson for a region small enough to allow reliable estimation (2001) and extended by McVean et al. (2002). It involves multiplying together the likelihood surface for all pairs of of the likelihood surface, and in part on the lack of segregating sites. This is a different approximation from that polymorphism across the remainder of the sequence of the AML, but its computational cost is much smaller. In within the hotspot (effectively our region II), suggesting particular it can be used to analyze data with a large number that recombination rates across region II would be bet- of segregating sites (which the AML cannot). We adapted the ter estimated by comparing haplotypes between regions ف program LDhat (available from www.stats.ox.ac.uk/ mcvean) I and III–IV. Region III consists of two exons and one to calculate the PL surface for the recombination rate for region II of the ␤-globin gene (see Figure 1). intron. We excluded it from our analysis because it Interval estimation: Both the full- and the approximate- contains only two SNPs, one being the sickle cell muta- likelihood methods estimate the parameters as the values that tion (position 917). There are no nonsynonymous muta- maximize the corresponding likelihood surface. Obtaining tions in region IV in our data. confidence intervals is more difficult, as the standard asymp- A summary of the pattern of polymorphism in each totic results for the distribution of either the estimators or the likelihood-ratio statistic do not necessarily hold due to the sample is given in Table 1. The polymorphism data are dependence between the data from different chromosomes. consistent with a larger effective population size for the Currently, the only available theoretical results concerning African compared to the non-African population, as 2070 P. Fearnhead et al.

Figure 1.—Map of the ␤-globin gene cluster (Gen- Bank accession no. NG_ 000007). The 3067-bp re- gion sequenced extends from positions 61290 to 64357. Throughout this ar- ticle we number from the first (5Ј) position of the se- quenced region, as in Har- ding et al. (1997). To sim- plify discussion we break the sequenced region into four subregions: (I) 1–510, (II) 511–897, (III) 898–1342, and (IV) 1343–3067 (see in- set). Subregions I and II lie in the 5Ј flanking sequence and untranslated region of the ␤-globin gene. Subre- gion III contains the first two exons and the first in- tron, and subregion IV con- tains the second intron, third exon, and 746 bp of 3Ј flanking sequence. The hotspot is known to be (somewhere) between a TaqI site at location Ϫ8578 and a HgiAI site at position 906. Our regions I and II are within this region. All but the first few base pairs of region III and all of region IV are 3Ј of it. has been observed elsewhere (e.g., Frisse et al. 2001). for region I is that it happens to have a deep genealogy Previous analyses of this region have shown that the by chance, and hence a greater opportunity for recombi- pattern of polymorphism is consistent with selective neu- nation events to happen and to be detected. The full- trality (Harding et al. 1997). likelihood method incorporates inferences about the Figure 2 gives the joint-likelihood surfaces in the pop- depth of the genealogy when producing estimates of the ulation mutation and recombination parameters for re- recombination rate. So given a fixed number of recom- gions I and IV in both samples. These convey the infor- bination events, evidence for a deep genealogy will re- mation about recombination rates in the sequence data duce (as it should) the estimate of ␳. Furthermore, chang- to which they relate. Table 2 gives point estimates and ing the estimate of ␪ will affect how deep the genealogy confidence intervals obtained from the likelihood sur- is likely to be, and thus varying ␪ enables us to assess the faces as described in materials and methods. sensitivity of our estimate of ␳ to this effect. Here, even In light of the relatively high level of variation in the varying ␪ by a factor of 2 has little effect on the estimates region, one possible explanation for the high ␳ estimates of ␳ (data not shown).

TABLE 1 Summary of polymorphism in the two samples

No. of No. of Population Region haplotypes polymorphic sites ␲ kbϪ1 ␪ˆ kbϪ1 The Gambia I 13 9 5.43 4.42 II 1 0 0.00 0.00 III 2 2 0.80 1.14 IV 8 10 1.30 1.44 Total 21 21 1.70 1.63 United Kingdom I 8 5 3.70 2.23 II 1 0 0.00 0.00 III 2 1 0.53 0.52 IV 5 9 1.51 1.18 Total 12 15 1.55 1.11 Sample sizes are 31 chromosomes from The Gambia and 46 from the United Kingdom. Sequence regions are defined in Figure 1. The statistics ␲, the average number of pairwise differences, and ␪ˆ, Watterson’s ␪ϭ estimator, are both unbiased estimators of the scaled neutral mutation rate 4Neu, where u is the mutation rate across the relevant region per generation and Ne is the effective population size. Application of Coalescent Methods 2071

␪ϭ ␳ϭ Figure 2.—Likelihood surfaces in the scaled mutation and recombination rates 4Neu and 4Ner, where u and r are the mutation and recombination rates per generation across the relevant region, and Ne is the effective population size. (a) Region I, The Gambia; (b) region I, United Kingdom; (c) region IV, The Gambia; (d) region IV, United Kingdom.

To estimate the population recombination rate in likelihood estimates. The recombination rate between region II, we assumed that the recombination rates in any pair of sites that straddle region II can then be regions I, III, and IV were given by the respective full- parameterized solely in terms of the recombination rate

TABLE 2 Recombination rate estimates in the two samples

Estimated MLE of ␪ MLE of ␳ recombination .C.I %95ف Population Region for region for region rate/kb The Gambia I 3.4 32.3 54 ϫ 10Ϫ5 [8.4 ϫ 10Ϫ5, 460 ϫ 10Ϫ5] II 0.0 37.0 62 ϫ 10Ϫ5 IV 1.0 0.0 0.0 ϫ 10Ϫ5 [0.0 ϫ 10Ϫ5, 1.7 ϫ 10Ϫ5] United Kingdom I 1.5 13.5 35 ϫ 10Ϫ5 [3.9 ϫ 10Ϫ5, 170 ϫ 10Ϫ5] II 0.0 15.0 40 ϫ 10Ϫ5 IV 1.2 0.2 0.5 ϫ 10Ϫ5 [0.0 ϫ 10Ϫ5, 4.0 ϫ 10Ϫ5] (.C.I %95ف Estimates of scaled mutation (␪) and recombination (␳) rates, and of recombination rate (with per kilobase, for regions I, II, and IV in the two populations are shown. The genome-wide average recombination ϫ 10Ϫ5 kbϪ1. See materials and methods for details of estimation. Estimation for regions I and 1ف rate is IV is based on full likelihood. The lack of polymorphism in region II precludes this, and estimates for region II are based on maximizing a pairwise likelihood, which does not provide confidence intervals. 2072 P. Fearnhead et al. within region II. We estimated this parameter using an the method of Myers and Griffiths (2003), which extension of the PL method described in the materials detects historical recombination events (see materials and methods section. For each pair of sites that straddle and methods). In the Gambian sample, the method the region we calculated the likelihood of the recombi- revealed that there must have been at least nine recom- nation rate in region II. Multiplying these likelihoods bination events in the history of the sample. Of these together for all such pairs gives us a pairwise likelihood nine, all potentially occurred within the hotspot. Three that we maximized to estimate the recombination rate can be definitively localized to within the hotspot and in region II. We converted the point estimates of ␳ into a further three could be localized to an area that in- estimates of the recombination rate per kilobase, using cludes, but is slightly larger than, the hotspot. None values of 15,000 and 9500 for the effective population of the nine events is definitively localized outside the size, Ne, of The Gambia and the United Kingdom, re- hotspot. In the United Kingdom sample there must have spectively. These values were obtained from the analysis been at least three historical recombination events, of of Harding et al. (1997), but we note that there are which two can be definitively localized within the hot- now many estimates of the human effective population spot, and the third localized to an area slightly larger size in the region 10,000–20,000 (Przeworski et al. than the hotspot with none definitively localized outside

2000), and so the exact values used for Ne are not crucial the hotspot. The results of the detection method are to our main conclusions, nor are they a factor when shown graphically in Figure 4. comparing recombination rate estimates between re- The detection approach has the advantage of not gions within a population. relying on an evolutionary model, and the method we While the exact numerical estimates have associated have used extends the well-known four-gamete test to uncertainty, two features are striking. The analysis sug- “find” more historical recombination events than other gests a difference of around two orders of magnitude approaches find. [Hudson and Kaplan’s (1985) RM, in recombination rates in two short sequences separated the number of recombination events detected through by Ͻ1 kb. Further, the estimated rate in regions I and the four-gamete test, is 2 for each data set.] Further, II is 30–50 times higher than the genome-wide average, the results are consistent with, and complement, our suggesting that this 897-bp region is “hotter” than all estimation results, and the evidence for a hotspot is but one of the recently characterized hotspots in the much clearer from the detection plot (Figure 4) than MHC (Jeffreys et al. 2001). The recombination rate from plots of pairwise summaries of LD such as those per kilobase in regions I and II is similar to the average in Figure 3. (male) rate per kilobase across the entire hotspot as It is important to remember that the detection ap- estimated directly from sperm typing. The data are not proach provides a different kind of information from consistent with all of the recombination activity across the estimation methods. Only a small proportion of the known hotspot occurring within the 897 bp we stud- actual historical recombination events will leave a con- ied, so that elevated recombination rates must also occur clusive trace even when, as here, resequencing provides elsewhere in the 9.1-kb region. Each of these results is all available SNPs in the data. Adopting the sperm-based consistent with a recently published population-based rate estimate of 80 ϫ 10Ϫ5 kbϪ1 (Schneider et al. 2002), analysis based on different samples and approximate- assuming the rate to be the same in females, and Ne likelihood estimation techniques (Wall et al. 2003). of 15,000 and 9500 for The Gambia and the United The AML method, which allows for repeat mutation, Kingdom, theory (Griffiths and Marjoram 1996b) applied to regions I and II together gave similar results would predict an expectation of 170 historical recombi- to those presented here. Single-sperm analysis (Schnei- nation events in the hotspot (regions I and II) in the der et al. 2002) has established that recombination rates Gambian sample and 110 in the United Kingdom sam- are low across region III, and inclusion or exclusion of ple. As another comparison, our coalescent approach it made little difference to recombination rate estimates estimates an average of 150 and 60 recombination events, (data not shown). Similar results are obtained if differ- respectively, in the Gambian and United Kingdom sam- ent choices of subregions are made. ples for region I alone. We now contrast the information available from the Detection of hotspots: We now consider the problem preceding coalescent analysis with that from some other of detecting recombination hotspots using SNP data current approaches. Figure 3 plots commonly used sum- over large genomic regions. We present results from a mary measures (e.g., Abecasis et al. 2001; Jeffreys et al. detailed simulation study aimed at evaluating how feasi- 2001; Dawson et al. 2002; Wall and Pritchard 2003) ble it is to detect recombination hotspots using the AML of pairwise LD for each sample. These are much less method. For definiteness, we based our simulation study informative than the full-likelihood analysis described on a model appropriate for human populations. The above: neither the qualitative pattern nor the quantita- extent to which this model, and the results from the tive extent of sharp variation revealed below is apparent simulation study, are applicable to other organisms is from such plots. discussed later. The parameter values for our model are On the same polymorphism samples we also applied given in terms of the population-scaled mutation rate, Application of Coalescent Methods 2073 (2001), for all pairs et al. Jeffreys (1984) is at position 906. The boundaries of the regions I, II, III, and IV are et al. Chakravarti 15%: (a) Gambian sample; (b) United Kingdom sample. The edges of each rectangle are midway Ͼ | (below diagonal) and the likelihood ratio for significant LD (above diagonal), as described in Ј D boundary of the region delineated by Ј Plots of LD measured by | — 3. Figure of SNPs inbetween each adjacent population SNPs. with The estimated 3 minor allele frequency marked on each top and right-hand side. 2074 P. Fearnhead et al.

Figure 4.—Graphical representation of information about detected recombination events. The plot is composed of colored rectangles whose boundaries are the posi- tions of informative SNPs found in either population. The top left half relates to the Gambian population and the lower right half to the United Kingdom population. For a pair of SNPs located at positions x and y, respectively, the color of the rectan- gle whose nearest point to the diagonal is at position (x, y)reflects the density of re- combinations definitively localized to fall between these two SNPs, that is, the number of such localized events divided by the phys- ical distance between the SNPs. The highest densities are colored white (hot) with de- creasing densities colored through the spectrum of yellow, orange, red, purple, and blue (cold). The green squares on the diagonal identify the positions of the SNPs but do not carry information about de- tected recombination events.

␪, and recombination rate, ␳. For a diploid population spots: (A) no hotspot, (B) a 1-kb hotspot with ␳ϭ50 Ϫ1 ␳ϭ Ϫ1 of effective population size Ne and probabilities of muta- kb , (C) a 1-kb hotspot with 10 kb , and (D) a tion and recombination per generation of u and r, re- 2-kb hotspot with ␳ϭ10 kbϪ1. We did not allow for ␪ϭ ␳ϭ spectively, these are defined by 4Neu and 4Ner. gene conversion. We simulated samples of 50 haplotypes, defined by The average recombination rate in the human ge- SNPs, over 100-kb regions, using a coalescent model, nome is 1.2 cM/Mb (Kong et al. 2002). It is not currently which assumes a panmictic population, constant popula- known how much of this is due to hotspots, nor what tion size, and no selection. We assumed a finite-sites their density is, so it is not clear in simulations how to mutation model, which allows for repeat mutation, with apportion overall recombination rates to “background” each of the 100,000 bases treated as a separate site. At and “hotspots.” Jeffreys et al. (2001) estimated the each site we used a two-allele mutation model, with background recombination rate in the major MHC to mutations switching between these alleles. We modeled be 0.04 cM/Mb, but extrapolation of this to the whole variable mutation rates by assuming that 99.9% of sites genome seems unwarranted at this stage. We have thus mutated at a rate equivalent to ␪ϭ1kbϪ1; the remaining chosen a middle course, with the background rate in sites had a mutation rate drawn from a gamma distribu- our simulations corresponding to one-third to one-sixth tion with mean and standard deviation of ␪ϭ0.1 baseϪ1 of the genome-wide average, depending on choice of

(the mean rate is thus 100 times the background rate). Ne. The size of hotspots we assume is consistent with Because of these hypervariable sites, the overall value those detected in the MHC (Jeffreys et al. 2001). The of ␪ is 1.1 kbϪ1. Very little is known about the pattern of hotspot rates correspond to 10 cM/Mb and 50 cM/Mb, mutation rate variation in humans. Recurrent mutation or 50 and 250 times the background rate we use; the can be misinterpreted as a signal of recombination, so hotspots characterized in the MHC have rates varying that recombination hotspots should be more difficult from 0.4 to 140 cM/Mb (Jeffreys et al. 2001). The SNP to detect in the presence of recurrent mutation. Our density in our simulated data was 1 kbϪ1 on average. 1kbϪ1 is consistent with available Ascertainment effects can be an important factor inف average ␪ value of data for humans (Li and Sadler 1991; Cargill et al. analyzing SNP data. Of course, different SNP discovery 1999). The explicit inclusion of recurrent mutation at studies use different experimental designs. We modeled each base, and the presence of hypervariable sites, SNP ascertainment under one of the simplest, and per- should move our assessment of the power to detect haps most common, scenarios, that in which a SNP hotspots in the direction of being conservative. is discovered if it is polymorphic in a sample of two We assumed a background recombination rate of ␳ϭ chromosomes. Thus samples of 52 sequences were simu- 0.2 kbϪ1 and used four models for recombination hot- lated, with 2 randomly chosen sequences constituting Application of Coalescent Methods 2075

Figure 5.—Quantile-quantile plot of the empirical distribution of the LR statistic against the asymptotic distribution of an ␹ 2 equal mixture of a 1 distribution and an atom at 0.

a panel. Sites that were polymorphic in the panel were gion we calculated a likelihood-ratio test statistic for the identified as SNPs. Our sample of sequences consisted presence of a hotspot. For subregion i, the statistic is of the remaining 50 sequences defined by their alleles calculated by the following: at these SNPs. (Note that only our simulation method i. Estimate the background recombination rate. The incorporated SNP ascertainment. The AML method for ␳ ␳ detecting hotspots does not use details of SNP ascertain- estimate, ˆb, is the value that minimizes the product of AML surfaces from all other subregions. ment.) ␳ Our approach to detecting hotspots was to first calcu- ii. Let li( ) be the log of the AML of region i evaluated ␳ late the AML for a set of subregions that span the region at . The likelihood-ratio statistic for a hotspot is of interest. The computational cost of calculating the ϭ ␳ Ϫ ␳ LRi 2΂max l( ) l(ˆb)΃. ␳Ͼ␳ AML surface depends primarily on the number of SNPs ˆb the subregion contains, and we chose each subregion Ͼ to contain six SNPs (which gives a good compromise We detect a hotspot in a subregion i if LRi c for a between the accuracy of inferences from the AML sur- suitably chosen cutoff value c. face and the computational cost of calculating the AML To make an informed choice of c requires knowledge surface). We allowed the subregions to overlap and had of the distribution of the likelihood-ratio (LR) statistic a new subregion starting after every third SNP. under the null hypothesis of no hotspot. Standard as- For each subregion we calculated the AML surface ymptotic theory would suggest that the statistic takes using 100,000 iterations of the program sequenceLD, the value 0 with probability 1/2 and otherwise has a ␹ 2 which took an average of 4 hr on a 900-MHz PC. The 1 distribution. We used the empirical distribution of surface was calculated over a grid of 101 ␳ values, equally the LR statistics in non-hotspot regions to test whether spaced over the interval 0–10 kbϪ1, and a single ␪ value, their actual distribution is close to this asymptotic distri- 0.0015 baseϪ1 (in practice the choice of ␪ value makes bution. A quantile-quantile plot is given in Figure 5, negligible difference, data not shown). When calculat- which suggests that the asymptotic distribution approxi- ing the AML we included all SNPs that were segregating mately holds. (The empirical distribution has too much (.on the value 0 ,0.7ف ,in the sample, and thus expect the surface to be very mass close to the full-likelihood surface for the data (ignoring We chose c ϭ 10 for the results that we present. Under any ascertainment). the asymptotic distribution, this gives a false-positive rate After calculating the AML surfaces, for each subre- of 0.0008 per subregion. On average there is 1 SNP/kb, 2076 P. Fearnhead et al.

,subregions for each 100-kb region. As before, we chose subregions that contain six SNPs 35ف and therefore Under the conservative assumption that each subregion here with the last SNP of one subregion being the first is independent, this equates to an approximate proba- SNP of the next subregion. To deal with the presence bility of 0.028 of a false positive when analyzing a 100- of multiple hotspots, we inferred the background rate kb region. using only the AML curves for subregions that had not We analyzed 10 data sets simulated for each of the currently been inferred as hotspot regions. In practice four hotspot models. The LR values for model C, the this involved iterating the calculation of background model with the smallest hotspot, are shown in Figure rates and the detection of hotspots until no new hotspots 6, and a summary of the results is given in Table 3. were detected. There were no false positives for any of the 40 data sets. Figure 7 shows that all six of the characterized hot- In total 80% of the hotspots were detected. Surprisingly spots are detected, and in particular that the method the size or strength of the hotspot seems to have no has resolution to distinguish between closely separated obvious effect on the power of the method. More impor- hotspots in this region. In addition there are two further 5kb3Ј of the TAP2 hotspotف tant seems to be the density of SNPs in and near the putative hotspots, one hotspot. Half the hotspots that were not detected were (right-hand side of Figure 7) and the other between due to not having SNPs near the hotspot. In these three DNA1 and DNA2 (the two leftmost known hotspots in cases, there were 20- to 30-kb regions that incorporated Figure 7). It is hard to assess whether these are false the hotspot that contained 0 or 1 SNP. Detection effi- positives or additional hotspots. In the case of the for- ciency decreases further for weaker hotspots. For exam- mer, the sperm data in Jeffreys et al. (2000) is uninfor- ple, in further simulations, 6 of 10 hotspots with a rate mative for recombination 3Ј of the TAP2 hotspot, and Ϫ of4kb 1 (20 times our assumed background rate) were Jeffreys et al. (2001) conjecture that the TAP2 hotspot Ϫ detected, and none of 5 hotspots with a rate of 2 kb 1 may be part of a cluster. In the latter case, our subregion (10 times our assumed background rate) were detected. covers a 464-bp region to which two crossovers were To test the robustness of our method to model mis- localized in the sperm data of Jeffreys et al. (2001; specification, we repeated our analysis but with data see their Figure 2c). However, the small number of simulated under a model of recent exponential growth. crossover events means that it is impossible to say from Our model was for a constant population size followed the sperm data whether this is an additional hotspot or by exponential growth by a factor of 500 over the last a false positive. 1600 generations (for human populations, such a model is suggested by the results of Sherry et al. 1994; Rogers and Jorde 1995). We again used four hotspot models, DISCUSSION with the same mutation and recombination models as above, and with the values of ␪ and ␳ being for the We have demonstrated that new computational tools population before the exponential growth. for calculating the full-likelihood surface, or an accurate The results of our study are given in Table 4. They approximation of it, from population data are highly are notably worse than the results for the constant popu- valuable for detection and analysis of local recombina- lation size. This time there were two false positives, al- tion hotspots. We have presented results for two related though this is still consistent with our putative 2.8% rate but distinct problems: the estimation of recombination of false positives. The overall rate of detecting hotspots rates and the detection of recombination hotspots. is now 63%. (There is significant evidence for different The results of our first study show that recombination rates of detection for the two models; P ϭ 0.001.) rates vary by around two orders of magnitude between As an empirical complement to our simulation study, a 897-bp region and a 1725-bp region that are separated we applied our detection method to population data by 445 bp. Furthermore the estimated rate in the hot- from 50 individuals over 216 kb of the HLA region spot region is 30–50 times greater than the genome- covering six recently characterized hotspots (Jeffreys wide average, suggesting that this region is hotter than et al. 2001). The available data is unphased, so we ap- all but one of the recently characterized hotspots in the plied our method as described above to haplotypes esti- HLA region. Comparisons with sperm studies, and other mated from the genotype data by the program PHASE analyses of data from this region, confirm full-likelihood (Stephens et al. 2001; Stephens and Donnelly 2003). estimation of recombination rates to be reliable. (The estimated haplotypes were kindly provided by Mat- Diversity in the human ␤-globin gene cluster has been thew Stephens.) intensively studied over many years, not least because

᭤ Figure 6.—The value of the LR statistic across the 100-kb region. For ease of presentation we show results from just nine of the simulations under model C—a constant population size model with a 1-kb hotspot, whose rate is 50 times the background rate. For positions in two sub-regions, we have plotted the maximum LR value. The position of the hotspot is shown by the two vertical dashed lines, and the cutoff value for detecting a hotspot by the horizontal dashed line. Application of Coalescent Methods 2077 2078 P. Fearnhead et al.

TABLE 3 TABLE 4 Detection of hotspots (constant population-size model) Detection of hotspots (recent exponential growth)

Model False positives % of hotspots detected Model False positives % of hotspots detected A0 — A2 — B0 70B0 70 C0 90C0 40 D0 80D0 80 Results of detecting recombination hotspots for each of Results of detecting recombination hotspots for each of four models are shown (all assume a constant population size). four models are shown (all assume a constant population size The number of false positives and the number of correctly followed by exponential growth by a factor of 500 over the detected hotspots are given. For each model 10 100-kb data past 1600 generations). The number of false positives and the sets were simulated. Model A had no hotspot; models B–D number of correctly detected hotspots are given. For each each had a single hotspot, respectively, of size 1, 1, and 2 kb; model 10 100-kb data sets were simulated. Model A had no and rate 250, 50, and 50 times the background recombination hotspot; Models B–D each had a single hotspot, respectively, rate. (We classified a hotspot as detected provided at least one of size 1, 1, and 2 kb; and rate 250, 50, and 50 times the subregion that overlapped the hotspot had an LR value that background recombination rate. (We classified a hotspot as exceeded the preset threshold c.) detected provided at least one subregion that overlapped the hotspot had an LR value that exceeded the preset threshold c.) many mutations in it cause a range of mild to severe inherited hemoglobinopathies (Weatherall and Clegg ing et al. 1997) or other genes of the complex (Webster 1996). On the basis of these studies we can be sure that et al. 2003). Rate estimates for these regions from the none of the diversity represented in the United King- approximate methods of Fearnhead and Donnelly dom sample causes functional variation. We also know (2002) and McVean et al. (2002), which explicitly allow that it is inappropriate to assume neutrality for the sam- repeat mutation, are similar (data not shown) to those ple from The Gambia. One of the best known of human from the full-likelihood approach reported here. Our mutations, Hemoglobin S (HbS), causes sickle-cell ane- computational methods have been shown (Fearnhead mia in homozygous individuals and has been elevated in and Donnelly 2001) to be reasonably robust to depar- frequency in some populations, mainly in sub-Saharan tures from underlying demographic assumptions. Selec- Africa, by the advantage conferred on HbS heterozy- tion as heterozygous advantage has an effect equivalent gotes who are protected against severe malarial mortal- to changing demographic assumptions and introducing ity. The HbS allele is present in the sample from The population structure (Nordborg 2001). Furthermore, Gambia, at a frequency of 3 out of 31 chromosomes. The power studies (C. Spencer, personal communication) recent rise of HbS alleles to polymorphic frequencies by for detection of selection in LD data show that regions malarial selection in Africa is evident from the presence containing polymorphisms held at intermediate fre- of characteristic RFLP haplotypes that span the recombi- quencies by balancing selection provide similar informa- nation hotspot (Flint et al. 1998; Webster et al. 2003). tion about recombination rates as do neutral regions. Consequently, our data set from The Gambia might have In marked contrast, fixations by selective sweeps do leave been expected to underestimate the number of recom- clear signals of extensive LD that would lead to underes- bination events compared with expectations under neu- timates of recombination rates. The latter is not the trality, with our estimate based on the United Kingdom case in our data. sample, and with the estimate based on sperm typing, But over and above these theoretical arguments, we but this does not seem to be a big effect in practice. are considerably encouraged by the fact that when ap- Estimation of recombination rates from polymor- plied to two populations, United Kingdom and The phism data necessarily relies on an underlying evolu- Gambia, deliberately chosen to have been subject to tionary model, in our case the so-called standard neutral different demographic histories and selection regimes, model. This model does not provide an exact account the full-likelihood approach gave similar rate estimates of the evolutionary history of the sequences we have across populations within and outside the hotspot. This examined, but we conclude that departures from these complements earlier simulation studies (Fearnhead assumptions have not greatly influenced our results. and Donnelly 2001) in showing that at least for these These departures include the possibility for recurrent real data sets, full-likelihood estimation of recombina- mutation, malarial selection in The Gambia, and a likely tion rates is robust to aspects of population history and bottleneck in the ancestral population for the United deviations from neutrality. Kingdom sample (Reich et al. 2001). Recurrent muta- It is unclear to what extent homologous recombina- tion is most likely at CpG sites but no evidence for their tion fails to resolve as a crossover with flanking ex- hypermutability has been found in any of the polymor- change, but instead resolves as short tracts (of a few phism data available to us from either ␤-globin (Hard- hundred or so base pairs) of gene conversion. Several Application of Coalescent Methods 2079

Figure 7.—The value of the LR statistic across a 216-kb segment of the class II region of the major histocompatibility complex. The data analyzed are from Jeffreys et al. (2001) and consist of genotype data on 241 SNP markers for 50 unrelated United Kingdom donors. The horizontal dashed line denotes the cutoff value used for the detection of hotspots. The vertical dashed lines show the locations of the centers of the six hotspots that were inferred from sperm data in Jeffreys et al. (2000, 2001). Reading from the left, these are the DNA1, DNA2, DNA3, DMB1, DMB2, and TAP2 hotspots, respectively. population analyses have suggested gene conversion characterized. All the known hotspots were detected, may play an important role (Frisse et al. 2001; Przewor- and the method successfully distinguished hotspots that ski and Wall 2001), and data on gene conversion pro- were physically close. We also found evidence for a cesses in humans are beginning to become available fourth hotspot in the HLA-DNA region and a hotspot (Jeffreys and May 2004). The RFLP haplotypes, LD 5 kb 3Ј of the TAP2 hotspot. For our simulation study patterns, and single-sperm typing data for the ␤-globin we fixed the sample size to be 50 chromosomes. Larger complex are mainly informative of recombination with sample sizes, and more importantly denser SNP maps, flanking exchange, rather than gene conversion. On should improve the power of the method to detect hot- the other hand, gene conversion events either wholly spots. Our simulation study was based on haplotype within or overlapping one boundary of the regions we data. If unphased genotype data were available, one have considered would be expected to be reflected in natural approach would be first to estimate the haplo- increased rate estimates using our approach. The quan- types statistically (Stephens et al. 2001; Stephens and titative consequences of gene conversion for coalescent- Donnelly 2003) and then to apply the methods de- based estimates of recombination rate are not yet well scribed above. This is the approach we used for the understood and warrant further investigation. HLA region, so those results speak to the accuracy of The results of the second study show that approxi- hotspot detection when haplotype phase needs to be mate-likelihood methods are powerful at detecting re- estimated. We note that our method for detecting hot- combination hotspots: Ͼ70% of the hotspots were cor- spots explicitly allows for repeat mutations. rectly detected, with only two false positives from Ͼ2000 We considered the formal problem of testing for the subregions that were tested for the presence of a hot- presence of a hotspot and suggest a test that declares spot. The results were encouraging even for data simu- a hotspot when the likelihood-ratio statistic exceeds a lated under a model of recent population growth, but threshold c, determined by simulation. In some applica- analyzed under a constant population size model. We tions a less formal approach may be appropriate, and also applied the method to population data across the putative hotspots could be studied in order of decreas- HLA region in which six hotspots have recently been ing value of the likelihood-ratio statistic, or quantile- 2080 P. Fearnhead et al. quantile plots of observed values of the statistic against Cargill, M., D. Altshuler, J. Ireland, P. Sklar, K. Ardlie et al., 1999 Characterization of single-nucleotide polymorphisms in those generated by simulation could be used to detect coding regions of human genes. Nat. Genet. 22: 231–238. outliers indicating putative hotspots. Chakravarti, A., K. H. Buetow, S. E. Antonarakis, P. G. Waber, We simulated data under assumptions plausible for C. D. Boehm et al., 1984 Nonuniform recombination within the human ␤-globin gene cluster. Am. J. Hum. Genet. 36: 1239–1258. human populations, in view of the current interest, and Daly, M. J., J. D. Rioux, S. F. Schaffner, T. J. Hudson and E. S. available data, for that particular organism. Other or- Lander, 2001 High-resolution haplotype structure in the hu- ganisms have different effective population sizes and man genome. Nat. Genet. 29: 229–232. Dawson, E., G. R. Abecasis, S. Bumpstead, Y. Chen, S. Hunt et al., different values for the probabilities of mutation and 2002 A first-generation linkage disequilibrium map of human recombination per , u and r. For example, Dro- chromosome 22. Nature 418: 544–548. sophila melanogaster has an effective population size that Fearnhead, P., 2003 Consistency of estimators of the population- .scaled recombination rate. Theor. Popul. Biol. 64: 67–79 ف is 10 times greater than that of humans, but has similar Fearnhead, P., and P. Donnelly, 2001 Estimating recombination values of u and r (Li and Sadler 1991; Nachman 2002). rates from population genetic data. Genetics 159: 1299–1318. This means that the values of ␪ and ␳ per kilobase that Fearnhead, P., and P. Donnelly, 2002 Approximate likelihood ␪ methods for estimating local recombination rates (with discus- we chose for our model would be appropriate for and sion). J. R. Stat. Soc. Ser. B 64: 657–680. ␳ values per 100 bases of D. melanogaster, and our results Flint, J., R. M. Harding and J. B. Clegg, 1998 The population for 100 kb of human DNA would correspond to results genetics of the haemoglobinopathies. Baillieres Clin. Haematol. 11: 1–51. for 10 kb of D. melanogaster DNA. For other organisms Frisse, L., R. R. Hudson, A. Bartoszewicz, J. D. Wall, J. Donfack the appropriate scaling of genomic regions would be et al., 2001 Gene conversion and different population histories different. (As we use a finite-sites mutation model, our may explain the contrast between polymorphism and linkage disequilibrium levels. Am. J. Hum. Genet. 69: 831–843. model does not strictly scale; for example, we will be Griffiths, R. C., and P. Marjoram, 1996a Ancestral inference from modeling 100 bases of D. melanogaster DNA with a 1000- samples of DNA sequences with recombination. J. Comput. Biol. sites model, but in practice this will not greatly affect 3: 479–502. Griffiths, R. C., and P. Marjoram, 1996b An ancestral recombina- results: the only effect this has is on the amount of tion graph, pp. 257–270 in IMA Volume on Mathematical Population repeat mutation, and experience suggests that, at least Genetics, edited by P. Donnelly and S. Tavare´. Springer-Verlag, for scalings of one or maybe two orders of magnitude, Berlin/Heidelberg, Germany/New York. Griffiths, R. C., and S. Tavare´, 1994 Unrooted genealogical tree any effect of increased repeat mutation would be small. probabilities in the infinitely-many-sites model. Math. Biosci. 127: For example, the analysis of LPL data in Fearnhead 77–98. and Donnelly 2002 gave almost identical results even Harding, R. M., S. M. Fullerton, R. C. Griffiths, J. Bond, M. J. Cox et al., 1997 Archaic African and Asian lineages in the genetic when some sites were allowed to mutate either 10 or ancestry of modern humans. Am. J. Hum. Genet. 60: 772–789. 100 times more frequently than normal.) Hudson, R. R., 2001 Two-locus sampling distributions and their On the one hand, population-based estimates of re- application. Genetics 159: 1805–1817. Hudson, R. R., and N. Kaplan, 1985 Statistical properties of the combination rates rest on simplistic models for the evo- number of recombination events in the history of a sample of lution of diversity and the uncertainty surrounding even DNA sequences. Genetics 111: 147–164. the most sophisticated estimates cannot be dismissed. Jeffreys, A. J., and C. A. May, 2004 Intense and highly localized gene conversion activity in human meiotic crossover hot spots. Uncertainty of this order is not a feature of pedigree or Nat. Genet. 36: 151–156. sperm-typing studies, although we note that quantifying Jeffreys, A. J., and R. Neumann, 2002 Reciprocal crossover asymme- uncertainty in rate estimates is not straightforward for try and meiotic drive in a human recombination hotspot. Nat. Genet. 31: 267–271. any pedigree-, sperm-, or population-based methods. Jeffreys, A. J., A. Ritchie and R. Neumann, 2000 High resolution On the other hand, it is feasible and practicable to apply analysis of haplotype diversity and meiotic crossover in the human computational methods to diversity data sampled from TAP2 recombination hotspot. Hum. Mol. Genet. 9: 725–733. Jeffreys, A. J., L. Kauppi and R. Neumann, 2001 Intensely punctate anywhere in the human genome, or indeed from any meiotic recombination in the class II region of the major histo- genome, and substantial information will accumulate. compatibility complex. Nat. Genet. 29: 217–222. It will be interesting to see the future contribution of Jorde, L. B., 2000 Linkage disequilibrium and the search for com- plex disease genes. Genome Res. 10: 1435–1444. coalescent methods toward detection of recombination Kong, A., D. F. Gubdjartsson, J. Sainz, G. M. Jonsdottir, S. A. hotspots and robust estimation of recombination rates Gudjonson et al., 2002 A high-resolution recombination map in hotspots, compared with the results of faster algo- of the human genome. Nat. Genet. 31: 241–247. Kruglyak, L., 1999 Prospects for whole-genome linkage disequilib- rithms based on summary statistics. Our study demon- rium mapping of common diseases. Nat. Genet. 22: 139–144. strates that computationally intensive inference meth- Kuhner, M. K., J. Yamato and J. Felsenstein, 1995 Estimating ods applied to polymorphism data provide practical and effective population size and mutation rate from sequence data using Metropolis-Hastings sampling. Genetics 140: 1421–1430. valuable tools for learning about fine-scale variation in Kuhner, M. K., J. Yamato and J. Felsenstein, 2000 Maximum likeli- recombination rates. hood estimation of recombination rates from population data. Genetics 156: 1393–1401. Li, N., and M. Stephens, 2003 Modelling LD, and identifying recom- bination hotspots from SNP data. Genetics 165: 2213–2233. LITERATURE CITED Li, W. H., and L. Sadler, 1991 Low nucleotide diversity in man. Genetics 129: 513–523. Abecasis, G. R., E. Noguchi, A. Heinzmann, J. A. Traherne, S. McVean, G. A. T., P. Awadalla and P. Fearnhead, 2002 A coales- Bhattacharyya et al., 2001 Extent and distribution of linkage cent method for detecting recombination from gene sequences. disequilibrium in three genomic regions. Am. J. Hum. Genet. Genetics 160: 1231–1241. 68: 191–197. Myers, S. R., and R. C. Griffiths, 2003 Bounds on the minimum Application of Coalescent Methods 2081

number of recombination events in a sample history. Genetics Sherry, S. T., A. R. Rogers, H. Harpending, H. Soodyall, T. Jen- 163: 375–394. kins et al., 1994 Mismatch distribution of mtDNA reveals recent Nachman, M. W., 2002 Variation in recombination rate across the human population expansions. Hum. Biol. 66: 761–775. genome: evidence and implications. Curr. Opin. Genet. Dev. 12: Smith, R. A., P. J. Ho, J. B. Clegg, J. R. Kidd and S. L. Thein, 1998 657–663. Recombination breakpoints in the human ␤-globin gene cluster. Nordborg, M., 2001 , pp. 179–212 in Handbook Blood 92: 4415–4421. of Statistical Genetics, edited by D. J. Balding, M. Bishop and C. Stephens, M., and P. Donnelly, 2000 Inference in molecular popu- Cannings. John Wiley & Sons, Chichester, England. lation genetics (with discussion). J. R. Stat. Soc. Ser. B 62: 605–655. Ott, J., 2000 Predicting the range of linkage disequilibrium. Proc. Stephens, M., and P. Donnelly, 2003 A comparison of Bayesian Natl. Acad. Sci. USA 97: 2–3. methods for haplotype reconstruction. Am. J. Hum. Genet. 70: Petes, T. D., 2001 Meiotic recombination hot spots and cold spots. 1162–1169. Nat. Rev. Genet. 2: 360–369. Stephens, M., N. J. Smith and P. Donnelly, 2001 A new statistical Pritchard, J. K., and M. Przeworski, 2001 Linkage disequilibrium method for haplotype reconstruction from population data. Am. in humans: models and data. Am. J. Hum. Genet. 69: 1–14. J. Hum. Genet. 68: 978–989. Przeworski, M., and J. D. Wall, 2001 Why is there so little intra- Wall, J. D., 2000 A comparison of estimators of the population genic linkage disequilibrium in humans? Genet. Res. 77: 143–151. recombination rate. Mol. Biol. Evol. 17: 156–163. Przeworski, M., R. R. Hudson and A. DiRienzo, 2000 Adjusting Wall, J. D., and J. K. Pritchard, 2003 Haplotype blocks and LD in the human genome. Nat. Rev. Genet. 4: 587–597. the focus on human variation. Trends Genet. 16: 296–302. Wall, J. D., L. A. Frisse, R. R. Hudson and A. Di Rienzo, 2003 Com- Reich, D. E., M. Cargill, S. Bolk, J. Ireland, P. C. Sabeti et al., parative linkage-disequilibrium analysis of the beta-globin hotspot 2001 Linkage disequilibrium in the human genome. Nature in primates. Am. J. Hum. Genet. 73: 1330–1340. 411: 199–204. Weatherall, D. J., and J. B. Clegg, 1996 Thalassemia - a global Rogers, A. R., and L. B. Jorde, 1995 Genetic evidence on modern public health problem. Nat. Med. 2: 847. human origins. Hum. Biol. 67: 1–36. Webster, M. T., J. B. Clegg and R. M. Harding, 2003 Common 5Ј Schneider, J. A., T. E. A. Peto, R. A. Boone, A. J. Boyce and J. B. ␤-globin RFLP haplotypes harbour a surprising level of ancestral Clegg, 2002 Direct measurement of the male recombination sequence mosaicism. Hum. Genet. 113: 123–139. fraction in the human beta-globin hot spot. Hum. Mol. Genet. 11: 207–215. Communicating editor: D. Charlesworth