Multipoint Identity-By-Descent Prediction Using Dense Markers to Map Quantitative Trait Loci and Estimate Effective Population Size
Total Page:16
File Type:pdf, Size:1020Kb
Copyright Ó 2007 by the Genetics Society of America DOI: 10.1534/genetics.107.070953 Multipoint Identity-by-Descent Prediction Using Dense Markers to Map Quantitative Trait Loci and Estimate Effective Population Size Theo H. E. Meuwissen*,1 and Mike E. Goddard†,‡ *IHA, Norwegian University of Life Sciences, 1432 A˚ s, Norway, †Department of Primary Industry, Attwood, 3049 Australia and ‡University of Melbourne, Melbourne, 3052 Australia Manuscript received January 15, 2007 Accepted for publication May 22, 2007 ABSTRACT A novel multipoint method, based on an approximate coalescence approach, to analyze multiple linked markers is presented. Unlike other approximate coalescence methods, it considers all markers simulta- neously but only two haplotypes at a time. We demonstrate the use of this method for linkage disequilibrium (LD) mapping of QTL and estimation of effective population size. The method estimates identity-by-descent (IBD) probabilities between pairs of marker haplotypes. Both LD and combined linkage and LD mapping rely on such IBD probabilities. The method is approximate in that it considers only the information on a pair of haplotypes, whereas a full modeling of the coalescence process would simultaneously consider all hap- lotypes. However, full coalescence modeling is computationally feasible only for few linked markers. Using simulations of the coalescence process, the method is shown to give almost unbiased estimates of the effec- tive population size. Compared to direct marker and haplotype association analyses, IBD-based QTL map- ping showed clearly a higher power to detect a QTL and a more realistic confidence interval for its position. The modeling of LD could be extended to estimate other LD-related parameters such as recombination rates. XTENSIVE genotyping of individuals for tens to haplotypes (Hudson 1993). Hence, markers provide E hundreds of thousands of SNP markers is becom- information about QTL alleles only through their in- ing common as automated high-throughput techniques formation on the underlying coalescence tree or IBD are established (Wang et al. 2005). These detailed ge- structure. Therefore the logical approach to using LD notype data provide important information about link- is to use the markers to infer the coalescent tree or age disequilibrium (LD) between the genes or markers. properties of it and then to use the coalescent tree to LD, in turn, can be used to test hypotheses about the infer properties of the population (e.g., effective size) or evolutionary history of the population (Hayes et al. to map QTL or to discover recombination hotspots. In 2003), to map QTL (Carlson et al. 2004), and to esti- line with this approach, all QTL mapping methods can mate the recombination rate at each position along a be described conceptually as following three steps: chromosome (Li and Stephens 2003). Thus, extract- 1. Calculate the probability G that two individuals carry ing maximum information from the pattern of LD ij chromosomes that are identical by descent at the pu- observed is very important. tative QTL position. LD, as pointed out by Chapman and Thompson 2. Compare the similarity in phenotype to G . (2003), occurs because multiple gametes inherit a chro- ij 3. The position of the QTL that maximizes the likeli- mosome segment from a common ancestor that is hood of the phenotypes given G is the estimated identical by descent (IBD), i.e., inherited without any ij position. recombination. Hayes et al. (2003) used this under- standing of LD to define a measure of LD called chro- Linkage mapping of QTL also follows this approach, mosome segment homozygosity (CSH) as the probability except that Gij is here solely due to within-family IBD. that random chromosome segments sampled from a Unfortunately, a full coalescent analysis of many linked population are IBD. markers is not computationally feasible. Early multi- According to coalescence theory, the mutations at all point IBD estimation methods assumed that the markers loci (markers and QTL) are independent given the provided independent information about the IBD status coalescence tree(s), i.e., given the IBD structure of the (e.g., Terwilliger 1995). In the case of dense SNP markers this assumption is clearly invalid. More recent multipoint methods approximate the coalescence; for 1Corresponding author: IHA, Norwegian University of Life Sciences, Box instance, composite-likelihood methods consider markers 5003, 1432 A˚ s, Norway. E-mail: [email protected] only two at a time (Hudson 2001). An alternative approach Genetics 176: 2551–2560 (August 2007) 2552 T. H. E. Meuwissen and M. E. Goddard (Meuwissen and Goddard 2001) is to model the coa- this pattern of IBD and non-IBD segments. P(p) can be lescence of all markers but only a pair of chromosomes factored because, once a recombination occurs, chro- at a time, which fits with the definition of CSH and is mosomes segments on either side of the recombination computationally much more tractable (can deal withthou- are assumed to evolve independently in coalescence the- sands of markers). This approach has been very useful ory. The latter assumes an unstructured population, i.e., for mapping QTL (e.g., Olsen et al. 2005) but it assumed no subpopulations or lineages. Therefore, we group el- that genes in the current population derived, without ements in p into IBD segments (continuous sequences any mutation, from a base population that contained a of 1’s) and others. For example, p ¼½0110109 consists single copy of the QTL mutant a known number of gen- of two IBD segments, brackets 2 and 3 and bracket 5. erations ago. This is satisfactory provided the mutation Therefore, rate is negligible relative to the recombination rate, but as the density of markers increases, this assumption is less Pðp ¼½0110109Þ¼Pðp ¼½0110::9Þ justified. Furthermore, Meuwissen and Goddard (2001) 3 Pðp ¼½... 0109 j p ¼½... 0 ::9Þ; assumed that the effective population size and time since the most recent QTL mutation were known. Here, we where a dot ½. denotes that the IBD status for this extend their approach to overcome these limitations. bracket is not specified; i.e., it is not accounted for in the We present a method that predicts IBD probabilities probability calculation and could be either 0 or 1. This at putative QTL positions using information from many allows the probability of a long sequence of data on a dense markers. As part of the predictions, the effective chromosome to be factored into manageable pieces. population size is estimated directly from the linked The prior probability of an IBD chromosome segment marker data. Predictions are compared to true IBD states that extends over n marker brackets is approximately and to direct association analyses using single markers and marker haplotypes. The approach is general and Pðp ¼ 1nÞ¼1=ð4Nc 1 1Þð2Þ can be extended to effective population sizes that varied in the past and to estimate recombination rates that vary (Hayes et al. 2003), where 1n is a vector of n ones, c is at different points in the genome. the size of the IBD segment in morgans, and N is the effective population size. The approximation in (2) assumes that the size of the segment, c, is small relative THE MODEL to 1. Terms like the above P(p ¼½01109), where the IBD segment is bounded by non-IBD segments, can For each pair of gametes, i and j, define a vector (yij) be rewritten involving only unbounded IBD segments summarizing the observed haplotypes where yijk ¼ 1if the alleles are alike-in-state at position k and 0 if they are as in Equation 2, not, where k ¼ 1, ..., l and l denotes the number of p 9 p : : p : 9 marker loci. To simplify the notation, for the moment, Pð ¼½0110 Þ¼Pð ¼½ 11 Þ À Pð ¼½111 Þ we suppress the ij subscript in what follows. That is, all À Pðp ¼½: 1119Þ 1 Pðp ¼½11119Þ data and parameters are defined for the pair of gametes euwissen oddard i and j. The parameters for pair i and j are connected (M and G 2001), which follows from with the parameters for other pairs in a hierarchical rearranging the equations model. Underlying this observed y is a pattern of IBD ðp ¼½: :9Þ¼ ðp ¼½ 9Þ 1 ðp ¼½ 9Þ relationships described by the vector p. If the bth P 11 P 0110 P 1110 marker bracket is IBD, i.e., inherited as an IBD chro- 1 Pðp ¼½01119Þ 1 Pðp ¼½11109Þ mosome segment from a common ancestor without any recombination, the bth element of p is 1, and otherwise and it is 0, where a marker bracket denotes the chromosome segment between two adjacent markers (including the Pðp ¼½111:9Þ¼Pðp ¼½11109Þ 1 Pðp ¼½11119Þ: marker positions themselves). For example, the vector p ¼½11109 denotes that the first three brackets are Also, the conditional probability P(p ¼½0109 j p ¼ inherited IBD from a common ancestor and the fourth ½0..9) can be rewritten in terms of unbounded prob- bracket is not entirely inherited from one common abilities of IBD segments as in Equation 2, using ancestor. The probability of observing y is thus X Pðp ¼½0109 j p ¼½0 ::9Þ¼Pðp ¼½0109Þ=Pðp ¼½0 ::9Þ p 3 p ; PðyÞ¼ all pPðy j Þ Pð Þ ð1Þ ¼ Pðp ¼½0109Þ=½1 À Pðp ¼½1 ::9Þ: where the summation is over all possible p-vectors, Thus, all these terms can be calculated by using (2) re- P(y j p) states the conditional probability of observing peatedly. A factorization to computationally speed up y given the pattern of IBD and non-IBD segments de- the summation in Equation 1 is described by Meuwissen noted by p,andP(p) is the prior probability of observing and Goddard (2001). Multipoint IBD Prediction 2553 Conditional marker homozygosity: Let P(yk j p) de- TABLE 1 note the probability that marker locus k is observed Probability of (un)equal marker alleles given the IBD pattern alike-in-state at a pair of gametes i and j, or not, given the pattern of IBD and non-IBD segments, p.