(2005) 95, 476–484 & 2005 Nature Publishing Group All rights reserved 0018-067X/05 $30.00 www.nature.com/hdy Biased estimators of quantitative trait and location in interval mapping

M Bogdan1,2 and RW Doerge2,3 1Institute of Mathematics, Wroclaw University of Technology, Wroclaw, Poland; 2Department of Statistics, Purdue University, West Lafayette, IN 47907, USA; 3Department of Agronomy, Purdue University, West Lafayette, IN 47907, USA

In many empirical studies, it has been observed that genome the testing bias), and we use interval mapping to estimate its scans yield biased estimates of heritability, as well as genetic location and effect, the estimator of the effect will be biased. effects. It is widely accepted that quantitative trait locus As evidence, we present results of simulations investigating (QTL) mapping is a model selection procedure, and that the the relative importance of the two sources of bias and the overestimation of genetic effects is the result of using the dependence of bias of heritability estimators on the true QTL same data for model selection as estimation of parameters. heritability, sample size, and the length of the investigated There are two key steps in QTL modeling, each of which part of the genome. Moreover, we present results of biases the estimation of genetic effects. First, test proce- simulations demonstrating the skewness of the distribution dures are employed to select the regions of the genome for of estimators of QTL locations and the resulting bias in which there is significant evidence for the presence of QTL. estimation of location. We use computer simulations to Second, and most important for this demonstration, esti- investigate the dependence of this bias on the true QTL mates of the genetic effects are reported only at the locations location, heritability, and the sample size. for which the evidence is maximal. We demonstrate that Heredity (2005) 95, 476–484. doi:10.1038/sj.hdy.6800747; even when we know there is just one QTL present (ignoring published online 28 September 2005

Keywords: quantitative trait locus; interval mapping; biased estimators; model selection

Introduction and references given there). It is already known that the bias in the estimation of QTL effects arises naturally Interval mapping (Thoday, 1961; Lander and Botstein, from the fact that standard QTL mapping can be 1989) is a well-known method for identifying and interpreted as a model selection procedure and that it locating quantitative trait loci (QTL). Extended (Zeng, uses the same data to choose the best model and estimate 1993, 1994) to include, and conditional on, additional its parameters (Utz et al, 2000; Broman, 2001; Go¨ring et al, regions of the genomes, interval mapping has been 2001; Ball, 2001; Allison et al, 2002). There are two main expanded to composite interval mapping, and both are steps in the model selection process each of which adds currently enjoying a renewed popularity as statistical to the bias of the estimators of genetic effects. First, in tools for identifying genomic regions associated with performing the genome scan, all QTL are identified when quantitative data (eg expression QTL or eQTL). the corresponding test statistic (log-odds score or like- Where QTL mapping was once an end point for lihood-ratio test) exceeds a certain threshold. The bias experimentation, it is now a jumping-off point for resulting from using the same data for testing and exciting genomic applications (Jansen and Nap, 2001; estimation has been very well investigated (Broman, Doerge, 2002; Wayne and McIntyre, 2002; Schadt et al, 2001; Go¨ring et al, 2001; Allison et al, 2002; Xu, 2003). The 2003). Since QTL are genomic regions associated with second, and more subtle point, in QTL mapping relies on variation of a quantitative trait, it is useful to estimate choosing the locations for which the evidence of a QTL is effects of QTL and heritability, as well as their genetic maximal, and then reporting only the estimates of the map location for the purpose of narrowing in on genetic effects obtained for these optimal locations. In important components involved in the genetic architec- QTL mapping, while many test statistics are significant, ture of many experimental populations (Mackay, 2001). only the location that provides the largest test statistic In practice, it has been observed that the estimators of value along with the estimated QTL effect(s) for that effects for QTL estimated via interval mapping, and its location is reported. Although Broman (2001) and extensions (Zeng, 1993, 1994), can be severely inflated Allison et al (2002) suggest that this latter step is also a (see Beavis, 1994, 1998; Utz et al, 2000; Allison et al, 2002, source of the bias, they do not discuss this issue in detail. In this work, we explore the second source of bias. Namely, we demonstrate via interval mapping that even Correspondence: RW Doerge, Department of Statistics, Purdue Univer- when one QTL is assumed present (ie ignoring the sity, 1399 Mathematical Sciences Building, West Lafayette, IN 47907- 1399, USA. E-mail: [email protected] testing step), model selection is still performed and the Received 29 June 2004; accepted 18 July 2005; published online 28 resulting estimator of the genetic effect is biased. September 2005 Furthermore, the relative importance of the two sources Bias in interval mapping M Bogdan and RW Doerge 477 of bias is investigated with respect to the dependence of of this distribution is given by bias of heritability estimators on true QTL heritability, ! 2 sample size, and the length of the genome that is tested. ffiffiffiffiffiffi1 ðy À m À 0:5aÞ fiðyÞ¼pi p exp À We also demonstrate that the distribution of estimators 2p s 2s2 for QTL location is skewed toward the middle of the !ð2Þ 2 under investigation, which in turn results ffiffiffiffiffiffi1 ðy À m þ 0:5aÞ þð1 À piÞ p exp À in the bias of QTL location. 2p s 2s2 Although we explain the phenomenon of biased QTL effects and location via interval mapping, it is well There are no closed-form calculations for the maximum- known for cases involving multiple QTL that the likelihood estimators of parameters m, a, and s2 relative estimators provided by simple interval mapping can to r1; therefore, the likelihood still be biased because the effects of the other QTL are Yn neglected (ie employing a single QTL model). To address 2 Lðm; a; s Þ¼ fiðYiÞð3Þ this problem improved methods, like composite interval i¼1 mapping (CIM) (Zeng, 1993, 1994), multiple QTL mapping (MQM) (Jansen, 1993), or multiple interval may be maximized by employing the EM algorithm mapping (MIM) (Kao et al, 1999), have been developed. (Dempster et al, 1977). For an application of the EM These methods search the entire genome and report algorithm to interval mapping, see Jansen and Stam QTL effects at the locations where the test statistics are (1994) and Kao and Zeng (1997). The maximization largest and exceed a given threshold. The model procedure is repeated at each increment through the selection process that is incorporated in these more interval, and then over a dense grid of locations, or advanced methods (eg CIM, MQM, MIM) is the same as marker intervals, covering the genome. At each location, that used for traditional interval mapping; therefore, the the likelihood of the fitted model is recorded, and the resulting estimators of QTL effects and heritability will location that maximizes the likelihood function is also suffer from the bias issue that is the focus of this regularly used as a point estimator of QTL location with research. the corresponding estimate a taken as the point estimate for the QTL effect. Therefore, interval mapping can be viewed as a process that selects from statistical models corresponding to different QTL locations. Since the Methods likelihood of a given model depends on the estimate of heritability (see Appendix A1), the model selection process chooses models with the highest heritability Explanation of bias of heritability estimates resulting from estimates, thus leading to overestimation of this quantity. model selection We demonstrate the overestimation of heritability in Consider a backcross population where Xi denotes the more detail by considering a situation where the markers QTL for the ith individual, and allow it to take are densely spaced and no model evaluations within ¼ 1 on two values (Kao et al, 1999): Xi 2 if the ith individual the interval are performed. Essentially, the interval ¼À1 is homozygous at a QTL and Xi 2 if it is hetero- mapping process, in this extreme situation, reduces to a zygous. We assume that the relationship between the regression over markers and chooses the marker that quantitative trait value Yi and a QTL genotype Xi is results in the highest coefficient of determination R2 (ie described by a normal regression model explains the most phenotypic variation) as a candidate Yi ¼ m þ aXi þ xi ð1Þ location for a QTL. To understand this, assume that a where m and a are the overall mean and QTL effect QTL is located at the first marker (M1) such that the quantitative trait data are distributed according to the parameters, respectively, and xi is the error term (environmental noise), which is normally distributed normal regression model with mean 0 and variance s2. Yi ¼ m þ aZ1i þ xi ð4Þ Interval mapping is based on an estimated genetic map and the assumption that a putative QTL is at a where Z1i is the genotype of the ith individual at first B 2 particular location. Each incremental location is flanked marker M1 and xi N(0, s ). If the position of the QTL is by a pair of markers that are related to each other via an a known, then the problem of estimating the genetic effect 2 priori estimated recombination value r. The probability of a and heritability h (proportion of the phenotypic recombination between a QTL and a left flanking marker variance explained by the QTL) reduces to a standard regression problem. Following Utz et al (2000), we use an is denoted by r1, and when the proposed location of the QTL changes incrementally across the interval, say left to asymptotically unbiased estimator of heritability right, the range of r is restricted to 0pr pr. After fixing 1 1 1 À R2 ðn À 1ÞR2 À 1 the QTL at a specific location in the interval (which h^2 ¼ R2 À 1 ¼ 1 ð5Þ 1 1 À À corresponds to fixing r1), the known flanking marker n 2 n 2 are used to assign conditional probabilities where R2 is the coefficient of determination for the to the two possible QTL genotypes. We denote model with the genotype of the first marker M1 as the these probabilities (Lander and Botstein, 1989) by predictor variable, and n is the sample size. Extending 1 1 ˆ2 pi ¼ PðXi ¼ 2Þ and 1 À pi ¼ PðXi ¼À2Þ. Because Xi can this idea, let hi denote the estimator of heritability if take on two values, the distribution of the trait values the ith marker is taken as a predictor variable. If we in a backcross population is not normal; instead, it is a continue in this manner by selecting a model not within 1 mixture of two normal distributions with means m þ 2a, an interval, but rather only at the markers, k markers 1 2 and m À 2a, respectively (Doerge et al, 1997). The density are considered and the marker yielding the highest R is

Heredity Bias in interval mapping M Bogdan and RW Doerge 478 ˆ2 chosen. The corresponding estimator of the heritability values) with respect to the distribution of h1. This is is equal to because we select the correct QTL location only when the estimated heritability corresponding to the true model ^2 ^2X^2 h ¼ max hi h1 ð6Þ 1pipk is large enough to outperform estimated corresponding to other investigated locations, which ˆ2 For each marker i, the distribution of hi is continuous on may be large just by random chance. The impact of this is 4 the support [À(1/(nÀ2),1]. When k 1, the probability that the small values for the heritability estimates are ˆ2 ˆ2 that h 4h1 is larger than zero (proof of this is included in effectively removed from the resulting conditional Appendix A1). Therefore, the distribution of the herit- distribution (see Figure 1d). Further ramifications of this ˆ2 ability estimate h , when compared to the distribution of overestimation are carried through to the estimator of ˆ2 the approximately unbiased estimator h1, is shifted in the the QTL effect, aˆi, when viewed as a function of the direction of larger values. There is a relatively large coefficient of determination, ˆ2 ˆ2 probability that h 4h1 when the true heritability is small. In other words, there is a greater chance that we will s2 ^a2 ¼ R2 Y ð7Þ realize an estimated QTL effect larger than the correct i i s2 one at the proper location in the genome. It is important Zi to note that the conditional distribution of the estimator s2 s2 Z where Zi and Y are the estimators of variance for i (the of heritability, conditioned on the event that we select the genotype of putative QTL) and Y (the quantitative trait), proper QTL location, is also shifted to the right (ie larger respectively. Therefore, the estimators of the magnitude

abEstimation at known QTL location Interval mapping over 10 80 120

70 100 60 80 50

40 60

30

Number of replicates 40 Number of replicates 20 20 10

0 0 -0.5 0 0.5 1 -0.5 0 0.5 1 Estimate of QTL effect Estimate of QTL effect

c Interval mapping for 177 d Replicates yielding "precise" significant replicates QTL location 25 70

60 20

50 15 40

30 10 Number of replicates Number of replicates 20 5 10

0 0 -1 -0.5 0 0.5 1 -0.5 0 0.5 1 Estimate of QTL effect Estimate of QTL effect Figure 1 Histograms of the estimators of the QTL effect under three different estimation procedures. Results based on 500 replicates consisting of 200 individuals from backcross population. Trait data are generated according to model (1) with parameters m ¼ 0, a ¼ 0.4588, and s ¼ 1, and the QTL is located 52.5 cM from the top end of chromosome 1. Panel (d) reports the results for replicates where the estimated QTL location was within 10 cM of the true value. ‘X’ marks the true value of QTL effect.

Heredity Bias in interval mapping M Bogdan and RW Doerge 479 of QTL effects (their absolute values) reported from a Dependence on true heritability interval mapping are also positively biased. In this work, 0.12 we demonstrate the bias of QTL effects assuming that the search is performed only at marker positions. When the genetic map is sparse and the search is performed at 0.1 intermarker positions, the number of points at which the models are fitted is larger than the number of markers, and the problem of bias of QTL effects may be even 0.08 greater. 0.06 Simulations We investigate the bias in the estimators of QTL 0.04 heritability and QTL location via a simulation study. Marker and QTL data (genotypes) were simulated on true heritability up to 15 chromosomes each of length 225 cM for a Mean of heritability estimates 0.02 only selection sample size that ranged from 100 to 500. The experi- selection and testing mental system is a backcross design, and the markers 0 are located every 15 cM. In experiments demonstra- 0 0.02 0.04 0.06 0.08 0.1 ting the bias of heritability, a single QTL was fixed True heritability at 52.5 cM from the left (top) end of chromosome 1. b Dependence on the sample size For the purpose of illustrating the bias of location, we 0.18 used different QTL locations spanning the whole of true heritability chromosome 1. For each combination of sample size and 0.16 only selection length of the investigated portion of the genome, selection and testing thresholds for the likelihood ratio statistic to detect the presence of QTL at the significance level a ¼ 0.05 were 0.14 simulated using 1000 replicates from the null distribu- tion with no QTL. 0.12

Results 0.1 For the first investigation, marker and QTL information were simulated on 10 chromosomes for 200 backcross 0.08 individuals. The trait data were simulated according to Mean of heritability estimates model (1) with m ¼ 0, a ¼ 0.4588, and s ¼ 1(h2 ¼ 0.05). A 0.06 total of 500 replicate data sets were used to investigate the estimators for the QTL effect and heritability at the 0.04 known (52.5 cM from the top end of chromosome 1) QTL 100 150 200 250 300 350 400 location, and when the search was performed on all 10 Sample size chromosomes. To estimate heritability, we first estimated 2 c Dependence on the number of chromosomes the genetic effect a and the coefficient of determination R 0.1 (7), replacing the estimator of the variance for Zi (the 1 genotype of putative QTL) by the true value (4). We then calculated an estimate of heritability according to 0.09 equation (5). Our simulations strongly suggest that the bias of the resulting estimator at a proper QTL location is 0.08 negligible; thus, the reported bias resulting from interval true heritability mapping is related only to the model selection process only selection and testing. 0.07 selection and testing Figure 1a illustrates that when the QTL location is known, the estimator of the QTL effect is approximately unbiased with a mean 0.4567 (true value 0.4588) and the 0.06 corresponding mean heritability 0.0513 (true value 0.05). When the QTL location is assumed unknown, and the adjacent intervals searched using interval mapping, a Mean of heritability estimates 0.05 histogram (Figure 1b) of the QTL effects illustrates no values near zero. This reflects that the estimated QTL 0.04 effects achieve considerably large values randomly 0 5 10 15 throughout the genome. Since the true QTL effect was Number of chromosomes relatively small, the probability that the estimated Figure 2 Dependence of the bias of heritability estimates on true heritability will have a maximum at a location away heritability, sample size, and the number of the investigated from the true location is quite high. As illustrated chromosomes. Panels (a) and (b) show the results of the search (Figure 1b), it is also possible that the estimator of the over ten 225 cM chromosomes. Panels (a) and (c), the sample size QTL effect can be negative. Such a phenomenon usually n ¼ 200. In panels (b) and (c) heritability value h2 ¼ 0.05.

Heredity Bias in interval mapping M Bogdan and RW Doerge 480 occurs when the estimated location is not near (unlinked) to exceed the heritability estimates at other locations, the true location, but as seen in Figure 1a this can also thus effectively eliminating small values from the occur at the true location. Finally, when each location resulting distribution. For this simulation, the mean of was tested for a QTL, and attention restricted to only the corresponding QTL effect estimates was 0.5647 (true significant replicates (of the 500), the distribution of the value 0.4588). estimator of QTL effect is seen to move further away The dependence of bias of heritability estimators on from zero (Figure 1c). In this situation, the value of the true QTL heritability, sample size and the length of the heritability estimate and the absolute value of the QTL investigated part of the genome was explored and is effect estimate were always larger than 0.0612 (0.05 true illustrated in Figures 2a–c. Figures 2a and b illustrate that value) and 0.4865 (0.4588 true value), respectively, the bias of heritability estimators diminishes over indicating significant upward bias. Note that the thresh- increasing true QTL heritability and the sample size. old for the heritability estimate for significant replicates This can be explained by the fact that both the accuracy can be easily predicted by using the relationship between of estimating QTL location and the power to detect QTL the heritability estimate and the likelihood ratio statistic increase when heritability and sample size increase. For (see Appendix A1). Furthermore (Figure 1d), when we very small heritabilities, the bias resulting from the joint conditioned on the event that the QTL was precisely estimation of location and genetic effect is comparable located, the distribution of the estimate of the QTL effect with the bias resulting from testing; however, for larger is also shifted in the direction of large values. This is due heritabilities, the bias introduced by the testing proce- to the fact that we picked the proper location only when dure is almost three times larger than the bias resulting the corresponding heritability estimate was large enough from the choice of optimal location. Figure 2c demon-

a 150 b 140

120

100 100 80

60 50 Number of replicates

Number of replicates 40

20

0 0 0 50 100 150 200 250 0 50 100 150 200 250 Estimate of location Estimate of location

c 140 d 80

120 70

100 60

50 80 40 60 30 Number of replicates 40 Number of replicates 20

20 10 erge 0 0 0 50 100 150 200 250 0 50 100 150 200 250 Estimate of location Estimate of location Figure 3 Histograms of the estimators of QTL location resulting from the search over a 225 cM chromosome. Panels (a)–(c) are based on all 500 replicates. Panel (d) is based on 187 significant replicates. Heritability is h2 ¼ 0.03 and the sample size is 200. Means of estimators of QTL location in panels (a)–(d) are marked by ‘O’, and are equal to 55.9, 111.7, 174.8, and 31.6, respectively. True values of QTL location are marked by ‘X’ (22.5, 112.5, 202.5, 22.5, respectively).

Heredity Bias in interval mapping M Bogdan and RW Doerge 481 strates the dependence of bias of heritability estimators a 40 on the number of investigated chromosomes. The value 0 on the x-axis (ie 0 chromosomes) corresponds to only estimation 35 estimation and testing estimating the heritability at the true QTL location. As shown in Figure 2c, the bias of QTL heritability increases 30 when the length of the investigated part of the genome increases (ie number of chromosomes). An increase of 25 bias due to the choice of optimal location results from the fact that the probability of obtaining a large heritability 20 estimate in a position different from QTL location increases when we increase the search area. Additionally, 15 the threshold values for the likelihood ratio test statistic that are used to detect the presence of QTL increase as 10 the investigated portion of the genome increases. This Absolute value of bias (in cM) results in a decrease of power, as well as an increase in 5 bias due to testing (see Appendix A1 for the relationship 0 between the likelihood ratio statistic and the heritability 0 50 100 150 200 250 estimate). True location Figures 3 and 4 demonstrate the bias involved in estimating QTL location. When the QTL heritability is b 35 low, there is a relatively large chance of making an error only estimation estimation and testing in the QTL location. Furthermore, if the QTL is located 30 close to one end of the chromosome, then the number of tested positions (and the margin of error) in the direction 25 toward the other end of the chromosome is much larger (Figure 3). Figure 3 is based on simulations with a 20 fixed heritability value h2 ¼ 0.03 and sample size 200. Figures 3a–c present histograms of the estimators of 15

QTL location resulting from interval mapping over a Bias (in cM) 225 cM chromosome based on 500 replicates. Figure 3d illustrates the histogram of the estimator of QTL 10 location based on only significant replicates. As seen in Figure 3b, when the QTL is located in the middle of the 5 chromosome, the distribution of the QTL location estimates is symmetric and a bias in QTL location is 0 not observed. However, when the QTL is close to 0.02 0.04 0.06 0.08 0.1 one end of the chromosome (Figure 3a and c), the Heritability distribution of estimators of QTL location is skewed c 45 toward the opposing end of the chromosome, and only estimation therefore the means of these estimators are shifted in 40 estimation and testing this direction. This effect (Figure 3d) is also observed when we restrict attention to only those replicates that 35 produce significant results. Finally, Figure 3 demon- strates that the bias in QTL location is directly related to a 30 very low accuracy of locating QTL with small herit- 25 ability. The standard deviation of estimators of QTL location that are reported in Figure 3 is approximately 20 equal to 50 cM when testing is not employed and 30 cM Bias (in cM) when we use testing and restrict the attention to only 15 significant replicates. 10 Figure 4 demonstrates the dependence of bias of QTL location on true location, QTL heritability, and 5 sample size. The bias resulting from the estimation 0 procedure is computed as the difference between the 100 200 300 400 500 mean of location estimators over all 500 replicates and Sample size the true location. To compute the bias resulting from using both estimation and testing, we used only Figure 4 Dependence of the bias of location estimators on true significant replicates. As demonstrated in Figure 4a, the location, heritability, and sample size. The QTL search is performed absolute value of the bias of location is equal to 0 if the over a 225 cM chromosome. In panels (a) and (b), the sample size is 200. In panels (a) and (c), the heritability is 0.03, and in panels (b) QTL is in the middle of the chromosome and increases and (c) the true QTL location is equal to 22.5 cM. symmetrically with increasing distance from the center of the chromosome. Furthermore, the bias of the estimators of QTL locations obtained for only significant testing, and the sample size and heritability are small, the replicates is much lower than in the case when we do not bias at locations close to the end of the chromosome can use testing (Figure 4). However, even when we apply reach 15 cM.

Heredity Bias in interval mapping M Bogdan and RW Doerge 482 Discussion Biecek as well as two anonymous referees for helpful comments and suggestions. This work was funded by a Interval mapping (Lander and Botstein, 1989) is a USDA-IFAFs (00-52100-9615) grant to RWD. popular statistical method for locating QTL. However, as demonstrated by our simulations, the estimators of QTL locations and heritability resulting from this procedure can be severely biased. We explain this References phenomenon by observing that inherent in the interval mapping framework lies the issue of a constantly Allison DB, Fernandez JR, Heo M, Zhu S, Etzel C, Beasley TM changing model that ultimately affects the accuracy of et al (2002). Bias in estimates of quantitative trait locus effect the parameter estimates. To further exacerbate this in genome scans: demonstration of the phenomenon and a problem, a direct consequence of both increased genome methods-of-moments procedure for reducing bias. Am J Hum Genet 70: 575–585. size and marker number is an increase in the number of Ball RD (2001). Bayesian methods for quantitative trait loci models considered. Therefore, as the genome and/or mapping based on model selection: approximate analysis marker number increases, the number of models using the Bayesian information criterion. 159: 1351– increases, and hence the effect of bias on the estimators 1364. resulting from interval mapping increases. Beavis WD (1994). The power and deceit of QTL experiments: The search for a solution to remedy the bias problem lessons from comparative QTL studies. In: Proceedings of in QTL mapping is ongoing (Utz et al, 2000; Ball, 2001; the Forty-Ninth Annual Corn & Sorghum Industry Research Allison et al, 2002). We have great hope that this Conference. American Seed Trade Association: Washington, problem can be solved by developing appropriate DC. pp 250–266. Bayesian methods. The advantages of Bayesian ap- Beavis WD (1998). QTL analyses: power, precision, and accuracy. In: Paterson AH (ed) Molecular Dissection of Complex proaches are that they give the investigator the chance Traits. CRC Press: New York. pp 145–162. to employ expert, prior knowledge to the problem Bogdan M, Ghosh JK, Doerge RW (2004). Modifying the at hand, keep track of many possible models, and Schwarz Bayesian information criterion to locate multiple quantify the uncertainty related to the model choice interacting quantitative trait loci. Genetics 167: 989–999. by reporting posterior probabilities. The quantities of Broman KW (2001). Review of statistical methods for QTL interest (ie parameters) can then be estimated by mapping in experimental crosses. Lab Anim 30: 44–52. computing the mean or the mode of the related posterior Clyde M (1999). Comment for ‘Bayesian model averaging: a distribution (for the tutorial on Bayesian model tutorial’ by Hoeting et al. Stat Sci 14: 401–404. averaging, see Hoeting et al, 1999). Because Bayesian Dempster AP, Laird NM, Rubin DB (1977). Maximum like- methods avoid concentrating on only one model, their lihood from incomplete data via EM algorithm. J R Stat Soc Ser B 39: 1–38. results do not suffer from the biased estimates of QTL Doerge RW (2002). Mapping and analysis of quantitative trait effects that are addressed in this work. This is not to say loci in experimental populations. Nat Genet Rev 3: 43–52. that we expect Bayesian methods to provide a solution to Doerge RW, Zeng Z-B, Weir BS (1997). Statistical issues in the the bias of location of weak QTL which do not result search for affecting quantitative traits in experimental from the model selection procedure. With this issue populations. Stat Sci 12: 195–219. aside, it is important to realize the impact that the prior Go¨ring HHH, Terwilliger JD, Blangero J (2001). Large upward distributions have on the outcome of Bayesian methods bias in estimation of locus-specific effects from genomewide (Clyde, 1999). By comparison to standard, classical scans. Am J Hum Genet 69: 1357–1369. estimates, Bayesian estimators are shrunken in the Hayashi T, Awata T (2005). Bayesian mapping of QTL in direction specified by the prior distribution, which in out bred F2 families allowing inference about whether F0 grandparents are homozygous or heterozygous at QTL. itself is a different form of bias that can be severe if the Heredity 94: 326–338. prior is selected inappropriately. Although Bayesian Hoeting JA, Madigan D, Raftery AE, Volinsky CT (1999). methods are becoming more mainstream in the QTL Bayesian model averaging: a tutorial. Stat Sci 14: 382–401. mapping community (Satagopan et al, 1996; Uimari and Jannink J, Fernando RL (2004). On the Metropolis–Hastings Hoeschele, 1997; Stephens and Fisch, 1998; Yi and Xu, acceptance probability to add or drop a quantitative trait 2000; Sen and Churchill, 2001; Vogl and Xu, 2002; Yi and locus in Markov Chain Monte Carlo-based Bayesian Ana- Xu, 2002; Yi et al, 2003a; Xu, 2003; Kilpikari and Silanpa¨a¨, lyses. Genetics 166: 641–643. 2003; Yi et al, 2003b; Bogdan et al, 2004; Jannink and Jansen RC (1993). Interval mapping of multiple quantitative Fernando, 2004; van de Ven, 2004; Zhang et al, 2005; trait loci. Genetics 135: 205–211. Hayashi and Awata, 2005), we feel that there is a need for Jansen RC, Stam P (1994). High resolution of quantitative traits into multiple loci via interval mapping. Genetics 136: 1447–1455. a more systematic investigation of the influence that the Jansen RC, Nap J-N (2001). Genetical : the added choice of priors has on the resulting estimates. Our hope value from segregation. Trends Genet 17: 388–391. is that work toward solving the bias problem will Kao C-H, Zeng Z-B (1997). General formulas for obtaining the continue, but our larger hope is that readers will be MLEs and the asymptotic variance–covariance matrix in aware that bias exists in the estimators rendered from mapping quantitative trait loci when using the EM algo- QTL mapping methodologies and will view their results rithm. Biometrics 53: 653–665. with a cautious eye. Kao C-H, Zeng Z-B, Teasdale RD (1999). Multiple interval mapping for quantitative trait loci. Genetics 152: 1203–1216. Kilpikari R, Silanpa¨a¨ MJ (2003). Bayesian analysis of multilocus association in quantitative and qualitative traits. Genetic Acknowledgements Epidemiol 25: 122–135. Lander ES, Botstein D (1989). Mapping Mendelian factors We thank Professors Witold Klonecki, Dan Nettleton, underlying quantitative traits using RFLP linkage maps. Friedrich Utz, Krzysztof Bogdan, and Przemyslaw Genetics 121: 185–199.

Heredity Bias in interval mapping M Bogdan and RW Doerge 483 Mackay TFC (2001). Quantitative trait loci in Drosophila. Nat R2 ¼ 1Àexp(ÀLRT/n), and the heritability estimate Rev Genet 2: 11–20. ðn À 1Þð1 À expðÀLRT=nÞÞ À 1 Satagopan JM, Yandell BS, Newton MA, Osborn TC (1996). h^2 ¼ ðA:1Þ Bayesian model determination for quantitative trait loci. n À 2 Genetics 144: 805–816. This representation can be used to compute the Schadt EE, Monks SA, Drake TA, Lusis AJ, Che N, Colinayo V threshold for a heritability estimate to detect the presence et al (2003). Genetics of gene expression surveyed in maize, of QTL. For example, in our first simulation, the mouse and man. Nature 422: 297–302. simulated threshold value for the likelihood ratio test Sen S, Churchill GA (2001). A statistical framework for statistic at the significance level a ¼ 0.05 was equal to quantitative trait mapping. Genetics 159: 371–387. 13.45, and the corresponding heritability estimate given Stephens DA, Fisch RD (1998). Bayesian analysis of quantitative 2 trait locus data using reversible jump Markov chain Monte by equation (A.1), h ¼ 0.0603. These computations, based Carlo. Biometrics 54: 1334–1347. on a regression on markers, agree well with the minimal Thoday JM (1961). Location of . Nature 191: 368–370. value of the heritability estimate for significant replicates Uimari P, Hoeschele I (1997). Mapping linked quantitative trait resulting from interval mapping, which in our experi- loci using Bayesian analysis and Markov chain Monte Carlo ment was equal to 0.0612. algorithms. Genetics 146: 735–743. Utz HF, Melchinger AE, Scho¨n CC (2000). Bias and sampling error of the estimated proportion of genotypic variance The probability that the heritability estimate will obtain its explained by quantitative trait loci determined from experi- maximum at the wrong marker is larger than zero mental data in maize using cross validation and validation with independent samples. Genetics 154: 1839–1849. Proposition 1 Let the QTL be located at marker 1, and let van de Ven R (2004). Reversible-jump Markov Chain Monte ˆ2 marker 2 be located at any other position in the genome. Let h1 Carlo for quantitative trait loci mapping. Genetics 167: ˆ2 and h2, be estimators of heritability resulting from fitting a 1033–1035. simple regression model (4) with predictor variables specified Vogl C, Xu S (2002). QTL analysis in arbitrary pedigrees with by genotypes of markers 1 and 2, respectively. If the sample size incomplete marker information. Heredity 89: 339–334. X Wayne M, McIntyre LM (2002). Combining mapping and n 3, then arraying: an approach to identification. Proc 2 2 Pðh24h1Þ40 Natl Acad Sci USA 99,23: 14903–14906. Xu S (2003). Theoretical basis of the Beavis effect. Genetics 165: (ie probability that the maximum of heritability estimates 2259–2268. will be obtained at the wrong marker is larger than zero). Yi N, Xu S (2000). Bayesian mapping of quantitative trait loci for complex binary traits. Genetics 155: 1391–1403. Proof Let Z1 ¼ (Z11, Z12, y, Z1n)andZ2 ¼ (Z21, Z22, y, Z2n) Yi N, Xu S (2002). Mapping quantitative trait loci with epistatic be the genotypes at markers 1 and 2, respectively. Consider effects. Genet Res 79: 185–198. the Z1i as i.i.d. random variables with P(Z1i ¼ 1/2) ¼ Yi N, George V, Allison DB (2003a). Stochastic search variable P(Z1i ¼À1/2) ¼ 1/2. Assume that the same holds for Z2i. selection for identifying multiple quantitative trait loci. Denote the probability of recombination between mar- Genetics 164 : 1129–1138. kers1and2ascA(0,1/2]. Therefore, P(Z2i ¼ 1/2|Z1i ¼ Yi N, Xu S, Allison DB (2003b). Bayesian model choice and 1/2) ¼ P(Z2i ¼À1/2|Z1i ¼À1/2) ¼ (1Àc)andP(Z2i ¼ search strategies for mapping interacting quantitative trait À1/2|Z1i ¼1/2) ¼P(Z2i ¼1/2|Z1i ¼À1/2) ¼ c, i ¼ 1, y, n. loci. Genetics 165: 867–883. Letting 1 ¼ (1, 1, y,1)ARn,(c, c, y, c)iswrittenasc instead Zeng Z-B (1993). Theoretical basis of precision mapping of c1. of quantitative trait loci. Proc Natl Acad Sci USA 90: 10972– a a a Let A be the event that {Z1 const, Z2 const, Z1 Z2 10976. a Zeng Z-B (1994). Precision mapping of quantitative trait loci. and Z1 ÀZ2}. It is easy to check that P(A)40. Denote n / S / S Genetics 136: 1457–1468. theP standard inner product in R as , (ie y, z n n Zhang M, Montooth KL, Wells MT, Clark AG, Zhang D (2005). ¼ i ¼ 1yqizi),Pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi and let | Á | denote the Euclidean norm in R Mapping multiple quantitative trait loci by Bayesian classi- n 2 (ie jyj¼ i¼1 yi ). The square of the sample correlation fication. Genetics 169: 2305–2318. coefficient between vectors y and z is given by hy À y; z À zi2 r2ðy; zÞ¼ jy À yj2jz À zj2 Appendix A1 It is straightforward to verify that on A the vector Z2ÀZ¯ 2 Relationship between the likelihood ratio statistic and the is linearly independent of Z1ÀZ¯ 1 (ie cannot be repre- ¯ A heritability estimate sented as k(Z1ÀZ1), where k R). B 2 Consider the likelihood ratio statistic to detect a QTL at a Let x N(0, s In  n) be independent of Z1 and Z2. marker position According to model (4), we define the vector of trait values as Y ¼ m þ aZ1 þ x. Let R ¼ |x| and X ¼ (1/|x|)x, Lðm^; ^a; s^2Þ LRT ¼ 2ln so that x ¼ RX. Observe that X is uniformly distributed  2 2 2 2 LðY; 0; s^0Þ on the unit sphere, R /s is w distributed with n degrees of freedom, and that X and R are independent. It holds where L Pis the likelihood givenP by equation (3),  n 2 n  2 2 Y ¼ð1=nÞ i¼1 Yi and s^0 ¼ð1=nÞ i¼1 ðYi À YÞ . Simple r ðY; Z2Þ ¼ 2 calculations yield LRT n ln(SST/SSE), where SST r ðY; Z1Þ and SSE are the total sum of squares and the sum of ðða=RÞhZ ; Z À Z iþhX; Z À Z iÞ2jZ À Z j2 residuals from regression of the trait data on marker ¼ 1 2 2 2 2 1 1 2  2  2  2 genotypes. Thus, LRT ¼ n ln 1/(1ÀR ), which results in ðða=RÞjZ1 À Z1j þhX; Z1 À Z1iÞ jZ2 À Z2j

Heredity Bias in interval mapping M Bogdan and RW Doerge 484 Since Z1ÀZ¯ 2 and Z2ÀZ¯ 2 are linearly independent (a.s.) multivariate normal and independent of Z1 and Z2, the n on A, there exists X0 on the unit sphere in R such probability of such event is larger than zero and it / Sa / S that X0, Z2ÀZ¯ 2 0 and X0, Z1ÀZ¯ 1 ¼ 0. There- immediately holds that fore, r2(Y, Z )/r2(Y, Z ) can be arbitrarily large,  2 1 r2ðY; Z Þ provided R is large and X is in a sufficiently small 24 2 2 4 4 & Pðh2 h1Þ¼P 2 1 0 neighborhood of X0. Since the distribution of x is r ðY; Z1Þ

Heredity