Copyright Ó 2009 by the Genetics Society of America DOI: 10.1534/genetics.109.104190

Bayesian Mapping Based on Reconstruction of Recent Genetic Histories

Dario Gasbarra,*,1 Matti Pirinen,* Mikko J. Sillanpa¨a¨*,† and Elja Arjas*,‡ *Department of Mathematics and Statistics and †Department of Animal Science, University of Helsinki, FIN-00014 Helsinki, Finland and ‡National Institute for Health and Welfare, FI-00271 Helsinki, Finland Manuscript received April 20, 2009 Accepted for publication July 15, 2009

ABSTRACT We assume that quantitative measurements on a considered trait and unphased genotype data at certain marker loci are available on a sample of individuals from a background population. Our goal is to map quantitative trait loci by using a Bayesian model that performs, and makes use of, probabilistic reconstructions of the recent unobserved genealogical history (a pedigree and a flow at the marker loci) of the sampled individuals. This work extends variance component-based linkage analysis to settings where the unobserved pedigrees are considered as latent variables. In addition to the measured trait values and unphased genotype data at the marker loci, the method requires as an input estimates of the population allele frequencies and of a marker map, as well as some parameters related to the population size and the mating behavior. Given such data, the posterior distribution of the trait parameters (the number, the locations, and the relative variance contributions of the trait loci) is studied by using the reversible-jump Markov chain Monte Carlo methodology. We also introduce two shortcuts related to the trait parameters that allow us to do analytic integration, instead of stochastic sampling, in some parts of the algorithm. The method is tested on two simulated data sets. Comparisons with traditional variance component linkage analysis and association analysis demonstrate the benefits of our approach in a gene mapping context.

SSUME that quantitative measurements on certain (Gasbarra et al. 2007a). Here we extend that work to A trait and unphased genotype data at certain mark- map QTL. er loci are available on a sample of individuals from a Many QTL mapping methods adopt a two-step strat- population. In quantitative trait locus (QTL) mapping egy where one first estimates the relatedness structure the goal is to find positions in the genome in which the from marker data, either locuswise or at the level of genetic variation between the individuals explains the individuals (Bink et al. 2008), and then incorporates the observed differences in the considered quantitative trait. estimates into subsequent QTL or association analyses A central task in QTL mapping is to estimate how the as if they were observed quantities (e.g.,Meuwissen and sampled individuals are related to each other in different Goddard 2004, 2007; Yu et al. 2006). A special feature of parts of the genome, since such information provides a our approach is that both of these tasks are carried out meanstoidentifyregionswheretheestimatedinheritance simultaneously within a single Bayesian model. Similar pattern is in accordance with the observed similarities ideas have been used in some coalescent-based gene in the trait values. Important pieces of information for mapping methods that model jointly the relatedness estimating the locus-specific relatedness structures would structure of a trait locus and the (e.g.,Larribe be given by pedigree records and genotyped ancestors. et al. 2002; Morris et al. 2002; Zo¨llner and Pritchard However, for example, in wild animal and plant pop- 2005; Minichiello and Durbin 2006). Our method ulations there are situations where no data on the genetic differs from these in that we are working with the recent history are available or where such data are available only genetic history of the sample (individuals) and not using on a very recent history of the sampled individuals. In view continuous time approximations of coalescent trees and of such situations, we have earlier introduced a Bayesian recombination graphs. model that estimates the unobserved recent history of the Our method can be seen as an extension of pedigree- sampled individuals by performing reconstructions of based linkage analysis to situations where no pedigrees pedigrees and gene flows at the linked marker loci are observed. Thus, even though the relationships be- tween the sampled individuals may be unknown, there should still be close relatives in the sample so that 1Corresponding author: Department of MathematicsandStatistics,P.O. Box 68, University of Helsinki, FIN-00014 Helsinki, Finland. linkage information can be extracted. These kinds of E-mail: [email protected].fi data may be mostly available in wild plant or animal

Genetics 183: 709–721 (October 2009) 710 D. Gasbarra et al. species (Frentiu et al. 2008), but may also be available founder generation. The population is characterized by $ for humans. We also consider similar marker maps that four sets of parameters: Nt9, Nt, at,andbt,fort ¼ 1, ..., T. $ are used in pedigree-based linkage analysis. In our The parameters Nt9 and Nt describe, respectively, the examples we consider 100lociwithpolymorphic number of males and females belonging to generation t markers (say, with six alleles) that are moderately spaced of the population. Parameter at controls the differences along the chromosomes (say, 4 cM apart from each other). of reproductive success between males in generation t: In this article our phenotype model considers only large values of at imply nearly equal numbers of children additive genetic effects for QTL and for the polygene, for each male, whereas for small values of at there will be a but extensions to environmental covariates and genetic few dominant males who are mainly responsible for the interaction effects would be technically straightforward. reproduction. Parameter bt tunes the degree of monog- We analyze the phenotype model by estimating the amy (of males) in generation t: large values of bt lead to genetic components of phenotypic variance. An advan- random mating and small values of bt introduce more tage of a variance component model is that the numbers permanent family structures into the pedigree. Naturally of QTL alleles and their effect sizes need not be the roles of males and females can be changed in the specified (see Yi and Xu 2000). model. We denote this probability measure on pedigree Computationally, we use the reversible-jump Markov graphs by PGðÞ. chain Monte Carlo methodology, which allows us to treat Flow of alleles through the pedigree: We assume a fixed the number of QTL, their positions, and their variance marker map with L loci and denote the recombination contributions as random variables. Because the method fractions between loci by r ¼ (r(l, l9): 1 # l , l9 # L). is computationally intensive, we have introduced two Note that several chromosomes can be modeled simul- shortcuts in the algorithm that allow for an analytic taneously by using the recombination fraction rðl; l9Þ¼ 1 integration over the allelic paths at the QTL and over the 2 to indicate that markers l and l9 lie in different linkage residual variance. As a result, the central parameters of groups. interest are the relative phenotypic variance contribu- By definition, the genome of each individual in the tions of the QTL with respect to the residual variance. pedigree consists of a pair of paternal and maternal We illustrate the usefulness of the method with two haplotypes. The flow of alleles through the pedigree is examples. Comparisons of our results with the fixed- determined by the grandparental origins that for pedigree variance component linkage analysis program haplotype i are denoted by ci ¼ðcið1Þ; ...; ciðLÞÞ 2 lmasy langero L SOLAR (A and B 1998) and with the f0; 1g . The convention used here is that ci(l) ¼ 0if association analysis package TASSEL (Bradbury et al. the allele at locus l of haplotype i is of grandmaternal 2007) clearly show the advantage that is gained from origin, and ci(l) ¼ 1 in the case of grandpaternal origin. being able to model the unobserved part of the recent If an allele carried by an individual in generation t . 0 genetic history of the sampled individuals. is transmitted to some individual in the present gener- ation, we say that the allele is ancestral and otherwise that it is censored. Since we are actually interested only in the paths of the ancestral alleles, we set c (l) ¼ Ø if the allele MODEL FOR GENETIC HISTORY i at locus l of haplotype i is censored.

Consider a sample of n individuals belonging to the The probability of a set C ¼ðciÞi2N of grandparental current generation of the population. We build a origins of nonfounder haplotypes on the pedigree is probability model for their joint genetic history, up to given by T generations backward in time, at certain marker loci Y Y whose relative positions (with respect to each other) are Di ðlÞ ð1Di ðlÞÞ PcðCÞ¼ rð jðci; lÞ; lÞ ð1 rð jðci; lÞ; lÞÞ ; assumed to be known. i2N l2Li The configuration space of possible ancestral histo- ries has three components: the pedigree specifying the where Li ¼ {l : ci(l) 6¼ Ø}, r(l, l9) is the recombination relationships between the individuals, the paths of fraction between loci l and l9, with the convention that 1 alleles of these individuals at the marker loci, and the rð‘; lÞ¼2 , j(ci, l) denotes the last uncensored locus types of the alleles of the founder individuals at gen- of haplotype i before l, with the convention that j(ci, l) ¼ eration T. We described the same probability model on ‘ if l is the first uncensored locus of its chromosome the configuration space earlier (Gasbarra et al. 2007a) in the haplotype i, and Di(l) ¼jci(l) ci(j(ci, l))j with and therefore provide here only a brief summary of this the convention that ci(‘) ¼ 0. model. Types of founder alleles: Denote by gk ¼ðgkðlÞ : l ¼ Pedigree model: For pedigrees we use the probability 1; ...; LÞ the ordered genotype of individual k and model introduced by Gasbarra et al. (2005). The model let A ¼fgk : k 2Fg be the set of founder geno- considers an isolated population with nonoverlapping types. Assuming linkage equilibrium at the founder generations indexed backward in time by t ¼ 0; 1; ...; generation, the probability of the founder alleles is T ,witht ¼ 0 referring to the present and t ¼ T to the given by QTL Mapping on Reconstructed Pedigrees 711

Y YL XNqtl ; ; PAðAÞ¼ frðgkðlÞ lÞ y ¼ m 1 Xqbq 1 h 1 e; ð1Þ k2F l¼1 q¼1 where the population genotype frequencies fr(; l)ateach where y is the n-dimensional vector of , m is marker locus l are assumed given. (If Hardy–Weinberg the population mean, Xq is the n 3 2f matrix describing equilibrium is assumed, we can use the population allele which of the 2f founder alleles each individual carries frequencies instead.) The genotype frequencies are at QTL q, bq is the 2f-dimensional vector of founder allele extended to partially or totally censored genotypes in effects at QTL q, h is the n-dimensional vector of poly- the obvious way. Note that the ordered founder alleles genic contributions (sum of the effects of QTL located together with the grandparental origins of the non- outside of the marker map), and e is the n-dimensional founder haplotypes determine the flow of alleles in the vector of residual errors. Possible extensions ofthe model, pedigree. which would contain also covariates and genetic effects of Prior distribution for genetic history: Given the pedigree higher degrees, are considered in the discussion. $ 2 parameters (Nt9, Nt, bt, at, T), the population genotype We assume that m Nð0; smÞ (univariate normal 2 frequencies and the recombination fractions between the distribution), bq Nð0; sq IÞ (multivariate normal 2 marker loci, a configuration v consisting of a pedigree G distribution), h Nð0; 4spFÞ, where F is the kinship and a gene flow with founder alleles A and grandparental matrix calculated from the pedigree structure, and 2 2 2 2 2 origins C is assigned the (prior) probability e Nð0; sr IÞ. Here sm, sq , sp, and sr are the vari- ance components related to the population mean, a single allele at QTL q, the polygenic contribution, and pðvÞ¼PGðGÞ 3 PAðAÞ 3 PcðCÞ: the residual effect, respectively. The element Fij of the kinship matrix is the conditional probability, given the Our earlier work (Gasbarra et al. 2007a) studied this pedigree structure, that a randomly sampled allele at a distribution conditionally on the observed marker data random locus from individual i is IBD with a randomly and in this article we further add to the model a sampled allele from the same locus from individual j ange contribution from a quantitative phenotype. (see, e.g.,L 2002). In this model the number of Variance component model for the phenotypes: We QTL (Nqtl) is considered as a random variable, and each use a variance component model that is similar to, for QTL has an exact position specified in terms of the example, those by Almasy and Blangero (1998) and genetic distance to the nearest marker locus. Yi and Xu (2000). A slight difference between the For a given genetic history (pedigree, allele paths, standard model development (e.g., Equation 1 in Yi and founder alleles at the marker loci), the phenotype and Xu 2000) and the one given below is that here the model (1) is simply covariances at the QTL are derived explicitly using the y Nð0; SÞ; ð2Þ corresponding incidence matrices (Xq). This formula- tion is chosen because our state space can be aug- where the covariance matrix has the form mented to contain the exact paths at the putative XNqtl QTL, and thus we could work with the exact identity- 2 2 2 2 T S ¼ sm1 1 s Yq 1 4s F 1 s I ð3Þ by-descent (IBD) matrices at the QTL (XqXq ). How- q p r ¼ ever, in this article we use the conditional expectations q 1 of the IBD matrices at QTL, given the allelic paths at the ¼ s2Q; ð4Þ flanking markers. We formulate the model at the level r of single alleles, and therefore certain variance param- with eters in our model equal half of the corresponding XNqtl quantities when they are considered at the level of Q ¼ um1 1 uq Yq 1 4upF 1 I: ð5Þ genotypes. q¼1 To derive the model, we suppose that a quantitative T phenotype is measured from each of the n sampled Here 1 is the n 3 n matrix full of ones, Yq ¼ XqXq is the 2 2 individuals from the current generation of the popula- (unscaled) covariance matrix at QTL q, and uz ¼ sz/sr tion. In the following we explain the phenotype model for z ¼ m, q, p. The representation of S in terms of Q conditionally on genetic history v. Note, however, that becomes useful in the practical computations that we both structures (genetic history and phenotype param- describe below. 2 eters) are random variables in our model and we are Priors: We assume that a priori sr IG(a, d) (inverse- studying their joint posterior distribution conditionally gamma distribution) and that each uz Exp(1) (expo- on the observed marker data and the observed pheno- nential distribution). A priori, each marker interval type values. Conditionally on genetic history v we adopt (between two adjacent markers or between the extreme a simple regression model for the phenotypes, markers and the endpoints of the chromosome) may 712 D. Gasbarra et al. contain at most one QTL and this happens with we add to the algorithm further Metropolis–Hastings probability 1 exp(lD), where D is the genetic length proposal steps involving QTL parameters. As the number of the interval and l . 0 is a hyperparameter chosen of QTL (Nqtl) is a random variable that affects the according to our prior guess on the total number of dimensionality of the model, we update it using the QTL. reversible-jump MCMC methodology (Green 1995). Ear- Posteriors: In our earlier works we estimated the lier, such ideas were used for QTL mapping, for example, posterior distribution of the genetic histories of the by Heath (1997) and Sillanpa¨a¨ and Arjas (1998). sampled individuals, conditionally on the genotype Next we describe the Metropolis–Hastings schemes that observations at the marker loci. In this article our focus are used to update the variables (in each iteration of is on the posterior distribution of the QTL parameters, the MCMC run). namely the number, the positions, and the relative Updating genetic history: We use the proposal variances of the QTL, as well as the relative variance of distributions explained in Gasbarra et al. (2007a) to the polygenic component with respect to the residual update the pedigree and allelic paths. The only modi- variance. The posterior is calculated conditionally on fication is that in the current algorithm the phenotype- the marker and phenotype data and on fixed popula- likelihood contribution is taken into account when tion parameters. We described earlier how to apply calculating the acceptance probability of the proposed Markov chain Monte Carlo (MCMC) methods to esti- configuration. mate the posterior distribution of the genetic histories. Adding and removing QTL: In each iteration with 1 Here we add new parts to our MCMC algorithm that also probability 2 we propose a deletion of an existing QTL. update the QTL parameters. Before introducing the The candidate QTL for deletion is sampled with a MCMC algorithm we consider briefly two simplifying steps probability proportional to the inverses of the effect by which we are able to integrate out analytically several variance. When we propose to delete a QTL, we also variables of the phenotype model during the MCMC run. propose to transfer its variance contribution to the These analytic integrations decrease the number of polygenic variance. Note that the variance related to a 2 variables that need to be updated in the MCMC algorithm QTL q is 2sq , i.e., twice the variance contribution of a and thereby shorten its running time. single QTL allele, since each individual carries two QTL 2 2 Integrating sr out: Instead of updating parameter sr alleles. in the MCMC algorithm, we combine the likelihood In the case that no deletion is proposed, we attempt to 2 function from (2), an inverse-gamma prior for sr and a add a new QTL at a location sampled uniformly on the 2 2 representation S ¼ sr Q to analytically integrate sr out available genetic map. If the sampled interval already from the likelihood formula. (See appendix b for contains a QTL, we propose to replace the existing one details.) As a result we get the likelihood function with a new one. The variance of the proposed QTL is sampled as a uniformly random proportion of the ða=2Þd=2Gððd 1 nÞ=2Þ pðy j QÞ¼ð2pÞðn=2ÞdetðQÞð1=2Þ ; current polygenic variance that is simultaneously pro- ða*=2Þðd1nÞ=2Gðd=2Þ posed to decrease accordingly. Both of these updates propose a change in the number where a* ¼ a 1 yT Q1y, and a and d are the parameters of QTL and since there are continuous variables in- 2 of the inverse-gamma prior for sr. volved in the QTL model, the theory of reversible-jump Integrating over allelic paths at QTL: It would be MCMC (Green 1995) is adapted in computing accep- possible to include in the configuration the complete tance probabilities of the moves. information of the allelic paths at the QTL and thus use Updating current QTL: We choose randomly a QTL the exact Yq matrices in the model. However, using the from Nqtl possibilities and propose a modification to its appendix a 2 exact recursive formulas given in we are able relative variance. In addition, with probability 3 we to compute Yˆ q ¼ EðYq j vÞ at each QTL q, i.e., the simultaneously attempt to modify its location. conditional expectation of Yq given the available allelic The new position is sampled from a normal distribu- paths at the (nearest informative) marker loci. In this tion centered at the current position (with fixed standard article we have approximated the exact model (2) by deviation given as a tuning parameter to the algorithm). ˆ replacing Yq with Yq in Equation 5 when calculating If the new position is outside of the chromosome or if matrix Q. there is already a QTL in the proposed interval, we refrain from modifying the location. Parameter uq for the chosen QTL is perturbed by multiplying it with a random variable from a lognormal MCMC ALGORITHM distribution (whose parameters are fixed for the whole In our earlier articles (Gasbarra et al. 2007a,b) the MCMC run). MCMC algorithms explored the space of possible Simultaneously with the perturbation of uq we also genetic histories of the sampled individuals using several propose to change the parameters up and um with similar different Metropolis–Hastings updating schemes. Here lognormal proposals. QTL Mapping on Reconstructed Pedigrees 713

Updating up and um: We also have separate updates that are equally frequent in the population at the for up and um that propose no modifications to any other founder generation. The recombination fractions be- variable. These are implemented with lognormal pro- tween adjacent markers of the same chromosome are posals (see above). 0.04. Conditionally on the marker data, two QTL were simulated on the genome, q1 between markers 38 and 39, and q between markers 96 and 97, on chromosomes RESULTS 2 2 and 4, respectively. For the founder individuals, five There are three main points that we aim to illustrate equally frequent alleles are assumed to exist in the by the following examples. First, our method produces population at both QTL. The effects of the five alleles 2 good results in comparison with a widely used variance are sampled from Nð0; sq Þ, with s1 ¼ 0.5 and s2 ¼ 0.7 component QTL mapping software SOLAR (Almasy for QTL q1 and q2, respectively. Finally, the phenotypes and Blangero, 1998), even in cases where we are using for the sampled individuals at generation 0 are simu- less information than SOLAR. More specifically, our lated as yi ¼ Q i 1 ei, where Q i is the sum of the effects of method is able to handle also data that do not include the four QTL alleles that i carries and ei Nð0; 1Þ are records on pedigrees, whereas traditional linkage anal- sampled as independent residual effects. The resulting ysis packages like SOLAR work only on the known parts empirical sample variance of the QTL allele effects of a considered pedigree. Second, we show that if the among the sampled individuals was 0.36 for q1 and 0.48 data include close relatives, then our method is able for q2, and the residual variance was 0.79. to separate the true signals from false positives more Results: We applied our method to analyze the phe- efficiently than an association analysis method TASSEL notypic values of 150 individuals belonging to the (Bradbury et al. 2007) that utilizes an estimated re- youngest generation of the above-explained simulated latedness matrix as a part of the mixed linear model genetic history combined with their unphased marker for the phenotype. Our third point is that, in some genotype data. We carried out a pedigree reconstruction genetic mapping situations, it is important to be able to for only a single generation backward, which turned out explicitly model the genetic relatedness between the to be enough in this case, as the nuclear families were sampled individuals beyond only one or two genera- large enough to contain linkage information (three tions backward in time to find strong enough QTL children in each family). The prior distribution for the signals. And this is exactly what our method is designed existence of QTL was chosen in such a way that the to do. expected number of QTL over all four chromosomes Linkage analysis on close relatives: When sampling was about five ðl ¼ 0:01ð1=cMÞÞ. By simulations the from natural populations, one frequently encounters population parameters for the parents’ generation were situations where the pedigree relationships between adjusted to correspond to a population where monog- individuals are not known, but where it is possible that amy is common and very large numbers of offspring in even close relatives are included among the sampled the same family are rare (N9 ¼ N$ ¼ 625, b ¼ 104, and individuals. When such samples are analyzed for the a ¼ 0.625). associations between genetic markers and phenotypic The allele frequencies among parents were assumed values, it is necessary to account for the relatedness to be uniform and the recombination fractions were structure to avoid false positives. In this example we show assumed known exactly (i.e., 0.04 between adjacent how our method is able to simultaneously account for markers). the unknown relatedness and to estimate the locations We executed four independent runs of the algorithm, on the genome that are partly responsible for the phe- each running for 55,000 MCMC iterations, and then notypic variation. discarded the first 5000 iterations as a burn-in part. The Data simulation: We consider 50 nuclear families each results from different runs were similar, suggesting that with three children, whose parents are interconnected the chains had converged. The elapsed time of the runs via a pedigree structure that is simulated, 20 genera- was 5 days but a running time of a single day would tions backward in time, using the pedigree model of have already been sufficient to achieve qualitatively the Gasbarra et al. (2005). (The population parameters are same results. The results (averaged over all four runs) 3 $ at ¼ 5, bt ¼ 10 , Nt9¼ Nt ¼ 1500 50t, for generations describing the expected value of the ratio of the t ¼ 1, 2, ..., 19, where 1 is the parents’ generation and QTL variance to the residual variance, as a function of 19 is the founder generation.) Given the simulated chromosomal location, are shown in the top panel of pedigree structure, the genetic marker data are simu- Figure 1. The measurement unit of location is an interval lated on four chromosomes by first sampling the alleles between two adjacent markers. The positions of the for the founders (who are assumed to be in linkage and simulated QTL are marked with asterisks on chromo- Hardy–Weinberg equilibria) and then dropping the somes 2 and 4. Both QTL can be identified from the through the pedigree according to Mendelian posterior curve of QTL variance, and no considerable rules and the recombination model. Each chromosome additional signals are present elsewhere on the marker contains 30 microsatellite markers, each with six alleles map. The top panel of Figure 1 also shows two intervals 714 D. Gasbarra et al.

Figure 1.—Results of example I. Rows from top to bottom correspond to our method, SOLAR, and TASSEL, respectively. The two true QTL are marked with ‘‘*’’ and the posterior probabilities that the marked intervals contain at least one QTL are shown for our method. surrounding the QTL positions and the corresponding these data contain two QTL, this may result in some posterior probabilities (0.68 and 0.74) that these additional noise in SOLAR’s tests. The overall shapes of intervals contain at least one QTL. These intervals were the curves in the top and middle panels of Figure 1 are chosen manually to cover the regions that were estimated quite similar and both concentrate strongly in the to have relatively high contributions to the phenotypic vicinity of the true QTL. The fact that we have used less variance. information in our analysis than in the SOLAR run does For comparison we also run the data with SOLAR not seem to result in less accurate signals for the true (Almasy and Blangero 1998), which is a widely used QTL. Indeed, with these data our method was able to variance component linkage analysis package for quan- reconstruct the nuclear families with three children very titative traits. It requires pedigree(s), marker data, accurately. and the marker map as input and estimates the IBD The bottom panel of Figure 1 displays marker related distribution between members of the same pedigree at P-values from the association analysis program TASSEL several locations between the markers. The resulting (Bradbury et al. 2007). TASSEL applies a mixed linear IBD information is then utilized in a sequential testing model (MLM) to explain the phenotypes by using procedure for the existence of a QTL between the markers (each marker separately) and can also include markers. SOLAR is able to compute the IBD estimates covariates, population structure, and relatedness struc- using the multipoint method unless the pedigree struc- ture in the analysis. Here we applied a model that tures are too complicated. included the relatedness structure between individuals In this example we gave SOLAR the correct family calculated using the moment estimator of Lynch and structures (50 nuclear families with three children each) Ritland (1999). This approach is close to our own in accompanied with the correct marker map and the the sense that both try to accommodate for the related- genetic marker data on the children. The resulting ness between the sampled individuals and also allow for a multipoint LOD score curves are displayed in the middle polygenic component in the model. panel of Figure 1. It can be seen that also SOLAR gives For these data it turned out that this combination of signals for the two real QTL, but it also assigns nonzero an association analysis with relatedness estimates was not scores to some other regions. The reason for this may be able to separate the true QTL signals from false positives. that SOLAR has a model for only a single QTL, and since However, one must keep in mind that this data set is QTL Mapping on Reconstructed Pedigrees 715

Figure 2.—The 7 youngest genera- tions from the 20-generation pedigree that was used in the data simulation of example II, drawn with Pedfiddler (J. C. Loredo-Osti and K. Morgan).

more suitable for linkage analysis than for association Data simulation: We use a similar procedure to analysis as the marker distances are relatively long and generate the data set as in the previous example, there are close relatives in the data. The association considering 50 nuclear families but now with only two analysis results indicate, however, that these data are not children in each. The parents of those families are too obvious in the sense that one could have established connected by a simulated 18-generation pedigree, the QTL positions directly from the correlations be- which has a bottleneck in its recent history as shown in tween the marker data and phenotypes without model- Figure 2. Only the first 7 generations of the complete ing the linkage within the families. pedigree are actually shown in Figure 2, and the Advantage from modeling several generations: In the remaining generations were simulated as in the previous above example the nuclear families were large enough example. We consider marker data on a single chromo- (with respect to the QTL effects, sample size, and marker some containing 100 microsatellite markers, each with allele distribution) so that the QTL could be found six alleles. The recombination fractions between the already by a model that included only a single genera- adjacent markers are again 0.04. Conditionally on the tion of the ancestors of the sampled individuals. We now marker data, two QTL were simulated on the chromo- consider a more difficult situation where we have to some, with q1 between markers 13 and 14 and q2 between model several additional generations backward in time markers 86 and 87. For the founder individuals, five before clear signals of the QTL can be identified. equally frequent alleles were assumed to exist in the 716 D. Gasbarra et al.

Figure 3.—Results of example II with varying numbers of reconstructed generations. The numerical values are the posterior probabilities that the marked intervals contain at least one QTL. population at both QTL. The effects for the five types of simulation. Thus it seems evident that in these data the 2 founder alleles were sampled from Nð0; sq Þ, with s1 ¼ power for detecting QTL comes from a more distant 0.5 for QTL q1 and s2 ¼ 0.3 for q2. The final phenotypes past than the parents’ or grandparents’ generations. for the sampled individuals at generation 0 were created Each panel of Figure 3 also shows two intervals by adding independent standard normal random vari- surrounding the QTL positions and the corresponding ables to the sum of the QTL effects of each individual. posterior probabilities that these intervals contain at The resulting sample variance of the QTL allele effects least one QTL. These intervals were chosen manually to among the sampled individuals was 0.26 for q1, and 0.12 cover the regions that were estimated to have relatively for q2, with residual variance equal to 0.96. high contributions to the phenotypic variance. For both Results: We applied our method to analyze the intervals the corresponding QTL probabilities consis- phenotypic values of 100 individuals belonging to the tently increase as more generations are included in the youngest generation of the above-explained simulated model. This phenomenon further confirms that we are genetic history combined with their unphased marker able to capture stronger QTL signals from these data by genotype data. Separate reconstructions were carried modeling several generations simultaneously. out for T ¼ 1, 2, 3, 4 generations backward in time. For The same phenomenon can also be seen in Figure 4, the values of the population parameters we used at ¼ 10, where we have also included results from a SOLAR $ bt ¼ 0.001, Nt9¼ Nt ¼ 1000 25t, for generations t ¼ 1, analysis based on the correct nuclear family structure 2, ..., 4. They were chosen on the basis of a similar (middle panel) and from a TASSEL analysis with the simulation experiment as in the previous example. relatedness matrix estimated as in the previous example For each value of T, four separate MCMC runs were (bottom panel). The top panel contains the data from executed and their average values are reported here. our method with T ¼ 4. Since the families are smaller The elapsed time of the runs was 6 days but qualita- and QTL weaker than in the previous example, SOLAR tively similar results were already achieved within a does not seem to be able to catch the QTL q1. However, single day. Figure 3 shows how the posterior of relative SOLAR might perform better if one first identifies q2 and QTL effects, as a function of location, develops as more then fixes it as a covariate when searching for a second generations are included. In particular the step from QTL. Note again that we have used more information in T ¼ 3toT ¼ 4 seems to bring the QTL effects to levels the SOLAR run than with our own method, since even that correspond well to those calculated in the data though we have modeled the history four generations QTL Mapping on Reconstructed Pedigrees 717

Figure 4.—Results of example II. Rows from top to bottom correspond to our method, SOLAR, and TASSEL, respectively. The two true QTL are marked with ‘‘*’’ and the posterior probabilities that the marked intervals contain at least one QTL are shown for our method.

backward in time, it is still based only on the genotype determined by the expected IBD sharing at the QTL data at the youngest generation and no part of the positions, given the IBD sharing at the flanking marker pedigree was considered known to us. We also attempted loci, and the polygenic covariance structure is dictated to analyze these data with SOLAR by giving it the correct by the pedigree configuration. This approach is tar- pedigree structure up to the grandparents’ generation, geted at the settings where the sampled individuals are but SOLAR was not able to handle that due to the likely to be related to each other within a few of the most complexity of the pedigree. recent generations, but where their exact relationships The association analysis with relatedness estimates are not known. If more specific knowledge of the incorporated into the linear model was not able to catch relationships were available, it would also be possible the true QTL positions either. Moreover, since the data to fix the known parts of the pedigree. Thus the tra- include close relatives, association analyses that test a ditional variance component linkage analysis that oper- single marker at a time can be expected to produce ates only on the fixed pedigree becomes a special case of some amount of false positives regardless of the our framework. correction. Our approach is applicable to diploid species and can adapt to several (nonrandom) mating scenarios as well as to user-specified marker maps. The practical useful- DISCUSSION ness of our method depends especially on the amount In this article we have extended our earlier model for of available marker data and on the degree of re- marker data-based pedigree and gene flow estimation to latedness between the sampled individuals. also account for a quantitative phenotype via a variance Marker data: In recent years the development of component approach. In the model, the phenotypic genotyping technologies has been extremely rapid, and variance is decomposed into a random number of QTL therefore the gene mapping studies in humans are effects, a polygenic effect, and a residual. The struc- currently carried out with hundreds of thousands of tures of the covariance matrices for QTL effects are single-nucleotide polymorphisms (SNPs) distributed 718 D. Gasbarra et al. over the genome. It is not computationally possible to of a known pedigree structure together with relatively handle such data at once by the approach taken in this small sample sizes may also be a reason for the relatively article. However, a stepwise strategy, where linkage large estimates for the additive polygenic component in analysis is carried out first, and then association analy- our examples, [uˆp ¼ 0:62 and uˆp ¼ 0:70 in examples I sis is employed conditionally on the results of the and II (T ¼ 4), respectively]. It is likely that the preliminary linkage analysis, is possible, especially in polygenic component has here captured some of the population isolates. Such a strategy is also the motiva- variation due to the QTL and to the residual variance. tion behind the methods that detect associations in the To estimate it more accurately, we would need larger presence of linkage signals (Cantor et al. 2005). It sample and family sizes. For example, Yi and Xu (2000) could also be possible to treat several tightly linked assumed in their examples that the pedigrees of 500 full- SNPs as a single multiallelic locus, if credible haplotype sib families with six siblings in each were known, com- information were available. In that case the linkage pared to our settings of 50 full-sib families with two to between the combined SNPs should be so tight that it three siblings in each and without a known pedigree would be feasible to assume that no recombinations structure. The important thing in our examples was that have occurred within the SNP blocks during the most we were able to get good estimates of the relative effects recent generations of the history of the sample. This of the QTL. would allow us to pick a sparser and computationally MCMC algorithm: The proposal distributions that more tractable marker map from the original large SNP modify the pedigree and the gene flow at the marker panel, while still maintaining some of the information loci are the same as in our previous article (Gasbarra carried by the polymorphic haplotype blocks. et al. 2007a), while the corresponding acceptance ratios Relatedness: In our approach the detection of QTL is have been updated to take into account the phenotype based on the correlations between the allele sharing and model. The main addition was the updates for the phenotypic similarities among the sampled individuals. phenotype parameters, implemented with the reversible- Thus, to identify the QTL, it is necessary that the jump MCMC (RJMCMC) methodology. In recent QTL sampled individuals share ancestors within the esti- literature, RJMCMC algorithms have been accused of mated pedigree. Furthermore, close relatives such as being complicated and slowly mixing, and other ap- siblings and cousins provide a valuable source of in- proaches to model selection have been considered formation concerning the transmission of the alleles (e.g.,Wang et al. 2005). On the other hand, in a recent between the generations. In contrast, the genetic mate- comparison by O’Hara and Sillanpa¨a¨ (2009) RJMCMC rial of the sampled individuals who are isolated from the was found to provide a competent alternative to other rest of the pedigree may spread out arbitrarily among the Bayesian model selection methods. Here updating of the ancestors, because there is no haplotype information phenotype parameters did not notably slow down the coming from genotyped relatives. Hence, our approach sampler, as most of the time is still spent on the is likely to be most useful in settings where the sampled computationally demanding block updates for the ped- individuals are closely related but the exact pedigree igree and the gene flow at the marker loci. Furthermore, records are not available or not reliable. Such cases could our phenotype updates were speeded up by integrating be encountered, for example, in some wild animal out the residual variance and the exact inheritance paths populations. Indeed, a recent topic of general interest at the QTL loci. We also note that in our variance in such a context is the estimation of the relatedness component model the effect of each QTL is character- structure or of the pedigree of the sampled individuals ized by a single (relative) variance value. This may further (Frentiu et al. 2008; Pemberton 2008). The approach facilitate the mixing of our algorithm compared to the introduced here not only provides a means to estimate models that estimate the absolute effects of all possible the relatedness structure, but also offers a simultaneous QTL alleles/genotypes (e.g.,Wang et al. 2005). analysis of a quantitative trait. During the example analyses we found that a good Phenotype model: In this article we considered only initial state for the multigeneration pedigree model can additive genetic effects, but if required, our phenotype be created sequentially, one generation at a time. The model could be easily extended to include environmen- idea is that first the algorithm is run for a certain tal covariates as well as genetic interaction effects. For number of iterations on the state space of t-generation example, inclusion of dominance effects would require pedigrees, and then the final configuration of that run is estimating whether each pair of the sampled individuals augmented to a (t 1 1)-generation configuration. By shares two alleles IBD at the QTL or a polygene and repeating this procedure for t ¼ 1, ..., T 1, an initial would be a technically straightforward addition to the state containing T ancestral generations can be created. current MCMC algorithm. However, identifiability prob- This strategy seemed to yield more realistic initial states lems (for the polygene) due to small sample sizes are than our previous practice (Gasbarra et al. 2007a), common already with known pedigrees (Miztal 1997; which sampled a pedigree and allelic paths from the Waldmann et al. 2008), and when also a pedigree is youngest generation to the founder generation in one estimated, such problems are likely to increase. The lack go. A natural explanation for this improvement is that a QTL Mapping on Reconstructed Pedigrees 719 multigeneration pedigree with a gene flow imposes Frentiu, F. D., S. M. Clegg,J.Chittock,T.Burke,M.W.Blows strong dependencies between the variables, and there- et al., 2008 Pedigree-free animal models: the relatedness matrix reloaded. Proc. R. Soc. Lond. Ser. B Biol. Sci. 275: 639–647. fore it is better to resample and improve the current Gasbarra, D., M. J. Sillanpa¨a¨ and E. Arjas, 2005 Backward simu- configuration locally before it is extended to the next lation of ancestors of sampled individuals. Theor. Popul. Biol. 67: generation. 75–83. asbarra irinen illanpa¨a¨ rjas Another way to improve the mixing of the sampler G , D., M. P ,M.J.S and E. A , 2007a Estimating genealogies from linked marker data: a could be an application of parallel computation, e.g.,in Bayesian approach. BMC Bioinformatics 8: 411. the form of Metropolis-coupled Markov chain Monte Gasbarra, D., M. Pirinen,M.J.Sillanpa¨a¨,E.Salmela and E. Arjas, Carlo (MCMCMC) (Geyer 1991), where a number of 2007b Estimating genealogies from unlinked marker data: a Bayesian approach. Theor. Popul. Biol. 72: 305–322. processors running in parallel would execute separate Geyer, C. J., 1991 Markov chain Monte Carlo maximum likelihood, MCMC algorithms, of which only one would correspond pp. 156–163 in Computing Science and Statistics: Proceedings of the exactly to the target distribution. The recombination 23rd Symposium on the Interface. Interface Foundation, Fairfax, VA. reen likelihoods of the other MCMC samplers would then G , P. J., 1995 Reversible jump Markov chain Monte Carlo com- putation and Bayesian model determination. Biometrika 82: have different degrees of relaxation, represented by a 711–732. temperature parameter. The chains with higher tempera- Heath, S. C., 1997 Markov chain Monte Carlo segregation and link- ture values would pay less attention to the recombination age analysis for oligogenic models. Am. J. Hum. Genet. 61: 748– likelihoods and as a result would explore the configura- 760. Lange, K., 2002 Mathematical and Statistical Methods for Genetic Anal- tion space more freely. At certain points of time the chains ysis. Springer, New York. at the adjacent temperature values would communicate Larribe, F., S. Lessard and N. J. Schork, 2002 Gene mapping via and possibly switch their temperature values according to the ancestral recombination graph. Theor. Popul. Biol. 62: 215– 229. the Metropolis–Hastings rule. The final results would be Lynch, M., and K. Ritland, 1999 Estimation of pairwise relatedness collected only from the coldest chain, i.e., from the chain with molecular markers. Genetics 152: 1753–1766. with the original target distribution. Meuwissen, T. H. E., and M. E. Goddard, 2004 Mapping multiple Conclusion: Our experiences with the method re- QTL using linkage disequilibrium and linkage analysis informa- tion and multitrait data. Genet. Sel. Evol. 36: 261–279. ported here suggest that a joint estimation of the recent Meuwissen, T. H. E., and M. E. Goddard, 2007 Multipoint identity- relatedness structure and of the locations of quantitative by-descent prediction using dense markers to map quantitative trait loci is not only feasible but also advantageous in trait loci and estimate effective population size. Genetics 176: certain situations compared to other approaches that are 2551–2560. Minichiello, M. J., and R. Durbin, 2006 Mapping trait loci by use not able to model the inheritance process using esti- of inferred ancestral recombination graphs. Am. J. Hum. Genet. mated pedigrees. The practical usefulness of this method 79: 910–922. depends, in a complex way, on several factors such as the Miztal, I., 1997 Estimation of variance components with large-scale dominant models. J. Dairy Sci. 80: 965–974. degree of relatedness of the sampled individuals, the Morris, A. P., J. C. Whittaker and D. J. Balding, 2002 Fine-scale of the trait, and the effect sizes of the mapping of disease loci via shattered coalescent modeling of ge- QTL and needs to be considered separately in any nealogies. Am. J. Hum. Genet. 70: 686–707. ara illanpa¨a¨ particular situation. An ability to analyze these kinds of O’H , R. B., and M. J. S , 2009 A review of Bayesian vari- able selection methods: what, how and which. Bayesian Anal. 4: complex models helps in filling the gap between the 85–118. frameworks of linkage and association analyses and as Pemberton, J. M., 2008 Wild pedigrees: the way forward. Proc. R. such deserves further development. Soc. Lond. Ser. B Biol. Sci. 275: 613–621. Sillanpa¨a¨, M. J., and E. Arjas, 1998 Bayesian mapping of multiple We are thankful to two anonymous reviewers for their comments quantitative trait loci from incomplete inbred line cross data. that helped us to improve the manuscript. This work was supported by Genetics 148: 1373–1388. grant nos. 50178, 202324, and 53297 (Centre of Population Genetic Waldmann, P., J. Hallander,F.Hoti and M. J. Sillanpa¨a¨, Analyses) from the Academy of Finland and by the ComBi Graduate 2008 Efficient Markov chain Monte Carlo implementation of School (M.P.). Bayesian analysis of additive and dominance genetic variances in noninbred pedigrees. Genetics 179: 1101–1112. Wang, H., Y. M. Zhang,X.Li,G.L.Masinde,S.Mohan et al., 2005 Bayesian shrinkage estimation of quantitative trait loci pa- LITERATURE CITED rameters. Genetics 170: 465–480. Yi, N., and S. Xu, 2000 Bayesian mapping of quantitative trait loci Almasy, L., and J. Blangero, 1998 Multipoint quantitative trait linkage analysis in general pedigrees. Am. J. Hum. Genet. 62: under the identity-by-descent-based variance component model. Genetics 156: 411–422. 1198–1211. u ressoir riggs roh i amasaki Bink, M. C. A. M., A. D. Anderson,W.E.van de Weg and E. A. Y , J., G. P ,W.H.B ,I.V B ,M.Y et al., Thompson, 2008 Comparison of marker-based pairwise relat- 2006 A unified mixed-model method for association mapping edness estimators on a pedigreed plant population. Theor. Appl. that accounts for multiple levels of relatedness. Nat. Genet. 38: Genet. 117: 843–855. 203–208. Bradbury,P.J.,Z.Zhang,D.E.Kroon,T.M.Casstevens,Y. Zo¨llner, S., and J. K. Pritchard, 2005 Coalescent-based associa- Ramdoss et al., 2007 TASSEL: software for association map- tion mapping and fine mapping of complex trait loci. Genetics ping of in diverse samples. Bioinformatics 23: 169: 1071–1092. 2633–2635. Cantor, R. M., G. K. Chen,P.Pajukanta and K. Lange, 2005 Association testing in a linked region using large pedi- grees. Am. J. Hum. Genet. 76: 538–542. Communicating editor: H. Zhao 720 D. Gasbarra et al.

APPENDIX A: EXACT RECURSIVE COMPUTATION OF THE CONDITIONAL COVARIANCES AT THE QTL GIVEN THE ALLELIC PATHS AT THE FLANKING MARKER LOCI As usual, we denote by v a configuration that describes the pedigree together with the paths and types of the (ancestral) alleles at marker loci. We numerate the haplotypes in the pedigree, starting from the founder generation. Let ci(l) 2 {0, 1} be the grandparental origin of the allele at marker locus l on haplotype i, with the convention that ci(l) ¼ 0 when the allele is inherited from the grandmother and ci(l) ¼ 1 if from the grandfather. Let AiðlÞ2f1; ...; 2f g be the founder allele of haplotype i; that is, Ai(l) ¼ k if the allele at position l of haplotype i is inherited from the kth founder haplotype. Suppose that a candidate QTL is located at position l 1 D between markers l and l9 (l , l 1 D , l9). If the ith haplotype belongs to the founder generation,

1; ifi ¼ k; PðA ðl 1 DÞ¼k j vÞ¼ i 0; ifi 6¼ k:

Otherwise, haplotype i is formed by recombining the haplotypes gf and gm transmitted from a grandfather and a grandmother, respectively. Then the conditional distribution of the grandparental origin at the locus (l 1 D), given the flanking markers, is

Pðciðl 1 DÞ¼0 j vÞ¼1 Pðciðl 1 DÞ¼1 j vÞ

¼ Pðciðl 1 DÞ¼0 j ciðlÞ; ciðl9ÞÞ and according to the Haldane recombination model

Pðciðl 1 DÞ¼0 j ciðlÞ¼0; ciðl9Þ¼0Þ

¼ Pðciðl 1 DÞ¼1 j ciðlÞ¼1; ciðl9Þ¼1Þ ð1 1 expð2DÞÞð1 1 expð2ðl9 l DÞÞÞ ¼ ; 2ð1 1 expð2ðl9 lÞÞÞ and

Pðciðl 1 DÞ¼0 j ciðlÞ¼0; ciðl9Þ¼1Þ

¼ Pðciðl 1 DÞ¼1 j ciðlÞ¼1; ciðl9Þ¼0Þ ð1 1 expð2DÞÞð1 expð2ðl9 l DÞÞÞ ¼ : 2ð1 expð2ðl9 lÞÞÞ

Using these conditional recombination probabilities we find the recursive formulas for the conditional IBD probabilities of nonfounder haplotypes:

PðAiðl 1 DÞ¼k j vÞ¼PðAgf ðl 1 DÞ¼k j vÞPðciðl 1 DÞ¼1 j ciðlÞ; ciðl9ÞÞ

1 PðAgmðl 1 DÞ¼k j vÞPðciðl 1 DÞ¼0 j ciðlÞ; ciðl9ÞÞ:

Since we follow only the paths of the ancestral alleles, it is possible that the grandparental origins of the ith haplotype are not known at the flanking markers of the QTL. Then in the formulas above, l and l9 are the closest markers on the left and on the right of the candidate QTL at which the configuration v determines the grandparental origins. In the case that there are not any ancestral alleles on the left (or right) side of the QTL, the corresponding grandparental 1 origin probabilities are set equal to 2 . Since we want to compute conditional genetic covariances between individuals, we use the same idea to compute the joint conditional distribution of the IBD indicators on two haplotypes. Again the recursion starts from the founder generation, where the IBD indicators are fully determined. If i and i9 are two separate haplotypes obtained by recombining the haplotypes gf, gm and gf9,gm9, respectively, then QTL Mapping on Reconstructed Pedigrees 721

PðAiðl 1 DÞ¼k; Ai9ðl 1 DÞ¼k9 j vÞ

¼ Pðciðl 1 DÞ¼0 j ciðlÞ; ciðl9ÞÞPðci9ðl 1 DÞ¼0 j ci9ðlÞ; ci9ðl9ÞÞ

3 f1ðgf ¼ gf9Þ1ðk ¼ k9ÞPðAgf ðl 1 DÞ¼k j vÞ

1 1ðgf 6¼ gf9ÞPðAgf ðl 1 DÞ¼k; Agf9ðl 1 DÞ¼k9 j vÞg

1 Pðciðl 1 DÞ¼0 j ciðlÞ; ciðl9ÞÞPðci9ðl 1 DÞ¼1 j ci9ðlÞ; ci9ðl9ÞÞ

3 PðAgf ðl 1 DÞ¼k; Agm9ðl 1 DÞ¼k9 j vÞ

1 Pðciðl 1 DÞ¼1 j ciðlÞ; ciðl9ÞÞPðci9ðl 1 DÞ¼0 j ci9ðlÞ; ci9ðl9ÞÞ

3 PðAgmðl 1 DÞ¼k; Agf9ðl 1 DÞ¼k9 j vÞ

1 Pðciðl 1 DÞ¼1 j ciðlÞ; ciðl9ÞÞPðci9ðl 1 DÞ¼1 j ci9ðlÞ; ci9ðl9ÞÞ

3 f1ðgm ¼ gm9Þ1ðk ¼ k9ÞPðAgmðl 1 DÞ¼k j vÞ

1 1ðgm 6¼ gm9ÞPðAgmðl 1 DÞ¼k; Agm9ðl 1 DÞ¼k9 j vÞg:

Here 1(x ¼ y) is the indicator that equals 1, if x ¼ y, and 0 otherwise. Using the above formulas, the expectation of the conditional covariance between individuals I and J at QTL q given the allele paths is calculated as

X X X2f

Eð½Yq IJ j vÞ¼ PðAiðl 1 DÞ¼k ¼ Aj ðl 1 DÞjvÞ; i¼i1;i2 j¼j1;j2 k¼1 where i1 and i2 are the haplotypes of I and j1 and j2 are the haplotypes of J.

2 APPENDIX B: ANALYTIC INTEGRATION OF sr FROM THE LIKELIHOOD With the notation introduced in the variance component model for the phenotypes section we assume that 2 2 2 ðy j sr ; QÞNð0; sr QÞ and that sr IG(a, d). Then

2 2 2 pðy; sr j QÞ¼pðsr Þ 3 pðy j sr ; QÞ d=2 ða=2Þ 2 ððd12Þ=2Þ a ðn=2Þ 2 ð1=2Þ 1 T 2 1 ¼ ðsr Þ exp 2 3 ð2pÞ detðsr QÞ exp y ðsr QÞ y Gðd=2Þ 2sr 2 d=2 ða=2Þ 2 ððd12Þ=2Þ a ðn=2Þ 2 ðn=2Þ ð1=2Þ 1 T 1 ¼ ðsr Þ exp 2 3 ð2pÞ ðsr Þ detðQÞ exp 2 y Q y Gðd=2Þ 2sr 2sr T 1 d=2 2 ððd1n12Þ=2Þ a 1 y Q y ðn=2Þ ð1=2Þ ða=2Þ ¼ðsr Þ exp 2 ð2pÞ detðQÞ 2sr Gðd=2Þ ða*=2Þðd1nÞ=2 a* ða=2Þd=2Gððd 1 nÞ=2Þ s2 ððd1n12Þ=2Þ 3 p ðn=2Þ Q ð1=2Þ ; ¼ ð r Þ exp 2 ð2 Þ detð Þ * ðd1nÞ=2 Gððd 1 nÞ=2Þ 2sr ða =2Þ Gðd=2Þ where a* ¼ a 1 yT Q1y. Now the term left of the multiplication symbol is the density of the IG(a*, d 1 n) distribution 2 2 (as the function of sr), whence its integral (with respect to sr) is 1, and only the term on the right side remains. That is,

ða=2Þd=2Gððd 1 nÞ=2Þ pðy j QÞ¼ð2pÞðn=2ÞdetðQÞð1=2Þ : ða*=2Þðd1nÞ=2Gðd=2Þ