Genetics: Published Articles Ahead of Print, published on July 20, 2009 as 10.1534/genetics.109.103952
1 Additive genetic variability and the Bayesian alphabet
; ; ; \ 2 Daniel Gianola y z , Gustavo de los Campos, William G. Hill , # 3 Eduardo Manfrediz and Rohan Fernando
e-mail: [email protected]
Department of Animal Sciences, University of Wisconsin-Madison, Madison, Wisconsin 53706
yDepartment of Animal and Aquacultural Sciences, Norwegian University of Life Sciences, N-1432 Ås, Norway
zInstitut National de la Recherche Agronomique, UR631 Station d’Amélioration Génétique des Animaux, BP 52627, 32326 Castanet-Tolosan, France \Institute of Evolutionary Biology, School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3JT, UK #Department of Animal Science, Iowa State University, Ames, Iowa 50011
4 July 14, 2009
5 Corresponding author: Daniel Gianola, Department of Animal Sciences, 1675 Observatory
6 Dr., Madison, WI 53706. Phone: (608) 265-2054. Fax: (608) 262-5157. E-mail:
8 Running head: Quantitative variation and the Bayesian alphabet
9 Key Words: quantitative genetics, marker assisted selection, genomic selection, single nu-
10 cleotide polymorphisms, Bayesian methods, prior distributions.
1 11 ABSTRACT
12 The use of all available molecular markers in statistical models for prediction of quantitative
13 traits has led to what could be termed a genomic-assisted selection paradigm in animal and plant
14 breeding. This paper provides a critical review of some theoretical and statistical concepts in the
15 context of genomic-assisted genetic evaluation of animals and crops. First, relationships between
16 the (Bayesian) variance of marker e¤ects in some regression models and additive genetic variance
17 are examined under standard assumptions. Second, the connection between marker genotypes and
18 resemblance between relatives is explored, and linkages between a marker-based model and the
19 in…nitesimal model are reviewed. Third, issues associated with the use of Bayesian models for
20 marker-assisted selection, with a focus on the role of the priors, are examined from a theoretical
21 angle. The sensitivity of some Bayesian speci…cation that has been proposed (called "Bayes A" )
22 with respect to priors is illustrated with a simulation. Methods that can solve potential shortcomings
23 of some of these Bayesian regression procedures are discussed brie‡y.
24 1 Introduction
25 In an in‡uential paper in animal breeding, MEUWISSEN et al. (2001) suggested using all avail-
26 able molecular markers as covariates in linear regression models for prediction of genetic value for
27 quantitative traits. This has led to a genome-enabled selection paradigm. For example, major dairy
28 cattle breeding countries are now genotyping elite animals and genetic evaluations based on SNPs
29 (single nucleotide polymorphisms) are becoming routine (HAYES et al. 2009; VAN RADEN et al.
30 2009). A similar trend is taking place in poultry (e.g., LONG et al. 2007; GONZÁLEZ-RECIO et
31 al. 2008, 2009), beef cattle (GARRICK 2008 personal communication) and plants (HEFFNER et
32 al. 2009).
33 The extraordinary speed with which events are taking place hampers the process of relating
34 new developments to extant theory, as well as the understanding of some of the statistical methods
35 proposed so far. These span from Bayes hierarchical models (e.g., MEUWISSEN et al. 2001)
36 and the Bayesian LASSO (e.g., de LOS CAMPOS et al. 2009b) to ad hoc procedures (e.g., VAN
37 RADEN 2008). Another issue is how parameters of models for dense markers relate to those of
38 classical models of quantitative genetics.
39 Many statistical models and approaches have been proposed for marker-assisted selection. These
40 include multiple regression on marker genotypes (LANDE and THOMPSON 1990), best linear unbi-
41 ased prediction (BLUP) including e¤ects of a single marker locus (FERNANDO and GROSSMAN
42 1989), ridge regression (WHITTAKER et al. 2000), Bayesian procedures (MEUWISSEN et al.
43 2001; GIANOLA et al. 2003; XU 2003; de los CAMPOS et al. 2009b) and semi-parametric speci…-
44 cations (GIANOLA et al. 2006a; GIANOLA and de los CAMPOS 2008). In particular, the methods
45 proposed by MEUWISSEN et al. (2001) have captured enormous attention in animal breeding, be-
46 cause of several reasons. First, the procedures cope well with data structures in which the number
47 of markers amply exceeds the number of markers, the so called "small n; large p" situation. Second,
2 48 the methods in MEUWISSEN et al. (2001) constitute a logical progression from the standard BLUP
49 widely used in animal breeding to richer speci…cations, where marker-speci…c variances are allowed
50 to vary at random over many loci. Third, Bayesian methods have a natural way of taking into
51 account uncertainty about all unknowns in a model (e.g., GIANOLA and FERNANDO 1986) and,
52 when coupled with the power and ‡exibility of Markov chain Monte Carlo, can be applied to almost
53 any parametric statistical model. In MEUWISSEN et al. (2001) normality assumptions are used
54 together with conjugate prior distributions for variance parameters; this leads to computational
55 representations that are well known and have been widely used in animal breeding (e.g., WANG
56 et al. 1993, 1994). An important question is that of the impact of the prior assumptions made in
57 these Bayesian models on estimates of marker e¤ects and, more importantly, on prediction of future
58 outcomes, which is central in animal and plant breeding.
59 A second aspect mentioned above, is how parameters from these marker based models relate
60 to those of standard quantitative genetics theory, with either a …nite or an in…nite number (the
61 in…nitesimal model) of loci assumed. The relationships depend on the hypotheses made at the
62 genetic level and some of the formulae presented, e.g., in MEUWISSEN et al. (2001) use linkage
63 equilibrium; however, the reality is that marker-assisted selection relies on the existence of linkage
64 disequilibrium. While a general treatment of linkage disequilibrium is very di¢ cult in the context of
65 models for predicting complex phenotypic traits, it is important to be aware of its potential impact.
66 Another question is the de…nition of additive genetic variance, according to whether marker e¤ects
67 are assumed …xed or random, with the latter corresponding to the situation in which such e¤ects
68 are viewed as sampled randomly from some hypothetical distribution. MEUWISSEN et al. (2001)
69 employ formulae for eliciting the variance of marker e¤ects, given some knowledge of the additive
70 genetic variance in the population. Their developments begin with the assumption that markers
71 e¤ects are …xed, but these eventually become random without a clear elaboration of why this is so.
72 Since their formulae are used in practice for eliciting priors in the Bayesian treatment (e.g., HAYES
73 et al. 2009), it is essential to understand their ontogeny, especially considering that priors may have
74 an impact on prediction of outcomes.
75 The objective of this paper is to provide a critical review of some of these aspects in the context
76 of genomic-assisted evaluation of quantitative traits. First, relationships between the (Bayesian)
77 variance of marker e¤ects in some regression models and additive genetic variance are examined.
78 Second, connections between marker genotypes and resemblance between relatives are explored.
79 Third, the liaisons between a marker-based model and the in…nitesimal model are reviewed. Fourth,
80 and in the context of the quantitative genetics theory discussed in the preceding sections, some
81 statistical issues associated with the use of Bayesian models for (massive) marker-assisted selection
82 are examined, with main focus on the role of the priors. The sensitivity of some of these methods
83 with respect to priors is illustrated with a simulation.
3 84 2 Relationship between the variance of marker e¤ects and additive
85 genetic variance
86 2.1 Additive genetic variance
87 A simple speci…cation serves to set the stage. The phenotypic value (y) for a quantitative trait is
88 described by the linear model
89 y = wa + E; (1)
90 where a is the additive e¤ect of a bi-allelic locus on the trait, w is a random indicator variable
91 (covariate) relating a to the phenotype, and E is an independently distributed random deviate, E 92 (0;VE) ; where VE is the environmental or residual variance, provided gene action is additive. Under
93 Hardy-Weinberg (HW) equilibrium, the frequencies of the three possible genotypes are Pr (MM) = 2 2 94 p ; Pr (Mm) = 2pq and Pr (mm) = q ; where p = Pr (M) and q = 1 p = Pr (m) : Code arbitrarily 95 the states of w such that 1 with probability p2 96 w = 8 0 with probability 2pq (2) <> 1 with probability q2 97 The genetic values of MM; Mm and mm:> individuals are a; 0 and a; respectively. Then
98 E (w) = p q; (3) 99 2 100 E w = 1 2pq (4)
101 and
102 V ar (w) = 2pq: (5)
103 This is the variance among genotypes (not among their genetic values) at the locus under HW
104 equilibrium.
105 A standard treatment (e.g., FALCONER AND MACKAY 1996) regards the additive e¤ect a as
106 a …xed parameter. The conditional distribution of phenotypes, given a (but unconditionally with
107 respect to w; since genotypes vary at random according to HW frequencies) has mean
108 E (y a) = E (w) a = (p q) a; (6) j
109 and variance 2 2 110 V ar (y a) = a V ar (w) + VE = 2pqa + VE = VA + VE (7) j 2 111 where VA = 2pqa is the additive genetic variance at the locus. In this standard treatment, the
112 additive genetic variance depends on the additive e¤ect a but not on its variance; unless a is assigned
113 a probability distribution it does not possess variance.
4 114 2.2 Uncertainty about the additive e¤ect
2 115 Assume now that a ; ; where is the mean of the distribution. In the Appendix of MEUWIS- a 116 SEN et al. (2001), a suddenly mutates from …xed to random without explanation. How does the 2 117 variance of a arise? Parameter a can be assigned at least two interpretations. The …rst one is as
118 the variance between a e¤ects in a conceptual sampling scheme in which such e¤ects are drawn at 2 119 random from a population of loci. In the second (Bayesian), a represents uncertainty about the
120 true but unknown value of the additive e¤ect of the speci…c locus, but without invoking a scheme 2 121 of random sampling over a population of loci. For example, a = 0 means in a Bayesian sense that
122 a = with complete certainty, but not necessarily that the locus does not have an e¤ect, since
123 may (may not) be distinct from 0; this cannot be overemphasized. With a single locus the Bayesian
124 interpretation is more intuitive, since it is hard to envisage a reference population of loci in this
125 case. 2 2 126 Irrespective of the interpretation assigned to , the assumption a ; induces another a a 127 conditional distribution (given w) with mean and variance
128 E (y w) = E (wa w) = w; (8) j j
129 and 2 2 130 V ar (y w) = w + VE; (9) j a
131 respectively. However, both w (the genotypes) and a (their e¤ects) are now random variables
132 possessing a joint distribution. Here, it is assumed that w and a are independent, but this may not
133 be so, as in a mutation-stabilizing selection model (e.g., TURELLI 1985) or in situations discussed
134 by ZHANG and HILL (2005) where the distribution of gene frequencies after selection depends on
135 a. Deconditioning over both a and w (that is, averaging over all genotypes at the locus and all
136 values that a can take), the expected value and variance of the marginal distribution of y are
137 E (y) = Ew [E (y w)] = E (w) = (p q) ; (10) j
138 and
139 V ar (y) = Ew [V ar (y w)] + V arw [E (y w)] j j 2 2 140 = Ew w a + VE + V arw [w] 2 2 141 = (1 2pq) + 2pq + VE: (11) a
142 This distribution is not normal: the phenotype results from multiplying a discrete random variable
143 w by the normal variate a; and then adding a deviate E; which may or may not be normal, depending 2 2 144 on what is assumed about the residual distribution. The genetic variance is now (1 2pq) a+2pq . 2 145 The term (1 2pq) a stems from randomness (uncertainty) about a; and it dissipates from (11) 2 2 2 146 only if a = 0. Note that there is additive genetic variance even if a = 0; and it is equal to 2pq .
5 147 Further, if the standard assumption = 0 is adopted, the variance of the marginal distribution of
148 phenotypes becomes 2 149 V ar (y) = (1 2pq) + VE: (12) a 2 150 Here, the standard term for additive genetic variance disappears and yet, a may not be zero, since
151 one poses that a locus has no e¤ect (on average) but there is some uncertainty or variation among 2 152 loci e¤ects, as represented by a: 2 153 In a nutshell: the additive genetic variance at the locus, VA = 2pqa ; does not appear in (11), 2 2 154 because a has been integrated out; actually, it is replaced by 2pq in (11). The term (1 2pq) a 155 does not arise in the standard (…xed) model, where additive genetic variance in stricto sensu is VA . 2 156 When a is known with certainty, then a = 0 and yet the locus generates additive genetic variance, 2 2 157 as measured by VA = 2pq : If = 0 and a = 0; then the variance is purely environmental. In 2 158 short, the connection between the uncertainty variance a and the additive variance VA (which
159 involves the e¤ect of the locus) is elusive when both genotypes and e¤ects are random variables.
160 2.3 Several loci
161 Consider now K loci with additive e¤ect ak at locus k, without dominance or epistasis. The
162 phenotype is expressible as K 163 y = wkak + e; (13) k=1 X K 164 with E (y a1; a2; :::; ak) = (pk qk) ak under HW equilibrium. If all markers were quantitative j k=1 X 165 trait loci (QTL), this would be a "…nite number of QTL" model; it will be assumed that these loci
166 are markers hereinafter unless stated otherwise. Under this speci…cation
K 167 V ar (y a1; a2; :::; ak) = V ar wkak a1; a2; :::; ak + VE: (14) j j k=1 ! X
168 In order to deduce the additive variance, some assumption must be made about the joint distribution
169 of w1; w2:::; wK , the genotypes at the K loci.
170 Linkage equilibrium (LE) and HW frequencies will be assumed to make the problem tractable.
171 Some expressions are available to accommodate linkage disequilibrium, but parameters are not
172 generally available to evaluate them (see the Appendix). When there is selection, genetic drift or
173 introgression and many loci are considered jointly, some of which will be even physically linked, the
174 LE assumption is unrealistic. Hence, as in the case of many other authors, e.g., BARTON et al.
175 (2009) in a study of evolution of traits using statistical mechanics, LE-HW assumptions are used
176 here.
6 177 If there is LE, the distributions of genotypes at the K loci are mutually independent, so that
K 2 178 V ar (y a1; a2; :::; ak) = V ar (wk) a + VE j k k=1 X K 2 179 = 2pkqkak + VE: (15) k=1 X K 2 180 The multi-locus additive genetic variance under LE-HW is then VA = 2pkqkak: k=1 X
181 Suppose now that all e¤ects ak (k = 1; 2; :::; K) are drawn from the same random process with 2 182 some distribution function P (a), mean and variance a. Using the same reasoning as before, the
183 variance of the marginal distribution of the phenotypes is
K K 2 184 V ar (y) = V ara (pk qk) ak + Ea 2pkqka + VE : (16) k k=1 ! k=1 ! X X
185 Then K K 2 2 186 V ar (y) = (1 2pkqk) + 2pkqk + VE; (17) a k=1 k=1 X X 187 which generalizes (11) to K loci. The …rst term is variance due to uncertainty, the second term is
188 the standard additive genetic variance, and the third one is residual variation. K 2 What is the relationship between the multi-locus additive genetic variance VA = 2pkqkak i=1 2 X and a? Let VA0 be the mean variance, obtained by averaging VA over the distribution of the a0s: This operation yields K K 2 2 2 VA0 = E 2pkqkak = a + 2pkqk: i=1 ! i=1 X X 189 Hence, if = 0 (additive e¤ects expressed as a deviation from their mean) then
K 2 190 VA0 = a 2pkqk: (18) i=1 X 2 191 The relationship between the uncertainty variance a and the marked average additive genetic
192 variance VA0 would then be 2 VA0 193 = ; (19) a K 2pkqk i=1 X 194 in agreement with HABIER et al. (2007), but di¤erent from MEUWISSEN et al. (2001). Unless
195 the markers are QTL, VA0 gives only the part of the additive genetic variance captured by markers,
7 196 and this may be a tiny fraction only (MAHER 2008). This makes the connection between additive
197 genetic variance for a trait and marked variance even more elusive.
198 Corresponding formulae under linkage disequilibrium are in the Appendix.
199 2.4 Heterogeneity of variance
Suppose now that locus e¤ects are independently (but not identically) distributed as a N ; 2 : k k ak K The mean of the marginal distribution of phenotypes is E (y) = (pk qk) k; and the variance k=1 becomes X K K 2 2 V ar (y) = (1 2pkqk) + 2pkqk + VE: ak k k=1 k=1 X X 2 200 If all were equal to 0 (complete Bayesian certainty about all marker e¤ects a ), there would ak k 201 still be genetic variance, as measured by the second term in V ar (y) : Apart from the di¢ culty of 2 202 inferring a given with any reasonable precision, there is the question of possible “commonality” ak 203 between loci e¤ects. For instance, some loci may have correlated e¤ects, and alternative forms for
204 the prior distribution of the a0s have been suggested by GIANOLA et al. (2003). Also, some of
205 these e¤ects are expected to be identically equal to 0; especially if ak represents a marker e¤ect,
206 as opposed to being the result of a QTL (if the marker is in LD with the QTL, its e¤ect would be
207 expected to be non-null). In such a case, a more ‡exible prior distribution might be useful, such as
208 a mixture of normals, a double exponential, or a Dirichlet process. 2 209 If a frequentist interpretation is adopted for the assumption ak N k; , it is di¢ cult to ak 210 envisage the conceptual population from which ak is sampled, unless the variances are also random 211 draws from some population. Posing a locus-speci…c variance is equivalent to assuming that each
212 sire in a sample of Holsteins is drawn from a di¤erent conceptual population with sire-speci…c
213 variance. There would be as many variances as there are sires!
214 2.5 Variation in allelic frequencies
215 In addition to assuming random variation of a e¤ects over loci, a distribution of gene frequencies may
216 be posed as well. WRIGHT (1937) found that a beta distribution arose from a di¤usion equation
217 that was used to study changes in allele frequencies in …nite populations, so this is well grounded
218 in population genetics. In a Bayesian context, on the other hand, assigning a beta distribution to
219 gene frequencies would be a (mathematically convenient) representation of uncertainty.
220 Suppose that allelic frequencies pk (k = 1; 2; :::; K) vary over loci according to the same beta
221 B (a; b) process, where a; b are parameters determining the form of the distribution (it can be
222 made U shaped or skewed by appropriate choice); WRIGHT (1937) expressed these parameters as 223 functions of e¤ective population size and mutation rates. Standard results yield
8 a 224 E (p) = = p0; (20) (a + b) b 225 E (q) = = q0; (21) (a + b)
226 and p0q0 227 V ar (p) = : (22) a + b + 1
228 The expected heterozygosity is given by
(a + b) 229 2pq = 2p0q0 : (23) (a + b + 1)
230 There are now four sources of variation: 1) due to random sampling of genotypes over individuals
231 in the population; 2) due to uncertainty about additive e¤ects (equivalently, variability due to
232 sampling additive e¤ects over loci); 3) due to spread of gene frequencies over loci or about some
233 equilibrium distribution, and 4) environmental variability. The third source contributes to variance
234 under conceptual repeated sampling, since gene frequencies would ‡uctuate around the equilibrium
235 distribution, or over loci.
236 Consider now additive genetic variance when dispersion in allelic frequencies is taken into ac-
237 count, assuming LE. Let p = (p1; p2; :::; pK )0 and a = (a1; a2; :::; aK )0. Using previous results, (22)
238 and (23)
239 V ar (y a) = Ep [V ar (y p; a)] + V arp [E (y p; a)] j j j K K 2 240 = Ep 2pkqka + VE + V arp (pk qk) ak k "k=1 # "k=1 # X X K (a + b + 2) 2 241 = 2p q a + V (24) 0 0 ( + + 1) k E a b k=1 X
242 The expected additive genetic variance is now
K (a + b + 2) 2 243 V = 2p q a (25) A 0 0 ( + + 1) k a b k=1 X
244 In order to arrive at the marginal distribution of phenotypes, variation in a e¤ects is brought into
245 the picture, producing the variance decomposition
246 V ar (y) = Ea [V ar (y a)] + V ara [E (y a)] ; (26) j j
247 after variation in genotypes and frequencies (i.e., with respect to w and p) has been taken into
9 248 account. From (13)
K K 249 E (y a; p) = Ew [E (y a; w; p)] = Ew wkak a; p = (2pk 1) ak; j j j "k=1 # k=1 X X K 250 E (y a) = Ep [E (y a; p)] = (p0 q0) ak; j j k=1 X
251 so that in the absence of correlations between locus e¤ects
2 2 252 V ara [E (y a)] = K (p0 q0) : (27) j a
253 Likewise, using (24)
(a + b + 2) 2 2 254 Ea [V ar (y a)] = 2p0q0 K a + + VE: (28) j (a + b + 1) 255 Combining (27) and (28) as required by (26) leads to
2 (a + b + 2) 2 256 V ar (y) = (p0 q0) + 2p0q0 K ( + + 1) a a b a (a + b + 2) 2 257 +2pq K + VE: (29) (a + b)
The variance of the marginal distribution of the phenotypes has, thus, three components. The 2 one involving a relates to uncertainty about or random variation of marker e¤ects. The second,
( + + 2) 2pq a b K2; (a + b)
258 is exactly the additive genetic variance when the mean of the distribution of marker e¤ects () is
259 known with complete certainty, and the third component is the environmental variance VE: If = 0
260 and the …rst term of (29) is interpreted as additive genetic variance (VA00), it turns out that
2 VA00 261 a = : (30) 2 (a+b+2) (p0 q0) + 2pq K (a+b) n o This is similar to HABIER et al. (2007) only if p0 = q0; a + b is large enough, and K is very large, such that 1 K lim 2pkqk = 2p(1 p)f (p a; b) dp; K K j !1 k=1 X Z 262 where f (p ; ) is the beta density representing variation of allelic frequencies. j a b
10 263 3 Resemblance between relatives
264 3.1 Standard results
265 Quantitative trait loci (QTL) are most often unknown, so their e¤ects and their relationships to
266 those of markers are di¢ cult to model explicitly. An alternative is to focus on e¤ects of markers,
267 whose genotypes are presumably in linkage disequilibrium with one or several QTL, so can be
268 thought of as proxies. Using a slightly di¤erent notation, the marker-based linear model suggested
269 by MEUWISSEN et al. (2001) for genomic-assisted prediction of genetic values is
270 y = X + Wam+e; (31)
271 where y is an n 1 vector of phenotypic values, is a vector of macro-environmental nuisance 272 parameters, X is an incidence matrix, am = amk is a K 1 vector of additive e¤ects of markers, f g 273 and W = wik is a known incidence matrix containing codes for marker genotypes, e.g., -1, 0 and f g th 2 274 1 for mm; Mm and MM; respectively. Let the i row of W be w and assume that e N 0; I i0 e 275 is a vector of micro-environmental residual e¤ects. 276 If genotypes are sampled at random from the population, this induces a probability distribution
277 with mean vector
278 E (y ; a ) = X +E (W) am j m 279 = X + (P Q) am; (32) where p1 p2 : : : pK q1 q2 : : : qK 2 p1 p2 : : : pK 3 2 q1 q2 : : : qK 3 :::::: :::::: 6 7 6 7 E (W) = 6 7 6 7 = P Q; 6 :::::: 7 6 :::::: 7 6 7 6 7 6 7 6 7 6 :::::: 7 6 :::::: 7 6 7 6 7 6 p1 p2 : : : pK 7 6 q1 q2 : : : qK 7 6 7 6 7 4 5 4 5 280 and P and Q are matrices whose columns contain pk and qk; respectively, in every position of column K 281 k. Every element of the column vector (P Q) am is the same and equal to = (pk qk) ak, a k=1 282 constant that can be absorbed into the intercept element of : Hence, and withoutX loss of generality,
283 E (y ; a ) = E (y ). j m j The covariance matrix (regarding and marker e¤ects am as …xed parameters) is
2 V ar (y ; a) = V ar (Wam) + I : j e
11 Here
am0 V ar (w1) am am0 Cov (w1; w20 ) am ::: am0 Cov (w1; wn0 ) am 2 : am0 V ar (w2) am ::: am0 Cov (w2; wn0 ) am 3 ::::: : 6 7 V ar (Wam am) = 6 7 : j 6 ::::: : 7 6 7 6 7 6 ::::: : 7 6 7 6 symmetric : : : : a0 V ar (wn) am 7 6 m 7 4 5 284 If genotypes are drawn at random from the same population, all wi vectors have the same distri-
285 bution, that is, wi (p; D) for all i: If the population is in HW-LE, D = 2pkqk is a diagonal f g 286 matrix. Further, if individuals are genetically related, o¤-diagonal terms am0 Cov wi; wj0 am are 287 not null, because of covariances due to genotypic similarity. To illustrate, let i be a randomly chosen male mated to a random female, and let j be a randomly chosen descendant. Under HW-LE, genotypes at di¤erent marker loci are mutually independent, within and between individuals, so it su¢ ces to consider a single locus. It turns out that
1 1 Cov (w ; w ) = pq = (2pq) = V ar (w ) ; i j 2 2 i
1 288 yielding the standard result that the covariance between genotypes of o¤spring and parent is 2 of 289 the variance between genotypes at the locus in question. This generalizes readily to any type of
290 additive relationships in the population.
291 Letting rij be the additive relationship between i and j
am0 Dam r12am0 Dam : : : r1nam0 Dam 2 am0 Dam : : : r2nam0 Dam 3 :::: 6 7 292 V ar (Wam am) = 6 7 j 6 :: 7 6 7 6 7 6 :: 7 6 7 6 symmetric a0 Dam 7 6 m 7 293 = 4VAAr; 5 (33)
294 where K 2 295 VA = am0 Dam= 2pkqkak (34) k=1 X 296 is the additive genetic variance among multi-locus genotypes in HW-LE, and Ar= rij is the matrix f g 297 of additive relationships between individuals; D is diagonal in this case. The variance-covariance
298 matrix (33) involves the …xed marker e¤ects am; but these get absorbed into VA:
299 The implication is that a model with the conditional (given marker e¤ects and gene frequencies)
300 expectation function
301 E (y ; a ) = X + (P Q) am; (35) j m
12 302 (recall that (P Q) am = 1 ) and conditional covariance matrix K 2 2 303 V ar (y ; a ) = 2pkqka Ar + I ; (36) j m k e k=1 ! X
304 has the equivalent representation
305 y = X + a + e; (37)
K 306 with a = Wam = ai = wikak distributed as k=1 P 307 a (0; ArVA) : (38)
308 This is precisely the standard model of quantitative genetics applied to a …nite number of marker
309 loci (K) ; where additive genetic variation stems from the sampling of genotypes but not of their
310 e¤ects. The assumption of normality is not required.
311 Formulae under LD are given in the second section of the Appendix.
312 3.2 Estimating the pedigree based relationship matrix and expected heterozy-
313 gosity using markers
Consider model (37)
y = X + a + e = X + Wam + e:
Given the observed marker genotypes W; the distribution of W is irrelevant with regards to inference
about a. It is relevant only if one seeks to estimate parameters of the genotypic distribution, e.g., gene frequencies and linkage disequilibrium statistics. Consider, for instance, the expected value of
WW0:
E WW0 = E wi0wj ; i; j = 1; 2; :::; n: K 314 where wi0wj = wi;kwj;k; and the sum is over markers. If all individuals belong to the same k=1 315 genotypic distribution,P as argued above,
K K K 2 316 E w0wj = E (wi;kwj;k) = rij 2pkqk + (pk qk) ; (39) i k=1 k=1 k=1 X X X 1 K 2 as in HABIER et al. (2007); if pk = qk = 2 ; then E (wi0wj) = rij 2 so E K WW0 = Ar: 2 WW Under this idealized assumption K 0 provides an unbiased estimator of the pedigree based additive relationship matrix, but only if HW-LE holds. Note that under a beta distribution of gene frequencies K 1 (a + b) lim 2pkqk = 2p0q0 : K K ( + + 1) !1 k=1 a b X
13 Likewise K 1 2 (a + b) lim (pk qk) = 1 4p0q0 : K K ( + + 1) !1 k=1 a b X 317 Using this continuous approximation (39) becomes
318 E w0wj = rijKH0 + K (1 2H0) i 319 = K + KH0(rij 2);
(a+b) where H0 = 2pq = 2p0q0 is the expected heterozygosity. Since the relationship holds for (a+b+1) any combination i; j and there are n(n + 1)=2 distinct elements in WW0 and in Ar, one can form an estimator of mean heterozygosity as
wi0wj K H0 = ; K(rij 2) b 320 where wi0wj and rij are averages over all distinct elements in WW0 and Ar: The estimator is simple, 321 but no claim is made about its properties.
322 4 Connection with the in…nitesimal model
323 Clearly, (31) or (37) with (38) involve a …nite number of markers, and the marked additive genetic
324 variance is a function of allelic frequencies and of e¤ects of individual markers. In the in…nitesimal
325 model, on the other hand, the e¤ects of individual loci or of gene frequencies do not appear explicitly.
326 How do these two types of models connect?
327 The vector of additive genetic e¤ects (or marked breeding value) of all individuals is a = Wam = K 328 a = wikak : Next, assume that e¤ects a1; a2; :::; aK are independently and identically f ig (k=1 ) X 2 2 329 distributed as ai N 0; ; but this does not need to be so. Again, is not the additive genetic a a 330 variance, which is given by (34) above. 2 331 Assuming am N 0; I implies that a = Wam must be normal (given W). However, the a 332 elements of W (indicators of genotypes, discrete) are also random so the …nite sample distribution of 333 a is not normal; however, as K goes to in…nity, the distribution approaches normality, as discussed
334 later in this section of the paper. If am and W are independently distributed, the mean vector and
335 covariance matrix of the distribution of marked breeding values are
336 E (a) = Ea [EW (Wam a )] = Ea [(P Q) am] = 0; (40) m j m m
337 and
338 V ar (a) = V ara [E (Wam a )] + Ea [V ar (Wam a )] : (41) m j m m j m
14 Using the fact that EW (Wam a ) = (P Q) am and (33), then under LE assumptions j m K 2 2 V ar (a) = (P Q)(P Q)0 + ArEa 2pkqka : a m k k=1 ! X K K 2 2 339 Since Eam 2pkqkak = a 2pkqk, it follows that k=1 ! k=1 X X K 2 2 340 V ar (a) = (P Q)(P Q)0 + Ar 2pkqk : (42) a a " k=1 # X 341 This shows that when both genotypes (the w0s) and their e¤ects (the a0s) vary at random according
342 to independent trinomial (at each locus) and normal distributions, respectively, the variance of the
343 distribution of a marked breeding value a is a¤ected by di¤erences in gene frequencies pk qk, 2 344 by the variance of the distribution of marker e¤ects a; and by the level of heterozygosity. In the 1 345 special case where allelic frequencies are equal to 2 at each of the loci, the …rst term vanishes and 2 2 2 Ka 346 one gets V ar (a ) = A ; where = : This can be construed as as the counterpart of the r a a 2 347 polygenic additive variance of the standard in…nitesimal model, but in a special situation. Again, 2 e e 348 this illustrates that a relates to additive genetic variance in a more subtle manner than would
349 appear at …rst sight. K
What is the distribution of marked breeding values ai = wikak when both wik and ak vary k=1 X at random? Because ak is normal, the conditional distribution ai wi is normal, with mean 0 and K j 2 2 variance a wik: Thus, one can write the density of the marginal distribution of ai as k=1 X
p (ai) = ::: p (ai wi) Pr (wi1; wi2; :::wik) ; w w w j Xi1 Xi2 XiK
where Pr (wi1; wi2; :::wik) = Pr (wi) is the probability of observing the K-dimensional marker geno-
type wi in individual i: If the population is in HW-LE at all K marker loci, the joint distribution of genotypes of an individual over the K loci is the product of the marginal distributions at each of the marker loci, that is
K 2 wik1 wik2 2 wik3 Pr (wi) = pk (2pkqk) qk : k=1 Y
For example, if individual i is AA at locus k; wik1 = 1; wik2 = 0 and wik3 = 0; if i is heterozygote
wik1 = 0; wik2 = 1 and wik3 = 0; and if i has genotype aa then wik1 = 0; wik2 = 0 and wik3 = 1.
Note that wik3 = 1 wik1 wik2; so that there are only 2 free indicator variates. It follows that K the marginal distribution of ai is a mixture of 3 normal distributions each with a null mean,
15 K 2 2 but distribution speci…c variance w : As K ; the mixing probabilities Pr (wi) become a ik ! 1 k=1 in…nitesimally small, so that X
K 2 wikak 1 2 k=1 ! 3 p (ai) exp X f(w)dw; K 6 K 7 Z 2 6 2 2 7 22 w 6 2a wik 7 v a ik 6 7 u k=1 k=1 u X 6 X 7 t 4 5 for some density f(w): The mixture distribution of ai must necessarily converge towards a Gaussian K
one, because ai = wikak is the sum of a large (now in…nite) number of independent random k=1 X variates (note that under LE wikak is independent of any other wik0 ak0 because genotypes at di¤erent loci are mutually independent), so the central limit theorem holds; it holds even under weaker assumptions. Then 2 a N 0; : i a Since all components of the mixture have null means, its variance is just the average of the variances of all components of the mixture (GIANOLA et al. 2006), that is
K 2 2 2 a = lim ::: Pr (wi) a wik : K !1 wi1 wi2 w k=1 ! X X XiK X
350 This is the additive genetic variance of an in…nitesimal model, i.e., one with an in…nite number of
351 loci, so that the probability of any genotypic con…guration is in…nitesimally small.
When the joint distribution of additive genetic values a of a set of individuals is considered, it is reasonable to conjecture that it should be multivariate normal, especially under LE. Its mean
vector is E (a) = 0, as shown in (40), and the covariance matrix of the limiting process can be deduced from (42). The …rst term becomes null, because allelic frequencies became in…nitesimally small as K ; so the covariance matrix tends to ! 1 V ar (a ) = A 2 : r a
352 As shown by HABIER et al. (2007), markers may act as a proxy for a pedigree. Hence, unless the
353 pedigree is introduced into the model in some explicit form, markers may be capturing relationships
354 among individuals, as opposed to representing ‡ags for QTL regions. At any rate, it is essential to
355 keep in mind that markers are not genes.
16 356 5 The Bayesian alphabet
357 The preceding part of the paper sets the quantitative genetics theory basis upon which marker
358 assisted prediction of breeding value (using linear models) rests. Speci…cally, MEUWISSEN et al.
359 (2001) use this theory to assess values of hyper-parameters of some Bayesian models proposed by
360 these authors. In what follows, a critique of some methods proposed for inference is presented.
361 5.1 Bayes A
362 MEUWISSEN et al. (2001) suggested using model (31) for marker-enabled prediction of additive
363 genetic e¤ects and proposed two Bayesian hierarchical structures, termed "Bayes A" and "Bayes
364 B". A brief review of these two methods follows, assuming that k denotes a SNP locus.
In Bayes A (using notation employed here), the prior distribution of a marker e¤ect ak, given some marker-speci…c uncertainty variance 2 , is assumed to be normal with null mean and disper- ak sion 2 . In turn, the variance associated to the e¤ect of each marker k = 1; 2; :::; K is assigned the ak same scaled inverse chi-square prior distribution 2 ; S2 where and S2 are known degrees of ak j freedom and scale parameters, respectively. This hierarchy is represented as:
2 2 2 2 2 2 ak N 0; ; ; S S : j ak ak ak j