| INVESTIGATION

A Multiple-Trait Bayesian Lasso for Genome-Enabled Analysis and Prediction of Complex Traits

Daniel Gianola*,†,‡,§,1 and Rohan L. Fernando‡ *Department of Animal Sciences, and †Department of Dairy Science, University of Wisconsin-Madison, Wisconsin 53706, ‡Department of Animal Science, , Ames, Iowa 50011, and §Department of Plant Sciences, Technical University of Munich (TUM), TUM School of Life Sciences, Freising, 85354 Germany ORCID IDs: 0000-0001-8217-2348 (D.G.); 0000-0001-5821-099X (R.L.F.)

ABSTRACT A multiple-trait Bayesian LASSO (MBL) for genome-based analysis and prediction of quantitative traits is presented and applied to two real data sets. The data-generating model is a multivariate linear Bayesian regression on possibly a huge number of molecular markers, and with a Gaussian residual distribution posed. Each (one per marker) of the T 3 1 vectors of regression coefficients (T: number of traits) is assigned the same T2variate Laplace prior distribution, with a null mean vector and unknown scale matrix Σ. The multivariate prior reduces to that of the standard univariate Bayesian LASSO when T ¼ 1: The covariance matrix of the residual distribution is assigned a multivariate Jeffreys prior, and Σ is given an inverse-Wishart prior. The unknown quantities in the model are learned using a Markov chain Monte Carlo sampling scheme constructed using a scale-mixture of normal distributions representation. MBL is demonstrated in a bivariate context employing two publicly available data sets using a bivariate genomic best linear unbiased prediction model (GBLUP) for benchmarking results. The first data set is one where wheat grain yields in two different environments are treated as distinct traits. The second data set comes from genotyped Pinus trees, with each individual measured for two traits: rust bin and gall volume. In MBL, the bivariate marker effects are shrunk differentially, i.e., “short” vectors are more strongly shrunk toward the origin than in GBLUP; conversely, “long” vectors are shrunk less. A predictive comparison was carried out as well in wheat, where the comparators of MBL were bivariate GBLUP and bivariate Bayes Cp—a variable selection procedure. A training- testing layout was used, with 100 random reconstructions of training and testing sets. For the wheat data, all methods produced similar predictions. In Pinus, MBL gave better predictions that either a Bayesian bivariate GBLUP or the single trait Bayesian LASSO. MBL has been implemented in the Julia language package JWAS, and is now available for the scientific community to explore with different traits, species, and environments. It is well known that there is no universally best prediction machine, and MBL represents a new resource in the armamentarium for genome-enabled analysis and prediction of complex traits.

KEYWORDS Bayesian Lasso; complex traits; GWAS; multiple-traits; prediction; quantitative genetics; quantitative genomics

WO main paradigms have been employed for investi- Gianola et al. (2016) formulate a statistically orientated Tgating statistical associations between molecular critique. WGR was developed mostly in animal and plant markers and complex traits: marker-by-marker genome- breeding (e.g., Lande and Thompson 1990; Meuwissen wide association studies (GWAS) and whole-genome re- et al. 2001; Gianola et al. 2003) primarily for predicting gression approaches (WGR). GWAS is dominant in human future performance, but it has received some attention in genetics; Visscher et al. (2017) present a perspective and human genetics as well (e.g.,delosCamposet al. 2010; Yang et al. 2010; Lee et al. 2011; Makowsky et al. 2011; López de Maturana et al. 2014). de los Campos et al. Copyright © 2020 by the Genetics Society of America doi: https://doi.org/10.1534/genetics.119.302934 (2013), Gianola (2013) and Isik et al. (2017) reviewed Manuscript received July 5, 2019; accepted for publication December 20, 2019; an extensive collection of WGR approaches. Other studies published Early Online December 26, 2019. “ ” Available freely online through the author-supported open access option. notedthatWGRcanbeusedbothfor discovery of asso- Supplemental material available at figshare: https://doi.org/10.25386/genetics. ciations and for prediction (Moser et al. 2015; Goddard 10689380. 1Corresponding author: University of Wisconsin-Madison, 1675 Observatory Dr., et al. 2016; Fernando et al. 2017). Hence, WGR methodol- Madison, WI 53706. E-mail: [email protected] ogy is an active area of research.

Genetics, Vol. 214, 305–331 February 2020 305 Multiple-trait analysis has long been of great interest in observations. Tibshirani (1996) noted that the LASSO solu- plant and , mainly from the point of view of tions can also be obtained by calculating the mode of the joint selection for many traits (Smith 1936; Hazel 1943; conditional posterior distribution of the regression coeffi- Walsh and Lynch 2018). Henderson and Quaas (1976) de- cients in a Bayesian model in which each coefficient is veloped multi-trait best linear unbiased prediction of breed- assigned the same conditional double exponential (DE) or ing values for all individuals and traits measured in a Laplace prior. Using a ridge regression reformulation of population of animals—a methodology that has gradually LASSO, it can be seen that its Bayesian version shrinks become routine in the field. For example, Gao et al. (2018) small-value regression coefficients very strongly toward zero, described an application of a nine-variate model to data rep- whereas large-effect variants are regularized to a much lesser resenting close to 7 million and 4 million Holstein and Nor- extent than in ridge regression (Tibshirani 1996; Gianola dic Red cattle, respectively; the nine traits were milk, fat, and 2013). Yuan and Lin (2006) and Yuan et al. (2007) consid- protein yields in each of the first three lactations of the cows. ered the problem of clustering regression coefficients into A multiple-trait analysis is also a natural choice in quests groups (factors), with the focus becoming factor selection, for understanding and dissecting genetic correlations be- as opposed to the predictor variable selection that takes place tween traits using molecular markers, e.g.,evaluating in LASSO. For instance, a cluster could consist of a group of whether pleiotropy or linkage disequilibrium are at the markers in tight physical linkage. These authors noted that, in root of between-trait associations (Gianola et al. 2015; some instances, grouping enhances prediction performance Cheng et al. 2018a). For instance, Galesloot et al. (2014) over ridge regression, while in others, it does not. Such find- compared six methods of multivariate GWAS via simula- ings are consistent with knowledge accumulated in close to tion and found that all delivered a higher power than sin- two decades of experience with genome-enabled prediction in gle-trait GWAS, even when genetic correlations were animal breeding: there is no universally best prediction ma- weak. Many single-trait WGR methods extend directly to the chine. A multiple-trait application of a LASSO penalty on re- multiple-trait domain, e.g., genomic best linear unbiased pre- gression coefficients was presented by Li et al. (2015). These diction (GBLUP; VanRaden 2007). Other procedures, such as authors assigned a multivariate Laplace distribution to the Bayesian mixture models, are more involved, but extensions model residuals, and a group-LASSO penalty (Yuan and Lin are available (Calus and Veerkamp 2011; Jia and Jannink 2006) to the regression coefficients. The procedure differs 2012; Cheng et al. 2018a). The mixture model of Cheng from Tibshirani’s LASSO in that the model selects vectors of et al. (2018a) is particularly interesting because it provides regressors (corresponding to regressions of a given marker insight into whether markers affect all, some, or none of the over traits) as opposed to single-trait predictor variables. traits addressed. For example, the proportion of markers in Park and Casella (2008) introduced a fully Bayesian LASSO each of the ð0; 0Þ; ð0; 1Þ; ð1; 0Þ and ð1; 1Þ categories, where (BL). Contrary to LASSO, BL produces a model where all re- ð0; 0Þ means “no effect,” and ð1; 1Þ denotes “effect” on each of gression coefficients are non-null (even if p . N); most regres- two disease traits in Pinus taeda was estimated by Cheng et al. sions are often tiny in value, except those associated with (2018a) using single nucleotide polymorphisms (SNPs). The covariates (markers) with strong effects. In short, LASSO pro- proportion of Markov chain Monte Carlo (MCMC) samples duces a sparse model, whereas BL yields an effectively sparse falling into the ð1; 1Þ class was ,3%, with 140 markers specification, similar to Bayesian mixture models such as Bayes appearing as candidates for further scrutiny of pleiotropy; B (Meuwissen et al. 2001). The first application of BL in quan- 97% of the SNP were in the ð0; 0Þ class, and 0.5% were in titative genetics was made by Yi and Xu (2008) in the context the ð0; 1Þ and ð1; 0Þ classes. It must be noted that Cheng of quantitative trait locus (QTL) mapping, with subsequent et al. (2018a) used Bayesian model averaging, so posterior applications reported in de los Campos et al. (2009), Legarra estimates of effects, and of their uncertainties, constitute av- et al. (2011), Li et al. (2011) and Lehermeier et al. (2013). erages over all possible configurations. The resulting “average It appearsthat no multiple-trait generalizationof the BL has yet model” is not truly sparse, as Bayesian mixture models always been reported. The present paper describes a multi-trait Bayesian assign some posterior probability to each of the possible con- LASSO (MBL) model based on adopting a multivariate Laplace figurations. An alternative to a mixture is to use a prior distri- distribution with unknown scale matrix as prior distribution bution that produces strong shrinkage toward the origin of for the markers or variants under scrutiny. The MBL is intro- “weak” vectors of marker effects; here, each marker has a duced and compared with a multiple-trait GBLUP (MTGBLUP) vector with dimension equal to the number of traits. using wheat and pine tree data sets. The section Multi-Trait The LASSO (least absolute shrinkage and selection oper- Regression Model describes MBL, including a MCMC sampling ator) method presented by Tibshirani (1996) is a single- algorithm. Subsequently, MBL is compared with MTGBLUP response method based on minimizing a linear regression using a wheat data set. Finally, bivariate MBL and bivariate residual sum of squares subject to a constraint based on an MTGBLUP are contrasted from a predictive perspective, show- L1 norm. It can produce a sparse model, i.e., if the linear ing a better performance of MBL over BLUP and over a single- regression model has p regression coefficients, the LASSO trait Bayesian LASSO specification, corroborating the usefulness yields a smaller model (i.e., model selection), but with a of multiple-trait analyses. The paper concludes with a general complexity that cannot exceed N; the number of discussion and technical details are presented in Appendices.

306 D. Gianola and R. L. Fernando b* b*; b* ; b* 3 Multi-Trait Regression Model trait t, respectively. Above, ¼ vecð 1 2 3Þ is a 3p 1 vector and e* ¼ vecðe*; e* ; e*Þ has dimension 3N 3 1. Note Assume there are T traits observed in each of N individuals, 1 2 3 that Varðe*Þ¼R 5I ¼ R. Putting y*ðm; b*Þ¼ðI 51 Þmþ and let b b be a T 3 1 vector of allelic substitution 0 3 N j ¼ f jtg 5 b*; ; ; ...; ; b ðI3 XÞ the probability model is effects at marker j ¼ 1 2 p with jt representing the ef- fect of marker j on trait t ðt ¼ 1; 2; ...; TÞ. The multi-trait 1 1 *m; b*; } 2 *2 * m; b* 9 regression model (assuming no nuisance location effects p y R0 N exp ðy y ð ÞÞ jR j2 2 other than a mean) for the T responses is 0 Xp 3 21 * 2 * m; b* : m b ; ; :::; ; ; ; ...; ; R ðy y ð ÞÞ (5) yi ¼ þ xij j þ ei i ¼ 1 2 N j ¼ 1 2 p (1) j¼1 We will work with either (1) or (4), depending on the context. 3 where yi is a T 1 vector of responses observed in individual ; m m Bayesian prior assumptions i ¼f tg is the vector of trait means, and xij is the genotype individual i possesses at marker locus j. The residual vector Parameters m and R : The vector m will be assigned a “flat” 3 0 ei ðT 1Þ is assumed to follow the Gaussian distribution improper prior, and Jeffreys noninformative prior (e.g., e jR Nð0; R Þ; where R is a T 3 T covariance matrix. i 0 0 0 Sorensen and Gianola 2002) will be adopted for R0 so that All ei vectors are assumed to be mutually independent and their joint prior density is identically distributed. If traits are sorted within individuals, the probability model 2 T þ 1 associated with Equation 1 can be represented as 2 pðm; R0Þ } jR0j : (6) ; ; ...; m; b ; b ; ...; b ; p y1 y2 yN 1 2 p R0 Multivariate laplace prior distribution for marker effects: " ! XN Xp 9 The same T-variate Laplace prior distribution with a null } 1 21 2m2 b mean vector will be assigned to each of the T 3 1 vectors b N exp yi xij j j 2 2 jR0j i¼1 j¼1 ð j ¼ 1; 2; ...; pÞ, assumed mutually independent, a priori. !# Xp Gómez et al. (1998) presented a multi-dimensional version 3 21 2 m 2 b of the power exponential family of distributions; one special R0 yi xij j j¼1 case is the multivariate Laplace distribution (MLAP). The density of the MLAP with a zero-mean vector used here is } 1 21 21 ; N exp tr R0 Se (2) T 2 G jR0j 2 T qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi b Σ 2 21 b9Σ21b ; pð jj Þ¼ 1 T exp j j 1 T where jΣj2 p2 Gð1 þ TÞ2ð þ Þ 2 ! ! j ¼ 1; 2; ...; p; XN Xp Xp 9 2 m 2 b 2m2 b (7) Se ¼ yi xij j yi xij j (3) i¼1 j¼1 j¼1 where Σ ¼ fΣtt9g is a T 3 T positive-definite scale matrix. The is a T 3 T matrix of sums of squares and products of the un- variance–covariance matrix of MLAP is observed regression residuals. Varðb jΣÞ¼4ðT þ 1ÞΣ ¼ B; (8) The regression model can be formulated in an equivalent j manner by sorting individuals within traits; hereafter, we will note that the absolute values of the elements of B; the inter- *; * * use T ¼ 3. Let y1 y2 and y3 be response vectors of order N trait variance–covariance of marker effects, are larger each observed for traits 1, 2, and 3, respectively. The repre- Σ b ; s2 " ; than those of .Hence, jt ð0 b;tÞ for j where sentation of the model is s2 Σ ; b;t ¼ 4ðT þ 1Þ tt is the appropriate diagonal element of B 2 3 2 32 3 2 32 3 2 3 likewise, sb;tt9¼ 4ðT þ 1ÞΣtt 9is the covariance of marker ef- y * 1 00 m X 00 b* e* 6 1 7 6 N 76 1 7 6 76 1 7 6 1 7 fects between traits t and t9; for all j: Putting T ¼ 1 in (7) 4 y* 5 ¼ 4 01 0 54 m 5 þ 4 0 X 0 54 b* 5 þ 4 e* 5 2 N 2 2 2 yields y* 001 m 00X b* * 3 N 3 3 e3 1 jbj pðb ΣÞ¼ pffiffiffiffiffiffi exp 2pffiffiffiffiffiffi : (9) * * j ¼ ðI351NÞm þ ðI35XÞb þ e ; (4) 2 4Σ 4Σ 3 9 ; 3 where 1N is an N 1 vector of 1 s X ¼ fxijg is an N p The preceding is thep densityffiffiffiffiffiffi of a DE distribution with b* 3 * 3 Σ b Σ matrix of marker genotypes, and t ðp 1Þ and et ðN 1Þ null mean, parameter 4 and variance Varð Þ¼8 .As are vectors of regression coefficients and of residuals for mentioned earlier, Tibshirani (1996) and Park and Casella

Multiple-Trait Bayesian Lasso 307 Σ 2; 2; ...; 2 (2008) used the DE distribution as conditional (given ) where D ¼ diagðv1 v2 vpÞ is a diagonal matrix. Hence, prior for regression coefficients in the BL—a member of the À Á “Bayesian Alphabet” (Gianola et al. 2009). Gianola et al. 1 2 2 p b*Σ; v2 } exp 2 b*9 Σ 15D 1 b* : (15) (2018) assigned the DE distribution to residuals of a linear 2 model for the purpose of attenuating outliers, and Li et al. (2015) used the MLAP distribution for the residuals in a Σ Σ “robust” linear regression model for QTL mapping. Scale matrix : The scale matrix of MLAP can be given a fi MLAP is therefore an interesting candidate prior for multi- xed value (becoming a hyper-parameter), or inferred, in trait marker effects in a multiple trait generalization of the which case a prior distribution is needed. Here, an in- V Bayesian LASSO (MBL). A zero-mean MLAP distribution has verse-Wishart ðIWÞ distribution with scale matrix b and nb degrees of freedom will be assigned as prior. The a sharp peak at the 0 coordinates, although, when T ¼ 1 MLAP reduces to a DE distribution, the marginal and condi- density is tional densities of MLAP are not DE. Gómez et al. (1998) T þ nb þ 1 showed that such densities are elliptically contoured, and, À Á 2 À Á 1 21 p Σ Vb; nb } jΣj 2 exp 2 tr Σ Vb : thus, not DE. Appendix A and Supplemental Material, Fig- 2 – ures S1 S3 in the Supplemental material give background (16) on MLAP. Gómez-Sánchez-Manzano et al. (2008) showed that Joint posterior and fully conditional distributions MLAP can be represented as a scaled mixture of normal dis- b Σ; 2 ; 2Σ ; 2 2 tributions under the hierarchy: (1) ½ j vj ¼NTð0 vj Þ The joint posterior distribution, including v ¼ fvj g from the 2 Tþ1; 1 ; ; ...; ; ; Σ 2 scale-mixture of normals representation of the prior distribu- and (2) vj Gamma 2 8 for j ¼ 1 2 p NT ð0 vj Þ denotes a T2variate normal distribution with null mean tion of b, was assumed to take the form Σ 2: 2 and covariance matrix vj The density of vj is À Á À Á * 2 * * * p m; b ; R0; Σ; v y ; H } p y m; b ; R0 pðR0jHÞ 2 À Á À Á (17) À Á À ÁTþ121 À v Á 3 b*Σ; 2 2 Σ ; h v2 } v2 2 exp 2 j : (10) p v p v pð jHÞ j j 8 where H denotes the hyper-parameters; recall that y* is Let the collection of all marker effects over traits be repre- the data vector sorted by individuals within trait. The 3 sented by the Tp 1 vector fully conditional distributions are presented below, 9 with ELSE used to denote all parameters that are b ¼ b9 b9 :::b9 : (11) 1 2 p kept fixed, together with H,inaspecificconditional If independent and identical MLAP prior distributions are distribution. assigned to each of the subvectors, the joint prior density of all marker effects, given Σ, can be represented as Parameters m and b* given ELSE: From (17,) and using representations (4) and (15), the fully conditional posterior Yp distribution of m and b* has density pðbjΣÞ¼ pðb jΣÞ j À Á À Á À Á j¼1 2 p m; b* ELSE } p y* m; b*; R0 p b* Σ; v Z Yp N À Á À Á b ; Σ 2 2 2; 1 2 ¼ NT j 0 vj h vj dvj (12) } exp 2 ðy*2y*ðm; b*ÞÞ9R 1ðy*2 y*ðm; b*ÞÞ j¼1 0 2 h i 9 b 2 2; 2 ::: 2 À Á and the joint density of and v ¼ v1 v2 vp is 1 2 2 3 exp 2 b*9 Σ 15D 1 b* : (18) À Á Yp À Á À Á 2 b; 2Σ ; Σ 2 2 : p v ¼ NT 0 vj h vj (13) j¼1 The preceding is a multivariate normal density (e.g., Sorensen and Gianola 2002). The mean vector of the distri- When individuals are sorted within traits (e.g., T ¼ 3), note bution is that ½b* Σ; v2 is a Tp2dimensional normal distribution, with 2 2 9 21 null mean vector and covariance matrix m R 1N R 151 X ¼ 0 0 N b* 215 9 215 9 Σ215 21 02 3 1 2 3 R0 X 1N R0 X X þ D * "#À Á (19) b Σ Σ Σ 21 9 B6 1 7 C D 11 D 12 D 13 R 51 y* @4 b* 5 Σ; 2A 4 Σ Σ Σ 5 Σ5 ; 3 À 0 NÁ ; Var 2 j v ¼ D 21 D 22 D 23 ¼ D 21 9 * R 5X9 y* b DΣ31 DΣ32 DΣ32 0 3 (14) and the variance–covariance matrix is

308 D. Gianola and R. L. Fernando " # 2 2 9 21 Recall that m R 1N R 151 X Var jELSE ¼ 0 0 N b 21 21 21 21 R 5X91N R 5X9X þ Σ 5D À Á Yp À Á 0 0 2 2 2 p b* Σ; v ¼ N b 0; Σv ; (26) ¼ C 1: T j j j¼1 (20) so, from (17) A more explicit representation is presented in Appendix B for À Á : the case T ¼ 3 pðΣjELSEÞ } p b* Σ; v2 pðΣjHÞ Yp Σ21 Fully conditional distributions of partitions of the loca- } 1 21b9 b 1 exp j 2 j tion vector: For details, see Van Tassell and Van Vleck (1996) Σ 22 2 v j¼1 vj j and Sorensen and Gianola (2002). Since the joint posterior of the location parameters, given Σ; v2; and R ; is multivariate 0 T þ nb þ 1 normal, all conditionals and linear combinations thereof are 2 À Á 1 21 3 jΣj 2 exp 2 tr Σ Vb normal as well. In particular ðT ¼ 3Þ, 2 2 3 X3 X n 1 9 9 p þ T þ b þ 1 E m ELSE 419 rtt y* 2 Xbt 2 N rtt m 5; 2 ð tj Þ¼ tt N t t9 1 21 r N } jΣj 2 exp 2 tr Σ Sb ; t9¼1 t96¼t 2 i ¼ 1; 2; 3; (21) (27) and where Xp b b9 1 j j m ; ; ; : Sb ¼ þ Vb (28) Varð tjELSEÞ¼ tt t ¼ 1 2 3 (22) 2 r N j¼1 vj

Likewise is a T 3 T matrix. Hence the conditional posterior distribution of Σ is IW p T nb; Sb . The kernel of the density of Σ is 2 21 ð þ þ Þ À Á D 1 1 21 b* tt 9 often represented as expf2 tr½Σ ðp þ T þ nbÞSbg, where E t ELSE ¼ r X X þ Σ 2 0 tt Sb ¼ Sb=ðp þ T þ nbÞ: 2 3 X3 X Fully conditional distribution of v2: From (17), and 3 4 9 tt9 * 2 m 2 tt9 9 21Σtt9 b* 5; X r yj 1N j r X X þ D t9 using (13), t9¼1 t96¼t À Á À Á À Á p v2ELSE } p b*Σ; v2 p v2 i ¼ 1; 2; 3; (23) Yp À Á À Á and } b ; Σ 2 2 NT j 0 vj h vj j¼1 À Á 21 21 b* tt 9 D ; ; ; : Var t ELSE ¼ r X X þ t ¼ 1 2 3 (24) " # Σ 2 2 tt Yp b9Σ 1b À ÁT 1 À Á 1 j þ 21 v } 2 j 2 2 2 j À ÁT exp 2 vj exp 2 2 2v 8 j¼1 vj j

" # Σ 4 Fully conditional distributions of R0 and : From (17), Yp À Á b9Σ21b v 21 j j þ using (2) and (6) } v2 2exp 2 4 : (29) j 2v2 j¼1 j N þ T þ 1 2 1 2 The preceding density is not in a recognizable form. Appen- pðR jELSEÞ } jR j 2 exp 2 tr R 1S ; 0 0 2 0 e dix C gives details of a Metropolis-Hastings algorithm tailored (25) for making draws from the distribution having density (29). A brief description of the procedure follows. so ½R jELSE is an IW distribution with N þ T degrees of 0 MCMC algorithm freedom and scale matrix Se.InIW; the kernel of the 1 21 density is often written as expf2 tr½R ðN þ TÞÞSeg; where Starting values for R0 and Σ can be obtained “externally” 2 0 Se ¼ Se=ðN þ TÞ: from some estimates of R0 and B (the T 3 T matrix of

Multiple-Trait Bayesian Lasso 309 Figure 1 Bivariate Bayesian LASSO: trace plot and posterior density of residual correlation.

variances and covariances of marker effects) calculated of marker effects. A vector of effects of a marker with a short with standard methods such as maximum likelihood. Re- Mahalanobis distance away from 0 is more strongly shrunk call that Σ ¼ B=½4ðT þ 1Þ: toward the origin (i.e., the mean of prior distribution) than 2 ; ; ...; Sample each vj ð j ¼ 1 2 pÞ using the following Metropolis- vectors containing strong effects on at least one trait. MLAP Hastings sampler: preserves the spirit of BL, producing “pseudoselection” of a 1; b 1 ; 1. At round t, draw y from Y IGð ¼ 2 ¼ 4Þ and eval- covariates: all markers stay in the model, but some are effec- uate y as proposal; IG stands for an inverse-gamma tively nullified. A marker with strong marginal and joint ef- distribution. fects on the traits under consideration could flag potentially 2. Draw U Uð0; 1Þ; with the probability of move being pleiotropic regions. minð1; RÞ, with R as in Appendix C. , ; ½tþ1 2½tþ1 = ½t 3. If U R set wj ¼ y and form vj ¼ 2 wj as a Missing records for some traits new state. Otherwise, set v2½tþ1 ¼ v2½t; j ¼ 1; 2; ...; p: j j Often, not all traits are measured in all individuals, a situation that is more common in animal breeding than in plant In a “single-pass” sampler, use (19) and (20) for sampling breeding. A standard approach (“data augmentation”) treats the entire location vector jointly. Otherwise, adopt a missing phenotypes as unknowns in an expanded joint blocking strategy; for example draw m and b* using posterior distribution. As shown in Appendix F, a predictive (21), (22), (23) and (24). distribution can be used to produce an imputation of missing Sample R from IW N T; S and Σ from IW p T nb; Sb . 0 ð þ eÞ ð þ þ Þ data.

Remarks Alternative Formulation in TN Dimensions Appendix E shows that the degree of shrinkage of marker The MCMC sampler described above is based on a regression effects results from a joint action between Σ and the strength on markers formulation stemming from either (1) or (4). In a

310 D. Gianola and R. L. Fernando Figure 2 Bivariate Bayesian LASSO: trace plot and posterior density of correlation between marker effects.

“single-pass” sampler, Tð1 þ pÞ parameters must be drawn dairy cattle, where milk production of daughters of bulls in together; when p is very large, direct inversion is typically different countries are regarded as different traits and in unfeasible, so the scheme must be reformulated into a multi-environment situations in ; both in- “block-sampling” one, i.e., by drawing some of the location stances can be represented as special cases of a multiple-trait parameters jointly by conditioning on the other location mixed effects model. parameters, or by using a single-site sampler (Sorensen A publicly available Loblolly pine ðPinus taedaÞ data de- and Gianola 2002). Blocking or single-site sampling facili- scribed in Cheng et al. (2018a) was used to carry out a pre- tate computation at the expense of slowing down conver- dictive comparison between a Bayesian bivariate GBLUP with gence to the target distribution. Appendix D gives a scheme the bivariate Bayesian LASSO, as well as the latter vs. a single- in which Tð1 þ NÞ effects (trait means and bivariate geno- trait Bayesian LASSO. After edits, there were n ¼ 807 indi- mic breeding values) are inferred, and the Tp marker effects viduals with p ¼ 4828 SNP markers with measurements on are calculated indirectly, following ideas of Henderson rust bin scores and rust gall volume—two disease traits; see (1977) and adapted by Goddard (2009) to a genome-based Cheng et al. (2018a). Supplemental material available at fig- model. share: https://doi.org/10.25386/genetics.10689380.

Data availability Bivariate Analysis of Wheat Yield: MBL vs. GBLUP The authors state that all data necessary for confirming the conclusions presented in the article are represented fully Genomic BLUP and Bayesian BLUP within the article. The wheat yield data set in the R package The bivariate model was BGLR (Pérez and de los Campos 2014) was employed to contrast MBL with GBLUP and Bayes Cp. This wheat data m y1 1 1 g1 e1 ; ¼ m þ þ (30) set has been studied extensively, e.g., by Crossa et al. y2 1 2 g2 e2 (2010), Gianola et al. (2011, 2016), and Long et al. (2011).

There are n ¼ 599 wheat inbred lines, each genotyped with where y1ðy2Þ is the vector of grain yields in environment 1ð2Þ m m p ¼ 1279 DArT (Diversity Array Technology) markers, and of the 599 inbred lines; 1 and 2 are the trait means in the each planted in four environments. The DArT markers are two environments, and 1 is a 599 3 1 incidence vector of ; ; “ ” binary ð0 1Þ and denote the presence or absence of an allele ones; g1 and g2 are the additive genomic values of the lines, at a marker locus in a given line. Grain yields in environments and e1 and e2 are model residuals. In GBLUP (VanRaden 1 and 2 were employed to compare outcomes between anal- 2008) the genetic signals captured by markers are repre- b b ; yses based on bivariate GBLUP and the bivariate BL. In the sented as g1 ¼ X 1 and g2 ¼ X 2 where X is a bivariate model, yields in the two environments are treated 599 3 1279 centered and scaled matrix of genotype codes, — b b as distinct traits, conceptually an idea that dates back to and 1ð 2Þ contains the marker allele substitution effects on Falconer (1952). This type of setting arises in breeding of trait 1ð2Þ: The residual distribution was

Multiple-Trait Bayesian Lasso 311 Figure 3 Bivariate GBLUP vs. bivariate Bayesian LASSO (posterior mean) estimates of marker effects on wheat grain yield.

m e1 ; 5 ; y1 1 1 ; 5 5 ; Nð0 R0 IÞ (31) N m V ¼ G0 G þ R0 I (35) e2 y2 1 2 where, as before, R0 is the 2 3 2 between-trait residual where V is the phenotypic covariance matrix. The bivariate – variance covariance matrix. Effects of environment 1 are best linear unbiased predictor of g1 and g2 (Henderson 1975) expected to be uncorrelated with those of environment 2. is: However, allowance was made for a non-null residual covari- ^ 2 m^ ance because the additive genomic model may not capture g1 5 21 y1 1 1 ; ^ ¼ðG0 GÞV 2 m^ (36) extant epistasis involving additive effects, potentially creat- g2 y2 1 2 ing correlations among residuals of the same lines in different environmental conditions. where b ; s2 ; b ; s2 GBLUP assumed 1 Nð0 I b Þ 2 Nð0 I b Þ and 1 2 m^ 9 21 9 Covðb ; b9 Þ¼Isb b ; so 1 1 0 21 10 1 0 21 y1 ; 1 2 1 2 m^ ¼ 9 V 9 V " # 2 01 01 01 y2 2 sb sb b 1 1 2 (37) B ¼ 2 (32) sb b sb 1 2 2 is the bivariate generalized least-squares (GLS) estimator of is the variance–covariance matrix of marker effects. It follows the trait means. ; that BLUP and GLS require knowledge of G0 and R0 and we replaced these unknown matrices by estimates obtained us- g1 0 ing a crude, but simple, procedure. Genomic and residual N ; G05G ; (33) g2 0 variance components were obtained by univariate maximum likelihood analyses of traits 1, 2, and 1 þ 2; and covariance where component estimates were calculated from the expression " # CovðX; YÞ¼½VarðX þ YÞ 2 VarðXÞ 2 VarðYÞ=2: The result- s2 s g12 ing estimates of G and R were inside their corresponding G pB g1 ; (34) 0 0 0 ¼ ¼ s s2 g12 g2 parameter spaces. An estimate of B was obtained by applying relationship (34) to the estimated G0: is a between-trait variance–covariance matrix of the additive Henderson (1977) showed how BLUP of vectors that are 2 2 genomic values (here, e.g., s ¼ psb ) and G ¼ XX9=p is a not likelihood identified can be obtained from best linear g1 1 genomic-relationship matrix describing genome-based simi- unbiased predictions of likelihood-identified random effects larities among the 599 lines. The preceding assumptions in- (see Gianola 2013). Goddard (2009) and Strandén and duce the marginal distribution Garrick (2009) used this property to obtain predictions of

312 D. Gianola and R. L. Fernando Figure 4 t-Statistics for marker effects on wheat grain yield: GBLUP vs. bivariate Bayesian LASSO (MBL) marker effects ðbÞ given predictions of signal ðgÞ.Ifb and g variance–covariance matrix of the BLUP of marker effects is have a joint normal distribution, under (30) one has given by b À Á 1 g1 215 9 21 g1 : b^ 2 b E b j ¼ BG0 X G Var 1 1 2 g2 g2 b^ 2 b 2 2 Using iterated expectations, and recalling that BLUP can be À Á 21 9 21 viewed as an estimated conditional expectation (with fixed 5 2 5 9 21 2 V 11 V 5 9 9: ¼ B Ip ðB X Þ I2n þ V 21 ðB X Þ effects replaced by their GLS estimates), BLUP of marker 19V 1 effects is expressible as (39) b^ b^ À Á ^ A set of t 2 statistics can be formed by taking the ratio 1 ^ 1 y1 215 9 21 g1 b^ ¼ E b^ j ¼ BG0 X G ^ 2 2 y2 g2 between the BLUP of a given marker effect as in (38), and the square root of the corresponding diagonal ele- À Á ment of (39). The statistic is a crude criterion for associ- 215 9 21 21 2 m^ 2 m^ ; ¼ BG0 X G V y1 1 1 y2 1 2 (38) ation between marker and phenotype as it ignores uncertainty associated with the fact that B and R0 are b^ ^ b ; ; ; : “ with i ¼ Eð ijy1 y2Þ i ¼ 1 2 After lengthy algebra, estimated from the data, as opposed to being true val- and using Henderson (1975), the prediction error ues” required by BLUP theory.

Multiple-Trait Bayesian Lasso 313 Figure 5 t-Statistics for marker effects on wheat grain yield: ordinary least-squares (OLS) vs. bivariate Bayesian LASSO (MBL) and bivariate GBLUP.

The Bayesian bivariate GBLUP model used standard as- The MCMC scheme employed the scale mixture of normal sumption as in Sorensen and Gianola (2002), i.e., it was a representation of the bivariate Laplace distribution. First, six multivariate normal-inverse Wishart hierarchical specifica- independent chains of 1500 iterations each were run. The tion. The only difference with GBLUP is that, in the Bayesian shrinkage diagnostic metric of Gelman and Rubin (1992) was m m ; ; Σ treatment, G0 and R0 were treated as unknown parameters, calculated for 1, 2 R0 and , for the effect of marker 10 on with the uncertainty about their values accounted for. trait 1, and for the effect of marker 200 on trait 2; the R package CODA was used for this purpose. Figures S4–S13 Bivariate LASSO gave no strong evidence of lack of convergence, as indicated Our MCMC implementation for MBL was applied to markers by shrinkage factor values close to 1. directly, as opposed to inferring their effects from signal Post burn-in samples were collected for an additional indirectly, as it is done for GBLUP. The model was as in 2000 iterations in each chain, so a total of 12,000 samples (4) with T ¼ 2: Each marker was assigned a conditional bi- (without thinning) was used for inference. Figures S14 and variate Laplace prior distribution with scale matrix Σ in S15 depict post burn-in trace plots for the elements of R0 and turn, Σ was given a two-dimensional (2D) inverse Wishart Σ, respectively.The six chains “joined” eventually,and sample distribution on nb ¼ 4 degrees of freedom and with values fluctuated thereafter within what seemed to be sta- scale matrix Vb ¼ nbB=12 ¼ B=3: The residual variance– tionary distributions. To assess convergence further, a test covariance matrix R0 was assigned the 2D Jeffreys improper suggested by Geweke (1992) was applied to the combined m ; prior in (6). 12,000 samples from the posterior distributions of 1 re12

314 D. Gianola and R. L. Fernando the extremes of the scatter plots. The lower panel depicts that markers with the strongest absolute effects, as estimated by BLUP, had an even stronger effect when estimated under the bivariate BL. Figure 4 presents standardized estimates of each of the 1279 marker effects, by trait. For GBLUP the t 2 statistic was the estimated marker effect divided by the square root of its prediction error variance; for MBL, it was the posterior mean divided by its posterior standard devia- tion. There is no evidence that any of the markers had an effect differing from 0, corroborating the view that wheat yield is a typical quantitative trait affected by many variants each having small effects (Singh et al. 1986; Sleper and Poehlman 2006). Using a univariate least-squares, GWAS- type analysis, there were 29 (yield 1) and 56 (yield 2) signif- icant hits after a Bonferroni correction (1279 tests, a ¼ 0:05) A comparison between the t 2 statistics from the GWAS-type analysis with the standardized BLUP and MBL effects is pro- vided in Figure 5. As expected, shrinkage toward null-mean Figure 6 Mahalanobis squared distance (M) away from (0, 0) for bivar- distributions (bivariate Gaussian in BLUP and bivariate Lap- iate effects on grain yield of 1279 markers: GBLUP vs. bivariate Bayesian lace in MBL) made t 2 statistics much smaller in absolute LASSO (BLASSO) value than the corresponding ones from GWAS. Standard GWAS aims to find connections between markers and genomic regions having an effect on a single trait (e.g., (residual correlation between yields in environments 1 and Manolio et al. 2009, Visscher et al. 2012; Gianola et al. 2016; b Schaid et al. 2018) A search for pleiotropy, on the other hand, 2) and r 12, the correlation between effects of a marker on the two traits. The test compared means of two parts of the com- focuses on markers having multi-trait effects. The latter can bined collection of 12,000 samples at each of 10 segments of be viewed as a search for vectors of effects with non-null the collection: there was no evidence of lack of convergence. coordinates that are distant from a T2dimensional 0 origin. 2 ; In short, the implementation met successfully the conver- Mahalanobis squared distances ðmj Þ away from ð0 0Þ for gence tests applied. each the 1279 bivariate vectors of marker effects were calcu- Figure 1 and Figure 2 display estimated posterior densities lated for both BLUP and MBL. For BLUP and marker j; the 2 21 b : b squared distance was computed as m ¼ b9 ; B b ; ; of re12 and r 12 Mixing for r 12 was poorer than for re12; the Blup;j Blup j Blup j 2 b9 Σ 21b ; b effective number of samples was 220.6 and 979.0, respec- and for MBL it was mMBL; j ¼ MBL; jð12 Þ MBL; j where :; j tively, and Monte Carlo errors were small enough. The re- are effect estimates for marker j, and Σ is the estimated pos- sidual correlation (posterior mean, 0.17) was positive and Σ 2 terior expectation of . For BLUP, mBlup;j had median and different from 0, whereas the rb parameter was estimated 12 maximum values of 0.16 and 2.94, respectively, over 2 at 0.35, also different from zero. However, the posterior markers. For MBL the corresponding values were 0.14 and densities were not sharp enough for precise inference, prob- 3.83. Figure 6 shows that the largest estimated distances ably due to the small sample size ðn ¼ 599Þ and low density were obtained with MBL, supporting the view that the of the marker panel ðp ¼ 1279Þ. The quality of these esti- method produces less shrinkage of multiple-trait effect sizes mates is of subsidiary interest here as our objective was to than BLUP. If the 95% percentile of a chi-squared distribution demonstrate the MBL in a comparison with bibariate BLUP of on 2 degrees of freedom (5.99 and 14.4 without and with a marker effects. Bonferroni correction at a ¼ 0:05) is used as “significance Location parameters mixed well. For example, the average threshold”, none of the 1279 markers could be claimed to m effective sample size of 1 over the six chains during burn-in have a bivariate effect on the trait, which is consistent with was 1499 for a nominal 1500 iterations. For marker 10 effect the t 2 statistics: on trait 1, it was 962 out of 1500, and, for marker 200 effect on trait 2, the effective size was 1130 out of 1500. These numbers suggest that all 2558 marker effects were estimated Predictive Comparison between MBL, MTGBLUP, and with a very small Monte Carlo error in our MBL implementa- MT-BayesCp: Wheat tion with 12,000 samples used for inference. Bivariate Bayesian GBLUP and BayesCp models (Cheng MBL vs. BLUP estimates of marker effects et al. 2018a) were also fitted to the wheat data set. Multiple- Figure 3 gives a comparison between bivariate BLUP and trait Bayesian linear models are well known (e.g., Sorensen and posterior mean estimates of effects from MBL. The upper Gianola 2002); BayesCp consisted of a bivariate mixture in panel shows good alignment between estimates, except at which each of the 1279 markers was allowed to fall, apriori;

Multiple-Trait Bayesian Lasso 315 Figure 7 Predictive correlations for wheat grain yield: bivariate Bayesian LASSO (Bayes L) vs. bivariate GBLUP (RR-BLUP).

into one of four disjoint classes: ð0; 0Þ; ð0; 1Þ; ð1; 0Þ; and ð1; 1Þ, uncertainty of prediction (Gianola et al. 2018). Many studies where ð0; 0Þ means that a marker has no effect on either fail to replicate, and often claim differences between, meth- trait, ð0; 1Þ indicates that a marker affects yield 2 only, and ods based on single realizations of predictive analyses. so on. The prior for the four probabilities of membership was a Dirichletð1; 1; 1; 1Þ distribution. All three methods were Predictive Comparison between MBL vs. MTGBLUP runineachof100randomlyconstructedtrainingsets, and MBL vs. Single Trait Bayesian LASSO: Pinus and predictions were formed for lines in corresponding testing sets. Training and testing set sizes had 300 and Figure 9 and Figure 10 present scatter-plots of the predictive 299 wheat lines, respectively, in each of the 100 runs. For performance (mean squared error and correlation, respec- all methods, the MCMC scheme was a single long chain of tively) of the bivariate Bayesian LASSO and bivariate Bayes- 50,000 iterations, with a burn-in period of 1000 draws. The ian GBLUP (MTGBLUP, denoted as RR-BLUP in the plots) in analyses were run using the JWAS package written in the the 100 testing sets. There were no obvious differences in JULIA language (Cheng et al. 2018b). mean-squared error for either rust bin or gall volume al- Figure 7 and Figure 8 present pairwise plots (bivariate though, for the latter trait, a slight superiority of MBL was Bayesian GBLUP denoted as RR-BLUP in the plots) of pre- noted (Figure 9); the plot contains distinct 12 points because dictive correlations and predictive mean-squared errors, re- the overlap in numerical values produced “clusters” of points. spectively; the plots display ,100 ðX; YÞ points because On the other hand, there was a decisive superiority (Figure numbers were rounded to two decimal points. There were 10) of MBL over MTGBLUP in predictive correlation. no appreciable differences in predictive performance be- Figure 11 contrasts the predictive performance of the bi- tween the three methods, supporting the view that cereal variate Bayesian LASSO over the single trait Bayesian LASSO grain yield is multi-factorial and that there are none, if any, for gall volume. The two trait analysis tended to produce genomic regions, with large effects. The variability among larger predictive correlations and smaller mean-squared er- replications of the training-testing layout is essentially ran- rors, illustrating instances in which a multiple-trait specifica- dom, reinforcing the notion of the importance of measuring tion clearly constitutes a better prediction machine.

316 D. Gianola and R. L. Fernando Figure 8 Predictive mean-squared error for grain yield: bivariate Bayesian LASSO (Bayes L), bivariate GBLUP (RR-BLUP), and bivariate Bayes Cp.

Conclusion analysis is more complex and requires more assumptions, so it may be less robust than a single trait procedure, and fail to Our study is possibly the first report in the quantitative deliver according to expectation in real-life circumstances. It genetics literature of a MBL, inspired by the BL of Park and is risky to make sweeping statements arguing in favor of a Casella (2008). MBL assumes that vectors of marker effects specific treatment of data, as outcomes are heavily dependent on T traits follow a null-mean multivariate Laplace distribu- on the biological architecture of the traits considered, and on tion, a priori: This distribution has a sharp peak at the origin, the data structure as well. The picture emerging from two and reduces to the DE prior of the BL when applied to a single trait. The implementation of MBL requires MCMC sampling, decades of experience in genome-enabled prediction in the fi and a relative simple Metropolis-Hastings algorithm based on elds of animal and plant breeding is that, in view of the large a scaled mixture of normals representation (Gómez-Sánchez- variability of performance with respect to data structure for Manzano et al. 2008) was presented. The algorithm was any given prediction machine, it is largely futile to categorize tested thoroughly with a wheat data set and found to mix methods in terms of expected predictive performance using well, with no evidence of lack of convergence to the posterior broad criteria (Morota and Gianola 2014; Gianola and Rosa distribution, and with a small Monte Carlo error. 2015; Momen et al. 2018; Montesinos-López et al. 2018 a,b, A question that arises often in practice, is the extent to 2019a,b; Azodi et al. 2019). which a multiple-trait method will produce a better perfor- MBL is expected to shrink more strongly toward zero mance than a single-trait specification. If the parameters of the vectors of markers with small effects in their coordinates, model (assuming it holds) representing the inter-trait distri- thus producing differential shrinkage and preserving the bution are either known or well estimated, one should expect modus operandi of BL. Mimicking the single-trait argument more power for QTL detection and a better predictive perfor- in Tibshirani (1996), which shows equivalence between mance for the multivariate specification. In our study, we LASSO and a posterior mode, the representation in Appendix found that MBL outperformed the single trait in terms of E illustrates that the degree of shrinkage of the vectorial delivering a better predictive performance for gall volume but effects of a marker (j, say) on a set of traits is inversely pro- b9Σ21b not for rust bin in Pinus. On the other hand, a multiple-trait portional to the quadratic form j j. Thus, multivariate

Multiple-Trait Bayesian Lasso 317 Figure 9 Mean-squared error of prediction for rust bin and gall volume in pine trees: bivariate Bayesian LASSO (Bayes L) vs. bivariate Bayesian GBLUP (RR-BLUP).

Bayesian pseudosparsity is induced by MBL to an extent spike-slab prior induces additional Bayesian sparsity at b9Σ21b fi depending on the heterogeneity of j j over markers. the level of individual regression coef cients. The authors Pp qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi did not address the predictive ability of their method so it b9Σ21b We note, in passing, that the term j sffiffiffiffiffiffiffiffiffiffiffiffiffij given in would be interesting to compare it against MBL and the j¼1 PG PT multiple-trait mixture model of Cheng et al. (2018a). We b2 (66) in Appendix E is the counterpart of gt, part plan to carry out this comparison in collaboration with g¼1 t¼1 CIMMYT (Centro Internacional de Mejoramiento de Maíz “ ” of the group-penalty in Li et al. (2015), where g is some y Trigo, México) using a large number of data sets in var- meaningful group of markers arrived at, say, on the basis of ious cereal crops. b2 biological considerations, and gt is the group regression co- Knowledge of the genetic basis of complex traits is lim- fi ef cient for trait t. The latter penalty assigns the same weight ited, and not vast enough to enable formulation of a priori to these regressions over traits, contrary to MBL, where prescriptions for any specific trait or situation. The number, 21 weights and coweights are driven by Σ : The BL or MBL location, and effects of causal variants, the linkage disequi- can be adapted to situations where a group structure may librium structure between such variants and markers, and the be needed via hierarchical modeling; this fairly straightfor- mode of gene action of QTL are largely unknown, this hold- ward issue is outside of the scope of the paper, but may pur- ing for all species of domesticated plants and animals, and for sued in future extensions of MBL. Actually, Liquet et al. most common diseases in humans. Theoretically, MBL is (2017) described a Bayesian multiple-trait analysis where expected to perform better than multiple-trait BLUP when- a LASSO-type penalty is assigned to group effects, and a ever appreciable heterogeneity exists over the effects of the

318 D. Gianola and R. L. Fernando Figure 10 Predictive correlation for rust bin and gall volume in pine trees: bivariate Bayesian LASSO (Bayes L) vs. bivariate Bayesian GBLUP (RR-BLUP).

markers in the panel employed, while behaving as multiple- ignored in the model; further, short- and long-range linkage trait GBLUP when all markers have tiny and similar effects. disequilibria create statistical ambiguity (Gianola et al. 2016). This consideration follows directly from the structure of In WGR, on the other hand, regressions are akin to partial the method, and computer simulations could be easily tai- derivatives, i.e., the coefficient gives the net effect of the lored to create scenarios where MBL has a better or a worse marker given that the other markers are fitted; typically, re- performance simply by design but without necessarily being gressions become smaller as p is increased at a fixed n: relevant to a real-life inferential or predictive problem. In plant and animal breeding, a focal point is the evaluation The expectation stated above was verified empirically: of genetic merit of candidates for artificial selection, and the markers with stronger (positive or negative) effects on the prediction of expected performance in either collateral rela- wheat yields examined had larger Mahalanobis distances tives or in descendants. Under the assumptions of additive away from zero than markers with small effects. Further, inheritance, genome-enabled prediction (Meuwissen et al. markers with short distances in GBLUP had even shorter 2001) produces estimates of marked additive genomic value, distances under MBL. Neither of the two methods was able g; or signal as referred to in our paper. In MBL, g and marker to detect variants having a strong effect on wheat yield, effects can be inferred from their posterior mean or from a contrary to least-squares GWAS. However, outcomes from modal approximation (MAP-MBL) that does not involve GWAS are not strictly comparable with those from shrinkage- MCMC that is described in Appendix E. A rough comparison based procedures. In single-marker least-squares, the estima- between GBLUP, MBL, and MAP-MBL was carried out with tor is potentially biased because other genomic regions are the wheat data. For the latter, we used Σ ¼ G0=ð12pÞ; and

Multiple-Trait Bayesian Lasso 319 Figure 11 Predictive mean squared error and correla- tion for gall volume in pine trees: bivariate Bayesian LASSO (MTBayesL) vs. univariate Bayesian LASSO (STBayesL).

starting values for the iteration were calculated using BLUP the data does not necessarily imply a poorer predictive ability. estimates of marker effects. MAP with T ¼ 2 were iterated for A thorough comparison of predictive ability between MBL 500 rounds. Figure S16 shows that, at iteration 500, the and MAP-MBL will be carried out in future research. metric used for monitoring convergence had stabilized at Our predictive comparison in wheat involved three bivar- the third decimal place, but, for our purposes, iteration could iate models: GBLUP, MBL and BayesCp; which employs have stopped after 200 rounds. Figure S17 presents a scatter Bayesian model averaging. A training-testing validation rep- plot of the 2558 (bivariate) marker effect solutions at itera- licated 100 times at random indicated no differences among tions 1 and 500 against the corresponding BLUP or MBL methods. However, it was found that MBL was better than posterior mean estimates. Clearly, the MAP approach gave MT Bayesian BLUP for the two pine tree traits. After almost markedly different results, producing a stronger shrinkage two decades of genome-enabled prediction it is now clear to 0 of small-effect markers and, thus, an effectively sparser that no universally best prediction machine exists (Gianola model. Figure S18 gives a comparison of the fitted genomic et al. 2011; Heslot et al. 2012; de los Campos et al. 2013; b* b* values, i.e., g1 ¼ X 1 and g2 ¼ X 2 for the two traits. Momen et al. 2018; Bellot et al. 2018; Montesinos-López et al. GBLUP and MBL estimates were closely aligned and fit the 2018a,b, 2019a,b), even when nonparametric or deep learn- data in a similar manner. On the other hand, MAP-MBL gave a ing techniques are brought into the comparisons. For this larger mean-squared error of fit, and a smaller correlation reason, we refrain from making any sweeping claim about between fitted and observed phenotypes, possibly because the superiority (inferiority) of MBL over any other competing of the larger effective sparsity of MAP-MBL. A worse fitto multiple-trait Bayesian regression method. If the situation is

320 D. Gianola and R. L. Fernando such that multiple-trait vectors of effects are fairly homoge- Cheng, H., K. Kizilkaya, J. Zeng, and D. Garrick, 2018a Genomic neous over makers, it is to be expected that most methods prediction from multiple-trait Bayesian regression methods – will have a similar performance. On the other hand, if there is using mixture priors. Genetics 209: 89 103. https://doi.org/ 10.1534/genetics.118.300650 underlying heterogeneity of vectors of effects, possibly Cheng, H., R. Fernando, and D. Garrick, 2018b Julia implemen- reflecting “structural sparsity” at the QTL level, it is quite tation of whole-genome analyses Software. In: Proceedings of likely that MBL and multiple-trait Bayesian variable selec- the World Congress on Genetics Applied to Livestock Produc- tion methods (e.g., Cheng et al. 2018a) will outperform tion, available at: http://www.wcgalp.org/proceedings/2018/ MTGBLUP or a multiple-trait version of Bayes A. Unfortu- jwas-julia-implementation-whole-genome-analyses-software fi Crossa, J., G. de los Campos, P. Pérez, D. Gianola, J. Burgueño nately, is it dif cult to anticipate a priori which method will et al., 2010 Prediction of genetic values of quantitative traits deliver the best performance, given the limited knowledge of in plant breeding using pedigree and molecular markers. Genet- the biological architecture of complex traits, the strong influ- ics 186: 713–724. https://doi.org/10.1534/genetics.110.118521 ence of the data structure, and the variability of X matrices de los Campos, G., H. Naya, D. Gianola, J. Crossa, A. Legarra et al., 2009 Predicting quantitative traits with regression models for over data sets, in dimension and content. – fi dense molecular markers and pedigree. Genetics 182: 375 385. As far as we know, our paper represents the rst report in https://doi.org/10.1534/genetics.109.101501 the quantitative genetics literature of a multiple-trait LASSO, de los Campos, G., D. Gianola, and D. A. B. Allison, 2010 Predicting implemented in a Bayesian or empirical Bayes (Appendix E) genetic predisposition in humans: the promise of whole-genome manner. MBL adds to the armamentarium of genome-enabled markers. Nat. Rev. Genet. 11: 880–886. https://doi.org/10.1038/ prediction, and expands the family of members of the Bayesian nrg2898 de los Campos, G., J. M. Hickey, R. Pong-Wong, H. D. Daetwyler, alphabet (Gianola et al. 2009; Habier et al. 2011; Gianola and M. P. L. Calus, 2013 Whole-genome regression and pre- 2013). Further, it has been implemented in the publicly avail- diction methods applied to plant and animal breeding. Genetics able JWAS software (Cheng et al. 2018b). We take the view 193: 327–345. https://doi.org/10.1534/genetics.112.143313 that every prediction problem is unique, and that no claims Falconer, D. S., 1952 The problem of environment and selection. – about the superiority of a specific procedure over others Am. Nat. 86: 293 298. https://doi.org/10.1086/281736 fi Fernando, R., A. Toosi, A. Wolc, D. Garrick, and J. Dekkers, should be made without quali cation. For instance, MBL 2017 Application of whole-genome prediction methods for ge- could perform worse or better than here when applied to nome-wide association studies: a Bayesian approach. J. Agric. other species, traits, or when confronted with different data Biol. Environ. Stat. 22: 172–193. https://doi.org/10.1007/ structures. Most quantitative and disease traits are truly com- s13253-017-0277-6 plex, and it is dangerous to offer simplistic solutions or pre- Galesloot, T. E., K. van Steen, L. A. L. M. Kiemeney, L. L. Janss, and S. H. Vermeulen, 2014 A comparison of multivariate genome- dictive panaceas (Goddard et al. 2019). wide association methods. PLoS One 9: e95923. https:// doi.org/10.1371/journal.pone.0095923 Gao, H., P. Madsen, J. Pösö, G. P. Aamand, M. Lidauer et al., Acknowledgments 2018 Short communication: Multivariate outlier detection Chris-Carolin Schön is thanked for useful discussions and for routine Nordic dairy cattle genetic evaluation in the Nordic Holstein and Red population. J. Dairy Sci. 101: 11159–11164. comments. J.M. Marín provided technical information on https://doi.org/10.3168/jds.2018-15123 the MLAP distribution. Two anonymous reviewers are Gelfand, A. E., D. K. Dey, and H. Chang, 1992 Model determina- thanked for helpful suggestions. Financial support to D.G. tion using predictive distributions with implementation via was provided by the Deutsche Forschungsgemeinschaft sampling-based methods, pp. 147–167 in Bayesian Statistics, (grant No. SCHO 690/4-1 to CCS) and by the J. Lush Vol. 4, edited by J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith. Oxford University Press, UK. Endowment as Visiting Professor at Iowa State University Gelman, A., and D. B. Rubin, 1992 Inference from iterative sim- in 2018. The University of Wisconsin-Madison, the Techni- ulation using multiple sequences (with discussion). Stat. Sci. 7: cal University of Munich (TUM) Institute for Advanced 457–472. https://doi.org/10.1214/ss/1177011136 Study, TUM School of Life Sciences, and the Institut Pasteur Gelman, A., J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari et al., de Montevideo, , are thanked for providing office 2014 Bayesian Data Analysis, Ed. 3rd. Chapman and Hall/CRC Press, Boca Raton. space, library and computing resources, and logistic support. Geweke, J., 1992 Evaluating the accuracy of sampling-based ap- proaches to calculating posterior moments, pp. 169–193 in Bayesian Statistics 4, edited by J. M. Bernardo, J. O. Berger, A. Literature Cited P. Dawid, A. F. M. Smith, Clarendon Press, Oxford. Gianola, D., 2013 Priors in whole genome regression: the Bayes- Azodi, C., E. Bolger, A. McCarren, M. Roantree, G. de los Campos ian alphabet returns. Genetics 194: 573–596. https://doi.org/ et al., 2019 Benchmarking parametric and machine learning 10.1534/genetics.113.151753 models for genomic prediction of complex traits. G3 (Bethesda) Gianola, D., and G. J. M. Rosa, 2015 One hundred years of sta- 9: 3691–3702. https://doi.org/10.1534/g3.119.400498 tistical developments in animal breeding. Annu. Rev. Anim. Biosci. 3: Bellot,P.,G.deLosCampos,andM.Pérez-Enciso,2018 Candeep 19–56. https://doi.org/10.1146/annurev-animal-022114-110733 learning improve genomic prediction of complex human traits? Ge- Gianola, D., M. Pérez-Enciso, and M. A. Toro, 2003 On marker- netics 210: 809–819. https://doi.org/10.1534/genetics.118.301298 assisted prediction of genetic value: beyond the ridge. Genetics Calus, M. P. L., and R. F. Veerkamp, 2011 Accuracy of multi-trait 63: 347–365. genomic selection using different methods. Genet. Sel. Evol. 43: Gianola, D., G. de los Campos, W. G. Hill, E. Manfredi, and R. L. 26. https://doi.org/10.1186/1297-9686-43-26 Fernando, 2009 Additive genetic variability and the Bayesian

Multiple-Trait Bayesian Lasso 321 alphabet. Genetics 183: 347–363. https://doi.org/10.1534/ genome-wide association studies. Am. J. Hum. Genet. 88: genetics.109.103952 294–305. https://doi.org/10.1016/j.ajhg.2011.02.002 Gianola, D., H. Okut, K. A. Weigel, and G. J. M. Rosa, Legarra, A., C. Robert-Granié, P. Croiseau, F. Guillaume, and S. 2011 Predicting complex quantitative traits with Bayesian Fritz, 2011 Improved LASSO for genomic selection. Genet. neural networks: a case study with Jersey cows and wheat. Res. 93: 77–87. https://doi.org/10.1017/S0016672310000534 BMC Genet. 12: 87. https://doi.org/10.1186/1471-2156-12-87 Lehermeier, C., V. Wimmer, T. Albrecht, H. Auinger, D. Gianola Gianola, D., G. de los Campos, M. A. Toro, H. Naya, C.-C. Schön et al., 2013 Sensitivity to prior specification in Bayesian ge- et al., 2015 Do molecular markers inform about pleiotropy? nome-based prediction models. Stat. Appl. Genet. Mol. Biol. Genetics 201: 23–29. https://doi.org/10.1534/genetics.115.179978 12: 1–17. https://doi.org/10.1515/sagmb-2012-0042 Gianola, D., M. I. Fariello, H. Naya, and C.-C. Schön, 2016 Genome- Li, J., K. Das, G. Fu, R. Li, and R. Wu, 2011 The Bayesian LASSO wide association studies with a genomic relationship matrix: a for genome-wide association studies. Bioinformatics 27: 516– case study with wheat and Arabidopsis. G3 (Bethesda) 6: 3241– 523. https://doi.org/10.1093/bioinformatics/btq688 3256. https://doi.org/10.1534/g3.116.034256 Li, Z., J. Mötton̈en, and M. J. Sillanpää, 2015 A robust multiple- Gianola, D., A. Cecchinato, H. Naya, and C.-C. Schön, 2018 Prediction locus method for quantitative trait locus analysis of non-nor- of complex traits: robust alternatives to best linear unbiased mally distributed multiple traits. Heredity 115: 556–564. prediction. Front. Genet. 9: 195. https://doi.org/10.3389/ https://doi.org/10.1038/hdy.2015.61 fgene.2018.00195 Liquet, B., K. Mengersen, A. N. Pettitt, and M. Sutton, 2017 Bayesian Goddard, M. E., 2009 Genomic selection: prediction of accuracy variable selection regression of multivariate responses for group and maximisation of long term response. Genetica 136: 245– data. Bayesian Anal. 12: 1039–1067. https://doi.org/10.1214/ 257. https://doi.org/10.1007/s10709-008-9308-0 17-BA1081 Goddard, M. E., K. E. Kemper, I. M. MacLeod, A. J. Chamberlain, Long, N., D. Gianola, G. J. M. Rosa, and K. A. Weigel, and B. J. Hayes, 2016 Genetics of complex traits: prediction of 2011 Marker-assisted prediction of non-additive genetic val- phenotype, identification of causal polymorphisms and genetic ues. Genetica 139: 843–854. https://doi.org/10.1007/s10709- architecture. Proc. Biol. Sci. 283: 20160569. .https://doi.org/ 011-9588-7 10.1098/rspb.2016.0569 López de Maturana, E., S. J. Chanok, A. C. Picornell, N. Rothman, J. Goddard, M. E., T. H. E. Meuwissen, and H. D. Daetwyler, Herranz et al., 2014 Whole genome prediction of bladder can- 2019 Prediction of phenotype from DNA variants, Handbook of cer risk with the Bayesian LASSO. Genet. Epidemiol. 38: 467– Statistical Genomics,editedbyD.Balding,I.Moltke,andJ.Marioni. 476. https://doi.org/10.1002/gepi.21809 Wiley, Hoboken. https://doi.org/10.1002/9781119487845.ch28 Makowsky, R., N. M. Pajewski, Y. C. Klimentidis, A. I. Vázquez, C. Gómez, E., M. A. Gómez-Villegas, and J. M. Marín, 1998 A mul- W. Duarte et al., 2011 Beyond missing heritability: prediction tivariate generalization of the power exponential family of of complex traits. PLoS Genet. 7: e1002051. https://doi.org/ distributions. Commun. Stat. Theory Methods 27: 589–600. 10.1371/journal.pgen.1002051 https://doi.org/10.1080/03610929808832115 Manolio, T. A., F. S. Collins, N. J. Cox, D. B. Goldstein, L. A. Gómez-Sánchez-Manzano, E., M. A. Gómez-Villegas, E. Gómez, and Hindorff et al., 2009 Finding the missing heritability of com- J. M. Marín, 2008 Multivariate exponential power distribu- plex diseases. Nature 461: 747–753. https://doi.org/10.1038/ tions as mixtures of normal distributions with Bayesian applica- nature08494 tions. Commun. Stat. Theory Methods 37: 972–985. https:// Meuwissen, T. H. E., B. J. Hayes, and M. E. Goddard, doi.org/10.1080/03610920701762754 2001 Prediction of total genetic value using genome-wide Habier, D., R. L. Fernando, K. Kizilkaya, and D. J. Garrick, dense marker maps. Genetics 157: 1819–1829. 2011 Extension of the Bayesian alphabet for genomic selec- Momen, M., A. Ayatollahi-Mehrgardi, A. Sheikhi, A. Kranis, L. tion. BMC Bioinformatics 12: 186. https://doi.org/10.1186/ Tusell et al., 2018 Predictive ability of genome-assisted statis- 1471-2105-12-186 tical models under various forms of gene action. Sci. Rep. 8: Hazel, L. N., 1943 The genetic basis for constructing selection 12309. https://doi.org/10.1038/s41598-018-30089-2 indexes. Genetics 28: 476–490. Montesinos-López, A., O. A. Montesinos-López, D. Gianola, J. Henderson, C. R., 1975 Best linear unbiased estimation and pre- Crossa, and C. M. Hernández-Suárez, 2018a Multi-environment diction under a selection model. Biometrics 31: 423–449. genomic prediction of plant traits using deep learners with https://doi.org/10.2307/2529430 dense architecture. G3 (Bethesda) 8: 3813–3828. Henderson, C. R., 1977 Best linear unbiased prediction of breed- Montesinos-López, O. A., A. Montesinos-López, J. Crossa, D. ing values not in the model for records. J. Dairy Sci. 60: 783– Gianola, C. M. Hernández-Suárez 2018b Multi-trait, multi- 787. https://doi.org/10.3168/jds.S0022-0302(77)83935-0 environment deep learning modeling for genomic-enabled Henderson, C. R., and R. L. Quaas, 1976 Multiple trait evaluation prediction of plant traits. G3 (Bethesda) 8: 3829–3840. using relatives’ records. J. Anim. Sci. 43: 1188–1197. https:// https://doi.org/10.1534/g3.118.200728 doi.org/10.2527/jas1976.4361188x Montesinos-López, O. A., J. Martín-Vallejo, J. Crossa, D. Gianola, C. Heslot, N., H. P. Yang, M. E. Sorrells, and J. L. Jannink, 2012 Genomic M. Hernández-Suárez et al., 2019a New deep learning geno- selection in plant breeding: a comparison of models. Crop Sci. 52: mic-based prediction model for multiple traits with binary, 146–160. https://doi.org/10.2135/cropsci2011.06.0297 ordinal, and continuous phenotypes. G3 (Bethesda) 9: 1545– Isik, F., J. Holland, and C. Maltecca, 2017 Genetic Data Analysis 1556. for Plant and Animal Breeding, Springer, Cham. https://doi.org/ Montesinos-López, O. A., J. Martín-Vallejo, J. Crossa, D. Gianola, C. 10.1007/978-3-319-55177-7 M. Hernández-Suárez et al., 2019b A benchmarking between Jia, Y., and J. L. Jannink, 2012 Multiple-trait genomic selection deep learning, support vector machine and Bayesian threshold methods increase genetic value prediction accuracy. Genetics best linear unbiased prediction for predicting ordinal traits in 192: 1513–1522. https://doi.org/10.1534/genetics.112.144246 plant breeding. G3 (Bethesda) 9: 601–618. Lande, R., and R. Thompson, 1990 Efficiency of marker-assisted Morota, G., and D. Gianola, 2014 Kernel-based whole-genome selection in the improvement of quantitative traits. Genetics prediction of complex traits: a review. Frontiers in Genetics 5: 124: 743–756. 363. https://doi.org/10.3389/fgene.2014.00363 Lee, S. H., N. R. Wray, M. E. Goddard, and P. M. Visscher, Moser, G., S. H. Lee, B. J. Hayes, M. E. Goddard, N. M. Wray et al., 2011 Estimating missing heritability for disease from 2015 Simultaneous discovery, estimation and prediction

322 D. Gianola and R. L. Fernando analysis of complex traits using a Bayesian mixture mode. VanRaden, P. M., 2008 Efficient methods to compute genomic PLoS Genet. 11: e1004969. https://doi.org/10.1371/journal. predictions. J. Dairy Sci. 91: 4414–4423. https://doi.org/10.3168/ pgen.1004969 jds.2007-0980 Park, T., and G. Casella, 2008 The Bayesian LASSO. J. Am. Stat. Van Tassell, C. P., and L. D. Van Vleck, 1996 Multiple-trait Gibbs Assoc. 103: 681–686. https://doi.org/10.1198/016214508000000337 sampler for animal models: flexible programs for Bayesian and Pérez, P., and G. de los Campos, 2014 Genome-wide regression likelihood-based (co)variance component inference. J. Anim. and prediction with the BGLR statistical package. Genetics 198: Sci. 74: 2586–2597. https://doi.org/10.2527/1996.74112586x 483–495. https://doi.org/10.1534/genetics.114.164442 Visscher, P. M., M. A. Brown, M. I. McCarthy, and J. Yang, Samorodnitsky, G., Taqqu, M. S., 2000. Stable non-Gaussian ran- 2012 Five years of GWAS discovery. Am. J. Hum. Genet. 90: dom processes, Chapman-Hall/CRC, Boca Raton. 7–24. https://doi.org/10.1016/j.ajhg.2011.11.029 Schaid, D. J., C. Wenan, and N. B. Larson, 2018 From genome- Visscher, P. M., N. R. Wray, Q. Zhang, P. Sklar, M. I. McCarthy et al., wide associations to candidate causal variants by statistical fine- 2017 10 Years of GWAS discovery: Biology, function, and mapping. Nat. Rev. Genet. 19: 491–504. https://doi.org/10.1038/ translation. Am. J. Hum. Genet. 101: 5–22. https://doi.org/ s41576-018-0016-z 10.1016/j.ajhg.2017.06.005 Singh, G., G. S. Bhullar, and K. S. Gill, 1986 Genetic control of Walsh, B., and M. Lynch, 2018 Evolution and selection of quanti- grain yield and its related traits in bread wheat. Theor. Appl. tative traits, Oxford University Press, New York. https://doi.org/ Genet. 72: 536–540. https://doi.org/10.1007/BF00289537 10.1093/oso/9780198830870.001.0001 Sleper, D. A., and M. Poehlman, 2006 Breeding Field Crops, Ed. 5, Yang, J., B. Benyamin, B. P. McEvoy, S. Gordon, A. K. Henders et al., Wiley-Blackwell, Hoboken, NJ. 2010 Common SNP’s explain a large proportion of the herita- Smith, H. F., 1936 A discriminant function for plant selection. bility for human height. Nat. Genet. 42: 565–569. https:// Ann. Eugen. 7: 240–250. https://doi.org/10.1111/j.1469-1809. doi.org/10.1038/ng.608 1936.tb02143.x Yi, N., and S. Xu, 2008 Bayesian LASSO for quantitative trait loci Sorensen, D., and D. Gianola, 2002 Likelihood, Bayesian, and mapping. Genetics 179: 1045–1055. https://doi.org/10.1534/ MCMC methods in quantitative genetics, Springer, New York. genetics.107.085589 https://doi.org/10.1007/b98952 Yuan, M., and Y. Lin, 2006 Model selection and estimation in Strandén, I., and D. J. Garrick, 2009 Derivation of equivalent regression with grouped variables. J. R. Stat. Soc. B 68: 49– computing algorithms for genomic predictions and reliabilities 67. https://doi.org/10.1111/j.1467-9868.2005.00532.x of animal merit. J. Dairy Sci. 92: 2971–2975. https://doi.org/ Yuan, M., A. Ekici, Z. Lu, and R. Monteiro, 2007 Dimension re- 10.3168/jds.2008-1929 duction and coefficient estimation in multivariate linear regres- Tibshirani, R., 1996 Regression shrinkage and selection via the sion. J. R. Stat. Soc. B 69: 329–346. https://doi.org/10.1111/ LASSO. J. R. Stat. Soc. B 58: 267–288. j.1467-9868.2007.00591.x VanRaden, P. M., 2007 Genomic measures of relationship and inbreeding. Interbull Bulletin 37: 33–36. Communicating editor: M. Sillanpää

Multiple-Trait Bayesian Lasso 323 13 Appendices 13.1 Appendix A: Excursus on the MLAP distribution 13.1.1 Three bivariate Laplace distributions For illustration, consider three bivariate Laplace distributions, all having null means but distinct scale matrices, as follows: 10 10:2 1 20:8 Σ ¼ ; Σ ¼ ; Σ ¼ : (40) 1 01 2 0:21 3 20:81

Σ Using (7), the density under 1 is qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 1 pðb ; b jΣ Þ ¼ exp 2 b2 þ b2 : (41) 1 2 1 8p 2 1 2

The covariance matrix here, B1 ¼ 12Σ1, is diagonal, so the random variables are uncorrelated but not independent, since (41) cannot be written as the product of two marginal densities. Under Σ2 and Σ3, the densities are pffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ! 5 6 1 25 5 25 pðb ; b jΣ Þ¼ exp 2 b2 2 b b þ b2 ; (42) 1 2 2 96p 2 24 1 12 1 2 24 2 and sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ! 5 1 25 40 25 pðb ; b jΣ Þ ¼ exp 2 b2 þ b b þ b2 : (43) 1 2 3 24p 2 9 1 9 1 2 9 2

Five bivariate Laplace densities are shown in Figure S1: (a) gives the density of the distribution of the two uncorrelated bivariate Laplace random variables ðΣ1Þ, and (b) and (c) show the positively (i.e., with Σ2) and negatively (with Σ3) correlated b b ; situations, respectively. These three densities have a sharp mode at 1 ¼ 2 ¼ 0 indicating that a bivariate Laplace prior would strongly shrink vectors to the ð0; 0Þ point, acting similarly to the DE prior in the univariate Bayesian LASSO. (d) and (e) display bivariate Laplace densities of distributions with non-null means.

13.1.2 Conditional distributions fi 3 b9 b9; b9 ; Dropping subscript j denoting a speci c marker, partitions the T 1 vector of effects into ¼ð 1 2Þ where the subvectors have orders T and T ; respectively; recall that T is the number of traits. Correspondingly, put 1 2 Σ Σ Σ ¼ 11 12 : (44) Σ21 Σ22

According to J. M. Marín (personal communication), the conditional distribution ½b jb has density 2 1 2 À Á G T T1 b b ; m ; Σ 2 p 2j 1 2j1 2j1 ¼ ð 1 N T2T1 T2T1 21 1 pffiffiffiffiffiffiffiffiffiffiffiffiffi jΣ22j2p 2 t 2 exp 2 t þ q1 dt 0 2 hrffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi i 1 À Á 3 exp 2 b 2m 9Σ21 b 2 m þ q ; (45) 2 2 2j1 2j1 2 2j1 1

m Σ Σ21b ; Σ Σ 2Σ Σ21Σ b9 Σ21b : where 2j1 ¼ 22 11 1 2j1 ¼ 22 21 11 12 and q1 ¼ 1 11 1 Similar to multivariate normal distribution, the condi- Σ b: tional expectation is linear on the conditioning variable, and 2j1 does not involve

13.1.3 Simulation of a multivariate Laplace distribution Gómez et al. (1998) showed that S independent draws from a MLAP distribution with a null mean vector can be made as b 9 ; ; ; ...; ; i ¼ riC ui i ¼ 1 2 S (46)

324 D. Gianola and R. L. Fernando where C9 results from the Cholesky decomposition Σ ¼ C9C, u is a T 3 1 vector uniformly distributed on a T-dimensional unit sphere, and r is a realization of a Gamma distribution with shape parameter T and scale 2. Vector u can be simulated by ; ; ; ...; ; th effecting,Tsindependentffiffiffiffiffiffiffiffiffiffiffi draws ðxi i ¼ 1 2 TÞ from a Nð0 1Þ distribution, and then forming the t element of u as PT 2 ; ; ...; : ut ¼ xt xt , t ¼ 1 2 T t¼1

Marginal distributions for the three bivariate Laplace distributions with scale matrices Σ1; Σ2; and Σ3 given above were estimated by sampling S ¼ 300; 000 independent realizations; Equation 46 was employed. Using the samples, zero-mean DE and normal distributions with the same variances were fitted, and the resulting densities were compared with the estimated densities based on the draws. As shown in Figure S2, a normal distribution provided a poor approximation to the marginals from the three bivariate Laplace cases, and the sharp peak at 0, characteristic of a DE density, was not matched by such marginals. This is a corroboration of theoretical results in Gómez et al. (1998): marginals from MLAP distributions are elliptically contoured and not DE.

13.2 Appendix B: Mean vector of location parameters given ELSE Consider (Equation 19). For T ¼ 3; let 2 3 2 3 r11 r12 r13 Σ11 Σ12 Σ13 21 4 21 22 23 5 Σ21 4 21 22 23 5 R0 ¼ r r r and ¼ Σ Σ Σ (47) r31 r32 r33 Σ31 Σ32 Σ33

Expansion of the Kronecker products in (Equation 19) produces the system 2 32 3 11 12 13 11 12 13 m r NrNrNr19 X r 19 X r 19 X 1 6 N N N 76 7 6 21 22 23 21 9 22 9 23 9 76 m 7 6 r NrNrNr1N X r 1N X r 1N X 76 2 7 6 31 32 33 31 32 33 76 m 7 6 r NrNrNr19 X r 19 X r 19 X 76 3 7 6 N N N 76 7 6 11 12 13 11 11 21 12 12 21 13 13 21 76 b* 7 6 r X91N r X91N r X91N r X9X þ Σ D r X9X þ Σ D r X9X þ Σ D 76 1 7 6 21 22 23 21 21 21 22 22 21 23 23 21 76 b* 7 4 r X91N r X91N r X91N r X9X þ Σ D r X9X þ Σ D r X9X þ Σ D 54 2 5 31 9 32 9 33 9 31 9 Σ31 21 32 9 Σ32 21 33 9 Σ33 21 b* r X 1N r X 1N r X 1N r X X þ D r X X þ D r X X þ D 3 2 3 9 11 * 12 * 13 * 6 1N r y1 þ r y2 þ r y3 7 6 7 6 9 21 * 22 * 23 * 7 6 1N r y1 þ r y2 þ r y3 7 6 7 6 31 * 32 * 33 * 7 6 19N r y þ r y þ r y 7 6 1 2 3 7: ¼ 6 7 (48) 6 X9 r11y* þ r12y* þ r13y* 7 6 1 2 3 7 6 7 6 X9 r21y* þ r22y* þ r23y* 7 4 1 2 3 5 9 31 * 32 * 33 * X r y1 þ r y2 þ r y3

Observe how phenotypes for all traits contribute to the solutions of trait-specific effects.

2 13.3 Appendix C: Sampling from the conditional posterior distribution of vj b9 Σ21b 1; ; ; Consider (29). Let Q ¼ j j take values 2 1 4 and 10, say. Numerical integration of (29) between 0 and 1000 produces 3:5203; 3:040 7; 1:844 3; 1:031 4 as reciprocal of the resulting integration constants, with the normalized densities shown in Figure S3. The distributions are skewed, and, as Q increases, the density becomes flatter.

Let S1ð y; sÞ be the Lévy density of a positive random variable Y having a positive stable distribution with parameter s 2 (Samorodnitsky and Taqqu 2000). From Gómez et al. (1998) and Gómez-Sánchez-Manzano et al. (2008), the Lévy density is s 1 2 s ð4Þ 2121 S1ð y; sÞ¼ y 2 exp 2 ; (49) 2 G 1 ð2Þ 4y

Multiple-Trait Bayesian Lasso 325 a 1 b s: which is that of an inverse Gamma ðIGÞ distribution with parameters ¼ 2 and ¼ 4 Consider now the transformation (Gómez et al. 1998) w ¼ 2=v2 so using (29) j j 2 3 !21 1 2 Q þ 2 j w2 2 } 42 j 5 pðwjjELSEÞ exp 4 2 wj w wj j " # " # 2121 wjQj 1 } 2 2 2 : wj exp exp (50) 4 4wj

Consider a Metropolis-Hastings ratio R using (49) with s ¼ 1 as proposal distribution, and (18) let yj be a proposed value, and w be a member of the target distribution. The ratio is then j h i h i 2121 2121 2 2yjQj 2 1 2 2 1 yj exp 4 exp 4y wj exp 4w R ¼ h i h j i 3 j 212 1 2 1 wjQj 1 2 21 1 w exp 2 exp 2 y 2 exp 2 j 4 4wj 4y Qj ¼ exp ðw 2 y Þ ; j ¼ 1; 2; ...; p: (51) 4 j j

a 1; b 1 Hence, if a proposal yj is drawn from IG ð ¼ 2 ¼ 4Þ it can be accepted as belonging to the conditional posterior distribution of “ ” 2 = 2 wj, with probability equal to R above. If accepted, a new vj ¼ 2 wj is a member of pðvj ELSEÞ with probability R as well; 2: otherwise stay with the current vj

13.4 Appendix D: Alternative algorithm for indirect sampling of marker effects An alternative sampling scheme that uses an equivalent formulation of the model is presented; a two-trait ðT ¼ 2Þ situation is b* b* employed for ease of presentation. Let g1 ¼ X 1 and g2 ¼ X 2 be the genomic values of the N individuals for each of the traits. A model could be m y1 1 1 g1 e1 ; ¼ m þ þ (52) y2 1 2 g2 e2 where residuals are as before. In a standard genomic best linear unbiased prediction (GBLUP, Van Raden 2008) setting, it is assumed that g1 0 jG; G0 N ; G05G; (53) g2 0 where G is an N 3 N marker-based matrix of “genomic relationships,” and

" # 2 s sg G g1 12 (54) 0 ¼ s s2 g12 g2

fi fi fi ; is a matrix containing the trait-speci c genomic variances and their covariances. Speci cally, from the de nition of g1 and g2 * 2 2 and, assuming that b sb Nð0; Isb Þðt ¼ 1; 2Þ; t t t ; s2 1 9 s2 s2 ; ; ; Var gtjX b ¼ XX p b ¼ G t ¼ 1 2 (55) t p t gt

2 2 for G ¼ XX9=p and s ¼ psb : Similarly, Covðg ; g9 X; sb Þ¼Gs , where s ¼ psb and sb is the covariance between gi i 1 2 12 g12 g12 12 12 sb 3 – marker effects on traits 1 and 2. Let B ¼f tt9g be the 2 2 variance covariance matrix of marker effects

326 D. Gianola and R. L. Fernando For a bivariate Bayesian LASSO model, conditionally on the p 3 1 vector v2; one has

" # b* 1 Σ; 2 0 ; Σ5 ; : b* j v N D (56) 2 0

Hence g 1 jΣ; v2 g2 0 X 0 X9 0 N ; ðΣ5DÞ 0 0 X 0 X9

0 ¼ N ; Σ5XDX9 : (57) 0

Let C ¼ Σ5XDX9: Further, Cond m y1 Σ; 2; 1 1 ; E j v R0 ¼ m (58) y2 1 2 and y1 2 Var jΣ; v ; R0 ¼ CCond þ R05I ¼ VCond: (59) y2

fl m m After assigning a at prior to each of 1 and 2, standard results give that posterior distribution of the genotypic values given Σ; v2; R is normal, with mean vector 0 2 m~ ~ g1 Σ; 2; ; 21 y1 1 1 g1 ; E j v R0 y ¼ CCondVCond 2 m~ ¼ ~ (60) g2 y2 1 2 g2 where m~ 9 21 1 1 0 21 10 m~ ¼ 9 VCond 2 01 01 9 21 3 1 0 21 y1 : 9 VCond 01 y2

Further (Henderson 1975) g1 2 Var jΣ; v ; R0; y g2 À Á 2 21 21 9 21 9 21 ¼ CCond CCondVCondCCond þ CCondVCond1 1 VCond1 1 VCondCCond À Á 2 21 2 21 9 21 9 21 : ¼ CCond CCond VCond VCond1 1 VCond1 1 VCond CCond (61)

9 Σ; 2; Hence, draws from the conditional posterior distribution of g ¼½g1 g2 given v and R0 can be obtained by sampling from a multivariate normal distribution with mean vector (60) and covariance matrix (61).

Multiple-Trait Bayesian Lasso 327 h i 9 Σ; 2 b*9 b*9 9 9 b*9 h Assumingi that, given v and R0, the vector 1 2 g1 g2 has a multivariate normal distribution, and let ¼ 9 b*9 b*9 : 1 2 Hence h i 2 2 b* Σ; ; ; 2 b* ; Σ; ; E v R0 y ¼ EgjΣ;v ;R0;y g v R0 h i 21 2 b*; 9 Σ5 9 ¼ EgjΣ;v ;R0;y Cov g ð XDX Þ g 21 ~* ¼ Cov b*; g9 ðΣ5XDX9Þ g~ ¼ b : (62)

Now, # ! b* Cov b*; g9 ¼ Cov 1 ; ½ g9 g9 b* 1 2 2

¼ B5X9; (63) so À Á ~* 21 2 b ¼ðB5X9ÞðΣ5XDX9Þ ¼ BΣ 15X9XDX9 g~ (64)

Similarly h i 2 2 b* Σ; ; ; 2 b* ; Σ; ; Var v R0 y ¼ VargjΣ;v ;R0;y E g v R0 h i 2 2 b* ; Σ; ; þEgjΣ;v ;R0;yVar g v R0 À Á 21 2 Σ 5 9 9 ¼ VargjΣ;v ;R0;y B X XDX g

2 þðB5IÞðΣ5XDX9Þ 1ðB5IÞ

2 2 ¼ BΣ 1B5ðXDX9Þ 1: (65)

13.5 Appendix E: A conditional posterior mode approximation to marker effects Despite of important advances in high-throughput computing, routine genetic evaluation of plants and animals is seldom done with MCMC methods. As an alternative to MCMC, we describe an iterative algorithm that produces point estimates of marker effects (and of linear functions thereof) and approximate measures of uncertainty in a computationally simpler manner. The algorithm uses a reweighted set of linear “mixed model equations,” for which extremely efficient solvers exist. It is assumed that “good” estimates of R0 (the residual covariance matrix) and of B (the T 3 T variance-covariance matrix of markers effects) are available. From (8) Σ ¼ B=½4ðT þ 1Þ; e.g., for T ¼ 3 then Σ ¼ B=16; hence, an assessment of the scale matrix of the MLAP distribution is easily available. We make use of (2) and (7), but employ the “markers within trait” representation given in (4). Letting u ¼ðm9; b9Þ9; the logarithm of the joint (conditionally on the dispersion matrices) posterior density of location effects, apart from a constant, is

log½pðujR0; Σ; DATAÞ n o À Án o 1 2 ¼ 2 y* 2 ðI 51 Þm 2 ðI 5XÞb* 9 R 15I y* 2 ðI 51 Þm 2 ðI 5XÞb* 2 3 N 3 0 N 3 N 3

328 D. Gianola and R. L. Fernando qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 Xp 2 b 9Σ21b ¼ LðuÞ (66) 2 j j j¼1

Let À Á * LðuÞ¼Llik m; b þ LpriorðbÞ; (67)

* where Llikðm; b Þ and LpriorðbÞ are the two terms in (66). Then À Á h i @L m; b* À Á lik 215 9 * 2 5 m 2 5 b* : @m ¼ R0 1N y ðI3 1NÞ ðI3 XÞ (68) and À Á À Áh i @L m; b* 2 lik ¼ R 15X9 y* 2 ðI 51 Þm 2 ðI 5XÞb* (69) @b* 0 3 N 3

Observe now that the relationship between b and marker effects sorted within traits ðb*Þ can be expressed as b ¼ Lb*; where L is a 3p 3 3p nonsingular matrix of elementary operators that rearrange rows and columns. For example, for T ¼ 3 and p ¼ 2; and with b representing the effect of marker j on trait t; jt 2 3 2 32 3 b 11 100000 b* 6 7 6 76 11 7 6 b 7 6 00100076 b* 7 6 12 7 6 76 21 7 6 b 7 6 76 * 7 6 13 7 6 00001076 b 7 6 7 ¼ 6 76 12 7: (70) 6 b 7 6 76 b* 7 6 21 7 6 01000076 22 7 6 7 6 76 7 b b* 4 22 5 4 00010054 13 5 b b* 23 000001 23

Since L is a matrix of elementary operators, L21 ¼ L9 (orthogonality) and b* ¼ L9b; the absolute value of the Jacobian of the transformation from b to b* is equal to 1. The contribution of the prior to the gradient for marker effects is then 2 À Á 3 1 Σ11b Σ12b Σ13b 6 j1 þ j2 þ j3 7 6 mj 7 6 7 @ 6 1 À Á 7 b 2 Σ21 1 b 6 Σ21b Σ22b Σ23b 7; ; :::; : Lpriorð Þ¼ j ¼ 6 j1 þ j2 þ j3 7 j ¼ 1 2 p (71) @b m 6 mj 7 j 6 7 4 À Á 5 1 Σ31b Σ32b Σ33b j1 þ j2 þ j3 mj

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi b9Σ21b b ; ; : 3 where mj ¼ 2 j j is proportional to the Mahalanobis distance of j away from ð0 0 0Þ for T ¼ 3 Hence, the 3p 1 vector of derivatives with respect to all marker effects, sorted by traits within individuals is 2 3 Σ21 b m1 1 6 Σ21 b 7 6 m2 2 7 @ 6 7 À Á 6 : 7 21 21 LpriorðbÞ¼ 2 6 7 ¼ 2 M 5Σ b; (72) @b 6 : 7 4 : 5 Σ21 b mp p

where M ¼ Diagfmjg is a p 3 p diagonal matrix with typical element mj. Rearranging the differentials such that the sorting is by À Á À Á markers within traits @ 2 2 L b* ¼ 2 Σ 15M 1 b*: (73) @b* prior

Multiple-Trait Bayesian Lasso 329 Collecting (69) and (73), @ @ @ LðuÞ¼ L ðuÞþ L b* @b* @b* lik @b* prior

À Áh i À Á 215 9 * 2 5 m 2 5 b* 2 Σ215 21 b*; ¼ R0 X y ðI3 1NÞ ðI3 XÞ M (74)

Setting (68) and (74) simultaneously to 0 produces the system of equations (not explicit) " #" # " À Á # 215 215 9 m 215 9 * R0 N R0 À1N X Á ÀR0 1N Á y 2 ¼ : (75) 15 9 5 9 Σ215 21 b* 215 9 * R0 X 1N ðI3 X XÞþ M R0 X y

Expanding the equations above for T ¼ 3 yields 2 3 11 12 13 11 12 13 ½b r NrNrNr19N X r 19N X r 19N X 6 21 22 23 21 22 23 7 6 r NrNrNr19N X r 19N X r 19N X 7 6 31 32 33 31 9 32 9 33 9 7 6 r NrNrNr1N X r 1N X r 1N X 7 3 6 11 12 13 11 11 21 12 12 21 13 13 21 7 6 r X91N r X91N r X91N X9r X þ Σ M X9r X þ Σ M X9r X þ Σ M 7 4 21 22 23 12 21 21 22 22 21 23 23 21 5 r X91N r X91N r X91N X9r X þ Σ M X9r X þ Σ M X9r X þ Σ M 31 32 33 31 31 21 32 32 21 33 33 21 r X91N r X91N r X91N X9r X þ Σ M X9r X þ Σ M X9r X þ Σ M

2 À Á 3½b 2 3½bþ1 9 11 * 12 * 13 * m 1N Àr y1 þ r y2 þ r y3Á 1 6 7 6 m 7 6 19 r21y* þ r22y* þ r23y* 7 6 2 7 6 N À 1 2 3Á 7 6 m 7 9 31 * 32 * 33 * 6 3 7 6 1N r y1 þ r y2 þ r y3 7 * ¼ 6 À Á 7 ; (76) 6 b 7 6 9 11 * 12 * 13 * 7 6 1 7 6 X Àr y1 þ r y2 þ r y3Á 7 4 b* 5 4 21 22 23 5 2 X9 r y* þ r y* þ r y* b* À 1 2 3Á 3 9 31 * 32 * 33 * X r y1 þ r y2 þ r y3

where b is iterate round. Matrix M ¼ DiagðmjÞ changes at every round of iteration, so the system needs to be reconstituted repeatedly. Marker effects producing small values of the Mahalanobis distance away from 0 result in tiny m 2 values and, consequently, M21 will have large diagonal elements. Hence, vectors of markers with weak effects are more strongly shrunk toward the 0 coordinate than those having strong effects in at least one trait The variance–covariance matrix of the conditional posterior distribution can be approximated as 02 3 1 m 1 B6 7 C B6 m 7 C B6 1 7 C B6 m 7 C B6 1 7 C 2 VarB6 7 :R ; Σ; DATAC K 1 ; (77) B6 * 7j 0 C ½N B6 b 7 C B6 1 7 C @4 b* 5 A 2 b* 3 with N indicating parameters evaluated at converged values, assuming that convergence has been attained at a hopefully global mode. 13.6 Appendix F: Treatment of missing data Let a multi-trait data point ðT 3 1Þ on individual i be represented as complete miss; obs ; yi ¼ yi yi (78)

330 D. Gianola and R. L. Fernando ; obs where miss denotes a missing record, e.g.,ifT ¼ 2 a record could be missing for trait 1 or for trait 2; yi represents the phenotypes for the traits observed in individual i. The posterior predictive distribution of ymiss has density Z Z i À Á miss ; ; Σ miss m; b; ; obs m; b; ; Σ m b Σ; p yi y R0 ¼ p yi j R0 yi pð R0 jyÞd d dR0d (79) Rm Rb

provided that data points in i are conditionally (given m; b; R0) independent of any other i9 in the sample, and with y being all miss m; b; ; Σ observed data. The preceding formulae implies that yi can be imputed by sampling R0 from their posterior distri- bution, and then drawing from miss ; ; Σ miss m; b; ; obs ; miss m; b; ; obs yi jy R0 N E yi j R0 yi Var yi j R0 yi (80)

Since the sampling model is normal, for T ¼ 3; one has 2 3 2 32 3 d m d 9 b* 1 1 1xi 006 1 7 miss m; b; ; obs 4 d m 5 4 d 9 54 b* 5 E yi j R0 yi ¼ 2 2 þ 0 2xi 0 2 (81) d m d 9 b* 2 3 003xi 3 21 ½miss;obs ½obs;obs ½obs; þR0 R0 e

d ; d ; d ; “ ” ½obs;obs where 1 2 3 take the value 1 when a given trait is missing in case i or denote exclude from formula otherwise; R0 is ; ½miss;obs the part of R0 pertaining to observed phenotypes for case i and R0 is the submatrix of residual covariances between missing and observed traits. Further,

21 miss m; b; ; obs ½miss;miss 2 ½miss;obs ½obs;obs ½obs;miss Var yi j R0 yi ¼ R0 R0 R0 R0 (82)

For example, let T ¼ 3 and suppose that trait 1 is missing in case 250, but that traits 2 and 3 have been recorded; here 21 miss m; b; ; obs m 9b* r22 r23 e2;250 ; E y250 j R0 y250 ¼ 1 þ x250 1 þ½r12 r13 (83) r32 r33 e3;250 and 21 miss m; b; ; obs 2 r22 r23 r12 : Var y250 j R0 y250 ¼ r11 ½ r12 r13 (84) r32 r33 r13

In the MCMC algorithm, missing data are sampled independently across cases, but dependently within case by addressing the pattern of missingness peculiar to each observation. Samples for missing observations can be used to estimate predictive distributions for the missing data (Gelfand et al. 1992; Sorensen and Gianola 2002; Gelman et al. 2014).

Multiple-Trait Bayesian Lasso 331