Joint Genome-Wide Prediction in Several Populations Accounting for Randomness of Genotypes

1 Joint genome-wide prediction in several populations accounting for randomness of genotypes: 2 A hierarchical Bayes approach. II: Multivariate spike and slab priors for marker effects and 3 derivation of approximate Bayes and fractional Bayes factors for the complete family of models 4 Carlos Alberto Martíneza,b, Kshitij Khareb, Arunava Banerjeec, Mauricio A. Elzoa 5 aDepartment of Animal Sciences 6 bDepartment of Statistics 7 cDepartment of Computer and Information Science and Engineering 8 University of Florida, Gainesville, FL, USA 9 Correspondence: Carlos Alberto Martínez, Department of Animal Sciences, University of Florida, 10 Gainesville, FL 32611, USA. 11 Tel: 352-328-1624. 12 Fax: 352-392-7851. 13 E-mail: [email protected] 14 15 16 17 18 19 20 21 22 23 24 25 1 26 Abstract 27 This study corresponds to the second part of a companion paper devoted to the development of 28 Bayesian multiple regression models accounting for randomness of genotypes in across population 29 genome-wide prediction. This family of models considers heterogeneous and correlated marker 30 effects and allelic frequencies across populations, and has the ability of considering records from 31 non-genotyped individuals and individuals with missing genotypes in any subset of loci without the 32 need for previous imputation, taking into account uncertainty about imputed genotypes. This paper 33 extends this family of models by considering multivariate spike and slab conditional priors for 34 marker allele substitution effects and contains derivations of approximate Bayes factors and 35 fractional Bayes factors to compare models from part I and those developed here with their null 36 versions. These null versions correspond to simpler models ignoring heterogeneity of populations, 37 but still accounting for randomness of genotypes. For each marker loci, the spike component of 38 priors corresponded to point mass at 0 in ℝ풮, where 풮 is the number of populations, and the slab 39 component was a 풮-variate Gaussian distribution, independent conditional priors were assumed. For 40 the Gaussian components, covariance matrices were assumed to be either the same for all markers or 41 different for each marker. For null models, the priors were simply univariate versions of these finite 42 mixture distributions. Approximate algebraic expressions for Bayes factors and fractional Bayes 43 factors were found using the Laplace approximation. Using the simulated datasets described in part I, 44 these models were implemented and compared with models derived in part I using measures of 45 predictive performance based on squared Pearson correlations, Deviance Information Criterion, 46 Bayes factors, and fractional Bayes factors. The extensions presented here enlarge our family of 47 genome-wide prediction models making it more flexible in the sense that it now offers more 48 modeling options. 2 49 Key words: Bayesian whole-genome regressions; Finite mixture priors; genetic heterogeneity; 50 Laplace approximation; multi-population genome-enabled prediction. 51 1. Introduction 52 The scenario of across population genome-wide prediction accounting for randomness of genotypes 53 was addressed in part I of our series of studies. There, we adopted a hierarchical Bayesian modeling 54 strategy to accommodate heterogeneous and correlated marker effects across subpopulations and 55 random genotypes. In that companion paper we provided a detailed derivation of the joint pmf of the 56 genotypes conditional on pedigree information and allelic frequencies and also discussed some of its 57 properties. Furthermore, the flexibility of hierarchical Bayesian modeling allowed us to account for 58 heterogeneous and correlated allelic frequencies. The “MG-GBLUP” model proposed by Lehermeir 59 et al. (2015) is similar to the models developed in part I of this study, except that they did not 60 consider randomness of genotypes. In addition, they did not consider models with different 61 (heterogeneous) covariance matrices of marker effects. One of the main properties of our models is 62 that individuals with phenotypic records and missing genotypes at any subset of loci (including non- 63 genotyped individuals) can be considered in the analysis without previous imputation. Furthermore, 64 due to the use of a Bayesian approach, uncertainty about imputed genotypes is automatically taken 65 into account. 66 The so called “spike and slab” priors, are finite mixtures of a continuous distribution (the slab) and a 67 mass point at some constant (the spike) (Mitchell and Beauchamp, 1988). A particular case of these 68 priors are the zero-inflated priors which have point mass at zero. This sort of priors has been used in 69 high dimensional problems to induce a stronger shrinkage and perform variable selection. In single 70 population analyses, it has been reported that when there are genes with major effects controlling the 71 trait under study or the number of genes controlling the trait is low, Bayesian variable selection 72 models tend to perform better (Daetwyler et al., 2012; Heslot et al., 2012; Gianola and Rosa, 2015). 73 In the case of multiple population analyses, van den Berg et al. (2015) studied scenarios under which 3 74 Bayesian variable selection models outperformed genomic BLUP (GBLUP). They found that 75 GBLUP was outperformed when the number of QTL was smaller than the number of independent 76 chromosome segments. They also found that the difference in accuracy between these models was 77 larger than in the single population case. 78 In a Bayesian framework, model comparison can be performed via Bayes factors and some 79 modifications of them known as non-subjective Bayes factors (Ghosh et al, 2006). Bayes factors 80 measure the change in the odds favoring a model once data are observed (Lavine and Schervish, 81 1999). On the other hand, O’Hagan (1994; 1995) proposed a non-subjective Bayes factor known as 82 fractional Bayes factor which uses a fractional part of the likelihood resulting in a “partial” Bayes 83 factor. Analytical forms of Bayes factors involve integration of the joint distribution of data and 84 parameters over the parameter space of a given model to obtain marginal likelihoods, and even for 85 some simple models these integrals do not have a closed form solution. One option to obtain 86 algebraic approximations is to use the Laplace approximation after arranging the integrand in an 87 appropriate form (Ghosh et al, 2006). Another criterion to compare models is the Deviance 88 Information Criterion (DIC, Spiegelhalter et al., 2002; 2014) which combines measures of model fit 89 and model complexity and, despite some limitations, it has been used in several research areas 90 (Spiegelhalter et al., 2014). 91 Thus, the objectives of this study were to extend the family of models developed in a companion 92 paper (part I) by considering the so called spike and slab priors for marker effects and to derive 93 approximate expressions for Bayes factors and fractional Bayes factors to compare the proposed 94 models with their corresponding null versions that ignore population structure. 95 2. Methods 96 2.1 The models 4 97 The complete population or simply the population is defined as the set of individuals with 98 phenotypes considered in the analysis, which is comprised by a set of 풮 subpopulations defined by 99 some criterion like environment, race, breed, line, etc. Also the following assumptions are made: 100 linkage equilibrium, Hardy-Weinberg equilibrium, no mutation, and starting from the oldest 101 individuals with phenotypes, the pedigree is fully known. 102 The following is the linear model describing the relationship between records and mappings of 103 marker genotypes: 풚 = 푊품 + 풆, where 풚 ∈ ℝ푛 is a vector containing response variables (e.g., 104 records corrected for non-genetic factors), 푊 ∈ ℝ푛×푚 is an observable random matrix with entries 105 corresponding to a one to one mapping from the set of individual marker genotypes to a subset of the 106 integers (defined later), 품 ∈ ℝ푚 is an unknown random vector of average marker allele substitution 107 effects for every population and 풆 ∈ ℝ푛 is a random vector of residuals. If records are sorted by 108 subpopulation as well as the columns of 푊 and the elements of 품, then for every 푙 = 1,2, … , 풮, ( ) ( ) ( ) ( ) 109 풚푙 = 푊푙품푙 + 풆푙, with dimensions: 풚푙 푛푙×1, 푊푙 푛푙×푚, 품푙 푚×1 and 풆푙 푛푙×1 where 푛푙 is the sample 풮 110 size of subpopulation 푙, and 푚 is the number of marker loci; therefore, 푛 = ∑푙=1 푛푙. 111 In our models, the mapping from the set of genotypes at each locus and each individual into a subset 112 of the integers is defined as follows, biallelic loci are considered. If A and B are the marker alleles at 113 each locus and B is considered the reference allele then: 1, 푖푓 푔푒푛표푡푦푝푒 = 퐵퐵 푊 = {푤푙 } = { 0, 푖푓 푔푒푛표푡푦푝푒 = 퐴퐵 . 푙 푖푗 푛 ×푚 푙 −1, 푖푓 푔푒푛표푡푦푝푒 = 퐴퐴 2 2 114 The following is the hierarchical representation of our models. Let 푅 = (휎푒1, … , 휎푒풮) and 푉 = 풮 115 퐵푙표푐푘 퐷푖푎푔. {휎2 퐼 } then 푒푙 푛푙 푙 = 1 풚|푊, 품, 푅~푀푉푁(푊품, 푉) ∗ ∗ ∗ ∗ ∗ ∗ 푊|풑1, 풑2, … , 풑푚~휋(∙|풑1, 풑2, … , 풑푚) 5 푖푖푑 ∗ ∗ 풑푗 ∼ 휋(풑 ), 푗 = 1,2, … , 푚 푖푖푑 휏2 푣 휏2 푣 휎2 , … , 휎2 ∼ 퐼푛푣푒푟푠푒 퐺푎푚푚푎 ( , ) ≔ 퐼퐺 ( , ) 푒1 푒풮 2 2 2 2 푖푛푑 푃표푖푛푡 푚푎푠푠 푎푡 ퟎ 푤푖푡ℎ 푝푟표푏푎푏푖푙푖푡푦 휋0 품푗|퐺푗, 휋0 ∼ { 푀푉푁(ퟎ, 퐺푗) 푤푖푡ℎ 푝푟표푏푎푏푖푙푖푡푦 1 − 휋0 푖푖푑 퐺푗 ∼ 퐼푛푣푒푟푠푒 푊푖푠ℎ푎푟푡(푎, 휮) ≔ 퐼푊(푎, 휮) 휎2 휎 ⋯ 휎 푗1 푗1,2 푗1,풮 휎2 ⋯ 휎 푗2 푗2,풮 퐺푗 = ⋱ ⋮ 푠푦푚 휎2 [ 푗풮 ] where 휎2 is the residual variance in subpopulation 푙, 휎2 is the variance of the effect of the 푗푡ℎ marker 116 푒푙 푗푙 푡ℎ 117 in the 푙 subpopulation, 휎푗푙,푙′ is the covariance between effects of marker 푗 in subpopulations 푙 and ∗ 푡ℎ ∗ 118 푙′, 풑푗 is a parameter related to allelic frequencies of the 푗 marker in each subpopulation and 휋(풑 ) 119 is its probability density function (pdf).

Load more