<<

bioRxiv preprint doi: https://doi.org/10.1101/102244; this version posted January 22, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

A New Distribution for Bayesian Regression Model with Application in Genomics

Renato Rodrigues Silva 1,*

1 Institute of Mathematics and , Federal University of Goi´as, Goiˆania,Goi´as,Brazil

*[email protected]

Abstract

In the regression analysis, there are situations where the model have more predictor variables than observations of dependent variable, resulting in the problem known as “large p small n”. In the last fifteen years, this problem has been received a lot of attention, specially in the genome-wide context. Here we purposed the bayes H model, a bayesian regression model using mixture of two scaled inverse chi square as hyperprior distribution of for each regression coefficient. This model is implemented in the R package BayesH.

Introduction 1

In the regression analysis, there are situations where the model have more predictor 2

variables than observations of dependent variable, resulting in the problem known as 3

“large p small n” [1]. 4

To figure out this problem, there are already exists some methods developed as ridge 5

regression [2], least absolute shrinkage and selection operator (LASSO) regression [3], 6

bridge regression [4], smoothly clipped absolute deviation (SCAD) regression [5] and 7

others. This class of regression models is known in the literature as regression model 8

with penalized likelihood [6]. In the bayesian paradigma, there are also some methods 9

purposed as stochastic search variable selection [7], and Bayesian LASSO [8]. 10

Recently, the “large p small n” problem has been receiving more attention for 11

scientist who works with animal or plant genetics, specifically to apply in genome-wide 12

selection studies [9]. 13

Genome-wide selection is a approach in quantitative genetics to predict the breeding 14

value of the individuals from a testing population based on estimates of the molecular 15

marker effects from training population. The training population is comprised by 16

individuals which were genotyped and phenotyped while in the testing population the 17

individuals are only genotyped [10], [11]. Genotyping refers to obtain the genetic 18

makeup of individuals through some technology [12] and phenotyping is a measure of 19

some economic importance traits as yield, height and etc [11,13]. 20

With advent of the high throughput genotyping plataforms, nowadays is possible to 21

define a statistical model to identify association between molecular markers and an 22

observed phenotype. In these models, the effects of all markers are estimated 23

simultaneously, capturing even small effects [10,14]. 24

In the context of genome-wide selection, many animal and plant breeders developed 25

some bayesian regression models to make prediction of complex traits when there are 26

1/13 bioRxiv preprint doi: https://doi.org/10.1101/102244; this version posted January 22, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

more covariables than observations of response variable. In the cornerstone 27

publication [10], Bayes A and Bayes B models were presented. In the Bayes A, the 28

scaled-t density were used as prior distribution of marker effects, while in the Bayes B 29

the prior distribution were modeled using a mixture of a point of mass at zero and a 30

scaled-t density. More recently, the use of a mixture of a point of mass at zero and a 31

Gaussian slab were purposed. This model is known in the literature as Bayes 32

Cπ [15, 17–19]. 33

However, there are issues in these models which should have been taken into account. 34

The prior distribution is always influential, therefore its choice is crucial. In this paper 35

we proposed the fit of an Bayesian regression model with mixture of two scaled inverse 36

chi square as hyperprior distribution of variance for each regression coefficient (bayes H 37

model). Until our knowledge, it has never reported before. 38

An advantage of the model is the flexibility. Depending on values chosen for 39

, is possible to obtain equivalent models to (Bayes Ridge Regression and 40

Bayes A) or even to select variable via Gibbs in a broad sense. To illustrate to 41

application of the Bayes H model, we analyzed some simulated and real datasets. 42

Materials and Methods 43

Simulated Data 44

The aim these simulations were compare effects of prior distribution in the prediction of 45

complex traits in some situations such as presence or absence of strong linkage 46

disequilibrium or oligogenic or poligenic genetic architecture. The settings of 47

four scenarios generated are presented below. The phenotype were calculated using the 48

equation described by (2). 49

Table 1. Parameter settings for four simulated scenarios Scenario Population ∗ N. individuals N. markers N. QTL’s Distribution of QTL’s effect 1 heterogeneous stock mice 250 2500 50 Gamma(3; 0, 75) 2 heterogeneous stock mice 250 2500 10 Gamma(3; 0, 75) 3 random mating 250 2500 50 Gamma(3; 0, 75) 4 random mating 250 2500 10 Gamma(3; 0, 75) * Heterogeneous stock mice were sampled from subsampling of mice dataset , which had already analized by [25] and avalaible in BGLR library of R statistical software [19]; random mating were sampled from Bernoulli distribution with allele frequency equal to 0.5.

For each scenario the predictive performance between Bayes ridge regression and 50

Bayes H model were compared. The table (1) displays the values of hyperparameters 51

used in mixture of the scaled inverse chi-squared distribution. Depending on the values 52

assigned to hyperparameters, differents model can be defined. For example, using the 53

hyperprior A (1), the model will be equivalent to Bayes A model [10]. On the other 54

hand, the use of hyperprior B can be considered as a variable selecion model in a broad 55

sense. 56

Table 2. Information about hyperparameters used in each component of mixture

Hyperprior ν1 s1 ν2 s2 A 5 0.04 5 0.04 B 5 0.5 7 0.002

2 The values of hyperparameters for prior distribuition for σ were defined as follows: 57

degree of freedom equal to 5 and scale parameter equal to 0.1. Figure (1) shows the 58

2/13 bioRxiv preprint doi: https://doi.org/10.1101/102244; this version posted January 22, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

2 influence of hyperprior distribution for τ in the marginal prior for βj. Assuming 59 2 σ = 1, it is observed the use of hyperprior A (mixture with same scale ) the 60 marginal prior resulting for βj is a t-scaled distribution. On the contrary, when 61 hyperprior B is used the marginal prior obtaining for βj is a mixture of t-scaled 62 distribution with the same location parameters but different scales parameters, which 63

results a distribution with tails heavier and sharper peaks than t-scaled. 64

2 Fig 1. Hyperprior distribution for τ and marginal prior distribution for βj

Real Data 65

The real dataset is comprised by 10346 polymorphic markers scored in a mice 66

population of 1814 individuals. The phenotype measured was body mass index 67

(BMI) [28]. The dataset were previously by [25], which further details about about 68

heterogeneous stock mice population can be found. It is important to mention the 69

dataset is avalaible in R package BGLR [19]. 70

Predictions of the complex traits in the mice dataset was done in two step. First of 71

all, a mixed model was fitted to remove the population structure and kinship effect of 72

dataset. In the second step, Bayes H or Bayesian Ridge Regression were fitted to make 73

predictions considering the BLUP’s predicted from mixed model as response variable. 74

The inference of population was based on clustering of the loadings of two top 75

principal components obtained from to genomic relationship matrix. The clustering was 76

done using Gaussian mixture models implemented in the library Mclust of R statistical 77

software [23], [29] e [30]. Several mixture models were fitted and bayesian information 78

criterion was used to select the best model. The candidate models differ each other in 79

relation to covariance matrix of each component of mixture. The general structure of 80 0 covariance matrix is Σk = λDkAkDk where Σk is the covariance matrix of kth 81 component of , Dk is the orthogonal matrix of eigenvectors, Ak is the 82 diagonal matrix whose elements are proportional to the eigenvalues and λ is a scale 83

parameter [30]. 84

Phenotypic Analysis - Mixed Model 85

Before to predict the molecular breeding value of BMI the phenotypic analysis was done. 86

Phenotypic analysis consisted fitting the mixed model to heterogeneous stock mice 87

population and predict the best linear unbiased predictor (BLUP) for each 88

individual [10, 14]. 89

The mixed model is defined by 90

y = Xβ + Zb +  (1)

where y is the vector of response variable; X is the incidence matrix of the fixed effects; 91

β is the vector of fixed effects that represents (litter, gender, year and population 92

strucuture); Z the incidence matrix of random effects; b the vector of random effects 93 2 that follows Gaussian distribution with 0 and variance σb and  the random error. 94 Population structure was infered using the results from fitting of Gaussian Mixture 95

Models implemented in the package mclust, a library of R statistical software [23,29,30]. 96

3/13 bioRxiv preprint doi: https://doi.org/10.1101/102244; this version posted January 22, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Genomic Selection - Statistical Model 97

The statistical model is defined by 98

p X yi = µ + βkxik + i (2) k=1

where yi is the i − th observation of response variable; βk is the k − th regression 99 coefficient of model; xik is a explanatory variable for i − th individual and k − th 100 explanatory variable and i is the random error for i − th individual that follows 101 2 N(0, σ ). 102

Prior Distributions 103

Considering the fitting of the model, the prior distribution for intercept µ, is defined by 104

µ ∼ N(0, ω2) 2 2 where ω is a . In practice, we used a large number for ω to set up this 105

prior distribution as vague. 106 2 The prior distribution for each βk given a value of τk is Gaussian, i.e, 107

2 2 2 2 βk|τk , σ ∼ N(0, τk σ ) (3) 2 Here, is the novelty of the manuscript. The hyperprior distribution for τk is 108 2 conditioned a latent Zk. Hence, the hyperprior distribution for τk 109 follows the mixture of the two components scaled inverse chi square distribution, i.e 110

 Zk ∼ Bernoulli(1, π)  2 2 2 τk |Zk = 1, ∼ Scaled-Invχ (ν1, s1) (4)  2 2 2 τk |Zk = 0, ∼ Scaled-Invχ (ν2, s2)

The hyperprior distribution of π is with parameters (α, γ). In 111

pratice, we adopted α = 1 and γ = 1 to obtain a vague hyperprior. 112 Finally, the prior distribution for σ is

2 2 2 σ ∼ Scaled-Invχ (νσ, sσ) (5) 2 where the hyperparameters νσ, sσ represents the degree of freedom and scale parameters 113 of the scaled inverse chi-square distribution. 114

Likelihood and Posterior Distribution 115

The likelihood is defined by 116

 n p !2 2 2 1  1 X X  P r(y, µ, β, τ , σ ) = n exp − 2 yi − µ − βkxik (6) (σ2) 2 2σ  i=1 k=1 

Hence, the joint posterior distribution is given by 117

p 2 2 2 Y  2 2  P r(µ, β, τk , σ z,π|y) ∝ P r(y|µ, β, σ )P r(µ) P r(βk|τk )P r(τk |Zk)P r(Zk|π) P r(π) × k=1 P r(σ2) (7)

4/13 bioRxiv preprint doi: https://doi.org/10.1101/102244; this version posted January 22, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

 n p !2 2 2 1  1 X X  P r(µ, β, τ , σ z,π|y) ∝ n exp − 2 yi − µ − βkxik (σ2) 2 2σ  i=1 k=1  I(Z =1)   p "  2 ! k 1 2 Y 1 ν1s1 exp − 2 µ π ν1 exp − 2 2ω 2( 2 +1) 2τ k=1 τk !I(Zk=0) # 1  ν s2  1  1  (1 − π) exp − 2 2 exp − β2 2( ν2 +1) 2 2 1 2 k 2 2τk (τ ) 2 2τk τk k 1 1  ν s2  πα−1(1 − π)γ−1 exp − σ σ ( νσ +1) 2 B(α, γ) (σ2) 2 2σ

Gibbs sampling 118

Gibbs sampling was used to obtain a sequence of observed values of the 119

parameters [20], [21]. The full conditional posterior distribution for µ is given by 120

(g) (g−1) 2(g−1) 2(g−1) (g−1) (g−1)  (g) 2(g) µ |β , τk , σ , π , z , y ∼ N µ˜ , σ˜ (8) where P ∗(g) i=1 yi 2(g−1) µ˜(g) = σ ;  n 1  + σ2(g−1) ω2(g−1) 1 σ˜2(g) = ;  n 1  + σ2(g−1) ω2(g−1) p ∗(g) X (g−1) yi = yi − βk xik k=1

and g is the counter of Gibbs sampling algorithm. 121 2 The full conditional posterior distribution for τk given Zk = 1 is defined by 122

 2(g−1)  βk 2 2(g) (g) (g−1) (g−1) 2(g−1) 2 σ2(g−1) + ν1s1 τk |µ , β , π , σ ,Zk = 1, y ∼ Scaled-Invχ ν1 + 1,  ν1 + 1 (9) Likewise, for given Zk = 0 we have

 2(g−1)  βk 2 2(g) (g) (g−1) (g−1) 2(g−1) 2 σ2(g−1) + ν2s2 τk |µ , β , π , σ ,Zk = 0, y ∼ Scaled-Invχ ν2 + 1,  ν2 + 1 (10) (2g) (g) The values of zk are obtained computing the probability Zk given τk and π , i.e, 123

2 2(g) (g−1) πf1(τk ) P r(Zk = 1|τk , π ) = 2 2 (11) πf1(τk ) + (1 − π)f2(τk ) 2 2 where f1(τ ) f2(τ ) are probability density functions of scaled inverse chi square 124 2 2 distribution with parameters (ν1, s1) and (ν2, s2), respectively. 125

5/13 bioRxiv preprint doi: https://doi.org/10.1101/102244; this version posted January 22, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Moreover, 2(g) (g−1) Zk ∼ Bernoulli(1, P r(Zk = 1|τk , π )). (12)

The full conditional posterior distribution for π is defined by 126

! (g) (g) (g−1) 2(g) 2(g−1) (g) X (g) X (g) π |µ , β , τk , σ , z , y ∼ Beta α + zk , γ + p − zk . zk:zk=1 zk:zk=1 (13) The procedure to sampling βk given Zk = 1 from the full conditional posterior 127 distribution was adapted from the strategy purposed by [22], i.e 128

(g) (g) 2(g) 2(g−1) g) (g−1) ˙(g) 2(g) βk |µ , τk , σ , π , β−k ,Zk = 1, y ∼ N(βk , σ˙ ) (14)

where 129

P ∗∗(g) i=1 xijyik 2(g−1) β˙(g) = σ ; k P x2 1 i=1 ij + σ2(g−1) 2(g−1) 2(g) σ τk 1 σ˙ 2(g) = ; k P x2 1 i=1 ij + σ2(g−1) 2(g−1) 2(g) σ τk

and 130

∗∗(g) X (g−1) yik = yi − βj xij. j6=k 2 Finally, the full conditional posterior distribution for σ is given by 131

 2  2(g) (g) (g) 2(g) 2 SSE + SSB + νσsσ σ |µ , β , τk , y ∼ Scaled-Invχ νσ + n, (15) νσ + n + p p 2 2 Pn  Pp (g)  X βk where SSE = y − µ − β x and SSB = . 132 i=1 i k=1 k ik τ 2 k=1 k 2 For Bayesian Ridge Regression, there are a unique τ hyperparameter. The prior 133 2 2 2 distribution for τ and σ follows scaled inverse chi square with hyperparameters (ν, s ) 134 2 and (νσ, sσ). 135 2 The full conditional posterior distribution for τ is 136

 Pp β( 2(g−1)  k=1 k + νs2 τ 2(g)|µ(g), β(g−1), π(g−1), σ2(g−1), y ∼ Scaled-Invχ2 ν + p, σ2(g−1)  ν + p 

2 and for σ , we have 137

 2  2(g) (g) (g) 2(g) 2 SSE + SSB + νσsσ σ |µ , β , τ , y ∼ Scaled-Invχ νσ + n + p, νσ + n + p

Consequently, the full conditional posterior for βk parameters is given by 138

(g) (g) 2(g−1) 2(g−1) (g−1) (g−1) ˙(g) 2(g) βk |µ , τ , σ , π , β−k ,Zk = 1, y ∼ N(βk , σ˙ )

6/13 bioRxiv preprint doi: https://doi.org/10.1101/102244; this version posted January 22, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

where 139

P ∗∗(g) i=1 xijyik 2(g−1) β˙(g) = σ ; k P x2 1 i=1 ij + σ2(g−1) σ2(g−1)τ 2(g) (g) 1 σ˙2 = . k P x2 1 i=1 ij + σ2(g−1) σ2(g−1)τ 2(g)

In summary, the Gibbs sampling algorithm can be defined by 140

For 1 to G do it: 141

(g) 1- Generate µ from (8). 142 2(g) 2- Generate τ from (9). 143 (g) 3- Generate z from (11) and (12). 144 (g) 4- Generate π from (13). 145 (g) 5- Generate each βk from (14). 146 2(g) 6- Generate σ from (15). 147

where G is the number of iterations. 148

This algorithm was implemented in a R package [23] called BayesH avalaible at 149

https://cran.r-project.org/web/packages/BayesH/index.html. 150

Mathematical Details about Prior Distribution 151 In this section we are going to show some details about conditional prior distribution for 2 2 βk|τk given a prior distribuition for τk |Zk. This demonstration is based on [17]. The 2 distribution of hyperparameter τk for given Zk = 1 is described by

ν1  s2ν  2 1 1   2  2 2 1 s1ν1 1 fτ 2|z =1(τ ) = exp − ν , k k k ν1  2 1+ 1 Γ 2 τk 2( 2 ) 2 τk

2 2 and the prior distribution for βk given σ and τk is defined by

  2  1 1 βk fβ |σ2,z =1 = exp − . k k p 2 2 2 τ 2σ2 2π (σ τk ) k

2 Consequently, the prior distribution for βk conditioned to σ and Zk = 1 is given by 152

ν1  s2ν  2 Z ∞ 1 1   2 2  2 1 s1ν1 βk 1 2 fβ |σ2,z =1 = exp − + ν dτ k k ν1  2 2 2 1+ 1 k 0 Γ 2 τk τk σ 2( 2 ) 2 τk Z ∞  1 s2ν β2  1 ∝ exp − 1 1 + k dτ 2 (16) 2 2 2 1+ ν1 k 0 2 τk τk σ 2( 2 ) τk

To solve the integrate written in (16) we have to make the change of variable 153

7/13 bioRxiv preprint doi: https://doi.org/10.1101/102244; this version posted January 22, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

 2 2 2  1 ν1s1σ + βk u = 2 2 2τk σ 2 and find dτk in terms of u and du. It is implies 154 1 ν s2σ2 + β2  τ 2 = 1 1 k (17) k 2u σ2

and 155

1 ν s2σ2 + β2  dτ 2 = − u−2 1 1 k du (18) k 2 σ2

Substituing both (17) and (18) in (16) we have 156

− ν1+1 Z ∞  2 2 2 ( 2 ) ν1s1σ + βk ν1+1 −1 f 2 ∝ u 2 exp {−u} du βk|σ ,zk=1 2 0 σ − ν1+1  2 2 2 ( 2 ) Z ∞ ν1s σ + β ν1+1 1 k 2 −1 ∝ 2 u exp {−u} du σ 0 (− ν1+1 ) ν s2σ2 + β2  2 ν + 1  ∝ 1 1 k Γ 1 σ2 2 ν +1   (− 1 ) ν1 + 1 ∝ ν s2σ2 + β2 2 Γ 1 1 k 2

ν1+1  2 (− 2 ) βk ∝ 1 + 2 2 ν1s1σ (− ν1+1 )  β2  2 ∝ 1 + k (19) ν1s˜1σ2

showing that (19) is a kernel of scaled t distribution [24], [17] with degree of freedom ν1 157 2 2 and scale parameters ˜1σ2 = s1σ . 158 Likewise, for Zk = 0, we have 159

2 βk|σ , zk = 0 ∼ Scaled t (0, ν2, s˜2σ2 ) , 2 2 wheres ˜2σ2 = s2σ . 160 Consequently, 161

 2 βk|σ ,Zk = 1, ∼ Scaled-t(ν1, s˜1σ2 ) with probability π 2 βk|σ ,Zk = 0, ∼ Scaled-t(ν2, s˜2σ2 ) with probability (1 − π)

2 2 showing that for (ν1 = ν2) and (s1 = s2), the prior distribution for each βk of bayes H 162 2 2 model is equivalent the prior distribuition of bayes A model. Furthermore, for s1 or s2 163 tending to zero, the prior distribution for each βk is equivalent to bayes B model. There 164 2 2 are other possibilities, for example, tending s1 or s2 to infinity, a mixture distribution of 165 slab Gaussian and t-scaled distribuition is obtained as prior for each βk. 166

Results and Discussion 167

In this study, we purposed a new hyperprior for bayesian regression model to predict 168

complex trait. This model were applied in real and simulated datasets. 169

8/13 bioRxiv preprint doi: https://doi.org/10.1101/102244; this version posted January 22, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Results from cross validation studies shows the prediction accuracy of Bayes H 170

model is slight higher than Bayesian Ridge Regression in scenarios where dataset were 171

generated from heterogeneous stock mice population and quite higher for dataset 172

simulated from random mating. Hence, the type population, consequently, the strength 173

of linkage disequilibrium is more influential in the prediction than number of QTLs (2). 174

However, comparing two datasets generated from random mating, the number of QTLs 175

caused a increase of prediction accuracy in Bayes H model. It was not observed 176

difference in the prediction accuracy of Bayes H when were used different . 177

Fig 2. Evaluation of accuracy performance of Bayes H model using 5 fold cross validation. Box plot of Pearson’s correlation distribution between observed and predicted values for each simulated scenario.

Figure (3) shows population structure inferred from top two eigenvectors obtained 178

from correlation matrix of mice dataset. Bayesian information criterion indicates the 179

best model is Gaussian mixture with 8 components. Thus, we can infer the presence of 180

eight subpopulations which are the same number of founders of heterogeneous stock 181

mice population 3. The scatterplot of top two eigenvector estimated from to genomic 182

relationship matrix shows the clusters 3. Prediction accuracy of BMI was compared 183

between differents hyperpriors of Bayesian Regression models in the heterogeneous stock 184

mice population. In order to make the comparison, 5 fold cross validation was used. 185

Box plot of Pearson’s correlation distribution between observed and predicted values 186

reveals moderate to high accuracy for all models 4. Moreover, the Bayesian ridge 187

regression presented slight higher correlation in regarded to variable selecion model 188

(hyperprior B) and quite higher correlation than model with hyperprior A. 189

A possible explanation of the fact that Bayes H model outperformed Bayes Ridge 190

Regression only in a simulated dataset is the genetic architecture of the trait BMI. In 191

the simulated data, the QTLs were the unique source of variation considered. 192

Furthermore, the number of QTLs used in the simulations were at most moderate (50). 193

Therefore, models take into consideration that markers have different 194

depending on their effects normally predict better than Bayesian Ridge 195

Regression [10, 15]. Using fat percentage dataset in a dairy cattle population, which a 196

single gene explains 50% of the genetic variation, Verbyla et. al [26] reported that 197

predictions from fitting of Bayes Cp have more accuracy than predictions obtained by 198

9/13 bioRxiv preprint doi: https://doi.org/10.1101/102244; this version posted January 22, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Fig 3. Population structure of heterogeneous stock mice.

RR BLUP. Moreover, Rezende et. al. [27] analyzed data from 17 traits measured in 199

Pinus taeda population comprised of 951 individuals genotyped with 4853 SNPs. They 200

concluded that for trait controlled by few genes, Fusiform rust for example, the models 201

as Bayes A, Bayes Cp had higher predict ability in comparison to RR BLUP. On the 202

other hand, BMI trait is considered a complex trait controlled by a large number of 203

genes [25]. Thus, it is expected the RR BLUP would have a good predictive 204

performance because this model considers homogeneous shrinkage of marker effects. 205

The hypothesis is supported when we considered that genetic architecture of trait can 206

be described by infitesimal model. However, we should have caution with these 207

arguments, Gianola showed heuristically that Bayesian Ridge Regression or RR BLUP 208

does not shrinkage the marker effects the same manner, the best linear unbiased 209

predictor is sample size and allele frequency dependent [17,18]. Here we would like to 210

speculate another hypothesis about the reason of good predictive performance of the 211

Bayesian Ridge Regression in the real mice dataset. In the real dataset there are many 212

source of genetic variation besides QTLS, such as: background genetic, linkage 213

disequilibrium, epistasis effects and etc. Consequently, the linear model declared in all 214

Bayesian model is not true. Hence, the idea to select the markers that contribute the 215

phenotypic variation does not work well. And this case, the prediction provided by RR 216

BLUP or Bayesian Ridge Regression would be a better approximation. 217

Table 3. Gaussian mixture model selection using Bayesian Information Criterion Modelo BIC 1 16888.60 2 18724.63 3 19643.02 4 19906.28 5 20030.32 6 20199.56 7 20253.73 8 20288.74

10/13 bioRxiv preprint doi: https://doi.org/10.1101/102244; this version posted January 22, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Fig 4. Evaluation of accuracy performance of Bayes H model using 5 fold cross validation. Box plot of Pearson’s correlation distribution between observed and predicted values for mice dataset .

References 218

1. de los Campos G, Hickey JM, Pong-Wong R, Daetwyler HD, Calus MPL. 219

Whole-Genome Regression and Prediction Methods Applied to Plant and Animal 220

Breeding. Genetics. 2013;193(1):327–345. doi:10.1534/genetics.112.143313. 221

2. Hoerl AE, Kennard RW. Ridge Regression: Biased Estimation for Nonorthogonal 222

Problems. Technometrics. 1970;12(1):55–67. doi:10.1080/00401706.1970.10488634. 223

3. Tibishirani R. Regression Shrinkage and Selection via the Lasso. Journal Royal 224

Statistical Society B. 1996;58(1):267–288. doi:10.2307/2346178. 225

4. Frank IE, Friedman JH. A Statistical View of Some Chemometrics Regression 226

Tools. Technometrics. 1996;35(2):109–135. doi:10.2307/1269656. 227

5. Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and its 228

Oracle Properties. Journal of the American Statistical Association. 229

2001;96(456):1348–1360. doi:10.1198/016214501753382273. 230

6. Fan J, Lv J. A Selective Overview of Variable Selection in High Dimensional 231

Feature Space. Statistica Sinica. 2010;20(01):101–148. 232

7. George EI, McCulloch RE. Variable Selection via Gibbs Sampling. Journal of the 233

American Statistical Association. 1993;88(423):881–889. 234

doi:10.1080/01621459.1993.10476353. 235

8. Park T, Casella G. The Bayesian Lasso. Journal of the American Statistical 236

Association. 2008;103(482):681–685. 237

doi:http://dx.doi.org/10.1198/016214508000000337. 238

9. Heffner EL, Sorrells ME, Jannink JL. Genomic selection for crop improvement. 239

Crop Science. 2009;49(1):1–12. doi:doi:10.2135/cropsci2008.08.0512. 240

10. Meuwissen THE, Hayes BJ, Goddard ME. Prediction of Total Genetic Value 241

Using Genome-Wide Dense Marker Maps. Genetics. 2001;157(4):1819–1829. 242

11. Bernardo R. Breeding for Quantitative Traits in Plants 2nd Edition. Woddbury: 243

Stemma Press; 1994. 244

11/13 bioRxiv preprint doi: https://doi.org/10.1101/102244; this version posted January 22, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

12. Lateef DD. DNA Marker Technologies in Plants and Applications for Crop 245

Improvements. Journal of Biosciences and Medicines. 2015;3(5):7–18. 246

doi:10.4236/jbm.2015.35002. 247

13. Hallauer AR, Carena MJ, Miranda Filho JB. Quantitative Genetics in Maize 248

Breeding (Handbook of Plant Breeding, Vol. 6). New York: Springer; 2010. 249

14. Piepho HP. Ridge regression and extensions for genomewide selection in maize. 250

Crop Science. 2009;49(4):1165–1179. doi:10.2135/cropsci2008.10.0595. 251

15. Habier D, Fernando RL, Kizilkaya K, Garrick DJ. Extension of the bayesian 252

alphabet for genomic selection. BMC Bioinformatics. 2011;186(12):1–12. 253

doi:10.1186/1471-2105-12-186. 254

16. Fisher, R. A. The correlation between relatives on the supposition of Mendelian 255

inheritance. Trans. R. Soc. 1918.52:34. 256

17. Gianola D, de los Campos G, Hill WG, Manfredi E, Fernando R. Additive 257

Genetic Variability and the Bayesian Alphabet. Genetics. 2009;183(1):347–363. 258

doi:10.1534/genetics.109.103952. 259

18. Gianola D. Priors in whole-genome regression: The Bayesian alphabet returns. 260

Genetics. 2013;90(3):525–540. doi:10.1534/genetics.113.151753. 261

19. Perez P, de los Campos G. Genome-wide regression and prediction with the 262

BGLR statistical package. Genetics. 2014;3(2):483–495. 263

doi:10.1534/genetics.114.164442. 264

20. Gilks WR, Richardson S, Spiegelhalter D. Monte Carlo in Practice. 265

London: Chapman & Hall; 1996. 266

21. Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. Bayesian 267

Data Analysis, 3rd ed. London: Chapman & Hall; 2004. 268

22. de los Campos G, Naya H, Gianola D, Crossa J, Legarra A, Manfredi E, et al. 269

Predicting Quantitative Traits With Regression Models for Dense Molecular 270

Markers and Pedigree. Genetics. 2009;182(1):375–385. 271

doi:10.1534/genetics.109.101501. 272

23. R: A Language and Environment for Statistical Computing; 2014. Available from: 273 http://www.R-project.org/. 274

24. Box GEP, Tiao GC. in Statistical Analysis. New York: John 275

Wiley and Sons, Inc; 1992. 276

25. Legarra A, Robert-Grani´eC, Manfredi E, Elsen JM. Performance of Genomic 277

Selection in Mice. Genetics. 2008;180(01):611–618. 278

doi:10.1534/genetics.108.088575. 279

26. Verbyla KL et. al. Accuracy of genomic selection using stochastic search variable 280

selection in Australian Holstein Friesian dairy cattle. Genetic Research. 281

2009;91(5):307–11. doi: 10.1017/S0016672309990243. 282

27. Rezende MRF, et. al. Accuracy of Genomic Selection Methods in a Standard 283

Data Set of Loblolly Pine (Pinus taeda L.) Genetics. 2012;190:1503–1510. doi: 284

10.1534/genetics.111.137026. 285

12/13 bioRxiv preprint doi: https://doi.org/10.1101/102244; this version posted January 22, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

28. Valdar W, Solberg LC, Gauguier D, Cookson WO, Rawlins JNP, Mott R, et al. 286

Genetic and Environmental Effects on Complex Traits in Mice. Genetics. 287

2006;174(02):959–984. doi:10.1534/genetics.106.060004. 288

29. Fraley C, Raftery AE, Murphy TB, Scrucca L. mclust Version 4 for R: Normal 289

Mixture Modeling for Model-Based Clustering, Classification, and Density 290

Estimation; 2012. 291

30. Fraley C, Raftery AE. Model-based Clustering, Discriminant Analysis and 292

Density Estimation. Journal of the American Statistical Association. 293

2002;97(458):611–631. 294

13/13