<<

Sparse Bayesian Nonparametric Regression

Fran¸coisCaron [email protected] Arnaud Doucet [email protected] Departments of Computer Science and , University of British Columbia, Vancouver, Canada

Abstract an overcomplete basis (Lewicki & Sejnowski, 2000; Chen et al., 2001). One of the most common problems in ma- chine learning and statistics consists of esti- Numerous models and algorithms have been proposed mating the response Xβ from a vec- in the machine learning and statistics literature to tor of observations y assuming y = Xβ + ε address this problem including Bayesian stochastic where X is known, β is a vector of param- search methods based on the ‘spike and slab’ prior eters of interest and ε a vector of stochastic (West, 2003), (Tibshirani, 1996), projection pur- errors. We are particularly interested here suit or the Relevance Vector Machine (RVM) (Tip- in the case where the dimension K of β is ping, 2001). We follow here a Bayesian approach much higher than the dimension of y. We where we set a prior distribution on β and we will propose some flexible Bayesian models which primarily focus on the case where βb is the result- can yield sparse estimates of β. We show ing Maximum a Posteriori (MAP) estimate or equiv- that as K → ∞ these models are closely re- alently the Penalized Maximum Likelihood (PML) es- lated to a class of L´evyprocesses. Simula- timate. Such MAP/PML approaches have been dis- tions demonstrate that our models outper- cussed many times in the literature and include the form significantly a of popular alterna- Lasso (the corresponding prior being the Laplace dis- tives. tribution) (Tibshirani, 1996; Lewicki & Sejnowski, 2000; Girolami, 2001), the normal-Jeffreys (NJ) prior (Figueiredo, 2003) or the normal-exponential gamma 1. Introduction prior (Griffin & Brown, 2007). Asymptotic theoreti- cal properties of such PML estimates are discussed in Consider the following model (Fan & Li, 2001). y = Xβ + ε (1) We propose here a class of prior distributions based L on scale mixture of Gaussians for β. For a finite K, where y ∈ R is the observation, β = (β1, . . . , βK ) ∈ RK is the vector of unknown parameters, X is an our prior models correspond to normal-gamma (NG) and normal-inverse Gaussian (NIG) models. This class known L × K matrix. We will assume¡ that ε follows¢ a 2 of models includes as limiting cases both the popular zero-mean normal distribution ε ∼ N 0, σ IL where Laplace and normal-Jeffreys priors but is more flex- IL is the identity matrix of dimension L. ible. As K → ∞, we show that the proposed pri- We do not impose here any restriction on L and K ors are closely related to the gamma and but we are particularly interested in the case where normal-inverse Gaussian processes which are L´evypro- K >> L. This scenario is very common in many ap- cesses (Applebaum, 2004). In this respect, our mod- plication domains. In such cases, we are interested in els are somehow complementary to two recently pro- obtaining a sparse estimate of β; that is an estimate posed Bayesian nonparametric models: the Indian buf- b b b β = (β1,..., βK ) such that only a subset of the com- fet process (Ghahramani et al., 2006) and the in- b ponents βk differ from zero. This might be for sake of finite gamma-Poisson process (Titsias, 2007). Un- variable selection (Tibshirani, 1996; Figueiredo, 2003; der given conditions, the normal-gamma prior yields Griffin & Brown, 2007) or to decompose a signal over sparse MAP estimates βb. The log-posterior distribu- tions associated to these prior distributions are not th Appearing in Proceedings of the 25 International Confer- convex but we propose an Expectation-Maximization ence on Machine Learning, Helsinki, Finland, 2008. Copy- right 2008 by the author(s)/owner(s). (EM) algorithm to find modes of the posteriors and Sparse Bayesian Nonparametric Regression a Markov Chain Monte Carlo (MCMC) algorithm to γ2 α K 2 sample from them. We demonstrate through simula- ( 2 ) 2 α −1 γ 2 (σ ) K exp(− σ ). tions that these Bayesian models outperform signifi- α k k Γ( K ) 2 cantly a range of established procedures on a variety of applications. Following Eq. (2), the marginal pdf of βk is given for βk 6= 0 by The rest of the paper is organized as follows. In Sec- α/K+1/2 tion 2, we propose the NG and NIG models for β. We γ α 1 K − 2 p(βk) = √ |βk| K α − 1 (γ|βk|) establish some properties of these models for K finite α/K−1/2 α K 2 π2 Γ( K ) and in the asymptotic case where K → ∞. We also (3) relate our model to the Indian buffet process (Ghahra- where Kν (·) is the modified Bessel function of the sec- mani et al., 2006) and the infinite gamma-Poisson pro- ond kind. We have cess (Titsias, 2007). In Section 3, we establish con- ( Γ( α − 1 ) b √γ K 2 α 1 ditions under which the MAP/PML estimate β can 2 π Γ( α ) if K > 2 lim p(βk) = K enjoy sparsity properties. Section 4 presents an EM βk→0 ∞ otherwise algorithm to find modes of the posterior distributions and a Gibbs algorithm to sample from them. and the tails of this distribution decrease in α K −1 We demonstrate the performance of our models and |βk| exp(−γ |βk|), see Figure 1(a). The parame- algorithms in Section 5. Finally we discuss some ex- ters α and γ resp. control the shape and scale of the tensions in Section 6. distribution. When α → 0, there is a high discrepancy 2 between the values of σk, while when α → ∞, most of 2. Sparse Bayesian Nonparametric the values are equal.

Models 2 α 6 α K = 0.1, c = 1 K = 0.1, c = 1 α α K = 0.1, c = 10 K = 0.1, c = 10 α α K = 0.75, c = 10 5 K = 0.75, c = 10 We will consider models where the components β are 1.5 4

independent and identically distributed ) ) k k

β 1 β 3 YK p( p( 2 p(β) = p(βk) 0.5 k=1 1 0 0 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 and p (βk) is a scale mixture of Gaussians; that is β β Z k k 2 ¡ 2¢ 2 p (βk) = N (βk; 0, σk)p σk dσk (2) (a) Normal-gamma (b) Normal-inverse Gaus- sian 2 where N (x; µ, σ ) denotes the Gaussian distribution Figure 1. Probability density functions of the NG and NIG 2 of argument x, mean µ and variance σ . We propose for different values of the parameters. 2 two conjugate distributions for σk; namely the gamma and the inverse Gaussian distributions. The resulting This class of priors includes many standard priors. In- marginal distribution for βk belongs in both cases to α deed, Eq. (3) reduces to the Laplace prior when K = 1 the class of generalized hyperbolic distributions. α and we obtain the NJ prior when K → 0 and γ → 0. In the models presented here, the unknown scale pa- In Figure 2 some realizations of the process are given rameters are random and integrated out so that the for different values α = 1, 5, 100 and γ2/2 = α. marginal priors on the regression coefficients are not Gaussian. This differs from the RVM (Tipping, 2001) 2.1.2. Properties where these parameters are unknown and estimated through maximum likelihood. It follows from Eq. (3) that r 4 Γ( α + 1 ) 2α 2.1. Normal-Gamma Model K 2 2 E[|βk|] = 2 α , E[βk] = 2 πγ Γ( K ) γ K 2.1.1. Definition and we obtain Consider the following gamma prior distribution XK XK α γ2 2α 2 2α 2 lim E[ |βk|] = , E[ βk] = . σk ∼ G( , ) K→∞ γ γ2 K 2 k=1 k=1 2 α γ2 whose probability density function (pdf) G(σk; K , 2 ) Hence the sum of the terms remains bounded whatever is given by being K. Sparse Bayesian Nonparametric Regression

0.4 0.4 according to PD(α) and G(α, γ2/2) where PD(α) is 2 k 2 k σ 0.2 σ 0.2 the Poisson-Dirichlet distribution of

0 0 α. It is well-known that this distribution can be re- 0 20 40 60 80 100 0 20 40 60 80 100

1 1 covered by the following (infinite) stick-breaking con- struction (Tsilevich et al., 2000) as if we set k k

β 0 β 0 k−1 −1 −1 Y 0 20 40 60 80 100 0 20 40 60 80 100 Feature k Feature k πk = ζk (1 − ζj) with ζj ∼ B(1, α) (5) (a) α = 1 (b) α = 5 j=1

0.06 ¡ ¢ for any k then the order statistics π(k) are dis- 0.04 2 k σ tributed from the Poisson-Dirichlet distribution. 0.02

0 0 20 40 60 80 100 The coefficients (βk) are thus nothing but the weights 1 (jumps) of the so-called variance gamma process which k β 0 is a Brownian motion evaluated at times given by a

−1 gamma process (Applebaum, 2004; Madan & Seneta, 0 20 40 60 80 100 Feature k 1990). (c) α = 100 © ª 2.2. Normal-Inverse Gaussian Model Figure 2. Realizations (top) σ2 and (bottom) k k=1,...,K {βk}k=1,...,K from the NG model for α = 1, 5, 100. 2.2.1. Definition Consider the following inverse Gaussian prior distribu- tion Using properties of the gamma distribution, it is possi- α ble to relate β to a L´evyprocess known as the variance σ2 ∼ IG( , γ) (6) k K gamma process as K → ∞. First consider a finite K. 2 α 2 2 2 whose pdf IG(σk; K , γ) is given by (Barndorff-Nielsen, Let σ(1) ≥ σ(2) ≥ ... ≥ σ(K) be the order statistics 2 2 2 1997) of the sequence σ1, σ2, . . . , σK and let π1, . . . , πK be random variables verifying the following (finite) stick- 1 α α 1 α2 √ 2 −3/2 2 2 exp(γ )(σk) exp(− ( 2 2 + γ σk)) breaking construction 2π K K 2 K σk (7) kY−1 α kα Following Eq. (2), the marginal pdf of βk is given πk = ζk (1 − ζj) with ζj ∼ B(1 + , α − ) (4) K K µ ¶ 1 Ã r ! j=1 2 − 2 2 αγ αγ α 2 α 2 exp( ) + βk K1 γ + β (8) where B is the Beta distribution. Finally if g ∼ πK K K2 K2 k 2 G(α, γ ) then we can check that the order statistics ³ ´2 and the tails of this distribution decrease in 2 −3/2 σ(k) follow the same distribution as the order statis- |βk| exp(−γ |βk|). It is displayed in Figure 1(b). tics of (gπk). The characteristic function of βk is given The parameters α and γ resp. control the shape and by scale of the distribution. When α → 0, there is a 2 1 1 high discrepancy between the values of σk, while when Φβk (u) = ³ ´ α ³ ´ α K K α → ∞, most of the values are equal. Some realiza- 1 − iu 1 + iu γ γ tions of the model, for different values of α are repre- and therefore sented in Figure 3.

d α α β = w − w where w ∼ G( ,γ) and w ∼ G( ,γ) 2.2.2. Properties k 1 2 1 K 2 K The moments are given It follows that βk can be written as the difference of two variables following a gamma distribution. 2α γα αγ α E[|β |] = exp( )K ( ), E[β2] = ³ ´ k Kπ K 0 K k Kγ As K → ∞, the order statistics σ2 are the conic (k) Therefore, as K → ∞, the mean of sum of the absolute part of a gamma process with α and α 2 values is infinite while the sum of the square is . scale parameter γ /2; see (Tsilevichµ et al., 2000)¶ for γ σ2 σ2 2 P (1) P (2) We can also establish in this case that the coefficients details. In particular σ = σ2 , σ2 ,... and k (k) k (k) (βk) tend to weights (jumps) of a normal-inverse Gaus- P 2 k σ(k) are independent and respectively distributed sian process (Barndorff-Nielsen, 1997). Sparse Bayesian Nonparametric Regression

0.4 0.1 1 0.2 2 k 2 k 2 k 2 k

σ 0.2 σ 0.05 σ 0.5 σ 0.1

0 0 0 0 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100

1 1 5 5

k k 10 10 β 0 β 0

Object n 15 Object n 15

−1 −1 20 20 0 20 40 60 80 100 0 20 40 60 80 100 20 40 60 80 100 20 40 60 80 100 Feature k Feature k Feature k Feature k (a) α = 1 (b) α = 5 (a) α = 1 (b) α = 5

0.015 0.1

0.01 2 k 2 k

σ σ 0.05 0.005

0 0 0 20 40 60 80 100 0 20 40 60 80 100

1 5

k 10 β 0

Object n 15

−1 20 0 20 40 60 80 100 20 40 60 80 100 Feature k Feature k (c) α = 100 (c) α = 100

2 2 Figure 4. Realizations (top) (σ )k=1,...,K and (bottom) Figure 3. Realizations (top) (σk)k=1,...,K and (bottom) k (βn) from the normal-gamma model for (βk)k=1,...,K from the NIG model for K = 100, N = 20, k n=1,...,N,k=1,...,K 2 α = 1, 10, 100 and γ = α. K = 100, N = 20, α = 1, 10, 100 and γ /2 = α. The n lighter the colour, the larger |βk | .

2.3. Extension els could be interpreted as prior distributions over in- Consider now the case where we have N vectors of finite matrices with real-valued entries. In our case, N L the number of non-zero entries of an (infinite) row is observations {yn}n=1 where yn ∈ R . We would like to always infinite but we can have model the fact that for a given k the random variables " # {βn}N are statistically dependent and exchangeable. XK k n=1 n ρ We consider the following hierarchical model lim E |βk | < ∞ (9) K→∞ k=1 α γ2 α for ρ = 1 or ρ = 2. Morever for some values of α and σ2 ∼ G( , ) or σ2 ∼ IG( , γ) K k K 2 k K γ we can also ensure that for any x > 0 n for k = 1,...,K and lim Pr (∃k : |βk | > x) > 0; (10) K→∞ n 2 βk ∼ N (0, σk) that is there is still a non-vanishing probability of hav- ing coefficients with large values as K → ∞ despite Eq. for n = 1,...,N. Some realizations of the process for (9). different values α = 1, 5, 100 are represented in Fig- 1:N ure 4. The joint distribution is given by p(β1:K ) = QK 1:N k=1 p(βk ) where for the NG model In this respect, this work is complementary to two re- α N 1:N K − 2 cently proposed Bayesian nonparametric models: the p(β ) ∝ u K α N (γuk) k k K − 2 Indian buffet process (Ghahramani et al., 2006) and and for the NIG model the infinite gamma-Poisson process (Titsias, 2007). In 1:N −(N+1)/2 these two contributions, prior distributions over infi- p(βk ) ∝ (qk) K N+1 (γqk) nite matrices with integer-valued entries are defined. 2 where These models are constructed as the limits of finite- r r XN 2 dimensional models based respectively on the beta- n 2 α 2 uk = (β ) , qk = + u (11) binomial and gamma-Poisson models. They enjoy the n=1 k K2 k following property: while the number of non-zero en- tries of an (infinite) row is potentially infinite, the ex- 3. Sparsity Properties pected number of these entries is finite. These models Further on we will also use the following notation for are also closely related to the beta and gamma pro- any u cesses which are L´evyprocesses (Applebaum, 2004; Teh et al., 2007; Thibaux & Jordan, 2007). Our mod- pen(u) ≡ log(p(u)) Sparse Bayesian Nonparametric Regression

Table 1. Penalizations and their derivatives for different prior distributions

1:N 0 1:N pen(βk ) pen (βk ) Lasso γ|βk| γ (N = 1) NJ N log(uk) N/uk N α γK (γu ) ( − ) log uk α − N −1 k (a) Laplace (b) Normal-Jeffreys NG 2 K K 2 α N K α N (γuk) − log K − (γuk) − α = 0.01 α = 0.01 K 2 K 2 K K α = 0.75 α = 0.25 (N+1)uk K K 2 α = 2 α = 2 N+1 q K K 2 log (qk) k NIG K N−1 (γqk) − log K N+1 (γqk) + γuk 2 2 qk K N+1 (γqk) 2 where ‘≡’ denotes equal up to an additive constant independent of u. When computing the MAP/PML estimate for N , we select (c) Normal-gamma (d) Normal-inverse Gaus- sian N 2 K X n X Figure 5. Contour of constant value of pen(β1) + pen(β2) b1:N kyn − Xβ k2 1:N β = arg min 2 − pen(βk ). for different prior distributions. 1:N 2σ β n=1 k=1 (12) rule for α/K ≤ 1 and yields sparse estimates. The 1:N We give in Table 1 the penalizations pen(βk ) and normal-inverse Gaussian is not a thresholding rule as their derivatives for different prior distributions as a the derivative of the penalization is 0 when βk = 0 function of uk and qk defined in Eq. (11). whatever being the values of the parameters. However, When α/K = 1, the NG prior is equal to the Laplace from Figure 7(d), it is clear that it can yield “almost prior so its penalization reduces to the ` penaliza- sparse”¯ ¯ estimates; that is most components are such 1 ¯b ¯ tion used in Lasso and basis pursuit (Tibshirani, 1996; that ¯βk¯ ' 0. Chen et al., 2001). When α/K → 0 and c → 0 the prior is the NJ prior and the penalization reduces to 20 α . 12 α . K = 0 1 K = 0 1 α . α . K = 0 75 K = 0 75 log(|β |) which has been used in (Figueiredo, 2003). α α k K = 1 10 K = 1 α = 2 α = 2 We display in Figure 5 the contours of constant value 15 K K 8 for various prior distributions when N = 1 and K = 2. 10 6 For α/K < 1/2, the MAP estimate (12) does not exist 4 as the pdf (3) is unbounded. For other values of the 5 parameters, a can dominate at zero whereas we 2 0 0 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 are interested in the data driven turning point/local β β minimum (Griffin & Brown, 2007). (a) Normal-gamma (b) Normal-inverse Gaus- Consider now the case where the matrix X is orthog- sian 0 onal, σ = 1 and N = 1. The turning point and/or Figure 6. Plots of |βk| + pen (|βk|). MAP/PML estimate is obtained by minimizing Eq. (12) which is equivalent to minimize componentwise 1 (z − β )2 + pen(β ) (13) 4. Algorithms 2 k k k 4.1. EM where z = XT y. The first derivative of (13) is 0 sign(βk)(|βk| + pen (|βk|)) − zk. As stated in (Fan & The log-posterior in Eq. (12) is not concave but we Li, 2001, p. 1350), a sufficient condition for the esti- can use the EM algorithm to find modes of it. The EM mate to be a thresholding rule is that the minimum of algorithm relies on the introduction of the 0 the function |βk| + pen (|βk|) is strictly positive. Plots σ1:K = (σ1, ..., σK ). Conditional upon these missing 0 of the function |βk| + pen (|βk|) are given in Figure 6 data, the regression model is linear Gaussian and all and the resulting thresholds corresponding to the ar- the EM quantities can be easily computed in closed gument minimizing (13) are presented in Figure 7. It form; see for example (Figueiredo, 2003; Griffin & follows that the normal-gamma prior is a thresholding Brown, 2007). We have at iteration i + 1 of the EM Sparse Bayesian Nonparametric Regression 5. Applications 5.1. Simulated Data In the following, we provide numerical comparisons between the Laplace (that is Lasso), the RVM, NJ, NG and NIG models. We simulate 100 datasets from (1) with L = 50 and σ = 1. The correlation be- (a) Laplace (b) Normal-Jeffreys |i−j| tween Xk,i and Xk,j is ρ with ρ = 0.5. We set

α = 0.01 α = 0.01 T K K K β = (3 1.5 0 0 2 0 0 ...) ∈ R where the remaining α . α . K = 0 75 K = 0 25 α = 2 α = 2 K K components of the vector are set to zero. We consider the cases where K = 20, 60, 100, 200. Parameters of the Lasso, NG and NIG are estimated by 5-fold cross- validation, as described in (Tibshirani, 1996). The Lasso estimate is obtained with the Matlab implemen- tation of the interior point method downloadable at (c) Normal-gamma (d) Normal-inverse http://www.stanford.edu/˜boyd/l1 ls/. For the other Gaussian priors, the estimate is obtained via 100 iterations of Figure 7. Thresholds for the different prior distributions. the EM algorithm. Box plots of the mean square er- ror (MSE) are reported in Figure 8. These plots show that the performance of the estimators based on the b1:N 1:N b1:N NG and NIG priors outperform those of classical mod- β(i+1) = arg max Q(β ; β(i) ) β1:N els in that case. In Figure 9 are represented the box where Q(β1:N ; βb1:N ) is given by plots of the number of estimated coefficients whose ab- (i) −10 Z solute value is below T , T = 10 (the precision tuned −3 log(p(β1:N |y , σ )).p(σ |βb1:N , y )dσ . for the Lasso estimate) and T = 10 , for K = 200. 1:N 1:K 1:K (i) 1:N 1:K The true number of zeros in that case is 197. The NG After a few calculations, we obtain outperforms the other models in identifying the zeros ¡ ¢ of the model. On the contrary, as the NIG estimate bn 2 T −1 T β(i+1) = σ V(i) + X X X yn is not a thresholding rule, the number of co- −10 with V(i) = diag(m1,(i), . . . , mK,(i)) and efficients whose absolute value is below 10 for this ¡ ¢−1 m = ub pen0(ub ) where ub = model is zero. However, most of the coefficients have rk,(i) k,(i) k,(i) k,(i) a very low absolute value, as the median of the coeffi- ³ ´2 ¯ PN ¯ −3 βbn , pen0(ub ) = ∂pen(uk) ¯ (see cients with absolute value below 10 is equal to the n=1 k,(i) k,(i) ∂uk ubk,(i) true value 197 (see Figure 9(b)). Moreover, the esti- Table 1). mator obtained by thresholding the coefficients whose absolute value is below 10−3 to zero yields very minor 4.2. MCMC differences in terms of MSE. We can also easily sample from the posterior distribu- 1:N 1:N 2 5.2. Biscuit NIR Dataset tion p(β |y1:N ) by sampling from p(β , σ1:K |y1:N ) using the Gibbs sampler. Indeed the full conditional 1:N 2 1:N We consider the biscuits data which have been studied distributions p(β |σ1:K , y1:N ) and p(σ1:K |β , y1:N ) in (Griffin & Brown, 2007; West, 2003). The matrix are available in closed-form. The distribution 1:N X is composed of 300 (centered) NIR reflectance mea- p(β |σ1:K , y1:N ) is a multivariate normal whereas we surements from 70 biscuit dough pieces. The obser- 2 1:N QK 2 1:N have p(σ1:K |β , y1:N ) = k=1 p(σk|βk ). For the vations y are the percentage of fat, sucrose, flour and NG prior, we obtain water associated to each piece. The objective here is ³ ´ to predict the level of each of the ingredients from the α N u2 2 K − 2 −1 1 k 2 (σk) exp − 2 2 − γσk NIR reflectance measurements. The data are divided 2 1:N σk p(σk|βk ) = ³ ´ α N K − 2 into a training dataset (39 measurements) and a test uk 2 K α N (γuk) γ K − 2 dataset (31 measurements). The fitted coefficients of fat and flour, using 5-fold cross-validation, are repre- which is a generalized inverse Gaussian distribution sented in Figure 10. The estimated spikes are consis- from which we can sample exactly. For the NIG distri- tent with the results obtained in (West, 2003; Griffin bution, we also obtain a generalized inverse Gaussian & Brown, 2007). In particular, both models detect distribution. Sparse Bayesian Nonparametric Regression

100 80 1.2 0.8 60 50 1 40 0.6 0.8 20 k k

β 0 β MSE 0.4 MSE 0.6 0

0.4 −20 0.2 −50 0.2 −40

0 0 −100 −60 1200 1400 1600 1800 2000 2200 2400 1200 1400 1600 1800 2000 2200 2400 Lasso RVM NJ NG NIG Lasso RVM NJ NG NIG Wavelength (nm) Wavelength (nm) (a) K = 20 (b) K = 60 (a) Fat (NG) (b) Fat (NIG)

1.6 2 80 50

1.4 60

1.2 1.5 40

1 20

0.8 1 k k

β 0 β 0 MSE MSE 0.6 −20

0.4 0.5 −40

0.2 −60 0 0 −80 −50 Lasso RVM NJ NG NIG Lasso RVM NJ NG NIG 1200 1400 1600 1800 2000 2200 2400 1200 1400 1600 1800 2000 2200 2400 Wavelength (nm) Wavelength (nm) (c) K = 100 (d) K = 200 (c) Flour (NG) (d) Flour (NIG)

Figure 8. Box plots of the MSE associated to the simulated Figure 10. Coefficients estimated with a normal-gamma data. (left) and normal-inverse Gaussian (right) prior for fat (top) and flour (bottom) ingredients. 200 200

150 150 −3 −10 Table 2. MSE for biscuits NIR data

100 100 Flour Fat Values below 10

Values below 10 50 50 NJ 9.93 0.56

0 RVM 6.48 0.56 Lasso RVM NJ NG NIG Lasso RVM NJ NG NIG NG 3.44 0.55 −10 −3 (a) T = 10 (b) T = 10 NIG 1.94 0.49 Figure 9. Box plots of the number of estimated coefficients whose absolute value is below a threshold T . Dash line these models require specifying two hyperparameters. represents the true value of zero coefficients (197). However, using a simple cross-validation procedure we have demonstrated that these models can perform sig- nificantly better that well-established procedures. In a spike at 1726nm, which lies in a region known for particular, the experimental performance of the NIG fat absorbance. The predicted observations versus the model are surprisingly good and deserve being further true observations are given in Figure 11 for the train- studied. The NG prior has been discussed in (Griffin & ing and test datasets. The test data are well fitted Brown, 2007). It was discarded because of its spike at by the estimated coefficients. MSE errors for the test zero and the flatness of the penalty for large values but dataset are reported in Table 2. The proposed models no simulations were provided. They favour another show better performances for flour and similar perfor- model which relies on a cylinder parabolic function1. mances for fat. The NG prior has nonetheless interesting asymptotic properties in terms of L´evyprocesses and we have 6. Discussion demonstrated its empirical performances. The NG, NIG and Laplace priors can also be considered as par- We have presented some flexible priors for linear re- ticular cases of generalized hyperbolic distributions. gression based on the NG and NIG models. The This class of distributions has been used in (Snoussi & NG prior yields sparse local maxima of the poste- Idier, 2006) for blind source separation. rior distribution whereas the NIG prior yields “almost sparse” estimates; that is most of the coefficients are The extension to (probit) classification is straightfor- extremely close to zero. We have shown that asymp- 1The authors provide a link to a program to compute totically these models are closely related to the vari- this function. Unfortunately, it is extremely slow. The re- ance gamma process and the normal-inverse Gaus- sulting algorithm is at least one order of magnitude slower sian process. Contrary to the NJ model or the RVM, than our algorithms which rely on Bessel functions. Sparse Bayesian Nonparametric Regression

3 3 Training data Training data Girolami, M. (2001). A variational method for learn- Test data Test data 2 2 ing sparse and overcomplete representations. Neural 1 1 computation, 13, 2517–2532. 0 0

Observations −1 Observations −1 Griffin, J., & Brown, P. (2007). Bayesian adaptive

−2 −2 lasso with non-convex penalization (Technical Re-

−3 −3 port). Dept of Statistics, University of Warwick. −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 Predicted observations Predicted observations (a) Fat (NG) (b) Fat (NIG) Lewicki, M. S., & Sejnowski, T. (2000). Learning over-

3 3 complete representations. Neural computation, 12, Training data Training data Test data Test data 2 2 337–365. 1 1 Madan, D., & Seneta, E. (1990). The variance-gamma 0 0 model for share market returns. Journal of Business, Observations −1 Observations −1 63, 511–524. −2 −2

−3 −3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 Snoussi, H., & Idier, J. (2006). Bayesian blind separa- Predicted observations Predicted observations tion of generalized hyperbolic processes in noisy and (c) Flour (NG) (d) Flour (NIG) underdeterminate mixtures. IEEE Transactions on Figure 11. Observations versus predicted observations es- Signal Processing, 54, 3257–3269. timated with a normal-gamma (left) and normal-inverse Gaussian (right) prior for fat (top) and flour (bottom) in- Teh, Y., Gorur, D., & Ghahramani, Z. (2007). Stick- gredients. breaking construction for the Indian buffet process. International Conference on Artificial Intelligence and Statistics. ward by adding latent variables corresponding to the regression function plus some normal noise. Compu- Thibaux, R., & Jordan, M. (2007). Hierarchical beta tationally it only requires adding one line in the EM processes and the Indian buffet process. Inter- algorithm and one simulation step in the Gibbs sam- national Conference on Artificial Intelligence and pler. Statistics. Tibshirani, R. (1996). Regression shrinkage and selec- References tion via the Lasso. Journal of the Royal Statistical Society B, 58, 267–288. Applebaum, D. (2004). L´evyprocesses and stochastic calculus. Cambridge University Press. Tipping, M. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Barndorff-Nielsen, O. (1997). Normal inverse Gaus- Learning Research, 211-244, 211–244. sian distributions and stochastic volatility mod- elling. Scandinavian Journal of Statistics, 24, 1–13. Titsias, M. (2007). The infinite gamma-Poisson fea- ture model. International Conference on Neural In- Chen, S., Donoho, D., & Saunders, M. (2001). Atomic formation Processing Systems. decomposition by basis pursuit. SIAM review, 43, 129–159. Tsilevich, N., Vershik, A., & Yor, M. (2000). Distin- guished properties of the gamma process, and related Fan, J., & Li, R. (2001). Variable selection via noncon- topics (Technical Report). Laboratoire de Proba- cave penalized likelihood and its oracle properties. bilit´eset Mod`elesal´eatoires,Paris. Journal of the American Statistical Association, 96, 1348–1360. West, M. (2003). Bayesian factor regression models in the “Large p, Small n” paradigm. In J. Bernardo, Figueiredo, M. (2003). Adaptive sparseness for su- M. Bayarri, J. Berger, A. Dawid, D. Heckerman, pervised learning. IEEE Transactions on Pattern A. Smith and M. West (Eds.), Bayesian statistics 7, Analysis and Machine Intelligence, 25, 1150–1159. 723–732. Oxford University Press.

Ghahramani, Z., Griffiths, T., & Sollich, P. (2006). Bayesian nonparametric latent feature models. Pro- ceedings Valencia/ISBA World meeting on Bayesian statistics.