Sparse Bayesian Nonparametric Regression
Total Page:16
File Type:pdf, Size:1020Kb
Sparse Bayesian Nonparametric Regression Fran»coisCaron [email protected] Arnaud Doucet [email protected] Departments of Computer Science and Statistics, University of British Columbia, Vancouver, Canada Abstract an overcomplete basis (Lewicki & Sejnowski, 2000; Chen et al., 2001). One of the most common problems in ma- chine learning and statistics consists of esti- Numerous models and algorithms have been proposed mating the mean response X¯ from a vec- in the machine learning and statistics literature to tor of observations y assuming y = X¯ + " address this problem including Bayesian stochastic where X is known, ¯ is a vector of param- search methods based on the `spike and slab' prior eters of interest and " a vector of stochastic (West, 2003), Lasso (Tibshirani, 1996), projection pur- errors. We are particularly interested here suit or the Relevance Vector Machine (RVM) (Tip- in the case where the dimension K of ¯ is ping, 2001). We follow here a Bayesian approach much higher than the dimension of y. We where we set a prior distribution on ¯ and we will propose some flexible Bayesian models which primarily focus on the case where ¯b is the result- can yield sparse estimates of ¯. We show ing Maximum a Posteriori (MAP) estimate or equiv- that as K ! 1 these models are closely re- alently the Penalized Maximum Likelihood (PML) es- lated to a class of L¶evyprocesses. Simula- timate. Such MAP/PML approaches have been dis- tions demonstrate that our models outper- cussed many times in the literature and include the form signi¯cantly a range of popular alterna- Lasso (the corresponding prior being the Laplace dis- tives. tribution) (Tibshirani, 1996; Lewicki & Sejnowski, 2000; Girolami, 2001), the normal-Je®reys (NJ) prior (Figueiredo, 2003) or the normal-exponential gamma 1. Introduction prior (Gri±n & Brown, 2007). Asymptotic theoreti- cal properties of such PML estimates are discussed in Consider the following linear regression model (Fan & Li, 2001). y = X¯ + " (1) We propose here a class of prior distributions based L on scale mixture of Gaussians for ¯. For a ¯nite K, where y 2 R is the observation, ¯ = (¯1; : : : ; ¯K ) 2 RK is the vector of unknown parameters, X is an our prior models correspond to normal-gamma (NG) and normal-inverse Gaussian (NIG) models. This class known L £ K matrix. We will assume¡ that " follows¢ a 2 of models includes as limiting cases both the popular zero-mean normal distribution " »N 0; σ IL where Laplace and normal-Je®reys priors but is more flex- IL is the identity matrix of dimension L. ible. As K ! 1, we show that the proposed pri- We do not impose here any restriction on L and K ors are closely related to the variance gamma and but we are particularly interested in the case where normal-inverse Gaussian processes which are L¶evypro- K >> L. This scenario is very common in many ap- cesses (Applebaum, 2004). In this respect, our mod- plication domains. In such cases, we are interested in els are somehow complementary to two recently pro- obtaining a sparse estimate of ¯; that is an estimate posed Bayesian nonparametric models: the Indian buf- b b b ¯ = (¯1;:::; ¯K ) such that only a subset of the com- fet process (Ghahramani et al., 2006) and the in- b ponents ¯k di®er from zero. This might be for sake of ¯nite gamma-Poisson process (Titsias, 2007). Un- variable selection (Tibshirani, 1996; Figueiredo, 2003; der given conditions, the normal-gamma prior yields Gri±n & Brown, 2007) or to decompose a signal over sparse MAP estimates ¯b. The log-posterior distribu- tions associated to these prior distributions are not th Appearing in Proceedings of the 25 International Confer- convex but we propose an Expectation-Maximization ence on Machine Learning, Helsinki, Finland, 2008. Copy- right 2008 by the author(s)/owner(s). (EM) algorithm to ¯nd modes of the posteriors and Sparse Bayesian Nonparametric Regression a Markov Chain Monte Carlo (MCMC) algorithm to γ2 ® K 2 sample from them. We demonstrate through simula- ( 2 ) 2 ® ¡1 γ 2 (σ ) K exp(¡ σ ): tions that these Bayesian models outperform signi¯- ® k k ¡( K ) 2 cantly a range of established procedures on a variety of applications. Following Eq. (2), the marginal pdf of ¯k is given for ¯k 6= 0 by The rest of the paper is organized as follows. In Sec- ®=K+1=2 tion 2, we propose the NG and NIG models for ¯. We γ ® 1 K ¡ 2 p(¯k) = p j¯kj K ® ¡ 1 (γj¯kj) establish some properties of these models for K ¯nite ®=K¡1=2 ® K 2 ¼2 ¡( K ) and in the asymptotic case where K ! 1. We also (3) relate our model to the Indian bu®et process (Ghahra- where Kº (¢) is the modi¯ed Bessel function of the sec- mani et al., 2006) and the in¯nite gamma-Poisson pro- ond kind. We have cess (Titsias, 2007). In Section 3, we establish con- ( ¡( ® ¡ 1 ) b pγ K 2 ® 1 ditions under which the MAP/PML estimate ¯ can 2 ¼ ¡( ® ) if K > 2 lim p(¯k) = K enjoy sparsity properties. Section 4 presents an EM ¯k!0 1 otherwise algorithm to ¯nd modes of the posterior distributions and a Gibbs sampling algorithm to sample from them. and the tails of this distribution decrease in ® K ¡1 We demonstrate the performance of our models and j¯kj exp(¡γ j¯kj), see Figure 1(a). The parame- algorithms in Section 5. Finally we discuss some ex- ters ® and γ resp. control the shape and scale of the tensions in Section 6. distribution. When ® ! 0, there is a high discrepancy 2 between the values of σk, while when ® ! 1, most of 2. Sparse Bayesian Nonparametric the values are equal. Models 2 α 6 α K = 0.1, c = 1 K = 0.1, c = 1 α α K = 0.1, c = 10 K = 0.1, c = 10 α α K = 0.75, c = 10 5 K = 0.75, c = 10 We will consider models where the components ¯ are 1.5 4 independent and identically distributed ) ) k k β 1 β 3 YK p( p( 2 p(¯) = p(¯k) 0.5 k=1 1 0 0 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 and p (¯k) is a scale mixture of Gaussians; that is β β Z k k 2 ¡ 2¢ 2 p (¯k) = N (¯k; 0; σk)p σk dσk (2) (a) Normal-gamma (b) Normal-inverse Gaus- sian 2 where N (x; ¹; σ ) denotes the Gaussian distribution Figure 1. Probability density functions of the NG and NIG 2 of argument x, mean ¹ and variance σ . We propose for di®erent values of the parameters. 2 two conjugate distributions for σk; namely the gamma and the inverse Gaussian distributions. The resulting This class of priors includes many standard priors. In- marginal distribution for ¯k belongs in both cases to ® deed, Eq. (3) reduces to the Laplace prior when K = 1 the class of generalized hyperbolic distributions. ® and we obtain the NJ prior when K ! 0 and γ ! 0: In the models presented here, the unknown scale pa- In Figure 2 some realizations of the process are given rameters are random and integrated out so that the for di®erent values ® = 1; 5; 100 and γ2=2 = ®. marginal priors on the regression coe±cients are not Gaussian. This di®ers from the RVM (Tipping, 2001) 2.1.2. Properties where these parameters are unknown and estimated through maximum likelihood. It follows from Eq. (3) that r 4 ¡( ® + 1 ) 2® 2.1. Normal-Gamma Model K 2 2 E[j¯kj] = 2 ® , E[¯k] = 2 ¼γ ¡( K ) γ K 2.1.1. Definition and we obtain Consider the following gamma prior distribution XK XK ® γ2 2® 2 2® 2 lim E[ j¯kj] = , E[ ¯k] = : σk »G( ; ) K!1 γ γ2 K 2 k=1 k=1 2 ® γ2 whose probability density function (pdf) G(σk; K ; 2 ) Hence the sum of the terms remains bounded whatever is given by being K. Sparse Bayesian Nonparametric Regression 0.4 0.4 according to PD(®) and G(®; γ2=2) where PD(®) is 2 k 2 k σ 0.2 σ 0.2 the Poisson-Dirichlet distribution of scale parameter 0 0 ®. It is well-known that this distribution can be re- 0 20 40 60 80 100 0 20 40 60 80 100 1 1 covered by the following (in¯nite) stick-breaking con- struction (Tsilevich et al., 2000) as if we set k k β 0 β 0 k¡1 −1 −1 Y 0 20 40 60 80 100 0 20 40 60 80 100 Feature k Feature k ¼k = ³k (1 ¡ ³j) with ³j »B(1; ®) (5) (a) ® = 1 (b) ® = 5 j=1 0.06 ¡ ¢ for any k then the order statistics ¼(k) are dis- 0.04 2 k σ tributed from the Poisson-Dirichlet distribution. 0.02 0 0 20 40 60 80 100 The coe±cients (¯k) are thus nothing but the weights 1 (jumps) of the so-called variance gamma process which k β 0 is a Brownian motion evaluated at times given by a −1 gamma process (Applebaum, 2004; Madan & Seneta, 0 20 40 60 80 100 Feature k 1990). (c) ® = 100 © ª 2.2. Normal-Inverse Gaussian Model Figure 2. Realizations (top) σ2 and (bottom) k k=1;:::;K f¯kgk=1;:::;K from the NG model for ® = 1; 5; 100. 2.2.1. Definition Consider the following inverse Gaussian prior distribu- tion Using properties of the gamma distribution, it is possi- ® ble to relate ¯ to a L¶evyprocess known as the variance σ2 »IG( ; γ) (6) k K gamma process as K ! 1.