Sparse Bayesian Nonparametric Regression

Sparse Bayesian Nonparametric Regression Fran»coisCaron [email protected] Arnaud Doucet [email protected] Departments of Computer Science and Statistics, University of British Columbia, Vancouver, Canada Abstract an overcomplete basis (Lewicki & Sejnowski, 2000; Chen et al., 2001). One of the most common problems in machine learning and statistics consists of esti- Numerous models and algorithms have been proposed mating the mean response X¯ from a vec- in the machine learning and statistics literature to tor of observations y assuming y = X¯ + " address this problem including Bayesian stochastic where X is known, ¯ is a vector of param- search methods based on the `spike and slab' prior eters of interest and " a vector of stochastic (West, 2003), Lasso (Tibshirani, 1996), projection pur- errors. We are particularly interested here suit or the Relevance Vector Machine (RVM) (Tip- in the case where the dimension K of ¯ is ping, 2001). We follow here a Bayesian approach much higher than the dimension of y. We where we set a prior distribution on ¯ and we will propose some ﬂexible Bayesian models which primarily focus on the case where ¯b is the result- can yield sparse estimates of ¯. We show ing Maximum a Posteriori (MAP) estimate or equiv- that as K ! 1 these models are closely re- alently the Penalized Maximum Likelihood (PML) es- lated to a class of L¶evyprocesses. Simula- timate. Such MAP/PML approaches have been dis- tions demonstrate that our models outper- cussed many times in the literature and include the form signi¯cantly a range of popular alterna- Lasso (the corresponding prior being the Laplace dis- tives. tribution) (Tibshirani, 1996; Lewicki & Sejnowski, 2000; Girolami, 2001), the normal-Je®reys (NJ) prior (Figueiredo, 2003) or the normal-exponential gamma 1. Introduction prior (Gri±n & Brown, 2007). Asymptotic theoreti- cal properties of such PML estimates are discussed in Consider the following linear regression model (Fan & Li, 2001). y = X¯ + " (1) We propose here a class of prior distributions based L on scale mixture of Gaussians for ¯. For a ¯nite K, where y 2 R is the observation, ¯ = (¯1; : : : ; ¯K ) 2 RK is the vector of unknown parameters, X is an our prior models correspond to normal-gamma (NG) and normal-inverse Gaussian (NIG) models. This class known L £ K matrix. We will assume¡ that " follows¢ a 2 of models includes as limiting cases both the popular zero-mean normal distribution " »N 0; σ IL where Laplace and normal-Je®reys priors but is more ﬂex- IL is the identity matrix of dimension L. ible. As K ! 1, we show that the proposed pri- We do not impose here any restriction on L and K ors are closely related to the variance gamma and but we are particularly interested in the case where normal-inverse Gaussian processes which are L¶evypro- K >> L. This scenario is very common in many ap- cesses (Applebaum, 2004). In this respect, our mod- plication domains. In such cases, we are interested in els are somehow complementary to two recently pro- obtaining a sparse estimate of ¯; that is an estimate posed Bayesian nonparametric models: the Indian buf- b b b ¯ = (¯1;:::; ¯K ) such that only a subset of the com- fet process (Ghahramani et al., 2006) and the in- b ponents ¯k di®er from zero. This might be for sake of ¯nite gamma-Poisson process (Titsias, 2007). Un- variable selection (Tibshirani, 1996; Figueiredo, 2003; der given conditions, the normal-gamma prior yields Gri±n & Brown, 2007) or to decompose a signal over sparse MAP estimates ¯b. The log-posterior distributions associated to these prior distributions are not th Appearing in Proceedings of the 25 International Confer- convex but we propose an Expectation-Maximization ence on Machine Learning, Helsinki, Finland, 2008. Copy- right 2008 by the author(s)/owner(s). (EM) algorithm to ¯nd modes of the posteriors and Sparse Bayesian Nonparametric Regression a Markov Chain Monte Carlo (MCMC) algorithm to γ2 ® K 2 sample from them. We demonstrate through simula- ( 2 ) 2 ® ¡1 γ 2 (σ ) K exp(¡ σ ): tions that these Bayesian models outperform signi¯- ® k k ¡( K ) 2 cantly a range of established procedures on a variety of applications. Following Eq. (2), the marginal pdf of ¯k is given for ¯k 6= 0 by The rest of the paper is organized as follows. In Sec- ®=K+1=2 tion 2, we propose the NG and NIG models for ¯. We γ ® 1 K ¡ 2 p(¯k) = p j¯kj K ® ¡ 1 (γj¯kj) establish some properties of these models for K ¯nite ®=K¡1=2 ® K 2 ¼2 ¡( K ) and in the asymptotic case where K ! 1. We also (3) relate our model to the Indian bu®et process (Ghahra- where Kº (¢) is the modi¯ed Bessel function of the sec- mani et al., 2006) and the in¯nite gamma-Poisson pro- ond kind. We have cess (Titsias, 2007). In Section 3, we establish con- ( ¡( ® ¡ 1 ) b pγ K 2 ® 1 ditions under which the MAP/PML estimate ¯ can 2 ¼ ¡( ® ) if K > 2 lim p(¯k) = K enjoy sparsity properties. Section 4 presents an EM ¯k!0 1 otherwise algorithm to ¯nd modes of the posterior distributions and a Gibbs sampling algorithm to sample from them. and the tails of this distribution decrease in ® K ¡1 We demonstrate the performance of our models and j¯kj exp(¡γ j¯kj), see Figure 1(a). The parame- algorithms in Section 5. Finally we discuss some ex- ters ® and γ resp. control the shape and scale of the tensions in Section 6. distribution. When ® ! 0, there is a high discrepancy 2 between the values of σk, while when ® ! 1, most of 2. Sparse Bayesian Nonparametric the values are equal. Models 2 α 6 α K = 0.1, c = 1 K = 0.1, c = 1 α α K = 0.1, c = 10 K = 0.1, c = 10 α α K = 0.75, c = 10 5 K = 0.75, c = 10 We will consider models where the components ¯ are 1.5 4 independent and identically distributed ) ) k k β 1 β 3 YK p( p( 2 p(¯) = p(¯k) 0.5 k=1 1 0 0 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 and p (¯k) is a scale mixture of Gaussians; that is β β Z k k 2 ¡ 2¢ 2 p (¯k) = N (¯k; 0; σk)p σk dσk (2) (a) Normal-gamma (b) Normal-inverse Gaus- sian 2 where N (x; ¹; σ ) denotes the Gaussian distribution Figure 1. Probability density functions of the NG and NIG 2 of argument x, mean ¹ and variance σ . We propose for di®erent values of the parameters. 2 two conjugate distributions for σk; namely the gamma and the inverse Gaussian distributions. The resulting This class of priors includes many standard priors. In- marginal distribution for ¯k belongs in both cases to ® deed, Eq. (3) reduces to the Laplace prior when K = 1 the class of generalized hyperbolic distributions. ® and we obtain the NJ prior when K ! 0 and γ ! 0: In the models presented here, the unknown scale pa- In Figure 2 some realizations of the process are given rameters are random and integrated out so that the for di®erent values ® = 1; 5; 100 and γ2=2 = ®. marginal priors on the regression coe±cients are not Gaussian. This di®ers from the RVM (Tipping, 2001) 2.1.2. Properties where these parameters are unknown and estimated through maximum likelihood. It follows from Eq. (3) that r 4 ¡( ® + 1 ) 2® 2.1. Normal-Gamma Model K 2 2 E[j¯kj] = 2 ® , E[¯k] = 2 ¼γ ¡( K ) γ K 2.1.1. Definition and we obtain Consider the following gamma prior distribution XK XK ® γ2 2® 2 2® 2 lim E[ j¯kj] = , E[ ¯k] = : σk »G( ; ) K!1 γ γ2 K 2 k=1 k=1 2 ® γ2 whose probability density function (pdf) G(σk; K ; 2 ) Hence the sum of the terms remains bounded whatever is given by being K. Sparse Bayesian Nonparametric Regression 0.4 0.4 according to PD(®) and G(®; γ2=2) where PD(®) is 2 k 2 k σ 0.2 σ 0.2 the Poisson-Dirichlet distribution of scale parameter 0 0 ®. It is well-known that this distribution can be re- 0 20 40 60 80 100 0 20 40 60 80 100 1 1 covered by the following (in¯nite) stick-breaking con- struction (Tsilevich et al., 2000) as if we set k k β 0 β 0 k¡1 −1 −1 Y 0 20 40 60 80 100 0 20 40 60 80 100 Feature k Feature k ¼k = ³k (1 ¡ ³j) with ³j »B(1; ®) (5) (a) ® = 1 (b) ® = 5 j=1 0.06 ¡ ¢ for any k then the order statistics ¼(k) are dis- 0.04 2 k σ tributed from the Poisson-Dirichlet distribution. 0.02 0 0 20 40 60 80 100 The coe±cients (¯k) are thus nothing but the weights 1 (jumps) of the so-called variance gamma process which k β 0 is a Brownian motion evaluated at times given by a −1 gamma process (Applebaum, 2004; Madan & Seneta, 0 20 40 60 80 100 Feature k 1990). (c) ® = 100 © ª 2.2. Normal-Inverse Gaussian Model Figure 2. Realizations (top) σ2 and (bottom) k k=1;:::;K f¯kgk=1;:::;K from the NG model for ® = 1; 5; 100. 2.2.1. Definition Consider the following inverse Gaussian prior distribution Using properties of the gamma distribution, it is possi- ® ble to relate ¯ to a L¶evyprocess known as the variance σ2 »IG( ; γ) (6) k K gamma process as K ! 1.

Sparse Bayesian Nonparametric Regression

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support