A Geometric View of Conjugate Priors
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence A Geometric View of Conjugate Priors Arvind Agarwal Hal Daume´ III Department of Computer Science Department of Computer Science University of Maryland University of Maryland College Park, Maryland USA College Park, Maryland USA [email protected] [email protected] Abstract Using the same geometry also gives the closed-form solution for the maximum-a-posteriori (MAP) problem. We then ana- In Bayesian machine learning, conjugate priors are lyze the prior using concepts borrowed from the information popular, mostly due to mathematical convenience. geometry. We show that this geometry induces the Fisher In this paper, we show that there are deeper reasons information metric and 1-connection, which are respectively, for choosing a conjugate prior. Specifically, we for- the natural metric and connection for the exponential family mulate the conjugate prior in the form of Bregman (Section 5). One important outcome of this analysis is that it divergence and show that it is the inherent geome- allows us to treat the hyperparameters of the conjugate prior try of conjugate priors that makes them appropriate as the effective sample points drawn from the distribution un- and intuitive. This geometric interpretation allows der consideration. We finally extend this geometric interpre- one to view the hyperparameters of conjugate pri- tation of conjugate priors to analyze the hybrid model given ors as the effective sample points, thus providing by [7] in a purely geometric setting, and justify the argument additional intuition. We use this geometric under- presented in [1] (i.e. a coupling prior should be conjugate) standing of conjugate priors to derive the hyperpa- using a much simpler analysis (Section 6). Our analysis cou- rameters and expression of the prior used to couple ples the discriminative and generative components of hybrid the generative and discriminative components of a model using the Bregman divergence which reduces to the hybrid model for semi-supervised learning. coupling prior given in [1]. This analysis avoids the explicit derivation of the hyperparameters, rather automatically gives 1 Introduction the hyperparameters of the conjugate prior along with the ex- In probabilistic modeling, a practitioner typically chooses a pression. likelihood function (model) based on her knowledge of the problem domain. With limited training data, a simple max- 2 Motivation imum likelihood estimation (MLE) of the parameters of this Our analysis is driven by the desire to understand the geom- model will lead to overfitting and poor generalization. One etry of the conjugate priors for the exponential families. We can regularize the model by adding a prior, but the fundamen- motivate our analysis by asking ourselves the following ques- tal question is: which prior? We give a turn-key answer to this tion: Given a parametric model p(x; θ) for the data likeli- problem by analyzing the underlying geometry of the likeli- hood, and a prior on its parameters θ, p(θ; α, β); what should hood model, and suggest choosing the unique prior with the the hyperparameters α and β of the prior encode? We know same geometry as the likelihood. This unique prior turns out that θ in the likelihood model is the estimation of the param- to be the conjugate prior, in the case of the exponential fam- eter using the given data points. In other words, the estimated ily. This provides justification beyond “computational conve- parameter fits the model according to the given data while the nience” for using the conjugate prior in machine learning and prior on the parameter provides the generalization. This gen- data mining applications. eralization is enforced by some prior belief encoded in the In this work, we give a geometric understanding of the hyperparameters. Unfortunately, one does not know what is maximum likelihood estimation method and a geometric ar- the likely value of the parameters; rather one might have some gument in the favor of using conjugate priors. Empirical evi- belief in what data points are likely to be sampled from the dence showing the effectiveness of the conjugate priors can be model. Now the question is: Do the hyperparameters encode found in our earlier work [1]. In Section 4.1, first we formu- this belief in the parameters in terms of the sampling points? late the MLE problem into a completely geometric problem Our analysis shows that the hyperparameters of the conjugate with no explicit mention of probability distributions. We then prior is nothing but the effective sampling points. In case of show that this geometric problem carries a geometry that is non-conjugate priors, the interpretation of hyperparameters is inherent to the structure of the likelihood model. For reasons not clear. given in Sections 4.3 and 4.4, when considering the prior, it A second motivation is the following geometric analysis. is important that one uses the same geometry as likelihood. Before we go into the problem, consider two points in the 2558 ˆ G b dg(a, b) θn θML θ1 c (a, b) b Θ de c (a, b) g a ) a γd (a, b γde g μn μML μ1 F (a) (b) M Figure 1: Interpolation of two points a and b using (a) Eu- Figure 2: Duality between mean parameters and natural pa- clidean geometry, and (b) non-Euclidean geometry. Here ge- rameters. ometry is defined by the respective distance/divergence func- tions de and dg. It is important to notice that the divergence is a generalized notion of the distance in the non-Euclidean density function can be expressed in the following form: spaces, in particular, in the spaces of the exponential fam- p(x; θ)=p (x)exp(θ, φ(x)−G(θ)) (1) ily statistical manifolds. In these spaces, it is the divergence o function that define the geometry. here φ(x):Xm → Rd is a vector potentials or suffi- cient statistics and G(θ) is a normalization constant or log- partition function. With the potential functions φ(x) fixed, Euclidean space which one would like to interpolate using a θ p(x; θ) γ ∈ [0, 1] every induces a particular member of the family. parameter . A natural way to do so is to interpolate In our framework, we deal with exponential families that are them linearly i.e., connect two points using a straight line, and [ ] γ regular and have the minimal representation 9 . then find the interpolating point at the desired , as shown in One important property of the exponential family is the ex- Figure 1(a). This interpolation scheme does not change if istence of conjugate priors. Given any member of the expo- we move to a non-Euclidean space. In other words, if we nential family in (1), the conjugate prior is a distribution over were to interpolate two points in a non-Euclidean space, we its parameters with the following form: would find the interpolating point by connecting two points by a geodesic (an equivalent to the straight line in the non- p(θ|α, β)=m(α, β) exp(θ, α−βG(θ)) (2) γ Euclidean space) and then finding the point at the desired , here α and β are hyperparameters of the conjugate prior. Im- shown in Figure 1(b). portantly, the function G(·) is the same between the exponen- This situation arises when one has two models, and wants tial family member and its conjugate prior. to build a better model by interpolating them. This exact sit- A second important property of exponential family mem- [ ] uation is encountered in 7 where the objective is to build ber is that log-partition function G is convex and defined over a hybrid model by interpolating (or coupling) discriminative the convex set Θ:={θ ∈ Rd : G(θ) < ∞}; and since it is [ ] and generative models. Agarwal et.al. 1 couples these two convex over this set, it induces a Bregman divergence [3] 2 on models using the conjugate prior, and empirically shows us- the space Θ. ing a conjugate prior for the coupling outperforms the original Another important property of the exponential family is the choice [7] of a Gaussian prior. In this work, we find the hy- one-to-one mapping between the canonical parameters θ and brid model by interpolating the two models using the inherent the so-called “mean parameters” which we denote by μ.For geometry1 of the space (interpolate along the geodesic in the each canonical parameter θ ∈ Θ, there exists a mean param- space defined by the inherent geometry) which automatically eter μ, which belongs to the space M defined as: results in the conjugate prior along with its hyperparameters. Our analysis and the analysis of Agarwal et al. lead to the M := μ ∈ Rd : μ = φ(x)p(x; θ) dx ∀θ ∈ Θ (3) same result, but ours is much simpler and naturally extends to the cases where one wants to couple more than two mod- It is easy to see that Θ and M are dual spaces, in the sense els. One big advantage of our analysis is that unlike prior of Legendre (conjugate) duality because of the following re- approaches [1], we need not know the expression and the hy- lationship between the log-partition function G(θ) and the perparameters of the prior in advance. They are automatically expected value of the sufficient statistics φ(x): ∇G(θ)= derived by the analysis. Our analysis only requires the inher- E(φ(x)) = μ. In Legendre duality, we know that two spaces ent geometry of the models under consideration and the inter- Θ and M are dual of each other if for each θ ∈ Θ, ∇G(θ)= polation parameters.