<<

Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence

A Geometric View of Conjugate Priors

Arvind Agarwal Hal Daume´ III Department of Computer Science Department of Computer Science University of Maryland University of Maryland College Park, Maryland USA College Park, Maryland USA [email protected] [email protected]

Abstract Using the same geometry also gives the closed-form solution for the maximum-a-posteriori (MAP) problem. We then ana- In Bayesian machine learning, conjugate priors are lyze the prior using concepts borrowed from the information popular, mostly due to mathematical convenience. geometry. We show that this geometry induces the Fisher In this paper, we show that there are deeper reasons information metric and 1-connection, which are respectively, for choosing a conjugate prior. Specifically, we for- the natural metric and connection for the mulate the conjugate prior in the form of Bregman (Section 5). One important outcome of this analysis is that it divergence and show that it is the inherent geome- allows us to treat the of the conjugate prior try of conjugate priors that makes them appropriate as the effective sample points drawn from the distribution un- and intuitive. This geometric interpretation allows der consideration. We finally extend this geometric interpre- one to view the hyperparameters of conjugate pri- tation of conjugate priors to analyze the hybrid model given ors as the effective sample points, thus providing by [7] in a purely geometric setting, and justify the argument additional intuition. We use this geometric under- presented in [1] (i.e. a coupling prior should be conjugate) standing of conjugate priors to derive the hyperpa- using a much simpler analysis (Section 6). Our analysis cou- rameters and expression of the prior used to couple ples the discriminative and generative components of hybrid the generative and discriminative components of a model using the Bregman divergence which reduces to the hybrid model for semi-supervised learning. coupling prior given in [1]. This analysis avoids the explicit derivation of the hyperparameters, rather automatically gives 1 Introduction the hyperparameters of the conjugate prior along with the ex- In probabilistic modeling, a practitioner typically chooses a pression. (model) based on her knowledge of the problem domain. With limited training data, a simple max- 2 Motivation imum likelihood estimation (MLE) of the parameters of this Our analysis is driven by the desire to understand the geom- model will lead to overfitting and poor generalization. One etry of the conjugate priors for the exponential families. We can regularize the model by adding a prior, but the fundamen- motivate our analysis by asking ourselves the following ques- tal question is: which prior? We give a turn-key answer to this tion: Given a parametric model p(x; θ) for the data likeli- problem by analyzing the underlying geometry of the likeli- hood, and a prior on its parameters θ, p(θ; α, β); what should hood model, and suggest choosing the unique prior with the the hyperparameters α and β of the prior encode? We know same geometry as the likelihood. This unique prior turns out that θ in the likelihood model is the estimation of the param- to be the conjugate prior, in the case of the exponential fam- eter using the given data points. In other words, the estimated ily. This provides justification beyond “computational conve- parameter fits the model according to the given data while the nience” for using the conjugate prior in machine learning and prior on the parameter provides the generalization. This gen- data mining applications. eralization is enforced by some prior belief encoded in the In this work, we give a geometric understanding of the hyperparameters. Unfortunately, one does not know what is maximum likelihood estimation method and a geometric ar- the likely value of the parameters; rather one might have some gument in the favor of using conjugate priors. Empirical evi- belief in what data points are likely to be sampled from the dence showing the effectiveness of the conjugate priors can be model. Now the question is: Do the hyperparameters encode found in our earlier work [1]. In Section 4.1, first we formu- this belief in the parameters in terms of the sampling points? late the MLE problem into a completely geometric problem Our analysis shows that the hyperparameters of the conjugate with no explicit mention of probability distributions. We then prior is nothing but the effective sampling points. In case of show that this geometric problem carries a geometry that is non-conjugate priors, the interpretation of hyperparameters is inherent to the structure of the likelihood model. For reasons not clear. given in Sections 4.3 and 4.4, when considering the prior, it A second motivation is the following geometric analysis. is important that one uses the same geometry as likelihood. Before we go into the problem, consider two points in the

2558 ˆ G b dg(a, b) θn θML θ1 c (a, b) b Θ de c (a, b) g a ) a γd (a, b γde g μn μML μ1 F (a) (b) M

Figure 1: Interpolation of two points a and b using (a) Eu- Figure 2: Duality between mean parameters and natural pa- clidean geometry, and (b) non-Euclidean geometry. Here ge- rameters. ometry is defined by the respective distance/divergence func- tions de and dg. It is important to notice that the divergence is a generalized notion of the distance in the non-Euclidean density function can be expressed in the following form: spaces, in particular, in the spaces of the exponential fam- p(x; θ)=p (x)exp(θ, φ(x)−G(θ)) (1) ily statistical manifolds. In these spaces, it is the divergence o function that define the geometry. here φ(x):Xm → Rd is a vector potentials or suffi- cient statistics and G(θ) is a normalization constant or log- partition function. With the potential functions φ(x) fixed, Euclidean space which one would like to interpolate using a θ p(x; θ) γ ∈ [0, 1] every induces a particular member of the family. parameter . A natural way to do so is to interpolate In our framework, we deal with exponential families that are them linearly i.e., connect two points using a straight line, and [ ] γ regular and have the minimal representation 9 . then find the interpolating point at the desired , as shown in One important property of the exponential family is the ex- Figure 1(a). This interpolation scheme does not change if istence of conjugate priors. Given any member of the expo- we move to a non-Euclidean space. In other words, if we nential family in (1), the conjugate prior is a distribution over were to interpolate two points in a non-Euclidean space, we its parameters with the following form: would find the interpolating point by connecting two points by a geodesic (an equivalent to the straight line in the non- p(θ|α, β)=m(α, β) exp(θ, α−βG(θ)) (2) γ Euclidean space) and then finding the point at the desired , here α and β are hyperparameters of the conjugate prior. Im- shown in Figure 1(b). portantly, the function G(·) is the same between the exponen- This situation arises when one has two models, and wants tial family member and its conjugate prior. to build a better model by interpolating them. This exact sit- A second important property of exponential family mem- [ ] uation is encountered in 7 where the objective is to build ber is that log-partition function G is convex and defined over a hybrid model by interpolating (or coupling) discriminative the convex set Θ:={θ ∈ Rd : G(θ) < ∞}; and since it is [ ] and generative models. Agarwal et.al. 1 couples these two convex over this set, it induces a Bregman divergence [3] 2 on models using the conjugate prior, and empirically shows us- the space Θ. ing a conjugate prior for the coupling outperforms the original Another important property of the exponential family is the choice [7] of a Gaussian prior. In this work, we find the hy- one-to-one mapping between the canonical parameters θ and brid model by interpolating the two models using the inherent the so-called “mean parameters” which we denote by μ.For geometry1 of the space (interpolate along the geodesic in the each canonical parameter θ ∈ Θ, there exists a mean param- space defined by the inherent geometry) which automatically eter μ, which belongs to the space M defined as: results in the conjugate prior along with its hyperparameters.    Our analysis and the analysis of Agarwal et al. lead to the M := μ ∈ Rd : μ = φ(x)p(x; θ) dx ∀θ ∈ Θ (3) same result, but ours is much simpler and naturally extends to the cases where one wants to couple more than two mod- It is easy to see that Θ and M are dual spaces, in the sense els. One big advantage of our analysis is that unlike prior of Legendre (conjugate) duality because of the following re- approaches [1], we need not know the expression and the hy- lationship between the log-partition function G(θ) and the perparameters of the prior in advance. They are automatically expected value of the sufficient statistics φ(x): ∇G(θ)= derived by the analysis. Our analysis only requires the inher- E(φ(x)) = μ. In Legendre duality, we know that two spaces ent geometry of the models under consideration and the inter- Θ and M are dual of each other if for each θ ∈ Θ, ∇G(θ)= polation parameters. No explicit expression of the coupling μ ∈ M. We call the function in the dual space M to be F i.e., prior is needed. F = G∗. A pictorial representation of the duality between canonical parameter space Θ and mean parameter space M is 3 Exponential Family given in Figure 2. In this section, we review the exponential family. The ex- 2Two important points to note about Bregman divergence are: 1) ponential family is a set of distributions, whose probability ∗ ∗ ∗ For dual spaces F and G, BF (pq)=BG(q p ), where p and 1In exponential family statistical manifold, inherent geometry is q∗ are the conjugate duals of p and q respectively. 2) Bregman di- defined by the divergence function because it is the divergence func- vergence is not symmetric i.e., in general, BF (pq) = BF (qp), tion that induces the metric structure and connection of the manifold. therefore it is important what directions these divergences are mea- Refer [2] for more details. sured in.

2559 In our analysis, we will need the Bregman divergence over Proof. Proof is straightforward. Using (4) in MLE problem φ(x) max n p(x ; θ) which can be obtained by showing that an augmented θ∈Θ i=1 log i , and ignoring terms that do not M contains all possible φ(x). In order to define the Breg- depend on θ: man divergence over all φ(x), we define a new set of mean parameters w.r.t. all probability distributions (not only w.r.t. n + d ˆ exponential family distributions): M := {μ ∈ R : μ = θML = min BF (xi∇G(θ)) (5)   θ∈Θ φ(x)p(x) dx s.t. p(x) dx =1}. i=1 + Note that M is the convex hull of φ(x) thus contains all which using the expression ∇G(θ)=μ gives the desired φ(x). We know from (see Theorem 3.3, [10]) that M is the result. interior of M+. Now we augment M with the boundary of M+ and Θ with the canonical parameters (limiting distribu- The above theorem converts the problem of maximizing tions) that will generate the mean parameters corresponding the log likelihood log p(X; θ) into an equivalent problem of to this boundary. We know (see Theorem 2, [9]) that such pa- minimizing the corresponding Bregman divergences which rameters exist. Call these new sets M+ and Θ+ respectively. is nothing but a Bregman median problem, the solution to + + μˆ = n x θˆ We also know [9] that Θ and M are conjugate dual of each which is given by ML i=1 i. ML estimate ML other (for boundary, duality exists in the limiting sense) i.e., can now be computed using the expression ∇G(θ)=μ, M+ ˆ −1 Bregman divergence is defined over the entire . θML =(∇G) (ˆμML). In the following discussion, M and Θ will denote the x M+ Θ+ Lemma 1. If is the sufficient statistics of the exponential closed sets i.e. and respectively. family with the log partition function G, and F is the dual function of G defined over the mean parameter space M then 4 Likelihood, Prior and Geometry (1) x ∈ M; (2) there exists a θ ∈ Θ, such that x∗ = θ. In this section, we first formulate the ML problem into a Breg- man median problem (Section 4.1) and then show that corre- Proof. (1) By construction of M, we know x ∈ M. (2) From duality of M and Θ, for every μ ∈ M, there exists a θ ∈ Θ sponding MAP (maximum-a-posteriori) problem can also be ∗ ∗ converted into a Bregman median problem (Section 4.3). The such that θ = μ , and since x ∈ M, which implies x = MAP Bregman median problem consists of two parts: a like- θ. lihood model and a prior. We argue (Sections 4.3 and 4.4) that Corollary 1 (ML as Bregman Median). Let G and X be a Bregman median problem makes sense only when both of defined as earlier, θi be the dual of xi, then ML estimation, these parts have the same geometry. Having the same geome- ˆ θML of X solves the following optimization problem: try amounts to having the same log-partition function leading n to the property of conjugate priors. ˆ θML = min BG(θθi) (6) θ∈Θ 4.1 Likelihood in the form of Bregman Divergence i=1 Following [5], we can write the distributions belonging to the Proof. Proof directly follows from Lemma 1 and Theorem 1. 3 x∗ = θ exponential family (1) in terms of Bregman divergence : From Lemma 1, we know that i i. Using Theorem 1 B (x μ)=B (θx∗)=B (θθ ) log p(x; θ)=log p (x)+F (x) − B (x∇G(θ)) (4) and expression F i G i G i gives o F the desired result. This representation of likelihood in the form of Bregman di- vergence gives insight in the geometry of the likelihood func- The above expression requires us to find a θ so that di- tion. Gaining the insight into the exponential family distri- vergence from θ to other θi is minimized. Now note that G butions, and establishing a meaningful relationship between is what defines this divergence and hence the geometry of likelihood and prior is the primary objective of this work. the Θ space (as discussed earlier in Section 2). since G is In learning problems, one is interested in estimating the the log partition function of an exponential family, it is the parameters θ of the model which results in low generaliza- log-partition function that determines the geometry of the tion error. Perhaps the most standard estimation method space. We emphasize that divergence is measured from the ˆ θ is maximum likelihood (ML). The ML estimate, θML,of parameter being estimated to other parameters i(s), as shown a set of n i.i.d. training data points X = {x1,...xn} in Figure 3. drawn from the exponential family is obtained by solving ˆ 4.2 Conjugate Prior in the form of Bregman the following problem: θML =maxθ∈Θ log p(X; θ)= max n p(x ; θ) Divergence θ∈Θ i=1 log i . X n We now give an expression similar to the likelihood for the Theorem 1. Let be a set of i.i.d. training data points conjugate prior (ignoring the term log m(α, β)): drawn from the exponential family distribution with the log α partition function G, F be the dual function of G, then dual log p(θ|α, β)=β(θ, −G(θ)) (7) ˆ β of ML estimate (θML)ofX under the assumed exponential family model solves the following Bregman median problem: which can be written in the form of Bregman divergence by n x α/β μˆML =minμ∈M BF (xiμ). a direct comparison to (1), replacing with . i=1      3 x φ(x) α α For the simplicity of the notations we will use instead of log p(θ|α, β)=β F − BF ∇G(θ) (8) assuming that x ∈ Rd. This does not change the analysis β β

2560 Non-conjugate The expression for the joint probability of data and parame- Conjugate ters (combining all terms that do not depend on θ in const) is Rd given by:   d θ θ1 R 1 B

α B G

G (

p(x, θ|α, β)= − B (xμ) − βB μ (θ θ log const F F (9) θ θ

β 2 ) 2  1 ) B ) ) (θ θ G (  1 BG θ( (θ θ α G β ) ∗ B α ∗ ˆ ) B ˆ Q ( { }β θ θ θ θ( α ( ) 2 ) ∗ β { α ∗}β θ2 β ) 4.3 Geometric Interpretation of Conjugate Prior (β) In this section we give a geometric interpretation of the term B (xμ)+βB ( α μ) F F β from (9). Figure 3: Prior in the conjugate case has the same geometry Theorem 2 (MAP as Bregman median). Given a set X of n i.i.d examples drawn from the exponential family distribution as the likelihood while in the non-conjugate case, they have with the log partition function G and a conjugate prior as in different geometries. ˆ ∗ (8), MAP estimation of parameters is θMAP =ˆμ where μˆ MAP MAP solves the following problem: 4.4 Geometric Interpretation of Non-conjugate   n α Prior μˆMAP = min BF (xiμ)+βBF μ (10) μ∈M β We derived expression (12) because we considered the prior i=1 conjugate to the likelihood function. Had we chosen a non- n xi+α Q which admits the following solution: μˆ = i=1 . conjugate prior with log-partition function , we would have MAP n+β obtained:     n β ∗ Proof. Proof is a direct result of applying the definition of ˆ α n θML = min BG(θθi)+ BQ θ  . (13) Bregman divergence on (9) for all points. θ∈Θ β i=1 i=1 The above solution gives a natural interpretation of MAP Here G and Q are different functions defined over Θ. Since estimation. One can think of prior as β number of extra points α/β β these are the functions that define the geometry of the space at position . works as the effective sample size of the parameter, having G = Q is equivalent to consider them as prior which is clear from the following expression of the dual ˆ being defined over different (metric) spaces. Here, it should of the θMAP: θ   be noted that distance between the sample point ( i) and the n β α x + parameter θ is measured using the Bregman divergence BG. μˆ = i=1 i i=1 β MAP n + β (11) On the other hand, the distance between the point induced by the prior (α/β)∗ and θ is measured using the divergence func- The expression (10) is analogous to (5) in the sense that both B (α/β)∗ M tion Q. This means that can not be treated as one are defined in the dual space . One can convert (10) into of the sample points. This tells us that, unlike the conjugate an expression similar to (6) in the dual space which is again a Bregman median problem in the parameter space. case, belief in the non-conjugate prior can not be encoded in the form of the sample points. n β ˆ α ∗ Another problem with considering a non-conjugate prior θMAP =min BG(θθi)+ BG θ( ) (12) Θ θ∈Θ β is that the dual space of under different functions would i=1 i=1 be different. Thus, one will not be able to find the alternate ( α )∗ ∈ Θ α here β is dual of β . The above prob- expression for (13) equivalent to (10), and therefore not be lem is a Bregman median problem of n + β points, able to find the closed-form expression similar to (11). This {θ ,...θ , (α/β)∗,...,(α/β)∗} tells us why non-conjugate does not give us a closed form 1 n    , as shown in Figure 3 (left). solution for θˆ . A pictorial representation of this is also β times MAP A geometric interpretation is also shown in Figure 3. When shown in Figure 3. Note that, unlike the conjugate case, in the prior is conjugate to the likelihood, they both have the the non-conjugate case, the data likelihood and the prior both same log-partition function (Figure 3, left). Therefore they belong to different spaces. We emphasize that it is possible induce the same Bregman divergence. Having the same diver- to find the solution of (13) that is, in practice, there is nothing that prohibits the use of non-conjugate prior, however, using gence means that distances from θ to θi (in likelihood) and the distances from θ to (α/β)∗ are measured with the same diver- the conjugate prior is intuitive, and allows one to treat the gence function, yielding the same geometry for both spaces. hyper-parameters as pseudo data points. It is easier to see using the median formulation of the MAP estimation problem that one must choose a prior that is con- 5 Information Geometric View jugate. If one chooses a conjugate prior, then the distances In this section, we show the appropriateness of the conjugate among all points are measured using the same function. It is prior from the information geometric angle. In information also clear from (11) that in the conjugate prior case, the point geometry, Θ is a statistical manifold such that each θ ∈ Θ de- induced by the conjugate prior behaves as a sample point fines a . This statistical manifold has (α/β)∗. A median problem over a space that have different an inherent geometry, given by a metric and an affine connec- geometries is an ill-formed problem, as discussed further in tion. One natural metric is the Fisher information metric be- the next section. cause of its many attractive properties: it is Riemannian and

2561 is invariant under reparameterization (for more details refer [2]). ) θ In exponential family distributions, the Fisher metric M(θ) θ θd g B ( g is induced by the KL-divergence KL(·θ), which is equiva- G θ lent to the Bregman divergence defined by the log-partition d λ function. Thus, it is the log-partition function G that induces Θd Θg the Fisher metric, and therefore determines the natural geom- etry of the space. It justifies our earlier argument of choosing the log-partition function to define the geometry. Now if we Figure 4: Parameters θd and θg are interpolated using the were to treat the prior as a point on the statistical manifold de- Bregman divergence fined by the likelihood model, the Fisher information metric on the point given by the prior must be same as the one de- fined on likelihood manifold. This means that the prior must 6.1 Generalized Hybrid Model have the same log-partition function as the likelihood i.e., it In order to see the effect of the geometry, we present the dis- must be conjugate. criminative and generative models associated with the hybrid model in the Bregman divergence form and obtain their ge- ometry. Following the expression used in [1], the generative 6 Hybrid model model can be written as: In this section, we show an application of our analysis to p(x,y|θ )=h(x,y)exp(θ ,T(x,y)−G(θ )) (16) a common supervised and semi-supervised learning frame- g g g work. In particular, we consider a generative/discriminative where T (·) is the potential function similar to φ in (1), hybrid model [1; 6; 7] that has been shown to be success- now only defined on (x,y). Let G∗ be the dual function ful in many application. The hybrid model is a mixture of of G; the corresponding Bregman divergence is given by discriminative and generative models, each of which has its BG∗ ((x,y)∇G(θg)). Solving the generative model inde- own separate set of parameters. These two sets of parameters pendently reduces to choosing a θg from the space of all gen- (hence two models) are combined using a prior called the cou- erative parameters Θ which has a geometry defined by the pling prior. Let p(y|x,θ ) be the discriminative component, g d log-partition function G. Similarly to the generative model, p(x,y|θg) be the generative component and p(θd,θg) be the coupling prior, the joint likelihood of the data and parameters the exponential form of the discriminative model is given as: p(y|x,θ )= ( θ ,T(x,y) −M(θ , x)) can be written as (combining all three): d exp d d (17) p(x,y,θd,θg)=p(θg,θd)p(y|x,θd) p(x,y |θg) (14) Importantly, the sufficient statistics T are the same in y the generative and discriminative models; such genera- The most important aspect of this model is the coupling prior tive/discriminative pairs occur naturally: logistic regres- p(θg,θd), which interpolates the hybrid model between two sion/naive Bayes and hidden Markov models/conditional ran- extremes: fully generative when the prior forces θd = θg, dom fields are examples. However, observe that in the dis- θ θ and fully discriminative when the prior renders d and g criminative case, the log partition function M depends on independent. In non-extreme cases, the goal of the cou- both x and θd which makes the analysis of the discrimina- pling prior is to encourage the generative model and the dis- tive model (and hence of hybrid model) harder. criminative model to have similar parameters. It is easy to see that this effect can be induced by many functions. One 6.2 Geometry of the Hybrid Model obvious way is to linearly interpolate them as done by [7; 6] using a Gaussian prior (or the Euclidean distance) of the We simplify the analysis of the hybrid model by rewriting following form: the discriminative model in a a form that makes its underly- 2 ing geometry obvious. Note that the only difference between p(θg,θd) ∝ exp −λ ||θg − θd|| (15) the two models is that discriminative model models the con- where, when λ =0, model is purely discriminative while ditional distribution while the generative model models the for λ = ∞, model is purely generative. Thus λ in the above joint distribution. We can use this observation to write the expression is the interpolating parameter, and is same as the discriminative model in the following alternate form using p(y,x|θ) γ the expression p(y|x, θ)= and (16): in Section 2. Note that the log of the prior is nothing but the y p(y x|θ) squared Euclidean distance between two sets of parameters. h(x,y) (θ ,T(x,y)−G(θ )) p(y|x, θ )=  exp d d It has been noted multiple times [4; 1] that a Gaussian prior d (18) h(x,y )exp(θ ,T(x,y )−G(θ )) is not always appropriate, and the prior should instead be cho- y d d sen according to models being considered. Agarwal et al. [1] Denote the space of parameters of the discriminative model suggested using a prior that is conjugate to the generative by Θd. It is easy to see that geometry of Θd is defined by model. Their main argument for choosing the conjugate prior G since function G is defined over θd. This is same as the came from the fact that this provides a closed form solution geometry of the parameter space of the generative model Θg. for the generative parameters and therefore is mathematically Now let us define a new space ΘH which is the affine combi- convenient. We will show that it is more than convenience nation of Θd and Θg.Now,ΘH will have the same geometry that makes conjugate prior appropriate. Moreover, our anal- as Θd and Θg i.e., geometry defined by G. Now the goal of ysis does not assume anything about the expression and the the hybrid model is to find a θ ∈ ΘH that maximizes the like- hyperparameters of the prior beforehand, rather derive them lihood of the data under the hybrid model. These two spaces automatically. are shown pictorially in Figure 4.

2562 6.3 Prior Selection and p(θg|θd). They then derive an expression for the p(θg|θd) p(θ |θ ) As mentioned earlier, the coupling prior is the most impor- based on the intuition that the mode of g d should be θ tant part of the hybrid model, which controls the amount of d. Our analysis takes a different approach by coupling two coupling between the generative and discriminative models. models with the Bregman divergence rather than prior, and There are many ways to do this, one of which is given by [7; results in the expression and hyperparameters for the prior 6]. By their choice of Gaussian prior as coupling prior, they same as in [1]. implicitly couple the discriminative and generative parame- ters by the squared Euclidean distance. We suggest coupling 7 Related Work and Conclusion these two models by a general prior, of which the Gaussian To our knowledge, there have been no previous attempts to prior is a special case. understand Bayesian priors from a geometric perspective. Bregman Divergence and Coupling Prior: One related piece of work [8] uses the Bayesian framework to Let a general coupling be given by B (θ θ ). Notice the find the best prior for a given distribution. It is noted that, in S g d δ direction of the divergence. We have chosen this direction that work, the authors use the -geometry for the data space α because the prior is induced on the generative parameters, and and the -geometry for the prior space, and then show the (δ, α) it is clear from (12) that parameters on which prior is induced, different cases for different values . We emphasize that are placed in the first argument in the divergence function. even though it is possible to use different geometry for the The direction of the divergence is also shown in Figure 4. both spaces, it always makes more sense to use the same ge- Now we rewrite (8) replacing ∇G(θ) by θ∗: ometry. As mentioned in remark 1in[8], useful cases are α α obtained only when we consider the same geometry. p(θ |α, β)=β(F ( ) − B ( θ∗)) log g β F β g (19) We have shown that by considering the geometry induced by a likelihood function, the natural prior that results is ex- α = λθ∗ β = λ Now taking the d and , we get: actly the conjugate prior. We have used this geometric under- ∗ ∗ ∗ ∗ standing of conjugate prior to derive the coupling prior for the p(θ |λθ ,λ)=exp(λ(F (θ ))) exp(−λB (θ θ )) (20) g d d F d g discriminative/generative hybrid model. Our derivation natu- For the general coupling divergence function BS(θgθd), the rally gives us the expression and the hyperparameters of this corresponding coupling prior is given by: coupling prior. ∗ ∗ ∗ ∗ exp(−λB ∗ (θ θ )) = exp(−λ(F (θ ))) p(θ |λθ ,λ) (21) S d g d g d References The above relationship between the divergence function (left [1] Agarwal, A., Daume´ III, H.: Exponential family hybrid semi- side of the expression) and coupling prior (right side of the supervised learning. In: In IJCAI. Pasadena, CA (2009) expression) allows one to define a Bregman divergence for a given coupling prior and vise versa. [2] Amari, S.I., Nagaoka, H.: Methods of Information Geometry (Translations of Mathematical Monographs). American Math- Coupling Prior for the Hybrid Model: ematical Society (April 2001) We now use (21) to derive the expression for the coupling [3] Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering prior using the geometry of the hybrid model which is given with bregman divergences. Journal of Machine Learning Re- by the log partition function G of the generative model. This search 6 (October 2005) argument suggests to couple the hybrid model by the diver- [4] Bouchard, G.: Bias-variance tradeoff in hybrid generative- gence BG(θgθd) which gives the coupling prior as: discriminative models. In: ICMLA ’07. pp. 124–129. IEEE ∗ ∗ exp(−λBG(θgθd)) = p(θg|λθd ,λ) exp(−λF (θd)) (22) Computer Society, Washington, DC, USA (2007) where λ =[0, ∞] is the interpolation parameter, interpolat- [5] Collins, M., Dasgupta, S., Schapire, R.E.: A generalization of ing between the discriminative and generative extremes. In principal component analysis to the exponential family. In: In dual form, the above expression can be written as: NIPS 14. MIT Press (2001) [ ] (−λB (θ θ )) = p(θ |λθ ∗,λ) (−λG(θ )). 6 Druck, G., Pal, C., McCallum, A., Zhu, X.: Semi-supervised exp G g d g d exp d (23) classification with hybrid generative/discriminative methods. In: KDD ’07. pp. 280–289. ACM, New York, NY, USA (2007)

Here exp(−λG(θd)) can be thought of as a prior on the [7] Lasserre, J.A., Bishop, C.M., Minka, T.P.: Principled hybrids discriminative parameters p(θd). In the above expression, of generative and discriminative models. In: CVPR ’06. pp. exp(−λBG(θgθd)) = p(θg|θg)p(θd) behaves as a joint cou- 87–94. IEEE Computer Society, Washington, DC, USA (2006) pling prior P (θd,θg) as originally expected in the model (14). [8] Snoussi, H., Mohammad-Djafari, A.: Information geometry Note that hyperparameters of the prior α and β are naturally and prior selection (2002) derived from the geometric view of the conjugate prior. Here [9] Wainwright, M., Jordan, M.: Graphical models, exponential α = λθ∗ β = λ d and . families, and variational inference. Tech. rep., University of Relation with Agarwal et al.: California, Berkeley (2003) The prior we derived in the previous section turns out to be [10] Wainwright, M.J., Jordan, M.I.: Graphical models, exponen- the exactly same as that proposed by Agarwal et al. [1],even tial families, and variational inference. Found. Trends Mach. though theirs was not formally justified. In that work, the Learn. 1(1-2), 1–305 (2008) authors break the coupled prior p(θg,θd) into two parts: p(θd)

2563