All in the Exponential Family: Bregman Duality in Thermodynamic

Bregman Duality in Thermodynamic Variational Inference A. Conjugate Duality A.2. Dual Divergence The Bregman divergence associated with a convex function We can leverage convex duality to derive an alternative f :⌦ R can be written as (Banerjee et al., 2005): divergence based on the conjugate function ⇤. ! ⇤(⌘)=sup⌘ β (β)= ⌘ = β (β) D [p : q]=f(p)+f(q) p q, f(q) β · − ) r Bf h − r i = ⌘ β (β ) (36) · ⌘ − ⌘ The family of Bregman divergences includes many familiar quantities, including the KL divergence corresponding to The conjugate measures the maximum distance between the negative entropy generator f(p)= p log pd!. Geo- the line ⌘ β and the function (β), which occurs at the − · metrically, the divergence can be viewed as the difference unique point β⌘ where ⌘ = β (β). This yields a bijective R r between f(p) and its linear approximation around q. Since mapping between ⌘ and β for minimal exponential fami- f is convex, we know that a first order estimator will lie lies (Wainwright & Jordan, 2008). Thus, a distribution p may be indexed by either its natural parameters βp or mean below the function, yielding Df [p : q] 0. ≥ parameters ⌘p. For our purposes, we can let f , (β) = log Zβ over Noting that ( ⇤)⇤ = (β)=sup ⌘ β ⇤(⌘) (Boyd the domain of probability distributions indexed by natural ⌘ · − & Vandenberghe, 2004), we can use a similar argument as parameters of an exponential family (e.g. (13)) : above to write this correspondence as β = ⇤(⌘).We r⌘ can then write the dual divergence D ⇤ as: D [β : β ]= (β ) (β ) β β , (β ) p q p − q h p − q rβ q i (34) D [⌘p : ⌘q]= ⇤(⌘p) ⇤(⌘q) ⌘p ⌘q, ⌘ ⇤(⌘q) ⇤ − h − r i = ⇤(⌘ ) ⇤(⌘ ) ⌘ β + ⌘ β p − q − p · q q · q This is a common setting in the field of information geom- = ⇤(⌘p)+ (βq) ⌘p βq (37) etry (Amari, 2016), which introduces dually flat manifold − · structures based on the natural parameters and the mean where we have used (36) to simplify the underlined terms. parameters. Similarly, D [β : β ]= (β ) (β ) β β , (β ) A.1. KL Divergence as a Bregman Divergence p q p − q h p − q rβ q i = (βp) (βq) βp ⌘q + βq ⌘q For an exponential family with partition function (β) and − − · · = (β )+ ⇤(⌘ ) β ⌘ (38) sufficient statistics T (!) over a random variable !, the Breg- p q − p · q man divergence D corresponds to a KL divergence. Re- Comparing (37) and (38), we see that the divergences are calling that β (β)=⌘β = E⇡β [T (!)] from (16), we r simplify the definition (34) to obtain equivalent with the arguments reversed, so that: D [βp : βq]=D ⇤ [⌘q : ⌘p] (39) D [β : β ]= (β ) (β ) β ⌘ + β ⌘ p q p − q − p · q q · q = (βp) (βq) Eq[βp T (!)] This indicates that the Bregman divergence D should also − − · ⇤ be a KL divergence, but with the same order of arguments. + Eq[βq T (!)] · We derive this fact directly in (44) , after investigating the = Eq βq T (!) (βq) + Eq[⇡0(!)] form of the conjugate function ⇤. · − ⇥ log q(!) ⇤ A.3. Conjugate ⇤ as Negative Entropy | Eq βp T (!){z (βp) Eq[⇡0}(!)] − · − − We first treat the case of an exponential family with no base ⇥ log p(!) ⇤ measure ⇡0(!), with derivations including a base measure | q(!) {z } in App. A.4. For a distribution p in an exponential family, = Eq log p(!) indexed by β or ⌘ , we can write log p(!)=β T (!) p p p · − = D [q(!) p(!)] (35) (β). Then, (36) becomes: KL || ⇤(⌘p)=βp ⌘p (βp) (40) where we have added and subtracted terms involving the · − = βp Ep[T (!)] (βp) (41) base measure ⇡0(!), and used the definition of our expo- · − = log p(!) nential family from (13). The Bregman divergence D is Ep (42) thus equal to the KL divergence with arguments reversed. = H (!) (43) − p Bregman Duality in Thermodynamic Variational Inference since βp and (βp) are constant with respect to !. Utilizing B. Renyi Divergence Variational Inference ⇤(⌘p)=Ep log p(!) from above, the dual divergence with q becomes: In this section, we show that each intermediate partition function log Zβ corresponds to a scaled version of the Renyi´ D [⌘p : ⌘q]= ⇤(⌘p) ⇤(⌘q) ⌘p ⌘q, ⌘ ⇤(⌘q) VI objective (Li & Turner, 2016). ⇤ − h − r i L↵ = Ep log p(!) ⇤(⌘q) ⌘p βq + ⌘q βq − − · · To begin, we recall the definition of Renyi’s ↵ divergence. = Ep log p(!) ⌘p βq + (βq) − · 1 1 ↵ ↵ = Ep log p(!) Ep[T (!) βq]+ (βq) D↵[p q]= log q(!) − p(!) d! − · || ↵ 1 − Z = Ep log p(!) Ep log q(!) − = DKL[p(!) q(!)] (44) Note that this involves geometric mixtures similar to (14). || Pulling out the factor of log p(x) to consider normalized distributions over z x, we obtain the objective of Li & Thus, the conjugate function is the negative entropy and in- | duces the KL divergence as its Bregman divergence (Wain- Turner (2016). This is similar to the ELBO, but instead wright & Jordan, 2008). subtracts a Renyi divergence of order ↵. Note that, by ignoring the base distribution over !, we have 1 β β (β) = log q(z x) − p(x, z) d z instead assumed that ⇡0(!):=u(!) is uniform over the | domain. In the next section, we illustrate that the effect of Z = β log p(x) (1 β)D [p (z x) q (z x)] adding a base distribution is to turn the conjugate function − − β ✓ | || φ | ⇡ (!) = β log p(x) βD1 β[qφ(z x) p✓(z x)] into a KL divergence, with the base 0 in the second − − | || | argument. This is consistent with our derivation of negative := β 1 β entropy, since D [p (!) u(!)] = H (⌦) + const. L − KL β || − pβ where we have used the skew symmetry property ↵ A.4. Conjugate ⇤ as a KL Divergence D↵[p q]= 1 ↵ D1 ↵[q p] for 0 <↵<1 (Van Erven || − − || & Harremos, 2014). Note that 0 =0and 1 = log p✓(x) As noted above, the derivation of the conjugate ⇤(⌘) in L L as in Li & Turner (2016) and Sec. 3. (40)-(43) ignored the possibilty of a base distribution in our exponential family. We see that ⇤(⌘) takes the form of a KL divergence when considering a base measure ⇡0(!). C. TVO using Taylor Series Remainders ⇤(⌘)=supβ ⌘ (β) (45) Recall that in Sec. 4, we have viewed the KL divergence · − β D [β : β0] as the remainder in a first order Taylor approxi- = β⌘ ⌘ (β⌘) mation of (β) around β0. The TVO objectives correspond · − to the linear term in this approximation, with the gap in = E⇡β⌘ [β⌘ T (!)] (β⌘) · − TVOL(✓,φ, x) and TVOU (✓,φ, x) bounds amounting to a = E⇡β [β⌘ T (!)] (β⌘) E⇡β [log ⇡0(!)] ⌘ · − ± ⌘ sum of KL divergences or Taylor remainders. Thus, the TVO may be viewed as a first order method. = E⇡β [log ⇡β⌘ (!) log ⇡0(!)] ⌘ − = DKL[⇡β⌘ (!) ⇡0(!)] (46) Yet we may also ask, what happens when considering other || approximation orders? We proceed to show that thermody- Note that we have added and subtracted a factor of namic integration arises from a zero-order approximation, ⇡ log ⇡0(!) in the fourth line, where our base mea- E β⌘ while the symmetrized KL divergence corresponds to a sim- sure ⇡ (!)=q(z x) in the case of the TVO. Compar- 0 | ilar application of the fundamental theorem of calculus in ing with the derivations in (41)-(42), we need to include a the mean parameter space ⌘ = (β). In App. E, we β rβ term of Ep⇡0(!) in moving to an expected log-probability briefly describe how ‘higher-order’ TVO objectives might Ep log p(!), with the extra, subtracted base measure term be constructed, although these will no longer be guaranteed transforming the negative entropy into a KL divergence. to provide upper or lower bounds on likelihood. In the TVO setting, this corresponds to We will repeatedly utilize the integral form of the Taylor remainder theorem, which characterizes the error in a k-th ⇤(⌘)=DKL[⇡β (z x) q(z x)] . (47) ⌘ | || | order approximation of (x) around a, with β [a, x]2. 2 This identity can be derived using the fundamental theo- When including a base distribution, the induced Bregman rem of calculus and repeated integration by parts (see, e.g. divergence is still the KL divergence since, as in the derivation of (35), both Ep log p(!) and Ep log q(!) will contain 2We use generic variable x, not to be confused with data x, for terms involving the base distribution Ep log ⇡0(!). notational simplicity. Bregman Duality in Thermodynamic Variational Inference (Kountourogiannis & Loya, 2003) and references therein): sentation of the KL divergence: βk (k+1) x p✓(x, z) β (β) k R (x)= r (x β) dβ (48) DKL! [⇡βk 1 ⇡βk ]= (βk β) Var⇡β log dβ k − || − q (z x) a k! − φ βkZ 1 | Z − (53) C.1. Thermodynamic Integration as 0th Order Remainder Following similar arguments in the reverse direction, we can obtain an integral form for the TVO upper bound gap Consider a zero-order Taylor approximation of (1) around R1 (βk 1)=DKL[⇡βk ⇡βk 1 ] via the first-order approxi- a =0, which simply uses (0) as an estimator. Applying − || − mation of (βk 1) around a = βk. the remainder theorem, we obtain the identity (6) underly- − ing thermodynamic integration in the TVO: R1 (βk 1)= (βk 1) (βk)+(βk 1 βk) β (βk) − − − − − r (1) = (0) + R (1) =(βk βk 1) β (βk) ( (βk) (βk 1)) 0 (49) − − r − − − 1 βk 1 (1) (0) = β (β)dβ (50) − 2 (β) − r β 1 Z0 = r (βk 1 β) dβ (54) 1 1! − − p✓(x, z) βZk log p✓(x)= E⇡β log dβ (51) 0 qφ(z x) Z | Note that the TVO upper bound (10) arises from the second line, with R1 (βk 1) 0 and (βk βk 1) β (βk) where the last line follows as the definition of ⌘ = − ≥ − − r (β)= log Z in (16).

Load more