arXiv:0911.2784v2 [cs.IT] 5 Oct 2011 rlss rrueo n oyihe opnn fti oki work this of component copyrighted fut any or of redi current reuse or or any resale lists, for in works, or uses, collective adve new other creating for all purposes, material this for reprinting/republishing obtained including be must IEEE 102/07/1131. okwsspotdb h M the by supported was work rge zc eulc(-al [email protected]). (e-mail: Vod´arenskou Republic Pod Republic, Czech Czech Prague, the of Sciences of Academy of [email protected],a ela ihteSho of School the Erlangen–N¨urnberg. with of as University well Economics, as [email protected]), Cauerstrasse Erlangen–N¨urnberg, problems other many subsequent in name the applied the in widely under but became programming, it convex literature of problem the rea,mcielann,saitcldcso,sufficie decision, statistical learning, machine infor trieval, processes, exponential distributions, exponential nldn h lsia xmlso h inrpoesand process Wiener the of examples motion. Brownian classical pro familie geometric diffusion the the and Poisson in the including and as i families, such Explici processes obtained applications Rayleigh exponential concrete are and analysis. Poisson with data binomial, families, Bregman mentio exponential 3D-exploratory scaled paper general the in the for application of signals, evaluati expressions and areas new and features classical learning a convex various the classification, of and to recognition, proximity distances addition as Bregman In such the example. of concrete application illu are a Bregman by classical the sense increase situati can the Pathological coding monotonicity. in universal of only sense but the transformatio informat sufficient too, literatu statistically An w.r.t. established cl invariance measures. previous is wider probability related the theorem of the processing exten in divergences and distances studied convex divergence Bregman information of ones an the scaled classical also processes the but the random that only of shown stocha not is continuous distances and It the discrete signals. also of cov but distances form general the observations, a only in not introduced are ing They events. contri observed non-uniform of admit which distributions probability esr fdissimilarity of measure B OAPA NIE RNATOSO NOMTO THEORY INFORMATION ON TRANSACTIONS IEEE IN APPEAR TO

aucitrcie coe 6 09 eie uut4 201 4, August revised 2009; 26, October received Manuscript .Vjai ihteIsiueo nomto hoyadAuto and Theory Information of Institute the with is Vajda I. .Sumri ihteDprmn fMteais Universit , of Department the with is Stummer W. ne Terms Index Abstract c d 01IE.Proa s fti aeili emte.Permi permitted. is material this of use Personal IEEE. 2011 dmninlvectors -dimensional EMN(97 nrdcdfrcne functions convex for introduced (1967) REGMAN R d → Teppritoue cldBemndsacsof distances Bregman scaled introduces paper —The B nBemnDsacsadDvrecsof Divergences and Distances Bregman On R φ ( ihgradient with ,q p, rga itne,casfiain divergences, classification, distances, Bregman — rga distance Bregman = ) .I I. φ ( p NTRODUCTION M rn M52adteGA the and 1M0572 grant SMT ) ˇ ,q p, − 11 ▽ φ 15 ragn emn (e-mail: Germany Erlangen, 91058 , φ ( q the ∈ ) − nsieo hti sntin not is it that of spite in ofagSumradIo Vajda, Igor and Stummer Wolfgang R φ ▽ rbblt Measures Probability d dpnignonnegative -depending i oiainwas motivation His . φ ( q )( tsn rpromotional or rtising p tiuint servers to stribution − q sadntin not and ns te works. other n ) Vˇeˇz´ı 08 182 4, ncy. uiesand Business ainre- mation n where ons r media, ure so from ssion Rgrant CR ˇ butions strated .This 1. mation, nthe in cesses nof on of y of s φ stic ion (1) ass er- re, ns of n d d s t : φ sdi h eaal om()frvectors for (2) e.g., form typ separable are see, the they in – context t used and In (1) statistical studies. Bauschke or projection form as random information-theoretic adjecent general well for (1997) as the Borwein 2009), in Mat´uˇs (2008, and studied Csisz´ar usually are tv oriae ersniggnrlzddistribution discrete generalized representing coordinates ative otergthn derivative right-hand the to on distance ifrnibefunctions differentiable φ o vectors for srflxv u ete ymti o aifigtetriang the special the satisfying form is feature nor important most symmetric The wh inequality). neither pseudodistance but a is reflexive (it is distance usual the general (0 o functions for o xml,teaoecniee functions considered t above pseudodistances are the wit they example, properties e.g., For some (2), share distances they Bregman the (1963), Csisz´ar of paper the omdto normed quencies rbblt itiuin ntecneto nomto th information of . of context asymptotic distances and the Bregman in the distributions used 2003) probability (1997, Vajda and Pardo φ fcus,tems omncnetaedsrt probability discrete are context common distributions most the course, Of ( ( - , t t nteotmzto-hoei otx h rga distanc Bregman the context optimization-theoretic the In motn lentvst h rga itne 2 r the are (2) distances Bregman the to alternatives Important divergences = ) ( = ) ∞ (0 endby defined , ) B ∞ n titycne at convex strictly and esrs n functions and measures) φ t ) ,q p, ( ln t ,q p, tepolmwith problem (the elw IEEE Fellow, 1 − t p ,q p, o xml,Ciza 19,19,19)or 1995) 1994, (1991, Csisz´ar example, For . r aiytasomdt h eaiefrequencies relative the to transformed easily are φ = ) ed otewl-nw ulakdivergence Kullback well-known the to leads 1) endby defined ( = hc r ovxon convex are which 2 ic etr fhpteia rosre fre- observed or hypothetical of vectors since B D X i B =1 φ ed otecasclsurdEuclidean squared classical the to leads d φ p ( φ ( ,q p, 1 ( ,q p, [ .,p ..., , ,q p, φ φ ( = ) p = ) : = ) i R ) d − ) X i → φ X 1 i =1 q , d X =1 i q d + ′ φ =1 d with i φ R ( (0) ( q q 0 = p ( = p o xml,tefunction the example, For . [0 : i i i ) φ i .Tecnrt example concrete The ). − ln φ −  , 1 0 = (1) q ∞ q ssle yresorting by solved is [0 φ p q p q 1 i i i i .,q ..., , i ) ′ , ) (  2 . ∞ q → . i ,q p, )( ) otnoson continuous , φ rgntn in Originating . R d p ( ) i t ihnonneg- with differentiable − ( = ) n convex and q separable i )] (finite s t − ically eory 1) oo. ich (3) (2) (4) he es le h 2 1 TO APPEAR IN IEEE TRANSACTIONS ON INFORMATION THEORY 2 and φ(t) = t ln t lead in this case to the classical Pearson (Aiv) parallel optimization and computing (see, e.g., Censor divergence and Zenios (1997)). d 2 (pi qi) Dφ(p, q)= − (5) In this context it is obvious the importance of the functionals qi i=1 X of distributions which are simultaneously divergences in both and the above mentioned Kullback divergence Dφ(p, q) the Csisz´ar and Bregman sense or, more broadly, of the ≡ Bφ(p, q) which are asymmetric in p, q and contradict the research of relations between the Csisz´ar and Bregman diver- triangle inequality. On the other hand, φ(t) = t 1 leads gences. This paper is devoted to this research. It generalizes the | − | to the L1-norm p q which is a metric distance and separable Bregman distances (2) as well as the φ-divergences || − || φ(t) = (t 1)2/(t + 1) defines the LeCam divergence (4) by introducing the scaled Bregman distances which for the − d 2 discrete setup reduce to (pi qi) Dφ(p, q)= − d pi + qi i=1 Bφ(p, q m) = φ(pi/mi) φ(qi/mi) X | − which is a squared metric distance (for more about the i=1 X h metricity of φ -divergences the reader is referred to Vajda ′ φ (qi/mi)(pi/mi qi/mi) mi (6) (2009)). − + − i However, there exist also some sharp differences between for arbitrary finite scale vectors m = (m1, ..., md), convex ′ these two types of pseudodistances of distributions. One functions φ and right-hand derivatives φ+. Obviously, the distinguising property of Bregman distances is that their use as uniform scales m = (1, ..., 1) lead to the Bregman distances loss criterion induces the conditional expectation as outcoming (2) and the probability distribution scales m = q = (q1, ..., qd) unique optimal predictor from given data (cf. Banerjee at al. lead to the φ-divergences (4). We shall work out further interesting relations of the Bφ(p, q m) distances to the φ- (2005a)); this is for instance used in Banerjee et al. (2005b) | for designing generalizations of the k-means algorithm which divergences Dφ(p, q) and Dφ(p,m) and evaluate explicit deals with the special case of squared Euclidean error (3) formulas for the stochastically scaled Bregman distances in (cf. the seminal work of Lloyd (1982) reprinting a Technical arbitrary exponential families of distributions, including also Report of Bell Laboratories dated by 1957). These features the non-discrete setup. are generally not shared by those of the φ-divergences which Section II defines the φ-divergences Dφ(P,M) of general are not Bregman distances, e.g., by the Pearson divergence probability measures P and arbitrary finite measures M and (5). On the other hand, a distinguishing property of φ- briefly reviews their basic properties. Section III introduces divergences is the information processing property, i.e., the scaled Bregman distances Bφ(P,Q M) and investigates their | impossibility to increase the value Dφ(p, q) by transforma- relations to the φ -divergences Dφ(P,Q) and Dφ(P,M). tions of the observations distributed by p, q and preservation Section IV studies in detail the situation where all three of this value by the statistically sufficient transformations measures P,Q,M are from the family of general exponen- (Csisz´ar (1967), see in this respect also Liese and Vajda tial distributions. Finally, Section V illustrates the results by (2006)). This property is not shared by the Bregman distances investigating concrete examples of P,Q,M from classical which are not φ-divergences. For example, the distributions statistical families as well as from a family of important p = (1/2, 1/4, 1/4) and q = (1, 0, 0) are mutually closer (less random processes. discernible) in the Euclidean sense (3) than their reductions Notational conventions: Throughout the paper, M de- p˜ = (1/2, 1/2) and q˜ = (1, 0) obtained by merging the second and third observation outcomes into one. notes the space of all finite measures on a measurable space ( , ) and P M the subspace of all probability mea- Depending on the need to exploit one or the other of these sures.X A Unless otherwise⊂ explicitly stated P,Q,M are mutually distinguished properties, the Bregman distances or Csisz´ar di- measure-theoretically equivalent measures on ( , ) domi- vergences are preferred, and both of them are widely applied in nated by a σ-finite measure λ on ( , ). ThenX theA densities important areas of information theory, statistics and computer X A dP dQ dM science, for example in p = , q = and m = (7) (Ai) information retrieval (see, e.g., Do and Vetterli (2002), dλ dλ dλ Hertz at al. (2004)), have a common support which will be identified with (i.e., the densities (7) are positive on ). Unless otherwise explicitlyX (Aii) optimal decision (for general decision see, e.g., Bo- stated, it is assumed that P,QX P, M M and that φ : ratynska (1997), Freund et al. (1997), Bartlett et al. (2006), (0, ) R is a continuous and∈ convex function.∈ It is known Vajda and Zv´arov´a(2007), for speech processing see, e.g., ∞ 7→ that then the possibly infinite extension φ(0) = limt↓0 φ(t) Carlson and Clements (1991), Veldhuis and Klabers (2002), ′ and the right-hand derivatives φ+(t) for t [0, ) exist, and for image processing see, e.g., Xu and Osher (2007), Marquina that the adjoint function ∈ ∞ and Osher (2008), Scherzer et al. (2008)), and φ∗(t)= tφ(1/t) (8) (Aiii) machine learning (see, e.g., Laferty (1999), Banerjee et al. (2005), Amari (2007), Teboulle (2007), Nock and Nielsen is continuous and convex on (0, ) with possibly infinite (2009)). extension φ∗(0). We shall assume∞ that φ(1) φ∗(1) = 0. ≡ TO APPEAR IN IEEE TRANSACTIONS ON INFORMATION THEORY 3

II. DIVERGENCES Proof: By (9) and the definition (8) of the convex ∗ For P P and M M we consider function φ ∈ ∈ p p ∗ m Dφ(P,M)= φ dM = m φ dλ (cf. (7)) Dφ(P,M)= φ dP. X m X m p Z   Z   (9) ZX   generated by the same convex functions as considered in the Hence by Jensen’s inequality formula (4) for discrete P and M. An important special case ∗ m ∗ is Dφ(P,Q) with Q P. Dφ(P,M) φ dP = φ (M( )) (16) ∈ ≥ X p X The existence (but possible infinity) of the φ-divergences Z  follows from the bounds which proves the desired inequality (14). Since p m φ′ (1)(p m) m φ m φ(0) + p φ∗(0) (10) = M( ) P -a.s. + − ≤ m ≤ p X on the integrand, leading to theφ-divergence bounds is the condition for equality in (16), the rest is clear from the ′ ∗ ∗ φ (1)(1 M( )) Dφ(P,M) M( ) φ(0) + φ (0). easily verifiable fact that φ (t) is strictly convex at t = s if + − X ≤ ≤ X (11) and only if φ(t) is strictly convex at t =1/s.  The integrand bounds (10) follow by putting s = 1 and t = p/m in the inequality For some of the representation investigations below, it will ′ ∗ also be useful to take into account that for probability measures φ(s)+ φ+(s)(t s) φ(t) φ(0) + tφ (0), (12) − ≤ ≤ P,Q we get directly from definition (9) the “skew symmetry” where the left-hand side is the well-known support line of φ(t) φ-divergence formula at t = s. The right-hand inequality is obvious for φ(0) = . ∞ If φ(0) < then it follows by taking s in the inequality D ∗ (P,Q)= D (Q, P ) , ∞ → ∞ φ φ φ(s) φ(0) φ(t) φ(0) + t − , as well as the sufficiency of the condition ≤ s obtained from the Jensen inequality for φ(t) situated between φ(t) φ∗(t) constant (t 1) (17) φ(0) and φ(s). Since the function ψ(p,m) = mφ(p/m) is − ≡ · − homogeneous of order 1 in the sense ψ(tp,tm) = tψ(p,m) for the φ-divergence symmetry for all t> 0, the divergences (9) do not depend on the choice D (P,Q)= D (Q, P ) for all P,Q. (18) of the dominating measure λ. φ φ

Notice that Dφ(P,M) might be negative. For probability Liese and Vajda (1987) proved that under the assumed strict measures P,Q the bounds (11) take on the form convexity of φ(t) at t = 1 the condition (17) is is not only ∗ sufficient but also necessary for the symmetry (18). 0 Dφ(P,Q) φ(0) + φ (0), (13) ≤ ≤ and the equalities are achieved under well-known conditions III. SCALED BREGMAN DISTANCES (cf. Liese and Vajda (1987), (2006)): the left equality holds Let us now introduce the basic concept of the current paper, if P = Q, and the right one holds if P Q (singularity). which is a measure-theoretic version of the Bregman distance Moreover, if φ(t) is strictly convex at t =⊥ 1, the first if can (6). In this definition it is assumed that φ is a finite convex be replaced by iff, and in the case φ(0) + φ∗(0) < also ∞ function in the domain t> 0, continuously extended to t =0. the second if can be replaced by iff. ′ As before, φ+(t) denotes the right-hand derivative which for An alternative to the left-hand inequality in (11), which ex- such φ(t) exists and p,q,m are the densities defined in (7). tends the left-hand inequality in (13) including the conditions for the equality, is given by the following statement (for a Definition 1: The Bregman distance of probability mea- systematic theory of φ-divergences of finite measures we refer sures P,Q scaled by an arbitrary measure M on ( , ) to the recent paper of Stummer and Vajda (2010)). measure-theoretically equivalent with P,Q is defined byX theA formula Lemma 1: For every P P, M M one gets the lower ∈ ∈ divergence bound Bφ (P,Q M) | 1 p q q p q M( ) φ Dφ(P,M) , (14) = φ φ φ′ dM X M( ) ≤ m − m − + m m − m  X  ZX where the equality holds if h       i (19) p q ′ q m = mφ mφ φ+ (p q) dλ. p = P -a.s. (15) X m − m − m − M( ) Z h       i X The convex φ under consideration can be interpreted as a If Dφ(P,M) < and φ(t) is strictly convex at t =1/M( ), the equality in (14)∞ holds if and only if (15 ) holds. X generating function of the distance. TO APPEAR IN IEEE TRANSACTIONS ON INFORMATION THEORY 4

Remarks 1: (1) By putting t = p/m and s = q/m in (12) where the difference (21) is meaningful if and only if we find the argument of the integral in (19) to be nonnegative. Dφ(Q, P ) Dφ∗ (P,Q) is finite. The nonnegative divergence ≡ Hence the Bregman distance Bφ (P,Q M) is well-defined by measure φ (P,Q) := Bφ (P,Q P ) is thus the difference (19) and is always nonnegative (possibly| infinite). between theB nonnegative dissimilarity| measure (2) Notice that the integrand in the first (respectively sec- ′ q ond) integral of (19) constitutes a function, say, Υ(p,q,m) φ (Q, P )= φ (q p) dλ Dφ(Q, P ) D + p − ≥ (respectively Υ(p,q,m)) which is homogeneous of order ZX   0 (respectively order 1), i.e., for all t > 0 theree holds and the nonnegative φ divergence Dφ(Q, P ). Furthermore, in Υ(tp,tq,tm) = Υ(p,q,m) (respectively Υ(tp,tq,tm) = the second special case− M = Q the formula (19) leads to the t Υ(p,q,m)). Analogously, as already partially indicated · equality above,e the integrande in the first (respectively second) integral Bφ (P,Q Q)= Dφ(P,Q) (22) of (9) is also a function, say, ψ(p,m) (respectively ψ(p,m)) | which is homogeneous of order 0 (respectively order 1). without any restriction on P,Q P as realized already by (3) In our measure-theoreticecontext (19) we have incorpo- Stummer (2007). ∈ rated the possible non-differentiability of φ by using its right- hand derivative, which will be essential at several places below. Conclusion 1: Equality (22) – together with the fact that For general Banach spaces, one typically employs various di- Bφ (P,Q M) depends in general on M (see, e.g., Subsec- rectional derivatives – see, e.g., Butnariu and Resmerita (2006) tion B below)| – shows that the concept of scaled Bregman in connection with different types of convexity properties. distance (19) strictly generalizes the concept of φ divergence − Dφ(P,Q) of probability measures P,Q. The special scaled Bregman distances Bφ (P,Q M) for probability scales M P were introduced by Stummer| ∈ Example 1: As an illustration not considered earlier we (2007). Let us mention some other important previously con- can take the non-differentiable function φ(t) = t 1 for sidered special cases. which | − | (a) For finite or countable and counting measure M = λ Bφ (P,Q Q)= V (P,Q) | some authorsX were already cited above in connection with the formula (2) and the research areas (Ai) - (Aiii). In addition to i.e., this particular scaled Bregman distance reduces to the well them, one can mention also Byrne (1999), Collins et al. (2002), known total variation. Murata et al. (2004), Cesa-Bianchi and Lugosi (2006). As demonstrated by an example in the Introduction, mea- (b) For open Euclidean set and Lebesgue measure M = λ surable transformations (statistics) on it one can mention JonesX and Byrne (1990), as well as Resmerita and Anderssen (2007). T : ( , ) ( , ) (23) X A 7→ Y B which are not sufficient for the pair P,Q can increase those In the rest of this paper, we restrict ourselves to the Bregman of the scaled Bregman distances B {(P,Q} M) which are not distances B (P,Q M) scaled by finite measures M φ φ φ -divergences. On the other hand, the transformations| (23) and to the same class| of convex functions as considered∈ in theM which are sufficient for the pair P,Q need not preserve these φ-divergence formulas (4) and (9). By using the remark after distances either. Next we formulate{ } conditions under which Definition 1 and applying (12) we get the scaled Bregman distances Bφ (P,Q M) are preserved by | ′ q transformations of observations. Dφ(P,M) Dφ(Q,M)+ φ+ (p q)dλ ≥ X m − Z   if at least one of the right-hand side expressions is finite. Definition 2: We say that the transformation (23) is sufficient for the triplet P,Q,M if there exist measurable Similarly, { } functions gP ,gQ,gM : R and h : R such that Y 7→ X 7→ ′ q Bφ (P,Q M)= Dφ(P,M) Dφ(Q,M) φ+ dλ p(x)= g (T x)h(x), q(x)= g (T x)h(x) | − − X m P Q Z  (20) and m(x)= gM (T x)h(x). (24) if at least two of the right-hand side expressions are finite (which can be checked, e.g., by using (11) or (14)). If M is probability measure then our definition reduces to the classical statistical sufficiency of the statistic T for the The formula (19) simplifies in the important special cases family P,Q,M (see pp. 18-19 in Lehman (2005)). All M = P and M = Q. In the first case, due to φ(1) = 0 it transformations{ (23)} induce the probability measures PT −1, reduces to QT −1 and the finite measure MT −1 on ( , ). We prove that Y B ′ q q the scaled Bregman distances of induced probability measures Bφ (P,Q P )= φ (q p) pφ dλ | + p − − p PT −1, QT −1 scaled by MT −1 are preserved by sufficient X      Z transformations . ′ q T = φ (q p)dλ Dφ(Q, P ) , (21) + p − − ZX   TO APPEAR IN IEEE TRANSACTIONS ON INFORMATION THEORY 5

Theorem 1: The transformations (23) sufficient for the Kulllback-Leibler information divergence (relative entropy) triplet P,Q,M preserve the scaled Bregman distances in Dt ln t(P,Q) (cf. Stummer (2007)). As a side effect, this the sense{ that } independence gives also rise to examples for the conclusion

−1 −1 −1 that the validity of (25) does generally not imply that T is Bφ PT ,QT MT = Bφ (P,Q M) . (25) | | sufficient for the triplet P,Q,M . { } Proof.: By (19) and (24), the right-hand side of (25) is equal to B. Bregman reversed logarithmic distance ′ [φP,M (T x) φQ,M (T x) ∆P,Q,M (T x)] dM (26) Let now φ(t) = ln t so that φ (t) = 1/t. Then (19) − − − − ZX implies for B− ln t (P,Q M) gP (y) gQ(y) | φ (y)= φ , φ (y)= φ (27) m m m P,M g (y) Q,M g (y) = m ln m ln + (p q) dλ (31)  M   M  p − q q − ZX   and mp = D (M, P ) D (M,Q)+ dλ M( ) (32) ′ g (y) t ln t t ln t Q − X q − X ∆P,Q,M (y)= φ+ (gP (y) gQ(y)) . (28) Z gM (y) − mp   = D− ln t(P,M) D− ln t(Q,M)+ dλ M( ) (33) By Theorem D in Section 39 of Halmos (1964), the integral − q − X ZX (26) is equal to where the equalities (32) and (33) hold if at least two out of

−1 the first three expressions on the right-hand side are finite. In [φP,M (y) φQ,M (y) ∆P,Q,M (y)] dMT (29) − − particular, (31) implies (consistent with (22)) ZY and, moreover, B t (P,Q Q)= D t(P,Q) (34) − ln | − ln −1 −1 −1 P (T B)= gP (y) h(T y) dλT and (32) implies for Dt ln t(P,Q) < (consistent with (21)) B ∞ Z 2 B t (P,Q P )= χ (P,Q) Dt t(P,Q) (35) and similarly for Q instead of P . Therefore − ln | − ln dPT −1 dQT −1 where −1 −1 2 −1 = gP (y) h(T y) and −1 = gQ(y) h(T y) (p q) dλT dλT χ2(P,Q)= − dλ q which together with (27), (28) and (19) implies that the integral ZX (29) is nothing but the left-hand side of (25). This completes is the well-known Pearson information divergence. From the proof.  (34) and (35) one can also see that the Bregman distance Bφ (P,Q M) does in general depend on the choice of the Remark 2: Notice that by means of Remark 1(2) after reference| measure M. Definition 1, the assertion of Theorem 1 can be principally re- lated to the preservation of φ divergences by transformations C. Bregman power distances which are sufficient for the pair− P,Q . { } In this subsection we restrict ourselves for simplicity to In the rest of this section we discuss some important special probability measures M P, i.e., we suppose M( )=1. classes of scaled Bregman distances obtained for special Under this assumption we∈ investigate the scaled BregmanX distance-generating functions φ. distances R Bα (P,Q M)= Bφα (P,Q M) , α , α =0, α =1 A. Bregman logarithmic distance | | ∈ 6 6 (36) Let us consider the special function φ(t) = t ln t. Then for the family of power convex functions ′ so that (19) implies φ (t) = ln t +1 α α−1 t 1 ′ t φ(t) φα(t)= − with φ (t)= . (37) Bt t (P,Q M) α ln | ≡ α(α 1) α 1 p q q − − = p ln q ln ln +1 (p q) dλ For comparison and representation purposes, we use for P X m − m − m − Z h   i (and analogously for Q instead of P ) the power divergences p q = p ln p ln dλ Dα(P,M)= Dφα (P,M) X m − m Z h i p 1 α 1−α = p ln dλ = D (P,Q) . (30) = p m dλ 1 t ln t α(α 1) X − X q   Z − Z Thus, for φ(t) = t ln t the Bregman distance B (P,Q M) exp ρα(P,M) 1 α 1−α φ = − with ρ (P,M) = ln p m dλ exceptionally does not depend on the choice of the scaling| and α(α 1) α − ZX reference measures M and λ; in fact, it always leads to the (38) TO APPEAR IN IEEE TRANSACTIONS ON INFORMATION THEORY 6

of real powers α different from 0 and 1, studied for arbitrary Theorem 2: If D0(P,M)+ D0(Q,M) < then probability measures P,M in Liese and Vajda (1987). They ∞ lim Bα(P,Q M) are one-one related to the R´enyi divergences α↓0 | mp ρα(P,M) = D0(P,M) D0(Q,M)+ dλ 1 (41) Rα(P,M)= , α R, α =0, α =1, − q − α(α 1) ∈ 6 6 ZX − = B− ln t(P,Q M). (42) | introduced in Liese and Vajda (1987) as an extension of the If D (P,M)+ D (Q,M) < and original narrower class of the divergences 1 1 ∞ (q/m)−β 1 ρα(P,M) lim − dP Rα(P,M)= , α> 0, α =1 β↓0 X β α 1 6 Z β − (q/m)− 1 q = lim − dP = ln dP (43) of R´enyi (1961). β↓0 β − m ZX ZX then Returning now to the Bregman power distances, observe q lim B (P,Q M)= D (P,M) ln dP (44) that if Dα(P,M)+ Dα(Q,M) is finite then (20), (36) and α 1 α↑1 | − X m (37) imply for α =0, α =1 Z 6 6 = D (P,Q) = Bt t(P,Q M) . (45) 1 ln | Bα(P,Q M) | 1 q α−1 = Dα(Q,M) (p q) dλ Proof: If 0 < α < 1 then Dα(P,M),Dα(Q,M) − − α 1 X m − − Z   are finite so that (39) holds. Applying the first relation of = Dα(P,M) Dα(Q,M) (40) in (39) we get (41) where the right hand side is well − 1 q α−1 q α defined because D0(P,M) + D0(Q,M) is by assumption p m dλ finite. Similarly, by using the second relation of (40) and the −α 1 X m − m − Z      assumption ( 43) in (39) we end up at (44) where the right- = Dα(P,M) (1 α) Dα(Q,M) hand side is well defined because D (P,M)+D (Q,M) is − − 1 1 1 q α−1 assumed to be finite. The identity (42) follows from (41), ( p dλ 1 . (39) 33) and the identity (45) from (44), (30).  −α 1 X m − − Z    In particular, we get from here (consistent with (22)) Motivated by this theorem, we introduce for all probability measures P,Q,M under consideration the simplified nota- Bα(P,Q Q)= Dα(P,Q) tions | B1(P,Q M) = Bt ln t(P,Q M) (46) and in case of Dα(Q, P ) D1−α(P,Q) < also | | ≡ ∞ and

Bα(P,Q P ) = (α 2) Dα−1(Q, P ) + (α 1) Dα(Q, P ) B (P,Q M) = B t(P,Q M) , (47) | − − 0 | − ln | and thus, (45) and (42) become (α 2) D α(P,Q) + (α 1) D α(P,Q). ≡ − 2− − 1− B1(P,Q M) = lim Bα(P,Q M) In the following theorem, and elsewhere in the sequel, we | α↑1 | use the simplified notation and

B0(P,Q M) = lim Bα(P,Q M). | α↓0 | D1(P,M)= Dt ln t(P,M) and D0(P,M)= D− ln t(P,M) Furthermore, in these notations the relations (30), (34) and (35) for the probability measures P,M under consideration (and reformulate (under the corresponding assumptions) as follows also later on where M is only a finite measure). This step is motivated by the limit relations B1(P,Q M) = D1(P,Q) , |

lim Dα(P,M) = D− ln t(P,M) and α↓0 B (P,Q Q)= D (P,Q) 0 | 0 lim Dα(P,M) = Dt ln t(P,M) (40) α↑1 and

2 proved as Proposition 2.9 in Liese and Vajda (1987) for B0(P,Q P ) = χ (P,Q) D1(P,Q) arbitrary probability measures P,M. Applying these relations | − = 2 D (P,Q) D (P,Q). (48) to the Bregman distances, we obtain 2 − 1 TO APPEAR IN IEEE TRANSACTIONS ON INFORMATION THEORY 7

Remark 3: The power divergences Dα(P,Q) are usu- The formula ally applied in the statistics as criteria of discrimination or x·θ b(θ) goodness-of-fit between the distributions and . The scaled e dλ(x) = e , θ Θ (52) P Q Rd ∈ Bregman distances Bα(P,Q M) as generalizations of the Z | follows from (50) and implies power divergences Dα(P,Q) Bα(P,Q Q) allow to extend ≡ | the 2D-discrimination plots [Dα(P,Q); α]: c α d x·θ b(θ) ˚ 2 { ≤ ≤ } ⊂ xe dλ(x) = e b(θ), θ Θ. (53) R into more informative 3D -discrimination plots Rd ∇ ∈ Z 3 Both formulas (52) and (53) will be useful in the sequel. [Bα(P,Q βP + (1 β)Q); α; β]: c α, β d R { | − ≤ ≤ }⊂ (49) We are interested in the scaled Bregman power distances reducing to the former ones for β = 0. The simpler 2D- R plots known under the name Q–Q-plots are famous tools Bα (Pθ1 , Pθ2 Pθ0 ) for θ0, θ1, θ2 Θ, α . | ∈ ∈ for the exploratory data analysis. It is easy to consider that Here Pθ1 , Pθ2 , Pθ0 are measure-theoretically equivalent prob- the computer-aided appropriately coloured projections of the ability measures, so that we can turn attention to the formulas 3D-plots (49) allow much more intimate insight into the (39), (30), (33), and (46) to (48), promising to reduce the relation between data and their statistical models. Therefore evaluation of Bα(Pθ1 , Pθ2 Pθ0 ) to the evaluation of the power this computer-aided 3D-exploratory analysis deserves a deeper | divergences Dα(Pθ1 , Pθ2 ). Therefore we first study these attention and research. The next example presents projections divergences and in particular verify their finiteness, which was of two such plots obtained for a binomial model P and its a sufficient condition for the applicability of the formulas (39), data based binomial alternative Q. (30) and (33). To begin with, let us mention the following well-established representation: Example 2: Let P = Bin(n, p) be a binomial distribution with parameters n, p (with a slight abuse of notation), and Theorem 3: If α R differs from 0 and 1, then the power Q = Bin(n, q). Figure 1 presents projections of the corre- ∈ e divergence Dα(Pθ1 , Pθ2 ) is for all θ1, θ2 Θ finite and given sponding 3D-discriminatione plots (49) for 0.2 α 2 and by the expression ∈ 0 β 1, where the Subfigure (a) used the≤ parameter≤ ≤ ≤ e constellation n = 10, p = 0.25, q = 0.20 whereas the exp b(αθ1 + (1 α) θ2) αb(θ1) (1 α) b(θ2) 1 − − − − − . Subfigure (b) used n = 10, p = 0.25, q = 0.30. In both α(α 1)  − (54) cases, the ranges of Bα(P,Qe βP + (1e β)Q) are subsets of the interval [0.06, 0.088]. | − In particluar, it is invariant with respect to the shifts of the e e cumulant function linear in θ Θ in the sense that it coincides with the power divergence D∈ (P˜ , P˜ ) in the exponential IV. EXPONENTIAL FAMILIES α θ1 θ2 family with the cumulant function ˜b(θ) = b(θ)+ c + v θ In this section we show that the scaled Bregman power where c is a real number and v a d vector. · distances Bα(P,Q M) can be explicitly evaluated for prob- − | ability measures P,Q,M from exponential families. Let This can be easily seen by slightly extending (38) to get for us restrict ourselves to the Euclidean observation spaces arbitrary α R and θ1, θ2 Θ ( , ) (Rd, d) and denote by x θ the scalar product ∈ ∈ ofXx,A θ ⊆Rd. TheB convex extended real· valued function α 1−α 1+ α (α 1) Dα(Pθ1 , Pθ2 )= pθ1 pθ2 dλ ∈ · − · Rd x·θ d Z b(θ) = ln e dλ(x), θ R , (50) Rd exp x [αθ1 + (1 α) θ2] dλ(x) Rd ∈ = { · − } Z R exp αb(θ1)+(1 α) b(θ2) and the convex set { − } which together with (52) gives the desired result. Θ= θ Rd : b(θ) < { ∈ ∞} The skew symmetry as well as the remaining power diver- define on ( , ) an of probability measures X A gences D0(Pθ1 , Pθ2 ) and D1(Pθ1 , Pθ2 ) are given in the next, Pθ : θ Θ with the densities { ∈ } straightforward theorem. dPθ Rd pθ(x) (x) = exp x θ b(θ) , x , θ Θ. R ≡ dλ { · − } ∈ ∈ Theorem 4: For all θ1, θ2 Θ and α different from (51) 0 and 1 there holds ∈ ∈ The cumulant function b(θ) is infinitely differentiable on the interior Θ˚ with the gradient Dα (Pθ2 , Pθ1 ) = D1−α (Pθ1 , Pθ2 )

∂ ∂ and for θ2 Θ˚ b(θ)= , ..., b(θ), θ Θ˚. ∈ ▽ ∂θ1 ∂θd ∈   D− ln t (Pθ1 , Pθ2 )= D0 (Pθ1 , Pθ2 ) = lim Dα (Pθ1 , Pθ2 ) Note that (51) are exponential type densities in the natural α↓0 form. All exponential type distributions such as Poisson, = b(θ1) b(θ2) b(θ2) (θ1 θ2) (55) normal etc. can be transformed to into this form (cf., e.g., − − ∇ − = lim Dα (Pθ2 , Pθ1 )= D1 (Pθ2 , Pθ1 )= Dt ln t (Pθ2 , Pθ1 ) .(56) Brown (1986)). α↑1 TO APPEAR IN IEEE TRANSACTIONS ON INFORMATION THEORY 8

(a) pe = 0.25, qe = 0.20 (b) pe = 0.25, qe = 0.30

Fig. 1. 3D-discrimination plots (49) for P = Bin(10, pe), Q = Bin(10, qe) with 0.2 ≤ α ≤ 2 and 0 ≤ β ≤ 1.

The main result of this section is the following represen- In particluar, all scaled Bregman distances (62) - (64) are tation theorem for Bregman distances in exponential families. invariant with respect to the shifts of the cumulant function We formulate this in terms of the functions linear in θ Θ in the sense that they coincide with the scaled ∈ ˜ ˜ ˜ Bregman distances Bα Pθ1 , Pθ2 Pθ0 in the exponential ρα(θ ,θ )= b αθ + (1 α) θ αb(θ ) (1 α) b(θ ) 1 2 1 − 2 − 1 − − 2 | (57) family with the cumulant function ˜b(θ) = b(θ)+ c + v θ   · (where the right hand side is finite if 0 α 1), as well as where c is a real number and v a d vector. ≤ ≤ − the functions σα(θ ,θ ,θ ) (α R, θ ,θ ,θ Θ) defined as 0 1 2 ∈ 0 1 2 ∈ Proof: (a) By (51) it holds for every α R and the difference ∈ θ0,θ1, θ2 Θ I II ∈ σα(θ0, θ1, θ2)= σα(θ0, θ1, θ2) σα (θ0, θ1, θ2) (58) p (x) α−1 − θ2 p (x) of the nonnegative (possibly infinite) p (x) θ1  θ0  σI (θ , θ , θ )= b αθ + (1 α)[θ θ + θ ] (59) = exp (α 1) x (θ θ ) (b(θ ) b(θ )) α 0 1 2 1 − 1 − 2 0 − · 2 − 0 − 2 − 0   n and the finite  +x θ b(θ ) · 1 − 1 II σα (θ0, θ1, θ2)= αb(θ1)+(1 α) b(θ1) b(θ2)+ b(θ0) . o − − = exp x αθ1 + (1 α)[θ1 θ2 + θ0] h (60)i · − − n II Alternatively, σ (θ , θ , θ ) − α 0 1 2 σα(θ0, θ1, θ2)= ρα(θ1,θ0 + θ1 θ2) II o − with σα (θ0, θ1, θ2) from (60). Since (52) leads to +(1 α)[b(θ + θ θ ) b(θ ) b(θ )+ b(θ )] . (61) − 0 1 − 2 − 0 − 1 2 exp x αθ1 + (1 α)[θ1 θ2 + θ0] dλ Theorem 5: Let θ , θ , θ Θ be arbitrary. If α(α Rd · − − 0 1 2 ∈ − Z 1) = 0 then the Bregman distance of the exponential family nI  o 6 = exp σα(θ0, θ1, θ2) distributions Pθ1 and Pθ2 scaled by Pθ0 is given by the formula I for σα(θ0, θ1, θ2) given by (59), it holds Bα (Pθ1 , Pθ2 Pθ0 ) α−1 | pθ2 exp ρα(θ1,θ0) exp ρα(θ2,θ0) exp σα(θ0,θ1,θ2) = + + .(62) pθ1 dλ = exp σα(θ0,θ1,θ2) (65) α(α 1) α 1 α X pθ0 − − Z   where σα(θ0,θ1,θ2) was defined in (58). Now, by plugging If θ0 respectively θ1 is from the interior Θ˚, then the limiting Bregman power distances are P = Pθ1 ,Q = Pθ2 ,M = Pθ0 (cf. (51)) in (39), we get for α(α 1) =0 the Bregman distances B (Pθ1 , Pθ2 Pθ0 ) 0 | − 6 = b(θ ) b(θ ) b(θ ) (θ θ ) Bα (Pθ1 , Pθ2 Pθ0 ) 1 − 2 − ∇ 0 1 − 2 | (63) +exp σ0(θ0,θ1,θ2) 1 = Dα (Pθ1 , Pθ2 ) (1 α) Dα (Pθ2 , Pθ0 ) − − − respectively α−1 1 pθ2 + pθ1 dλ 1 . (66) B1 (Pθ1 , Pθ2 Pθ0 )= b(θ2) b(θ1) b(θ1) (θ2 θ1) . (64) 1 α " X pθ0 − # | − −∇ − − Z   TO APPEAR IN IEEE TRANSACTIONS ON INFORMATION THEORY 9

By combining the power divergence formula (54) with (57), distribution Pθ0 . The distance of order α = 0 satisfies the exp{ρα(θ1,θ2)}−1 one ends up with Dα (Pθ1 , Pθ2 ) = α(α−1) which relation together with (65) and (66) leads to the desired representation B (Pθ1 , Pθ2 Pθ0 )= D (Pθ1 , Pθ2 )+ exp σ (θ , θ , θ ) 1 (62). 0 | 0 0 0 1 2 − = B1 (Pθ2 , Pθ1 Pθ0 ) + ∆(θ0, θ1, θ2) , (b) By the definition of B0(P,Q M) in (47) and by (41) | | where B0 (Pθ1 , Pθ2 Pθ0 ) | ∆(θ0, θ1, θ2) = exp σ0(θ0, θ1, θ2) 1 pθ0 pθ1 − = D0 (Pθ1 , Pθ0 ) D0 (Pθ2 , Pθ0 )+ dλ 1 − p 2 − ZX θ represents a deviation from the skew-symmetry of the Breg- where man distances B (Pθ1 , Pθ2 Pθ0 ) and B (Pθ2 , Pθ1 Pθ0 ) of 0 | 1 | pθ0 pθ1 Pθ1 and Pθ2 . This deviation is zero if (for strictly convex b(θ) dλ = exp σ0(θ0, θ1, θ2) (cf. (65)). p if and only if ) θ0 = θ2. ZX θ2 For θ Θ˚ the desired assertion (63) follows from here and 0 ∈ Remark 5: We see from the formulas (54) – (64) that from the formulas R for all α the quantities Dα (Pθ1 , Pθ2 ), ρα(θ , θ ), ∈ 1 2 σα(θ0, θ1, θ2) and Bα (Pθ1 , Pθ2 Pθ0 ) only depend on the D0 (Pθi , Pθ0 ) = b(θi) b(θ0) b(θ0) (θi θ0) for i =1, 2 − −∇ − cumulant function b(θ) defined in| (50), and not directly on obtained from (55). the reference measure λ used in the definition formulas (50), (c) The desired formula (64) follows immediately from the (51). definition (46) and from the formulas (44), (45), (55) and (56). (d) The finally stated invariance is immediate.  V. EXPONENTIAL APPLICATIONS In this section we illustrate the evaluation of scaled Bregman The Conclusion 1 of Section III about the relation between divergences Bα (Pθ1 , Pθ2 Pθ0 ) for some important discrete scaled Bregman distances and φ-divergences can be completed and continuous exponential| families, and also for exponen- by the following relation between both of them and the tially distributed random processes. classical Bregman distances (1). Binomial model: Consider for fixed n 2 on the observation space = 0, ..., n the binomial≥ distribution Conclusion 2: Let be the classical Bregman X { } Bφ(x, y) Pθ determined by d d distance (1) of x, y R and P = Pθ : θ R the ex- ∈ ∈ ponential family with cumulant function φ, i.e., with densities n x n−x Pθ[ x ]= λ[ x ] exp x θ b(θ) = p (1 p) d  pθ(s) = exp s θ φ(θ) ,s R . Then for all Px, Py, Pz P { } { } · { · − } x − { · − } ∈ ∈   for x 0, ..., n , where Bφ(x, y)= B (Py, Px Pz)= D (Py, Px) , 1 | 1 ∈{ } n p i.e., there is a one-to-one relation between the classical λ[ x ]= , θ = ln Θ= R and b(θ)= n ln(1+eθ) . { } x 1 p ∈ Bregman distance Bφ(x, y) and the scaled Bregman dis-   − tances B1(Py, Px Pz) and power divergences D1(Py, Px) After some calculations one obtains from (57) and (61) of the exponential| probability measures generated by 1+ eαθ1+(1−α)θ2 the cumulant function φ. This means that the family ρ (θ ,θ )= n ln d α 1 2 θ1 α θ2 1−α Bα(Py, Px Pz): α R, z R of scaled Bregman power (1 + e ) (1 + e ) | ∈ ∈ distances and the family Dα(Py, Px): α R of power  { ∈ } and divergences extend the classical Bregman distances Bφ(x, y) P 1+ eθ1+(1−α)(θ0+θ1−θ2) (1 + eθ2 )1−α to which they reduce at α =1 and arbitrary Pz . In fact, σ (θ ,θ ,θ )= n ln . ∈ α 0 1 2 θ0 α θ1 we meet here the extension of the classical Bregman distances (1 + e ) (1 +e ) in three different directions: the first represented by various Applying Theorem 5 one achieves an explicit formula for the power parameters α R, the second represented by various binomial Bregman distances Bα (Pθ1 , Pθ2 Pθ0 ) from here. possible exponential distributions∈ parametrized by θ Rd, and | the third represented by the exponential distribution para∈ meters Rayleigh model: An important role in communication z Rd which are relevant when α =1. ∈ 6 theory play the Rayleigh distributions defined by the prob- ability densities Remark 4: We see from Theorems 4 and 5 that – consistent with (30), (45) – for arbitrary interior parameters θx2 ˚ pθ(x)= θx exp , θ Θ=(0, ) (67) θ0, θ1, θ2 Θ − 2 ∈ ∞ ∈  

B1 (Pθ1 , Pθ2 Pθ0 )= D1 (Pθ1 , Pθ2 ) , with respect to the restriction λ+ of the Lebesgue measure λ | on the observation space = (0, ). The mapping i.e. that the Bregman distance of order α =1 of exponential X ∞ family distributions Pθ1 , Pθ2 does not depend on the scaling T (x)= √2x − TO APPEAR IN IEEE TRANSACTIONS ON INFORMATION THEORY 10

from the positive halfline (0, ) to the negative halfline The formula (69) implies that the family Pt = Pt,θ : θ Θ ∞ { ∈ } ( , 0) transforms (67) into the family of Rayleigh densities is exponential on ( t, t) for which the “extremally reduced” −∞ X A observation T (xxt) = xt is statistically sufficient. Thus, by pθ(x)= θ exp θx = exp θx b(θ) { } { − } Theorem 1, for b(θ)= ln θ, θ> 0 − B(Pt,θ1 , Pt,θ2 Pt, )= B(Qt,θ1 ,Qt,θ2 Qt, ) (71) | 0 | 0 with respect to the restriction λ− of the Lebesgue measure λ on the observation space = ( , 0). These are the where Qt,θ is a probability distribution on the real line Rayleigh densities in the naturalX form−∞ assumed in (51). After governing the marginal distribution of the last observed value XX some calculations one derives from (57) Xt of the process t. α 1−α θ1 θ2 Queueing processes and Brownian motions: For illustra- ρα(θ1,θ2) = ln (68) αθ + (1 α)θ tion of the general result of the previous subsection we can 1 − 2 and take the family of Poisson processes with initial value X0 =0 θ R 1−α and intensities η = e , θ Θ= for which δ = σ =0 and θ1 θ0 θ ∈ θ σα(θ0,θ1,θ2) = ln . c(θ) = e 1 so that bt(θ) = t e 1 . Then Qt,θ is the 1−α − − θ (αθ1 + (1 α)(θ0 + θ1 θ2)) θ2 Poisson distribution Poi(τ) with parameter τ = tη = te and − −  Applying Theorem 5 one obtains the Rayleigh-Bregman dis- probabilities x tances Bα (Pθ1 , Pθ2 Pθ0 ) from here. −τ e (τ) ϑ | Qt,θ[ x ]= = λ[ x ] exp xϑ e Theorem 1 about the preservation of the scaled Bregman { } x! { } · − 1 distances by statistically sufficient transformations is useful for ϑ = ln τ = θ + lnt, λ[ x ]= . for the evaluation of these distances in exponential families. { } x! It implies for example that these distances in the normal and The exponential structure is similar as above, so that by lognormal families coincide. The next two examples dealing applying (57) to the cumulant function b(ϑ) = eϑ = teθ we with distances of stochastic processes make use of this theorem get for the Poisson processes with parameters θ1 and θ2 too. αθ1+(1−α)θ2 θ1 θ2 ρα(θ ,θ )= t e αe (1 α)e . Exponentialy distributed signals: Most of the random 1 2 − − − processes modelling physical, social and economic phenomena Combining this withh (61) and Theorem 5 we obtain an expliciti are exponentially distributed. Important among them are the formula for the scaled Bregman distance (71) of these Poisson real valued L´evy processes XXt = (Xs : 0 s t) with ≤ ≤ processes. trajectories xxt = (xs : 0 s t) from the Skorokchod ≤ ≤ To give another illustration of the result of the previous observation spaces ( t, t) and parameters from the set X A subsection, let us first introduce the standard Wiener process Θ= θ R : c(θ) < X which is the L´evy process with ν 0, δ =0, σ =1 and { ∈ ∞} t . It defines the family of Wiener≡ processes defined by means of the function θ =1 e Xs = θ Xs, 0 s t, θ (0, ), c(θ)= x2eθx/(1 + x2) dν(x) ≤ ≤ ∈ ∞ R Z \{0} which are L´evy processes with δ =0, σ =1 and c(θ) 0 so e 2 ≡ where ν is a L´evy measure which determines the probability that (70) implies bt(θ)= θ /2. They are well-known models distribution of the size of jumps of the process and the intensity of the random fluctuations called Brownian motions. If the with which jumps occur. It is assumed that 0 belongs to Θ and initial value X0 is zero then Qt,θ is the normal distribution 2 2 it is known (cf., e.g., K¨uchler and Sorensen (1994)) that the with mean zero and variance v = tθ . The corresponding Lebesgue densities probability distributions Pt,θ induced by these processes on ( t, t) are mutually measure-theoretically equivalent with 1 x2 ϑ 1 theX relativeA densities exp = exp ϑx2 for ϑ = √ 2 −2v2 π − 2v2 2πv   r dPt,θ xx  ( t) = exp θ xt bt(θ) (69) are transformed by the mapping x x of R on dPt,0 { − } the negative halfline ( , 0) into the7−→ natural − | exponential| xx p for the end xt of the trajectory t. The cumulant function densities exp ϑx b(ϑ−∞) with respect to the dominating appearing here is { − } 1 1 1 density 1/ π x where b(ϑ) = 2 ln ϑ = ln θ + 2 ln 2t. Thus by (57) | | − − 1 2 2 bt(θ)= t δθ + σ θ + γ(θ) (70) p 2 α 1−α   θ1 θ2 ρα(θ1,θ2)= ln (cf. (68)). for two genuine parameters δ R respectively σ > 0 of the − αθ1 + (1 α)θ2 ∈ − process which determine its intensity of drift respectively its This together with (61) and Theorem 5 leads to the explicit volatility, and for the function formula for the scaled Bregman distance (71) of the Wiener processes under consideration. γ(θ)= [eθx 1 θx/(1 + x2)] dν(x). R − − Z \{0} TO APPEAR IN IEEE TRANSACTIONS ON INFORMATION THEORY 11

Geometric Brownian motions: From the abovementioned REFERENCES standard Wiener process one can also build up the family of Amari S.-I. (2007), “Integration of stochastic models by minimizing geometric Brownian motions (geometric Wiener processes) α-divergence,” Neural Computation, vol. 19, no. 10, pp. 2780-2796. Banerjee, A., Guo, X., and Wang, H. (2005a), “On the optimality of Ys = exp σXs + θs , 0 s t, θ R, conditional expectation as a Bregman predictor, ” IEEE Transaction { } ≤ ≤ ∈ on Information theory, vol. 51, no. 7, pp. 2664-2669. where the family-generatinge θ can be interpreted as drift Banerjee, A., Merugu, S., Dhillon, I.S. and Ghosh, J. (2005b), parameters, and the volatility parameter σ > 0 is assumed “Clustering with Bregman divergences,” J. Machine Learning Re- to be constant all over the family. Then, σXt + θt is normally search, vol. 6, pp. 1705-1749. distributed with mean m = θt and variance v2 = σ2t, Bartlett, P.L., Jordan M.I. and McAuliffe, J.D. (2006), “Convexity, and Yt is lognormally distributed with thee same parameters classification and risk bounds,” JASA, vol. 101,pp. 138-156. m and v2. By (71), the scaled Bregman distance of two Bauschke, H.H. and Borwein, J.M. (1997), “Legendre functions and geometric Brownian motions with parameters θ1, θ2 reduces the method of random Bregman projections, ” J. Convex Analysis, to the scaled Bregman distance of two lognormal distributions vol. 4, No. 1, pp. 27-67. 2 2 LN(θ1t, σ t), LN(θ2t, σ t). As said above, it coincides with Boratynska, A. (1997), “Stability of Bayesian inference in expo- the scaled Bregman distance of two normal distributions nential families,” Statist. & Probab. Letters, vol. 36, pp. 173-178. 2 2 N(θ1t, σ t), N(θ2t, σ t). This is seen also from the fact that Bregman, L.M. (1967), “The relaxation method of finding the the reparametrization common point of convex sets and its application to the solution of problems in convex programming,” USSR Computational Math- µ 1 ematics and Mathematical Physics, vol. 7, no. 3, pp. 200-217. ϑ = , τ = v2 2v2 Brown, L.D. (1986), Fundamentals of Statistical Exponential Fami- lies. Hayward, California: Inst. of Math. Statistics. R R2 and transformations similar to that from the previ- Butnariu, D. and Resmerita, E. (2006), “Bregman distances, totally 7−→ 2 2 ous example lead in both distributions N(µ, v ) and LN(µ, v ) convex functions, and a method for solving operator equations in to the same natural exponential density Banach spaces, ” Abstr. Appl. Anal., vol. 2006, Art. ID 84919, 39 pp.

pϑ,τ (x1, x2) = exp x1ϑ + x2τ b(ϑ, τ) Byrne, C. (1999), “Iterative projection onto convex sets using { − } multiple Bregman distances,” Inverse Problems, vol. 15, pp. 1295- with 1313. 1 ϑ2 Carlson, B.A. and Clements, M.A. (1991), “A computationally com- b(ϑ, τ)= ln τ + . 2 4τ pact divergence measure for speech processing. IEEE Transactions on PAMI, vol. 13, pp. 1255-1260. These two distributions differ just in the dominating measures Censor, Y. and Zenios, S.A. (1997), Parallel Optimization - Theory, R2 2 on the transformed observation space = . For (µ1, v1 )= Algorithms, and Applications. New York: Oxford University Press. 2 2 2 X (θ1t, σ t) and (µ2, v2 ) = (θ2t, σ t) we get Cesa-Bianchi, N. and Lugosi, G. (2006), Prediction, Learning, Games. Cambridge: Cambridge University Press. θ 1 θ 1 (ϑ , τ )= 1 , and (ϑ , τ )= 2 , Collins, M., Schapire, R.E. and Singer, Y. (2002), “Logistic regres- 1 1 σ2 2σ2t 2 2 σ2 2σ2t     sion, AdaBoost and Bregman distances,” Machine Learning, vol. 48, pp. 253-285. and thus Csisz´ar, I. (1963), “Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizit¨at von Markoffschen b(α(ϑ , τ )+(1 α)(ϑ , τ )) αb(ϑ , τ ) (1 α)b(ϑ , τ ) 1 1 − 2 2 − 1 1 − − 2 2 Ketten. Publ. Math. Inst. Hungar. Acad. Sci., ser. A, vol. 8, pp. 85- 2 2 2 108. (αθ1 + (1 α)θ2) αθ1 + (1 α)θ2 = − −2 − t . Csisz´ar, I. (1967), “Information-type measures of difference of 2σ probability distributions and indirect observations. Studia Sci. Math. Hungar., vol. 2, pp. 299-318. Hence, for distributions Pt,θ1 , Pt,θ2 of the geometric Brownian motions considered above we get from (57) Csisz´ar, I. (1991), “Why and maximum entropy? An axiomatic approach to inference for linear inverse problems,” Annals of Statistics, vol. 19, no. 4, pp. 2032-2066. (αθ + (1 α)θ )2 αθ2 + (1 α)θ2 1 − 2 − 1 − 2 ρα(θ1,θ2)= t . Csisz´ar, I. (1994), “Maximum entropy and related methods,” Trans. h 2σ2 i 12th Prague Conf. Information Theory, Statistical Decision Func- tions and Random Processes. Prague, Czech Acad. Sci., pp. 58-62. The expression (61) can be automatically evaluated using this. Applying both these results in Theorem 5 one obtains Csisz´ar, I. (1995), “Generalized projections for non-negative func- tions,” Acta Mathematica Hungarica, vol. 68, pp. 161-186. explicit formula for the scaled Bregman distance (71) of these geometric Brownian motions. Csisz´ar, I. and Mat´uˇs, F. (2008), “On minimization of entropy functionals under moment constraints,” Proceedings of ISIT 2008, Toronto, Canada, pp. 2101-2105.

ACKNOWLEDGMENT Csisz´ar, I. and Mat´uˇs, F. (2009), “On minimization of multivariate entropy functionals,” Proceedings of ITW 2009, Volos, Greece, pp. We are grateful to all three referees for useful suggestions. 96-100. TO APPEAR IN IEEE TRANSACTIONS ON INFORMATION THEORY 12

Do, M.N. and Vetterli, M. (2002), “Wavelet-based texture retrieval Research, vol. 8, pp. 65-102. using generalized Gaussian density and Kullback-Leibler distance,” Vajda, I. (2009), “ On metric divergences of probability measures,” IEEE Transactions on Image Processing, vol. 11, pp. 146-158. Kybernetika, vol. 45, no. 5 (in print). Freund, Y. and Schapire, R.E. (1997), “A decision-theoretic gen- Vajda, I. and Zv´arov´a, J. (2007), “On generalized entropies, eralization of on-line learning and an application to boosting,” J. Bayesian decisions and statistical diversity,” Kybernetika, vol. 43, Comput. Syst. Sci., vol. 55, pp. 119-139. no. 5, pp. 675-696. Halmos, P.R. (1964), Measure Theory. New York: D. Van Nostrand. Veldhuis, R.N.J. (2002), “The centroid of the Kullback-Leibler Hertz, T., Bar-Hillel, A. and Weinshall, D. (2004), “Learning distance,” IEEE Signal Processing Letters, vol. 9, no. 3, pp. 96-99. distance functions for information retrieval,” in Proc. IEEE Comput. Xu, J. and Osher, S. (2007), “Iterative regularization and nonlinear Soc. Conf. on Computer Vision and Pattern Rec. CVPR, vol. 2, II- inverse scale space applied to wavelet-based denoising,” IEEE 570 - II-577. Transaction on Image Processing, vol. 16, no. 2, pp. 534-544. Jones, L.K. and Byrne, C.L. (1990), “General entropy criteria for inverse problems, with applications to data compression, pattern Wolfgang Stummer graduated from the Johannes Kepler classification, and cluster analysis,” IEEE Trans. Inform. Theory vol. University Linz, Austria, in 1987 and received the Ph.D. 36, no. 1, pp. 23-30. degree in 1991 from the University of Zurich, Switzerland. K¨uchler, U. and Sorensen, M. (1994), “Exponential families of From 1993 to 1995 he worked as Research Assistant at the stochastic processes and L´evy processes,” J. of Statist. Planning and University of London and the University of Bath (UK). From Inference, vol. 39, pp. 211-237. 1995 to 2001 he was Assistant Professor at the University of Lehman, E.L. and Romano J.P. (2005), Testing Statistical Hypothe- Ulm (Germany). From 2001 to 2003 he held a Term Position ses. Berlin: Springer. as a Full Professor at the University of Karlsruhe (now KIT; Liese, F. and Vajda, I. (1987), Convex Statistical Distances. Leipzig: Germany) where he continued as Associate Professor until Teubner. 2005. Since then, he is affiliated as Full Professor at the Liese, F. and Vajda, I. (2006), “On divergences and informations in Department of Mathematics, University of Erlangen-N¨urnberg statistics and information theory,” IEEE Transaction on Information FAU (Germany); at the latter, he is also a Member of the theory, vol. 52, no. 10, pp. 4394-4412. School of Business and Economics. Lloyd, S.P. (1982), “Least squares quantization in PCM,” IEEE Transactions on Inform. Theory, vol. 28, no. 2, pp. 129-137. Igor Vajda (M90 - F01) was born in 1942 and passed away Marquina, A. and Osher, S.J. (2008), “Image super-resolution by suddenly after a short illness on May 2, 2010. He graduated TV-regularization and Bregman iteration,” J. Sci. Comput., vol. 37, from the Czech Technical University, Czech Republic, in 1965 pp. 367-382. and received the Ph.D. degree in 1968 from Charles University, Murata, N., Takenouchi, T., Kanamori, T. and Eguchi, S. (2004), Prague, Czech Republic. “Information geometry of U-Boost and Bregman divergence,” Neu- He worked at UTIA (Institute of Information Theory and ral Computation, vol. 16, no. 7, pp. 1437-1481. Automation, Czech Academy of Sciences) from his graduation Nock, R. and Nielsen, F. (2009), “Bregman divergences and sur- until his death, and became a member of the Board of rogates for learning,” IEEE Transactions on PAMI, vol. 31, no. 11, UTIA in 1990. He was a visiting professor at the Katholieke pp. 2048 - 2059. Universiteit Leuven, Belgium; the Universidad Complutense Pardo, M.C. and Vajda, I. (1997), “About distances of discrete Madrid, Spain; the Universit´ede Montpellier, France; and the distributions satisfying the data processing theorem of information Universidad Miguel H´ernandez, Alicante, Spain. He published theory,” IEEE Transaction on Information theory, vol. 43, no. 4, pp. 1288-1293. four monographs and more than 100 journal publications. Dr. Vajda received the Prize of the Academy of Sciences, Pardo, M.C. and Vajda, I. (2003), “On asymptotic properties of information-theoretic divergences,” IEEE Transaction on Informa- the Jacob Wolfowitz Prize, the Medal of merits of Czech tion theory, vol. 49, no. 7, pp. 1860-1868. Technical University, several Annual prizes from UTIA, and, R´enyi, A. (1961), “On measures of entropy and information,” in posthumously, the Bolzano Medal from the Czech Academy Proc. 4th Berkeley Symp. Math. Stat. Probab., vol. 1, pp. 547-561. of Sciences. Berkeley, CA: Univ. of California Press. Resmerita, E. and Anderssen R.S. (2007), “Joint additive Kullback- Leibler residual minimization and regularization for linear inverse problems,” Math. Meth. Appl. Sci., vol. 30, no. 13, pp. 1527-1544. Scherzer, O., Grasmair, M., Grossauer, H., Haltmeier, M. and Lenzen, F. (2008); Variational methods in imaging. New York: Springer. Stummer, W. (2007), “Some Bregman distances between financial diffusion processes,” Proc. Appl. Math. Mech., vol. 7, no. 1, pp. 1050503 - 1050504. Stummer, W. and Vajda, I. (2010), “On divergences of finite mea- sures and their applicability in statistics and information theory,” Statistics, vol. 44, pp. 169-187. Teboulle, M. (2007), “A unified continuous optimization framework for center-based clustering methiods,” Journal of Machine Learning