<<

arXiv:1810.09503v2 [cs.IT] 23 Apr 2020 em fadfre xoeta function exponential deformed a of terms oZagsdvrec with divergence Zhang’s to iegneapae eoei h ieaue saseicc specific a as by literature, denoted divergence the a in before appeared divergence eeaie divergence We analyz inequality. generalized been Pinsker’s not and convexity, have Theory, joint inequality, Information in useful are which iegne a netgtd h divergences The investigated. was divergences M o eaiiy on ovxt,adPnkrsinequality Pinsker’s and convexity, joint negativity, non xrsin()i 1] hc defines which [12], in (1) Expression em fthe of terms atto nqaiy ufiin odto o h on c joint the for condition sufficient A inequality. partition o ise’ nqaiyaei codnet rvoswork previous to accordance in are inequality Pinsker’s for sdvtdt h osrcino aiyo rbblt dis sta probability are of perspectives and family conclusions a Finally, of III. construction Section the to devoted is atto nqaiy n sjitycne,i,adol if only and if, convex, jointly is and inequality, partition setnieyue nIfrainTer.Talsrelativ Tsallis Theory. Information in used extensively is rpris uha o eaiiy oooiiyadjoint and monotonicity negativity, non as such properties, osr-N rzl e-mail:[email protected] Brazil, Mossoró-RN, n inl,laigt oeefiin nomto recovery information efficient more to flexib leading more in d signals, since interest learning, and The be statistical literature. can and the Entropy optimization in to found [1]. be systems can divergence App physical of other distributions. and probability econometric two between dissimilarity rmteIE ysnigarqett pubs-permissions@ieee to request a sending by IEEE the from .HF nrd swt eateto aua cecs Math Sciences, Natural of Department F with Sobral, is Campus Andrade Engineering, Computer H.F. with L. is Vigelis F. R. oyih c 07IE.Proa s fti aeili per is material this of use Personal Engi Teleinformatics IEEE. of 2017 Department (c) with Copyright is Cavalcante C. C. 3 n[1,[2,tepooe iegne eeivsiae f investigated were divergences proposed the [12], [11], In h eto ae sognzda olw.I eto IAw p we II-A Section In follows. as organized is paper of rest The h onepr fSannetoyi h elkonKullba well-known the is entropy Shannon of counterpart The Informat in role essential an play divergences Statistical rpriso eeaie iegneRltdto Related Divergence Generalized a of Properties ( = nqaiy on ovxt,Pnkrsinequality. Pinsker’s convexity, joint inequality, Tsal the found. with coincides was it inequality if, Pinsker’s only the prove and We for if, established. convex, was jointly convexity sufficie is joint and the necessary for found condition We divergence. generalized the xoeta ucin hc elcsteTsallis diverge the Kullback–Leibler replaces and which function, Entropy exponential Relative Tsallis the ϕ ulakLilrdvrec,Talsrltv nrp,g entropy, relative Tsallis divergence, Kullback–Leibler nti ae,w netgt h atto nqaiy joi inequality, partition the investigate we paper, this In − 1 ) ′ q ( lgrtm[] 7.Bt Ldvrec n sli relativ Tsallis and divergence KL Both [7]. [6], -logarithm q u .Vgls uz ..d nrd,CalsC Cavalcante C. Charles Andrade, de H.F. Luiza Vigelis, F. Rui ) and , λ D = ϕ D ( λ f,ρ ( |· || · α # α ) ( = stecutn measure. counting the is |· || · ) esoe eesr n ufiin odtosfrtegener the for conditions sufficient and necessary showed We . − sli eaieEntropy Relative Tsallis ) 1 where , and , D β c ( ϕ ρ |· || · hc ly h oeof role the plays which , = α D ) ∈ f eue otegnrlzddvrec,with divergence, generalized the to reduces , q β c − epnnil eas osrce aiyo rbblt d probability of family a constructed also We -exponential. ( [ .org. − 1 |· || · ern,FdrlUiest fCaá 02-8,Fortalez 60020-181, Ceará, of University Federal neering, itd oee,priso oueti aeilfrayoth any for material this use to permission However, mitted. drlUiest fCaá 20050 orlC,Brazil Sobral-CE, 62.010-560, Ceará, of University ederal 1 = .I I. , ) mtc n ttsis eea ua nvriyo h Sem the of University Rural Federal , and ematics 1] ϕ locnie h aiyo rbblt itiuin asso distributions probability of family the consider also nti ls r ie ntrso parameters of terms in given are class this in neiyof onvexity e nScinIV. Section in ted efntosadepesosmyb utbefrlre clas larger for suitable be may expressions and functions le and , tcicdswt h sli eaieentropy relative Tsallis the with coincides it , ne Terms Index − 8,[] 1] A [10]. [9], [8], NTRODUCTION i eaieetoy sa plcto fpriininequa partition of application an As entropy. relative lis entropy e tcneiy n ise’ nqaiy o iegneth divergence a for inequality, Pinsker’s and convexity, nt iain htuei pnfo ra uha communications as such areas from span it use that lications httegnrlzddvrec aifistepriinin partition the satisfies divergence generalized the that d ovxt,aogothers. among convexity, 1 c.Tegnrlzddvrec sdfie ntrso defor a of terms in defined is divergence generalized The nce. Abstract ifrn ttsia iegne smtvtdb applica by motivated is divergences statistical different tcniin o h atto nqaiyt estse.A satisfied. be to inequality partition the for conditions nt 1] [14]. [13], s nrlzddvrec,fml fpoaiiydistributio probability of family divergence, eneralized rbtos rpriso h eeaie iegneaes are divergence generalized the of Properties tributions. o hoy[] iegnecnb nepee samaueof measure a as interpreted be can Divergence [1]. Theory ion o eomdepnnilfunction exponential deformed a for s nabodrcaso iegne.Zagi 1]introduc [11] in Zhang divergences. of class broader a in ase o emti n iiiainprpcie.Sm proper Some perspectives. minimization and geometric a rom dfrteedvrecs nti ok eivsiaetepa the investigate we work, this In divergences. these for ed ehd 2,[] 4.Tedvrec sflesdpnson depends usefulness divergence The [4]. [3], [2], methods f kLilr(L iegne[] eoe by denoted [5], divergence (KL) ck–Leibler and rvdfo h oino iegne ueosdefinitions Numerous divergence. of notion the from erived oietedfiiino eeaie iegne eto II Section divergence. generalized of definition the rovide D D ρ q ϕ q ( r ucin.Tegnrlzddvrec corresponds divergence generalized The functions. are ( lgrtmi sli eaieetoy h generalized The entropy. relative Tsallis in -logarithm |· || · |· || · nrp aif oeipratpoete,sc as such properties, important some satisfy entropy e eeaie divergence generalized ) ) hc eeaie Ldvrec,i endin defined is divergence, KL generalizes which , a on.W rvdthat proved We found. was , eirMme,IEEE, Member, Senior φ lzddvrec ostsythe satisfy to divergence alized = -E rzl e-mail:[email protected] Brazil, a-CE, D ϕ e-mail:[email protected] , − n[2,aohrcasof class another [12], In . ϕ β ϕ ( rproe utb obtained be must purposes er srbtosrltdto related istributions |· || · − ( = D -rdRgo,59.625-900, Region, i-arid 1 D , q ) ( ,M φ, ϕ M |· || · iy criterion a lity, a edfie in defined be can ( D tgeneralizes at |· || · 1 qaiy and equality, s partition ns, KL itdwt the with ciated 1 = ) 1 usresults Ours . ) M , sufficient ( in related tions |· || · aifisthe satisfies e fdata of ses , 2 uidin tudied M med M , ) which , 2 rtition 3 1 = λ , ties, -B its ed to ) 1 . , 2

II. GENERALIZEDDIVERGENCE The generalized divergence is defined in terms of a deformed exponential function ϕ(·). Writing the KL divergence or Tsallis relative entropy in appropriate form, we can obtain the generalized divergence by replacing ln(·) or lnq(·) by the inverse of a deformed exponential ϕ−1(·). We also provide a construction of a family of probability distributions related the generalized divergence.

A. Definitions

For simplicity we denote the set of all probability distributions on In = {1,...,n} by n ∆n = (p1,...,pn): pi =1 and pi ≥ 0 for all i . i=1  X  ◦ The generalized divergence is defined for in the interior of ∆n, which is denoted by ∆n. A probability ◦ distribution p = (pi) belongs to ∆n if and only if pi > 0 for each i, A deformed exponential function is a convex function ϕ: R → [0, ∞) such that limu→−∞ ϕ(u)=0 and limu→∞ ϕ(u)= ∞. It is easy to verify that the ordinary exponential and Tsallis q-exponential are deformed exponential functions. The Tsallis R q-exponential expq : → [0, ∞) is given by 1/(1−q) [1 + (1 − q)x]+ , if q ∈ (0, 1], expq(x)= (exp(x), if q =1, R where [x]+ = x for x ≥ 0, and =0 otherwise. The Tsallis q-logarithm lnq : (0, ∞) → is defined as the inverse of expq(·), 1 1−q which is given by lnq(x)= 1−q (x − 1) if q ∈ (0, 1]. Fixed a deformed exponential function ϕ: R → [0, ∞), the generalized divergence (or generalized relative entropy) between ◦ two probability distributions p = (pi) and q = (qi) in ∆n is defined as n ϕ−1(p ) − ϕ−1(q ) D (p || q)= i i . (1) ϕ (ϕ−1)′(p ) i=1 i X Clearly, expression (1) reduces to the KL divergence D (p || q)= − n p ln qi if ϕ is the exponential function. Tsallis KL i=1 i pi n qi relative entropy in its standard form is given by Dq(p k q)= − pi lnq . The equality i=1 P pi  1−q 1−q 1−q (qi/pi) − 1 1 Ppi − 1 qi − 1 −pi = q − 1 − q pi 1 − q 1 − q   shows that Dq(· k ·) can be written as in (1) if ϕ is the Tsallis q-exponential. −1 −1 The non-negativity of Dϕ(· k ·) is a consequence of the concavity of ϕ (·). Because ϕ (·) is concave, it follows that (y − x)(ϕ−1)′(y) ≤ ϕ−1(y) − ϕ−1(x), for all x,y > 0. (2)

Using this inequality with y = pi and x = qi, we can write n ϕ−1(p ) − ϕ−1(q ) n D (p k q)= i i ≥ (p − q )=0. ϕ (ϕ−1)′(p ) i i i=1 i i=1 X X −1 Its is clear that Dϕ(p k q)=0 if p = q. The converse depends on whether ϕ (x) is strictly concave. Indeed, if we suppose that ϕ−1(x) is strictly concave, then an equality in equation (2) is attained if and only if x = y. Therefore, when ϕ−1(x) is strictly concave, the equality Dϕ(p k q)=0 is satisfied if and only if p = q. In addition to similarities between the generalized divergence, KL divergence, and Tsallis relative entropy, there exists another motivation for the choice of expression given as in (1). We can associate with the generalized relative entropy Dϕ(· k ·) a ϕ-family of probability distributions, just as the KL divergence is related to the -generating function in a of probability distributions.

B. Families of probability distributions ◦ For each probability distribution p = (pi) ∈ ∆n, we can define a deformed exponential family (of probability distributions) ◦ centered at p. A deformed exponential family consists of a parameterization for the set ∆n. We remark that a deformed exponential family depends on the centered probability distribution p. We can associate with each probability distribution ◦ p ∈ ∆n a deformed exponential family centered at p. 3

R ◦ Assume that ϕ: → [0, ∞) is a positive, deformed exponential function with continuous derivative. Fixed p = (pi) ∈ ∆n, let c = (ci) be a vector such that pi = ϕ(ci) for each i. We also fix a vector u0 = (u0i) such that u0i > 0 for each i, and n ′ u0iϕ (ci)=1. (3) i=1 X ◦ A deformed exponential family (of probability distributions) centered at p is a parameterization of ∆n, which maps each vector u = (ui) in the subspace n ϕ ′ Bc = (u1,...,un): uiϕ (ci)=0 i=1  X  ◦ to a probability distribution q = (qi) ∈ ∆n by the expression

qi = ϕ(ci + ui − ψc(u)u0i), (4) ϕ ◦ where ψc : Bc → [0, ∞) is the normalizing function, which is introduced so that (4) defines a probability density in ∆n. ϕ The choice for u ∈ Bc is not arbitrary. Thanks to this choice, it is possible to find ψc(u) ≥ 0 for which expression (4) is ◦ a probability density in ∆n. We will justify this claim. Because ϕ(·) is convex, it follows that yϕ′(x) ≤ ϕ(x + y) − ϕ(x), for all x, y ∈ R. (5)

ϕ Using (5) with x = ci and y = ui, we can write, for any u ∈ Bc , n n n ′ 1= uiϕ (ci)+ ϕ(ci) ≤ ϕ(ci + ui). i=1 i=1 i=1 X X X By the definition of ϕ(·), the map n g(λ)= ϕ(ci + ui − λu0i) i=1 X is continuous, approaches 0 as λ → ∞, and tends to ∞ as α → ∞. Since ϕ(·) is strictly increasing, it follows that g(·) is strictly decreasing. Then we can conclude that there exists a unique λ0 = ψc(u) ≥ 0 for which qi = ϕ(ci + ui − λ0u0i) is a ◦ probability distribution in ∆n. The generalized divergence Dϕ(· k ·) is associated with the deformed exponential family (4) by the equality

n −1 −1 ϕ (pi) − ϕ (qi) ψc(u)= D (p k q)= . (6) ϕ (ϕ−1)′(p ) i=1 i X n ′ Using i=1 uiϕ (ci)=0, together with the constraint (3), we can write n P ′ ψc(u)= (−ui + ψc(u)u0i)ϕ (ci). (7) i=1 X It is clear that −1 −1 − ui + ψc(u)u0i = ϕ (pi) − ϕ (qi), (8) and 1 ′ (9) ϕ (ci)= −1 ′ . (ϕ ) (pi) Inserting (8) and (9) into (7), we obtain (6). If ϕ is the exponential function, and u0i =1, the deformed exponential family reduces to the well known exponential family:

qi = exp(ui − Kp(u)) · pi, (10) where Kp(u) is the cumulant-generating function, which equals the normalizing function ψc(u).

III. PROPERTIESOFTHEGENERALIZEDDIVERGENCE The KL divergence and Tsallis relative entropy satisfy the partition inequality, and are jointly convex. They also satisfy Pinsker’s inequality. We will investigate under what conditions these properties hold for the generalized divergence. Throughout this section we assume that (ϕ−1)′′(x) is continuous and > 0. 4

A. Partition inequality Partition inequality, which is a case of the processing inequality, will be used in the proof of Pinsker’s inequality. Let A = {A1,...,Ak} be a partition of In = {1,...,n}, i.e., A is a collection of subsets Aj ⊆ In such that Ai ∩ Aj = ∅, for k A A i 6= j, and j=1 Aj = In. For any probability distribution p = (pi), we define the probability distribution p = (pj ) as S A pj = pi, for each j =1, . . . , k. ∈ iXAj The next result gives a necessary and sufficient condition for the partition inequality to be satisfied.

Proposition 1. For the divergence Dϕ(· || ·) to satisfy the partition inequality A A Dϕ(p || q) ≥ Dϕ(p || q ), (11) for all probability distributions p = (pi) and q = (qi), and any partition A of In, it is necessary and sufficient that the (ϕ−1)′ function g = − be superadditive, i.e., the inequality (ϕ−1)′′ g(x + y) ≥ g(x)+ g(y), (12) be satisfied for all x, y ∈ (0, 1) such that x + y ∈ (0, 1). The proof of Proposition 1 requires some preliminary results which are presented in the sequel. Lemma 2. Fix any α ∈ (0, 1). The mapping −1 −1 Fα(x, y)= ϕ((1 − α)ϕ (x)+ αϕ (y)), is superadditive in (0, 1) × (0, 1) if, and only if, G(x, y)= ϕ−1(ϕ(x)+ ϕ(y)) is convex in {(x, y) ∈ R2 : ϕ(x)+ ϕ(y) ∈ (0, 1)}.

Proof: Let xi,yi ∈ (0, 1) be such that x1 + x2 ∈ (0, 1) and y1 + y2 ∈ (0, 1). The superadditivity of Fα implies that

−1 −1 ϕ((1 − α)ϕ (x1 + x2)+ αϕ (y1 + y2)) −1 −1 ≥ ϕ((1 − α)ϕ (x1)+ αϕ (y1)) −1 −1 + ϕ((1 − α)ϕ (x2)+ αϕ (y2)). (13) −1 −1 Denote si = ϕ (xi) and ti = ϕ (yi) for i =1, 2. Thus inequality equation (13) is equivalent to

−1 −1 (1 − α)ϕ (ϕ(s1)+ ϕ(s2)) + αϕ (ϕ(t1)+ ϕ(t2)) −1 ≥ ϕ [ϕ((1 − α)s1 + αt1)+ ϕ((1 − α)s2 + αt2)], which shows the desired result. (ϕ−1)′ Lemma 3. The function G, as defined in Lemma 2, is convex if and only if g = − is superadditive in (0, 1). (ϕ−1)′′

Proof: For the function G to be convex, it is necessary and sufficient that its Hessian HG be positive semi-definitive, which is equivalent to tr(HG) ≥ 0 and JG = det(HG) ≥ 0, where tr(·) denotes the trace of a matrix and det(·) is the determinant of a matrix (see [15]). Letting z = ϕ(x)+ ϕ(y), we can express ∂2G (x, y)= ϕ′′(x)(ϕ−1)′(z) + [ϕ′(x)]2(ϕ−1)′′(z), (14) ∂x2 ∂2G (x, y)= ϕ′′(y)(ϕ−1)′(z) + [ϕ′(y)]2(ϕ−1)′′(z), (15) ∂y2 and ∂2G (x, y)= ϕ′(x)ϕ′(y)(ϕ−1)′′(z). (16) ∂x∂y If we divide the right-hand side of (14) by −ϕ′′(x)(ϕ−1)′′(z) ≥ 0, and we use [ϕ(x)′]2 (ϕ−1)′(ϕ(x)) = − (17) ϕ(x)′′ (ϕ−1)′′(ϕ(x)) 5 into the resulting expression, we obtain (ϕ−1)′(z) (ϕ−1)′(ϕ(x)) − + = g(z) − g(ϕ(x)). (ϕ−1)′′(z) (ϕ−1)′′(ϕ(x)) As a result, we conclude that ∂2G/∂x2 ≥ 0 (and similarly ∂2G/∂y2 ≥ 0) if g is superadditive. Using expressions (14)–(16) for the partial derivatives of G, we find

−1 ′ −1 ′′ ′′ ′′ JG(x, y) = (ϕ ) (z)(ϕ ) (z)ϕ (x)ϕ (y) (ϕ−1)′(z) [ϕ′(y)]2 [ϕ′(x)]2 · + + . (ϕ−1)′′(z) ϕ′′(y) ϕ′′(x)   In view of (17), it follows that JG(x, y) ≥ 0 is equivalent to g(z) ≥ g(ϕ(x)) + g(ϕ(y)). Thus G is convex if and only if g is superadditive in (0, 1). Remark 4. Similar versions of these lemmas appeared previously in the literature (see [16, sec. 3.16] and [17]). The hypothesis in these versions was weaker, or just one direction was proved. Now, we may proceed to the proof of the main result in this section. Proof of Proposition 1: Sufficiency. Lemmas 2 and 3 imply that

−1 −1 Fα(x, y)= ϕ((1 − α)ϕ (x)+ αϕ (y)), is superadditive in , for each . Considering , we denote A p and (0, 1) × (0, 1) α ∈ (0, 1) A = {A1,...,Ak} pj = i∈Aj i A q = qi. By the superadditivity of Fα(x, y), we can write j i∈Aj P n 1P [p − F (q ,p )] 1 − α i α i i i=1 X 1 k ≥ [pA − F (qA,pA)]. (18) 1 − α j α j j j=1 X An application of L’Hôpital’s rule on the limit below provides y − F (x, y) y − ϕ((1 − α)ϕ−1(x)+ αϕ−1(y)) lim α = lim α↑1 1 − α α↑1 1 − α = ϕ′(ϕ−1(y))[−ϕ−1(x)+ ϕ−1(y)] ϕ−1(y) − ϕ−1(x) = . (ϕ−1)′(y) Thus, in the limit α ↑ 1, expression (18) becomes A A Dϕ(p || q) ≥ Dϕ(p || q ), which is the asserted inequality. Necessity. It is clear that if (11) holds for all p = (pi), q = (qi), and A, then

−1 −1 −1 −1 ϕ (p1) − ϕ (q1) ϕ (p2) − ϕ (q2) −1 ′ + −1 ′ (ϕ ) (p1) (ϕ ) (p2) ϕ−1(p + p ) − ϕ−1(q + q ) 1 2 1 2 (19) ≥ −1 ′ (ϕ ) (p1 + p2) is satisfied for all p1,p2 and q1, q2 in (0, 1) such that the sums p1 + p2 and q1 + q2 are in (0, 1). Let us fix p1,p2 ∈ (0, 1). We rewrite (19) as

−1 −1 −1 ϕ (p1) ϕ (p2) ϕ (p1 + p2) −1 ′ + −1 ′ − −1 ′ (ϕ ) (p1) (ϕ ) (p2) (ϕ ) (p1 + p2) −1 −1 −1 ϕ (q1) ϕ (q2) ϕ (q1 + q2) ≥ −1 ′ + −1 ′ − −1 ′ , (ϕ ) (p1) (ϕ ) (p2) (ϕ ) (p1 + p2) which is satisfied if and only if the function −1 −1 −1 ϕ (q1) ϕ (q2) ϕ (q1 + q2) F (q1, q2)= −1 ′ + −1 ′ − −1 ′ (ϕ ) (p1) (ϕ ) (p2) (ϕ ) (p1 + p2) 6

attains a global maximum at (q1, q2) = (p1,p2). By a simple calculation, it can be verified that ∇F (p1,p2)=0. Moreover, we express the determinant of the Hessian of F at (p1,p2) as −1 ′′ −1 ′′ (ϕ ) (p1) (ϕ ) (p2) JF (p1,p2)= −1 ′ −1 ′ (ϕ ) (p1) (ϕ ) (p2) −1 ′′ −1 ′′ (ϕ ) (p1 + p2) (ϕ ) (p2) − −1 ′ −1 ′ (ϕ ) (p1 + p2) (ϕ ) (p2) −1 ′′ −1 ′′ (ϕ ) (p1) (ϕ ) (p1 + p2) − −1 ′ −1 ′ (ϕ ) (p1) (ϕ ) (p1 + p2) 1 1 1 1 1 = − + . g(p ) g(p ) g(p + p ) g(p ) g(p ) 1 2 1 2  1 2  Because JF (p1,p2) ≥ 0, it follows that g(p1 + p2) ≥ g(p1)+ g(p2). (ϕ−1)′ Remark 5. If ϕ(x) = exp(x) the function g = − is the identity function which is additive, therefore superadditive. (ϕ−1)′′

B. Joint convexity

In this section, we find a sufficient condition for the joint convexity of Dϕ(· || ·). We also show that Dϕ(· || ·) satisfies the partition inequality, and is jointly convex, if, and only if, the deformed exponential function is a scaled and translated version of the Tsallis exponential. The generalized divergence Dϕ(· || ·) is said to be jointly convex if the inequality

Dϕ(λp1 + (1 − λ)p2 || λq1 + (1 − λ)q2)

≤ λDϕ(p1 || q1)+(1 − λ)Dϕ(p2 || q2) (20) ◦ is satisfied for all probability distributions p1, p2 and q1, q2 in ∆n, and each λ ∈ [0, 1]. Before we find a sufficient condition for the joint convexity of Dϕ(· || ·), we show some preliminary results. (ϕ−1)′ ϕ′ Lemma 6. The function g = − is (strictly) concave if and only if h = is (strictly) concave. (ϕ−1)′′ ϕ′′ Proof: Inserting the expressions (ϕ−1)′ =1/ϕ′(ϕ−1) and ϕ′′(ϕ−1) · (ϕ−1)′ (ϕ−1)′′ = − [ϕ′(ϕ−1)]2 into the definition of g, we can write ϕ′(ϕ−1) g = ϕ′(ϕ−1) = ϕ′(ϕ−1)h(ϕ−1). ϕ′′(ϕ−1) Some calculations show that ′ ′ −1 g+ =1+ h+(ϕ ), ′ −1 ′ where (·)+ denotes the right derivative. By the fact of ϕ is strictly increasing, we conclude that g+ is (strictly) decreasing ′ if and only if h+ is (strictly) decreasing. As a result, for g to be (strictly) concave, it is necessary and sufficient that h be (strictly) concave. (ϕ−1)′ Lemma 7. The function g = − is concave if and only if the mapping (ϕ−1)′′ −1 −1 2 Fα(x, y)= ϕ((1 − α)ϕ (x)+ αϕ (y)), (x, y) ∈ R , is concave for each α ∈ (0, 1). −1 −1 Proof: Let us denote zα = (1 − α)ϕ (x)+ αϕ (y). Some calculations show that ∂2F α (x, y)=(1 − α)(ϕ−1)′′(x)ϕ′(z ) ∂x2 α −1 ′ 2 ′′ + [(1 − α)(ϕ ) (x)] ϕ (zα), ∂2F α (x, y)= α(ϕ−1)′′(y)ϕ′(z ) ∂y2 α −1 ′ 2 ′′ + [α(ϕ ) (y)] ϕ (zα), 7 and ∂2F α (x, y)= α(1 − α)(ϕ−1)′(x)(ϕ−1)′(y)ϕ′′(z ), ∂x∂y α which we use to find the following expression for the determinant of the Hessian of Fα at (x, y):

′ ′′ −1 ′′ −1 ′′ JFα (x, y)= α(1 − α)ϕ (zα)ϕ (zα)(ϕ ) (x)(ϕ ) (y) ϕ′(z ) [(ϕ−1)′(y)]2 [(ϕ−1)′(x)]2 · α + α + (1 − α) . ϕ′′(z ) (ϕ−1)′′(y) (ϕ−1)′′(x)  α  Denote h = ϕ′/ϕ′′. Noticing that [(ϕ−1)′]2 ϕ′(ϕ−1) = − , (21) (ϕ−1)′′ ϕ′′(ϕ−1) −1 −1 we conclude that JFα (x, y) ≥ 0 is equivalent to h(zα) ≥ (1 − α)h(ϕ (x)) + αh(ϕ (y)). To show that the Hessian of Fα is negative semi-definitive, we have to verify, in addition, that its trace is non-positive. Since h is concave and non-negative, we have ϕ′(z ) ϕ′(ϕ−1(x)) α (22) ′′ − (1 − α) ′′ −1 ≥ 0. ϕ (zα) ϕ (ϕ (x)) ′′ −1 ′′ If we insert (21) into (22), and multiply the resulting expression by (1 − α)ϕ (zα)(ϕ ) (x) ≤ 0, we get ∂2F α (x, y)=(1 − α)(ϕ−1)′′(x)ϕ′(z ) ∂x2 α −1 ′ 2 ′′ + [(1 − α)(ϕ ) (x)] ϕ (zα) ≤ 0. 2 2 Analogously, we also have (∂ Fα/∂x )(x, y) ≤ 0. Consequently, the Hessian of Fα has a negative trace. From Lemma 6, it follows that g is concave if and only if Fα is concave for each α ∈ (0, 1). (ϕ−1)′ Proposition 8. If the function g = − is concave, then the divergence D (· || ·) is jointly convex. (ϕ−1)′′ ϕ Proof: According to Lemma 7, the mapping

−1 −1 2 Fα(x, y)= ϕ((1 − α)ϕ (x)+ αϕ (y)), (x, y) ∈ R , ◦ is concave for each α ∈ (0, 1). Fixed an arbitrary pj = (pji) and qj = (qji) in ∆n for j =0, 1, define

pλ = (1 − λ)p0 + λp1,

qλ = (1 − λ)q0 + λq1, for each λ ∈ (0, 1). Hence we can write

n 1 [p − F (q ,p )] 1 − α λi α λi λi i=1 X 1 n ≤ (1 − λ) [p − F (q ,p )] 1 − α 0i α 0i 0i i=1 X 1 n + λ [p − F (q ,p )]. (23) 1 − α 1i α 1i 1i i=1 X Using L’Hôpital’s rule in the limit below, we obtain y − F (x, y) y − ϕ((1 − α)ϕ−1(x)+ αϕ−1(y)) lim α = lim α↑1 1 − α α↑1 1 − α = ϕ′(ϕ−1(y))[−ϕ−1(x)+ ϕ−1(y)] ϕ−1(y) − ϕ−1(x) = . (ϕ−1)′(y) Thus, in the limit α ↑ 1, expression (23) becomes

Dϕ(pλ || qλ) ≤ (1 − λ)Dϕ(p0 || q0)+ λDϕ(p1 || q1), which is the desired result. 8

The next result is a partial converse of Proposition 8. (ϕ−1)′ Lemma 9. If the divergence D (·|| ·) is jointly convex for some n ≥ 3, then the function g = − satisfies the inequality ϕ (ϕ−1)′′ x + y g(x)+ g(y) g ≥ , (24) 2 2   for all x, y ∈ (0, 1) such that x + y ∈ (0, 1).

Proof: If p1 = (p1i), p2 = (p2i) and q1 = (q1i), q2 = (q2i) then inequality (20) is equivalent to

n −1 −1 ϕ (p1i) ϕ (p2i) λ −1 ′ + (1 − λ) −1 ′ (ϕ ) (p1i) (ϕ ) (p2i) i=1h X −1 ϕ (λp1i + (1 − λ)p2i) − −1 ′ (ϕ ) (λp1i + (1 − λ)p2i) n −1 −i1 ϕ (q1i) ϕ (q2i) ≥ λ −1 ′ + (1 − λ) −1 ′ (ϕ ) (p1 ) (ϕ ) (p2 ) i=1 i i Xh ϕ−1(λq + (1 − λ)q ) 1i 2i (25) − −1 ′ . (ϕ ) (λp1i + (1 − λ)p2i) i For the fixed probability distributions

p1 = (p1,p2,p,p13,...,p1n), (26)

p2 = (p2,p1,p,p23,...,p2n), (27) ◦ in ∆n, we consider

q1 = (p1 + x, p2 + y,p − x − y,p13,...,p1n), (28)

q2 = (p2 + y,p1 + x, p − x − y,p23,...,p2n), (29) ◦ where x and y are taken so that q1 and q2 are in ∆n. Inserting these probability distributions into (25) with λ =1/2, we can infer that the function −1 −1 1 ϕ (p1 + x) 1 ϕ (p2 + y) F (x, y)= −1 ′ + −1 ′ 2 (ϕ ) (p1) 2 (ϕ ) (p2) −1 1 1 −1 ϕ ( (p1 + x)+ (p2 + y)) 1 ϕ (p + y) − 2 2 + 2 −1 ′ 1 1 −1 ′ (ϕ ) ( 2 p1 + 2 p2) 2 (ϕ ) (p2) −1 −1 1 1 1 ϕ (p + x) ϕ ( (p2 + y)+ (p1 + x)) + 1 − 2 2 −1 ′ −1 ′ 1 1 2 (ϕ ) (p1) (ϕ ) ( 2 p2 + 2 p1) attains a global maximum at (x, y)=(0, 0). Further, we can also write 1 1 1 JF (0, 0) = − 1 1 g(p1) 2 g( 2 p1 + 2 p2) h i 1 1 1 1 1 2 · − 1 1 − 1 1 g(p2) 2 g( 2 p1 + 2 p2) 2 g( 2 p1 + 2 p2) h i h 1 1 i 1 1 1 1 1 = − 1 1 + , g(p1) g(p2) g( p1 + p2) 2 g(p2) 2 g(p1) 2 2 h i where JF (0, 0) is the determinant of the Hessian of F at (0, 0). Since F (x, y) attains a maximum at (0, 0), inequality 1 1 1 1 JF (0, 0) ≥ 0 implies g( 2 p1 + 2 p2) ≥ 2 g(p1)+ 2 g(p2).

Proposition 10. Assume that n ≥ 3. Then the generalized divergence Dϕ(· || ·) satisfies the partition inequality, and is jointly convex, if, and only if, −1 ϕ (x)= b lnq(x) − a, for x ∈ (0, 1), for some q> 0 and b> 0, a ∈ R. Proof: Clearly, inequalities (12) and (24) are satisfied for all x, y ∈ (0, 1]. Therefore, the function g(x) is superadditive and concave for x ∈ (0, 1/2). It is easy to verify that g(0+) = 0. To see this, we apply the limit x ↓ 0 in 0 ≤ g(x) ≤ g(x+y)−g(y), and use the continuity of g at y. In addition, because g(x) is concave with g(0+) = 0, the function g(x) is also subadditive 9 for x ∈ (0, 1/2). Making y ↓ 0 in g(λx + (1 − λ)y) ≥ λg(x)+(1 − λ)g(y), we obtain that g(λx) ≥ λg(x) for λ ∈ [0, 1]. From x y the inequalities g(x) ≥ x+y g(x + y) and g(y) ≥ x+y g(x + y), it follows that g(x)+ g(y) ≥ g(x + y). Hence we conclude that g(x) is additive for x ∈ (0, 1/2). By [18, Theorem 13.5.2], there exists q > 0 such that g(x) = x/q for x ∈ (0, 1/2). Using (12), and letting y ↓ 0 in (24), we get x x x g(x) g(x) ≥ g + g , and g ≥ , 2 2 2 2       which imply g(x)=2g(x/2) for all x ∈ (0, 1). Hence, expression g(x)= x/q is also verified for x ∈ (0, 1). Solving (ϕ−1)′(x) x g(x)= − = (ϕ−1)′′(x) q with respect to ϕ−1(x), we find b> 0 and a ∈ R such that x1−q ϕ−1(x)= b − a 1 − q

= b lnq(x) − a, for q 6=1, and ϕ−1(x)= b ln(x) − a, for q =1, for every x ∈ (0, 1). The converse direction follows from Propositions 1 and 8.

C. Pinsker’s inequality

Pinsker’s inequality relates the divergence with the ℓ1-. This inequality implies that convergence in divergence is stronger than convergence in the ℓ1-distance For the KL divergence, Pinsker’s inequality is given by 1 D (p || q) ≥ kp − qk2, (30) KL 2 1 n ◦ where kp − qk1 = i=1 |pi − qi| is the ℓ1-distance between probability distributions p = (pi) and q = (qi) in ∆n. The next result shows Pinsker’s inequality for the generalized divergence. P Theorem 11 (Pinsker’s Inequality). Suppose that the partition inequality (11) holds. In addition, assume that 1 1 (ϕ−1)′(q) (ϕ−1)′(1 − q) c = inf − + > 0. (31) 0

Proof: Let A = {A1, A2} be a partition of In, where A1 = {i : pi ≥ qi} and A2 = {i : pi < qi}. Hence we can write n kp − qk1 = |pi − qi| i=1 X = (pi − qi)+ (qi − pi) i∈A1 i∈A2 XA A AX A = (p1 − q1 ) + (q2 − p2 ) A A = kp − q k1. By the partition inequality A A Dϕ(p || q) ≥ Dϕ(p || q ), we see that it suffices to show A A A A 2 Dϕ(p || q ) ≥ ckp − q k1. (33) A A Let us denote p1 = p and q1 = q. Then inequality (33) can be rewritten as ϕ−1(p) − ϕ−1(q) ϕ−1(1 − p) − ϕ−1(1 − q) + ≥ 4c(p − q)2, (ϕ−1)′(p) (ϕ−1)′(1 − p) 10

A A since kp − q k1 = 2(p − q). For a fixed p ∈ (0, 1), we define the function ϕ−1(p) − ϕ−1(q) F (q)= (ϕ−1)′(p) ϕ−1(1 − p) − ϕ−1(1 − q) + − 4c(p − q)2, (ϕ−1)′(1 − p) for q ∈ (0, 1). By the symmetry of the terms p and q in (31), it is clear that 1 1 (ϕ−1)′(q) (ϕ−1)′(1 − q) c = inf − + > 0. 0p, and ≤ 0 for q 0. Tsallis relative entropy is an f-divergence ′′ P  with f(x)= − lnq(x). In this case, we have f (1) = q.

IV. CONCLUSIONS

In this work, we found necessary and sufficient conditions for the generalized divergence Dϕ(· || ·) to satisfy the partition inequality. We also showed a condition that implies the joint convexity of Dϕ(· || ·). It was proved that, for the generalized divergence Dϕ(· || ·) to coincide with the Tsallis relative entropy Dq(· || ·), it is necessary and sufficient that Dϕ(· || ·) satisfy the partition inequality, and be jointly convex. As an application of partition inequality, a criterion for the Pinsker’s inequality was found. We also constructed a family of probability distributions associated with the generalized divergence. This work can be extended in many aspects. The data processing inequality was not proved. Comparisons between generalized divergences, as investigated in [19] for f-divergences, have the potential of being a prosperous topic of research. In [20], a generalization of Rényi divergence was defined in terms of a deformed exponential. As future work, we aim to investigate the properties of this generalized Rényi divergence.

ACKNOWLEDGMENT The authors would like to thank CNPq (Procs. 408609/2016-8 and 309472/2017-2) and Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brazil (CAPES) - Finance Code 001 for partial funding of this research. We would also like to thank Sueli I.R. Costa for the valuable contributions to this work.

REFERENCES [1] T. M. Cover and J. A. Thomas, Elements of , 2nd ed. Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, 2006. [2] T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical learning, 2nd ed., ser. Springer Series in Statistics. Springer, New York, 2009, data mining, inference, and prediction. [3] J. C. Principe, Information theoretic learning, ser. Information Science and Statistics. Springer, New York, 2010, Rényi’s entropy and kernel perspectives. [4] S. Konishi and G. Kitagawa, Information criteria and statistical modeling, ser. Springer Series in Statistics. Springer, New York, 2008. [5] S. Kullback and R. A. Leibler, “On information and sufficiency,” Ann. Math. Statistics, vol. 22, pp. 79–86, 1951. [Online]. Available: https://doi.org/10.1214/aoms/1177729694 [6] L. Borland, A. R. Plastino, and C. Tsallis, “Information gain within nonextensive thermostatistics,” J. Math. Phys., vol. 39, no. 12, pp. 6490–6501, 1998. [Online]. Available: https://doi.org/10.1063/1.532660 [7] ——, “Erratum: “Information gain within generalized [non-extensive] thermostatistics”,” J. Math. Phys., vol. 40, no. 4, p. 2196, 1999. [Online]. Available: https://doi.org/10.1063/1.533119 [8] T. van Erven and P. Harremoës, “Rényi divergence and Kullback-Leibler divergence,” IEEE Trans. Inform. Theory, vol. 60, no. 7, pp. 3797–3820, 2014. [Online]. Available: https://doi.org/10.1109/TIT.2014.2320500 [9] S. Furuichi, K. Yanagi, and K. Kuriyama, “Fundamental properties of Tsallis relative entropy,” J. Math. Phys., vol. 45, no. 12, pp. 4868–4877, 2004. [10] S. Furuichi, “On uniqueness theorems for Tsallis entropy and Tsallis relative entropy,” IEEE Trans. Inform. Theory, vol. 51, no. 10, pp. 3638–3645, 2005. [Online]. Available: https://doi.org/10.1109/TIT.2005.855606 [11] J. Zhang, “Divergence function, duality, and convex analysis,” Neural Comput., vol. 16, no. 1, pp. 159–195, Jan. 2004. [12] M. Broniatowski and W. Stummer, “Some universal insights on divergences for statistics, machine learning and artificial intelligence,” in Geometric structures of information, ser. Signals Commun. Technol. Springer, Cham, 2019, pp. 149–211. 11

[13] G. L. Gilardoni, “On Pinsker’s and Vajda’s type inequalities for Csiszár’s f-divergences,” IEEE Trans. Inform. Theory, vol. 56, no. 11, pp. 5377–5386, 2010. [14] I. Sason and S. Verdú, “f-divergence inequalities,” IEEE Trans. Inform. Theory, vol. 62, no. 11, pp. 5973–6006, 2016. [Online]. Available: https://doi.org/10.1109/TIT.2016.2603151 [15] R. Bhatia, Positive definite matrices, ser. Princeton Series in Applied Mathematics. Princeton University Press, Princeton, NJ, 2007. [16] G. H. Hardy, J. E. Littlewood, and G. Pólya, Inequalities, ser. Cambridge Mathematical Library. Cambridge University Press, Cambridge, 1988, reprint of the 1952 edition. [17] J. Matkowski, “The converse of the Minkowski’s inequality theorem and its generalization,” Proc. Amer. Math. Soc., vol. 109, no. 3, pp. 663–675, 1990. [Online]. Available: https://doi.org/10.2307/2048205 [18] M. Kuczma, An introduction to the theory of functional equations and inequalities, 2nd ed. Birkhäuser Verlag, Basel, 2009. [19] P. Harremoës and I. Vajda, “On pairs of f-divergences and their joint ,” IEEE Trans. Inform. Theory, vol. 57, no. 6, pp. 3230–3235, 2011. [Online]. Available: https://doi.org/10.1109/TIT.2011.2137353 [20] D. C. de Souza, R. F. Vigelis, and C. C. Cavalcante, “Geometry induced by a generalization of Rényi divergence,” Entropy, vol. 18, no. 11, pp. Paper No. 407, 16, 2016. [Online]. Available: https://doi.org/10.3390/e18110407

Rui F. Vigelis received the B.Sc degree in Electrical Engineering from the Federal University of Ceará, Brazil, in 2005, and the M.Sc and Ph.D. degrees in Teleinformatics Engineering from the Federal University of Ceará, Brazil, in 2006 and 2011, respectively. Since 2012, he is an Assistant Professor at the Federal University of Ceará, campus Sobral. His primary research interests are in the analysis of non-standard function spaces (e.g., Musielak–Orlicz spaces), non-parametric , and measures of information.

Luiza H.F. Andrade received the B.Sc degree in Mathematics from the Ceará State University, in 2002, the M.Sc degree in Mathematics and the Ph.D. degree in Teleinformatics Engineering, both from the Federal University of Ceará, in 2007 and 2018, respectively. She has been developing research in information geometry and information theory.

Charles C. Cavalcante (S’98 - M’04 - SM’11) received the B.Sc and M.Sc in Electrical Engineering from the Federal University of Ceará (UFC), Brazil, in 1999 and 2001, respectively, and the Ph.D. degree from the University of Campinas (UNICAMP), Brazil, in 2004. He has held a grant for Scientific and Technological Development from 2004 to 2007 and since March 2009 he has a grant of Scientific Research Productivity both from the Brazilian Research Council (CNPq). He is now an Associate Professor at Teleinformatics Engineering Department of UFC holding the Statistical Signal Processing chair. From August 2014 to July 2015 he was a Visiting Assistant Professor at the Department of Computer Science and Electrical Engineering (CSEE) from University of Maryland, Baltimore County (UMBC) in the United States. He has been working on signal processing strategies for communications where he has several papers published in journal and conferences, has authored three international patents and he has worked on several funded research projects on the signal processing and wireless communications areas. He is also a co-author of the book Unsupervised Signal Processing: Channel Equalization and Source Separation and co-editor of the book Signals and Images: Advances and Results in Speech, Estimation, Compression, Recognition, Filtering, and Processing, both published by CRC Press. He is a researcher of the Wireless Telecommunications Research Group (GTEL) where he leads research on signal processing and wireless communications. Dr. Cavalcante is a Senior Member of the IEEE and Senior Member of the Brazilian Telecommunications Society (SBrT) for the term 2018-2020. Since March 2018 he is the President of the Brazilian Telecommunication Society (SBrT) and has just been elected to the IEEE Signal Processing Society Board of Governors in the capacity of Regional Director-at-Large for Regions 7 & 9 for the term 2020-2021. His main research interests are in signal processing for communications, statistical signal processing and information geometry.