Relative Entropy and Score Function: New Information–Estimation Relationships through Arbitrary Additive Perturbation

Dongning Guo Department of Electrical Engineering & Computer Science Northwestern University Evanston, IL 60208, USA

Abstract—This paper establishes new information–estimation error (MMSE) of a Gaussian model [2]: relationships pertaining to models with additive noise of arbitrary d √ 1  distribution. In particular, we study the change in the relative I (X; γ X + N) = mmse PX , γ (2) entropy between two probability measures when both of them dγ 2 are perturbed by a small amount of the same additive noise. where X ∼ P and mmseP , γ denotes the MMSE of It is shown that the rate of the change with respect to the X √ X energy of the perturbation can be expressed in terms of the mean estimating X given γ X + N. The parameter γ ≥ 0 is squared difference of the score functions of the two distributions, understood as the signal-to-noise ratio (SNR) of the Gaussian and, rather surprisingly, is unrelated to the distribution of the model. By-products of formula (2) include the representation perturbation otherwise. The result holds true for the classical of the entropy, and the non-Gaussianness relative entropy (or Kullback–Leibler distance), as well as two of its generalizations: Renyi’s´ relative entropy and the f-divergence. (measured in relative entropy) in terms of the MMSE [2]–[4]. The result generalizes a recent relationship between the relative Several generalizations and extensions of the previous results entropy and mean squared errors pertaining to Gaussian noise are found in [5]–[7]. Moreover, the derivative of the mutual models, which in turn supersedes many previous information– information and entropy w.r.t. channel gains have also been estimation relationships. A generalization of the de Bruijn iden- studied for non-additive-noise channels [7], [8]. tity to non-Gaussian models can also be regarded as consequence of this new result. Among the aforementioned information measures, relative entropy is the most general and versatile in the sense that it I.INTRODUCTION is defined for distributions which are discrete, continuous, or neither, and all the other information measures can be easily To date, a number of connections between basic information expressed in terms of relative entropy. The following rela- measures and estimation measures have been discovered. By tionship between the relative entropy and information measures we mean notions which describe the is known [9]: Let {p } be a family of probability density amount of information, such as entropy and mutual infor- θ functions (pdfs) parameterized by θ ∈ . Then mation, as well as several closely related quantities, such R 2  2 as differential entropy and relative entropy (also known as D(pθ+δkpθ) = δ /2 J(pθ) + o δ (3) information divergence or Kullback–Leibler distance). By es- timation measures we mean key notions in estimation theory, where J(pθ) is the Fisher information of pθ w.r.t. θ. which include in particular the mean squared error (MSE) and In a recent work [10], Verdu´ established an interesting Fisher information, among others. relationship between the relative entropy and mismatched An early such connection is attributed to de Bruijn [1] which estimation. Let mseQ(P, γ) represent the MSE for estimating relates the differential entropy of an arbitrary the input X of distribution P to a Gaussian channel of SNR corrupted by Gaussian noise and its Fisher information: equal to γ based on the channel output, with the estimator d  √  1  √  assuming the prior distribution of X to be Q. Then h X + δ N = J X + δ N (1) dδ 2 d 2 D P ∗ N 0, γ−1 kQ ∗ N 0, γ−1 for every δ ≥ 0, where X denotes an arbitrary random dγ (4) variable and N ∼ N (0, 1) denotes a standard Gaussian = mseQ(P, γ) − mmse(P, γ) random variable independent of X throughout this paper. where the convolution P ∗ N 0, γ−1 represents the distribu- Here J(Y ) denotes the Fisher information of its distribution √ with respect to (w.r.t.) the location family. The de Bruijn tion of X + N/ γ with X ∼ P . Obviously mseQ(P, γ) = identity is equivalent to a recent connection between the input– mmse(P, γ) if Q is identical to P . The formula is particularly output mutual information and the minimum mean-square satisfying because the left-hand side (l.h.s.) is an information- theoretic measure of the mismatch between two distributions, This work is supported by the NSF under grant CCF-0644344 and DARPA whereas the right-hand side (r.h.s.) measures the mismatch under grant W911NF-07-1-0028. using an estimation-theoretic metric, i.e., the increase in the √ estimation error due to the mismatch. its shape fixed, i.e., Ψ is the distribution of δ V with the In another recent work, Narayanan and Srinivasa [11] random variable V fixed. We note that the r.h.s. of (8) does consider an additive non-Gaussian noise channel model and not depend on the distribution of V , i.e., the change in the provide the following generalization of the de Bruijn identity: relative entropy due to small perturbation is proportional to the variance of the perturbation but independent of its shape. d  √  1 h X + δ V = J(X) (5) Thus the notation d/ dδ is not ambiguous. dδ δ=0+ 2 For every function f, let ∇f denote its derivative f 0 for where the pdf of V is symmetric about 0, twice differentiable, notational convenience. For every differentiable pdf p, the and of unit variance but otherwise arbitrary. The significance function ∇ log p(x) = p0(x)/p(x) is known as its score of (5) is that the derivative does not depend on the detailed function, hence the r.h.s. of (8) is the mean squared difference statistics of the noise. Thus, if we view the differential entropy √ of two score functions. As the previous result (4), this is as a manifold of the distribution of the perturbation δ V , then satisfying because both sides of (8) represent some error due the geometry of the manifold appears to be locally a bowl to the mismatch between the prior distribution q supplied to which is uniform in every direction of the perturbation. the estimator and the actual distribution p. Obviously, if p and In this work, we consider the change in the relative entropy q are identical, then both sides of the formula are equal to between two distributions when both of them are perturbed by zero; otherwise, the derivative is negative (i.e., perturbation an infinitesimal amount of the same arbitrary additive noise. reduces relative entropy). We show that the rate of this change can be expressed as the mean-squared difference of the score functions of the Consider now the Renyi´ relative entropy, which is defined distributions. Note that the score function is an important for two probability measures P  Q and every α > 0 as notion in estimation theory, whose mean square is the Fisher 1 Z  dP α−1 information. Like formula (5), the new general relationship Dα(P kQ) = log dP (10) α − 1 dQ turns out to be independent of the noise distribution. The general relationship is found to hold for both the where D1(P kQ) is defined as the classical relative entropy classical relative entropy (or Kullback–Leibler distance) and D(P kQ) because limα→1 Dα(P kQ) = D(P kQ). the more general Renyi’s´ relative entropy, as well as the Theorem 2: Let the distributions P , Q and Ψ be defined general f-divergence due to Csiszar´ and independently Ali and the same way as in Theorem 1. Let δ denote the variance of Silvey [12]. In the special case of Gaussian perturbations, it is Ψ. If P  Q and shown that (1), (2), (4), (5) can all be obtained as consequence d of the new result. lim pα(z) q1−α(z) = 0 (11) z→∞ dz II.MAIN RESULTS then Theorem 1: Let Ψ denote an arbitrary distribution with zero d D (P ∗ ΨkQ ∗ Ψ) mean and variance δ. Let P and Q be two distributions whose α dδ δ=0+ respective pdfs p and q are twice differentiable. If P  Q and ∞ (12) α Z  p(z)2 pα(z)q1−α(z) d  p(z) = − ∇ log ∞ dz. 2 q(z) R α 1−α lim p(z) log = 0 (6) −∞ p (u)q (u) du z→∞ dz q(z) −∞ then Note that as α → 1, the r.h.s. of (12) becomes the r.h.s.

d of (7). We also point out that, similar to that in (7), the outer D(P ∗ ΨkQ ∗ Ψ) dδ δ=0+ integral in (12) can be viewed as the mean square difference 1 Z ∞  p(z)2 of two scores (∇ log p(Z) − ∇ log q(Z)) with the pdf of Z = − p(z) ∇ log dz (7) being proportional to pα(z)q1−α(z). 2 −∞ q(z) 1 n o Theorems 1 and 2 are quite general because conditions (6) = − E (∇ log p(Z) − ∇ log q(Z))2 (8) 2 P and (11) are satisfied by most distributions of interest. For example, if both p and q belong to the exponential family, then where the expectation in (8) is taken with respect to Z ∼ P . the derivatives in (6) and (11) also vanish exponentially fast. The classical relative entropy (Kullback-Leibler distance) is Not all distributions satisfy those conditions. This is because defined for two probability measures P  Q as that although the functions p(z) and p(z) log(p(z)/q(z)) inte- Z  dP  grate to 1 and D(P kQ) respectively, they need not vanish. For D(P kQ) = log dP. (9) dQ example, p(z) may consist of narrower and narrower Gaussian pulses of the same height as z → ∞, so that not only p(z) When the corresponding densities exist, it is also customary 0 to denote the relative entropy by D(pkq). does not vanish, but p (z) is unbounded. The notation d/ dδ in Theorem 1 can be understood as Another family of generalized relative entropy, called the f- taking derivative w.r.t. the variance of the distribution Ψ with divergence, was introduced by Csiszar´ and independently by Ali and Silvey (see e.g., [12]). It is defined for P  Q as For any single-variable function g, let g0 and g00 denote its Z  dP  first and second derivative respectively. Consider now If (P kQ) = f dQ. (13) d dQ I (p kq ) dδ f δ δ Theorem 3: P Q Ψ ∞ Let the distributions , and be defined d Z p (y) the same way as in Theorem 1. Let δ denote the variance of = q (y)f δ dy (19) dδ δ q (y) Ψ. Suppose the second derivative of f(·) exists and is denoted −∞ δ Z ∞     by f 00(·). If P  Q and ∂qδ(y) pδ(y) ∂ pδ(y) = f + qδ(y) f dy (20)    −∞ ∂δ qδ(y) ∂δ qδ(y) d p(z) Z ∞      lim q(z)f = 0 (14) ∂qδ(y) pδ(y) pδ(y) pδ(y) z→∞ dz q(z) = f − f 0 −∞ ∂δ qδ(y) qδ(y) qδ(y) then ∂p (y) p (y) + δ f 0 δ dy. (21) d  ∂δ q (y) If P ∗ Ψ Q ∗ Ψ δ dδ + δ=0 (15) Invoking Lemma 1 on (21) yields Z ∞    2 1 00 p(y) p(y) Z ∞   = − q(y)f ∇ dy . d 1 00 0 p(y) 2 −∞ q(y) q(y) If (pδkqδ) = p (y)f dδ + 2 q(y) δ=0 −∞ (22) The integral in (15) can still be expressed in terms of the  p(y) p(y) p(y) + q00(y) f − f 0 dy. difference of the score functions because q(y) q(y) q(y) ∇(p(y)/q(y)) = (p(y)/q(y))[∇ log p(y) − ∇ log q(y)]. (16) To proceed, we reorganize the integrand in (22) to the Note that the special case of f(t) = t log t corresponds to desired form. The key technique is integration by parts, which the Kullback–Leibler distance, whereas the case of f(t) = we carry out implicitly with the help of a modest amount of q (t − t )/(q − 1) corresponds to the Tsallis relative entropy foresight. For convenience, we use p and q as shorthand for [13]. Indeed, Theorem 1 is a special case of Theorem 3. p(y) and q(y) respectively. We use the fact g0h = (gh)0 − gh0 to rewrite the integrand in (22) as III.PROOF p  p p p p00f 0 + q00 f − f 0 The key property that underlies Theorems 1–3 is the fol- q q q q lowing observation of the local geometry of an additive-noise-  p  p p p0 = p0f 0 + q0 f − f 0 (23) perturbed distribution made in [11]: q q q q Lemma 1: Let the pdf p(·) of a random variable√ Z be twice   0     0 0 0 p 0 p p 0 p differentiable. Let pδ denote the pdf of Z + δ V where V − p f − q f − f . is of zero mean and unit variance, and is independent of Z. q q q q Then for every y ∈ R, as δ → 0+, Combining the first two terms and simplifying the last term 2 on the r.h.s. of (23) yield ∂ 1 d 00 0 0 pδ(y) = 2 p(y). (17)       0    ∂δ + 2 dy p p q p p δ=0 qf − p0 f 0 + f 0 . (24) q q q q Formula (17) allows the derivative w.r.t. the energy of the perturbation δ to be transformed to the second derivative of The first term in (24) integrates to zero by assumption (14). the original pdf. In Appendix we provide a brief proof for The last two terms in (24) can be combined to obtain Lemma 1 which is slightly different than that in [11]. Note that p0  p0  p2 p Lemma 1 does not require the distribution of the perturbation −q f 0 = −q ∇ f 00 . (25) q q q q to be symmetric as is required in [11]. In the following we first prove Theorem 3, which implies Collecting the results from (22) to (25), we have Theorem 1 as a special case, and then prove Theorem 2. dIf (PδkQδ)

dδ + δ=0 (26) A. Proof for Theorem 3 Z ∞  2   1 p(y) 00 p(y) Let V be a random variable with fixed distribution P . = − q(y) ∇ f dy V 2 −∞ q(y) q(y) For convenience, we use a shorthand p to denote the pdf √ δ which is equivalent to (15). Hence the proof of Theorem 3. of the random variable Y = Z + δ V with (Z,V√) ∼ P × PV . Similarly, let qδ denote the pdf of Y = Z + δ V with We note that the preceding calculation is tantamount to two (Z,V ) ∼ Q × PV . Clearly, uses of integration by parts. The treatment here, however, d  d requires the minimum regularity conditions on the densities If P ∗ Ψ Q ∗ Ψ = If (pδkqδ). (18) dδ dδ p and q. B. Proof for Theorem 2 result, which is equivalent to (2),   Consider now d 1 1 I(X; X + σW ) = − mmse PX , . (36) d d(σ2) 2σ4 σ2 Dα(pδkqδ) dδ For any x ∈ R, let PZ|X=x denote the distribution of Z as the Z ∞ 1 d α α−1 output of the model (35) conditioned on X = x. The mutual = log pδ (y)qδ (y) dy (27) α − 1 dδ −∞ information can be expressed as ∞ , ∞ 1 Z ∂ pα(y)qα−1(y) Z I(X; X + σW ) = I(X; Z) = D(P kP |P ) (37) = δ δ dy pα(y)qα−1(y) dy. Z|X Z X α − 1 ∂δ δ δ −∞ −∞ which is the average of D(PZ|X=xkPZ ) over x according 2 (28) to the distribution PX , which does not depend on σ . Let N ∼ N (0, 1) be independent of Z. Consider the derivative By Lemma 1, the integral in the numerator in (28) can be of D(P kP ) w.r.t. σ2, or equivalently, by introducing written as Z|X=x Z a small perturbation, Z ∞ p (y)α−1 ∂p (y) p (y)α ∂q (y) α δ δ + (1 − α) δ δ dy d   D P √ P √ −∞ qδ(y) ∂δ qδ(y) ∂δ Z+ δN|X=x Z+ δN dδ δ=0+ Z ∞ α p(y)α−1 1 − α p(y)α (38) = p00(y) + q00(y) dy 1 n 2o = − E ∇ log pZ|X=x(Z) − ∇ log pZ (Z) −∞ 2 q(y) 2 q(y) 2 (29) due to Theorem 1, where the expectation is over PZ|X=x, at δ = 0+. Note that (11) implies that which is a Gaussian distribution centered at x with variance 2 α−1 α σ . The first score is easy to evaluate: ∇ log pZ|X=x(Z) = p(y) p(y) 2 α p0(y) + (1 − α) q0(y) (30) (x − Z)/σ . The second score is determined by the following q(y) q(y) simple variation of a result due to Esposito [14] (see also vanishes as y → ∞. Using integration by parts, we further Lemma 2 in [2]): 2 equate the integral on the r.h.s. of (29) to Lemma 2: ∇ log pZ (z) = (E {X | Z = z} − z)/σ . Clearly, the r.h.s. of (38) becomes Z ∞  α−1  α α 0 d p α − 1 0 d p − p + q dy. (31)  2 4 −EP (x − E {X | Z}) /(2σ ), (39) −∞ 2 dy q 2 dy q Z|X=x The integrand in (31) can be written as the average of which over x is equal to the r.h.s. of (36). ! Thus (36) is established, and so is (2). α(1 − α) pα−2 pα−1 p0 p0 − q0 . (32) 2 q q q B. Differential Entropy and MMSE

In the following we omit the coefficient α(1 − α)/2 to write Consider again the model (35). It is not difficult to see √ 2 the remaining terms in (32) as D(PZ+ δ N kN (0, σ + δ))  α−2  α−1! 0 0 1 EX2/2 √ (40) 0 p 0 p p q − pq 2   p − q = log 2πe σ + δ + 2 − h Z + δ N . q q q2 2 σ + δ By Theorem 1 and Lemma 2, we have = (p0)2 pα−2q1−α − 2p0q0pα−1q−α + pαq−1−α (q0)2 (33)

2 α 1−α d √ 2  = (∇ log p − ∇ log q) p q . (34) D PZ+ δ N kN (0, σ + δ) dδ δ=0+ ( ) Collecting the preceding results from (28) to (34), we have 1 E {X | Z} − z z 2 established (12) in Theorem 2. = − E + (41) 2 σ2 σ2

n 2o . 4 IV. RECOVERING EXISTING INFORMATION–ESTIMATION = −E (E {X | Z}) (2σ ). (42) RELATIONSHIPS USING THEOREM 1 Plugging into (40), we have A. Mutual information and MMSE d  1  h(Z) − log(2πe(1 + σ2)) d(σ2) 2 We first use Theorem 1 to recover formula (2) established (43) in [2]. For convenience, consider the following alternative 1 1 EX2 − (E {X | Z})2 = − . Gaussian model: 2 (1 + σ2)σ2 2σ4 Z = X + σW (35) 2 2 2 Note that EX − (E {X | Z}) = mmse(PX , 1/σ ). More- 2 1 2 where X and W ∼ N (0, 1) are independent so that σ over, h(X) = h(Z)|σ=0, and h(Z) − 2 log(2πe(1 + σ )) represents the noise variance. It suffices to show the following vanishes as σ2 → ∞. Therefore, by integrating w.r.t. σ2 from 0 to ∞, we obtain VI.ACKNOWLEDGEMENTS ∞ The author would like to thank Sergio Verdu´ for sharing 1 1 Z 1  1 1 h(X) = log (2πe) + mmse P , − ds an earlier conjecture of his on the relationship between the 2 2 s2 X s s(s + 1) 0 Kullback–Leibler distance and MSE, which inspired this work. (44) APPENDIX which is equivalent to the integral expression in [2], in which we use the SNR as the integral variable. PROOFOF LEMMA 1 C. Relative Entropy and MMSE Proof:√ Recall that pδ denotes the distribution of Y = X + δ V . Denote the characteristic function of p as The connection between relative entropy and MMSE (4) δ can also be regarded as a special case of Theorem 1. Consider ϕ(u, δ) = E eiuY . (50) again the model (35) and apply Theorem 1. We have Due to independence of X and V , d  √ − 2 D P √ kQ √  iuX n iu δ V o dδ X+ δ N X+ δ N ϕ(s, δ) = E e E e (51)  √ √ 2 √   ( ∞ k ) = EP ∇log pZ (X + δ N) − ∇log qZ (X + δ N) . X iu δ V = ϕ(s, 0) E (52) k! (45) k=0 √ " ∞ k # By Lemma 2, the r.h.s. of (45) can be rewritten as δ(iu)2 X iu δ = ϕ(s, 0) 1 + + EV k (53)  2 2 k! EP (EP {X | Z} − EQ {X | Z}) k=3  2 = EP [(X − EQ {X | Z}) − (X − EP {X | Z})] (46) where we have used the assumptions that V has zero mean and  2  2 unit variance. Note also that the series sum in (53) vanishes = EP (X − EQ {X | Z}) + EP (X − EP {X | Z}) as o(δ). Taking the inverse Fourier transform on both sides − 2E {(X − E {X | Z})(X − E {X | Z})} . (47) P Q P of (53) yields Using the orthogonality of (X − EP {X | Z}) and every 1 ∂2 function of Z under probability measure P , we can replace pδ(y) = p0(y) + δ p0(y) + o(δ). (54) 2 ∂y2 EQ {X | Z} in the last term by EP {X | Z} (which are both functions of Z), and continue the equality as Hence Lemma 1 is proved as p0(y) = p(y).  2  2 REFERENCES EP (X − EQ {X | Z}) + EP (X − EP {X | Z}) − 2E {(X − E {X | Z})(X − E {X | Z})} [1] A. J. Stam, “Some inequalities satisfied by the quantities of information P P P of Fisher and Shannon,” Information and Control, vol. 2, pp. 101–112,  2  2 = EP (X − EQ {X | Z}) − EP (X − EP {X | Z}) 1959. [2] D. Guo, S. Shamai, and S. Verdu,´ “Mutual information and minimum (48) mean-square error in Gaussian channels,” IEEE Trans. Inform. Theory, = mse (PX , γ) − mmse (PX , γ) (49) vol. 51, pp. 1261–1282, Apr. 2005. QX [3] D. Guo, S. Shamai, and S. Verdu,´ “Proof of entropy power inequalities via MMSE,” in Proc. IEEE Int. Symp. Inform. Theory, pp. 1011–1015, where γ = 1/δ. Hence yields the desired formula (4). Seattle, WA, USA, July 2006. [4] S. Verdu´ and D. Guo, “A simple proof of the entropy power inequality,” D. Differential Entropy and Fisher Information IEEE Trans. Inform. Theory, pp. 2165–2166, May 2006. The generalized de Bruijn identity (5) can be recovered [5] M. Zakai, “On mutual information, likelihood ratios, and estimation error for the additive Gaussian channel,” IEEE Trans. Inform. Theory, basically by inspection of (8). Consider a distribution QZ vol. 51, pp. 3017–3024, Sept. 2005. which is uniform on [−m, m] with m being a large number and [6] D. Guo, S. Shamai, and S. Verdu,´ “Additive non-Gaussian noise chan- vanishes smoothly outside the interval (e.g., a raised-cosine nels: Mutual information and conditional mean estimation,” in Proc. IEEE Int. Symp. Inform. Theory, Adelaide, Australia, Sept. 2005. function with roll-off). Then QZ+σN remains essentially uni- [7] D. Palomar and S. Verdu,´ “Representation of mutual information via form, so that ∇ log qZ (z) ≈ 0 over almost all the probability input estimates,” IEEE Trans. Inform. Theory, vol. 53, pp. 453–470, mass of PZ . As m → ∞, (8) reduces to (5). Feb. 2007. [8] C. Measson, A. Montanari, and R. Urbanke, “Maxwell Construction: V. CONCLUDING REMARKS The Hidden Bridge Between Iterativew and Maximum a Posteriori Decoding,” IEEE Trans. Inform. Theory, vol. 54, pp. 5277–5307, 2008. The relationships connecting the score function and various [9] S. Kullback, and Statistics. New York: Dover, 1968. forms of relative entropy shown in this paper are the most gen- [10] S. Verdu,´ “Mismatched estimation and relative entropy,” in Proc. IEEE eral for additive-noise models to this date. It is by now clear Int. Symp. Inform. Theory, Seoul, Korea, 2009. [11] K. R. Narayanan and A. R. Srinivasa, “On the thermodynamic temper- that such derivative relationships between basic information- ature of a general distribution.” http://arxiv.org/abs/0711.1460v2, 2007. and estimation-theoretic measures rely on neither the normal- [12] I. Csiszr, “Axiomatic characterizations of information measures,” En- ity of the additive perturbation, nor the logarithm functional in tropy, vol. 10, pp. 261–273, 2008. [13] C. Tsallis, “Possible generalization of Boltzmann-Gibbs statistics,” Jour- classical information measures. The results, however, do not nal of Statistical Physics, vol. 52, pp. 479–487, 1988. directly translate into integral relationships unless the noise is [14] R. Esposito, “On a relation between detection and estimation indecision Gaussian, which has the infinite divisibility property. theory,” Information and Control, vol. 12, pp. 116–120, 1968.