1 Nonparametric estimation of the likelihood ratio and divergence functionals

1 1 2 1 2 XuanLong Nguyen , Martin J. Wainwright , and Michael I. Jordan , 1 Department of Electrical Engineering and Computer Science 2 Department of University of California, Berkeley {xuanlong,wainwrig,jordan}@eecs.berkeley.edu

Abstract— We develop and analyze a nonparametric totic one; for example, the KL divergence emerges as the method for estimating the class of f-divergence functionals, asymptotic rate of the probability of error in Neyman- and the density ratio of two probability distributions. Pearson binary hypothesis testing (a result known as Our method is based on a non-asymptotic variational characterization of the f-divergence, which allows us to Stein’s lemma). But it is also possible to provide non- cast the problem of estimating divergences in terms of asymptotic characterizations of the divergences; in par- risk minimization. We thus obtain an M-estimator for ticular, Fano’s lemma shows that KL divergence provides divergences, based on a convex and differentiable opti- a lower bound on the error probability for decoding mization problem that can be solved efficiently We analyze error [5]. This paper is motivated by a non-asymptotic the consistency and convergence rates for this M-estimator given conditions only on the ratio of densities. characterization of f-divergence in the spirit of Fano’s lemma, first explicated in our earlier work [14]. This characterization states that that there is an one-to-one I. INTRODUCTION correspondence between the family of f-divergences and Given samples from two (multivariate) probability the family of “surrogate loss functions”, such that the distributions P and Q, it is frequently of interest to (optimum) Bayes risk is equal to the negative of the di- estimate the values of functionals measuring the diver- vergence. In other words, any negative f-divergence can gence between the unknown P and Q. Of particular serve as a lower bound of a risk minimization problem. interest is the Kullback-Leibler (KL) divergence, but This variational characterization of divergence, stated the approach of this paper applies to the more general formally in Lemma 1, allows us estimate a divergence class of Ali-Silvey or f-divergences [1], [6]. An f- D(P, Q) by solving a binary decision problem. Not divergence, to be defined formally in the sequel, is of the surprisingly, we show how the problem of estimating f- form D(P, Q) = (dQ/dP)dP, where is a convex divergence is intrinsically linked to that of estimating the function of the likelihoodR ratio. likelihood ratio g0 = dP/dQ. Overall, we obtain an M- These divergences play fundamental roles in statis- estimator, whose optimal value estimates the divergence tics and . In particular, divergences and optimizing argument estimates the likelihood ratio. are often used as measures of discrimination in binary Our estimator is nonparametric, in that it imposes no hypothesis testing and classification applications. Ex- strong assumptions on the form of the densities for P and amples include signal selection [11] and decentralized Q. We establish consistency of this estimator by exploit- detection [14], where f-divergences are used to solve ing analysis techniques for M-estimators in the setting experimental design problems. The Shannon mutual of nonparametric and regression [18], information (a particular type of KL divergence), in [20]. At a high level, the key to the proof is suitable addition to its role in coding theorems, is often used as a control on the modulus of continuity of the suprema measure of independence to be extremized in dimension- of two empirical processes, one for each of P and Q, ality reduction and feature selection. In all of these cases, with respect to a metric defined over density ratios. if divergences are to be used as objective functionals, one This metric turns out to be a surrogate lower bound has to be able to estimate them efficiently from . of a defined on a pair of density There are two ways in which divergences are typically ratios. In this way, we not only establish consistency of characterized. The classical characterization is an asymp- our estimator, but also obtain convergence rates. As one concrete example, when the likehood ratio g0 lies in a The KL divergence is a special case of a broader function class G of smoothness with > d/2, where class of divergences known as Ali-Silvey , or d is the number of data dimensions, our estimator of f-divergences [6], [1]: the likelihood ratio achieves the optimal minimax rate /(2+d) n according to the Hellinger metric, while the D(P, Q) = p0(q0/p0) d, Z divergence estimator also achieves the rate n/(2+d). In abstract terms, an f-divergence can be viewed as an where : R → R is a convex function. Different choices integral functional of a pair of densities. While there are of result in many divergences that play important relatively little work focusing on integral functionals for roles in information theory and statistics, including the pairs of densities (such as the f-divergences of interest variational , , KL divergence here), there is an extensive literature on the estimation of and so on (see, e.g., [17]). an integral functionals of the form (p)p, where p the Since is a convex function, by Legendre-Fenchel density of an unknown probability distribR ution. Work on convex duality [16] we can write: this topic dates back to the 1970s [9], [13]; see also [2], (u) = sup(uv (v)), [3], [12] and the references therein. There are also a v∈R number of papers that focus specifically on the entropy functional (see, e.g., [7], [10], [8]). where is the convex conjugate of . As a result, In a separate line of work, Wang et al. [21] proposed D(P, Q) = p0 sup(fq0/p0 (f)) d a -based KL estimator, which is based on the Z f estimation of the likelihood ratio by building partitions of equivalent (empirical) Q-measure. The estimator was = sup f dQ (f) dP , µ ¶ empirically shown to outperform direct plug-in methods, f Z Z but no theoretical convergence rate analysis was given. A where the supremum is taken over all measurable func- concern with histogram-based methods are their possible tions f : X → R, and f dP denotes the expectation inefficiency, in both statistical and computational terms, of f under distribution PR. It is easy to see that equality when applied to higher dimensional data. Our prelim- in the supremum is attained for functions f such that inary empirical results [15] suggest that our estimator q0/p0 ∈ ∂ (f), where q0, p0 and f are evaluated at any exhibits comparable or superior convergence rates in a x ∈ X . By convex duality, this is true if f ∈ ∂(q0/p0) number of examples. for any x ∈ X . Thus, we have proved the following: The remainder of this paper is organized as follows. In Lemma 1. Letting be any function class in R, Section II, we describe a general variational character- F X → there holds: ization of f-divergence, and derive an M-estimator for the KL divergence and the likelihood ratio. Section III is D(P, Q) sup f dQ (f) dP. (1) devoted to the analysis of consistency and convergence f∈F Z rates. In Section IV we briefly discuss how our analysis Furthermore, equality holds if F ∩ ∂(q /p ) 6= ∅. extends to general f-divergences. Additional results and 0 0 complete proofs of all theorems can be be found in [15]. B. An M-estimator of density ratio and KL divergence II. M -ESTIMATOR FORMULATION Returning to the KL divergence, has the form We begin by describing an M-estimator formulation (u) = log(u) for u > 0 and +∞ for u 0. for KL divergence and the density ratio. The convex dual of is (v) = supu(uv (u)) = A. Variational characterization of f-divergence 1log(v) if u < 0 and +∞ otherwise. By Lemma 1,

Let X1, . . . , Xn be n i.i.d. random variables drawn from an unknown distribution P; similarly, let Y , . . . , Y DK (P, Q) = sup f dQ (1 log(f)) dP 1 n f<0 Z Z be n random variables drawn from an unknown distribu- tion Q. Assume that both are absolutely continuous with = sup log g dP gdQ + 1. (2) g>0 Z Z respect to Lebesgue measure , with positive densities p0 d and q0, respectively, on some compact domain X R . In addition, the supremum is attained at g0 = p0/q0. This The KL divergence between P and Q is defined as: motivates the following estimator of the KL divergence: Let G be a function class of X → R+, and dPn and DK (P, Q) = p0 log(p0/q0) d. Z dQn denote the expectation under empiricalR measures R 2 Pn and Qn, respectively, and consider the following rates. We bound E1(G) in terms of the following empir- optimization problem: ical processes: g Dˆ = sup log g dP gdQ + 1. (3) vn(G) = sup log d(PnP) (gg0)d(QnQ) . K n n ¯ Z g Z ¯ g∈G Z Z g∈G¯ 0 ¯ ¯ ¯ In practice we generally choose G to be a convex ¯ ¯ wn(g0) = log g0 d(Pn P) g0d(Qn Q) . function class, which turns (3) into a convex optimization ¯ Z ¯ ¯ ¯ problem that can be solved efficiently [15]. Suppose ¯ ¯ Note that by ¯construction, we have: ¯ that the supremum is attained at gˆn. Then gˆn is an M- estimator of the density ratio g0 = p0/q0. E1(G) vn(G) + wn(g0). (5) In the case of KL divergence estimation, we need to Our first lemma deals with the term wn: analyze the behavior of |Dˆ K DK (P, Q)| as n → ∞. In the case of density ratio estimation, we also need a Lemma 2. We have the almost-sure convergence a.s. performance measure. Since g0 = p0/q0 can be viewed wn(g0) → 0. as a density function with respect to Q measure, a natural Note that in this lemma and other theorems, all “a.s. metric is the Hellinger distance: convergence” statements can be understood with respect 2 1 1/2 1/2 2 to either P or Q because of the mutual absolute conti- hQ(g, g0) := (g g0 ) dQ. (4) 2 Z nuity. Next, we relate vn(G) to the Hellinger distance. This link is made via an intermediate term that is also a As we shall see, this distance measure is weaker than (pseudo) distance between g0 and g: |Dˆ K DK (P, Q)|, with the advantage of allowing us to obtain guarantees under milder assumptions. g d(g0, g) = (g g0)dQ log dP. (6) Z Z g0 2 III. CONSISTENCY AND CONVERGENCE RATES Lemma 3. (i) d(g0, g) 2hQ(g, g0) for any g ∈ G. (ii) If gˆ is the estimate of g , then d(g , gˆ ) v (G). In this section we shall present consistency results and n 0 0 n n obtain convergence rates for our estimators. Throughout Lemma 3 asserts that the Hellinger distance between the paper, we impose the following conditions on the gˆn and g0 is bounded by the suprema of empirical distributions P, Q and the function class G. processes vn(G). One difficulty with the function class {log(g/g )} g (i) DK (P, Q) < ∞; and 0 is that it can be unbounded when takes ∞ 0 (ii) G is sufficiently rich so that g0 ∈ G. value or . The following lemma borrows an idea due to Birge´ and Massart (cf. [18]), considering functions To analyze the overall error, we define the approxima- g0+g log 2g0 , which are always bounded from below. tion error E0(G) and estimation error E1(G) as follows: Lemma 4. If gˆn is the estimate of g0, then: E0(G) = DK (P, Q) sup (log g dP g dQ + 1) 0 Z 1 g0 + gˆn g∈G h2 (g , gˆ ) 2h2 (g , ) 8 Q 0 n Q 0 2 E1(G) = sup log g d(Pn P) gd(Qn Q) . gˆ g gˆ + g ¯ Z ¯ n 0 n 0 g∈G¯ ¯ d(Qn Q) + log d(Pn P). ¯ ¯ Z 2 Z 2g0 Combining with (2)¯ and (3) it is easy to see that: ¯ B. Consistency results E1(G) E0(G) Dˆ K DK (P, Q) E1(G). Our analysis relies on results from empirical process Since we have imposed condition (ii), the approximation theory. We first introduce several standard notions of error E0(F) vanishes, so that this paper focuses on entropy of a function class [20]. For each > 0, a estimation error E1(G) only. If (ii) does not hold, we covering for function class G using metric Lr(Q) is a obtain instead a lower bound on the KL divergence. collection of functions which cover entire G using Lr(Q) balls of radius . Let N(G, Lr(Q)) be the smallest cardinality of such a covering, then H (G, L (Q)) := A. Set-up and some basic inequalities r log N(G, Lr(Q)) is called the entropy for G using We begin by stating a few basic inequalities used Lr(Q) metric. A related notion is entropy with brack- B throughout our analysis of consistency and convergence eting. Let N (G, Lr(Q)) be the smallest value of N for

3 L U which there exist N pairs of functions {gj , gj } such that C. Convergence rate of density ratio estimation U L kgj gj kLr (Q) , and such that for each g ∈ G there We can obtain the convergence rate of our estimator j gL g gL HB(G, L (Q)) := is a such that j j . Then r gˆn using the Hellinger metric. Our result is based on B log N (G, Lr(Q)) is called the entropy with bracketing Lemma 4, which bounds the Hellinger metric in terms of of G. Define the envelope functions: the supremum of empirical processes, and the modulus g(x) of continuity of this supremum. We shall assume that: G0(x) = sup |g(x)|; G1(x) = sup | log |. g∈G g∈G g0(x) sup kgk∞ < K2, (10) Proposition 5. Assume the envelope conditions g∈G in addition to an entropy condition on the function class (a) G0dQ < ∞, and (b) G1dP < ∞, (7) 1/2 Z Z G := {((g + g0)/2) , g ∈ G}: In particular, we assume that for some constant 0 < < 2, there holds for any and suppose that for all > 0 there holds: G > 0, 1 Q B Q G H (G g , L (Q )) → 0, (8a) H (G, L2( )) = O( ). (11) n 0 1 n 1 P Theorem 7. Under conditions (10) and (11), then H(log G/g0, L1(Pn)) → 0. (8b) 1/(G+2) n hQ(g0, gˆn) = OP(n ), where OP is w.r.t. P. a.s. a.s. Then, vn(G) → 0. As a result, E1(G) → 0, and Remarks: To follow up on our earlier example, if G is a.s. hQ(g0, gˆn) → 0. a Sobolev class with smoothness , and g0 is bounded from below, then it is known [4] that = d/. In Envelope condition (7)(b) is quite severe, because G this particular case, we obtain the rate n/(2+d). It it essentially requires all functions in G be bounded is worthwhile comparing to the optimal minimax rates from both above and below. To ensure the Hellinger w.r.t Hellinger metric. More precisely, the minimax rate consistency of the estimation for g , however, we can 0 is defined as: essentially drop envelope condition (7)(b) and replace entropy condition (8)(b) by a milder entropy condition. rn := inf sup EPhQ(g0, gˆn). gˆn∈G P,Q Proposition 6. Assume that (7)(a) and (8a) hold, and Here the infimum is taken with respect to all estimators 1 G + g0 P H(log , L1(Pn)) → 0, (9) gˆn ∈ G, where G is a Sobolev class with smoothness . n 2g0 First, note that rn infgˆn∈G supP Eh(g0, gˆn), where a.s. Q then hQ(g0, gˆn) → 0. we have fixed = as the Lebesgue measure on X . Thus, we have lower bounded the minimax rate by that It can be shown that both entropy conditions (8a) of a nonparametric density estimation problem.1 Thus, and (9) can be deduced from a single condition—namely, we obtain the following: that for all > 0, the bracketing entropy is bounded as B Proposition 8. When the likelihood ratio lies in the H (G, L1(Q)) < ∞. As a concrete illustration, let us consider: Sobolev class of smoothness , the optimal minimax rate /(2+d) rn = (n ) is achieved by our estimator. d Example: (Sobolev classes W2 ) For x ∈ R , and a d- dimensional multi-index = (1, . . . , d) (all i are D. Convergence rates of divergence estimation natural numbers), write x = d xi , and || = i=1 i We now turn to the convergence rate of our estimator d . Let D denote the differentialQ operator: i=1 i for the KL divergence. We assume that all functions in P || ∂ G are bounded from above and below: D g(x) = 1 d g(x1, . . . , xd). ∂x1 . . . ∂xd 0 < K1 kgk∞ K2 for all g ∈ G. (12) Let W2 (X ) denote the Sobolev space of functions f : 2 2 Theorem 9. Under conditions (12) and (11), we have X → R, where ||f|| = |D f(x)| dx L2 (X ) ||= ˆ P Q 1/(G+2) is bounded by a constant M 2.PIf G isR restricted to a DK DK ( , ) = OP(n ). ¯ ¯ subspace of W (X ) such that G is uniformly bounded ¯ ¯ 2 ¯ ¯ from above, then all conditions in Prop. 6 can be shown 1There is a small technical aspect here: the space G ranges over smooth functions that need not be valid probability densities. to hold. If, in addition, all functions in G is also bounded Nonetheless, a standard hypercube argument is still directly applicable. from below, then all conditions in Prop. 5 hold. (See [19], Sec. 24.3, for such an argument).

4 Proof: We provide only a sketch here. From Implementation and experimental results: In practice, equations (2) and (3), we can bound |Dˆ K DK (P, Q)| we implement our estimator by taking G to be the from above by the sum A+B +C of three terms, where reproducing kernel Hilbert space induced by a Gaussian kernel. Doing so yields a convex optimization problem A := log gˆn/g0d(Pn P) (gˆn g0)d(Qn Q) Z Z ˛ ˛ that can be solved efficiently [15]. Our ˛ ˛ results [15] demonstrate the practical viability of such B := log gˆn/g0dP (gˆn g0)dQ ˛ Z Z ˛ ˛ ˛ estimators, which complements the theoretical analysis ˛ ˛ ˛ ˛ of consistency and convergence rates reported herein. C := log g0d(Pn P) g0d(Qn Q) . ˛ Z Z ˛ ˛ ˛ ˛ ˛ REFERENCES ˛ 1/2 ˛ We have C = OP (n ) by the . [1] S. M. Ali and S. D. Silvey. A general class of coefficients of Using assumption (12), divergence of one distribution from another. J. Royal Stat. Soc. Series B, 28:131–142, 1966. K2 B |gˆn g0| dQ| + |gˆn g0|dQ [2] P. Bickel and Y. Ritov. Estimating integrated squared density Z K1 Z derivatives: Sharp best order of convergence estimates. Sankhya Ser. A, 50:381–393, 1988. (K2/K1 + 1)kgˆn g0kL2(Q) [3] L. Birge´ and P. Massart. Estimation of integral functionals of a 1/2 density. Annals of Statistics, 23(1):11–29, 1995. (K2/K1 + 1)K2 4hQ(g0, gˆn) [4] M. S. Birman and M. Z. Solomjak. Piecewise-polynomial (a) 1/(2+G) = OP(n ), approximations of functions of the classes Wp . Math. USSR- Sbornik, 2(3):295–317, 1967. where equality (a) is due to Theorem 7. [5] T. Cover and J. Thomas. Elements of Information Theory. John Wiley and Sons, New York, 1991. Finally, to bound A, we apply a modulus of continu- [6] I. Csiszar´ . Information-type measures of difference of probability ity result on the suprema of empirical processes w.r.t. distributions and indirect observation. Studia Sci. Math. Hungar, 2:299–318, 1967. functions (g g0) and (log g log g0). From (12), the [7] L. Gyorfi and E.C. van der Meulen. Density-free convergence bracket entropy for both function classes G and log G has properties of various estimators of entropy. Computational the same order as that of G, as given in (11). Applying Statistics and Data Analysis, 5:425–436, 1987. Lemma 5.13 from van de Geer [18], we obtain that for [8] P. Hall and S. Morton. On estimation of entropy. Ann. Inst. Statist. Math. 1/(2+G) , 45(1):69–88, 1993. n = n , there holds: [9] I. A. Ibragimov and R. Z. Khasminskii. On the nonparametric estimation of functionals. In Symposium in Asymptotic Statistics, 1/2 1G /2 2 2/(2+G) A = OP(n kgˆng0kL2(Q) ∨n) = OP(n ). pages 41–52, 1978. [10] H. Joe. Estimation of entropy and other functionals of a Overall, we have established that the sum A + B + C is multivariate density. Ann. Inst. Statist. Math., 41:683–697, 1989. 1/(2+ ) upper bounded by OP(n G ). [11] T. Kailath. The divergence and measures in signal selection. IEEE Trans. on Communication Technology, 15(1):52–60, 1967. IV. OTHER RESULTS [12] B. Laurent. Efficient estimation of integral functionals of a density. Annals of Statistics, 24(2):659–681, 1996. In [15], we provide several further results that cannot [13] B. Ya. Levit. Asymptotically efficient estimation of nonlinear be presented here due to space limitations. At a high functionals. Problems Inform. Transmission, 14:204–209, 1978. level, these results include the following: [14] X. Nguyen, M. J. Wainwright, and M. I. Jordan. On divergences, surrogate losses and decentralized detection. Technical Report M-estimation of D and ∂(q0/p0): Although we 695, Department of Statistics, UC Berkeley, October 2005. have focused primarily here on the KL divergence, [15] X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating our approach is applicable to the estimation of any f- divergence functionals and the likelihood ratio by convex risk minimization. Technical report, Department of Statistics, UC divergence D and subgradient ∂(q0/p0). In general, Berkeley, January 2007. the estimator of D takes the following form: [16] G. Rockafellar. Convex Analysis. Princeton University Press, Princeton, 1970. ˆ [17] F. Topsoe. Some inequalities for information divergence and D := sup f dQn (f) dPn, related measures of discrimination. IEEE Transactions on Infor- f∈F Z Z mation Theory, 46:1602–1609, 2000. where the supremum is attained at an estimate of [18] S. van de Geer. Empirical Processes in M-Estimation. Cambridge University Press, 2000. ∂(q0/p0). The analysis hinges on the modulus of [19] A. W. van der Vaart. Asymptotic Statistics. Cambridge University continuity of the suprema of certain empirical processes Press, 1998. with respect to the following Bregman divergence (a [20] A. W. van der Vaart and J. Wellner. Weak Convergence and Empirical Processes. Springer-Verlag, New York, NY, 1996. special case of which was defined in Eq. (6)): [21] Q. Wang, S. R. Kulkarni, and S. Verdu.´ Divergence estimation of continuous distributions based on data-dependent partitions. ∂ IEEE Transactions on Information Theory, 51(9):3064–3074, d(f0, f) = ( (f) (f0) (f f0) dP. Z ∂f ¯ 2005. ¯f0 ¯ ¯ 5