Nonparametric Estimation of the Likelihood Ratio and Divergence Functionals

1 Nonparametric estimation of the likelihood ratio and divergence functionals 1 1 2 1 2 XuanLong Nguyen , Martin J. Wainwright ; and Michael I. Jordan ; 1 Department of Electrical Engineering and Computer Science 2 Department of Statistics University of California, Berkeley fxuanlong,wainwrig,[email protected] Abstract— We develop and analyze a nonparametric totic one; for example, the KL divergence emerges as the method for estimating the class of f-divergence functionals, asymptotic rate of the probability of error in Neyman- and the density ratio of two probability distributions. Pearson binary hypothesis testing (a result known as Our method is based on a non-asymptotic variational characterization of the f-divergence, which allows us to Stein's lemma). But it is also possible to provide non- cast the problem of estimating divergences in terms of asymptotic characterizations of the divergences; in par- risk minimization. We thus obtain an M-estimator for ticular, Fano's lemma shows that KL divergence provides divergences, based on a convex and differentiable opti- a lower bound on the error probability for decoding mization problem that can be solved efficiently We analyze error [5]. This paper is motivated by a non-asymptotic the consistency and convergence rates for this M-estimator given conditions only on the ratio of densities. characterization of f-divergence in the spirit of Fano's lemma, first explicated in our earlier work [14]. This characterization states that that there is an one-to-one I. INTRODUCTION correspondence between the family of f-divergences and Given samples from two (multivariate) probability the family of “surrogate loss functions”, such that the distributions P and Q, it is frequently of interest to (optimum) Bayes risk is equal to the negative of the di- estimate the values of functionals measuring the diver- vergence. In other words, any negative f-divergence can gence between the unknown P and Q. Of particular serve as a lower bound of a risk minimization problem. interest is the Kullback-Leibler (KL) divergence, but This variational characterization of divergence, stated the approach of this paper applies to the more general formally in Lemma 1, allows us estimate a divergence class of Ali-Silvey or f-divergences [1], [6]. An f- DÁ(P; Q) by solving a binary decision problem. Not divergence, to be defined formally in the sequel, is of the surprisingly, we show how the problem of estimating f- form DÁ(P; Q) = Á(dQ=dP)dP, where Á is a convex divergence is intrinsically linked to that of estimating the function of the likelihoodR ratio. likelihood ratio g0 = dP=dQ. Overall, we obtain an M- These divergences play fundamental roles in statis- estimator, whose optimal value estimates the divergence tics and information theory. In particular, divergences and optimizing argument estimates the likelihood ratio. are often used as measures of discrimination in binary Our estimator is nonparametric, in that it imposes no hypothesis testing and classification applications. Ex- strong assumptions on the form of the densities for P and amples include signal selection [11] and decentralized Q. We establish consistency of this estimator by exploit- detection [14], where f-divergences are used to solve ing analysis techniques for M-estimators in the setting experimental design problems. The Shannon mutual of nonparametric density estimation and regression [18], information (a particular type of KL divergence), in [20]. At a high level, the key to the proof is suitable addition to its role in coding theorems, is often used as a control on the modulus of continuity of the suprema measure of independence to be extremized in dimension- of two empirical processes, one for each of P and Q, ality reduction and feature selection. In all of these cases, with respect to a metric defined over density ratios. if divergences are to be used as objective functionals, one This metric turns out to be a surrogate lower bound has to be able to estimate them efficiently from data. of a Bregman divergence defined on a pair of density There are two ways in which divergences are typically ratios. In this way, we not only establish consistency of characterized. The classical characterization is an asymp- our estimator, but also obtain convergence rates. As one concrete example, when the likehood ratio g0 lies in a The KL divergence is a special case of a broader function class G of smoothness ® with ® > d=2, where class of divergences known as Ali-Silvey distances, or d is the number of data dimensions, our estimator of f-divergences [6], [1]: the likelihood ratio achieves the optimal minimax rate ¡®=(2®+d) n according to the Hellinger metric, while the DÁ(P; Q) = p0Á(q0=p0) d¹; Z divergence estimator also achieves the rate n¡®=(2®+d). In abstract terms, an f-divergence can be viewed as an where Á : R ! R¹ is a convex function. Different choices integral functional of a pair of densities. While there are of Á result in many divergences that play important relatively little work focusing on integral functionals for roles in information theory and statistics, including the pairs of densities (such as the f-divergences of interest variational distance, Hellinger distance, KL divergence here), there is an extensive literature on the estimation of and so on (see, e.g., [17]). an integral functionals of the form Á(p)p, where p the Since Á is a convex function, by Legendre-Fenchel density of an unknown probability distribR ution. Work on convex duality [16] we can write: this topic dates back to the 1970s [9], [13]; see also [2], Á(u) = sup(uv ¡ Á¤(v)); [3], [12] and the references therein. There are also a v2R number of papers that focus specifically on the entropy ¤ functional (see, e.g., [7], [10], [8]). where Á is the convex conjugate of Á. As a result, In a separate line of work, Wang et al. [21] proposed ¤ DÁ(P; Q) = p0 sup(fq0=p0 ¡ Á (f)) d¹ a histogram-based KL estimator, which is based on the Z f estimation of the likelihood ratio by building partitions of equivalent (empirical) Q-measure. The estimator was = sup f dQ ¡ Á¤(f) dP ; µ ¶ empirically shown to outperform direct plug-in methods, f Z Z but no theoretical convergence rate analysis was given. A where the supremum is taken over all measurable func- concern with histogram-based methods are their possible tions f : X ! R, and f dP denotes the expectation inefficiency, in both statistical and computational terms, of f under distribution PR. It is easy to see that equality when applied to higher dimensional data. Our prelim- in the supremum is attained for functions f such that ¤ inary empirical results [15] suggest that our estimator q0=p0 2 @Á (f), where q0; p0 and f are evaluated at any exhibits comparable or superior convergence rates in a x 2 X . By convex duality, this is true if f 2 @Á(q0=p0) number of examples. for any x 2 X . Thus, we have proved the following: The remainder of this paper is organized as follows. In Lemma 1. Letting be any function class in R, Section II, we describe a general variational character- F X ! there holds: ization of f-divergence, and derive an M-estimator for the KL divergence and the likelihood ratio. Section III is ¤ DÁ(P; Q) ¸ sup f dQ ¡ Á (f) dP: (1) devoted to the analysis of consistency and convergence f2F Z rates. In Section IV we briefly discuss how our analysis Furthermore, equality holds if F \ @Á(q =p ) 6= ;. extends to general f-divergences. Additional results and 0 0 complete proofs of all theorems can be be found in [15]. B. An M-estimator of density ratio and KL divergence II. M -ESTIMATOR FORMULATION Returning to the KL divergence, Á has the form We begin by describing an M-estimator formulation Á(u) = ¡ log(u) for u > 0 and +1 for u · 0. for KL divergence and the density ratio. ¤ The convex dual of Á is Á (v) = supu(uv ¡ Á(u)) = A. Variational characterization of f-divergence ¡1¡log(¡v) if u < 0 and +1 otherwise. By Lemma 1, Let X1; : : : ; Xn be n i.i.d. random variables drawn from an unknown distribution P; similarly, let Y ; : : : ; Y DK (P; Q) = sup f dQ ¡ (¡1 ¡ log(¡f)) dP 1 n f<0 Z Z be n random variables drawn from an unknown distribution Q. Assume that both are absolutely continuous with = sup log g dP ¡ gdQ + 1: (2) g>0 Z Z respect to Lebesgue measure ¹, with positive densities p0 d and q0, respectively, on some compact domain X ½ R . In addition, the supremum is attained at g0 = p0=q0. This The KL divergence between P and Q is defined as: motivates the following estimator of the KL divergence: Let G be a function class of X ! R+, and dPn and DK (P; Q) = p0 log(p0=q0) d¹: Z dQn denote the expectation under empiricalR measures R 2 Pn and Qn, respectively, and consider the following rates. We bound E1(G) in terms of the following empir- optimization problem: ical processes: g D^ = sup log g dP ¡ gdQ + 1: (3) vn(G) = sup log d(Pn¡P)¡ (g¡g0)d(Qn¡Q) : K n n ¯ Z g Z ¯ g2G Z Z g2G¯ 0 ¯ ¯ ¯ In practice we generally choose G to be a convex ¯ ¯ wn(g0) = log g0 d(Pn ¡ P) ¡ g0d(Qn ¡ Q) : function class, which turns (3) into a convex optimization ¯ Z ¯ ¯ ¯ problem that can be solved efficiently [15]. Suppose ¯ ¯ Note that by ¯construction, we have: ¯ that the supremum is attained at g^n.

Load more