Minimax Estimation of Discrete Distributions Under L1 Loss
Total Page:16
File Type:pdf, Size:1020Kb
1 Minimax Estimation of Discrete Distributions under ℓ1 Loss Yanjun Han, Student Member, IEEE, Jiantao Jiao, Student Member, IEEE, and Tsachy Weissman, Fellow, IEEE . Abstract—We consider the problem of discrete distribution 2) High dimensional asymptotics: we let the support size S estimation under ℓ1 loss. We provide tight upper and lower and the number of observations n grow together, char- bounds on the maximum risk of the empirical distribution acterize the scaling under which consistent estimation is (the maximum likelihood estimator), and the minimax risk in regimes where the support size S may grow with the number of feasible, and obtain the minimax rates. observations n. We show that among distributions with bounded 3) Infinite dimensional asymptotics: the distribution P may entropy H, the asymptotic maximum risk for the empirical have infinite support size, but is constrained to have distribution is 2H/ ln n, while the asymptotic minimax risk is bounded entropy H(P ) H, where the entropy [1] ln H/ n. Moreover, we show that a hard-thresholding estimator is defined as ≤ oblivious to the unknown upper bound H, is essentially minimax. However, if we constrain the estimates to lie in the simplex of S probability distributions, then the asymptotic minimax risk is H(P ) , p ln p . (1) − i i again 2H/ ln n. We draw connections between our work and i=1 the literature on density estimation, entropy estimation, total X We remark that results for the first regime follow from the variation distance (ℓ1 divergence) estimation, joint distribution estimation in stochastic processes, normal mean estimation, and well-developed theory of asymptotic statistics [2, Chap. 8], and adaptive estimation. we include them here for completeness and comparison with Index Terms—Distribution estimation, entropy estimation, other regimes. One motivation for considering the high dimen- minimax risk, hard-thresholding, high dimensional statistics sional and infinite dimensional asymptotics is that the modern era of big data gives rise to situations in which we cannot assume that the number of observations is much larger than the I. INTRODUCTION AND MAIN RESULTS dimension of the unknown parameter. It is particularly true for Given n independent samples from an unknown discrete the distribution estimation problem, e.g., the Wikipedia page probability distribution P = (p1,p2, ,pS), with unknown on the Chinese characters showed that the number of Chinese support size S, we would like to estimate··· the distribution sinograms is at least 80, 000. Meanwhile, for distributions with P under ℓ1 loss. Equivalently, the problem is to estimate P extremely large support sizes (such as the Chinese language), based on the Multinomal random vector (X ,X ,...,X ) the number of frequent symbols are considerably smaller than 1 2 S ∼ Multi(n; p1,p2,...,pS). the support size. This observation motivates the third regime, A natural estimator of P is the Maximum Likelihood in which we focus on distributions with finite entropy, but Estimator (MLE), which in this problem setting coincides with possibly extremely large support sizes. Another key result that the empirical distribution Pn, where Pn(i) = Xi/n is the motivates the problem of discrete distribution estimation under number of occurrences of symbol i in the sample divided bounded entropy constraint is Marton and Shields [3], who arXiv:1411.1467v3 [cs.IT] 29 Dec 2015 by the sample size n. This paper is devoted to analyzing essentially showed that the entropy rate dictates the difficulty the performances of the MLE, and the minimax estimators in estimating discrete distributions under ℓ1 loss in stochastic in various regimes. Specifically, we focus on the following processes. three regimes: 1) Classical asymptotics: the dimension S of the unknown parameter P remains fixed, while the number of obser- vations grows. n We denote by the set of all distributions of support MS Manuscript received Month 00, 0000; revised Month 00, 0000; accepted size S. The ℓ1 loss for estimating P using Q is defined as Month 00, 0000. Date of current version Month 00, 0000. This work was S supported in part by the Center for Science of Information (CSoI), an NSF P Q , p q , (2) Science and Technology Center, under grant agreement CCF-0939370. The k − k1 | i − i| material in this paper was presented in part at the 2014 IEEE International i=1 Symposium on Information Theory, Honolulu, HI, USA. Copyright (c) 2014 X IEEE. Personal use of this material is permitted. However, permission to use where Q is not necessarily a probability mass function. The this material for any other purposes must be obtained from the IEEE by risk function for an estimator Pˆ in estimating P under ℓ1 loss sending a request to [email protected]. is defined as Yanjun Han, Jiantao Jiao and Tsachy Weissman are with the Department of R(P ; Pˆ) , E Pˆ P , (3) Electrical Engineering, Stanford University, CA, USA. Email: {yjhan, jiantao, P k − k1 tsachy}@stanford.edu This work was supported in part by the NSF Center for Science of where the expectation is taken with respect to the measure P . Information under grant agreement CCF-0939370. The maximum ℓ1 risk of an estimator Pˆ, and the minimax 2 risk in estimating P are respectively defined as the empirical distribution Pn has been obtained in [12]: 1 1 ˆ , ˆ (4) 2 4 Rmaximum( ; P ) sup R(P ; P ) E 2(S 1) 2S (S 1) P P ∈P sup P Pn P 1 − + 3− . (8) πn 4 P ∈MS k − k ≤ r n Rminimax( ) , inf sup R(P ; Pˆ), (5) P ˆ P P ∈P In the present paper we show that MLE is minimax rate- where is a given collection of probability measures P , and optimal in all the three regimes we consider, but possibly P the infimum is taken over all estimators Pˆ. suboptimal in terms of constants in high dimensional and This paper is dedicated to investigating the maximum risk of infinite dimensional settings. the MLE R ( ; P ) and the minimax risk R ( ) maximum n minimax 1) Classical Asymptotics: The next corollary is an imme- for various . ThereP are good reasons for focusing on theP P diate result from Theorem 1. ℓ1 loss, as we do in this paper. Other loss functions in distribution estimation, such as the squared error loss, have Corollary 1. The empirical distribution Pn achieves the been extensively studied in a series of papers [4]–[7], while − 1 worst-case convergence rate O(n 2 ). Specifically, less is known for the ℓ1 loss. For the squared error loss, the minimax estimator is unique and depends on the support lim sup √n sup EP Pn P 1 √S 1 < . (9) →∞ · k − k ≤ − ∞ size S [8, Pg. 349]. Since the support size S is unknown n P ∈MS in our setting, this estimator is highly impractical. This fact Regarding the lower bound, the well-known H´ajek-Le partially motivates our focus on the ℓ1 loss, which turns out to Cam local asymptotic minimax theorem [13] (Theorem 6 in bridge our understanding of both parametric and nonparamet- Appendix A) and corresponding achievability theorems [2, ric models. The ℓ1 loss, being proportional to the variational Lemma 8.14] show that the MLE is optimal (even in con- distance, is often a more natural measure of discrepancy stants) in classical asymptotics. Concretely, one corollary of between distributions than the ℓ2 loss. Moreover, the ℓ1 loss Theorem 6 in the Appendix shows the following. in discrete distribution estimation is compatible with and is a degenerate case of the L1 loss in density estimation, which is Corollary 2. Fixing S 2, we have the only loss that satisfies certain natural properties [9]. ≥ All logarithms in this paper are assumed to be in the natural 2(S 1) lim inf √n inf sup EP Pˆ P 1 − > 0, n→∞ ˆ π base. · P P ∈MS k − k ≥ r (10) where the infimum is taken over all estimators. A. Main results We investigate the maximum risk of the MLE Note that the combination of (8) and Corollary 2 actually yields that the MLE is asymptotically minimax, and Rmaximum( ; P ) and the minimax risk Rminimax( ) in P n P the aforementioned three different regimes separately. lim √n inf sup EP Pˆ P 1 n→∞ · Pˆ P ∈M k − k Understanding Rmaximum( ; P ) = sup R(P ; P ) fol- S P n P ∈P n (11) lows from an understanding of R(P ; Pn) = EP Pn P 1. 2(S 1) k − k = lim √n sup EP Pn P 1 = − , This problem can be decomposed into analyzing the Binomial n→∞ · P ∈MS k − k r π mean absolute deviation defined as where the infimum is taken over all estimators. This result X E p , (6) was also proved in [12] via a different approach to obtain the n − exact constant for the lower bound in the classical asymptotics where X B(n,p) follows a Binomial distribution. Com- setting. plicated as∼ it may seem, De Moivre obtained an explicit expression for this quantity. Diaconis and Zabell provided a 2) High-dimensional Asymptotics: Theorem 1 also implies nice historical account for De Moivre’s discovery in [10]. the following: Berend and Kontorovich [11] provided tight upper and lower Corollary 3. For S = n/c, the empirical distribution Pn bounds on the Binomial mean absolute deviation, and we − 1 achieves the worst-case convergence rate O(c 2 ), i.e., summarize some key results in Lemma 4 of the Appendix. A well-known result to recall first is the following. lim sup √c lim sup sup EP Pn P 1 1 < . (12) c→∞ · n→∞ P ∈MS k − k ≤ ∞ Theorem 1.