arXiv:1411.1467v3 [cs.IT] 29 Dec 2015 observations itiuinis distribution entropy siainunder estimation onso h aiu iko h miia distribution in empirical risk minimax the size the of support and the risk ), where regimes likelihood maximum maximum the (the on bounds oee,i ecntanteetmtst i ntesmlxo simplex ri the minimax in asymptotic lie the to again estimates then the distributions, constrain probability we if bound However, upper unknown the to oblivious h ieaueo est siain nrp estimation, entropy estimation, ( density distance variation on literature the siaini tcatcpoess omlma estimatio mean estimation. normal adaptive processes, stochastic in estimation H/ iia ik adtrsodn,hg iesoa statis dimensional high hard-thresholding, risk, minimax rbblt distribution probability upr size support P Multi ae nteMlioa admvector random Multinomal the on based lcrclEgneig tnodUiest,C,UA Ema USA. CA, University, Stanford Engineering, Electrical tsachy nomto ne rn gemn CCF-0939370. agreement grant under Information nvrosrgms pcfial,w ou ntefollowing the on focus minimax we the regimes: Specifically, three and regimes. MLE, various the in of performances size the sample the by hsmtra o n te upssms eotie rmth from obtained be must [email protected]. purposes to other request p However, a any I permitted. sending IEEE for is 2014 material material the this Copyri this of at USA. use part HI, Personal Honolulu, in IEEE. Theory, presented Information CCF-0 (C was w on Information agreement paper Symposium This of grant this 0000. Science under in 00, for Center, material Month Technology Center the version and by Science current part of in Date supported 0000. 00, Month ubro curne fsymbol of occurrences of number h miia distribution wi coincides empirical setting problem the this in which (MLE), Estimator ajnHan, Yanjun iia siaino iceeDsrbtosunder Distributions Discrete of Estimation Minimax ajnHn ina ioadTah esmnaewt h Dep the with are Weissman Tsachy and Jiao Jiantao Han, Yanjun hswr a upre npr yteNFCne o cec o Science for Center NSF the by part in supported was work This aucitrcie ot 0 00 eie ot 0 0000; 00, Month revised 0000; 00, Month received Manuscript Abstract ne Terms Index Given aua siao of estimator natural A )Casclaypois h dimension the asymptotics: Classical 1) under ln } ( n vations parameter @stanford.edu 2 n oevr eso htahr-hehligestimator hard-thresholding a that show we Moreover, . H/ ; H p ℓ n W osdrtepolmo iceedistribution discrete of problem the consider —We 1 .I I. 1 h smttcmxmmrs o h empirical the for risk maximum asymptotic the , ln p , needn ape rma nnw discrete unknown an from samples independent n os qiaety h rbe st estimate to is problem the Equivalently, loss. n 2 S n eso htaogdsrbtoswt bounded with distributions among that show We . TOUTO AND NTRODUCTION eda oncin ewe u okand work our between connections draw We . p , . . . , 2 tdn ebr IEEE Member, Student Dsrbto siain nrp estimation, entropy estimation, —Distribution ewudlk oetmt h distribution the estimate to like would we , H/ grows. P ℓ 1 ℓ ln ean xd hl h ubro obser- of number the while fixed, remains 1 os epoietgtupradlower and upper tight provide We loss. n S n iegne siain on distribution joint estimation, divergence) ) hsppri eoe oanalyzing to devoted is paper This . hl h smttcmnmxrs is risk minimax asymptotic the while , . P ( = P P n where , S p steMxmmLikelihood Maximum the is 1 a rwwt h ubrof number the with grow may p , M 2 , i H · · · AIN ntesml divided sample the in P sesnilyminimax. essentially is , ina Jiao, Jiantao , n ( p , ( X R i S S = ) ESULTS 1 ) X , with , fteunknown the of il: 2 riso ouse to ermission X X , . . . , { ja,jiantao, yjhan, o) nNSF an SoI), i 330 The 939370. h c 2014 (c) ght /n nternational EEby IEEE e unknown rmn of artment ℓ tics accepted r was ork ,and n, sthe is tdn ebr IEEE Member, Student S 1 kis sk total ) P ∼ th Loss f f setal hwdta h nrp aedcae h diffic under who the distributions dictates [3], discrete rate estimating Shields entropy in and the that Marton showed is essentially constraint but entropy estimation entropy, distribution bounded discrete finite result of key with problem the Another distributions motivates sizes. support on large regi extremely focus possibly third we the motivates which observation tha in This smaller size. considerably support are language symbols the Chinese frequent the of as number (such the sizes support large extremely processes. nteCieecaatr hwdta h ubro Chinese of number least the at that is p showed sinograms Wikipedia characters the Chinese e.g., the cannot on problem, we true estimation particularly which distribution is It th in the parameter. than unknown situations larger the much to of is observations dimension rise of number gives the that data mode assume the big that of is asymptotics era dimensional dime high infinite with the and comparison considering sional for and motivation completeness One regimes. for other here them 8 include Chap. we [2, statistics asymptotic of theory well-developed size where ikfnto o nestimator an for function risk hr h xetto stknwt epc otemeasure the to respect maximum with taken The is expectation the where sdfie as defined is ednt by denote We ermr htrslsfrtefis eieflo rmthe from follow regime first the for results that remark We )Hg iesoa smttc:w e h upr size support the let we asymptotics: dimensional High 2) )Ifiiedmninlaypois h distribution the asymptotics: dimensional Infinite 3) S sdfie as defined is esbe n bantemnmxrates. minimax the is obtain estimation and consistent feasible, which under scaling the acterize n h ubro observations of number the and aeifiiespotsz,bti osrie ohave to constrained is but entropy size, bounded support infinite have The . Q sntncsaiyapoaiiyms ucin The function. mass probability a necessarily not is ℓ 1 osfrestimating for loss ℓ n scyWeissman, Tsachy and , 1 k R M P iko nestimator an of risk ( 80 P S − H ; , h e faldsrbtoso support of distributions all of set the P Q 000 ˆ H ( ) P k ( 1 , P ) enhl,frdsrbtoswith distributions for Meanwhile, . , , ) E P ˆ X P i ≤ X i =1 S nestimating in =1 S k P ˆ H P | − p − hr h nrp [1] entropy the where , i p using − i n P ln P q k ˆ rwtgte,char- together, grow ℓ i 1 p n h minimax the and , 1 | , , i Q elw IEEE Fellow, . osi stochastic in loss P sdfie as defined is under P ℓ 1 under ,and ], may loss ulty that me, age for (1) (2) (3) P rn n- . S ), n e 1 . 2

risk in estimating P are respectively defined as the empirical distribution Pn has been obtained in [12]:

1 1 ˆ , ˆ (4) 2 4 Rmaximum( ; P ) sup R(P ; P ) E 2(S 1) 2S (S 1) P P ∈P sup P Pn P 1 − + 3− . (8) πn 4 P ∈MS k − k ≤ r n Rminimax( ) , inf sup R(P ; Pˆ), (5) P ˆ P P ∈P In the present paper we show that MLE is minimax rate- where is a given collection of probability measures P , and optimal in all the three regimes we consider, but possibly P the infimum is taken over all estimators Pˆ. suboptimal in terms of constants in high dimensional and This paper is dedicated to investigating the maximum risk of infinite dimensional settings. the MLE R ( ; P ) and the minimax risk R ( ) maximum n minimax 1) Classical Asymptotics: The next corollary is an imme- for various . ThereP are good reasons for focusing on theP P diate result from Theorem 1. ℓ1 loss, as we do in this paper. Other loss functions in distribution estimation, such as the squared error loss, have Corollary 1. The empirical distribution Pn achieves the been extensively studied in a series of papers [4]–[7], while − 1 worst-case convergence rate O(n 2 ). Specifically, less is known for the ℓ1 loss. For the squared error loss, the minimax estimator is unique and depends on the support lim sup √n sup EP Pn P 1 √S 1 < . (9) →∞ · k − k ≤ − ∞ size S [8, Pg. 349]. Since the support size S is unknown n P ∈MS in our setting, this estimator is highly impractical. This fact Regarding the lower bound, the well-known H´ajek-Le partially motivates our focus on the ℓ1 loss, which turns out to Cam local asymptotic minimax theorem [13] (Theorem 6 in bridge our understanding of both parametric and nonparamet- Appendix A) and corresponding achievability theorems [2, ric models. The ℓ1 loss, being proportional to the variational Lemma 8.14] show that the MLE is optimal (even in con- distance, is often a more natural measure of discrepancy stants) in classical asymptotics. Concretely, one corollary of between distributions than the ℓ2 loss. Moreover, the ℓ1 loss Theorem 6 in the Appendix shows the following. in discrete distribution estimation is compatible with and is a degenerate case of the L1 loss in density estimation, which is Corollary 2. Fixing S 2, we have the only loss that satisfies certain natural properties [9]. ≥ All logarithms in this paper are assumed to be in the natural 2(S 1) lim inf √n inf sup EP Pˆ P 1 − > 0, n→∞ ˆ π base. · P P ∈MS k − k ≥ r (10) where the infimum is taken over all estimators. A. Main results We investigate the maximum risk of the MLE Note that the combination of (8) and Corollary 2 actually yields that the MLE is asymptotically minimax, and Rmaximum( ; P ) and the minimax risk Rminimax( ) in P n P the aforementioned three different regimes separately. lim √n inf sup EP Pˆ P 1 n→∞ · Pˆ P ∈M k − k Understanding Rmaximum( ; P ) = sup R(P ; P ) fol- S P n P ∈P n (11) lows from an understanding of R(P ; Pn) = EP Pn P 1. 2(S 1) k − k = lim √n sup EP Pn P 1 = − , This problem can be decomposed into analyzing the Binomial n→∞ · P ∈MS k − k r π mean absolute deviation defined as where the infimum is taken over all estimators. This result X E p , (6) was also proved in [12] via a different approach to obtain the n − exact constant for the lower bound in the classical asymptotics where X B(n,p) follows a . Com- setting. plicated as∼ it may seem, De Moivre obtained an explicit expression for this quantity. Diaconis and Zabell provided a 2) High-dimensional Asymptotics: Theorem 1 also implies nice historical account for De Moivre’s discovery in [10]. the following: Berend and Kontorovich [11] provided tight upper and lower Corollary 3. For S = n/c, the empirical distribution Pn bounds on the Binomial mean absolute deviation, and we − 1 achieves the worst-case convergence rate O(c 2 ), i.e., summarize some key results in Lemma 4 of the Appendix. A well-known result to recall first is the following. lim sup √c lim sup sup EP Pn P 1 1 < . (12) c→∞ · n→∞ P ∈MS k − k ≤ ∞ Theorem 1. The maximum ℓ1 risk of the empirical distribution Pn satisfies Now we show that S = n/c is the critical scaling in high dimensional asymptotics. In other words, if n = o(S), then S 1 sup E P P − , (7) no estimator for the distribution P is consistent under ℓ1 loss. P n 1 n P ∈MS k − k ≤ r This phenomenon has been observed in several papers, such where denotes the set of distributions with support size as [11] and [14], to name a few. MS S. The following theorem presents a non-asymptotic minimax lower bound. In fact, a tighter upper bound on the worst-case ℓ1 risk of 3

Theorem 2. For any ζ (0, 1], we have structed. ∈ E ˆ 1 eS (1 + ζ)n e inf sup P P P 1 ½ > Theorem 4. For any η> 1, denote ˆ P P ∈MS k − k ≥ 8s(1 + ζ)n S 16   (ln n)2η 2(1 + ζ)n (1 + ζ)n e ∆n , , (18)

+ exp ½ n − S S ≤ 16     then for the hard-thresholding estimator defined as ˆ∗ X ζ2n ζ2S P ( )= exp( ) 12 exp , (13) (g (X ),g (X ), ,g (X )) with − − 24 − −32(ln S)2 n 1 n 2 ··· n S   Xi Xi 2 where the infimum is taken over all estimators. g (X )= ½ >e ∆ , (19) n i n n n Theorem 2 implies the following minimax lower bound in   we have high dimensional asymptotics, if we take ζ 0+. → H sup E Pˆ∗ P Corollary 4. For any constant c> 0, if S = n/c, the conver- P k − k1 ≤ ln n ln(2e2) 2η ln ln n − 1 P :H(P )≤H 2 − − gence rate of the maximum ℓ1 risk is Ω(c ). Specifically, 2 −η 1− e + (ln n) + n 4 . (20) E ˆ √e lim inf √c lim inf inf sup P P P 1 > 0, H c→∞ · n→∞ ˆ k − k ≥ 8 Moreover, for any c (0, 1) and n e , we have P P ∈MS ∈ ≥ (14) cH 1− 1 − 1 inf sup EP Pˆ P 1 1 n c (1 c) c , where the infimum is taken over all estimators. Pˆ P :H(P )≤H k − k ≥ ln n · − −  (21) Corollaries 3 and 4 imply that MLE achieves the optimal − 1 convergence rate Θ(c 2 ) in high dimensional linear scaling. where the infimum is taken over all estimators. 3) Infinite-dimensional Asymptotics: The performance of MLE in the regime of bounded entropy is characterized in Theorem 4 presents both a non-asymptotic achievable maxi- the following theorem. mum ℓ1 risk and a non-asymptotic lower bound of the minimax ℓ risk, and it is straightforward to verify that the upper bound Theorem 3. The empirical distribution P satisfies that, for 1 n and lower bound coincide asymptotically by choosing c 1−. any H > 0 and η> 1, → As a result, the asymptotic minimax ℓ1 risk is characterized 2H 1 sup E P P + . in the following corollary. P k n − k1 ≤ ln n 2η ln ln n (ln n)η P :H(P )≤H − (15) Corollary 6. For any H > 0, the asymptotic minimax risk is H − 1 , i.e., Further, for any c (0, 1) and n> max (1 c) 1 c ,eH , ln n ∈ { − − } n ln n 2cH − 1 E ˆ E c lim inf sup P P P 1 =1, (22) sup P Pn P 1 1 ((1 c)n) . n→∞ H · Pˆ P :H(P )≤H k − k P :H(P )≤H k − k ≥ ln n − −   (16) and the estimator Pˆ in Theorem 4 is asymptotically minimax. The next corollary follows from Theorem 3 after taking c 1−. In light of Corollaries 5 and 6, the asymptotic minimax → ℓ1 risk for the distribution estimation with bounded entropy Corollary 5. For any H > 0, the MLE Pn satisfies H is exactly ln n , half of that obtained by MLE. Since the ln n convergence rate is Θ((ln n)−1), the performance of the lim sup EP Pn P 1 =2. (17) n→∞ H · P :H(P )≤H k − k asymptotically minimax estimator with n samples in this problem is nearly that of MLE with 2 samples, which is −1 n It implies that we not only have obtained the Θ((ln n) ) a significant improvement. convergence rate of the asymptotic ℓ1 risk of MLE, but also shown that the multiplicative constant is exactly 2H. We Note that the asymptotically minimax estimator given in note that this logarithmic convergence rate is really slow, Theorem 4 is a hard-thresholding estimator which neglects all 2 2 2η implying that the sample size needs to be squared to reduce symbols with frequency less than e ∆n = e (ln n) /n, and the maximum ℓ1 risk by a half. Also note that the maximum its risk depends on η through two factors. On one hand, there is ℓ1 risk is proportional to the entropy H, thus the smaller the a loss due to ignoring low frequency symbols, which increases entropy of a distribution is known to be, the easier it is to with η and corresponds to the first term in the right-hand side estimate. of (20). On the other hand, the risk incurred by the involvement However, given this slow rate Θ((ln n)−1), it is of utmost of high frequency symbols decreases with η and corresponds importance to obtain estimators such that the corresponding to the (ln n)−η term in (20). Taking derivative with respect to multiplicative constant is as small as possible. We show that η yields that the optimal parameter η∗ to handle this trade-off ∗ MLE does not achieve the optimal constant. In the following takes the form η = cn ln n/ ln ln n with coefficients cn 0 theorem, an essentially minimax estimator is explicitly con- (but larger than Θ(ln ln n/ ln n)), and then the upper bound→ 4

of the ℓ1 risk in (20) becomes 1) Density estimation under L1 loss: There is an extensive literature on density estimation under L1 loss, and we refer to E ˆ∗ H sup P P P 1 the book by Devroye and Gyorfi [9] for an excellent overview. k − k ≤ (1 2c ) ln n ln(2e2) P :H(P )≤H − n − This problem is also very popular in theoretical computer 2 −c 1− e + n n + n 4 . (23) science, e.g. [17]. The problem of discrete distribution estimation under ℓ1 However, one may notice that the asymptotically minimax loss has been the subject of recent interest in the theoretical estimator in Theorem 4 outputs estimates that are not necessar- computer science community. For example, Daskalakis, Di- ily probability distributions. What if we constrain the estimator ∞ akonikolas and Servedio considered the problem of k-modal Pˆ to output estimates confined to ∞ , , i.e., are M ∪S=1MS distributions [18], and a very recent talk by Diakonikolas [14] bona fide probabilities? In this case, the next theorem shows provided a literature survey. The conclusion that it is necessary that the MLE is asymptotically minimax again. and sufficient to use n = Θ(S) samples to consistently Theorem 5. For any c (0, 1) and n eH , we have estimate an arbitrary discrete distribution with support size S ∈ ≥ has essentially appeared in the literature [14], but we did not E ˆ inf sup P P P 1 find an explicit reference giving non-asymptotic results, and Pˆ∈M P :H(P )≤H k − k ∞ for completeness we have included proofs corresponding to the 2cH 1− 1 − 1 1 n c (1 c) c , (24) high dimensional asymptotics in this paper. We remark that, ≥ ln n · − − a very detailed analysis of the discrete distribution estimation   where the infimum is taken over all estimators with outputs problem under ℓ1 loss may be insightful and instrumental in confined to the probability simplex. future breakthroughs in density estimation under L1 loss. The next corollary follows immediately from Corollary 5 2) Entropy estimation: It was shown that the entropy is upon taking c 1−: nearly Lipschitz continuous under the ℓ1 norm [19, Thm. → 17.3.3] [20, Lemma 2.7], i.e., if P Q 1 1/2, then Corollary 7. For any H > 0, the asymptotic minimax risk k − k ≤ 2H P Q 1 is ln n when the estimates are restricted to the probability H(P ) H(Q) P Q 1 ln k − k , (26) simplex: | − |≤−k − k S where S is the support size. At first glance, it seems to suggest ln n lim inf sup EP Pˆ P 1 =2, (25) that the estimation of entropy can be reduced to estimation of n→∞ H · Pˆ∈M P :H(P )≤H k − k ∞ discrete distributions under ℓ1 loss. However, this question is and the MLE is asymptotically minimax. far more complicated than it appears. First, people have already noticed that this near-Lipschitz Hence, the MLE still enjoys the asymptotically minimax continuity result is valid only for finite alphabets [21]. The property in the bounded entropy case if the estimates have to following result by Antos and Kontoyiannis [22] addresses be probability mass functions. We can explain this peculiar entropy estimation over countably infinite support sizes. phenomenon from the Bayes viewpoint as follows. By the minimax theorem [15], any minimax estimator can be always Remark 1. Among all discrete sources with finite entropy and Var approached by a sequence of Bayes estimators under some ( ln p(X)) < , for any sequence Hn of estimators for the− entropy, and∞ for any sequence a {of positive} numbers priors. However, for general loss functions including the ℓ1 { n} loss, the corresponding (without any restric- converging to zero, there is a distribution P (supported on at most countably infinite symbols) with H = H(P ) < such tions) may not belong to the probability simplex ∞ even M that ∞ if the prior is supported on ∞. Hence, the space of all Bayes estimators which formM probability masses is strictly E H H lim sup P | n − | = . (27) smaller than the space of all free Bayes estimators without n→∞ an ∞ any constraints. For example, it is well-known that under the Remark 1 shows that, among all sources with finite entropy ℓ1 loss, the Bayes response is actually the median vector and finite varentropy, no rate-of-convergence results can be of the posterior distribution, and thus does not necessarily obtained for any sequence of estimators. Indeed, if the entropy form a probability distribution. We also remark that for some is still nearly-Lipschitz continuous with respect to ℓ1 distance special loss functions such as the squared error loss, the Bayes in the infinite alphabet setting, then Corollary 5 immediately estimator under any prior supported on ∞ will always M implies that the MLE plug-in estimator for entropy attains a belong to ∞. Broadly speaking, if the is a M universal convergence rate. That there is no universal con- Bregman divergence [16], the Bayes estimator under any prior vergence rate of entropy estimators for sources with bounded will always be the conditional expectation, which is supported entropy is particularly interesting in light of the fact that the on the convex hull of the parameter space. minimax rates of convergence of distribution estimation with bounded entropy is Θ((ln n)−1). Second, it is very interesting and deserves pondering that B. Discussion along high dimensional asymptotics, the minimax sample S Now we draw various connections between our results and complexity for estimating entropy is n = Θ( ln S ) samples, the literature. a result first discovered by Valiant and Valiant [23], then 5 recovered by Jiao et al. using a different approach in [24], the regime of distribution estimation with bounded entropy, 2dH and Wu and Yang in [25]. Since it is shown in Corollary 4 the asymptotic minimax risk of MLE will be ln n = 2(1 ǫ), that we need n = Θ(S) samples to consistently estimate the which does not vanish as n . Hence, we conclude− that distribution, this result shows that we can consistently estimate the minimax distribution estimation→ ∞ under i.i.d. samples is the entropy without being able to consistently estimate the indeed harder than the joint d-tuple distribution estimation underlying distribution under L1 loss. Note that if the plug- in a stationary ergodic process. To clarify the distinction, we in approach is used for entropy estimation, i.e., if we use remark that the ergodicity plays a crucial role in the latter case: H(Pn) to estimate H(P ), it has been shown in [26], [27] for d , the Shannon-McMillan-Breiman theorem [30] that this estimator again requires n = Θ(S) samples. In fact, guarantees→ ∞ that there are approximately edH typical sequences for a wide class of functionals of discrete distributions, it of length d, each of which occurs with probability about e−dH . is shown in [24] that the MLE is strictly suboptimal, and Then it turns out that we need only to estimate a uniform the performance of the optimal estimators with n samples is distribution with support size edH , and applying the previous dH ln n essentially that of the MLE with n ln n samples, a phenomenon conclusion n/d = Θ(e ) yields the desired result d H . termed “effective sample size enlargement” in [24]. Jiao et al. On the other hand, for the distribution estimation problem≈ [28] showed that the improved estimators introduced in [24] with a large support size, the equipartition property does can lead to consistent and substantial performance boosts in not necessarily hold, and our scheme focuses on the worst- various machine learning algorithms. case risk over all possible distributions. Hence, it is indeed harder to handle the distribution estimation problem under 3) ℓ divergence estimation between two distributions: 1 i.i.d. samples in the large alphabet regime, and it indicates Now we turn to the estimation problem for the ℓ divergence 1 that directly applying results from a large alphabet regime P Q between two discrete distributions P,Q with sup- 1 to stochastic processes with a large memory may not be a portk − sizek at most S. At first glance, by setting one of the fruitful route. Before closing the discussion, we mention that distributions to be deterministic, the problem of ℓ divergence 1 Marton and Shields did not show that the d ln n scaling estimation seems a perfect dual to the distribution estimation H is minimax optimal. It remains an interesting∼ question to problem under ℓ loss. However, compared to the required 1 investigate whether there is a better estimator to estimate the sample complexity n = Θ(S) in the distribution estimation d-tuple joint distribution in stochastic processes. problem under ℓ loss, the minimax sample complexity for 1 5) Hard-thresholding estimation is asymptotically minimax: estimating the ℓ divergence between two arbitrary distribu- 1 Corollary 6 shows that in the infinite dimensional asymp- tions is n = Θ( S ) samples, a result of Valiant and Valiant ln S totics, MLE is far from asymptotically minimax, and a hard- [29]. Hence, it is easier to estimate the ℓ divergence than to 1 thresholding estimator achieves the asymptotic minimax risk. estimate the distribution with a vanishing ℓ risk. Note that 1 The phenomenon that thresholding methods are needed in for distribution estimation, for each symbol we need to obtain order to obtain minimax estimators for high dimensional pa- a good estimate for p in terms of the ℓ risk, while for ℓ i 1 1 rameters in a ℓ ball under ℓ error, p> 0,p 0 is an arbitrary constant. Moreover, the empirical ℓ1, so it is not surprising that hard-thresholding leads to an (1+ǫ)ln n distribution is not consistent if d . Now we treat asymptotically minimax estimator. The asymptotic minimax ≥ H the joint d-tuple distribution as a single distribution with a estimators under other constraints on the distribution P remain large support size, and consider the corresponding estimation to be explored. problem. To be precise, we assume without loss of generality 6) Adaptive estimation: Note that in the infinite dimen- that the original process consists of n observations and merge sional asymptotics, for a sequence of problems H(P ) H (1−ǫ)ln n { ≤ } all disjoint blocks containing d = H symbols into with different upper bounds H, the asymptotically minimax supersymbols. Consequently, we obtain a sample size of n/d estimator in Theorem 4 achieves the minimax risk over all from which we would like to learn a new distribution over “entropy balls” without knowing its “radius” H. It is very an alphabet of size Sd and entropy nearly dH. Considering a important in practice, since we do not know a priori an special case where the stationary ergodic process is nearly i.i.d. upper bound on the entropy of the distribution. This estimator and applying our result on the minimum sample complexity of belongs to a general collections of estimators called adaptive the distribution estimation problem with bounded support size, estimators. For details we refer to a survey paper by Cai [34]. (1 ǫ) ln S d − we need Θ(S )=Θ(n H ) samples to estimate the new The rest of this paper is organized as follows. Section distribution, which cannot be achieved by n samples unless II provides outlines of the proofs of the main theorems, ln S = H, i.e., the distribution is uniform. Furthermore, under and some useful auxiliary lemmas are listed in Appendix A. 6

Complete proofs of some lemmas and corollaries are provided and the identity in Lemma 4 can be applied to obtain in Appendix B. S′ X E P P E p i (38) P k − nk1 ≥ P i − n i=1 X n δ =2δ 1 (39) − S′   n II. OUTLINES OF PROOFS OF MAIN THEOREMS 2cH − 1 1 ((1 c)n) c . (40) ≥ ln n − −   A. Analysis of MLE

For the analysis of the performance of the MLE Pn, the key B. Analysis of the Estimator in Theorem 4 is to obtain a good approximation of E X/n p with X B(n,p), i.e., the Binomial mean absolute| deviation.− | Lemma∼ For the achievability result, we first establish some lemmas 4 in the Appendix lists some sharp approximations, which on the properties of g (X). together with the concavity of x(1 x) yield n − B pS Lemma 1. If X (n,p),p ∆n, we have Xi ∼ ≤ E P P = E p (28) 2 P k − nk1 P n − i e ln n i=1 E p X gn(X) p p + . (41) S | − |≤ e∆n pi(1 pi)   − (29) ≤ n i=1 r Proof: It is clear from the triangle inequality that X S 1 X X 2 − , (30) E gn(X) p p + E ½ >e ∆n (42) ≤ n | − |≤ n n r    which completes the proof of Theorem 1. For the upper bound p + P(X>e2n∆ ) (43) ≤ n in Theorem 3, we use Lemma 4 again and obtain e2(ln n)2η np S p + (44) pi ≤ en∆n EP P Pn 1 min , 2pi (31)   k − k ≤ n where Lemma 5 in Appendix A is used in the last step. The i=1 r  X proof is completed by noticing that p 1 < 1. 1 e∆n e 2 pi + √pi (32) ≤ ≤ 2η √n 2η 2 p ≤ (ln n) p > (ln n) Lemma 2. If X B(n,p),p 2e ∆n, we have i Xn i Xn ∼ ≥ 2 2 p − e ( pi ln pi) E 4 ≤ ln n 2η ln ln n − gn(X) p + n . (45) (ln n)2η | − |≤ n − ≤ pi Xn r 1 (ln n)2η Proof: It follows from the identity + i : pi > (33) √n · s n X X X 2   g (X)= ½ e ∆ (46) n n 2H 1 n − n n ≤ + . (34)   ≤ ln n 2η ln ln n (ln n)η and the triangle inequality that − For the lower bound, we consider the distribution P = X X X 2 E gn(X) p E p + E ½ e ∆n ′ ′ with entropy , then for | − |≤ n − n n ≤ (δ/S , ,δ/S , 1 δ) H c    (0, 1),δ··· c, due− to the monotone decreasing property of∈ (47) 1 x ≤ (1 x) with respect to x (0, 1), X 2 − ∈ E p + P(X e n∆n) (48) − n H 1 1 δ ≤ − ≤ ′ δ S = δ exp (1 δ) (35) 2 2η δ p − e (ln n) · − 4 (49)     + e . H 1 ≤ n c r δ exp (1 c) . (36) 2 2η ≥ δ · − − e (ln n)   Then the proof is completed by noticing that e 4 2 2 ′ − e ln n − e ≤ Note that S +1 is the support size, and we assume without loss e 4 = n 4 . of generality that S′ is an integer. For c (0, 1), since n eH , ∈ − ≥1 1 c B 2 we choose δ = cH/ ln n c, then since n> (1 c) − , Lemma 3. If X (n,p), ∆n

Proof: It is clear that such as the Hoeffding bound to ensure that the vector P is close to a probability distribution with overwhelming proba- E E X X 2 g (X) p p ½ >e ∆ bility. Then the relationship between the minimax risk of the | n − |≤ n − n n    Poissonized model and that of the Multinomial model needs

E X 2 to be established. The rigorous proofs are detailed as follows. + 0 p ½ e ∆ (51) | − | n ≤ n    X E p + p (52) ≤ n −

p + p. (53) ≤ n 1) Lower Bound in Theorem 2: We denote the uniform r 1−η 1+η distribution on two points S , S by µ0, with η (0, 1) { } ∈ S Combining these lemmas, the upper bound of Theorem 4 is to be specified later, and assign the product measure µ0 to the probability vector P . Under the Poissonized model X given on the bottom of this page. i ∼ Poi(npi), it is straightforward to see that all pi(1 i S) are conditionally independent given X. Hence, the≤ Bayes≤ C. Proof of the Lower Bounds in Theorem 2, 4 and 5 ˆB X S estimator P ( ) under prior µ0 can be decomposed into To obtain a lower bound for the minimax risk, an effective B Pˆ (X) = (f(X1),f(X2), ,f(XS)), for some function way is to use the Bayes risk to serve as the lower bound f( ). Then the Bayes risk is shown··· on the bottom of this page, [8], where the prior can be arbitrarily chosen. Hence, our · where dTV(Poi(u), Poi(v)) is the variational distance between target is to find an unfavorable prior and compute the cor- two Poisson distributions: responding Bayes risk. For computational simplicity, in the proof we will assign the product of independent priors to the dTV(Poi(u), Poi(v)) whole probability vector P based on the Poissonized model , sup P Poi(u) A P Poi(v) A . (65) X Poi(np ), and then use some concentration inequalities A⊂N | { ∈ }− { ∈ }| i ∼ i

e2 ln n 2 pi pi pi − e EP P Pˆ 1 pi + + + pi + + n 4 (54) k − k ≤ e∆n 2 n 2 n pi≤∆n ! ∆n∆n r pi≤∆n pi≥2e ∆n X X X   X −1 2 1 1 −e2 ln n n − e 1 ln H + + e + n 4 (56) ≤ 2e2∆ · √n∆ · (ln n)2η · 2e2∆  n  n n H 1 1 1 = + + 2 + 2 (57) ln n ln(2e2) 2η ln ln n (ln n)η ne −1(ln n)2η 2 e −1 2η − − 2e n 4 (ln n) 2 H −η 1− e + (ln n) + n 4 (58) ≤ ln n ln(2e2) 2η ln ln n − −

R (S,n,µS)= E P PˆB µ(dP ) (59) B 0 P k − k1 Z ∞ S (np )k = p f(k) e−npi i µ (dp ) (60) | i − | k! 0 i i=1 Z k=0 X X∞ S n(1 η) n(1 + η) p f(k) min P Poi − = k , P Poi = k µ (dp ) (61) ≥ | i − | { S } { S } 0 i i=1 Z k=0       X X∞ η n(1 η) n(1 + η) S min P Poi − = k , P Poi = k (62) ≥ · S { S } { S } k=0       ∞ X n(1 η) n(1 + η) = η min P Poi − = k , P Poi = k (63) · { S } { S } kX=0       n(1 η) n(1 + η) = η 1 dTV Poi − , Poi , (64) · − S S       8

Adell and Jodra [35] gives an upper bound for this distance: distribution µ on to the distribution vector P , and con- NS,H sider the Bayes estimator PˆB of P under the prior µ. We first dTV(Poi(t), Poi(t + x)) ≤ compute the posterior distribution of P given an observation 2 vector X. Denote by N(X) the number of different symbols min 1 e−x, (√t + x √t) , t,x 0 (66) (excluding the last symbol) appearing in the observation vector ( − r e − ) ≥ X, and we assume without loss of generality that the first then N(X) symbols appear in the sample. Then it is straightforward n(1 η) n(1 + η) to show that the posterior distribution of P on X is uniform dTV Poi − , Poi (67) S S in the following set X S,H : P X if and only if      δ N ⊂ N ∈ N δ pS = 1 δ, pi = , 1 i N(X),pj 0, ,N(X)+ n 2n S′ S′ min 1 exp 2η , ( 1+ η 1 η) 1 j −kS′, and ≤ ≤ ∈ { } ≤ ( − − · S reS − − ) ≤ ≤   δ p p (68) j : N(X)+1 j kS′,p = = S′ N(X). ≤ ≤ j S′ − n n   (69) (78) min 1 exp 2η , 2η . ≤ − − · S · eS    r  Hence, by setting B Hence, the Bayes estimator Pˆ (X) = (a1,a2, ,aS) 1 eS should minimize the posterior risk given X expressed··· as η = min 1, (70) ℓ1 4 n ( r ) N(X) X , δ we can obtain RB( ,H,n,µ) ai + 1 δ aS S′ − | − − | i=1 exp 2n , n e X S − S S ≤ 16 kS′ ′ ′ X RB(S,n,µ0 ) (71) (k 1) S aj S N( ) δ ≥  1 eS  n e + −′ | | + ′− ′ aj , 8 n , S > 16 . kS N(X) kS N(X) S −  j=N(X)+1  − −  q X The combination of Lemma 7 and 8 in Appendix A yields, (79)  ζ for any ζ (0, 1] and ǫ (0, 2(1+ζ) ], δ ∈ ∈ and the solution is a1 = = aN(X) = ,aN(X) = = ··· S′ ··· E ˆ S akS =0 and aS =1 δ, and inf sup P P P 1 RB(S, (1 + ζ)n,µ0 ) ′ ˆ k − k ≥ − P P ∈MS N(X) 2 ζ n RB(X,H,n,µ)= 1 δ. (80) exp( ) 6µS ( (ǫ)c) . (72) − S′ − − 24 − 0 MS   Setting ǫ = ζ , Lemma 6 in Appendix A yields 4ln S In light of this, the Bayes risk can be expressed as S S RB(H,n,µ) = E[RB (X,H,n,µ)], where the expectation S c S E µ0 ( S(ǫ) )= µ0 pi pi ǫ (73) is taken with respect to X, and by (80) we only need to M − ≥ ( i=1 "i=1 # ) E X X X compute N( ). Due to the symmetry of the Multinomial 2ǫ2 2exp (74) distribution, we can assume without loss of generality that ≤ −S (2/S)2 X Multi(n; δ , , δ , 0, , 0, 1 δ), then  ·  ∼ S′ ··· S′ ··· − ζ2S = 2exp . (75) S′ S′ −32(ln S)2 E E P N(X)= ½(X > 0) = (X > 0)    i  i The proof of Theorem 2 is completed by the combination of i=1 i=1 X X (81) (71), (72) and (73).  δ  n = S′ 1 1 nδ 2) Lower Bound in Theorem 4 and 5: For the proof of · − − S′ ≤ the lower bound in Theorem 4, we need a different prior.     Specifically, we fix some δ (0, 1) with its value to be which yields ′∈ specified later, and consider S with RB(H,n,µ)= E[RB (X,H,n,µ)] ′ EN(X) nδ (82) δ ln S δ ln δ (1 δ) ln(1 δ)= H. (76) = 1 δ = 1 δ. − − − − − S′ − S′ Now define S = kS′ + 1 with parameter k 2 to be     ≥ specified later, and consider the following collection S,H of -dimensional non-negative vectors: if andN only if cH S P S,H Fix c (0, 1), we set δ = ln n c, and δ ∈ N ′ ∈ ≤ pS =1 δ, pi 0, , 1 i S 1= kS , and S′ 1−δ − ∈{ } ≤ ≤ − H 1 ′ δ δ S = δ exp (1 δ) (83) i :1 i kS′,p = = S′. (77) δ − i ′     ≤ ≤ S cH 1 1   c c n (1 c) . (84) By the preceding two equalities, we can easily verify that ≥ ln n · · − P implies H(P ) = H. Now we assign the uniform Since the minimax risk is lower bounded by any Bayes risk, ∈ NS,H 9 we have III. ACKNOWLEDGEMENT We would like to thank I. Sason, the associate editor, and inf sup EP Pˆ P 1 RB(H,n,µ) (85) Pˆ P :H(P )≤H k − k ≥ three anonymous reviewers for their very helpful suggestions.

cH 1− 1 − 1 J. Jiao is partially supported by a Stanford Graduate Fellow- 1 n c (1 c) c , ≥ ln n · − − ship.  (86) which completes the proof of the lower bound in Theorem 4. APPENDIX A AUXILIARY LEMMAS For the lower bound in Theorem 5, note that we need to The following lemma gives a sharp estimate of the Binomial constrain that the Bayes estimator PˆB form a probability mass, mean absolute deviation. S i.e., i=1 ai =1 in the minimization process of (79). Defining B X Lemma 4. [11] For X (n,p), we have (k−2)S′+N( ) ∼ λ , X [0, 1], the derivations on the bottom of P kS′−N( ) ∈ this page show that the solution becomes a = = a X = X p(1 p) 1 N( ) E p min − , 2p . (99) δ δ(S′−N(X)) ··· ,a X = = a = and a = 1 δ. n − ≤ n S N( ) kS′ S (kS −N(X)) S (r ) ′ ··· ′ ′ X − Hence, the corresponding Bayes risk given is Moreover, for p< 1/n , there is an identity 2(k 1)S′ N(X) R (X,H,n,µ)= − 1 δ (93) X B kS′ N(X) − S′ E p =2p(1 p)n. (100) −   n − − 2(k 1)δ N(X) − 1 . (94) The following lemma gives some tail bounds for Poisson or ≥ k − S′   Binomial random variables. Lemma 5. [36, Exercise 4.7] If X Poi(λ) or X B(n, λ ), Applying the similar steps, we can show that the overall ∼ ∼ n Bayes risk is then for any 0 <δ< 1, we have λ E X eδ RB(H,n,µ)= RB( ,H,n,µ) (95) P (101) (X (1 + δ)λ) 1+δ , 2(k 1)cH 1 1 ≥ ≤ (1 + δ) 1− c − c   − 1 n (1 c) . (96) −δ λ ≥ k ln n · − − e 2 P(X (1 δ)λ) e−δ λ/2. (102) Since the minimax risk is lower bounded by any Bayes risk, ≤ − ≤ (1 δ)1−δ ≤  −  we have The following lemma presents the Hoeffding bound. E ˆ inf sup P P P 1 RB(H,n,µ) (97) Lemma 6. [37] For independent and identically distributed Pˆ∈M P :H(P )≤H k − k ≥ random variables X1, ,Xn with a Xi b for 1 i 2(k 1)cH 1− 1 − 1 n ··· ≤ ≤ ≤ ≤ − 1 n c (1 c) c , (98) n, denote Sn = i=1 Xi, we have for any t> 0, ≥ k ln n · − − 2 which completes the proof of the lower bound in Theorem 5 P 2t P Sn E[Sn] t 2exp . (103) by letting k . {| − |≥ }≤ −n(b a)2 → ∞  − 

N(X) δ kS′ (k 1)S′ S′ N(X) δ R (X,H,n,µ) , a + − 0 a + − a + 1 δ a (87) B S′ − i kS′ N(X) | − j| kS′ N(X) S′ − j | − − S| i=1 X X j=NX( )+1  − −  X N( ) kS′ δ (k 1)S′ S′ N(X) δ a + − a + − a + 1 δ a (88) ≥ S′ − i kS′ N(X) · j kS′ N(X) · S′ − j | − − S| i=1 X X j=NX( )+1 − −   X N( ) kS′ δ δ N(X) = a + λa + 1 + 1 δ a (89) S′ − i j kS′ N(X) · − S′ | − − S| i=1 j=N(X)+1 X X −   X N( ) kS′ δ δ N(X) λ a + λa + 1 + λ a 1+ δ (90) ≥ i − S′ j kS′ N(X) · − S′ | S − | i=1 j=N(X)+1   X X − N(X) δ kS′ δ N(X ) λ a + λa + 1 + λ(a 1+ δ) (91) ≥ i − S′ j kS′ N(X) · − S′ S − i=1   j=N(X)+1    X X − 2(k 1)S′ N(X) = − 1 δ, (92) kS′ N(X) − S′ −   10

A non-negative loss function l( ) on Rp is called bowl- APPENDIX B · shaped iff l(u)= l( u) for all u Rp and for any c 0, the PROOF OF COROLLARIES AND AUXILIARY LEMMAS sublevel set u : l(u−) c is convex.∈ The following≥ theorem { ≤ } is one of the key theorems in the definition of asymptotic A. Proof of Corollary 2 efficiency. Consider the discrete distribution P = (p ,p , ,p ) 1 2 ··· S Theorem 6. [2, Thm. 8.11] Let the experiment (P ,θ Θ) be with cardinality S, and we take θ = (p1,p2, ,pS−1) to θ ··· differentiable in quadratic mean at θ with nonsingular∈ Fisher be the free parameter. By definition, we know that the Fisher information matrix I . Let ψ( ) be differentiable at θ. Let T information matrix is θ · { n} be any estimator sequence in the experiments (P n,θ Rk). ∂2 θ ∈ E Then for any bowl-shaped loss function l, Ii,j (θ)= θ ln p(x θ) , 1 i, j S 1. −∂θi∂θj | ≤ ≤ −   (111) E h sup lim inf sup θ+ h l √n Tn ψ θ + n→∞ √n − √n I h∈I     It is straightforward to obtain that El(X), (104) ≥ ∂2 ′ −1 ′ T Ii,j (θ)= pi 2 ln pi δi,j where (X) = (0, ψ (θ)Iθ ψ (θ) ), and the first supre- −∂pi L N k   mum is taken over all finite subsets I R . S−1 S−1 ⊂ ∂2 + 1 pk ln 1 pk − −∂pi∂pj − The next lemma relates the minimax risk under the Pois- k=1 ! " k=1 !# X X sonized model of an approximate probability distribution and (112) that under the Multinomial model of a true probability distri- δi,j 1 = + . (113) bution, where the set of approximate probability distribution pi pS is defined by where δi,j equals one if i = j and zero otherwise. Hence, in S matrix form we have , S(ǫ) P = (p1,p2, ,pS): pi 0, pi 1 <ǫ . 1 T M ( ··· ≥ − ) I(θ)= Λ + 11 , (114) i=1 p X (105) S Λ , −1 −1 1 , T where diag(p1 , ,pS−1), and (1, 1, , 1) is a We define the minimax risk for Multinomial model with n (S 1) 1 column vector.··· According to the Woodbury··· matrix observations on support size S for estimating P as identity− ×

−1 −1 −1 −1 −1 −1 −1 R(S,n) = inf sup EMultinomial Pˆ P 1, (106) (A + UCV) = A A U(C + VA U) VA , ˆ k − k P P ∈MS − (115) and the corresponding minimax risk for Poissonized model for A Λ U 1 C −1 V 1T estimating an approximate distribution as we can take = , = , = pS , = to obtain

−1 1 T −1 RP (S,n,ǫ) = inf sup EPoissonized Pˆ P 1. (107) I(θ) = (Λ + 11 ) (116) ˆ P P ∈MS (ǫ) k − k pS −1 −1 −1 T −1 T −1 = Λ Λ 1(pS + 1Λ 1 ) 1 Λ (117) − T Lemma 7. The minimax risks under the Poissonized model = Λ−1 Λ−111 Λ−1. (118) and the Multinomial model are related via the following − inequality: for any ζ (0, 1] and 0 <ǫ ζ , we have After some algebra we can show that ∈ ≤ 2(1+ζ) 2 −1 ζ n [I(θ) ]i,j = pipj + piδi,j , (119) R(S,n) RP (S, (1 + ζ)n,ǫ) exp( ) ǫ. (108) − ≥ − − 24 − RS R X , S then by choosing l : + defined by l( ) i=1 Xi 7→ S−1| | The following lemma establishes the relationship of the and ψ((p1,p2, ,pS−1)) = (p1,p2, ,pS−1, 1 pi) ··· ··· −P i=1 R (S,n,ǫ) and the Bayes risk under some prior µ. in Theorem 6, for (X)= (0, ψ′(θ)I(θ)−1ψ′(θ)T ), P L N P 2 S Lemma 8. Assigning prior µ to a non-negative vector P , El(X)= p (1 p ). (120) π i − i denote the corresponding Bayes risk for estimating P under r i=1 X ℓ loss by R (S,n,µ). If there exists a constant A> 0 such p 1 B If we choose θ = (1/S, 1/S, , 1/S), then for any that ∞ ··· estimator sequence Tn n=1, Theorem 6 yields S { } µ P : pi A =1, (109) E h ≤ sup lim inf sup θ+ h √n Tn θ + ( i=1 ) n→∞ √n − √n X I h∈I   1 S then the following inequality holds: 2 2(S 1) pi(1 p i)= − (121) c ≥ π − π RP (S,n,ǫ) RB(S,n,µ) 3A µ ( S(ǫ) ) . (110) r i=1 r ≥ − · M X p 11 and the proof is completed by noticing that the Bayes risk under prior µ, applying Pˆ′ yields

lim inf √n inf sup EP Tn P E ˆ′ n→∞ T 1 RB(S,n,µ) P P P 1µ(dP ) (131) · n P ∈MS k − k ≤ k − k Z E h ′ sup lim inf sup θ+ h √n Tn θ + . = EP Pˆ P 1µ(dP ) ≥ I n→∞ h∈I √n − √n k − k   1 MS (ǫ) (122) Z E ˆ′ + P P P 1µ(dP ) (132) c k − k ZMS (ǫ) 1 E Pˆ′ P π(dP ) ≤ µ( (ǫ)) P k − k1 MS ZMS(ǫ) B. Proof of Lemma 7 + 2Aµ(dP ) (133) c MS (ǫ) ′ Z By the definition of the minimax risk under the Multinomial RB(S,n,π) = +2A (1 µ( S(ǫ))) . (134) model, for any δ > 0, there exists an estimator Pˆ (X,S,n) µ( S(ǫ)) − M M M such that Since the Bayes risk serves as a lower bound for the minimax ′ E ˆ X risk, i.e., RP (S,n,ǫ) R (S,n,π), we have sup P PM ( ,S,n) P 1 < R(S,n)+ δ, n. (123) ≥ B P ∈MS k − k ∀ c RP (S,n,ǫ) RB(S,n,µ) (2A + RB (S,n,µ))µ ( S(ǫ) ) . Now we construct a new estimator under the Poissonized ≥ − M (135) ′ ′ model, i.e., we set PˆP (X,S) , PˆM (X,S,n ) where n = S Poi S Then the proof is completed by noticing that RB(S,n,µ) A, i=1 Xi (n i=1 pi). Then we can obtain that for ∼ζ ˆ X 0 ≤ 0 <ǫ< and ζ (0, 1), for the risk of the null estimator P ( )= under prior µ does P 2(1+ζ) P∈ not exceed A. R (S,n,ǫ) sup E Pˆ (X,S) P (124) P ≤ P k P − k1 P ∈MS (ǫ) REFERENCES ∞ = sup E Pˆ (X,S,m) P P(n′ = m) [1] C. E. Shannon, “A mathematical theory of communication,” The Bell P k M − k1 · System Technical Journal, vol. 27, pp. 379–423, 623–656, 1948. m=0 P ∈MS (ǫ) X [2] A. W. Van der Vaart, Asymptotic Statistics. Cambridge university press, (125) 2000, vol. 3. ∞ [3] K. Marton and P. C. Shields, “Entropy and the consistent estimation P sup E Pˆ (X,S,m) of joint distributions,” The Annals of Probability, vol. 22, no. 2, pp. ≤ P M − S 960–977, 1994. m=0 P ∈MS (ǫ) i=1 pi X 1 [4] H. Steinhaus, “The problem of estimation,” The Annals of Mathematical Statistics, vol. 28, pp. 633–648, 1957. P ′ P + P P(n = m) (126) [5] S. Trybula, “Some problems of simultaneous minimax estimation,” The S − · Annals of Mathematical Statistics, vol. 29, pp. 245–253, 1958. i=1 pi ! 1 [6] M. Rutkowska, “Minimax estimation of the parameters of the multivari- ∞ P ′ ate hypergeometric and multinomial distributions,” Zastos. Mat., vol. 16, ( R(S,m)+ δ + ǫ) P(n = m) (127) ≤ · pp. 9–21, 1977. m=0 [7] I. Olkin and M. Sobel, “Admissible and minimax estimation for the X ′ n n multinomial distribution and for independent binomial distributions,” P(n )+ R(S, )+ ǫ + δ (128) The Annals of Statistics, vol. 7, pp. 284–290, 1979. ≤ ≤ 1+ ζ 1+ ζ [8] E. L. Lehmann and G. Casella, Theory of Point Estimation, 2nd ed. ζ2n n New York, NY: Springer-Verlag, 1998. exp( )+ R(S, )+ ǫ + δ, (129) [9] L. Devroye and L. Gy¨orfi, Nonparametric Density Estimation: The L1 ≤ − 24 1+ ζ View. New York, NY: John Wiley, 1985. where we have used Lemma 5 in the last step. The desired [10] P. Diaconis and S. Zabell, “Closed form summation for classical distributions: Variations on a theme of De Moivre,” Statistical Science, result follows directly from (129) and the arbitrariness of δ. vol. 6, no. 3, pp. 284–302, 1991. [11] D. Berend and A. Kontorovich, “A sharp estimate of the binomial mean absolute deviation with applications,” Statistics & Probability Letters, vol. 83, no. 4, pp. 1254 – 1259, 2013. [12] S. Kamath, A. Orlitsky, V. Pichapati, and A. T. Suresh, “On learning distributions from their samples,” in Proceedings of The 28th Conference C. Proof of Lemma 8 on Learning Theory, 2015, pp. 1066–1100. [13] J. H´ajek, “Local asymptotic minimax and admissibility in estimation,” in Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, 1972, pp. 175–194. Denote the conditional prior π by [14] I. Diakonikolas, “Beyond histograms: Structure and distribution estima- µ(A (ǫ)) tion,” in Workshop of the 46th ACM Symposium on Theory of Computing, π(A)= ∩MS , (130) 2014. µ( S(ǫ)) [15] A. Wald, Statistical Decision Functions. Wiley, 1950. M [16] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, “Clustering with we consider the Bayes estimator Pˆ′ under prior π and the bregman divergences,” The Journal of Machine Learning Research, ′ vol. 6, pp. 1705–1749, 2005. corresponding Bayes risk RB (S,n,π). Due to our construction X [17] S. Chan, I. Diakonikolas, R. A. Servedio, and X. Sun, “Efficient density of the prior, we know that for all , the sum of all entries of estimation via piecewise polynomial approximation,” in 46th ACM ′ Pˆ (X) will not exceed A almost surely. Since RB (S,n,µ) is Symposium on Theory of Computing, 2014, pp. 604–613. 12

[18] C. Daskalakis, I. Diakonikolas, and R. A. Servedio, “Learning k-modal Jiantao Jiao (S’13) received his B.Eng. degree with the highest honor in distributions via testing,” in Proceedings of the Twenty-Third Annual Electronic Engineering from Tsinghua University, Beijing, China in 2012, ACM-SIAM Symposium on Discrete Algorithms, 2012, pp. 1371–1385. and a Master’s degree in Electrical Engineering from Stanford University in [19] T. M. Cover and J. A. Thomas, Elements of Information Theory. New 2014. He is currently working towards the Ph.D. degree in the Department of York, NY, USA: Wiley-Interscience, 2006. Electrical Engineering at Stanford University. He is a recipient of the Stanford [20] I. Csiszar and J. K¨orner, “Information theory: Coding theorems for Graduate Fellowship (SGF). His research interests include information theory discrete memoryless channels,” Budapest: Akad´emiai Kiad´o, 1981. and statistical signal processing, with applications in communication, control, [21] J. F. Silva and P. A. Parada, “Shannon entropy convergence results in computation, networking, data compression, and learning. the countable infinite case,” in Information Theory Proceedings (ISIT), 2012 IEEE International Symposium on, July 2012, pp. 155–159. [22] A. Antos and I. Kontoyiannis, “Convergence properties of functional estimates for discrete distributions,” Random Struct. Algorithms, vol. 19, no. 3-4, pp. 163–193, 2001. [23] G. Valiant and P. Valiant, “Estimating the unseen: an n/ log n-sample estimator for entropy and support size, shown optimal via new CLTs,” in Proceedings of the 43rd ACM Symposium on Theory of Computing, 2011, pp. 685–694. [24] J. Jiao, K. Venkat, Y. Han, and T. Weissman, “Minimax estimation of functionals of discrete distributions,” Information Theory, IEEE Transactions on, vol. 61, no. 5, pp. 2835–2885, 2015. [25] Y. Wu and P. Yang, “Minimax rates of entropy estimation on large alphabets via best polynomial approximation,” available on arXiv, 2014. [26] L. Paninski, “Estimation of entropy and mutual information,” Neural Computation, vol. 15, no. 6, pp. 1191–1253, 2003. [27] J. Jiao, K. Venkat, Y. Han, and T. Weissman, “Maximum likelihood estimation of functionals of discrete distributions,” submitted to IEEE Transactions on Information Theory, 2015. [28] ——, “Beyond maximum likelihood: from theory to practice,” available on arXiv, 2014. [29] G. Valiant and P. Valiant, “The power of linear estimators,” in IEEE 52nd Annual Symposium on Foundations of Computer Science, Oct 2011, pp. 403–412. [30] P. H. Algoet and T. M. Cover, “A sandwich proof of the shannon- mcmillan-breiman theorem,” The Annals of Probability, vol. 16, no. 2, pp. 899–909, 1988. [31] D. L. Donoho and J. M. Johnstone, “Minimax risk over ℓp-balls for ℓq-error,” Probability Theory and Related Fields, vol. 99, pp. 277–303, 1994. Tsachy Weissman (S’99-M’02-SM’07-F’13) graduated summa cum laude [32] C. Stein, “Inadmissibility of the usual estimator for the mean of a with a B.Sc. in electrical engineering from the Technion in 1997, and earned multivariate ,” in Proceedings of the Third Berkeley his Ph.D. at the same place in 2001. He then worked at Hewlett Packard Symposium on Mathematical Statistics and Probability, vol. 1, no. 399, Laboratories with the information theory group until 2003, when he joined 1956, pp. 197–206. Stanford University, where he is currently Professor of Electrical Engineering [33] D. L. Donoho and J. M. Johnstone, “Ideal spatial adaptation by wavelet and incumbent of the STMicroelectronics chair in the School of Engineering. shrinkage,” Biometrika, vol. 81, no. 3, pp. 425–455, 1994. He has spent leaves at the Technion, and at ETH Zurich. [34] T. T. Cai, “Minimax and adaptive inference in nonparametric function Tsachy’s research is focused on information theory, compression, commu- estimation,” Statistical Science, vol. 27, no. 1, pp. 31–50, 2012. nication, statistical signal processing, the interplay between them, and their [35] J. A. Adell and P. Jodra, “Exact Kolmogorov and total variation distances applications. He is recipient of several best paper awards, and prizes for between some familiar discrete distributions,” Journal of Inequalities excellence in research and teaching. He served on the editorial board of the and Applications, vol. 2006, no. 1, pp. 1–8, 2006. IEEETRANSACTIONS ON INFORMATION THEORY from Sept. 2010 to Aug. [36] M. Mitzenmacher and E. Upfal, Probability and Computing: Random- 2013, and currently serves on the editorial board of Foundations and Trends ized Algorithms and Probabilistic Analysis. Cambridge University in Communications and Information Theory. He is Founding Director of the Press, 2005. Stanford Compression Forum. [37] W. Hoeffding, “Probability inequalities for sums of bounded random variables,” Journal of the American Statistical Association, vol. 58, no. 301, pp. 13–30, 1963.

Yanjun Han (S’14) received his B.Eng. degree with the highest honor in Electronic Engineering from Tsinghua University, Beijing, China in 2015. He is currently working towards the Ph.D. degree in the Department of Electrical Engineering at Stanford University. His research interests include information theory and statistics, with applications in communications, data compression, and learning.