1 Minimax Estimation of Functionals of Discrete Distributions Jiantao Jiao, Student Member, IEEE, Kartik Venkat, Student Member, IEEE, Yanjun Han, Student Member, IEEE, and Tsachy Weissman, Fellow, IEEE

Abstract—We propose a general methodology for the con- Index Terms—, entropy estimation, non- struction and analysis of essentially minimax for smooth functional estimation, maximum likelihood , a wide class of functionals of finite dimensional parameters, approximation theory, minimax lower bound, polynomial approx- and elaborate on the case of discrete distributions, where the imation, minimax-optimality, high dimensional statistics, Renyi´ support size S is unknown and may be comparable with or entropy, Chow–Liu algorithm even much larger than the number of observations n. We treat the respective regions where the functional is “nonsmooth” and “smooth” separately. In the “nonsmooth” regime, we apply an I.INTRODUCTIONANDMAINRESULTS unbiased estimator for the best polynomial approximation of the Given n independent samples from an unknown discrete functional whereas, in the “smooth” regime, we apply a bias- corrected version of the Maximum Likelihood Estimator (MLE). probability distribution P = (p1, p2, . . . , pS), with unknown We illustrate the merit of this approach by thoroughly ana- support size S, consider the problem of estimating a functional lyzing the performance of the resulting schemes for estimating of the distribution of the form: two important information measures: the entropy H(P ) = S PS PS α −pi ln pi and Fα(P ) = p , α > 0. We obtain the X i=1 i=1 i F (P ) = f(pi), (1) minimax L2 rates for estimating these functionals. In partic- ular, we demonstrate that our estimator achieves the optimal i=1 sample complexity n  S/ ln S for entropy estimation. We where f : (0, 1] → R is a continuous function. Among the also demonstrate that the sample complexity for estimating 1/α most fundamental of such functionals is the entropy [1], Fα(P ), 0 < α < 1, is n  S / ln S, which can be achieved by our estimator but not the MLE. For 1 < α < 3/2, we show S −2(α−1) X the minimax L2 rate for estimating Fα(P ) is (n ln n) for H(P ) , −pi ln pi, (2) infinite support size, while the maximum L2 rate for the MLE is i=1 n−2(α−1) . For all the above cases, the behavior of the minimax which plays significant roles in information theory [1]. An- rate-optimal estimators with n samples is essentially that of the MLE (plug-in rule) with n ln n samples, which we term “effective other information theoretic quantity which is closely related sample size enlargement”. to the entropy is the mutual information, which for discrete We highlight the practical advantages of our schemes for random variables can be defined as the estimation of entropy and mutual information. We com- pare our performance with various existing approaches, and I(X; Y ) = I(PXY ) = H(PX ) + H(PY ) − H(PXY ), (3) demonstrate that our approach reduces running time and boosts the accuracy. Moreover, we show that the minimax rate-optimal where PX ,PY ,PXY denote, respectively, the distributions of mutual information estimator yielded by our framework leads to random variables X, Y , and the pair (X,Y ). significant performance boosts over the Chow–Liu algorithm in We are also interested in the family of information measures learning graphical models. The wide use of information measure Fα(P ): estimation suggests that the insights and estimators obtained in S arXiv:1406.6956v5 [cs.IT] 10 Mar 2015 this work could be broadly applicable. X α Fα(P ) , pi , α > 0. (4) i=1 Manuscript received Month 00, 0000; revised Month 00, 0000; accepted Month 00, 0000. Date of current version Month 00, 0000. This work was The significance of functional Fα(P ) can be seen via the ln Fα(P ) supported in part by two Stanford Graduate Fellowships, and by the Center connection Hα(P ) = 1−α , where Hα(P ) is the Renyi´ for Science of Information (CSoI), an NSF Science and Technology Center, entropy [2], emerging in answering fundamental questions in under grant agreement CCF-0939370. Jiantao Jiao, Kartik Venkat, and Tsachy Weissman are with the De- information theory [3], [4], [5]. The functional 1 − F2(P ) is partment of Electrical Engineering, Stanford University, CA, USA. Email: also called Gini impurity, which is widely used in machine {jiantao,kvenkat,tsachy}@stanford.edu. learning [6]. Yanjun Han is with the Department of Electronic Engineering, Tsinghua University, Beijing, China. Email: [email protected]. Over the years, the use of information theoretic measures, Communicated by H. Permuter, Associate Editor for Shannon Theory. especially entropy and mutual information, has extended far Color versions of one or more of the figures in this paper are available beyond the information theory community, and is deeply online at http://ieeexplore.ieee.org. The Matlab code package of this paper can be downloaded from http: imbued in fundamental concepts from various disciplines. In //web.stanford.edu/∼tsachy/software.html. statistics, one of the popular criteria for objective Bayesian Copyright (c) 2014 IEEE. Personal use of this material is permitted. modeling [7] is to design a prior on the parameter to maximize However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected]. the mutual information between the parameter and the observa- Digital Object Identifier 10.1109/TIT.2015.0000000 tions. In machine learning, the so-called infomax [8] criterion 2 states that the function that maps a set of input values to a set A. Our estimators of output values should be chosen or learned so as to maximize the mutual information between the input and output, subject Our main goal in this work is to present a general approach to a set of specified constraints. This principle has been to the construction of minimax rate-optimal estimators for widely adopted in practice, for example, in decision tree based functionals of the form (1) under L2 loss. To illustrate our algorithms in machine learning such as C4.5 [9], one tries to approach, we describe and analyze explicit constructions for select the feature at each step of tree splitting to maximize the the specific cases of entropy H(P ) and Fα(P ), from which mutual information (called information gain principle [10]) the construction for any other functional of the form (1) will between the output and the feature conditioned on previous be clear. Our estimators for each of these two functionals are chosen features. Other measures in feature selection have been agnostic with respect to the support size S, and achieve the proposed, such as the Gini impurity (used in CART [6]), minimax L2 rates (i.e. the performance of our approaches variance reduction [6], and many of them can be incorporated when we do not know the support size S does not degrade as special cases of what we study in this paper. We emphasize compared with the case where the support size S is known). that in some applications, mutual information arises naturally Our approach is to tackle the estimation problem separately as the only answer, for example, the well known Chow– for the cases of “small p” and “large p” in H(P ) and Liu algorithm [11] for learning tree graphical models relies Fα(P ) estimation, corresponding to treating regions where the on estimation of the mutual information, which is a natural functional is nonsmooth and smooth in different ways. As we consequence of maximum likelihood estimation. Recently, it describe in detail in the sections to follow, where we give a was shown [12] that mutual information is the unique measure full account of our estimators, in the nonsmooth region, we of relevance for inference in the presence of side information rely on the best polynomial approximation of the function f, to satisfy a natural data processing property. by employing an unbiased estimator for this approximation. The best polynomial approximation for a function f(x) on We also mention genetics [13], image processing [14], com- domain A with order no more than K is defined as puter vision [15], secrecy [16], ecology [17], and physics [18] ∗ as fields in which information theoretic measures are widely PK (x) , argmin max |f(x) − P (x)|, (5) P ∈poly x∈A used. There are some other functionals that can be loosely K categorized as information theoretic measures, such as the where polyK is the collection of polynomials with order at association measures quantifying certain dependency relations most K on A. The part pertaining to the smooth region is of random variables [19], and divergence measures [20]. estimated by a bias-corrected maximum likelihood estimator. We apply this procedure coordinate-wise based on the empir- In most applications, the underlying distribution is un- ical distribution of each observed symbol, and finally sum the known, so we cannot compute these information theoretic respective estimates. measures exactly. Hence, in nearly every problem that uses We now look at the specific cases of entropy and Fα(P ) information theoretic measures, we need to estimate these separately. For the entropy, after we obtain the empirical distri- quantities from the data, which is what we study in this paper. bution Pn, for each coordinate Pn(i), if Pn(i)  ln n/n, we Our contributions are threefold. (i) We show that when the (i) compute the best polynomial approximation for −pi ln pi in number of observations n is comparable to the parameter the regime 0 ≤ pi  ln n/n, (ii) use the unbiased estimators dimension (a relevant regime in the “big data” era), the prevail- k for integer powers pi to estimate the corresponding terms ing approaches (such as plug-in of the maximum likelihood in the polynomial approximation for −pi ln pi up to order estimator) can be highly sub-optimal. (ii) We propose new Kn ∼ ln n, and (iii) use that polynomial as an estimate and computationally efficient algorithms that are essentially for −pi ln pi. If Pn(i)  ln n/n, we use the estimator optimal in terms of the worst case squared error risk. That 1 −Pn(i) ln Pn(i) + 2n to estimate −pi ln pi. Then, we add the is, we both characterize the fundamental limits on estimation estimators corresponding to each coordinate. Our estimator for performance, and propose practical algorithms that essentially Fα(P ) is very similar to that of entropy, with the only differ- achieve them. Our results establish that for such functional ence that we conduct polynomial approximation for xα with estimation scenarios, replacing the plug-in (maximum like-  α(1−α)  α order Kn ∼ ln n, and use the estimator 1 + Pn (i) lihood) estimator by our practical and essentially minimax 2nPn(i) optimal estimators yields an effective enlargement of the when Pn(i)  ln n/n. sample size from n to n ln n, which can make a significant We remark that our estimator is both conceptually and difference in practice. (iii) We demonstrate the efficacy of algorithmically simple, with complexity linear in the num- our schemes by a comparison with existing procedures in ber of samples n. Indeed, the only non-trivial computation the literature, as well as illustrate performance boosts over required is the best polynomial approximation for functions, traditional schemes on both real and simulated data. which is data independent and can be done offline before obtaining any samples from the experiment. Moreover, the Notation: We use the notation aγ . bγ to denote that there coefficients of the best polynomial approximation of different exists a universal constant C such that sup aγ ≤ C. Notation γ bγ orders can be preprocessed and stored in advance in the aγ  bγ is equivalent to aγ . bγ and bγ . aγ . Notation aγ  implementation of our approach. We demonstrate in Section V aγ bγ means that lim infγ = ∞, and aγ  bγ is equivalent to that the best polynomial approximation step can be performed bγ bγ  aγ . The sequences aγ , bγ are non-negative. efficiently using modern machinery from approximation theory 3 and numerical analysis. function, which turns out to have been studied extensively in approximation theory for more than a century. B. Main results To ease the presentation, we consider the “Poissonized” Simple as our estimators are to describe and implement, observation model [22, Pg. 508], since we can show that they can be shown to be near “optimal” in the strong sense we the minimax risks under the Multinomial model and Poisson now describe. We adopt the conventional statistical decision model are essentially the same (cf. Lemma 16). Moreover, theoretic framework [21]. Regarding the task of estimating adopting the Poisson model significantly reduces the length functional F (P ), the L2 risk of an arbitrary estimator Fˆ is of the proofs, and we emphasize that similar analysis can defined as also go through for Multinomial settings, with more nuanced 2  ˆ analysis. In the Poisson setting, we first draw a Poisson random EP F (P ) − F , (6) number N ∼ Poi(n), and then conduct the sampling N times. where the expectation is taken with respect to the distribution Consequently, the observed number of occurrences of each P that generates the observations used by Fˆ. Apparently, the symbol are independent [23, Thm. 5.6]. L2 risk is a function of both the unknown distribution P and We have the following characterization of the minimax risk the estimator Fˆ, and our goal is to minimize this risk. Since P for entropy estimation. is unknown, we cannot directly minimize it, but if we want to S do well no matter what the true distribution P is, we may want Theorem 1. Suppose n & ln S . Then the minimax risk of to adopt the minimax criterion [21] [7], and try to minimize estimating entropy H(P ) satisfies the maximum risk 2 2 2  ˆ  S (ln S) 2 inf sup EP H − H(P )  + . (9)   ˆ 2 ˆ H P ∈MS (n ln n) n sup EP F (P ) − F , (7) P ∈MS Our estimator achieves this bound without knowledge of the where MS denotes the set of all discrete distributions with support size S under the Poisson model. support size S. The estimator that minimizes the maximum The following is an immediate consequence of Theorem 1. risk above is called the minimax estimator, and the correspond- ing risk is called the minimax risk. The exact computation Corollary 1. For our entropy estimator, the maximum L2 risk of the minimax risk and the minimax estimator for general S S vanishes provided n  ln S . Moreover, if n . ln S , then the F (P ) seems intractable. Although the maximum risk in (7) maximum risk of any estimator for entropy is bounded from is a convex function of Fˆ (supremum of convex functions zero. is convex), minimizing this function involves computation of S It was first shown in [24] that one must have n  ln S the objective function via supP ∈MS , which is a non-convex optimization problem. Moreover, even if we can compute it for consistently estimating the entropy. However, the entropy exactly, the minimax estimator will surely depend on the estimators based on linear programming proposed in Valiant support size S, which is unknown to the statistician in many and Valiant [24], [25] have not been shown to achieve the applications. minimax risk. Another estimator proposed by Valiant and Valiant [26] has only been shown to achieve the minimax Hence, we slightly relax the requirement, and seek minimax 1.03 ∗ S S rate-optimal estimators Fˆ with maximum (worst-case) risk risk in the restrictive regime of ln S . n . ln S . Wu and equal to the minimax risk up to a multiplicative constant. In Yang [27] independently applied the idea of best polynomial other words, we want to design estimator Fˆ∗ such that there approximation to entropy estimation, and obtained its minimax L exist two universal positive constants 0 < C1 ≤ C2 < ∞ 2 rates. The minimax lower bound part of Theorem 1 follows that do not depend on the problem configuration (such as the from Wu and Yang [27]. We also remark that, unlike the support size S and sample size n), for which estimator we propose, the estimator in Wu and Yang [27] relies S 2 2 on knowledge of the support size , which generally may not  ˆ  ˆ∗ C1 · inf sup EP F (P ) − F ≤ sup EP F (P ) − F be known. Fˆ P ∈M P ∈M S S For the functional Fα(P ), 0 < α < 1, we have the 2  ˆ following. ≤ C2 · inf sup EP F (P ) − F . (8) Fˆ P ∈M S S1/α Theorem 2. Suppose n & ln S when we estimate Fα(P ), 0 < As it turns out, it is possible to construct estimators Fˆ∗ for a α < 1. Then we have the following characterizations of the wide class of functionals, which do not rely on the knowledge minimax risk. of support size S. A brief description of the constructions is 1) 0 < α ≤ 1/2. If we also have ln n ln S, then given in Section I-A. We find it intriguing that our estimators, . 2 2 which are minimax rate-optimal, are intimately connected to  ˆ  S inf sup EP Fα − Fα(P )  2α . (10) the problem of best (minimax) polynomial approximation, Fˆα P ∈MS (n ln n) which is a convex optimization problem. In some sense, we 2) 1/2 < α < 1. have transformed the difficult-to-solve minimax and convex problem of minimizing the maximum risk in (7) into another  2 S2 S2−2α inf sup Fˆ − F (P )  + . efficently solvable minimax and convex problem of minimiz- EP α α 2α Fˆα P ∈MS (n ln n) n ing the maximum deviation of a polynomial from a given (11) 4

Our estimators Fˆα achieve this bound without knowledge of and the maximum risk of MLE for estimating Fα(P ), 0 < α < the support size S under the Poisson model. 3/2, and entropy H(P ) in the most comprehensive regime of (S, n) pairs. Evident from Table I is the fact that the MLE One immediate corollary of Theorem 2 is the following. cannot achieve the minimax rates for estimation of H(P ), Corollary 2. For our estimators of Fα, the maximum L2 and Fα(P ) when 0 < α < 3/2. In these cases, our estimators S1/α risk vanishes provided n  ln S , 0 < α < 1. Moreover, if have performance with n samples essentially the same as S1/α the MLE with n ln n samples, and it is the best possible. In n . ln S , then the maximum risk of any estimator for Fα is bounded from zero. other words, the minimax rate-optimal schemes enlarge the “effective sample size” from n to n ln n. Furthermore, all the 1 The minimax lower bound we present in Theorem 2 improvements we have are in the bias, which is the dominating significantly improves on Paninski’s lower bound in [28], factor in the risk. This observation suggests a simple way to 1/α−1 which states that if n S , then the maximum L2 risk . obtain the minimax L2 rates from the L2 rates of the MLE. of any estimator for Fα(P ), 0 < α < 1, is bounded from zero. One need merely find the bias term in the expression of MLE The next two theorems correspond to estimation of Fα(P ), L2 rates, and replace the term n by n ln n. This simple rule is α > 1. intimately connected to the rationale behind the construction 3 and analysis of our estimators, on which we elaborate in Theorem 3. Suppose 1 < α < 2 . Under the Poissonized model, our estimator Fˆα satisfies Section II. 2 We also note that Table II is a “lossy compression” of 2(α−1)  ˆ  lim sup (n ln n) · sup sup EP Fα − Fα(P ) < ∞. Table I. Indeed, it did not reflect the important improvement n→∞ S P ∈M S of our estimator over the MLE in estimating Fα(P ), 1 < (12) α < 3/2. However, it is more transparent about the increase α In other words, our estimator Fˆα, 1 < α < 3/2 achieves of difficulty in estimation when we decrease . Indeed, −2(α−1) α → 0+ F (P ) an L2 convergence rate of (n ln n) regardless of the when , the sample complexity in estimating α , 1/α support size. This also turns out to be the minimax rate, as S / ln S becomes super-polynomial in S, which implies that shown by the following result. the problem has become extremely challenging. Indeed, the limiting case of α = 0 corresponds to estimating the support 3 Theorem 4. Suppose 1 < α < 2 . There exists a universal size of a discrete distribution, which has long been known im- constant c0 > 0 such that if S = c0n ln n then possible to do consistently without additional assumptions [30] 2 2(α−1)   [31]. lim inf (n ln n) · inf sup P Fˆ − Fα(P ) > 0, n→∞ E Fˆ P ∈MS (13) C. Discussion of main results where the infimum is taken over all possible estimators Fˆ. Within the scope of estimating entropy of discrete distri- butions from i.i.d. samples, the reader should be aware of Table I summarizes the minimax L2 rates and the L2 different problem formulations, so as not to be confused by convergence rates of the MLE in estimating Fα(P ), α > 0 and seemingly contradictory results. The problem we consider in H(P ). When the L2 rates have two terms, the first and second terms represent respectively the contributions of the bias and this paper is to estimate the entropy H(P ) for all possi- the variance. When there is a single term, only the dominant ble distributions P supported on S elements, a setting for term is retained. Conditions for these results are presented in which we obtain Theorem 1. However, one may impose some parentheses. additional structure on the distribution P , and thus restrict From a sample complexity perspective (i.e. how should the attention to smaller uncertainty sets of distributions. One may number of samples n scale with the support size S to achieve expect different answers depending on the size and nature of consistent estimation), Table I implies the results in Table II. the uncertainty sets. One of the popular alternative settings is to assume that the distribution P comes from a uniform MLE Minimax rate-optimal distribution with unknown support size S. Note that it is H(P ) n  S n  S/ ln S a great simplification of the problem, indeed, there is only 1/α 1/α Fα(P ), 0 < α < 1 n  S n  S / ln S one parameter√ S to estimate. Correspondingly, we only need Fα(P ), α > 1 n  1 n  1 n  S samples to consistently estimate the support size, and the entropy under this setting [32], which is much smaller than TABLE II: The number of samples needed to achieve consis- S the required n  ln S samples in our setting, cf. Theorem 1. tent estimation Some readers may be concerned that the minimax decision theoretic framework we adopt is too pessimistic. In some Our work (including the companion paper [29]) is the first sense, it characterizes the worst-case performance over all to obtain the minimax rates, minimax rate-optimal estimators, possible distributions P ∈ MS, and it would be disappointing √ if our estimator fails to behave reasonably for distributions 1In a previous version of the manuscript, there is a ln S gap between our lying in a strict subset of MS not including the worst case minimax lower bound and the achievability in Theorem 2. Partially inspired by Wu and Yang [27], we modified the proof by using an argument similar distribution. Regarding this question, Brown [33], [34] argued to that in the lower bound proof of [27], thereby closing the gap. that the minimax idea has been an essential foundation for 5

Minimax L2 rates L2 rates of MLE S2 ln2 S S  S2 ln2 S H(P ) (n ln n)2 + n n & ln S (Thm. 1, [27]) n2 + n (n & S) [29] 1 S2 1/α  S2 1/α Fα(P ), 0 < α ≤ 2 (n ln n)2α n & S / ln S, ln n . ln S (Thm. 2) n2α n & S [29] 1 S2 S2−2α 1/α  S2 S2−2α 1/α Fα(P ), 2 < α < 1 (n ln n)2α + n n & S / ln S (Thm. 2) n2α + n n & S [29] 3 −2(α−1) −2(α−1) Fα(P ), 1 < α < 2 (n ln n) (S & n ln n) (Thm. 3,4) n (S & n) [29] 3 −1 −1 Fα(P ), α ≥ 2 n [29] n TABLE I: Summary of results in this paper and the companion [29]

advances in many areas of statistical research, including gen- ln n eral asymptotic theory and methodology, hierarchical models, ln S robust estimation, optimal design, and nonparametric function analysis. Second, the statistics community in general uses the adaptive estimation framework to alleviate the pessimism of minimaxity [35]. Specifically, one specifies a nested sequence of subsets of MS, and tries to construct an estimator that achieves simultaneously the minimax rates over each of the subsets without knowing the subset to which the active pa- rameter P actually belongs. It was shown recently in another consistent estimation related paper [36] that along the nested subsets MS(H) = achievable via both MLE {P : H(P ) ≤ H,P ∈ MS}, our estimator (without knowing and our scheme H nor S) simultaneously achieves the minimax rates over 1 1/α (resp. [29], Theorem 2,3) P ∈ MS(H) for all H ≤ ln S. Most surprisingly, the maximum risk of our estimator over MS(H) for every S and H with n samples is still essentially that of the MLE with n ln n samples, further reinforcing the effectiveness of our estimator. consistent estimation not achievable It is instructive to consider our results in the context of the (Theorem 2) intriguing connections and differences between three important problems in information theory: entropy estimation, estimating a discrete distribution under relative entropy loss, and minimax 0 α redundancy in compressing i.i.d. sources. Table III summarizes 1 2 the known results. Fig. 1: For any fixed point above the thick curve, consistent Table III conveys several important messages. First, in the estimation of Fα(P ) is achieved using MLE Fα(Pn) [29]. Our asymptotic regime, there is a logarithmic factor between the estimator slightly improves over MLE to achieve the optimal redundancy of the compression and distribution estimation n  S1/α/ ln S sample complexity when 0 < α < 1. For the problems. Indeed, since compression requires use of a coding regime 0 < α < 1 below the thick curve, Theorem 2 shows distribution Q that does not depend on the data, the redundancy that no estimator can have vanishing maximum L2 risk. of compression will definitely be larger than the risk under relative entropy in estimating the distribution. However, in the We observe a sharp phase transition at α = 1, as the 1 large alphabet setting, the problems are equally difficult - the sample size requirement shifts from n  S α / ln S to n  1, phase transition of vanishing risk for both compression and depending on whether α is in the left or right neighborhood distribution estimation happen when n is linear in the support of 1, respectively. Hence, α = 1 is a critical point in size S. that consistent estimation requires a number of measurements Second, the large alphabet setting shows that estimation super-linear or constant in the size of the alphabet according of entropy is considerably easier than both estimating the to whether α < 1 or α > 1. corresponding distribution, and compression. It is somewhat Combining Table III and Figure 1 leads to the interest- surprising and enlightening, since there has been a well- ing observation that, in high dimensional asymptotics, esti- received tradition to apply data compression techniques to esti- mating a functional of a distribution could be easier (e.g. mate entropy, even beyond the information theory community, H(P ),Fα(P ), α > 1) or harder (e.g. Fα(P ), 0 < α < 1) e.g. [43], [44], whereas one of the implications of Table III is than estimating the distribution itself. This observation taps that the approach of entropy estimation via compression can into another interesting interpretation of the functional Fα(P ). be highly sub-optimal. 1 In information theory, the random variable ı(X) = ln P (X) is If we plot the phase transitions of ln n/ ln S for estimating known as the information density, and plays important roles Fα(P ) using Fα(Pn) with respect to α, we obtain Figure 1. in characterizing higher order fundamental limits of coding 6

entropy estimation estimation of distribution compression with blocklength n Var(− ln P (X)) ˆ S−1 1 S−1 S fixed MSE ∼ n [22] infPˆ supP ED(PX kPX ) ∼ 2n [37], [38] infQ supP n D(PXn kQXn ) ∼ 2n ln n [39] large S n  S/ ln S [24] n  S [40] n  S [41], [42] TABLE III: Comparison of difficulties in entropy estimation, estimation of distribution, and data compression under classical asymptotics and high dimensional asymptotics

problems [45], [46]. The functional Fα(P ) can be interpreted satisfies as the moment generating function for random variable ı(X) lim inf inf sup P (|Hα(Pn) − Hα(P )| ≥ δ) = 1, (16) as n→∞ S≥n/c P ∈M h (1−α)ı(X)i S Fα(P ) = EP e . (14) where Pn is the MLE of P . It is shown in Valiant and Valiant [24] that the distribution To conclude this discussion, we conjecture that plugging of ı(X) can be estimated using n  S/ ln S samples. Since in our minimax rate-optimal estimators for Fα(P ) into the moment generating functions can determine the distribution definition of H (P ) results in minimax rate-optimal estimators under some conditions, it is indeed plausible to see that α for Hα(P ) for all α > 0. the problem of estimating Fα(P ), or the moment generating function of ı(X), is either easier or harder than estimating the distribution of ı(X) itself for various values of α. II.MOTIVATION, METHODOLOGY, AND RELATED WORK We now briefly shift our focus towards estimation of Renyi´ A. Motivation entropy Hα(P ), which is closely related to the functional Existing theory proves inadequate for addressing the prob- ln Fα(P ) Fα(P ) via Hα(P ) = 1−α . Acharya et al. [47] considered lem of estimating functionals of probability distributions. A the estimation of Hα(P ), and demonstrated that the sample natural estimator for functionals of the form (1) is the maxi- complexity for estimating Hα(P ) may exhibit a different mum likelihood estimator (MLE), or plug-in estimator, which behavior than that of estimating F (P ) for certain values of α simply evaluates F (Pn), where Pn is the empirical distribution + α. It was also shown in [47] that for α > 1, α ∈ Z , the of the data. How well does the MLE perform? Interestingly, 1−1/α sample complexity is n  S , which can be achieved by if f ∈ C1(0, 1] and we focus on n i.i.d. observations from a a bias-corrected MLE. Second, for non-integer α > 1, [47] distribution with support size S, then the problem of estimating showed that the sample complexity for estimating Hα(P ) is F (P ) becomes a classical problem when S is fixed, and the 1−η between S , ∀η > 0 and S, and that it suffices to take number of observations n → ∞. This maximum likelihood n  S samples for the MLE to be consistent. By a partial estimator is asymptotically efficient [48, Thm. 8.11, Lemma application of results from the present paper, [47] also showed 8.14] in the sense of the Hajek´ convolution theorem [49] that for 0 < α < 1, the sample complexity for estimating and the Hajek–Le´ Cam local asymptotic minimax theorem 1/α−η 1/α Hα(P ) is between S , ∀η > 0 and S / ln S. However, [50]. It is therefore not surprising to encounter the following certain questions remain unanswered. For example, it was not quote from the introduction of Wyner and Foster [51] who + clear, for α > 1, α∈ / Z , whether the MLE indeed requires considered entropy estimation: n  S samples, and whether there exist estimators that can “The plug-in estimate is universal and optimal not consistently estimate H (P ) with n  S samples. We provide α only for finite alphabet i.i.d. sources but also for finite partial answers to these questions below by focusing on the alphabet, finite memory sources. On the other hand, case when 1 < α < 3/2. First, we show in Theorem 5 that practically as well as theoretically, these problems are simply plugging in the novel estimator Fˆ from Theorem 3 α of little interest. ” to the definition of Hα(P ) results in an estimator that needs 3 In light of this, is it fair to say that the entropy estimation at most S/ ln S samples when 1 < α < 2 . problem is solved in the finite alphabet setting? It was ob- Theorem 5. For any α ∈ (1, 3/2) and any δ > 0,  ∈ (0, 1), served in Paninski [52] that the maximum of Var(− ln P (X)) there exists a constant c = cα(δ, ) > 0 such that, over distributions with support size S is of order (ln S)2 (a ! ln Fˆ tight bound is also given by Lemma 15 in the appendix). Since α lim sup sup P − Hα(P ) ≥ δ ≤ , (15) classical asymptotics (with the Delta method [48, Chap. 3]) S→∞ P ∈M ,n≥ cS 1 − α S ln S show that ˆ Var(− ln P (X)) where Fα is the estimator from Theorem 3. (H(P ) − H(P ))2 ∼ , n  1, (17) EP n n In words, with high probability (ln Fˆ )/(1 − α) is close to α a naive interpretation of (17) might be that it suffices to take the Renyi´ entropy provided n S/ ln S. In contrast, the MLE & n  (ln S)2 samples to guarantee the consistency of H(P ). requires n S samples for estimating H (P ), 1 < α < 3 , as n & α 2 Such an interpretation could however be very misleading. It is implied by the following theorem. 1−δ was already observed in Paninski [52] that if n . S , δ > 0, Theorem 6. For any α ∈ (1, 3/2) and any constant c > 0, then the maximum L2 risk of any entropy estimator would be there exist some δ = δα(c) > 0 such that the MLE Hα(Pn) unbounded as S grows. 7

This apparent discrepancy shows that (17) is not valid when The jackknife [59] is another popular technique in reducing S might be growing with n, and it is of utmost importance the bias. However, it was shown by Paninski [52] that the to obtain risk bounds for estimators of entropy and other jackknifed MLE also requires n  S samples to consistently functionals of distributions in the latter regime. Indeed, in the estimate entropy. S modern era of high dimensional statistics, we often encounter Given the fact that n  ln S is both necessary and sufficient, situations where the support size is comparable to, or much the approaches mentioned above are far from optimal. We larger than the number of observations. For example, half of note that the problem of estimating entropy of discrete dis- the words in the collected works by Shakespeare appeared tributions from i.i.d. observations on large alphabets has been only once [30]. investigated in various disciplines by many authors, with many It was shown in the companion paper [29] that for n & S, approaches difficult to analyze theoretically. Among them we the maximum risk of the MLE H(Pn) can be written as mention the Miller–Madow bias-corrected estimator and its variants [53], [60], [61], the jackknified estimator [62], the S2 (ln S)2 sup (H(P ) − H(P ))2  + , (18) shrinkage estimator [63], the under various EP n n2 n P ∈MS priors [56], [64], the coverage adjusted estimator [65], the Best where the first term corresponds to the squared bias (defined Upper Bound (BUB) estimator [52], the B-Splines estimator 2 as (EP H(Pn)−H(P )) ), and the second term corresponds to [66], and [67] [68] [69] [70] etc. 2 the variance (defined as EP (H(Pn) − EP H(Pn)) ). Then we In what follows, we explain in detail our step by step can understand this mystery: when we fix S and let n → ∞, approach to this problem, and arrive at minimax rate-optimal the variance dominates and we get the expression in (17). estimators for the entire range of functional estimation prob- However, when n and S may grow together, the bias term lems considered. will not vanish unless n  S. Thus we conclude that in the large alphabet setting of entropy estimation, it is the bias that B. How did we come up with our scheme? dominates the risk, and we have to reduce bias to improve the estimation accuracy. Existing literature implied that it is possible to come up with The fact that the bias dominates the risk in entropy estima- consistent entropy estimators that require sublinear n  S tion in the large alphabet setting has been known, see [52], samples. The earliest indication to this effect appeared in [53]. However, a general recipe to overcome the bias has defied Paninski [28], but only an existential proof based on the many attempts. We briefly review some of the approaches in Stone–Weierstrass theorem was provided. It was therefore a the literature. breakthrough when Valiant and Valiant [24] introduced the One of the earliest investigations on reducing the bias of first explicit entropy estimator requiring a sublinear number of MLE in entropy estimation is due to Miller [53], who showed samples. They [24] showed that n  S/ ln S samples are both that, for any fixed distribution P supported on S elements, for necessary and sufficient to consistently estimate the entropy of each symbol i, 1 ≤ i ≤ S, we have a discrete distribution. However, the entropy estimators based on linear programming proposed in Valiant and Valiant [24],   1 − pi 1 [25] have not been shown to achieve the minimax rate. Another EP (−Pn(i) ln Pn(i)) = −pi ln pi − +O . (19) 2n n2 estimator proposed by Valiant and Valiant [26] has only been shown to achieve the minimax rate in the restrictive regime Summing up both sides shows that EH(Pn) is nearly H(P )− 1.03 S−1 1  of S n S . Moreover, the scheme of [24] can 2n up to the error term O n2 . Hence, the so-called ln S . . ln S Miller–Madow bias-corrected entropy estimator is defined by only be applied to functionals that are Lipschitz continuous S−1 with respect to a Wasserstein metric, which can be roughly H(Pn) + 2n , whose maximum risk in estimating entropy understood as those functionals that are equally “smooth” was shown to be bounded from zero if n . S by [52] [29]. Thus, it still requires n  S samples to consistently estimate or “smoother” than entropy. Notably, this does not include the entropy, which is the same as MLE. the functional Fα, α < 1 and other interesting nonsmooth Another popular approach in estimating entropy is based on functionals of distributions. Also, it is not clear whether these using the Dirichlet prior smoothing. Dirichlet smoothing may techniques generally lead to minimax rate-optimal estimators. carry two different meanings in terms of entropy estimation: Readers are referred to Valiant’s thesis [71] for more details. Conceivably, there is a fundamental connection between the • [54], [55] One first obtain a Bayes estimate for the smoothness of a functional, and the hardness of estimating ˆ discrete distribution P , which we denote by PB, and then it. The ideal solution to this problem would be systematic plugs it in the entropy functional to obtain the entropy and capture this trade-off for nearly every functional. Such ˆ estimate H(PB). a comprehensive view of functional estimation has yet to be • [56], [57] One calculates the Bayes estimate for entropy realized. George Polya´ [72] commented that “the more general H(P ) under Dirichlet prior for squared error. The esti- problem may be easier to solve than the special problem”. mator is the conditional expectation E[H(P )|X], where This motivated our present work, in which we provide a X represents the samples, and the distribution P follows general framework and procedure for minimax estimation of a certain Dirichlet prior. functionals with non-asymptotic performance guarantees. To It was shown in [58] that both approaches require at least make things transparent, let us now start from scratch and n  S to be consistent. demonstrate how our solution has a natural construction. 8

Suppose we would like to propose a general method to show that truncating the Taylor series up to order n results 1 construct minimax rate-optimal estimators for functionals of in maximum error ∼ (n+1)! on [−1, 1], but there exists a the form (1). What are the prerequisites that any method must polynomial with order n whose maximum approximation error 1 satisfy? Based on our analysis above, the following criteria is asymptotically 2n(n+1)! [73], which is the so called best ap- appear natural: proximation polynomial. The best polynomial approximation 1) Asymptotic efficiency. As modern asymptotic statis- is targeted at computing the polynomial that minimizes the tics [48] tells us, if the function f(p) in (1) is differen- maximum deviation of the polynomial from the function f(p). It is known that for any continuous function on a compact tiable on (0, 1], then the MLE F (Pn) is asymptotically efficient. In other words, no matter how we adjust the interval, there exists a unique best approximation polynomial MLE in finite sample settings, we have to ensure that for any order. Adopting this rationale, we may try to solve the when the number of samples n go to infinity while S following problem: remains fixed, our estimator is very similar to the MLE. ∗ g = argmin max |Biasp(g(X))|, (22) 2) Bias reduction. As our analysis of MLE [29] indicates, g p∈[0,1] the MLE usually has large bias and small variance where we seek g∗ minimizing the maximum value of in high dimensions. Hence, the general method has to |Biasp(g(X))|. It gives us the best uniform control of the bias reduce bias in finite samples. since we do not know p a priori. Let us attempt to understand an estimator’s bias more Applying advanced tools from approximation theory, Panin- carefully. In the simplest setting, consider a Binomial random ski [52] tried the idea mentioned above, which unfortunately variable X ∼ B(n, p), and suppose we wish to estimate the did not result in improved estimators. It turns out that this scalar f(p) based on X. Denote by g(X) an arbitrary estimator idea, while improving significantly in the bias, results in a for f(p). The bias of g(X) can be written as blowing up of the variance term. Indeed, the squared bias 2 4 of the estimator√ designed above can be shown to be S /n . Biasp(g(X)) = Epg(X) − f(p) Taking n  S, the bias term will vanish, but the variance  n  X n term will diverge, because Paninski [52] already showed that = g(j) pj(1 − p)n−j − f(p). if n S1−δ, for any δ > 0, the maximum L risk of any  j  . 2 j=0 estimators for entropy will be bounded from zero. (20) In fact, there is a simpler way to understand why the Equation (20) conveys two important messages. First, the global scale polynomial approximation idea of the form (22) only form of f(p) that can be estimated without bias is does not work. It is destined to fail because it violates the polynomials with order no more than n. Indeed, if g(X) is first prerequisite of any general method to improve MLE in functional estimation. Indeed, this scheme does not behave an unbiased estimator for f(p), then Biasp(g(X)) = 0, for all p ∈ [0, 1]. Thus, f(p) is a polynomial of p with order no more like the MLE even if n  S. This observation leads us to combine some core ideas than n because Epg(X) is such a polynomial. Conversely, any polynomial of p whose order is no more than n can be that finally constitute our scheme. First, one needs to use estimated without bias using X. Indeed, for all 0 ≤ r ≤ n, approximation theory to reduce bias. Second, one cannot do approximation on a global scale (such as p ∈ [0, 1]),   X(X − 1) ····· (X − r + 1) r but can only approximate the function f(p) locally. For- Ep = p . (21) n(n − 1) ····· (n − r + 1) tunately, the measure concentration phenomenon allows us to do approximation locally. For example, upon observing Second, the bias Biasp(g(X)) as a function of p corresponds p(1−p) to a polynomial approximation error. In other words, it is the pˆ = X/n, X ∼ B(n, p), we have Var(ˆp) = n , which vanishes as n → ∞. Finally, where should we approximate? difference between a function f(p) and a polynomial Epg(X). This viewpoint was first proposed by Paninski [52], who Intuitively, the bias is mainly due to the set of points where made the important connection between the analysis of bias the function f(p) changes abruptly. For f(p) = −p ln p or α and approximation theory. Starting from the seminal work of p , α > 0, the most “nonsmooth” point is p = 0. Chebyshev, a central problem in approximation theory [73], To sum up, we need to approximate locally around the polynomial approximation, is targeted at designing polyno- “nonsmooth” points to reduce bias. Natural as this statement mials that approximate any continuous function as well as may seem, there are some parameters to be carefully specified. possible. It then appears natural to choose the coefficients For this subsection we only consider f(p) = −p ln p or α n p , α > 0. We detail the construction of our scheme by posing {g(j)}j=0 in a way that the resulting polynomial Epg(X) approximates the function f(p) optimally. the following natural questions: One may initially be tempted to use the Taylor series 1) If we approximate function f(p) in interval [0, ∆n], how to approximate f(p). However, a more careful inspection should we choose ∆n? indicates that the Taylor polynomial is inappropriate for ap- 2) If we use a polynomial with order Kn to approximate proximating general continuous functions. Even setting aside f(p) in [0, ∆n], how should we choose Kn? What questions of convergence, Taylor polynomials are not defined should we do after obtaining the polynomial? for functions that are not infinitely differentiable. Even for 3) What should we do in interval p ∈ [∆n, 1]? functions that are analytic (such as ex, x ∈ [−1, 1]), one can Let us now answer these questions in the order in which 9

they were asked. The value ∆n should always be chosen to in the smooth regime our estimators are very similar to the be the smallest number such that we can localize the parameter MLE, which naturally implies that they are also asymptotically p. In other words, suppose we observe X ∼ B(n, p). Then, efficient in the sense of Hajek´ and Le Cam [48]. ∆n should be chosen to ensure that if X/n ≤ c∆n, c > 0 is a constant, then p ∈ [0, ∆n] with high probability. Similarly, if X/n ≥ C∆n, C > 0 is a constant, then p ∈ [∆n, 1] with f(pi) ln n high probability. It turns out that ∆n  n fulfills this goal (cf. Lemma 21). unbiased estimate Regarding the second question, the value Kn should always be chosen to be the largest number such that the increased of best polynomial variance does not exceed the bias. Indeed, if we use order n approximation of approximation, then we essentially go back to the idea (22) order ln n that increases the variance too much such that the resulting estimator does not have vanishing risk. It turns out for H(P ) and F (P ), K  ln n is the correct order for which we 00 α n f(ˆp ) − f (ˆpi)ˆpi(1−pˆi) need to conduct the best polynomial approximation. Suppose i 2n we have obtained the best polynomial approximation of order Kn for function f(p) over regime [0, ∆n]. Noting that any polynomial of order no more than n can be estimated without “nonsmooth” “smooth” bias using the estimator in (21), we use the corresponding un- biased estimator to estimate this Kn-order polynomial, thereby 0 pi ln n 1 ensuring that the bias of this estimator when p ∈ [0, ∆n] is n exactly the polynomial approximation error in approximating Fig. 2: Pictorial explanation of our estimators. f(p) over [0, ∆n]. The third question refers to the scheme in the “smooth” regime. Interestingly, it was already observed in 1969 by The idea of approximation in the context of estimation Carlton [60] that Miller’s bias correction formula (19) should has appeared before. Nemirovski [74] pioneered the use of 1 approximation theory in functional estimation in the Gaussian only be applied when pi  n . In other words, Miller’s 1 white noise model (see Nemirovski [75] for a comprehensive formula (19) is relatively accurate when pi  n . In our case, ln n treatment). Later, Lepski, Nemirovski, and Spokoiny [76] since we have already chosen ∆n  n , in the smooth regime ln n considered estimating the Lr, r ≥ 1 norm of a regression we have pi & n . We use the first order bias-correction in this regime inspired by Miller, whose rationale is the following. function, and utilized trigonometric approximation. Cai and For Binomial random variable X ∼ B(n, p), denote the Low [77] used best polynomial approximation to estimate the X `1 norm of a Gaussian mean. empirical frequency by pˆ = n . Then it follows from Taylor’s theorem that We conclude this subsection by comparing any minimax   1 00 1 rate-optimal estimator with our estimator. If we consider en- Ef(ˆp) − f(p) = f (p)Varp(ˆp) + O (23) 2 n2 tropy, Theorem 1 demonstrates that when n is not too large, the 2 f 00(p)p(1 − p)  1  risk is dominated by the first term S , which corresponds = + O , (24) (n ln n)2 2n n2 to the squared bias of our estimator in the “nonsmooth” regime. Further, it is shown in Wu and Yang [27] that in the where f 00(p) is the second derivative of f(p). We define the worst case, the risk contributed by the “nonsmooth” regime first order bias-corrected estimator of f(ˆp) by ln n S2 (i.e. [0, ]) is at least of order 2 . These observations 00 n (n ln n) c f (ˆp)ˆp(1 − pˆ) together imply that the gist of any successful scheme should f (ˆp) = f(ˆp) − . (25) S2 2n contribute squared bias nearly (n ln n)2 . However, the bias c 1−pˆ always corresponds to a polynomial approximation error, and Taking f(p) = −p ln p, f (ˆp) = −pˆlnp ˆ + 2n , which is ln n in the interval [0, n ], it roughly corresponds to a polynomial exactly the Miller–Madow bias corrected entropy estimator. 2 α S Taking f(p) = p , we have the corresponding bias corrected with order ln n. The squared bias (n ln n)2 corresponds to a estimator polynomial whose error in approximating f(p) = −p ln p in [0, ln n ] is nearly the same as the best approximation  α(1 − α)(1 − pˆ) n f c(ˆp) =p ˆα 1 + . (26) polynomial with order ln n. The theory of strong uniqueness 2npˆ in approximation theory [78] states that any polynomial whose Figure 2 demonstrates the estimators for H(P ) and Fα(P ) approximation property is close to the best approximation pictorially, where pˆi = Pn(i) is the empirical frequency of must be close to the best approximation polynomial. Thus, i-th symbol. An important observation is that our estimator we conclude that any successful scheme must inherently con- naturally satisfies the first prerequisite of any improved method duct near-best polynomial approximation in the “nonsmooth” for functional estimation as discussed above. Indeed, as n → regime, which is what we do in our scheme. Similar arguments ∞, all the observations will fall in the “smooth” regime, and also explain Fα(P ) and Theorem 2, 3, and 4. 10

C. Related work [120]–[123]. 2) Information theory: In the information theory commu- The problem of estimating functionals of parameters is one nity, following the seminal work of Shannon [124], the focus that has been studied extensively in such fields as statistics, has been on estimating entropy rates of general stationary information theory, computer science, physics, neuroscience, ergodic processes with fixed (usually small) support (alpha- psychology, and ecology, to name a few. Different communi- bet) sizes. Outside of the favored binary alphabet, printed ties have focused on different aspects of this general problem, English contributed the other interesting example of support and some seemingly different problems can be recast as size 27 (including the “space”). Cover and King [125] gave functional estimation ones. Below we review some of the core an overview of the entropy rate estimation literature until ideas in various communities. 1978. Soon after the appearance of universal data compression 1) Statistics: Consider a sequence of independent and algorithms proposed by Ziv and Lempel [126], [127], the identically distributed (i.i.d.) random variables Z1,Z2,...,Zn information theory community started applying these ideas p taking values in Z, Zi ∼ Pθ, θ ∈ Θ ⊂ R . We would in entropy rate estimation, e.g. Wyner and Ziv [128], and like to obtain a good estimate of the functional ϕ(θ). In Kontoyiannis et al. [129]. Verdu´ [130] provides an overview of general, this problem differs from that of seeking a good universal estimation of information measures until 2005. Jiao estimate of the parameter θ. The most natural and ambitious et al. [131] constructed a general framework for applying data aim towards this problem is to seek the “optimal” estimator compression algorithms to establish near-optimal estimators given exactly n samples, which falls in the realm of finite for information rates, with a focus on directed information. sample theory in statistics [7]. There is no consensus on 3) Computer science, physics, neuroscience, psychology, what criterion best evaluates how “good” an estimator is in ecology, etc: Much of the efforts in computer science, physics, a finite sample sense. Over the years, various criteria for neuroscience, psychology, ecology, and related fields have goodness have been proposed and analyzed, including the focused on some special functionals of particular interest. For uniform minimum variance unbiased estimator (UMVUE), the example, the problem of estimating Shannon entropy H(P ) minimum risk equivariant estimator (MRE), and the minimax from a finite alphabet source with i.i.d. observations has been estimator, among others. However, generally it is difficult to investigated extensively. Section II-A and II-B summarize obtain estimators for functionals satisfying any of the finite some of the efforts. sample optimality criteria mentioned above. Further, even if 4) Modern era: high dimensions and non-asymptotics: ˆ we obtained an estimator θn that is or is close to being optimal The current era of “big data” abounds with applications in for the parameter θ under a finite sample criterion, the plug-in which we no longer operate in the asymptotic regime of large ˆ approach ϕ(θn) need not result in an optimal estimator for sample sizes. This sample scarcity regime necessitates going ϕ(θ) under some finite sample criterion. beyond classical asymptotic analysis and considering finitely In light of these shortcomings, there seems to be a percep- many samples in high dimensions. Indeed, the recent successes tion that estimation under finite sample optimality criteria is of finite-blocklength analysis in information theory [45], and not amenable to a general mathematical theory [79]. Classical compressed sensing in statistics [132], [133] have demon- asymptotic theory is usually the refuge. The beautiful theory strated the benefit of carefully analyzing practical sample sizes. of Hajek´ and Le Cam [22], [49], [50] showed that, under There are also ample recent examples in statistics approaching mild conditions, there exist systematic methods to construct an classical questions from a high dimensional perspective, cf. ˆ asymptotically efficient estimator θn for the finite dimensional [134]–[138]. The machine learning community has the tradi- ˆ parameter θ, where if ϕ(θ) is differentiable at θ, ϕ(θn) is tion of favoring non-asymptotic analysis, and usually pose the also asymptotically efficient for estimating ϕ(θ) [48, Lemma question of the sample complexity for achieving  accuracy 8.14]. Furthermore, it is also known that if the functional is with 1 − δ probability, cf. [139]–[141]. The information non-differentiable, then it is nearly impossible to get an elegant theoretic counterpart of high dimensional statistics might be mathematical theory [80]. the large alphabet setting, with exciting recent advances (cf. The question of estimating functionals of finite dimen- [41], [42], [142]–[145]). sional parameters being satisfactorily answered under classical With the above as context, our work revisits the framework asymptotics, functional estimation in various nonparametric of functional estimation for finite dimensional models, with a settings has been a strong area of focus since. There are focus on high dimensional and non-asymptotic analysis. several profound contributions in this area, of which we only mention a few. The most developed theory deals with linear functionals, for example, see [79], [81]–[96]. Another well D. General methodology for functional estimation studied situation deals with the case of “smooth” functionals, 1) Review: general methods of estimation: We begin by see [74], [97]–[108], among others. Estimation of non-smooth reviewing the existing general approaches to estimation. Max- functionals is an extremely difficult problem, and is still imum likelihood is the most widely used statistical estimation largely open [76]. In particular, the problem of estimating technique, which emerged in modern form 90 years ago in differential entropy R −f ln f, where f is a density, remains a series of remarkable papers by Fisher [146]–[148]. As evi- fertile ground for research, cf. [109]–[119]. Similar situations dence of its ubiquity, the Google Scholar search query “Maxi- are also true for estimating the entropy of a discrete distri- mum Likelihood Estimation” yields approximately 2, 570, 000 bution supported on a countably infinite alphabet, cf. [51], articles, patents and books. Indeed, in his response to Berkson 11

[149] in 1980, Efron explains the popularity of maximum that it has a large bias. In particular, the statistician may prefer likelihood: the uniform minimum variance unbiased estimator (UMVUE), X− 1 θ “The appeal of maximum likelihood stems from its e 2 to estimate e . As we discussed in the presentation universal applicability, good mathematical properties, by of our main results, the bias is usually the dominating term which I refer to the standard asymptotic and exponential in estimation of functionals of high-dimensional parameters. family results, and generally good track record as a tool Notably, the two general improvements of the MLE, namely in applied statistics, a record accumulated over fifty years robust estimation and shrinkage estimation, are not designed of heavy usage. ” to handle functional estimation problems such as the one Over the years, the following folk theorem seems to have presented by Efron. Also, as Efron himself observed, the been tacitly accepted by applied scientists: statistician cannot always rely on the UMVUE to save the day, since these are generally very hard to compute, and may Theorem 7 (“Folk Theorem”). For a finite dimensional para- not always exist [155, Remark C, Sec. 7]. Thus, there is a need metric estimation problem, it is “good” to employ the MLE. to address, both in scope and methodology, the improvement From the perspective of mathematical statistics, however, over the MLE for problems where the bias is the leading term. maximum likelihood is by no means sacrosanct. As early as Such a solution could be considered the dual of the idea of in 1930, in his letters to Fisher, Hotelling raised the possibility shrinkage, since the trade-off between bias and variance is of the MLE performing poorly [150]. Subsequently, various now reversed, i.e., one might want to sacrifice the variance to examples showing that the performance of the MLE can be reduce the bias. significantly improved upon have been proposed in the litera- 2) Approximation: dual of shrinkage: Our main results in ture, cf. Le Cam [151] for an excellent overview. However, as this paper imply that Theorem 7 is far from true in high- Stigler [150, Sec. 12] discussed in his 2007 survey, while these dimensional non-asymptotic settings. Now, we aim to abstract early examples created a flurry of excitement, for the most part our scheme in estimating functionals of type (1), and distill a they were not seen as debilitating to the fundamental theory. general methodology for estimating functionals of parameters Perhaps because these examples did not provide a systematic of any finite dimensional parametric families. Consider estimating G(θ) of a parameter θ ∈ Θ ⊂ Rp for an methodology for improving the MLE. ˆ In 1956, Stein [152] observed that in the Gaussian location experiment {Pθ : θ ∈ Θ}, with a consistent estimator θn for θ, where n is the number of observations. Suppose the functional model X ∼ N (θ, Ip) (where Ip is the p × p identity matrix), G(θ) 2 θ ∈ Θ the MLE for θ, θˆMLE = X is inadmissible [7, Chap. 1] when is analytic everywhere except at 0. A natural G(θ) G(θˆ ) p ≥ 3. Later, James and Stein [153] showed that an estimator estimator for is n , and we know from classical that appropriately shrinks the MLE towards zero achieves asymptotics [48, Lemma 8.14] that if the model satisfies the uniformly lower L risk compared to the risk of the MLE. benign LAN (Local Asymptotic Normality) condition [48] 2 θˆ θ G(θˆ ) The shrinkage idea underlying the James–Stein estimator has and n is asymptotically efficient for , then n is also G(θ) θ∈ / Θ proven extremely fruitful for statistical methodology, and has asymptotically efficient for for 0. Note that motivated further milestone developments in statistics, such as this general framework naturally encompasses the family of Θ wavelet shrinkage [154], and compressed sensing [132], [133]. probability functionals as a special case. To see this, let be the S-dimensional probability simplex, where S denotes One interpretation of the shrinkage idea is that, when one the support size. For functionals of the form (1), if f is desires to estimate a high dimensional parameter, the MLE analytic on (0, 1], it is clear that Θ denotes the boundary may have a relatively small bias compared to the variance. 0 θˆ Shrinking the MLE introduces an additional bias, but reduces of the probability simplex. One natural candidate for n is the the overall risk by reducing the variance substantially. A empirical distribution, which is an unbiased estimator for any θ ∈ Θ natural question now arises: what about situations wherein the . bias is the dominating term? Does there exist an analogous We propose to conduct the following two-step procedure in G(θ) methodology for improving over the performance of the MLE estimating . ˆ in such scenarios? A precedent to this line of questioning can 1) Classify Regime: Compute θn, and declare that we are ˆ be found in the 1981 Wald Memorial Lecture by Efron [155] operating in the “nonsmooth” regime if θn is “close” entitled “Maximum Likelihood and Decision Theory”: enough to Θ0. Note that G(θ) is not analytic at any θ ∈ Θ . Otherwise declare we are in the “smooth” regime; “. . . the MLE can be non-optimal if the statistician 0 2) Estimate: has one specific estimation problem in mind. Arbitrarily ˆ bad counterexamples, along the line of estimating eθ • If θn falls in the “smooth” regime, use an estimator ˆ from X ∼ N (θ, 1), are easy to construct. Nevertheless “similar” to G(θn) to estimate G(θ); ˆ the MLE has a good reputation, acquired over 60 years • If θn falls in the “nonsmooth” regime, replace the of heavy use, for producing reasonable point estimates. functional G(θ) in the “nonsmooth” regime by an Useful general improvements on the MLE, such as ro- approximation Gappr(θ) (another functional) which bust estimation, and Stein estimation, are all the more can be estimated without bias, then apply an unbi- impressive for their rarity. ” ased estimator for the functional Gappr(θ).

For the aforementioned example, Efron [155] argued that 2 A function f is analytic at a point x0 if and only if its Taylor series about X θ the reason the MLE e may not be a good estimate for e , is x0 converges to f in some neighborhood of x0. 12

3) Details of “Approximation”: While this general recipe where polyn is the collection of polynomials with order appears clean in its description, there are several problem- at most n on A, is a crucial object in approximation dependent features that one needs to design carefully – namely theory as well as our general methodology. Quantifying En[f]A and obtaining the polynomial that achieves it Question 1. How to determine “nonsmooth” regime? What turned out to be extremely challenging. Remez [160] is the size of it? in 1934 proposed an efficient algorithm for computing Question 2. What approximation should we choose to approx- the best polynomial approximation, and it was recently imate G(θ) in the “nonsmooth” regime? implemented and highly optimized in Matlab by the Chebfun team [161], [162]. Regarding the theoretical ˆ Question 3. What does “ ‘similar’ to G(θn)” mean precisely? understanding of En[f]A, de la Vallee-Poussin,´ Bern- What exactly do we do in the “smooth” regime? stein, Ibragimov, Markov, Kolmogorov and others have made significant contributions, and it is still an active The careful reader may have realized that Questions 1,2, research area. Among others, Bernstein [163], [164] and and 3 resemble the questions we asked in Section II-B. Ibragimov [165] showed various exact limiting results Answers to these questions draw on additional problems we for some important classes of functions like |x|p and investigated [156] beyond these in the present paper. |x|m ln |x|n. For example, we have 1) Question 1 Theorem 8. [164] The following limit exists for all We should always choose the “nonsmooth” regime to p > 0: p p be the smallest regime such that we can still localize lim n En[|x| ][−1,1] = µ(p), (28) ˆ n→∞ the parameter θ. In other words, when we observe θn in the “nonsmooth” regime, we should be able to infer where µ(p) is a constant bounded as with high probability that θ is also in the “nonsmooth” Γ(p) πp  1  Γ(p) πp regime. Similarly, we should also be able to localize sin 1 − ≤ µ(p) ≤ sin , π 2 p − 1 π 2 the parameter in the “smooth” regime. A concrete case would be the following. Say we observe X ∼ B(n, p), (29) and we would like to estimate a functional which is where Γ(·) denotes the Gamma function. Regarding bounds on E [f] for any finite n, Korneichuk not analytic at p0 = 0.2. How should we define n the “nonsmooth” regime? Noting that Var(X/n) = [166, Chap. 6] provides a comprehensive study. For p(1−p) a comprehensive treatment of modern approximation n , it turns out we can set the “nonsmooth” regime  q q  theory, DeVore and Lorentz [73], Ditzian and Totik p0(1−p0) ln n p0(1−p0) ln n to be p0 − n , p0 + n (cf. [167] provide excellent references. For the most up-to- Lemma 21). date review of polynomial approximation, we refer the 2) Question 2 readers to Bustamante [168]. We should always choose an approximation Gappr(θ) that We emphasize that the discussions above refer to ap- can be estimated without bias. This requirement leads us proximation in dimension one. The general multivariate to the general theory of unbiased estimation, which was case is extremely complicated. Rice [169] wrote: pioneered by Halmos [157] and Kolmogorov [158]. For a comprehensive survey the readers are referred to the “The theory of Chebyshev approximation (a.k.a. best monograph by Voinov and Nikulin [159]. approximation) for functions of one real variable has There is a delicate trade-off: the approximation Gappr(θ) been understood for some time and is quite elegant. For should be estimated without bias, but also should ap- about fifty years attempts have been made to general- proximate the functional G(θ) well, and at the same ize this theory to functions of several variables. These time not incur too much additional variance. These attempts have failed because of the lack of uniqueness three requirements yield a highly non-trivial interplay of best approximations to functions of more than one between approximation theory and statistics, of which variable. ” our understanding is as yet incomplete. Another related paper [156] showed that the non- For functionals in (1), the separability of each p essen- i uniqueness can cause serious trouble: some polynomial tially reduces the problem from multivariate to univari- that can achieve the best approximation error rate cannot ate, for which best polynomial approximation plays an be used in our general methodology in functional esti- important role in the optimal solution. Similar stories are mation. What if we relax the requirement of computing true for the Gaussian setting, e.g. estimating PS f(µ ) i=1 i the best approximation in multivariate case, and only where µ ∈ S is the mean of a normal vector. Mod- R want to analyze the best approximation rate (i.e., the best ern approximation theory provides mature machinery approximation error up to a multiplicative constant)? of polynomial approximation in one dimension, with That turns out also to be extremely difficult. Ditzian and various profound results developed over the last century. Totik [167, Chap. 12] obtained the error rate estimate on The best approximation error rate E [f] : n A simple polytopes3, balls, and spheres, and it remained

En[f]A = min max |f(x) − P (x)|, (27) 3 d P ∈polyn x∈A A simple polytope in R is a polytope such that each vertex has d edges. 13

open until Totik [170] generalized the results to general for any k ∈ N+. For a proof of this fact we refer the readers polytopes. For results in balls and spheres, the readers to Withers [172, Example 2.8]. are referred to Dai and Xu [171]. We still know little about regimes other than polytopes, balls, and spheres. A. Estimator construction We do not know whether in general polynomial approxi- ˆ mation can achieve the minimax rates in general settings. Our estimator Fα, α > 0, is constructed as follows. Probably other approximation bases need be chosen for S ˆ X 1 1 certain problems. Fα , [Lα(ˆpi,1) (ˆpi,2 ≤ 2∆) + Uα(ˆpi,1) (ˆpi,2 > 2∆)] , 3) Question 3 i=1 Note that we have assumed G(θ) is analytic in the (32) “smooth” regime. For various statistical models (like where Gaussian and Poisson), any analytic functional admits K k−1 X −k+α Y unbiased estimators. We propose to use Taylor series SK,α(x) , gk,α(4∆) (x − r/n) (33) bias correction [172] in the “smooth” regime, where the k=1 r=0 order of the Taylor series may vary between problems. Lα(x) , min {SK,α(x), 1} (34)   α(1 − α) α Uα(x) , In(x) 1 + x . (35) E. Remaining content 2nx The rest of the paper is organized as follows. Section III We explain each equation in detail as follows. details the construction of our estimators Hˆ and Fˆα and 1) Equation (32): their analysis. In Section IV we present our general approach Note that pˆi,1 and pˆi,2 are i.i.d. random variables such for proving minimax lower bounds and apply it to establish that npˆi,1 ∼ Poi(npi). We use pˆi,2 to determine whether Theorems 2 and 4. Section V presents a few experiments we are operating in the “nonsmooth” regime or not. demonstrating the practical advantages of our estimators in If pˆi,2 ≤ 2∆, we declare we are in the “nonsmooth” entropy estimation, mutual information estimation, entropy regime, and plug in pˆi,1 into function Lα(·). If pˆi,2 > rate estimation, and learning graphical models. Complete 2∆, we declare we are in the “smooth” regime, and plug proofs of the remaining theorems and lemmas are provided in pˆi,1 into Uα(·). in the appendices. 2) Equation (33): The coefficients gk,α, 0 ≤ k ≤ K are coefficients of the α III.ESTIMATOR CONSTRUCTION AND ANALYSIS best polynomial approximation of x over [0, 1] up to degree K, i.e., Throughout our analysis, we utilize the Poisson sampling K model, which is equivalent to having a S-dimensional random X k α gk,αx = argmin sup |y(x) − x |, (36) vector Z such that each component Zi in Z has distribution k=0 y(x)∈polyK x∈[0,1] Poi(npi), and all coordinates of Z are independent. For simplicity of analysis, we conduct the classical “splitting” where polyK denotes the set of algebraic polynomials operation [173] on the Poisson random vector Z, and ob- up to order K. Note that in general gk,α depends on K, tain two independent identically distributed random vectors which we do not make explicit for brevity. Lemma 4 T T X = [X1,X2,...,XS] , Y = [Y1,Y2,...,YS] , such that shows that for nX ∼ Poi(np), each component Xi in X has distribution Poi(npi/2), and K X −k+α k all coordinates in X are independent. For each coordinate i, ESK,α(X) = gk,α(4∆) p . (37) the splitting process generates a random variable Ti such that k=1 T |Z ∼ B(Z , 1/2), and assign X = T ,Y = Z − T . All i i i i i i i Thus, we can understand S (X), nX ∼ Poi(np) as a the random variables {T : 1 ≤ i ≤ S} are conditionally K,α i random variable whose expectation is nearly 4 the best independent given our observation Z. approximation of function xα over [0, 4∆]. n/2 n For simplicity, we re-define as , and denote 3) Equation (34): α Xi Yi c1 ln n ∆ Any reasonable estimator for pi should be upper pˆi,1 = , pˆi,2 = , ∆ = ,K = c2 ln n, t = , n n n 4 bounded by the value one. We cut off SK,α(x) by upper (30) bound 1, and define the function Lα(x), which means where c1, c2 are positive constants to be specified later. For “lower part”. simplicity we assume K is always a non-negative integer. Note 4) Equation (35): that ∆, K, t are functions of n, where we omit the subscript n The function Uα(x) (standing for “upper part”) is noth- for brevity. We remark that the “splitting” operation is used to 5 ing but a product of an interpolation function In(x) simplify the analysis, and is not performed in the experiments. and the bias-corrected MLE. The careful reader may We also note that for random variable X such that nX ∼ Poi(np), 4Note that we have removed the constant term from the best polynomial k−1 Y  r  approximation. It is to ensure that we assign zero to symbols we do not see. X − = pk, (31) 5The usage of the interpolation function was partially inspired by Valiant E n r=0 and Valiant [26]. 14

note that the bias-corrected MLE is not exactly the same where as what we used before in (25). It is because here we K k−1 X −k+1 Y are using the Poisson model instead of the Multinomial SK,H (x) , gk,H (4∆) (x − r/n) (44) model. In the Poisson model, the bias correction formula k=1 r=0 should be modified to LH (x) , min {SK,H (x), 1} (45) 00 c f (ˆp)ˆp  1  f (ˆp) = f(ˆp) − . (38) U (x) I (x) −x ln x + . (46) 2n H , n 2n The interpolation function I (x) is designed to make n The coefficients {g } are defined as follows. We U (x) a smooth function on [0, 1]. Indeed, when 0 < k,H 1≤k≤K α first define α < 1, were it not for the interpolation function, Uα(x) K would be unbounded for x close to zero. Note that Lα(x) X k rk,H x = argmin sup |y(x) − (−x ln x)| (47) and Uα(x) are dependent on n. We omit this dependence k=0 y(x)∈polyK x∈[0,1] in notation for brevity. The interpolation function In(x) is defined as follows: and then define  0 x ≤ t gk,H = rk,H , 2 ≤ k ≤ K, g1,H = r1,H − ln(4∆). (48)  In(x) = g (x − t; t) t < x < 2t (39) Lemma 20 shows that for nX ∼ Poi(np),  1 x ≥ 2t K X −k+1 k ESK,H (X) = gk,H (4∆) p (49) The following lemma characterizes the properties of the k=1 function g(x; a) appearing in the definition of In(x). In is a near-best polynomial approximation for −p ln p on [0, 4∆]. particular, it shows that In(x) is four times continuously differentiable. Figure 4 is designed to provide pictorial explanation of our scheme in the “nonsmooth” regime for entropy estimation. Lemma 1. For the function g(x; a) on [0, a] defined as The curve −p ln p is the functional we want to estimate, and follows, the horizontal axis represents the possible values of p. We take n = 100, and use a 3-order best polynomial approximation for g(x; a) function −p ln p over regime [0, ln n = 0.0461]. It is evident 5 6 n x x from the curve that the expectation [L (ˆp)] is very close , 126 − 420 E H a a to the true function −p ln p, but the expectation of the MLE x7 x8 x9 + 540 − 315 + 70 , (40) E[−pˆlnp ˆ], and the function LH (·) itself are both far from a a a −p ln p. It demonstrates the nuanced nature of our scheme: we have the following properties: we first construct a polynomial (equal to E[LH (ˆp)]) that approximates the function −p ln p very well, then we design g(0; a) = 0, g(i)(0; a) = 0, 1 ≤ i ≤ 4 (41) the function LH (·) to make sure its expectation is the good (i) g(a; a) = 1, g (a; a) = 0, 1 ≤ i ≤ 4 (42) approximation. Plugging pˆ in LH (·) may seem less sensible The function g(x; 1) is depicted in Figure 3. than plugging pˆ into −p ln p at first glance, but Figure 4 vividly demonstrates that in fact plugging pˆ into LH (·) is far more accurate in estimating −p ln p. 1.0

0.14 0.8

0.12 0.6

0.1 0.4

0.08 0.2

0.06

0.2 0.4 0.6 0.8 1.0 Fig. 3: The function g(x; 1) over interval [0, 1]. 0.04 −p ln p E[LH (ˆp)], npˆ ∼ B(n, p) 0.02 E[−pˆlnp ˆ], npˆ ∼ B(n, p) Similarly, we define our estimator for entropy H(P ) as LH (p)

S 0 ˆ X 1 1 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 H , [LH (ˆpi,1) (ˆpi,2 ≤ 2∆) + UH (ˆpi,1) (ˆpi,2 > 2∆)] , p i=1 (43) Fig. 4: Comparison of our scheme and MLE 15

c3 is the constant appearing in Lemma 19. If we also have c2 ≤ 4c1, then B. Estimator analysis (4c ln n)2+2α 2 8c2 ln 2 1 ESK,α(X) ≤ n . (57) We demonstrate our analysis techniques via the proof of n2α Theorem 2 and 3, and note that similar techniques allow us For the entropy, if p ≤ 4∆, we have to establish Theorem 1. The next two lemmas show that the estimators C |ESK,H (X) + p ln p| ≤ . (58) Uα(x),UH (x) have desirable bias and variance properties n ln n when the true probability p is not too small. 4c1ν1(2) When n is large enough, C can be taken to be 2 , which c2 Lemma 2. Suppose nX ∼ Poi(np), p ≥ ∆, c1 ln n ≥ 1. For is given in Lemma 20. If we also have c2 ≤ 4c1, then 0 < α < 3/2, we have 4 (4c1 ln n) S2 (X) ≤ n8c2 ln 2 . (59) 17 8310(1 + α) E K,H n2 | U (X) − pα| ≤ + pαn−c1/8. E α nα(c ln n)2−α α(2 − α) 1 1 (50) Lemma 5. If nX ∼ Poi(np), p ≤ n ln n , 1 < α < 3/2, then for c2 ≤ 4c1, For 0 < α ≤ 1/2,  4c α−1 24 576 α 1 2α −c1/8 |ESK,α(X) − p | ≤ D1 p (60) Var(Uα(X)) ≤ + p n c2n ln n n2α(c ln n)1−2α α 2 1 2α+2 (4c1 ln n) p 28800 2α −c /4 2 10c2 ln 2 + p n 1 . (51) ESK,α(X) ≤ n 2α−1 , (61) α2 n For 1/2 < α < 1, where D1 is a universal positive constant appearing in Lemma 17. 14p2α−1 576 2α −c1/8 Var(Uα(X)) ≤ + p n n α With the machinery established in Lemma 2, 3, 4, and 5, 28800 8 2α −c1/4 we are now ready to bound the bias and variance of each + 2 p n + 2α 2−2α . α n (c1 ln n) summand in our estimators. Define, (52) ξ = ξ(X,Y ) = Lα(X)1(Y ≤ 2∆) + Uα(X)1(Y > 2∆), For 1 < α < 3/2, (62) 202p 8 28800 where nX =D nY ∼ Poi(np), and X is independent of Y . Var(U (X)) ≤ + + p2αn−c1/4 α n n2 α2 Apparently, we have 120 + p2αn−c1/8. (53) S α X Fˆα = ξ(ˆpi,1, pˆi,2), (63) Lemma 3. If nX ∼ Poi(np), p ≥ ∆, i=1 3 2 and each of the S summands are independent. Hence, it |EUH (X) + p ln p| ≤ + 2 c1n ln n 3(c1 ln n) n suffices to analyze the bias and variance of ξ(X,Y ) thoroughly ˆ + 8024 (p ln(1/p) + 2p) n−c1/8 (54) for all values of p in order to obtain a risk bound for Fα. We break this into three different regimes. In the first case 2 when p ≤ ∆, we shall show that the estimator essentially Var(UH (X)) ≤ 2p(ln p − ln 2) /n behaves like Lα(X), which is a good estimator when p is 2 2 −c /8 + 54p 2(ln p) − 2 ln p + 3 n 1 small. In the second case when ∆ ≤ p ≤ 4∆, we show that 2  1  our estimator uses either Lα(X) or Uα(X), which are both + + 60 (p ln(1/p) + 2p) n−c1/8 n good estimators in this case. In the last case p ≥ 4∆, we show U (X)  1  that our estimator behaves essentially like α , which has + 2 p ln(1/p) + × good properties when p is not too small. 2n α  1  We denote B(ξ) , Eξ(X,Y ) − p as the bias of ξ. + 60 (p ln(1/p) + 2p) n−c1/8 . (55) n The following two lemmas characterize the performance of Lemma 6. Suppose 0 < α < 1, 0 < c1 = 16(α + δ), 0 < 8c2 ln 2 =  < α, δ > 0. Then, SK,α(X) and SK,H (X), nX ∼ Poi(np) when p is not too large. 1) when p ≤ ∆, Lemma 4. If nX ∼ Poi(np), p ≤ 4∆, α > 0, we have 1 α c3 |B(ξ)| . , (64) | SK,α(X) − p | ≤ , (56) (n ln n)α E (n ln n)α (ln n)2+2α 2µ(2α)cα 1 Var(ξ) . 2α− . (65) and for n large enough, we can take c3 = 2α , where n c2 16

2) when ∆ < p ≤ 4∆, risk bound, since when ln n . ln S, the first term dominates, 1 otherwise the third term dominates. |B(ξ)| , (66) . (n ln n)α  (ln n)2+2α The proof of Theorem 1 is essentially the same as that for  2α− 0 < α ≤ 1/2, Var(ξ) n (67) Theorem 2, with the only differences being replacing Lemma 2 . (ln n)2+2α p2α−1  n2α− + n 1/2 < α < 1. with Lemma 3, applying the entropy part of Lemma 4 and Lemma 15. The proof of Theorem 3 is slightly more involved, 3) when p > 4∆, and we need to split the analysis into four different regimes. 1 |B(ξ)| , (68) . nα(ln n)2−α  1 Lemma 7. Suppose 1 < α < 3/2. Setting c1 = 16(α+δ), 0 <  2α 1−2α 0 < α ≤ 1/2, n (ln n) 10c2 ln 2 =  < 2α − 2, δ > 0, we have the following bounds Var(ξ) . 2α−1 1 p on |B(ξ)| and Var(ξ).  n2α(ln n)1−2α + n 1/2 < α < 1. (69) Now the result of Theorem 2 follows easily from Lemma 6. We have 1 1) when p ≤ n ln n , S X p |Bias(Fˆ )| ≤ |B(ξ(ˆp , pˆ ))| (70) |B(ξ)| , (80) α i,1 i,2 . (n ln n)α−1 i=1 2α+2 S (ln n) p X 1 Var(ξ) . . (81) (71) n2α−1− . (n ln n)α i=1 2) when 1 < p ≤ ∆, S n ln n . , (72) 1 (n ln n)α |B(ξ)| , (82) . (n ln n)α and (ln n)2α+2 S Var(ξ) . (83) X . n2α− Var(Fˆα) = Var(ξ(ˆpi,1, pˆi,2)) (73) i=1 3) when p > ∆, S  (ln n)2+2α 1  2α− 0 < α ≤ 1/2 X n |B(ξ)| . α 2−α , (84) . 2+2α 2α−1 (74) n (ln n) (ln n) pi i=1  2α− + 1/2 < α < 1 1 p n n Var(ξ) + . 2+2α . (85)  S(ln n) n2 n  n2α− 0 < α ≤ 1/2 . 2+2α 2α−1 Now the result of Theorem 3 follows easily from Lemma S(ln n) PS pi  n2α− + i=1 n 1/2 < α < 1 7. First, the total bias can be bounded by (75) S  2+2α X S(ln n) |Bias(Fˆα)| ≤ |B(ξ(ˆpi,1, pˆi,2))| (86)  n2α− 0 < α ≤ 1/2 . (76) i=1 . S(ln n)2+2α S2−2α  2α− + 1/2 < α < 1 X n n = |B(ξ(ˆpi,1, pˆi,2))| i:p ≤ 1 Here we have used the fact that i n ln n X S + |B(ξ(ˆpi,1, pˆi,2))| X 2α−1 2α−1 2−2α sup p = S(1/S) = S , (77) 1 i i: n ln n ∆ Combining the bias and variance bounds, we have X p X 1 +  2 . (n ln n)α−1 (n ln n)α ˆ i:p ≤ 1 i: 1

∆  S2 S(ln n)2+2α 1 1  2α + 2α− 0 < α ≤ 1/2 ≤ + · (n ln n) (n ln n) n (n ln n)α−1 (n ln n)α . S2 S(ln n)2+2α S2−2α  (n ln n)2α + n2α− + n 1/2 < α < 1 1 n + α 2−α · (89) (79) n (ln n) c1 ln n where  > 0 is a constant that is arbitrarily small. Note that 1 . . (90) when 1/2 < α < 1, we can remove the middle term in the (n ln n)α−1 17

Second, the total variance is bounded by we have   S ˆ |T (θ1) − T (θ0)| ˆ X inf sup Pθ |T − T (θ)| ≥ Var(Fα) = Var(ξ(ˆpi,1, pˆi,2)) (91) Tˆ θ∈Θ 2 i=1 1 X ≥ exp (−D (Pθ1 kPθ0 )) . (100) = Var(ξ(ˆpi,1, pˆi,2)) 4 1 i:pi≤ n ln n The second lemma is the so-called method of two fuzzy X hypotheses presented in Tsybakov [174]. Suppose we observe + Var(ξ(ˆpi,1, pˆi,2)) 1 a random vector Z ∈ (Z, A) which has distribution Pθ where i: n ln n ∆ i is σi for i = 0, 1. Let Tˆ = Tˆ(Z) be an arbitrary estimator of X (ln n)2α+2p a function T (θ) based on Z. We have the following general . n2α−1− 1 minimax lower bound. i:pi≤ n ln n X (ln n)2α+2 Lemma 9. [174, Thm. 2.15] Given the setting above, suppose + n2α− there exist ζ ∈ R, s > 0, 0 ≤ β0, β1 < 1 such that 1 i: ∆ (ln n)2α+2 (ln n)2α+2 If V (F1,F0) ≤ η < 1, then ≤ + · (n ln n) n2α−1− n2α−  ˆ  1 − η − β0 − β1   inf sup Pθ |T − T (θ)| ≥ s ≥ , (103) 1 1 n Tˆ θ∈Θ 2 + + 2 · (94) n n c1 ln n where Fi, i = 0, 1 are the marginal distributions of Z when (ln n)2α+3 1 + (95) the priors are σi, i = 0, 1, respectively. . n2α−1− n 1 Here V (P,Q) is the total variation distance between two . (96) . (n ln n)2α−2 probability measures P,Q on the measurable space (Z, A). Concretely, we have Combining the bias and variance bounds, we have 1 Z  2  2 V (P,Q) , sup |P (A) − Q(A)| = |p − q|dν, (104) ˆ ˆ ˆ A∈A 2 sup EP Fα − Fα(P ) = Bias(Fα) + Var(Fα) (97) P dP dQ 1 3 where p = dν , q = dν , and ν is a dominating measure so , 1 < α < , P  ν, Q  ν . (n ln n)2α−2 2 that . (98) which completes the proof of Theorem 3. A. Minimax lower bound for Theorem 2 (Fα : 0 < α < 1) Note that the minimax lower bound in Theorem 2 consists of two parts when 1/2 < α < 1. Hence, for 1/2 < α < 1, it suffices to first show that IV. MINIMAXLOWERBOUNDSFORESTIMATING  2 S2−2α inf sup Fˆ − F (P ) , (105) Fα(P ), 0 < α < 3/2 EP α α & Fˆα P ∈MS n and then show There are two main lemmas that we employ towards the 2 2  ˆ  S proof of the minimax lower bounds in Theorem 2 and 4. The inf sup EP Fα − Fα(P ) & , (106) ˆ (n ln n)2α first lemma is the Le Cam two-point method. Suppose we Fα P ∈MS observe a random vector Z ∈ (Z, A) which has distribution in order to obtain the desired conclusion via the relation a+b Pθ where θ ∈ Θ. Let θ0 and θ1 be two elements of Θ. Let max{a, b} ≥ 2 . Tˆ = Tˆ(Z) be an arbitrary estimator of a function T (θ) based Regarding (105), we have the following theorem. Z on . Le Cam’s two-point method gives the following general Theorem 9. For 1 ≤ α < 1, we have minimax lower bound. 2 2 2−2α  ˆ  S Lemma 8. [174, Sec. 2.4.2] Denoting the Kullback-Leibler inf sup EP Fα − Fα(P ) & , (107) Fˆ P ∈M n divergence between P and Q by α S where the infimum is taken over all possible estimators Fˆ . (R  dP  α ln dQ dP, if P  Q, D(P kQ) , (99) Proof: Applying this lemma to our Poissonized model +∞, otherwise. npˆi ∼ Poi(npi), 1 ≤ i ≤ S, we know that for θ1 = 18

− 1 (p1, p2, ··· , pS), θ0 = (q1, q2, ··· , qS), Hence, by choosing  = n 2 , we know that  2 D (Pθ kPθ ) ˆ 1 0 inf sup EP F − Fα(P ) S Fˆ P ∈MS X 2 = D (Poi(npi)kPoi(nqi)) (108) α2  1 − α  ≥ (2(S − 1))1−α − 2−α − (2(S − 1))1−α + 2−α i=1 16en 2n S ∞   X X pi (122) = P (Poi(npi) = k) · k ln − n(pi − qi) qi i=1 k=0 under the Poissonized model. Applying Lemma 16, we know (109) that under the Multinomial model, the non-asymptotic mini- S S max lower bound is X pi X = npi ln − n (pi − qi) (110) 2  2 q α 1−α −α 1 − α 1−α −α i=1 i i=1 (2(S − 1)) − 2 − (2(S − 1)) + 2 32en 4n = nD(θ kθ ), (111) 1 0 S2−2α − e−n/4S2(1−α) . (123) then Markov’s inequality yields & n 2  ˆ  The proof is complete. inf sup EP F − Fα(P ) (112) Fˆ P ∈MS Now we start the proof of (106) in earnest. For 1/2 < α < 2−2α 2 2 S S |Fα(θ1) − Fα(θ0)| 1, (106) follows directly from (105) if n & (n ln n)2α , ≥ × 1− 1 4 or equivalently, S . n 2α ln n. Hence, we only need to   1− 1 ˆ |Fα(θ1) − Fα(θ0)| consider the case where S & n 2α ln n, which implies that inf sup P |F − Fα(P )| ≥ ˆ ln S ln n ln S ln n F P ∈MS 2 & . Since the condition & is also treated as (113) an assumption in Theorem 2 for 0 < α ≤ 1/2, we adopt it |F (θ ) − F (θ )|2 throughout the following proof. ≥ α 1 α 0 exp (−nD(θ kθ )) , (114) 16 1 0 We construct the two fuzzy hypotheses required by Lemma 9. Similar construction was applied in proving mini- where we are operating under the Poissonized model. max lower bounds in [76] and [77]. Lemma 10. For any given positive integer L > 0, there exist ∗ ∗ two probability measures ν0 and ν1 on [0, 1] that satisfy the following conditions: R l ∗ R l ∗ 1) t ν1 (dt) = t ν0 (dt), for l = 0, 1, 2,...,L; R α ∗ R α ∗ α 2) t ν1 (dt) − t ν0 (dt) = 2EL[x ][0,1], α Fix  ∈ (0, 1/2) to be specified later. Letting where EL[x ][0,1] is the distance in the uniform norm on [0, 1] α  1 1 1 from the function f(x) = x to the space polyL of polynomials θ = , ··· , , , (115) of no more than degree L. 1 2(S − 1) 2(S − 1) 2  1 +  1 +  1 −  The two probability measures ν∗ and ν∗ can be understood θ = , ··· , , , (116) 0 1 0 2(S − 1) 2(S − 1) 2 as the solution to the optimization problem of maximizing R α R α R l t ν1(dt) − t ν0(dt), with the constraint that t ν1(dt) = direct computation yields R l t ν0(dt), for l = 0, 1, 2,...,L, supp(νi) ⊂ [0, 1], i = 0, 1. 1 1 1 D(θ kθ ) = − ln(1 + ) + ln (117) Wu and Yang [27] gave an explicit construction of the mea- 1 0 ∗ ∗ 2 2 1 −  sures ν0 and ν1 from the solution of the best polynomial 1 = − ln(1 − 2) ≤ 2, (118) approximation problem for general functions on an interval. In ∗ ∗ 2 some sense, the two probability measures ν0 and ν1 are chosen and to be those that differ the most in terms of the expectations of the functions we care about (here is xα), with the same |F (θ ) − F (θ )| α 1 α 0 moments up to a certain order. Hence, they are difficult α 1−α −α α = [(1 + ) − 1] · (2(S − 1)) − 2 [1 − (1 − ) ] to distinguish via samples, but the corresponding functional (119) values are maximally apart from each other.  α(1 − α)2  According to Lemma 17, we have ≥ α − · (2(S − 1))1−α 2 2α α µ(2α) 2  α(1 − α)  lim L EL[x ][0,1] = 2α , (124) − 2−α α + (120) L→∞ 2 2 1/α Since we have assumed n S and ln S ln n, we ≥ α (2(S − 1))1−α − 2−α  & ln S & represent α(1 − α) 1−α −α 2 S1/α − (2(S − 1)) + 2  . (121) n = c , 1 c n1−δ, (125) 2 ln S . . 19

for some constant δ ∈ (0, 1), which implies that Lemma 11. The following bounds are true if d1 = 1, d2 = α 10e: α α α S ∼ n (ln n) . (126) α α c α µ(2α)d1 0 0 EµS Fα(P ) − EµS Fα(P ) = 2 2α (1 + o(1)) Note that 1 0 c (2d2) 1 S2 (137) 2α  2α . (127) c (n ln n) 1  , (138) Define cα  2 α 3α ln n 0 αd1 (ln n) M = d1 ,L = d2 ln n, S = S − 1, (128) VarµS0 (Fα(P )) ≤ (139) n j c nα n→∞ where d1, d2 are positive constants (not depending on n) that −→ 0, j = 0, 1, (140) will be determined later. Without loss of generality we assume ∞ 1 X that d2 ln n is always a positive integer. V (F1,M ,F0,M ) = |F1,M (y) − F0,M (y)| ∗ ∗ 2 For a given integer L, let ν0 and ν1 be the two probability y=0 measures possessing the properties given in Lemma 10. Let (141) g(x) = Mx and let µi be the measures on [0, 1] defined by 1 ∗ −1 ≤ . (142) µi(A) = νi (g (A)) for i = 0, 1. It follows from Lemma 10 n6 that: Setting R l R l 1) t µ1(dt) = t µ0(dt), for l = 0, 1, 2,...,L; 0 R α R α α α σ = µS , j = 0, 1, 2) t µ1(dt) − t µ0(dt) = 2M EL[x ][0,1]. j j 0 S0 S0 S0 QS θ = (p1, p2, . . . , pS−1), Let µ1 and µ0 be the product priors µi = j=1 µi. We 0 assign these priors to the length-S vector (p1, p2, . . . , pS0 ). T (θ) = Fα(P ), 0 0 Under µS or µS , we have almost surely 1 αα µ(2α)dα 0 1 s = 1 , 0 2 c (2d )2α S α α+1 2 X 0 α (ln n) pi ≤ S M ∼ d1  1, (129) ζ = EµS0 Fα(P ) + 2s c n1−α 0 i=1 in Lemma 9, it follows from Chebyshev’s inequality and c hence . n1−δ that  (ln n)α+1 α pα ≥ 1 − O ∼ 1, n → ∞. (130) β = σ (F (P ) > ζ − s) (143) S n1−α 0 0 α

= σ0(Fα(P ) − Eσ0 Fα(P ) > s) (144) We decompose Fα(P ) as Varσ0 (Fα(P )) α ≤ (145) Fα(P ) = Fα(P ) + pS, (131) s2 c(ln n)3 α where → 0, (146) S0 . n X F (P ) = pα. (132) α i and i=1 We argue that it suffices to show the minimax lower bound β1 = σ1(Fα(P ) < ζ + s) (147) in Theorem 2 holds when we replace F (P ) by F (P ). α α = σ1(Fα(P ) − Eσ1 Fα(P ) < −s) (148) Indeed, we just showed that Varσ1 (Fα(P )) 2+2α ≤ (149) 2 1 (ln n) s2 |F (P ) − F (P )| − 1 (133) α α α . c2α n2−2α c(ln n)3  1 . → 0. (150)  (134) n c2α Also, it follows from the general fact that S2  . (135) V (Qn P , Qn Q ) ≤ Pn V (P ,Q ) (which follows (n ln n)2α i=1 i i=1 i i=1 i i easily from a coupling argument [175]) that ˜ Hence, if there exists an estimator F that violates the minimax S0 ˜ η ≤ = O(n−5) → 0, (151) lower bound for Fα(P ) in Theorem 2, then F −1 will violates n6 the same minimax lower for estimating F (P ), which will α Applying Lemma 9, we have contradict what we show below.   1 For Y |p ∼ Poi(np), p ∼ µ0, we denote the marginal ˆ inf sup P |F − Fα| ≥ s ≥ , n → ∞. (152) distribution of Y by F0,M (y), whose pmf can be computed as Fˆ P ∈MS 2 Z e−np(np)y According to Markov’s inequality, we have F0,M (y) = µ0(dp). (136) y! 2 2  ˆ  1 2 1 S inf sup E F − Fα ≥ s  2α  2α . (153) We define F1,M (y) in a similar fashion. Fˆ P ∈MS 2 c (n ln n) 20

B. Minimax lower bound for Theorem 4 (Fα : 1 < α < 3/2) probability vectors ( S ) First we assume that S = n ln n. Similar to Lemma 10, we X 1 M (γ) P : p − d ≤ , (158) construct two measures as follows for α ∈ (1, 3/2). S , i 1 (ln n)γ i=1 Lemma 12. For any 0 < η < 1 and positive integer L > 0, with universal constant γ > 0 to be specified later, and further there exist two probability measures ν0 and ν1 on [η, 1] such define the minimax risk under the Poissonized model for that estimating Fα(P ) with P ∈ MS(γ) as 1) R tlν (dt) = R tlν (dt), for all l = 0, 1, 2, ··· ,L; ˆ 2 1 0 RP (S, n, γ) inf sup EP |F − Fα(P )| . (159) R α−1 R α−1 α−1 , ˆ 2) t ν1(dt) − t ν0(dt) = 2EL[x ][η,1], F P ∈MS (γ) β where EL[x ][η,1] is the distance in the uniform norm on β [η, 1] from the function f(x) = x to the space spanned by The equivalence of the minimax risk under the Multinomial {1, x, ··· , xL}. model R(S, n) (defined in (202)) and RP (S, n, γ) is estab- lished in the following lemma. Based on Lemma 12, two new measures ν˜0, ν˜1 can be constructed as follows: for i = 0, 1, the restriction of ν˜i on [η, 1] is absolutely continuous with respect to ν , with the i Lemma 14. For any S, n ∈ , γ > 0, 1 < α < 3/2, we have Radon-Nikodym derivative given by N 2α−2 d1n 1 1 d1n 4M dν˜i η R(S, ) ≥ R (S, n, γ) − exp(− ) − . (t) = ≤ 1, t ∈ [η, 1], (154) 2α P 2α 2α 2γ dν t 2 2d1 d1 8 d1 (ln n) i (160) and ν˜i({0}) = 1 − ν˜i([η, 1]) ≥ 0. Hence, ν˜0, ν˜1 are both probability measures on [0, 1], with the following properties R 1 R 1 In light of Lemma 14, it suffices to consider R (S, n, γ) to 1) t ν˜1(dt) = t ν˜0(dt) = η; P R l R l give a lower bound of R(S, n). Denote 2) t ν˜1(dt) = t ν˜0(dt), for all l = 2, ··· ,L + 1; R α R α α−1 3) t ν˜1(dt) − t ν˜0(dt) = 2ηEL[x ] . [η,1] χ S Fα(P ) − S Fα(P ) (161) , Eµ1 Eµ0 ν˜ , ν˜ α α−1 The construction of measures 0 1 are inspired by Wu and = 2ηM EL[x ][η,1] · S (162) Yang [27]. α−1 α−1 = 2d1M EL[x ][η,1], (163) The following lemma characterizes the properties of β EL[x ][η,1] using well-developed tools from approximation and theory [167]. Similar results can be found in Wu and Yang [27] \ n χo Ei , MS(γ) P : |Fα(P ) − EµS Fα(P )| ≤ , i = 0, 1. in which they treated the logarithmic function. i 4 (164) Lemma 13. For 0 < β < 1/2, there exists a universal positive Applying Chebyshev’s inequality and the union bound yields constant D such that that 2β β lim inf L EL[x ][(DL)−2,1] > 0. (155)  S  L→∞ S c S X 1 µi [(Ei) ] ≤ µi  pj − d1 >  Define (ln n)γ j=1 2 2 1 d1 d1d2D ln n S h χi L = d2 ln n, η = ,M = = , + µi |Fα(P ) − EµS Fα(P )| > (165) (DL)2 Sη n i 4 (156) S 2γ X 16 ≤ (ln n) VarµS [pj] + 2 VarµS [Fα(P )] with universal positive constants d1, d2 to be determined later. i χ i j=1 Without loss of generality we assume that d2 ln n is always a (166) positive integer. By the choice of η we know that 2γ 2 16 2α 2(α−1) α−1 ≤ (ln n) SM + 2 SM (167) lim inf(ln n) EL[x ][η,1] > 0. (157) χ n→∞ d2d4D4(ln n)2γ+3 4d4D4(ln n)3 = 1 2 + 2 Let g(x) = Mx and let µi be the measures on [0,M] cn cn(E [xα−1] )2 −1 L [η,1] defined by µi(A) =ν ˜i(g (A)) for i = 0, 1. It then follows (168) that → 0 as n → ∞, (169) R 1 R 1 1) t µ1(dt) = t µ0(dt) = d1/S; R l R l where (169) follows from (157). Denote by πi the conditional 2) t µ1(dt) = t µ0(dt), for all l = 2, ··· ,L + 1; R α R α α α−1 distribution defined as 3) t µ1(dt) − t µ0(dt) = 2ηM EL[x ][η,1]. µS(E ∩ A) µS µS i i Let 0 and 1 be product priors which we assign to the πi(A) = S , i = 0, 1. (170) µ (Ei) length-S vector P = (p1, p2, ··· , pS). Note that P may not be i a probability distribution, we consider the set of approximate Now consider π0, π1 as two priors and F0,F1 as the corre- 21 sponding marginal distributions. Setting Theorem 4 by noticing that under this scale, χ S m ln m 2 ζ = EµS Fα(P ) + , (171) lim = lim = . 0 2 (188) χ n→∞ n ln n m→∞ (d1m/2) ln(d1m/2) d1 s = , (172) 4 1 V. EXPERIMENTS d = , (173) 1 (10eD)2 As mentioned in the Introduction, the implementation of d = 10e, (174) 2 our algorithm is extremely efficient and has linear complexity γ = 2α, (175) with respect to the sample size n, independent of the support size. The only overhead that deserves special mention is the we have β0 = β1 = 0. The total variational distance is then upper bounded by computation of the best polynomial approximation, which is performed via the Remez algorithm [160] offline before obtain- V (F0,F1) ≤ V (F0,G0) + V (G0,G1) + V (G1,F1) (176) ing any samples. The Chebfun team [161] provides a highly S c S c optimized implementation of the Remez algorithm in Matlab ≤ µ0 [(E0) ] + V (G0,G1) + µ1 [(E1) ] (177) [162]. In numerical analysis, the convergence of an algorithm S c S S c ≤ µ0 [(E0) ] + 6 + µ1 [(E1) ] (178) is called quadratic if the error em after the m-th computation n 2m → 0, (179) satisfies em ≤ Cα for some C > 0 and 0 < α < 1. Under some assumptions about the function to approximate, S where Gi is the marginal probability under prior µi . Equa- one can prove [73, Pg. 96] the quadratic convergence of the tion (176) follows from the triangle inequality of the total Remez algorithm. Empirical experiments partially validate the variation distance, and (177) follows from the data processing efficiency of the Remez algorithm, which computes order 500 inequality satisfied by the total variation distance. Equation best polynomial approximation for −x ln x, x ∈ [0, 1] in a (178) is given by Lemma 11, and (179) follows from (169). fraction of a second on a Thinkpad X220 laptop. Consid- S The idea of converting approximate priors µi into priors πi ering the fact that the order of approximation we conduct via conditioning comes from Wu and Yang [27]. is logarithmic in n, in practice we do not need to perform this computation: we simply precompute the best polynomial It follows from Lemma 9 and Markov’s inequality that approximation coefficients for various orders (e.g., up to order 2  ˆ  200) and store them in the software. RP (S, n, γ) ≥ s inf sup P |F − Fα(θ)| ≥ s (180) ˆ c , c F P ∈MS (γ) We emphasize that although the value of constants 1 2 required in Lemma 6 lead to rather poor constants in the bias 1 − V (F0,F1) ≥ χ2 (181) and variance bounds, the practical performance could be much 32 2  2(α−1) better than what the theoretical bounds guarantee. It is due to d1(1 − V (F0,F1)) ln n = × the conservative nature of our worst-case (maximum L2 risk) 8 n formalism and the fact that we use upper bounds in the analysis α−1 2 (EL[x ][η,1]) . (182) that, while being optimal up to a multiplicative constant, are not absolutely tightest for a fixed S and n. Practically,

d1m experimentation shows that c1 ∈ [0.05, 0.2], c2 = 0.7 results Now we consider the scale (S, n) = (m ln m, 2 ), and it follows from (157) and Lemma 14 that for this scale, in very effective entropy estimation. In our experiments, we do not conduct “splitting” and lose half of the samples, and 2(α−1) ˆ 2 lim inf(n ln n) · inf sup EP |F − Fα(P )| (183) we evaluate our estimator on the multinomial rather than the n→∞ ˆ F P ∈MS Poisson sampling model required for the analysis. Moreover, 2(α−1) d m d m  d m we do not remove the constant term in the best polynomial = lim inf 1 ln( 1 ) · R(S, 1 ) (184) m→∞ 2 2 2 approximation in experiments, since one can show they do not 2(α−1) influence the achievability of the minimax rates, and it is easier d  ≥ 1 lim inf(m ln m)2(α−1)× to conduct best polynomial approximation with the constant 2 m→∞   term. h 1 1 d1m Given the extensive literature on entropy estimation, we 2α RP (S, m, γ) − 2α exp − 2d1 d1 8 demonstrate the efficacy of our methodology in functional 4 ln m2α−2 i estimation in estimating entropy. Specifically, we compare − (185) 2α 2γ our estimator with the following estimators proposed in the d1 (ln m) m 2 literature: h1 − V (F0,F1)  2(α−1) α−1  ≥ lim inf (ln m) EL[x ][η,1] m→∞ 22α+2 1) The MLE: the entropy of the empirical distribution 2(α−1) 4−2α ˆ MLE PS (m ln m)  d m 2 i H = −pˆi lnp ˆi. It has been shown in [29], [52] − exp − 1 − (186) i=1 2(α−1) 2 2 4 that this approach cannot achieve the minimax rates. 2 d1 8 d1(ln n) 2) The Miller-Madow bias-corrected estimator [53]: > 0. (187) ˆ MM ˆ MLE S−1 H = H + 2n . It has been shown [29], [52] Then the proof is completed by choosing any c0 > 2/d1 in that this approach cannot achieve the minimax rates. 22

3) The Jackknifed MLE [59]: Hˆ JK(Z) = nHˆ MLE(Z) − plug-in estimator of the Bayes estimate of the distri- n−1 Pn ˆ MLE −j −j n j=1 H (Z ), where Z is the remaining bution when the Dirichlet prior Dir(a) is imposed on sample by removing j-th observation. It has been shown P ∈ MS, i.e., in [52] that this approach cannot achieve the minimax S   X npˆi + a npˆi + a rates. Hˆ Dirichlet = − ln . (193) n + Sa n + Sa 4) The unseen estimator by Valiant and Valiant [25]: i=1 the estimator in [24] is the first estimator shown to We give the true support size S as its input, and we set achieve the optimal sample complexity n  S in √ ln S the parameter a = n/S to obtain a minimax estimator entropy estimation. Recently, Valiant and Valiant [25] of the distribution P under `2 loss [7, Example 5.4.5]. It provided a modification of [24] to estimate entropy, has been shown in [58] that this approach cannot achieve and demonstrated its superior empirical performance the minimax rates. via comparison with various existing algorithms, even 10) The Bayes estimator under Dirichlet prior [56]: this with the algorithm proposed in Valiant and Valiant [24]. estimator also uses the Dirichlet prior Dir(a) on the Hence, it is most informative to compare our algorithm distribution P ∈ MS, but then it is the Bayes estimator with that of [25]. In our experiments, we downloaded for H(P ) under this prior in lieu of the plug-in approach. and used the Matlab implementation of the estimator in Wolpert and Wolf [56] gives an explicit expression of [25], with default parameters. this estimator: 5) The coverage adjusted estimator (CAE) [65], [121]: an S estimator specifically designed to apply to settings in X npˆi + a Hˆ Bayes = · (ψ (n + Sa + 1) − ψ (npˆ + a + 1)) . which there is a significant component of the distribution n + Sa 0 0 i i=1 ˆ f1 that is unseen. Defining C = 1 − n , where f1 denotes (194) the number of symbols that only appear once in the sample [176], the CAE estimator is then given by It has been shown in [58] that this approach cannot achieve the minimax rates. We feed the algorithm with S ˆ ˆ √ X (−Cpˆi ln(Cpˆi)) true S and set a = n/S in our experiments. Hˆ CAE = . (189) n 11) The Nemenman–Shafee–Bialek estimator (NSB) [64], 1 − (1 − Cˆpˆi) i=1 [68]: instead of fixing some parameter a in the Dirich- 6) The best upper bound estimator (BUB) [52]: an estima- let prior, the NSB estimator uses an infinite Dirichlet tor proposed by Paninski which minimizes the sum of mixture for averaging so that the prior distribution of upper bounds of the squared bias and the variance, which H(P ) is near uniform. Then the NSB estimator is are given by approximation theory and the bounded- the Bayes estimator under this prior. There are some difference inequality [177], respectively. Note that this algorithmic stability issues due to the involvement of estimator requires the knowledge of S, and we give the several numerical algorithms, including the Newton- true support size as its input. Hence we are comparing Raphson iterative algorithm and numerical integration. our estimator with the best-case performance of the BUB estimator. A. What do we want to test? 7) The shrinkage estimator [63]: the plug-in estimator of We would like to clarify the aim of experimentation in eval- the shrinkage estimate of the distribution, which is given uating estimators for functionals, such as entropy, compared by to theoretical study. On the face of it, experimentation on sim- s 1 ulated data seems to be the holy grail: indeed, in simulations pˆi = λ · + (1 − λ) · pˆi (190) S we can compute the true expected squared error of certain 1 − PS pˆ2 estimator for certain distributions very accurately, and the λ = i=1 i . (191) PS 2 smaller the risk is, the better the estimator is. However, a close (n − 1)( i=1 pˆi − 1/S) inspection of this procedure demonstrates a severe limitation Hence the distribution estimate is shrunk towards the of the simulation approach: one can never simulate all the Hˆ shrinkage = PS −pˆs lnp ˆs uniform distribution, and i=1 i i . possible distributions in a not-too-small subset of the space We also give the true support size S as its input. of discrete distributions with support size S. For example, 8) The Grassberger estimator [67]: this estimator aims suppose we have chosen 10 distributions with support size S, to explicitly construct a function f with a small bias and conducted experiments on entropy estimation. The only |Ef(X) + p ln p| for X ∼ B(n, p). Define a sequence ∞ possible conclusion we may draw from these experiments is {Gn}n=0 with that the estimator that performs well on these 10 distributions Z 1 n−1 should be applied if the true distribution is one of the 10 n x Gn = ψ0(n) + (−1) dx, n ≥ 0 (192) distributions. We cannot draw conclusions about other dis- 0 x + 1 tributions with support size S that were not tested, but we d where ψ0(x) = dx [ln Γ(x)] is the digamma function. can never test all the distributions. The advantage of theory Then the estimator is given by Hˆ Grassberger = ln n − is to study the performance of estimators for all the possible PS i=1 pˆiGnpˆi . distributions. On the practical side, if the statistician is being 9) The Dirichlet-smoothed plug-in estimator [178]: the conservative, i.e., the statistician wants the scheme to perform 23 well no matter what distribution might be the true distribution, under which the Bayes risk is at least the same order of the then only theoretical study can resolve this issue, which is our minimax risk. starting point for this paper.

One might argue that in practice, one may possess some 0 knowledge of the underlying distribution. For example, we 10 may know that the distribution is exactly uniform, without −1 knowledge of its support size. In that case, entropy estimation 10 has been shown to be considerably√ easier [32]: it is necessary and sufficient to take n  S samples to consistently estimate −2 the entropy ln S. Comparing with the optimal sample complex- 10 ity S/ ln S under the assumption that the estimator is required

−3 to estimate the entropy of any discrete distribution with S 10 Our estimator elements, the knowledge of uniformity reduces the difficulty Mean Squared Error MLE of the problem considerably. We remark that if the statistician Bias−corrected MLE −4 has convincing knowledge that the unknown distribution has 10 Valiant−−Valiant Jackknifed certain structures (such as uniformity), then one should design CAE schemes to exploit that prior knowledge. Recently, follow-up −5 10 work [36] showed that the estimator in the present paper is 2 4 6 8 10 12 14 log S also adaptive, i.e., in some sense it can automatically adjust itself to the unknown distribution to achieve higher accuracy. (a) Recall Section I-C for more discussions on this feature.

0 The extensive comparative study we conduct below demon- 10 strates that, not only does our entropy estimator enjoy strong theoretical guarantees, but also works well in practice for vari- −1 ous distributions. Furthermore, experiments test the numerical 10 stability, space, and time efficiency of the implementation, which are of crucial importance in practical applications. −2 10

S B. Convergence properties along n = c −3 ln S 10 BUB Since the optimal sample complexity in entropy estimation Mean Squared Error Dirichlet + plug−in is n  S/ ln S, we investigate the performance of various Grassberger −4 estimators along the scaling n = c S ,S → ∞. 10 Bayes ln S NSB In light of the proof of the lower bounds in Section IV, Shrinkage

−5 we can construct two priors on MS such that the entropies 10 2 4 6 8 10 12 14 corresponding to the priors are quite different, but these two log S priors are hard to distinguish based on observed samples. Hence for any estimator, the arithmetic mean of the expected (b) MSE based on distributions drawn from each prior should be lower bounded by the minimax risk. Moreover, for estimators Fig. 5: The total empirical MSE of all 12 estimators along sequence n = 10S/ ln S, where S is sampled equally spaced which cannot achieve the minimax risk up to a multiplicative 6 S logarithmically from 10 to 10 . The horizontal line is ln S, constant, this MSE will blow up eventually along n = c ln S as S → ∞. and the vertical line is the MSE in logarithmic scale. We choose c = 10, and sample 15 points equally spaced in a logarithmic scale from 10 to 106 as candidates for support size The experimental results are exhibited in Figure 5. We S. For each support size S, we construct two product priors remark that the vertical line is the MSE in logarithmic scale, S S α−1 µ0 , µ1 as in Section IV-B by replacing x with − ln x in which means that a small positive slope represents exponential Lemma 12. Wu and Yang [27] gave an explicit construction of growth in MSE. Bearing this in mind, Figure 5 suggests the priors and used them first in the proof of minimax lower that the following three estimators out of 12, namely our bounds for entropy estimation. Then for i = 0, 1, in every estimator, the estimator by Valiant and Valiant [25] and the S sample we obtain a non-negative random vector from µi , best upper bound (BUB) estimator, achieve the minimax rate. which is normalized into a probability distribution P ∈ MS. Some other estimators have already been shown not to achieve Then we take n = 10S/ ln S samples from distribution P and the optimal sample complexity, e.g., the Miller-Madow bias- obtain an estimate of H(P ). We repeat all the preceding steps corrected MLE [29], [52], the jackknifed estimator [52], and 20 times by Monte Carlo experiments to obtain the empirical the Dirichlet-smoothed plug-in estimator as well as the Bayes MSE under each prior, then we take the arithmetic mean to estimator under Dirichlet prior [58]. We remark that the shrink- form the total empirical MSE. We remark that the resulting age estimator only improves the MLE when the distribution prior from this approach is nearly a least favorable prior [21], is near-uniform, and this estimator performs poorly under 24 the Zipf distribution (verified by our experiments), thus it 0.03 attains neither the optimal minimum sample complexity nor Our estimator MLE the minimax rate. Valiant−−Valiant 0.025 Motivated by the preceding result, in our subsequent ex- BUB periments we only consider our estimator, the estimator in [25] and the BUB estimator, as well as the MLE used as a 0.02 benchmark. We also remark that all other estimators are still tested in our experiments, which consistently demonstrate the 0.015 superior empirical performance of our estimator, the estimator in [25] and the BUB estimator over others. 0.01 Root Mean Squared Error

C. Estimation of entropy 0.005 Now we examine the performance of our estimator, the 0 estimator in [25], the BUB estimator and the MLE in entropy 4.5 5 5.5 6 6.5 7 log S estimation for various distributions. We remark that in theory, our estimator is the only one among these estimators which (a) Uniform Distribution has been shown to achieve the minimax L2 rates, while the estimator in [25] has only been shown order-optimal in terms 0.04 of the sample complexity, and currently neither the maximum Our estimator MLE L2 risk nor the sample complexity is known for the BUB 0.035 Valiant−−Valiant estimator. Moreover, the BUB estimator requires an accurate BUB upper bound for the support size S, and it was remarked 0.03 in [121] that the performance of BUB degrades considerably when this bound is inaccurate. 0.025 We divide our experiments into two regimes. 1) Data rich regime: S  n: We first experiment in the 0.02 regime S  n, which is an “easy” regime where even the

Root Mean Squared Error 0.015 MLE is known to perform very well. However, the estimator in [25] exhibits peculiar behavior. We sample 8 points equally 0.01 spaced in a logarithmic scale from 102 to 103 as candidates for support size S, and for each S we conduct 20 Monte Carlo 0.005 4.5 5 5.5 6 6.5 7 simulations of estimation based on n = 50S observations from log S certain distribution over an alphabet of size S. We consider two special distributions, i.e., the uniform distribution with (b) Zipf Distribution p = 1/S and the Zipf distribution p = i−α/ PS j−α with i i j=1 Fig. 6: The empirical MSE of our estimator, the MLE, the order α = 1 for 1 ≤ i ≤ S. The empirical root MSE is estimator in [25] and the BUB estimator for the uniform exhibited in Figure 6. and Zipf distributions along sequence n = 50S, where S is It is quite clear that both our estimator and the BUB sampled equally spaced logarithmically from 102 to 103. The estimator perform quite well for these distributions in the data horizontal line is ln S, and the vertical line is the root MSE. rich regime, but the MLE and the estimator in [25] return the entropy estimate which is far from the true entropy. A close inspection shows that the variance dominates the squared bias S for our estimator and the BUB estimator, while there is a huge based on n = c ln S observations from certain distribution bias which constitutes the major part of the MSE for both the over an alphabet of size S, where c = 5 for the uniform MLE and the estimator in [25]. distribution and c = 15 for the Zipf distribution with α = 1. We remark that the estimator in [25] and the BUB estimator Note that when S = 20, 000, we have n = 10, 098 for c = 5 have substantially longer running time than ours in the data and n = 30, 293 for c = 15, so we are actually in the data rich regime. For the uniform distribution, the total running time sparse regime. The outputs of our estimator, the MLE, the of our estimator in 160 Monte Carlo simulations is 0.75s, with estimator in [25] and the BUB estimator for both distributions a small overhead over the MLE which requires 0.47s, whereas are exhibited in Figure 7. the one in [25] takes 186.72s and the BUB estimator takes Figure 7 shows that the MLE is far from the true entropy. 32.97s. Similar results hold for the Zipf distribution. Both our estimator and that of [25] perform quite well, but 2) Data sparse regime: S  n or S  n: This is the comparatively the BUB estimator exhibits a large bias for regime where the conventional approaches such as MLE fail. some distributions, e.g., the uniform distribution. Interestingly, We sample 8 points equally spaced in a logarithmic scale from with the same sample size n = 10000, the estimator in [25] 10, 000 to 40, 000 as candidates for support size S, and for and the BUB estimator run much faster than in the data each S we conduct 20 Monte Carlo simulations of estimation rich regime, with a total running time 9.28s and 2.35s for 25

undesirable in applications such as mutual information 1.4 estimation and situations where the support size S is

1.2 unknown; 3) the BUB estimator performs quite well in the data rich 1 regime S  n, but performs worse than our estimator Our estimator and the estimator in [25] for some distributions in the MLE 0.8 Valiant−−Valiant data sparse regime. Moreover, it requires the knowledge BUB of S and does not have theoretical guarantees on its 0.6 worst-case performance thus far, which is undesirable and not convincing in practice; Root Mean Squared Error 0.4 4) our estimator has stable performance, linear complexity, and high accuracy. 0.2

0 9.2 9.4 9.6 9.8 10 10.2 10.4 10.6 D. Estimation of mutual information log S One functional of particular significance in various appli- (a) Uniform Distribution cations is the mutual information I(X; Y ), but it cannot be directly expressed in the form of (1). Indeed, we have

0.35 X PXY (x, y) I(X; Y ) = PXY (x, y) ln (195) PX (x)PY (y) 0.3 x,y X PXY (x, y) = P (x, y) ln . 0.25 XY P P ( PXY (x, y))( PXY (x, y)) Our estimator x,y y x MLE (196) 0.2 Valiant−−Valiant BUB However, one can easily show that if X,Y both take 0.15 values in alphabets of size S, then the sample complexity for estimating I(X; Y ) is n  S2/ ln S, rather than n  S2 Root Mean Squared Error 0.1 required by the MLE [179]. Applying our entropy estimator in the following way results in an essentially minimax (rate- 0.05 optimal) mutual information estimator. We represent

0 9.2 9.4 9.6 9.8 10 10.2 10.4 10.6 I(X; Y ) = H(X) + H(Y ) − H(X,Y ), (197) log S where H(X,Y ) is the entropy associated with the joint (b) Zipf Distribution distribution PXY , and use our entropy estimator to estimate each term. As was exhibited in previous experiments, in the Fig. 7: The empirical root MSE of our estimator, the MLE, data rich regime, the BUB estimator is better than the MLE and the estimator in [25] and the BUB estimator along sequence the estimator in [25], and in the data sparse regime, the worst- n = c S , where S is sampled equally spaced logarithmically ln S case performance of [25] is better than the BUB estimator from 10, 000 to 40, 000. The quantity c = 5 for the uniform and the MLE, and in both regimes our estimators are doing distribution and c = 15 for the Zipf distribution. The horizon- well uniformly. However, in mutual information estimation, tal line is ln S, and the vertical line is the root MSE. the estimators of H(X) and H(Y ) may be operating in the data rich regime, but that of H(X,Y ) in the data sparse regime. Conceivably, in this situation none of the MLE, [25] the uniform distribution. However, it is still slower than our and the BUB estimator would perform well, but our estimator estimator, which takes 0.71s, only 0.05s longer compared with is expected to have good performance. the MLE. In order to investigate this intuition, we fix n = 2, 500 and We have experimented with other distributions such as the sample 8 points equally spaced in a logarithmic scale from 100 Zipf with order α 6= 1, the mixture of the uniform and Zipf to 200 as candidates for support size S, and we generate two distributions, as well as randomly generated distributions, with random variables X,Y both with support size S as follows. similar results. In summary, we observe that We first randomly generate two marginal distributions PX (i) 1) the MLE usually concentrates at some point far away and PZ (i), 1 ≤ i ≤ S, where for each i we choose two from the true functional value, particularly when the independent random variables distributed as Beta(0.6, 0.5) for support size is comparable, or larger than the number PX (i) and PZ (i), and we normalize at the end to make them of observations; distributions. We pass X through a transition channel to obtain 2) the estimator in [25] performs quite well in the data Y , such that Y = (X + Z) mod S. Note that we are in the sparse regime S  n and S  n, but performs worse regime S  n  S2. We conduct 20 Monte Carlo simulations than the MLE in the data rich regime S  n, which is for each S, and the results are exhibited in Figure 8. 26

1 10 twofold: the supersymbols Yi are no longer independent, and it satisfies the asymptotic equipartition property (AEP) by the Shannon-McMillan-Breiman theorem [180]. To see the Our estimator 0 10 MLE role played by ergodicity, it has been shown in [181] that Valiant−−Valiant minimax estimation of discrete distributions with support size BUB S under `1 loss requires n  S samples, which turns out ln n −1 D 10 to be S  n, or equivalently, D  ln S , in the D-tuple distribution estimation in stochastic processes. However,

Mean Squared Error [182] showed that it suffices to choose the memory length

−2 ln n 10 D ≤ (1 − ) H for any  > 0 in a stationary ergodic stochastic process, where H is its entropy rate, to guarantee that the empirical joint distribution of D-tuple converges to

−3 10 the true joint distribution under `1 loss. Noting that H ≤ ln S 4.6 4.7 4.8 4.9 5 5.1 5.2 5.3 log S holds for any distribution, we conclude that the stochastic process is easier to handle compared with a single distribution Fig. 8: The empirical MSE of our estimator, the MLE, the with a large support size. Finally, we remark that minimax estimator in [25] and the BUB estimator, where n = 2, 500 is estimation of entropy rate for stationary ergodic processes in fixed, and S is sampled equally spaced logarithmically from the finite sample setting remains largely open. 100 to 200. The horizontal line is ln S, and the vertical line Now we examine the performance of our estimator, the is the MSE in logarithmic scale. The goal is the estimate the MLE, the estimator in [25] and the BUB estimator in the mutual information I(X; Y ). estimation of entropy rate. We fix the memory length D = 4, and choose the support size S from 7 to 14. For each support size S, we construct a discrete distribution PZ where It is clear from Figure 8 that the MLE deviates from the PZ (i) is independently drawn from Beta(0.6, 0.5) for 1 ≤ true mutual information significantly, but our estimator is quite i ≤ S before normalization. Then for the sample size n = D+1 D+1 accurate, performing comparably to the estimator in [25] and 1.5S / ln(S ), we draw n i.i.d. samples Z1,Z2, ··· ,Zn the BUB estimator for most support sizes S. Note that the n from PZ and then construct the stochastic process {Xk}k=1 true mutual information is only about 0.4, so it is a significant with memory length D as follows: Xk = Zk for 1 ≤ k ≤ D, improvement for decreasing the MSE from 0.01 to its half. At Pk−1 and Xk = (Zk + j=k−D Xj) mod S for k > D. We conduct the same time, the estimator in [25] and the BUB estimator 20 Monte Carlo simulations for each S, with results exhibited have considerably longer running time than our estimator. It in Figure 9 takes 128.13s and 48.86s for the estimator in [25] and the BUB estimator to complete the 160 simulations, respectively, 1 whereas ours requires 1.98s. 10

0 10 Our estimator E. Estimation of entropy rate MLE Valiant−−Valiant

Another functional of particular significance is the entropy −1 BUB −1 10 rate H = H(X0|X−∞) of a stationary ergodic stochastic ∞ process {Xn}n=−∞, of fundamental importance in informa- −2 tion theory [1]. Consider a stationary ergodic Markov process 10

∞ Mean Squared Error {Xn}n=0 with support size S and memory length D, then the

−3 entropy rate H can be expressed as 10 D D+1 D H = H(XD+1|X1 ) = H(X1 ) − H(X1 ) (198) −4 10 where we have adopted the notation Xn = 7 8 9 10 11 12 13 14 m S (Xm,Xm+1, ··· ,Xn) for m ≤ n. Hence, to estimate the D+1 entropy rate, it suffices to estimate the joint entropy H(X1 ) Fig. 9: The empirical MSE of our estimator, the MLE, the D and H(X1 ) separately, which can be accomplished by any estimator in [25] and the BUB estimator, where D = 4 is fixed, ˆ D D+1 D+1 entropy estimator H. Specifically, to estimate H(X1 ), we 7 ≤ S ≤ 14, and n = 1.5S / ln(S ). The horizontal i+D−1 D can construct n − D + 1 supersymbols Yi = Xi = Y line is S, and the vertical line is the MSE in logarithmic scale. D for 1 ≤ i ≤ n − D + 1 and then estimate H(Y ) using the The goal is the estimate the entropy rate H(XD+1|X1 ). estimator for entropy. The size of the alphabet in which Y takes value is SD, but we remark that the sample complexity It is clear from Figure 9 that our estimator performs most n  SD/ ln(SD) need not be optimal in this setting of a favorably in estimation of entropy rate for most S. As before, stationary ergodic stochastic process, i.e., there may exist it takes our estimator less time (6.62s) to complete all 160 some estimator Hˆ with a vanishing worst-case L2 risk simulations compared with the estimator in [25] (24.65s) and D D for this case even if n . S / ln(S ). The reasons are the BUB estimator (13.91s). Due to the high accuracy and 27 the linear complexity, our estimator is an efficient tool when in what follows, this is no coincidence, as the CL can be dealing with high dimensional data. considerably improved on in practice. To explain the insights underlying our improved algorithm, we revisit equation (200) F. Application in learning graphical models and note that if we were to replace the empirical mutual Given n i.i.d. samples of a random vector X = information with the true mutual information, the output of the MWST would be the true edges of the tree. In light of (X1,X2,...,Xd), where Xi ∈ X , |X | < ∞, we are interested in estimating the joint distribution of X. It was shown [181] this, the CL algorithm can be viewed as a “plug-in” estimator that one needs to take n  |X |d samples to consistently that replaces the true mutual information with an estimate of it, estimate the joint distribution [40], which blows up quickly namely the empirical mutual information. Naturally then, it is with growing d. Practically, it is convenient and necessary to to be expected that a better estimate of the mutual information would lead to smaller probability of error in identifying the impose some structure on the joint distribution PX to reduce the required sample complexity. Chow and Liu [11] considered tree. It is thus natural to suspect that using our estimator for this problem under the constraint that the joint distribution of mutual information in lieu of the empirical mutual information X satisfies order-one dependence. To be precise, Chow and in the CL algorithm would lead to performance boosts. It is gratifying to find this intuition confirmed in all the ex- Liu assumed that PX can be factorized as: periments that we conducted. In the following experiment, d Y we fix d = 7, |X | = 200, construct a star tree (i.e. all PX = PX |X , 0 ≤ j(i) < i, (199) mi mj(i) random variables are conditionally independent given X1), and i=1 generate a random joint distribution by assigning independent where (m1, m2, . . . , md) represents an unknown permutation Beta(0.5, 0.5)-distributed random variables to each entry of of the integers (1, 2, . . . , n). This dependence structure can be the marginal distribution PX1 and the transition probabilities written as a tree with the random variables as nodes. PXk|X1 , 2 ≤ k ≤ d (with normalization). Then, we increase Towards estimating PX from n i.i.d. samples, Chow and the sample size n from 103 to 5.5 × 104, and for each n we Liu [11] considered solving for the MLE under the constraint conduct 20 Monte Carlo simulations. that it factors as a tree. Interestingly, this optimization prob- Note that the true tree has d − 1 = 6 edges, and any lem can be efficiently solved after being transformed into a estimated set of edges will have at least one overlap with these Maximum Weight Spanning Tree (MWST) problem. Chow 6 edges because the true tree is a star graph. We define the and Liu [11] showed that the MLE of the tree structure boils wrong-edges-ratio in this case as the number of edges different down to the following expression: from the true set of edges divided by d − 2 = 5. Thus, if the X wrong-edges-ratio equals one, it means that the estimated tree EML = argmax I(Pˆe), (200) E :Q is maximally different from the true tree and, in the other Q is a tree e∈E Q extreme, a ratio of zero corresponds to perfect reconstruction. where I(Pˆe) is the mutual information associated with the We compute the expected wrong-edges-ratio over 20 Monte empirical distribution of the two nodes connected via edge Carlo simulations for each n, and the results are exhibited in e, and EQ is the set of edges of distribution Q that factors Figure 10. as a tree. In words, it suffices to first compute the empirical mutual information between any two nodes (in total d 2 1 pairs), and the maximum weight spanning tree is the tree Modified CL 0.9 structure that maximizes the likelihood. To obtain estimates of Original CL distributions on each edge, Chow and Liu [11] simply assigned 0.8 the empirical distribution. 0.7 The Chow–Liu algorithm is widely used in machine learn- ing and statistics as a tool for dimensionality reduction, 0.6 classification, and as a foundation for algorithm design in 0.5 more complex dependence structures [183] in the theory of 0.4 learning graphical models [184], [185]. It has also been widely 0.3 adopted in applied research, and is particularly popular in Expected wrong−edges−ratio systems biology. For example, the Chow–Liu algorithm is 0.2 extensively used in the reverse engineering of transcription 0.1 regulatory networks from gene expression data [186]. 0 0 1 2 3 4 5 6 Considerable work has been dedicated to the theoretical number of samples n 4 properties of the CL algorithm. For example, Chow and x 10 Wagner [187] showed that the CL algorithm is consistent as Fig. 10: The expected wrong-edges-ratio of our modifed n → ∞. Tan et al. [188] studies the large deviation properties algorithm and the original CL algorithm for sample sizes of CL. However, no study justified the use of the CL in ranging from 103 to 5.5 × 104. practical scenarios involving finitely many samples. Indeed, the fact that CL solves MLE does not imply it is optimal: recall Figure 10 reveals intriguing phase transitions for both the Section II-D1 for the discussion on MLE. As we elaborate modified and the original CL algorithm. When we have fewer 28 than 3 × 103 samples, both algorithms yield a wrong-edges- Allan Pinkus. We thank Marcelo Weinberger for bringing ratio of 1, but soon after the sample size exceeds 6 × 103, up the question of comparing the difficulty between entropy the modified CL algorithm begins to reconstruct the network estimation and data compression, and Thomas Courtade for perfectly, while the original CL algorithm continues to fail proposing to view Fα(P ) as moment generating functions maximally until the sample size exceeds 47 × 103, 8 times of the information density random variable and suggesting the sample size required by the modified algorithm. The some tricks in the proof of Lemma 15. We thank Liam theoretical properties of these sharp phase transitions remain Paninski for interesting discussions related to Paninski [52]. for future work. We thank Martin Vinck for interesting discussions related to entropy estimation in physics and neuroscience. We thank VI.CONCLUSIONSANDFUTUREWORK Jayadev Acharya, Alon Orlitsky, Ananda Theertha Suresh, and Himanshu Tyagi for stimulating discussions. We thank The risk of any statistical estimator under L loss can be 2 Yihong Wu for insightful discussions, in particular, regarding decomposed into squared bias and variance. Shrinkage [152] a non-rigorous step in the proof of Lemma 16 in a previ- [154] [189] is particularly useful in reducing the variance ous version of the manuscript. We thank Maya Gupta for via slightly sacrificing the bias. Approximation, which is inspiring discussions, in particular for raising the question of introduced in this paper, turns out to be the counterpart of whether Dirichlet prior smoothed plug-in entropy estimation shrinkage. The methodology of approximation has demon- can achieve the minimax rates, whose answer was shown to be strated its efficacy in reducing the bias via slightly sacrificing negative in [58]. Finally, we thank Dmitri Sergeevich Pavlichin the variance in constructing minimax rate-optimal estimators for translating articles from Bernstein’s collected works [191], for H(P ) and F (P ) in this paper. We remark that in α whose English versions are unavailable. estimating functionals, especially low dimensional functionals of high dimensional parameters, the bias usually dominates the risk [29], and the methodology proposed in this paper has APPENDIX A proved to be extremely effective in properly reducing bias, and AUXILIARY LEMMAS consequently achieving the minimax rates. Lemma 15. If the support of distribution P is of size S, then We show that the bias of a statistical estimator for a ((ln S + 1)2 S < 56 functional can be interpreted as the approximation error of a Var(− ln P (X)) ≤ . (201) 3 2 certain operator in approximating the function, which exhibits 4 (ln S) S ≥ 56 an intimate connection between statistics and approximation The next lemma relates the minimax risk under the Pois- theory, the latter being a mature mathematical field studied sonized model and that under the Multinomial model. We for several centuries. Designing an estimator with smaller define the minimax risk for Multinomial model with n ob- bias is equivalent to designing an approximation operator with servations on support size S for estimating functional F as small approximation error. However, the functional estimation 2 problem is far more subtle than this connection, since one has  ˆ  R(S, n) inf sup EMultinomial F − F (P ) , (202) , ˆ to control the bias and variance simultaneously. We have made F P ∈MS significant efforts to understand this delicate trade-off, which and the counterpart for the Poissonized model as is the foundation for our general methodology in functional 2  ˆ  estimation in Section II. However, we remark that it still RP (S, n) inf sup EPoisson F − F (P ) . (203) , ˆ remains fertile ground for research. F P ∈MS This paper constitutes a first step towards a more com- The next lemma is an extension of Wu and Yang [27]. prehensive understanding of functional estimation. We have partially answered Paninski’s question in [Section 8.3] [52], Lemma 16. The minimax risks under the Poissonized model since we demonstrated that the minimax rates are highly and the Multinomial model are related via the following dependent on the properties of the functionals to be esti- inequalities: −n/4 2 mated. A general theory characterizing the minimax rates in RP (S, 2n)−e sup |F (P )| ≤ R(S, n) ≤ 2RP (S, n/2). functional estimation using approximation theoretic quantities P ∈MS is yet unavailable. An ambitious goal might be constructing (204) the counterpart of what we know in nonparametric linear The following lemma characterizes the best polynomial functional estimation [86], or nonparametric function estima- approximation error of xα over [0, 1] in a very precise sense. tion [190]. Concretely, denoting the best polynomial approximation error with order at most n for function f as En[f], we have the VII.ACKNOWLEDGMENTS following lemma. We are grateful to Gregory Valiant for introducing us to Lemma 17. The following limit exists for any α > 0: the entropy estimation problem, which motivated this work. µ(2α) We thank many approximation theorists for very helpful 2α α lim n En[x ][0,1] = 2α , (205) discussions, in particular, Dany Leviatan, Kirill Kopotun, n→∞ 2 p p Feng Dai, Volodymyr Andriyevskyy, Gancho Tachev, Radu where µ(p) , limn→∞ n En[|x| ][−1,1], p > 0 is the Bern- Paltanea, Paul Nevai, Doron Lubinsky, Dingxuan Zhou, and stein function introduced by [164]. For an non-asymptotic 29

 2 α  2 α α π c1 4π c1 bound, denote the best polynomial approximation of x , α > 0 where c3 = 2 c2 for 0 < α < 1, and c3 = 3 c2 Pn k 2 2 to the n-th degree by k=0 gk,αx , and define Rn,α(x) , for 1 < α < 3/2. When n (or equivalently, K) is large enough, Pn k gk,αx as the best approximation polynomial without k=1 we could take α the constant term. Then for 0 < α < 1, we have the norm 2µ(2α)c1 c3 = 2α , (215) bound c2 2α α  π  where the function µ(·) is the Bernstein function introduced max |Rn,α(x) − x | ≤ 2 . (206) 0≤x≤1 2n in Theorem 8. For 1 < α < 3/2, we have the norm bound Lemma 20. For all x ∈ [0, 4∆], there exists a constant C > 0 2α α π  such that max |Rn,α(x) − x | ≤ 3 , (207) 0≤x≤1 n K C X −k+1 k and the pointwise bound gk,H (4∆) x + x ln x ≤ . (216) n ln n D x k=1 |R (x) − xα| ≤ 1 , (208) n,α n2(α−1) Moreover, when n (equivalently, K) is large enough, we could where D > 0 is a universal positive constant. take C to be 1 4c1ν1(2) 1.81c1 Furthermore, for any α > 0, we have C = 2 ≈ 2 , (217) c2 c2 3n 3n |gk,α| ≤ 2 , |gk,H | ≤ 2 , k = 1, 2, ··· , n, (209) where the function ν1(p) is introduced in Lemma 18. where the coefficients g are defined in (48). k,H 1.81c1 According to Lemma 18, the asymptotic result C ≈ 2 c2 Although the Bernstein function µ(p) seems hard to an- starts to become very accurate even from very small values of alyze, we can compute it fairly easily using well-developed K such as 5. The following lemma gives some tails bounds machinery in numerical analysis. For example, [192] showed for Poisson and Binomial random variables. the following bound on µ(1) using analytical methods: Lemma 21. [193, Exercise 4.7] If X ∼ Poi(λ), or X ∼ 0.2801685460... ≤ µ(1) ≤ 0.2801733791, (210) B(n, p), np = λ, then for any δ > 0, we have but we can easily obtain it numerically in the Chebfun system  δ λ [161] using polynomial approximation order roughly 100. e P(X ≥ (1 + δ)λ) ≤ 1+δ (218) We have the following result by Ibragimov [165]: (1 + δ)  −δ λ e 2 Lemma 18. The following limits exists: (X ≤ (1 − δ)λ) ≤ ≤ e−δ λ/2. (219) P (1 − δ)1−δ 2 ν1(2) 1 lim n En[−x ln x][0,1] = < . (211) n→∞ 2 2 Next lemma gives an upper bound on the k-th moment of a Poisson random variable. The function ν1(p) was introduced by Ibragimov [165] as the following limit for p positive even integer and m positive integer: Lemma 22. Let X ∼ Poi(λ), k be an positive integer. Taking p M = max{λ, k}, we have n p m lim En[|x| ln |x|][−1,1] = ν1(p). (212) k k n→∞ (ln n)m−1 EX ≤ (2M) . (220) This Lemma follows from Ibragimov [165, Thm. 9δ]. Note that Ibragimov [165] contained a small mistake where the limit The next two lemmas from Cai and Low [77] are simple 2 facts we will utilize in the analysis of our estimators. of n En[(1 − x) ln(1 − x)][−1,1] was wrongly computed to be 4ν (2), but it is supposed to be ν (2). Using numerical 1 1 1 computation provided by the Chebfun [161] toolbox, we obtain Lemma 23. [77, Lemma 4] Suppose (A) is an indicator that random variable independent of X and Y , then ν1(2) ≈ 0.453, (213) Var(X1(A) + Y 1(Ac)) c 2 c and this asymptotic result starts to be very accurate for small = Var(X)P(A) + Var(Y )P(A ) + (EX − EY ) P(A)P(A ). n such as 5. (221) The following two lemmas characterize the approximation error of xα and −x ln x when x is small. Lemma 24. [77, Lemma 5] For any two random variables X and Y , Lemma 19. For all x ∈ [0, 4∆], the following bound holds for 0 < α < 3/2, α 6= 1: Var(min{X,Y }) ≤ Var(X) + Var(Y ). (222) K In particular, for any random variable X and any constant C, X c3 g ∆−k+αxk − xα ≤ , (214) k,α (n ln n)α Var(min{X,C}) ≤ Var(X). (223) k=1 30

APPENDIX B Defining PROOFOF THEOREM 5, 6 ANDMAINLEMMAS S0   X cm + 1 N = 1 pˆ > , (235) A. Proof of Theorem 5 i n i=1 xα, α > 1 The convexity of yields then the random variable N follows a S  α N ∼ B(S0, p), and by the central limit theorem again we have X 1 1−α Fα(P ) ≥ = S , (224)   S S0 i=1 lim inf P N ≥ = 1. (236) n→∞ 6 hence for any δ > 0, ( ) ln Fˆ (Z) α Z : − Hα ≥ δ 1 − α n   o ˆ (1−α)δ ⊆ Z : Fα(Z) − Fα ≥ 1 − e Fα (225) n   o ˆ (1−α)δ 1−α ⊆ Z : Fα(Z) − Fα ≥ 1 − e S . (226) Given η , N/S0 ≥ 1/6, it follows from the convexity of xα, α > 1 that Theorem 3 implies that there exists a constant 0 < Cα < ∞ S such that X X X pˆα = pˆα + pˆα (237) 2 i i i   Cα ˆ i=1 1≤i≤S :ˆp > cm+1 1≤i≤S :ˆp ≤ cm+1 sup EP Fα − Fα(P ) ≤ 2α−2 , (227) 0 i n 0 i n P (n ln n) M α  1 − ηM α where the supremum is taken over all discrete distributions ≥ ηS0 · + (S0 − ηS0) · S0 S0 − ηS0 supported on countably infinite alphabet. Using this estimator (238) and applying Chebychev’s inequality, S1−α · f(η, M), (239) ! , 0 ln Fˆ α sup P − Hα ≥ δ where P 1 − α 1 S0 1 X   (1−α)δ 1−α ≥ ≥ M , S0 · pˆi (240) ≤ sup Fˆα − Fα ≥ 1 − e S (228) η N N P cm+1 P 1≤i≤S0:ˆpi> n 2 ˆ cm + 1 1 supP EP Fα − Fα ≥ S0 · ≥ 1 + . (241) ≤ (229) n cm (1−α)δ2 2−2α 1 − e S It can be easily checked that  2α−2 Cα S α α (1 − xy) ≤ 2 . (230) f(x, y) = xy + (242) 1 − e(1−α)δ n ln n (1 − x)α−1 α−1 The proof is finished by choosing ∂f α−1 αx(1 − xy) = αxy − α−1 (243) − 1 ∂y (1 − x) (1−α)δ2 ! 2α−2 αx  1 − e = · (y − xy)α−1 − (1 − xy)α−1 > 0, cα(δ, ) = . (231) (1 − x)α−1 Cα 0 < x < xy ≤ 1. (244) Hence, due to 0 < 1/6 ≤ η < 1 < M ≤ 1/η, we conclude B. Proof of Theorem 6 −1 that f(η, M) ≥ f(η, 1 + cm ) > f(η, 1) = 1. Since f(η, 1 + Since the central limit theorem claims that −1 cm ) is continuous with respect to η ∈ [1/6, cm/(cm +1)], we Poi(λ) − λ have √ N (0, 1) (232) λ ln f(η, M) Hˆα(Pn) ≤ ln S0 − (245) as λ → ∞, there exists λ0 > 0 such that α − 1 ln f(η, 1 + c−1) 1 ≤ ln S − m (Poi(λ) > λ + 1) ≥ , ∀λ ≥ λ . (233) 0 (246) P 3 0 α − 1 min ln f(η, 1 + c−1) n ≤ ln S − η∈[1/6,cm/(cm+1)] m Denoting cm , max{c, λ0}, we set S0 = d c e ≤ 0 n m α − 1 d c e ≤ S and consider the distribution P = (247) (1/S0, 1/S0,..., 1/S0, 0, 0,..., 0), then Hα(P ) = ln S0. < ln S0. (248) Under the Poissonized model npˆi ∼ Poi(npi), 1 ≤ i ≤ S, we have pˆi = 0 for i > S0, and Then the proof is completed by choosing  c + 1 1 −1 m minη∈[1/6,cm/(cm+1)] ln f(η, 1 + cm ) p , P pˆi > ≥ , ∀i = 1, 2, ··· ,S0. (234) δ (c) = > 0. (249) n 3 α α − 1 31

C. Proof of Lemma 2 this regime. For x ∈ [2t, 1], Uα(x) = f(x), hence (4) (4) |Uα (x)| = |f (x)| (260)

For p ≥ ∆, we do Taylor expansion of Uα(x) around x = p. α−4 We have = α(α − 1)(α − 2)(α − 3)x

1 00 2 Uα(x) = Uα(p) + Uα(p)(x − p) + Uα (p)(x − p) α(1 − α)(α − 1)(α − 2)(α − 3)(α − 4) 2 + xα−5 , 1 000 3 2n + Uα (p)(x − p) + R(x; p), (250) 6 (261) where the remainder term enjoys the following representations: which implies that for x ≥ 2t, Z x 1 3 (4) 12 R(x; p) = (x − u) Uα (u)du (251) sup |U (4)(z)| ≤ 6xα−4 + xα−5. 6 p α (262) z∈[x,1] n U (4)(ξ ) = α x (x − p)4, ξ ∈ [min{x, p}, max{x, p}]. 24 x (252) The first remainder is called the integral representation of Finally we consider x ∈ (t, 2t). Denoting y = x − t, the Taylor series remainders, and the second remainder is called derivatives of In(x) for x ∈ (t, 2t) are as follows: the Lagrange remainder. 630y4(t − y)4 I0 (x) = (263) Since p ≥ ∆, we know that n t9 2520y3(t − 2y)(t − y)3 α(1 − α) 00 0 α−1 α−2 In (x) = 9 (264) Uα(p) = αp + (α − 1)p (253) t 2n 2520y2(t − y)2(3t2 − 14ty + 14y2) α(1 − α)(α − 1)(α − 2) (3) 00 α−2 α−3 In (x) = 9 (265) Uα (p) = α(α − 1)p + p t 2n 2 2 (254) (4) 15120y(t − 2y)(t − y)(t − 7ty + 7y ) In (x) = 9 . (266) (3) α−3 t Uα (p) = α(α − 1)(α − 2)p α(1 − α)(α − 1)(α − 2)(α − 3) + pα−4 (255) 2n U (4)(p) = α(α − 1)(α − 2)(α − 3)pα−4 Considering the fact that y/t ∈ [0, 1], we can maximize α (i) α(1 − α)(α − 1)(α − 2)(α − 3)(α − 4) |In (x)| over x ∈ (t, 2t) for 1 ≤ i ≤ 4. With the help of + pα−5 2n Mathematica [194], we could show that for x ∈ (t, 2t), (256) 4 |I0 (x)| ≤ (267) n t Replacing x by random variable X in (250), where nX ∼ 20 |I00(x)| ≤ (268) Poi(np), p ≥ ∆, and taking expectations on both sides, we n t2 have 100 |I(3)(x)| ≤ (269) 1 p 1 p n 3 U (X) = U (p) + U 00(p) + U 000(p) + [R(X; p)] t E α α 2 α n 6 α n2 E 1000 |I(4)(x)| ≤ . (270) (257) n t4 α(α − 1)(α − 2)(5 − 3α) = pα + pα−2 12n2 α(1 − α)2(2 − α)(3 − α) − pα−3 + [R(X; p)] Plugging these upper bounds in (259), we know for x ∈ 12n3 E (258) (t, 2t) where we have used the fact that if nX ∼ Poi(np), then (4) 1000 α 4 × 100 α−1 20 α−2 |Uα (x)| ≤ t + t + 6 × t 2 p 3 p t4 t3 t2 E(X − p) = n , E(X − p) = n2 . 4 α−3 α−4 (4) + 4 × × 2t + 6t (271) Since the representation of R(x; p) involves Uα (ξx), it t (4) α−4 would be helpful to obtain some estimates of Uα (x) over ≤ 1558t (272) α [0, 1]. Denoting Uα(x) = In(x)f(x), where f(x) = x + ≤ 1558(x/2)α−4 (273) α(1−α) xα−1, we have 2n ≤ 24928xα−4. (274) (4) (4) (3) (1) (2) (2) (1) (3) (4) Uα (x) = In f + 4In f + 6In f + 4In f + Inf . (259) Hence, it suffices to bound each term in (259) separately. Now we proceed to upper bound |E[R(X; p)]|, p ≥ ∆. We For x ∈ [0, t], Uα(x) ≡ 0, so we do not need to consider consider the following two cases: 32

1) Case 1:x ≥ p/2. In this case, For the term B2, we have (4) U (ξ ) B2 = [|R(X; p)|1(X < p/2)] (292) α x 4 E |R(x; p)| = (x − p) (275) 24 8310(1 + α) α ≤ p P(nX < np/2). (293) 4 α(2 − α) (4) (x − p) ≤ sup |Uα (x)| (276) x∈[p/2,1] 24  12  (x − p)4 Applying Lemma 21, we have ≤ 6(p/2)α−4 + (p/2)α−5 . n 24 8310(1 + α) 8310(1 + α) B ≤ pαe−np/8 ≤ pαn−c1/8. (277) 2 α(2 − α) α(2 − α) (294) 2) Case 2: 0 ≤ x < p/2. In this case, denoting y = max{x, ∆/4}, Hence, we have |R(x; p)| Z p E[R(X; p)] ≤ E[|R(X; p)|] (295) 1 3 (4) ≤ (u − x) |Uα (u)|du (278) 1 1   p 3p2  6 y ≤ (p/2)α−4 + (p/2)α−5 + Z p 4 2n n3 n2 1 3 α−4 ≤ (u − x) 24928u du (279) 8310(1 + α) 6 y + pαn−c1/8. (296) Z p (u − x)3 α(2 − α) ≤ 4155 4−α du (280) y u Z p Plugging this into (258), we have for p ≥ ∆, = 4155 uα−1 − 3xuα−2 + 3x2uα−3 − x3uα−4 du α y |EUα(X) − p | (281) α(α − 1)(α − 2)(5 − 3α)  1 3x ≤ pα−2 = 4155 (pα − yα) − pα−1 − yα−1 12n2 α α − 1 1 1   p 3p2  3x2 x3  + (p/2)α−4 + (p/2)α−5 + + pα−2 − yα−2 − pα−3 − yα−3 4 2n n3 n2 α − 2 α − 3 8310(1 + α) (282) + pαn−c1/8 (297)  2  α(2 − α) 1 α α 3x α−2 α−2 ≤ 4155 (p − y ) + p − y 17pα−2 8310(1 + α) α α − 2 ≤ + pαn−c1/8 (298) (283) n2 α(2 − α)   2 2  17 8310(1 + α) α −c /8 1 α α 3 α x α x = + p n 1 . (299) = 4155 (p − y ) + p − y α 2−α α α − 2 p2 y2 n (c1 ln n) α(2 − α) (284)  1 3  ≤ 4155 pα + pα (285) For the upper bound on the variance Var(Uα(X)), denoting α 2 − α α α(1−α) α−1 f(p) = p + 2n p , for p ≥ ∆, we have 8310(1 + α) = pα. (286) Var(U (X)) = U 2(X) − ( U (X))2 (300) α(2 − α) α E α E α 2 2 2 2 = EUα(X) − f (p) + f (p) − (EUα(X)) Now we have (301) 1 2 2 E[|R(X; p)|] = E[|R(X; p)| (X ≥ p/2)] ≤ |EUα(X) − f (p)| 1 2 2 + E[|R(X; p)| (X < p/2)] (287) + |f (p) − (EUα(X) − f(p) + f(p)) | (302) , B1 + B2. (288) 2 2 2 = |EUα(X) − f (p)| + |(EUα(X) − f(p)) + 2f(p)(EUα(X) − f(p))| (303) 2 2 2 For the term B1, we have ≤ |EUα(X) − f (p)| + |EUα(X) − f(p)| 1 + 2f(p)|EUα(X) − f(p)|. (304) B1 = E[|R(X; p)| (X ≥ p/2)] (289)  12  ≤ 6(p/2)α−4 + (p/2)α−5 [(X − p)4]/24 (290) n E 2 2 Hence, it suffices to obtain bounds on |EUα(X) − f (p)| 1 1   p 3p2  and | U (X) − f(p)|. Denoting r(x) = U 2(x), we know that ≤ (p/2)α−4 + (p/2)α−5 + , (291) E α α 4 2n n3 n2 r(x) ∈ C4[0, 1], and it follows from Taylor’s formula and the integral representation of the remainder term that where we have used the fact that if nX ∼ Poi(np), then 4 2 2 4 2 0 E(X − p) = (np + 3n p )/n . r(X) = f (p) + r (p)(X − p) + R1(X; p), (305) 33

Z X 00 hence for x ∈ [t, 2t], R1(X; p) = (X − u)r (u)du (306) p 00 20 α α−2 4 α−1 1 00 2 |Uα (x)| ≤ 2 t + t + 2 t (324) = r (ηX )(X − p) , t t 2 ≤ 30tα−2 (325) ηX ∈ [min{X, p}, max{X, p}]. (307) ≤ 30(x/2)α−2 (326) ≤ 120xα−2. (327) Similarly, we have 0 Uα(X) = f(p) + f (p)(X − p) + R2(X; p), (308) Now we are in the position to bound |ER1(X; p)| and Z X | R2(X; p)|. 00 E R2(X; p) = (X − u)Uα (u)du (309) p 1 We have = U 00(ν )(X − p)2, 2 α X |ER1(X; p)| ≤ E|R1(X; p)| (328) ν ∈ [min{X, p}, max{X, p}]. (310) X 1 = E[|R1(X; p) (X ≥ p/2)|] + [R (X; p)1(X < p/2)] (329) Taking expectation on both sides with respect to X, where E 1 1  nX ∼ Poi(np), p ≥ ∆, we have ≤ 4(p/2)2α−2(X − p)2 E 2 2 2 |EUα(X) − f (p)| = |ER1(X; p)|. (311) 1 + E[R1(X; p) (X < p/2)] (330) Similarly, we have p2α−1 = 8 + sup |R (x; p)| (nX < np/2) n 1 P |EUα(X) − f(p)| = |ER2(X; p)|. (312) x≤p/2 (331) p2α−1 U (x) −c1/8 As we did for function α , now we give some upper ≤ 8 + sup |R1(x; p)|n , (332) estimates for |r00(x)| over [0, 1]. Over regime [0, t], r(x) ≡ 0, n x≤p/2 so we ignore this regime. Over regime [2t, 1], since Uα(x) = where in the last step we have applied Lemma 21. α α(1−α) α−1 f(x), f(x) = x + 2n x , we have 0 0 r (x) = 2ff (313) Regarding supx≤p/2 |R1(x; p)|, for any x ≤ p/2, denoting r00(x) = 2(f 0)2 + 2ff 00. (314) y = max{x, ∆/4}, we have Z p 00 Hence, for x ≥ 2t, R1(x; p) = (u − x)r (u)du (333) x sup |r00(z)| ≤ 4x2α−2. (315) Z p 2α−2 z∈[x,1] ≤ (u − x)432u du (334) y 00 α−2 p sup |Uα (z)| ≤ x . (316) Z z∈[x,1] ≤ 432 u2α−1du (335) y 432 Over regime [t, 2t], we have = (p2α − y2α) (336) 2α r0(x) = 2ff 0I2 + 2I I0 f 2 (317) 432 n n n ≤ p2α (337) 00  0 2 2 00 2 0 0 2α r (x) = 2 (f ) In + ff In + 2ff InIn 216 ≤ p2α. (338) 0 2 2 00 2 0 0  α + (In) f + InIn f + 2ff InIn . (318) Hence, we have for x ∈ [t, 2t], Hence, we have 2   2α−1 00  2α−2 2α−2 2α−1 4 4 2α 8p 216 |r (x)| ≤ 2 t + t + 2t + t | R (X; p)| ≤ + p2αn−c1/8. (339) t t E 1 n α 20 4 + t2α + 2t2α−1 (319) t2 t Analogously, we obtain the following bound for ≤ 108t2α−2 (320) |ER2(X; p)|: ≤ 108(x/2)2α−2 (321) pα−1 120 = 432x2α−2 (322) | R (X; p)| ≤ 2 + pαn−c1/8. (340) E 2 n α Also, over regime [t, 2t], 00 00 00 0 0 Uα (x) = In f + Inf + 2Inf , (323) Plugging these estimates of |ER1(X; p)| and |ER2(X; p)| 34

00 00 into (304), we have for p ≥ ∆, c1 ln n ≥ 1, some upper bounds on |r (x)| and |Uα (x)|. Over regime [0, t], 2 00 00 r(x) = Uα(x) ≡ 0, we have r (x) = Uα (x) = 0. Over regime Var(Uα(X)) [2t, 1], since Uα(x) = f(x), we have 8p2α−1 216 ≤ + p2αn−c1/8 00 0 2 00 n α |r (x)| = |2(f ) + 2ff | (350) α−1 2 α α−2  pα−1 120 2 ≤ |2(αx ) + 2x · αx | (351) + 2 + pαn−c1/8 n α ≤ 8x2α−2 (352)  pα−1 120  |U 00(x)| = |f 00(x)| ≤ αxα−2 ≤ 2xα−2. (353) + 2f(p) 2 + pαn−c1/8 . (341) α n α

We need to distinguish two cases: 0 < α ≤ 1/2, and 1/2 < Over regime [t, 2t], we have α < 1. 00 1) 0 < α ≤ 1/2: in this case, we have |r (x)| = 2|(f 0)2I2 + ff 00I2 + 2ff 0I I0 Var(U (X)) n n n n α 0 2 2 00 2 0 0 2α−1 + (In) f + InIn f + 2ff InIn| (354) 8p 216 2α −c /8 ≤ + p n 1  4 42 n α ≤ 2 α2x2α−2 + αx2α−2 + 2αx2α−1 · + x2α  α−1 2 t t p 120 α −c /8 + 2 + p n 1  n α 20 2α 2α−1 4 + 2 · x + 2αx · (355)  pα−1 120  t t + 2f(p) 2 + pαn−c1/8 (342) ≤ 400x2α−2, (356) n α 8 216 and 2α −c1/8 ≤ 2α 1−2α + p n n (c1 ln n) α 00 00 0 0 00 |U (x)| = |I f + 2I f + Inf | (357)  2α−2  α n n 4p 14400 2α −c /4 + 2 + p n 1 (343) 20 α α 4 α−2 2 2 ≤ · x + 2αx · + αx (358) n α t2 t    α−1  α−2 α 1 p 120 α −c /8 ≤ 120x , + 2p 1 + 2 + p n 1 (359) 8c1 ln n n α (344) where we have used the inequality 16 8 0 4 00 20 |In(x)| ≤ 1, |I (x)| ≤ , |I (x)| ≤ , ∀x ∈ [t, 2t]. ≤ 2α 1−2α + 2α 2−2α n n 2 n (c1 ln n) n (c1 ln n) t t 576 28800 (360) + p2αn−c1/8 + p2αn−c1/4 (345) α α2 24 576 00 2α −c1/8 Noting that we have obtained a norm bound for |r (x)| over ≤ 2α 1−2α + p n n (c1 ln n) α all regimes expressed as 28800 + p2αn−c1/4. (346) 00 2α−2 α2 |r (x)| ≤ 400x ≤ 400, ∀x ∈ [0, 1], (361) 2) 1/2 < α < 1: in this case, we have we have the upper bound

Var(Uα(X)) |ER1(X; p)| ≤ E|R1(X; p)| (362) 2α−1 1 8p 216 2α −c /8 00 2 ≤ + p n 1 = E|f (ηX )(X − p) | (363) n α 2 2  α−1 2 ≤ 200 |(X − p) | (364) p 120 α −c /8 E + 2 + p n 1 200p n α = . (365)  pα−1 120  n + 2f(p) 2 + pαn−c1/8 (347) n α For the upper bound of |ER2(X; p)|, we first consider the upper bound of |R (x; p)| when x ≤ p/2. Denoting y = 8p2α−1 216 8p2α−2 2 ≤ + p2αn−c1/8 + max{x, ∆/4}, we have n α n2  α−1  Z p 28800 2α −c /4 α p 120 α −c /8 00 1 1 R2(x; p) = (u − x)U (u)du (366) + 2 p n + 3p 2 + p n α α n α x (348) Z p ≤ (u − x)120uα−2du (367) 14p2α−1 576 28800 2α −c1/8 2α −c1/4 y ≤ + p n + 2 p n Z p n α α α−1 8 ≤ 120u du (368) y + 2α 2−2α . (349) n (c1 ln n) 120 ≤ pα, (369) For 1 < α < 3/2, following the same procedures, we obtain α 35 then Since p ≥ ∆, we know that 0 |ER2(X; p)| UH (p) = − ln p − 1 (381) 00 ≤ E|R2(X; p)| (370) UH (p) = −1/p (382) = |R (X; p)1(X ≥ p/2)| + |R (X; p)1(X < p/2)| (3) 2 E 2 E 2 UH (p) = 1/p (383) (371) U (4)(p) = −2/p3 (384)  α−2  H 1 p 2 ≤ E · 2 (X − p) 2 2 Replacing x by random variable X in (378), where nX ∼  np Poi(np), p ≥ ∆, and taking expectations on both sides, we + sup |R2(x; p)| · P nX < (372) x

(t, 2t) For the term B1, we have (4) 1 |UH (x)| B1 = E[|R(X; p)| (X ≥ p/2)] (416) 1000 4 × 100 20 2(X − p)4  ≤ t ln(1/t) + ln(1/t) + 6 × 1/t ≤ (417) t4 t3 t2 E 3p3 4 2 3 2 2 + 4 × × 1/t + 2/t (398) = + , (418) t 3p2n3 pn2 ln(1/t) ≤ 1538 3 (399) where we have used the fact that if nX ∼ Poi(np), then t 4 2 2 4 ln(2/x) E(X − p) = (np + 3n p )/n . ≤ 1538 3 (400) (x/2) For the term B2, we have 1 + ln(1/x) ≤ 12304 . (401) B = [|R(X; p)|1(X < p/2)] (419) x3 2 E ≤ 8024p (ln(1/p) + 2) P(nX < np/2). (420) Now we proceed to upper bound | [R(X; p)]|, p ≥ ∆. We E Applying Lemma 21, we have consider the following two cases: −np/8 B2 ≤ 8024 (p ln(1/p) + 2p) e (421) = 8024 (p ln(1/p) + 2p) n−c1/8. (422) 1) Case 1:x ≥ p/2. In this case,

U (4)(ξ ) Hence, we have H x 4 |R(x; p)| = (x − p) (402) 24 E[R(X; p)] ≤ E[|R(X; p)|] (423) 4 2 2 (4) (x − p) ≤ + + 8024 (p ln(1/p) + 2p) n−c1/8. ≤ sup |U (x)| (403) 2 3 2 H 24 3p n pn x∈[p/2,1] (424) 2 (x − p)4 ≤ (404) (p/2)3 24 Plugging this into (386), we have for p ≥ ∆, 2(x − p)4 | UH (X) + p ln p| = 3 . (405) E 3p 1 2 2 ≤ + + + 8024 (p ln(1/p) + 2p) n−c1/8 2) Case 2: 0 ≤ x < p/2. In this case, denoting y = 6pn2 3p2n3 pn2 max{x, ∆/4}, (425) 3 2 |R(x; p)| ≤ + + 8024 (p ln(1/p) + 2p) n−c1/8 (426) pn2 3p2n3 1 Z p ≤ (u − x)3|U (4)(u)|du (406) 3 2 6 H ≤ + + 8024 (p ln(1/p) + 2p) n−c1/8. y c n ln n 3(c ln n)2n Z p 1 1 1 3 1 + ln(1/u) (427) ≤ (u − x) 12304 3 du (407) 6 y u Z p (u − x)3(1 + ln(1/u)) For the upper bound on the variance Var(UH (X)), recalling 1 ≤ 2051 3 du (408) that f(p) = −x ln x + , for p ≥ ∆, we have y u 2n p 3 2 2 3 Z (u − 3xu + 3x u − x )(1 + ln(1/u)) Var(UH (X)) = 2051 3 du 2 2 y u = EUH (X) − (EUH (X)) (428) (409) 2 2 2 2 = EUH (X) − f (p) + f (p) − (EUH (X)) (429) p 3 2 Z (u + 3x u)(1 + ln(1/u)) 2 2 2 2 ≤ | U (X) − f (p)| + |f (p) − ( UH (X) − f(p) + f(p)) | ≤ 2051 3 du (410) E H E y u (430) p Z  3x2  2 2 2 = 2051 1 + (1 + ln(1/u)) du (411) = |EUH (X) − f (p)| + |(EUH (X) − f(p)) u2 y + 2f(p)( U (X) − f(p))| (431) Z p E H 2 2 2 ≤ 8204 (1 + ln(1/u)) du (412) ≤ |EUH (X) − f (p)| + |EUH (X) − f(p)| y + 2f(p)|EUH (X) − f(p)|. (432) ≤ 8024p (ln(1/p) + 2) . (413) 2 2 Now we have Hence, it suffices to obtain bounds on |EUH (X) − f (p)| and | U (X) − f(p)|. Denoting r(x) = U 2 (x), we know 1 E H H E[|R(X; p)|] = E[|R(X; p)| (X ≥ p/2)] that r(x) ∈ C4[0, 1], and it follows from Taylor’s formula and + E[|R(X; p)|1(X < p/2)] (414) the integral representation of the remainder term that

B1 + B2. (415) 2 0 , r(X) = f (p) + r (p)(X − p) + R1(X; p), (433) 37

Z X 00 hence for x ∈ [t, 2t], R1(X; p) = (X − u)r (u)du (434) p 00 20 1 4 1 00 2 |UH (x)| ≤ 2 (t ln(1/t)) + + 2 ln(1/t) (452) = r (ηX )(X − p) , t t t 2 30 ≤ ln(1/t) (453) ηX ∈ [min{X, p}, max{X, p}]. (435) t 30 Similarly, we have ≤ ln(2/x) (454) x/2 U (X) = f(p) + f 0(p)(X − p) + R (X; p), (436) 60 H 2 ≤ (ln(1/x) + 1). (455) x Z X 00 Now we are in the position to bound | R1(X; p)| and R2(X; p) = (X − u)UH (u)du (437) E p |ER2(X; p)|. 1 = U 00 (ν )(X − p)2, We have 2 H X | R1(X; p)| νX ∈ [min{X, p}, max{X, p}]. (438) E ≤ E|R1(X; p)| (456) Taking expectation on both sides with respect to X, where 1 1 = E[|R1(X; p) (X ≥ p/2)|] + E[R1(X; p) (X < p/2)] nX ∼ Poi(np), p ≥ ∆, we have (457) | U 2 (X) − f 2(p)| = | R (X; p)|. (439) 1  E H E 1 ≤ × 4(ln(p/2))2(X − p)2 + [R (X; p)1(X < p/2)] E 2 E 1 Similarly, we have (458) | U (X) − f(p)| = | R (X; p)|. (440) 2 E H E 2 = 2p(ln p − ln 2) /n + sup |R1(x; p)|P(nX < np/2) x≤p/2 As we did for function UH (x), now we give some upper (459) 00 estimates for |r (x)| over [0, 1]. Over regime [0, t], r(x) ≡ 0, 2 −c1/8 = 2p(ln p − ln 2) /n + sup |R1(x; p)|n , (460) so we ignore this regime. Over regime [2t, 1], since UH (x) = x≤p/2 f(x), we have where in the last step we have applied Lemma 21. 0 0 r (x) = 2ff (441) Regarding supx≤p/2 |R1(x; p)|, for any x ≤ p/2, denoting r00(x) = 2(f 0)2 + 2ff 00. (442) y = max{x, ∆/4}, we have Z p x ≥ 2t 00 Hence, for , R1(x; p) = (u − x)r (u)du (461) x sup |r00(z)| ≤ 4(ln x)2. (443) Z p z∈[x,1] ≤ (u − x)216 (ln u)2 + 1 du (462) 00 y sup |Uα (z)| ≤ 1/x. (444) Z p z∈[x,1] ≤ 216 u (ln u)2 + 1 du (463) y Over regime [t, 2t], we have ≤ 54p2 2(ln p)2 − 2 ln p + 3 . (464) r0(x) = 2ff 0I2 + 2I I0 f 2 (445) n n n Hence, we have 00  0 2 2 00 2 0 0 r (x) = 2 (f ) I + ff I + 2ff InI n n n 2 2 2 −c1/8 |ER1(X; p)| ≤ 2p(ln p−ln 2) /n+54p 2(ln p) − 2 ln p + 3 n . 0 2 2 00 2 0 0  + (In) f + InIn f + 2ff InIn . (446) (465) Analogously, we obtain the following bound for Hence, we have for x ∈ [t, 2t], |ER2(X; p)|: 00  2 4 1 |r (x)| ≤ 2 (ln t) + ln(1/t) + 2(t ln(1/t))(ln(1/t)) | R (X; p)| ≤ + 60 (p ln(1/p) + 2p) n−c1/8. (466) t E 2 n 20 + (4/t)2t2(ln t)2 + t2(ln t)2 t2 Plugging these estimates of |ER1(X; p)| and |ER2(X; p)| 4 into (432), we have for p ≥ ∆, + 2(t ln(1/t)) ln(1/t) (447) t Var(UH (X)) ≤ 108(ln t)2 (448) 2 2 2 −c /8 ≤ 2p(ln p − ln 2) /n + 54p 2(ln p) − 2 ln p + 3 n 1 ≤ 108(ln(x/2))2 (449)  2 2 1 = 216((ln x) + 1), (450) + + 60 (p ln(1/p) + 2p) n−c1/8 (467) n where we have used the fact that | ln 2| ≈ 0.69 < 1. Also,  1   1  + 2 p ln(1/p) + + 60 (p ln(1/p) + 2p) n−c1/8 . over regime [t, 2t], 2n n 00 00 00 0 0 (468) UH (x) = In f + Inf + 2Inf , (451) 38

E. Proof of Lemma 4 we know 2 K  k! X 8c1 ln n S2 (X) ≤ 26K (4∆)−k+α (480) E K,α n k=1 K !2 X ≤ 26K 22K (4∆)−k+α(4∆)k (481) k=1 K !2 We first bound the bias term. It follows from differentiating X ≤ 26K 22K (4∆)α (482) the moment generating function of the that if X ∼ Poi(λ), then k=1 = 28K K2(4∆)2α (483) r X(X − 1) ... (X − r + 1) = λ , (469) 8c2 ln 2 2 2α E ≤ n (c2 ln n) (4c1 ln n/n) (484) 2+2α for any r positive integer. (4c1 ln n) ≤ n8c2 ln 2 (485) n2α

Then, we know that for nX ∼ Poi(np), The proof for the SK,H (x) case is essentially the same as

K that for SK,α(x) via replacing α by 1 and applying Lemma 20 X −k+α k rather than Lemma 19. ESK,α(X) = gk,α(4∆) p . (470) k=1

Applying Lemma 19, we know that for all p ≤ 4∆, c | S (X) − pα| ≤ 3 . (471) F. Proof of Lemma 5 E K,α (n ln n)α

We first bound the bias term. It follows from the property Now we bound the second moment of SK,α(X). Denote of Poisson distribution that K k−1 X Y S (X) = g (4∆)−k+αpk. (486) Ek,n(x) = (x − r/n), (472) E K,α k,α r=0 k=1 we have It follows from a variation of the pointwise bound in Lemma 17 that K !2 2 X −k+α 2 1/2 α  α−1 S (X) ≤ |g |(4∆) E (X) D1(4∆) p 4c1 E K,α k,α E k,n | S (X) − pα| ≤ · = D p, E K,α 2(α−1) 1 2 k=1 K 4∆ c2n ln n (473) (487) K !2 6K X −k+α 2 1/2 which completes the proof of the first part of Lemma 5. For ≤ 2 (4∆) EEk,n(X) the variance, denote k=1 k−1 (474) Y  r  E (x) = x − , (488) k,n n Here we have used Lemma 17. r=0 we have Since K ≤ 4n∆, applying Lemma 22, k−1 2 1 Y 2 1 2k EEk,n(X) = E (nX − r) ≤ E[nX] (489) k−1 n2k n2k 2 1 Y 2 r=0 EEk,n(X) = 2k E (nX − r) (475) 2k ( ) n 1 X 2k r=0 = (np)i (490) k−1 n2k i 1 Y 2 i=1 ≤ 2k E (nX) (476) 2k n 1 X 2k r=0 ≤ i2k−i(np)i (491) 1 n2k i 2k i=1 = 2k E(nX) (477) n 2k   1 1 X 2k 2k−i ≤ (8c ln n)2k (478) ≤ (2k) np (492) 2k 1 n2k i n i=1  2k 8c1 ln n (2k + 1)2kp = , (479) ≤ , (493) n n2k−1 39

( ) k have where is the Stirling numbers of the second kind, and i |B(ξ)| we have used the inequality [195]

( )   = S (X) (Y ≤ 2∆) k k k−i E K,α P ≤ i . (494) i i 1 − [E(SK,α(X) − 1) (SK,α ≥ 1)] P(Y ≤ 2∆)

α Hence, we can bound the second moment of SK,α(X) as + EUα(X)P(Y > 2∆) − p (503)

K !2 X S2 (X) ≤ |g |(4∆)−k+α( E2 (X))1/2 α E K,α k,α E k,n = ESK,α(X) − p k=1 (495) 1 − [E(SK,α(X) − 1) (SK,α ≥ 1)] P(Y ≤ 2∆) K √ !2 X (2k + 1)k p ≤ 26K (4∆)−k+α (496) k− 1 + (EUα(X) − ESK,α(X)) P(Y > 2∆) (504) n 2 k=1 K √ !2 α  −k+α k ≤ |ESK,α(X) − p | 10K X 4c1 ln n K p ≤ 2 1 (497) 1 n k− 2 + E(SK,α(X) − 1) (SK,α ≥ 1) k=1 n 2 + (|EUα(X)| + |ESK,α(X)|) P(Y > 2∆) (505) 10K 2α K ! 2 (4c1 ln n) X c2 k ≡ B1 + B2 + B3. (506) ≤ 2α−1 ( ) p (498) n 4c1 k=1 2α 2 Now we bound B1,B2,B3 separately. It follows from (4c1 ln n) K ≤ 210c2 ln 2 p (499) Lemma 4 that n2α−1 2α+2 α c3 1 10c ln 2 (4c1 ln n) p B1 = |ESK,α(X) − p | ≤ . ≤ 2 2 , (500) (n ln n)α . (n ln n)α n2α−1 (507) given c2 < 4c1, where we have used Lemma 17. Now consider B2. Note that for any random variable Z and any constant λ > 0, −1 2 −1 2 E(Z1(Z ≥ λ)) ≤ λ E(Z 1(X ≥ λ)) ≤ λ EZ . (508) Hence, we have 1 B2 = E(SK,α(X) − 1) (SK,α ≥ 1) (509) 1 ≤ ESK,α (SK,α ≥ 1) (510) 2 ≤ ESK,α (511) 2+2α (4c1 ln n) ≤ n8c2 ln 2 (512) n2α G. Proof of Lemma 6 (ln n)2+2α , (513) . n2α− where we used Lemma 4 in the last step. Now we deal with B3. We have c | S (X)| ≤ pα + 3 (514) We apply Lemma 23 and Lemma 24 to calculate the bias E K,α (n ln n)α and variance of ξ. c ln nα c ≤ 1 + 3 (515) n (n ln n)α α(1 − α) E|Uα(X)| ≤ sup |Uα(x)| ≤ 1 + (516) 1) Case 1: p ≤ ∆ x∈[0,1] 2c1 ln n 1 Claim: when p ≤ ∆, we have ≤ 1 + (517) 8c ln n 1 1 |B(ξ)| (501) P(Y ≥ 2∆) = P(nY ≥ 2n∆) (518) . (n ln n)α ≤ (e/4)c1 ln n (519) (ln n)2+2α Var(ξ) . (502) −c1 ln(4/e) . n2α− = n , (520) Now we prove this claim. In this regime, we write where we have used Lemma 4 and Lemma 21. Thus, Lα(X) = SK,α − (SK,α(X) − 1)1(SK,α(X) ≥ 1). We 40

we have Regarding B3, applying Lemma 2, we have 17 8310(1 + α) B3 = (| SK,α(X)| + |Uα(X)|) (Y ≥ 2∆) (521) α −c1/8 E E P B3 ≤ + p n (539) −c ln(4/e) nα(c ln n)2−α α(2 − α) n 1 . (522) 1 . 1 + n−2α. (540) To sum up, we have the following bound on |B(ξ)|: . nα(ln n)2−α 2+2α 1 (ln n) −16α 1 To sum up, we have |B(ξ)| . α + 2α− + n . α . (n ln n) n (n ln n) 1 (ln n)2+2α 1 (523) −2α |B(ξ)| . α + 2α− + α 2−α + n We now consider the variance. It follows from (n ln n) n n (ln n) Lemma 23 and Lemma 24 that (541) 1 . (542) Var(ξ) . (n ln n)α ≤ Var(SK,α(X)) + Var(Uα(X))P(Y > 2∆) For the variance, we have 2 + (ELα(X) − EUα(X)) P(Y > 2∆) (524) 2 2  Var(ξ) ≤ Var(SK,α(X)) + Var(Uα(X)) ≤ ESK,α(X) + EUα(X) + 1 + 2|EUα(X)| P(Y > 2∆) 2 (525) + (ELα(X) − EUα(X)) . (543) 2+2α  2 (4c1 ln n) 1 Applying Lemma 4, we have ≤ n8c2 ln 2 + 1 + n−c1 ln(4/e) 2α 2+2α 2+2α n 8c1 ln n (4c1 ln n) (ln n) Var(S (X)) ≤ n8c2 ln 2 . (526) K,α n2α . n2α− (ln n)2+2α (544) + n−8α (527) . n2α− Lemma 2 implies that when 0 < α ≤ 1/2, 2+2α (ln n) Var(U (X)) . (528) α n2α− 24 576 28800 2α −c1/8 2α −c1/4 ≤ 2α 1−2α + p n + 2 p n , 2) Case 2: ∆ < p ≤ 4∆. n (c1 ln n) α α Claim: when ∆ < p ≤ 4∆, we have (545) 1 and when 1/2 < α < 1, |B(ξ)| (529) . (n ln n)α Var(Uα(X))  (ln n)2+2α  n2α− 0 < α ≤ 1/2 14p2α−1 576 28800 Var(ξ) (530) 2α −c1/8 2α −c1/4 . (ln n)2+2α p2α−1 ≤ + p n + 2 p n  2α− + 1/2 < α < 1 n α α n n 8 + Now we prove this claim. In this case, 2α 2−2α (546) n (c1 ln n) |B(ξ)| 2 Regarding (ELα(X) − EUα(X)) , we have

2 α ( L (X) − U (X)) = (ELα(X) − p )P(Y ≤ 2∆) E α E α h α 1 ≤ |ESK,α(X) − p | + E(SK,α(X) − 1) (SK,α ≥ 1)

α i2 + (EUα(X) − p )P(Y > 2∆) (531) α + |EUα(X) − p | (547) α α 2+2α ≤ | Lα(X) − p | + | Uα(X) − p | (532) h c3 (4c1 ln n) 17 E E ≤ + n8c2 ln 2 + α 1 α 2α α 2−α ≤ |ESK,α(X) − p | + E(SK,α(X) − 1) (SK,α ≥ 1) (n ln n) n n (c1 ln n) α 8310(1 + α) i2 + |EUα(X) − p | (533) + pαn−c1/8 (548) α(2 − α) ≡ B + B + B . (534) 1 2 3 1 . (549) It follows from Lemma 4 that . (n ln n)2α c 1 B = | S (X) − pα| ≤ 3 . To sum up, we have 1 E K,α (n ln n)α . (n ln n)α  2+2α (535) (ln n)  n2α− 0 < α ≤ 1/2 As in (513), we have Var(ξ) . 2+2α 2α−1 (550)  (ln n) + p 1/2 < α < 1 1 n2α− n B2 = E(SK,α(X) − 1) (SK,α ≥ 1) (536) 2+2α 3) Case 3: p > 4∆. (4c1 ln n) ≤ n8c2 ln 2 (537) n2α (ln n)2+2α . (538) . n2α− 41

Claim: when p > 4∆, we have When 1/2 < α < 1, we have 1 Var(ξ) |B(ξ)| . α 2−α (551) n (ln n) 14p2α−1 576 28800  2α −c1/8 2α −c1/4 1 ≤ + p n + 2 p n  2α 1−2α 0 < α ≤ 1/2 n α α Var(ξ) n (ln n) . 1 p2α−1 8  n2α(ln n)1−2α + n 1/2 < α < 1 + 2α 2−2α + 3P(Y ≤ 2∆) (563) n (c1 ln n) (552) 14p2α−1 576 28800 ≤ + p2αn−c1/8 + p2αn−c1/4 Now we prove this claim. In this case, n α α2 8 |B(ξ)| + + 3n−c1/2 (564) n2α(c ln n)2−2α α α 1 ≤ |EUα(X) − p | + (|ELα(X)| + p ) P(Y ≤ 2∆) 1 p2α−1 (553) + . (565) . n2α(ln n)1−2α n 17 8310(1 + α) α −c1/8 ≤ α 2−α + p n n (c1 ln n) α(2 − α) + 2P(Y ≤ 2∆) (554) 17 8310(1 + α) α −c1/8 ≤ α 2−α + p n n (c1 ln n) α(2 − α) + 2P(nY ≤ 2n∆) (555) 17 8310(1 + α) α −c1/8 H. Proof of Lemma 7 ≤ α 2−α + p n n (c1 ln n) α(2 − α) + 2e−c1/2 ln n (556) 17 8310(1 + α) α −c1/8 −c1/2 We use Lemma 2, Lemma 4 and Lemma 5 to compute the = α 2−α + p n + 2n n (c1 ln n) α(2 − α) bias and variance of ξ. We distinguish four cases. (557) 1 . (558) 1) Case 1: p ≤ 1 . . nα(ln n)2−α n ln n In this regime, we write Lα(X) = SK,α(X) − Regarding the variance, we have (SK,α(X) − 1)1(SK,α(X) ≥ 1) and bound the bias Var(ξ) as

≤ Var(Uα(X)) |B(ξ)|

+ Var(L (X)) + ( L (X) − U (X))2 (Y ≤ 2∆). α E α E α P = ESK,α(X)P(Y ≤ 2∆) (559) 1 − [E(SK,α(X) − 1) (SK,α(X) ≥ 1)]P(Y ≤ 2∆) When 0 < α ≤ 1/2, we have α + EUα(X)P(Y > 2∆) − p (566) Var(ξ) ≤ | S (X) − pα| 24 576 E K,α 2α −c1/8 ≤ 2α 1−2α + p n + (SK,α(X) − 1)1(SK,α(X) ≥ 1) n (c1 ln n) α E 28800 + (| Uα(X)| + | SK,α(X)|) (Y > 2∆) (567) 2α −c1/4 E E P + 2 p n + 3P(Y ≤ 2∆) (560) α ≡ B1 + B2 + B3. (568) 24 576 2α −c1/8 ≤ 2α 1−2α + p n Now we bound B1,B2,B3 separately. It follows from n (c1 ln n) α 28800 Lemma 5 that 2α −c1/4 −c1/2 + 2 p n + 3n (561) α α B1 = |ESK,α(X) − p | (569) 1 . (562)  4c α−1 . n2α(ln n)1−2α 1 ≤ D1 2 p (570) c2n ln n p , (571) . (n ln n)α−1 1 B2 = E(SK,α(X) − 1) (SK,α(X) ≥ 1) (572) 2 ≤ ESK,α (573) 2α+2 (4c1 ln n) p ≤ n10c2 ln 2 (574) n2α−1 (ln n)2α+2p . (575) . n2α−1− 42

1 Now we consider B3. Since where we have used the fact that when p ≤ n ln n , α 2 2α |EUα(X)| ≤ |EX | (576) |EUα(X)| ≤ |EX | (595) 1 1 ≤ | (nX)α| (577) ≤ | (nX)2α| (596) nα E n2α E 1 1 ≤ | (nX)2| (578) ≤ | (nX)3| (597) nα E n2α E np + (np)2 p (np)3 + 3(np)2 + np = (579) = (598) nα . nα−1 n2α α α p |ESK,α(X)| ≤ |ESK,α(X) − p | + p (580) . (599) . n2α−1  4c α−1 1 α 1 ≤ D1 p + p (581) 2 2) Case 2: n ln n < p ≤ ∆. c2n ln n p To bound the bias, we use the same definition of (582) . (n ln n)α−1 B1,B2,B3 as in Case 1. It follows from Lemma 4 that c 1 P(Y > 2∆) = P(nY > 2n∆) (583) B = | S (X) − pα| ≤ 3 , 1 E K,α (n ln n)α . (n ln n)α  enp 2c1 ln n ≤ (584) (600) 2c1 ln n 1 2 B2 = E(SK,α(X) − 1) (SK,α(X) ≥ 1) ≤ ESK,α  2c1 ln n e (601) ≤ 2 (585) 2c1(ln n) 2α+2 8c ln 2 (4c1 ln n) −4c ln ln n ≤ n 2 (602) . n 1 , (586) n2α (ln n)2α+2 where we have used Lemma 21 and the pointwise bound . 2α− . (603) in Lemma 17, thus n To deal with B3, we use B3 = (|EUα(X)| + |ESK,α(X)|)P(Y > 2∆) (587) −4c1 ln ln n |EUα(X)| ≤ sup |Uα(x)| ≤ 1 (604) . pn . (588) x∈[0,1] α α To sum up, we have the following bound on |B(ξ)|: |ESK,α(X)| ≤ |ESK,α(X) − p | + p (605)  α p (ln n)2α+2p c3 c1 ln n −4c1 ln ln n ≤ + |B(ξ)| + + pn α (606) . (n ln n)α−1 n2α−1− (n ln n) n α (589) ln n (607) p . n . α−1 (590) (n ln n) P(Y > 2∆) = P(nY > 2n∆) (608) As for variance, it follows from Lemma 23 and Lemma  e c1 ln n ≤ (609) 24 that 4 −c1 ln(4/e) Var(ξ) . n , (610)

≤ Var(SK,α(X)) + Var(Uα)P(Y > 2∆) to obtain that + ( L (X) − U (X))2 (Y > 2∆) (591) E α E α P B3 = (|EUα(X)| + |ESK,α(X)|)P(Y > 2∆) (611) 2 2 2 ≤ S (X) + 2( L (X) + U (X)) (Y > 2∆) −c1 ln(4/e) E K,α E α E α P . n . (612) (592) (ln n)2α+2p (ln n)2α+2p p  Hence, the bias is upper bounded by + 2 + n−4c1 ln ln n . n2α−1− n2α−1− n2α−1 1 (ln n)2α+2 |B(ξ)| + + n−c1 ln(4/e) (613) (593) . (n ln n)α n2α− (ln n)2α+2p 1 , (594) . . (614) . n2α−1− (n ln n)α Similar to the analysis in Case 1, the variance is upper 43

bounded by For the variance, we have Var(ξ) Var(ξ)

≤ Var(SK,α(X)) + Var(Uα(X))P(Y > 2∆) ≤ Var(SK,α(X)) + Var(Uα(X)) 2 2 + (ELα(X) − EUα(X)) P(Y > 2∆) (615) + (ELα(X) − EUα(X)) (635) 2 2 2 2 2 ≤ ESK,α(X) + 2(ELα(X) + EUα(X))P(Y > 2∆) ≤ ESK,α(X) + Var(Uα(X)) + (ELα(X) − EUα(X)) (616) (636) (ln n)2α+2 (ln n)2α+2  p 1  1 + 2 (1 + 1) n−c1 ln(4/e) (617) + + + (637) . n2α− . n2α− n n2 n2α(ln n)4−2α (ln n)2α+2 1 p . . (618) + , (638) n2α− . n2 n 3) Case 3: ∆ < p < 4∆. where we have used Lemma 2 and In this case, |ELα(X) − EUα(X)| |B(ξ)| = ( L (X) − pα) (Y ≤ 2∆) α E α P ≤ |ESK,α(X) − p |

α + (S (X) − 1)1(S (X) ≥ 1) + (EUα(X) − p )P(Y > 2∆) (619) E K,α K,α α α α + |EUα(X) − p | (639) ≤ |ELα(X) − p | + |EUα(X) − p | (620) = B + B + B (640) ≤ | S (X) − pα| 1 2 3 E K,α 1 + (S (X) − 1)1(S (X) ≥ 1) . (641) E K,α K,α . nα(ln n)2−α + | U (X) − pα| (621) E α 4) Case 4: p ≥ 4∆. ≡ B1 + B2 + B3. (622) In this case, the bias is upper bounded by It follows from Lemma 2 and Lemma 4 that |B(ξ)| α α α B1 = |ESK,α(X) − p | (623) ≤ |EUα(X) − p | + (|ELα(X)| + p )P(Y ≤ 2∆) c3 (642) ≤ (624) (n ln n)α 17 8310(1 + α) α −c1/8 1 ≤ α 2−α + p n , (625) n (c1 ln n) α(2 − α) . (n ln n)α + 2P(nY ≤ 2n∆) (643) 1 B2 = E(SK,α(X) − 1) (SK,α(X) ≥ 1) (626) 17 8310(1 + α) α −c1/8 −c1 ln n/2 2 ≤ α 2−α + p n + 2e ≤ ESK,α (627) n (c1 ln n) α(2 − α) 2α+2 (4c1 ln n) (644) ≤ n8c2 ln 2 (628) n2 17 8310(1 + α) = + pαn−c1/8 + 2n−c1/2 (ln n)2α+2 nα(c ln n)2−α α(2 − α) , (629) 1 . n2− (645) B = | U (X) − pα| (630) 1 3 E α , (646) 17 8310(1 + α) . nα(ln n)2−α α −c1/8 ≤ α 2−α + p n (631) n (c1 ln n) α(2 − α) where we have used Lemma 21 to bound P(nY ≤ 1 . (632) 2n∆). The variance is then upper bounded by . nα(ln n)2−α Var(ξ) To sum up, the total bias is upper bounded by ≤ Var(Uα(X)) 2α+2 1 (ln n) 1 2 + Var(Lα(X)) + ( Lα(X) − Uα(X)) (nY ≤ 2n∆) |B(ξ)| . α + 2− + α 2−α E E P (n ln n) n n (ln n) (647) (633) 202p 8 28800 2α −c1/4 1 ≤ + 2 + 2 p n . . (634) n n α nα(ln n)2−α 120 + p2αn−c1/8 + 2 (nY ≤ 2n∆) (648) α P 202p 8 28800 ≤ + + p2αn−c1/4 n n2 α2 120 + p2αn−c1/8 + 2n−c1/2 (649) α p 1 + , (650) . n n2 44

where we have used Lemma 2. ∞ X |F1,M (y) − F0,M (y)| y=0

d2 ln n I. Proof of Lemma 10 X2 X = |F1,M (y) − F0,M (y)| + |F1,M (y) − F0,M (y)| y=0 y> d2 ln n The existence of the two prior distributions ν0 and ν1 2 follows directly from a standard functional analysis argument (662) proposed by Lepski, Nemirovski, and Spokoiny [76], and , D1 + D2. (663) elaborated in best polynomial approximation by Cai and Low [77, Lemma 1]. It suffices to replace the interval [−1, 1] with [0, 1] and the function |x| with xα in the proof of Lemma 1 in [77]. Note that we take d1 = 1, d2 = 10e in the assumption. We bound D2 in the following way: Z −np y X e (np) D2 = (µ1(dp) − µ0(dp)) (664) J. Proof of Lemma 11 y! d2 ln n y> 2 Z  d ln n We compute the difference of the expectations as follows, ≤ Poi(np) > 2 (µ (dp) + µ (dp)) (665) P 2 1 0 S0 Fα(P ) − S0 Fα(P )   Eµ1 Eµ0 d2 ln n 0 ≤ 2 · P Poi(d1 ln n) > (666) S 2 X = ( pα − pα) (651) ln n Eµ1 i Eµ0 i  e5e−1  i=1 ≤ 2 · (667) (5e)5e = 2S0M αE [xα] (652) L [0,1] 2 2 αα d ln nα µ(2α) 1 = ≤ , (668) = 2 nα(ln n)α 1 (1 + o(1)) n1+5e ln 5 n22 c n 22α (d ln n)2α 2 where in the fourth step we have applied Lemma 21. (653) α α α µ(2α)d1 = 2 2α (1 + o(1)), (654) c (2d2) where µ(2α) is the constant given by Lemma 17. We bound D1 as follows:

d2 ln n We also have bounds for the variance: 2 2 X   D1 = |F1,M (y) − F0,M (y)| (669) Var S0 (Fα(P )) = E S0 Fα(P ) − E S0 Fα(P ) (655) µj µj µj y=0 S0 d ln n 2 ∞ X α α2 X2 1 X Z (−1)i(np)i+y = µj pi − µj pi (656) E E = (µ1(dp) − µ0(dp)) i=1 y! i! y=0 i=0 0 2α ≤ S Eµj pi (657) (670) 0 2α ≤ S M (658) d2 ln n 2 Z i i+y  2 α 3α X 1 X (−1) (np) αd1 (ln n) = (µ1(dp) − µ0(dp)) ≤ α → 0, j = 0, 1. y! i! c n y=0 i>d2 ln n−y (659) (671)

d2 ln n 2 i+y For any integer y ≥ 0, X 1 X (d1 ln n) ≤ (672) Z −np y y! i! e (np) y=0 i>d ln n−y F1,M (y) − F0,M (y) = (µ1(dp) − µ0(dp)) 2 y! d2 ln n 2 y i X (d1 ln n) X (d1 ln n) (660) = , (673) Z ∞ i i+y y! i! X (−1) (np) y=0 i>d2 ln n−y = (µ (dp) − µ (dp)) , i!y! 1 0 i=0 where in the third step we have used the fact that µ1 and µ0 (661) have matching moments up to order d2 ln n. The Lagrangian remainder for Taylor series of ex, x > 0 shows that where we have used the Taylor expansion of e−x. X xj xm+1 Now we proceed to bound the total variation distance = eξ , (674) j! (m + 1)! between the marginal distributions under two priors µ0, µ1. j>m 45 where 0 ≤ ξ ≤ x. Applying this result, we have shown in [167, Thm. 7.2.1, Thm. 7.2.4] that there exists two

d2 ln n universal positive constants M1,M2 such that 2 y i X (d1 ln n) X (d1 ln n) D ≤ (675) E [f ] ≤ M ω2 (f , n−1), (688) 1 y! i! n η [−1,1] 1 ϕ η y=0 i>d2 ln n−y n 1 X 2 −1 d ln n 2 2 (k + 1)Ek[fη][−1,1] ≥ M2ωϕ(fη, n ). (689) 2 y d2 ln n−y+1 n X (d1 ln n) (d1 ln n) ≤ ed1 ln n (676) k=0 y! (d ln n − y + 1)! y=0 2 Applying (688) and (689) and setting the approximation order

d2 ln n N = DL with a positive constant D > 1 to be specified 2 (d ln n)y (d ln n)d2 ln n−y+1ed2 ln n−y+1 2 X 1 d1 ln n 1 later, then given η = 1/N , the non-increasing property of ≤ e d ln n−y+1 y! (d ln n − y + 1) 2 En[fη] with respect to n yields that y=0 2 [−1,1] (677) EL[fη][−1,1] d ln n 2 +1 d2 ln n ! 2 2 y N d ln n+d ln n+1 d1 ln n X (d1 ln n) 1 X ≤ e 1 2 ≥ Ek[fη][−1,1] (690) d2 ln n y! N − L 2 + 1 y=0 k=L+1 (678) N 1 X d2 ln n +1 ≥ (k + 1)Ek[fη][−1,1] (691) ! 2 N 2 d1 ln n k=L+1 ≤ e2d1 ln n+d2 ln n+1 (679) d2 ln n E [f ] 2 + 1 2 −1 0 η [−1,1] ≥ M2ωϕ(fη,N ) − 2 n n N where in the third step we have used the fact that n! ≥ . L e 1 X Plugging d1 = 1, d2 = 10e in, we have − (k + 1)E [f ] (692) N 2 k η [−1,1] 1 k=1 D ≤ . 1 6 (680) β β β L 5n M2(2 · 5 − 4 − 6 ) 1 2M1 X 2 −1 ≥ − − ωϕ(fη, k ) Combining bounds on D1 and D2 together, we have (2N)2β N 2 DN k=1 ∞ 1 X (693) V (F ,F ) = |F (y) − F (y)| (681) 1,M 0,M 1,M 0,M β β β L 2 M2(2 · 5 − 4 − 6 ) 1 2M1 X 1 y=0 ≥ − − (694)   (2DL)2β (DL)2 DN k2β 1 k=1 ≤ min (D1 + D2), 1 (682) β β β Z L 2 M2(2 · 5 − 4 − 6 ) 1 2M1 dx 1 ≥ 2β − 2 2β − 2 2β ≤ . (683) (2DL) D L D L 0 x n6 (695) M (2 · 5β − 4β − 6β) 1 M  K. Proof of Lemma 13 = L−2β 2 − − 1 . (2D)2β D2 D2(1 − 2β) For η ∈ (0, 1/2), define (696) 1 − η 1 + η β Due to 0 < 2β < 1, for a sufficiently large universal constant f (x) = x + , (684) η 2 2 D we can obtain that √ β 2 2β β 2β then EL[fη] = EL[x ] . For ϕ(x) = 1 − x , denote lim inf L EL[x ][η,1] = lim inf L EL[fη][−1,1] > 0. [−1,1] [η,1] L→∞ L→∞ the second-order Ditzian-Totik modulus of smoothness by (697) n u + v  ω2 (f, t) sup f(u) + f(v) − 2f : ϕ , 2 u + v o u, v ∈ [−1, 1], |u − v| ≤ 2tϕ , (685) 2 L. Proof of Lemma 14 β −1 then it is straightforward to obtain that for all n ≤ (4η) 2 , Fix δ > 0. Let Fˆ(Z) be a near-minimax estimator of F (P )  2(1 − η)β  1 − η β α ω2 (f , n−1) = ηβ + η + − 2 η + . under the Multinomial model. The estimator Fˆ(Z) obtains the ϕ η 2 2 n + 1 n + 1 number of samples n from observation Z. By definition, we (686) have It follows directly from (686) that when n ≤ 2 sup Multinomial|Fˆ(Z) − Fα(P )| < R(S, n) + δ, (698) 1 β E √ 2 −1 P ∈M min{ η , (4η) }, S where R(S, n) is the minimax L risk under the Multinomial 2 · 5β − 4β − 6β 1 2 2 −1 model. Given P ∈ M (γ), let Z = [Z , ··· ,Z ]T with Z ∼ 2β ≤ ωϕ(fη, n ) ≤ 2β . (687) S 1 S i (2n) n 0 PS PS Poi(npi) and let n = i=1 Zi ∼ Poi(n i=1 pi), we use the 2 −1 α ˆ The relationship between ωϕ(fη, n ) and En[fη][−1,1] was estimator d1 · F (Z) to estimate Fα(P ). 46

The triangle inequality gives A tighter bound which gives better constant can be con- 1 structed as follows. We define the Lagrangian: RP (S, n, γ) 2 S S !2 S ! 1 X 2 X X α ˆ 2 L = pi(ln pi) − pi ln pi + λ pi − 1 . ≤ EP |d1 · F (Z) − Fα(P )| (699) 2 i=1 i=1 i=1 ! 2 (708) P α ˆ α ≤ EP d1 · F (Z) − d1 Fα Taking derivatives with respect to pi, we obtain PS p i=1 i S ! 2 ∂L 1 S !α ! 2 X X P = (ln pi) + pi(2 ln pi) × − 2 pi ln pi (1 + ln pi) + λ + p − dα F 2 (700) ∂pi pi i 1 α PS i=1 i=1 i=1 pi (709)  ! 2  P = 0, for all i. (710) 2α ˆ 0 0 ≤ d1 EP  F (Z) − Fα n = m P(n = m) PS p i=1 i It is equivalent to 4d2α−2 M 2α−2 (ln p )2 + 2 ln p + 2H(P )(1 + ln p ) + λ = 0, ∀i (711) + 1 · (701) i i i (ln n)2γ d 1 ln p ∞ Note that it is a quadratic form for i with the same X 4M 2α−2 ≤ d2α R(S, m) (n0 = m) + δ + (702) coefficients. Solving for ln pi, we obtain that 1 P (ln n)2γ m=0 p 2 ln pi = −(1 + H(P )) ± 1 + H (P ) − λ. (712) d n d n d n 4M 2α−2 ≤ d2αR(S, 1 ) (n0 ≥ 1 ) + (n0 ≤ 1 ) + δ + 1 2 P 2 P 2 (ln n)2γ It implies that components of the maximum achieving dis- (703) tribution can only take two values. Assume pi ∈ {q1, q2}, ∀i. 2α−2 Suppose q1 appears k times, we have 2α d1n d1n 4M ≤ d1 R(S, ) + exp(− ) + δ + 2γ , (704) 2 8 (ln n) kq1 + (S − k)q2 = 1. (713) 0 where we have used the fact that conditioned on n = m, Now we compute the functional P Z ∼ Multinomial(m, P ), and the last step follows from i pi Lemma 21. The proof is completed by the arbitrariness of δ. V (P ) 2 2 2 = kq1(ln q1) + (S − k)q2(ln q2) − (kq1 ln q1 + (S − k)q2 ln q2) (714) APPENDIX C 2 2 2 2 2 = kq1(ln q1) + (1 − kq1)(ln q2) − k q (ln q1) PROOF OF AUXILIARY LEMMAS 1 2 2 − (1 − kq1) (ln q2) − 2kq1(1 − kq1)(ln q1)(ln q2) A. Proof of Lemma 1 (715)  2 We obtain the polynomial g(x; a) via the Hermite interpo- q2 = kq1(1 − kq1) ln . (716) lation formula. Concretely, the following WolframAlpha (http: q1 //www.wolframalpha.com/) command will give us g(x; a): Since q = 1−kq1 , we have InterpolatingPolynomial[{{{0}, 0, 0, 0, 0, 0}, {{a}, 1, 0, 2 S−k 0, 0, 0}}, x].  2 1 − kq1 V (P ) = kq1(1 − kq1) ln . (717) Sq1 − kq1

B. Proof of Lemma 15 Denote x = kq1, y = k/S, we have 2 For brevity, denote Var(− ln P (X)) as V (P ), we have  1 − x 1 − y  V (P ) = x(1 − x) ln − ln . (718) x y S S !2 X 2 X V (P ) = pi(ln pi) − pi ln pi . (705) Fixing x, we see V (P ) is a monotone function of y. Without i=1 i=1 loss of generality, by symmetry we assume x ≤ 1/2. Then, S−1 We note that the maximum achieving y = S , and V (P ) as a function of S S x is X X 2 V (P ) ≤ p (ln p )2 ≤ p (ln p − 1)2. (706)  1 − x  i i i i V (P ) = x(1 − x) ln + ln(S − 1) . (719) i=1 i=1 x 2 The function x(ln x − 1) on [0, 1] has second derivative Taking derivatives with respect to x, ignoring the minimum 2 ln x PS x , hence is concave. Thus, the expression i=1 pi(ln pi − achieving x, we obtain the following equation for maximum 2 1) attains its maximum when P is uniform distribution. In achieving value of x, which is denoted as x1: other words, we have shown that   1   2 (1 − 2x1) ln − 1 + ln(S − 1) = 2. (720) V (P ) ≤ (ln S + 1) . (707) x1 47

Denoting S − 1 by m, we have where the supremum is taken over all priors on MS, and we denote the Bayes risk under prior π in Poissonized m(1 − x1) P (1 − 2x1) ln = 2, (721) and Multinomial models by RB(S, n, π) and RB(S, n, π), x1 respectively. Using the property that independent Zi ∼ which is equivalent to PS Poi(npi), 1 ≤ i ≤ S implies (Z1, ··· ,ZS)|( i=1 Zi = n) ∼ m(1 − x1) m(1 − x1) Multinomial(n; p1, ··· , pS), it is straightforward to show that (2 − 2x1) ln = 2 + ln . (722) for any prior π, x1 x1 ∞ m(1−x ) Multiplying both sides by x1 ln 1 , we obtain P X 2 x1 RB(S, n, π) = RB(S, m, π)P (Poi(n) = m) . (734)  2 m=0 m(1 − x1) Vmax = x1(1 − x1) ln (723) x1 Since for n > m, the Bayes estimator with sample size  2! m can be used for estimation with sample size n by neglect- m(1 − x1) 1 m(1 − x1) = x1 ln + ln . (724) ing the last n − m samples, we conclude that the function x1 2 x1 RB(S, n, π) is non-increasing in n. Moreover, it is obvious that R(S, n, π) ≤ sup |F (P )|2 by considering the zero Note that for x ∈ (0, 1/2], P ∈MS estimator. Then we have m(1 − x) ln ∈ [ln m, ∞), (725) P x RB(S, 2n, π) ∞ and if S ≥ 4, we have ln m = ln(S − 1) > 1. X = R (S, m, π) (Poi(2n) = m) (735) Using the bound z ≤ z2, z ≥ 1, we have B P m=0  2 n−1 3 m(1 − x1) X Vmax ≤ x1 ln . (726) = RB(S, m, π)P (Poi(2n) = m) 2 x1 m=0  2 ∞ Taking derivatives with respect to z for z ln m(1−z) , z ∈ X z + RB(S, m, π)P (Poi(2n) = m) (736) (0, 1/2], we have m=n 2  2! ≤ sup |F (P )| · P (Poi(2n) < n) d m(1 − z) P ∈MS z ln ∞ dz z X + R (S, n, π) (Poi(2n) = m) (737) m(1 − z)  m(1 − z) 2  B P = ln × ln + (727) m=n z z z − 1 ≤ sup |F (P )|2 · e−n/4 + R (S, n, π) (738)   B m(1 − z) P ∈MS ≥ ln × (ln m − 4) , (728) z and 4 ∞ which is always nonnegative if m ≥ e , i.e., S ≥ 56. Hence, P X 2  m(1−z)  RB(S, n/2, π) = RB(S, m, π)P (Poi(n/2) = m) (739) we know that when S ≥ 56, the function z ln z is m=0 n an increasing function of z for z ∈ (0, 1/2], thus achieves its X maximum at z = 1/2. ≥ RB(S, m, π)P (Poi(n/2) = m) (740) Then, we obtain m=0 ≥ RB(S, n, π)P(Poi(n/2) ≤ n) (741) 3 2 3 2 Vmax ≤ (ln m) ≤ (ln S) . (729) 1 4 4 ≥ R (S, n, π), (742) 2 B where we have applied Lemma 21 to show that C. Proof of Lemma 16 −n/4 P (Poi(2n) < n) ≤ e and the Markov inequality to ˆP ˆ show that (Poi(n/2) ≤ n) ≥ 1/2. Then we are done by Denote by Fn , Fn the estimator for F (P ) under the Pois- P sonized model and the Multinomial model with sample size taking the supremum over all prior π at both sides of these n, respectively. By the minimax theorem [21], the minimax two inequalities. risk is the supremum of Bayes risk under all priors, i.e., Z ˆP 2 RP (S, n) = sup inf EP |Fn (Z) − F (P )| π(dP ) (730) ˆP π Fn P , sup RB(S, n, π) (731) π D. Proof of Lemma 17 Z ˆ 2 R(S, n) = sup inf EP |Fn(Z) − F (P )| π(dP ) (732) ˆ π Fn We first show the limiting result. Defining y2 = x, we know sup R (S, n, π) (733) , B α 2α π En[x ][0,1] = E2n[y ][−1,1]. (743) 48

Applying Theorem 8 to our settings, for any α > 0 we have and 2α α 2α 2α  π  2α(2α − 1)π2α−2 lim n En[x ][0,1] = lim n E2n[y ][−1,1] (744) ω 2α(2α − 1)y2α−2, = . n→∞ n→∞ 2n − 1 (2n − 1)2α−2 1 2α 2α = lim (2n) E2n[y ][−1,1] (761) 22α n→∞ (745) Hence we have µ(2α) π2 2α(2α − 1)π2α−2 = . (746) E [xα] ≤ · (762) 22α n [0,1] 8n(2n − 1) (2n − 1)2α−2 α(2α − 1)  π 2α Korneichuk [166, Sec. 6.2.5] showed the inequality < (763) 2 2n − 1  π  3 π 2α En[f] ≤ ω f, (747) < . (764) n + 1 2 n 3 2α for all f ∈ C[−1, 1], where ω(f, δ) is the first order modulus Plugging in x = 0 yields |g0,α| < 2 (π/n) , hence of smoothness, defined as 2α α α π  max |Rn,α(x) − x | ≤ En[x ][0,1] + |g0,α| ≤ 3 . ω(f, δ) , sup{|f(x)−f(x+δ)| : x ∈ [−1, 1], x+δ ∈ [−1, 1]}. 0≤x≤1 n (748) (765) Moreover, it has been shown in [73, Pg. 207] that Bernstein [196, Pg. 171] showed  π 2(α−1) π 0 0 α 0 α 0 E [f(x)] ≤ E [f (x)], (749) max |Rn,α(x) − (x ) | ≤ D · En[(x ) ][0,1] ≤ Dα , n+1 2(n + 1) n 0≤x≤1 2n (766) where f ∈ C1[−1, 1]. where D > 0 is a positive universal constant, and the last For 0 < α ≤ 1/2, ω(y2α, δ) ≤ δ2α, δ ≤ 2, hence we know inequality follows directly from (750). Then integrating on x 2α yields the pointwise bound α 2α  π  En[x ][0,1] = E2n[y ][−1,1] ≤ . (750) Z x 2n α 0 α 0 |Rn,α(x) − x | ≤ Rn,α(t) − (t ) dt (767) 0 x For 1/2 < α < 1, noting that Z  π 2(α−1) ≤ Dα dt (768) 0 2n (y2)α = 2αy(y2)α−1, (751) 0  π 2(α−1) = Dα x (769) and that 2n D1x ω(2αy(y2)α−1, δ) ≤ 2αδ2α−1, δ ≤ 2, (752) . (770) , n2(α−1) we know for 1/2 < α < 1, In order to bound the coefficients of best polynomial ap- π proximations, we need the following result by Qazi and Rah- E [y2α] ≤ E [2αy(y2)α−1] (753) 2n [−1,1] 2(2n) 2n−1 [−1,1] man [197, Thm. E] on the maximal coefficients of polynomials π  π 2α−1 on a finite interval. ≤ 2α (754) 4n 2n Pn ν  π 2α Lemma 25. Let pn(x) = ν=0 aν x be a polynomial of = α (755) degree at most n such that |p (x)| ≤ 1 for x ∈ [−1, 1]. Then, 2n n  π 2α |an−2µ| is bounded above by the modulus of the corresponding ≤ . (756) 2n coefficient of Tn for µ = 0, 1,..., bn/2c, and |an−1−2µ| is 2α bounded above by the modulus of the corresponding coefficient Plugging in x = 0 yields |g0,α| < (π/2n) , hence of Tn−1 for µ = 0, 1,..., b(n − 1)/2c. Here Tn(x) is the n-th 2α α α  π  Chebyshev polynomials of the first kind. max |Rn,α(x) − x | ≤ En[x ][0,1] + |g0,α| ≤ 2 . 0≤x≤1 2n (757) It is shown in Cai and Low [77, Lemma 2] that all of the coefficients of Chebyshev polynomial T2m(x), m ∈ Z+ are upper bounded by 23m. If we view the best polynomial ap- 2 For 1 < α < 3/2, by defining y = x we know that proximation of xα or −x ln x over [0, 1] as the best polynomial α 2α approximation of y2α or −y2 ln y2, y2 = x, then we would En[x ][0,1] = E2n[y ][−1,1] (758) π obtain an even polynomial over interval [−1, 1] represented as ≤ E [2αy2α−1] (759) 4n 2n−1 [−1,1] n n 2 X 2k X 2k π gk,αy or rk,H y . (771) ≤ E [2α(2α − 1)y2α−2] 8n(2n − 1) 2n−2 [−1,1] k=0 k=0 (760) Applying Lemma 25 and equation (48), we know that for 49 all k ≤ n, we have G. Proof of Lemma 22 3n 3n |gk,α| ≤ 2 , |gk,H | ≤ 2 . (772) We know that if X ∼ Poi(λ), then it follows from [198] that k ( ) k X i k E. Proof of Lemma 19 EX = λ , (785) i Define x0 = x ∈ [0, 1]. For 0 < α < 1, applying i=1 4∆ ( ) Lemma 17, we have k where is the Stirling numbers of the second kind. K i X  π 2α (x0)α − g (x0)k ≤ 2 . (773) k,α 2K Using (494), we have k=1 k ( ) α k X i k Multiplying both sides by (4∆) , we have EX = λ (786) i i=1 K π 2α (4∆)α c X −k+α k α 3 k   gk,α(4∆) x − x ≤ 2 = . X k 2 K2α (n ln n)α ≤ λi ik−i (787) k=0 i (774) i=1 k For the case 1 < α < 3/2, similar results hold for X k ≤ M i M k−i (788) 4π2c α i 1 i=1 c3 = 3 2 . (775) c k 2 X k = M k (789) i F. Proof of Lemma 20 i=1 k k 0 x 0 ≤ M 2 (790) Define x = 4∆ , hence for x ∈ [0, 4∆], x ∈ [0, 1]. It follows from the best polynomial approximation result for = (2M)k. (791) −x ln x on [0, 1] that there exists a constant d > 0 such that x0 ∈ [0, 1] for all , REFERENCES K X d [1] C. E. Shannon, “A mathematical theory of communication,” The Bell r (x0)k − (−x0 ln x0) ≤ . (776) k,H K2 System Technical Journal, vol. 27, pp. 379–423, 623–656, 1948. k=0 [2]A.R enyi,´ “On measures of entropy and information,” in Fourth Berkeley Symposium on Mathematical Statistics and Probability, 1961, ν1(2) When n is sufficiently large, we could take d = 2 . Taking pp. 547–561. x0 = 0, we have [3] I. Csiszar,´ “Generalized cutoff rates and Renyi’s´ information measures,” d Information Theory, IEEE Transactions on, vol. 41, no. 1, pp. 26–34, r0,H ≤ 2 , (777) 1995. K [4] T. A. Courtade and S. Verdu,´ “Cumulant generating function of hence codeword lengths in optimal lossless compression,” in Proc. of IEEE K International Symposium on Information Theory (ISIT) X 2d , 2014. r (x0)k − (−x0 ln x0) ≤ . (778) [5] ——, “Variable-length lossy compression and channel coding: non- k,H 2 K asymptotic converses via cumulant generating functions,” in Proc. of k=1 IEEE International Symposium on Information Theory (ISIT), 2014. Now, multiplying both sides by 4∆, we have [6] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classification and regression trees. CRC press, 1984. K [7] E. L. Lehmann and G. Casella, Theory of point estimation. Springer, X −k+1 k 2d(4∆) rk,H (4∆) x + x (ln x − ln(4∆)) ≤ . 1998, vol. 31. K2 [8] R. Linsker, “Self-organization in a perceptual network,” Computer, k=1 (779) vol. 21, no. 3, pp. 105–117, 1988. [9] J. R. Quinlan, C4. 5: programs for machine learning. Morgan Since we have defined gk,H as kaufmann, 1993, vol. 1. [10] S. Nowozin, “Improved information gain estimates for decision tree gk,H = rk,H , 2 ≤ k ≤ K, g1,H = r1,H − ln(4∆), (780) induction,” in Proceedings of the 29th International Conference on Machine Learning (ICML-12), 2012, pp. 297–304. we have [11] C. Chow and C. Liu, “Approximating discrete probability distributions K with dependence trees,” Information Theory, IEEE Transactions on, X 2d(4∆) g (4∆)−k+1xk + x ln x ≤ (781) vol. 14, no. 3, pp. 462–467, 1968. k,H K2 [12] J. Jiao, T. Courtade, K. Venkat, and T. Weissman, “Justification of k=1 logarithmic loss via the benefit of side information,” arXiv preprint 8dc1 arXiv:1403.4679, 2014. = 2 (782) [13] C. Olsen, P. E. Meyer, and G. Bontempi, “On the impact of entropy c2n ln n estimation on transcriptional regulatory network inference based on C mutual information,” EURASIP Journal on Bioinformatics and Systems = . (783) n ln n Biology, vol. 2009, no. 1, p. 308959, 2009. [14] J. P. Pluim, J. A. Maintz, and M. A. Viergever, “Mutual-information- When n is sufficiently large, we could replace d by ν1(2)/2, based registration of medical images: a survey,” Medical Imaging, IEEE hence obtain Transactions on, vol. 22, no. 8, pp. 986–1004, 2003. 4c1ν1(2) [15] P. Viola and W. M. Wells III, “Alignment by maximization of mutual C = 2 . (784) information,” International journal of computer vision, vol. 24, no. 2, c2 pp. 137–154, 1997. 50

[16] L. Batina, B. Gierlichs, E. Prouff, M. Rivain, F.-X. Standaert, and [44] T. Takaguchi, M. Nakamura, N. Sato, K. Yano, and N. Masuda, N. Veyrat-Charvillon, “Mutual information analysis: a comprehensive “Predictability of conversation partners,” Physical Review X, vol. 1, study,” Journal of Cryptology, vol. 24, no. 2, pp. 269–291, 2011. no. 1, p. 011008, 2011. [17] M. O. Hill, “Diversity and evenness: a unifying notation and its [45] Y. Polyanskiy, H. V. Poor, and S. Verdu,´ “Channel coding rate in the consequences,” Ecology, vol. 54, no. 2, pp. 427–432, 1973. finite blocklength regime,” Information Theory, IEEE Transactions on, [18] F. Franchini, A. Its, and V. Korepin, “Renyi´ entropy of the XY spin vol. 56, no. 5, pp. 2307–2359, 2010. chain,” Journal of Physics A: Mathematical and Theoretical, vol. 41, [46] I. Kontoyiannis and S. Verdu,´ “Optimal lossless compression: Source no. 2, p. 025302, 2008. varentropy and dispersion,” in Information Theory Proceedings (ISIT), [19] F. Schmid, R. Schmidt, T. Blumentritt, S. Gaißer, and M. Ruppert, 2013 IEEE International Symposium on. IEEE, 2013, pp. 1739–1743. “Copula-based measures of multivariate association,” in Copula theory [47] J. Acharya, A. Orlitsky, A. T. Suresh, and H. Tyagi, “The complexity and its applications. Springer, 2010, pp. 209–236. of estimating Renyi´ entropy,” available on arXiv, 2014. [20] M. Basseville, “Divergence measures for statistical data processing: An [48] A. W. Van der Vaart, Asymptotic statistics. Cambridge university annotated bibliography,” Signal Processing, vol. 93, no. 4, pp. 621–633, press, 2000, vol. 3. 2013. [49]J.H ajek,´ “A characterization of limiting distributions of regular [21] A. Wald, Statistical decision functions. Wiley, 1950. estimates,” Zeitschrift fur¨ Wahrscheinlichkeitstheorie und verwandte Gebiete [22] L. Le Cam, Asymptotic methods in statistical decision theory. , vol. 14, no. 4, pp. 323–330, 1970. Springer, 1986. [50] ——, “Local asymptotic minimax and admissibility in estimation,” in Proceedings of the sixth Berkeley symposium on mathematical statistics [23] M. Mitzenmacher and E. Upfal, Probability and computing: Random- and probability, vol. 1, 1972, pp. 175–194. ized algorithms and probabilistic analysis. Cambridge University [51] A. J. Wyner and D. Foster, “On the lower limits of entropy estimation,” Press, 2005. IEEE Transactions on Information Theory, submitted for publication, [24] G. Valiant and P. Valiant, “Estimating the unseen: an n/ log n-sample 2003. estimator for entropy and support size, shown optimal via new CLTs,” [52] L. Paninski, “Estimation of entropy and mutual information,” Neural in Proceedings of the 43rd annual ACM symposium on Theory of Computation, vol. 15, no. 6, pp. 1191–1253, 2003. computing. ACM, 2011, pp. 685–694. [53] G. A. Miller, “Note on the bias of information estimates,” Information [25] P. Valiant and G. Valiant, “Estimating the unseen: improved estimators Theory in Psychology: Problems and Methods, vol. 2, pp. 95–100, for entropy and other properties,” in Advances in Neural Information 1955. Processing Systems, 2013, pp. 2157–2165. [54] T. Schurmann¨ and P. Grassberger, “Entropy estimation of symbol [26] G. Valiant and P. Valiant, “The power of linear estimators,” in sequences,” Chaos: An Interdisciplinary Journal of Nonlinear Science, Foundations of Computer Science (FOCS), 2011 IEEE 52nd Annual vol. 6, no. 3, pp. 414–427, 1996. Symposium on. IEEE, 2011, pp. 403–412. [55] S. Schober, “Some worst-case bounds for bayesian estimators of [27] Y. Wu and P. Yang, “Minimax rates of entropy estimation on discrete distributions,” in Information Theory Proceedings (ISIT), 2013 large alphabets via best polynomial approximation,” arXiv preprint IEEE International Symposium on. IEEE, 2013, pp. 2194–2198. arXiv:1407.0381, 2014. [56] D. H. Wolpert and D. R. Wolf, “Estimating functions of probability [28] L. Paninski, “Estimating entropy on m bins given fewer than m distributions from a finite set of samples,” Physical Review E, vol. 52, samples,” Information Theory, IEEE Transactions on, vol. 50, no. 9, no. 6, p. 6841, 1995. pp. 2200–2203, 2004. [57] D. Holste, I. Grosse, and H. Herzel, “Bayes’ estimators of generalized [29] J. Jiao, K. Venkat, Y. Han, and T. Weissman, “Maximum likelihood entropies,” Journal of Physics A: Mathematical and General, vol. 31, estimation of functionals of discrete distributions,” submitted. no. 11, p. 2551, 1998. [30] B. Efron and R. Thisted, “Estimating the number of unsen [58] Y. Han, J. Jiao, and T. Weissman, “Does Dirichlet prior smoothing species: How many words did shakespeare know?” Biometrika, solve the Shannon entropy estimation problem?” submitted to IEEE vol. 63, no. 3, pp. pp. 435–447, 1976. [Online]. Available: Transactions on Information Theory. http://www.jstor.org/stable/2335721 [59] R. G. Miller, “The jackknife-a review,” Biometrika, vol. 61, no. 1, pp. [31] J. Bunge and M. Fitzpatrick, “Estimating the number of species: a 1–15, 1974. review,” Journal of the American Statistical Association, vol. 88, no. [60] A. Carlton, “On the bias of information estimates.” Psychological 421, pp. 364–373, 1993. Bulletin, vol. 71, no. 2, p. 108, 1969. [32] N. Santhanam, A. Orlitsky, and K. Viswanathan, “New tricks for old [61] P. Grassberger, “Finite sample corrections to entropy and dimension dogs: Large alphabet probability estimation,” in Information Theory estimates,” Physics Letters A, vol. 128, no. 6, pp. 369–373, 1988. Workshop, 2007. ITW’07. IEEE. IEEE, 2007, pp. 638–643. [62] S. Zahl, “Jackknifing an index of diversity,” Ecology, vol. 58, no. 4, [33] L. D. Brown, “Minimaxity, more or less,” in Statistical Decision Theory pp. 907–913, 1977. and Related Topics V. Springer, 1994, pp. 1–18. [63] J. Hausser and K. Strimmer, “Entropy inference and the James-Stein [34] ——, “An essay on statistical decision theory,” Journal of the American estimator, with application to nonlinear gene association networks,” Statistical Association, vol. 95, no. 452, pp. 1277–1281, 2000. The Journal of Machine Learning Research, vol. 10, pp. 1469–1484, [35] T. T. Cai, “Minimax and adaptive inference in nonparametric function 2009. estimation,” Statistical Science, vol. 27, no. 1, pp. 31–50, 2012. [64] I. Nemenman, F. Shafee, and W. Bialek, “Entropy and inference, revisited,” Advances in neural information processing systems, vol. 1, [36] Y. Han, J. Jiao, and T. Weissman, “Adaptive estimation of Shannon pp. 471–478, 2002. entropy,” submitted. [65] A. Chao and T.-J. Shen, “Nonparametric estimation of Shannon’s index Statisticheskie reshayushchie pravila i optimalnye vyvody [37] N. Chentsov, . of diversity when there are unseen species in sample,” Environmental Nauka, Moskva (Engl. transl.: 1982, Statistical decision rules and and ecological statistics, vol. 10, no. 4, pp. 429–443, 2003. optimal inference, American Mathematical Society, Providence), 1972. [66] C. O. Daub, R. Steuer, J. Selbig, and S. Kloska, “Estimating mutual [38] D. Braess and H. Dette, “The asymptotic minimax risk for the estima- information using B-spline functions–an improved similarity measure tion of constrained binomial and multinomial probabilities,” Sankhya:¯ for analysing gene expression data,” BMC bioinformatics, vol. 5, no. 1, The Indian Journal of Statistics, pp. 707–732, 2004. p. 118, 2004. [39] J. Rissanen, “Stochastic complexity and modeling,” The Annals of [67] P. Grassberger, “Entropy estimates from insufficient samplings,” arXiv Statistics, pp. 1080–1100, 1986. preprint physics/0307138, 2008. [40] L. Paninski, “Variational minimax estimation of discrete distributions [68] I. Nemenman, “Coincidences and estimation of entropies of random under KL loss,” in Advances in Neural Information Processing Systems, variables with large cardinalities,” Entropy, vol. 13, no. 12, pp. 2013– 2004, pp. 1033–1040. 2023, 2011. [41] A. Orlitsky and N. P. Santhanam, “Speaking of infinity,” Information [69] I. Nemenman, W. Bialek, and R. d. R. van Steveninck, “Entropy and Theory, IEEE Transactions on, vol. 50, no. 10, pp. 2215–2230, 2004. information in neural spike trains: Progress on the sampling problem,” [42] W. Szpankowski and M. J. Weinberger, “Minimax pointwise re- Physical Review E, vol. 69, no. 5, p. 056111, 2004. dundancy for memoryless models over large alphabets,” Information [70] M. Vinck, F. P. Battaglia, V. B. Balakirsky, A. H. Vinck, and C. M. Theory, IEEE Transactions on, vol. 58, no. 7, pp. 4094–4104, 2012. Pennartz, “Estimation of the entropy based on its polynomial represen- [43] C. Song, Z. Qu, N. Blumm, and A.-L. Barabasi,´ “Limits of predictabil- tation,” Physical Review E, vol. 85, no. 5, p. 051139, 2012. ity in human mobility,” Science, vol. 327, no. 5968, pp. 1018–1021, [71] G. J. Valiant, “Algorithmic approaches to statistical questions,” Ph.D. 2010. dissertation, University of California, 2012. 51

[72]G.P olya,´ How to solve it: A new aspect of mathematical method. [104] L. Birge and P. Massart, “Estimation of integral functionals of a Princeton university press, 2014. density,” The Annals of Statistics, pp. 11–29, 1995. [73] R. A. DeVore and G. G. Lorentz, Constructive approximation. [105] L. Goldstein and R. Khas’ minskii, “On efficient estimation of smooth Springer, 1993, vol. 303. functionals,” Theory of Probability & Its Applications, vol. 40, no. 1, [74] I. Ibragimov, A. Nemirovskii, and R. Khas’ minskii, “Some problems pp. 151–156, 1996. on nonparametric estimation in Gaussian white noise,” Theory of [106] B. Laurent and P. Massart, “Adaptive estimation of a quadratic func- Probability & Its Applications, vol. 31, no. 3, pp. 391–406, 1987. tional by model selection,” Annals of Statistics, pp. 1302–1338, 2000. [75] A. Nemirovski, “Topics in non-parametric statistics. lectures on prob- [107] T. T. Cai and M. G. Low, “Nonquadratic estimators of a quadratic ability theory and statistics, saint flour 1998, 1738,” 2000. functional,” The Annals of Statistics, pp. 2930–2956, 2005. [76] O. Lepski, A. Nemirovski, and V. Spokoiny, “On estimation of the Lr [108] ——, “Optimal adaptive estimation of a quadratic functional,” The norm of a regression function,” Probability theory and related fields, Annals of Statistics, vol. 34, no. 5, pp. 2298–2325, 2006. vol. 113, no. 2, pp. 221–253, 1999. [109] P. Hall and S. C. Morton, “On the estimation of entropy,” Annals of the [77] T. T. Cai and M. G. Low, “Testing composite hypotheses, Hermite Institute of Statistical Mathematics, vol. 45, no. 1, pp. 69–88, 1993. polynomials and optimal estimation of a nonsmooth functional,” The [110] A. B. Tsybakov and E. Van der Meulen, “Root-n consistent estima- Annals of Statistics, vol. 39, no. 2, pp. 1012–1041, 2011. tors of entropy for densities with unbounded support,” Scandinavian [78] A. Kroo´ and A. Pinkus, “Strong uniqueness,” Surv. Approx. Theory, Journal of Statistics, pp. 75–83, 1996. vol. 5, pp. 1–91, 2010. [111] J. Beirlant, E. J. Dudewicz, L. Gyorfi,¨ and E. C. Van der Meulen, “Non- [79] I. A. Ibragimov and R. Z. Has’ minskii, Statistical estimation: asymp- parametric entropy estimation: An overview,” International Journal of totic theory. Springer-Verlag New York, 1981, vol. 2. Mathematical and Statistical Sciences, vol. 6, pp. 17–40, 1997. [80] K. Hirano and J. R. Porter, “Impossibility results for nondifferentiable [112] A. O. Hero III, B. Ma, O. J. Michel, and J. Gorman, “Applications of functionals,” Econometrica, vol. 80, no. 4, pp. 1769–1790, 2012. entropic spanning graphs,” Signal Processing Magazine, IEEE, vol. 19, [81] B. Y. Levit, “On optimality of some statistical estimates,” in Proceed- no. 5, pp. 85–95, 2002. ings of the Prague symposium on asymptotic statistics, vol. 2, 1974, [113] J. D. Victor, “Binless strategies for estimation of information from pp. 215–238. neural data,” Physical Review E, vol. 66, no. 5, p. 051903, 2002. [82] ——, “On efficiency of a class of non-parametric estimates,” Teoriya [114] A. Kraskov, H. Stogbauer,¨ and P. Grassberger, “Estimating mutual Veroyatnostei i ee Primeneniya, vol. 20, no. 4, pp. 738–754, 1975. information,” Physical Review E, vol. 69, no. 6, p. 066138, 2004. [83] Y. A. Koshevnik and B. Y. Levit, “On a non-parametric analogue of the [115] L. Paninski and M. Yajima, “Undersmoothed kernel entropy estima- information matrix,” Theory of Probability & Its Applications, vol. 21, tors,” Information Theory, IEEE Transactions on, vol. 54, no. 9, pp. no. 4, pp. 738–753, 1977. 4384–4388, 2008. [84] I. Ibragimov and R. Khas’ minskii, “Estimation of linear functionals [116] R. Mnatsakanov, N. Misra, S. Li, and E. Harner, “k -nearest neighbor Theory of Probability & Its Applications n in Gaussian noise,” , vol. 32, estimators of entropy,” Mathematical Methods of Statistics, vol. 17, no. 1, pp. 30–39, 1988. no. 3, pp. 261–277, 2008. [85] ——, “Estimation of linear functionals in Gaussian noise,” Theory of [117] Q. Wang, S. R. Kulkarni, and S. Verdu,´ “Universal estimation of Probability & Its Applications, vol. 32, no. 1, pp. 30–39, 1988. information measures for analog sources,” Foundations and Trends in [86] D. L. Donoho and R. C. Liu, “Geometrizing rates of convergence, ii,” Communications and Information Theory, vol. 5, no. 3, pp. 265–353, The Annals of Statistics, pp. 633–667, 1991. 2009. [87] ——, “Geometrizing rates of convergence, iii,” The Annals of Statistics, [118] S. Bouzebda and I. Elhattab, “Uniform-in-bandwidth consistency for pp. 668–701, 1991. kernel-type estimators of shannons entropy,” Electronic Journal of [88] D. L. Donoho, “Statistical estimation and optimal recovery,” The Statistics, vol. 5, pp. 440–459, 2011. Annals of Statistics, pp. 238–270, 1994. [89] A. Goldenshluger and S. V. Pereverzev, “Adaptive estimation of linear [119] H. Liu, L. Wasserman, and J. D. Lafferty, “Exponential concentration for mutual information estimation with application to forests,” in functionals in hilbert scales from indirect white noise observations,” Advances in Neural Information Processing Systems Probability Theory and Related Fields, vol. 118, no. 2, pp. 169–186, , 2012, pp. 2537– 2000. 2545. [90] J. Klemela and A. B. Tsybakov, “Sharp adaptive estimation of linear [120] A. Antos and I. Kontoyiannis, “Convergence properties of functional functionals,” Annals of statistics, pp. 1567–1600, 2001. estimates for discrete distributions,” Random Structures & Algorithms, [91] T. T. Cai and M. G. Low, “A note on nonparametric estimation of linear vol. 19, no. 3-4, pp. 163–193, 2001. functionals,” Annals of statistics, pp. 1140–1153, 2003. [121] V. Q. Vu, B. Yu, and R. E. Kass, “Coverage-adjusted entropy estima- [92] ——, “Minimax estimation of linear functionals over nonconvex pa- tion,” Statistics in medicine, vol. 26, no. 21, pp. 4039–4060, 2007. rameter spaces,” Annals of statistics, pp. 552–576, 2004. [122] Z. Zhang, “Entropy estimation in Turing’s perspective,” Neural com- [93] ——, “On adaptive estimation of linear functionals,” The Annals of putation, vol. 24, no. 5, pp. 1368–1389, 2012. Statistics, vol. 33, no. 5, pp. 2311–2343, 2005. [123] ——, “Asymptotic normality of an entropy estimator with expo- [94] ——, “Adaptive estimation of linear functionals under different perfor- nentially decaying bias,” Information Theory, IEEE Transactions on, mance measures,” Bernoulli, vol. 11, no. 2, pp. 341–358, 2005. vol. 59, no. 1, pp. 504–508, 2013. [95] A. B. Juditsky and A. S. Nemirovski, “Nonparametric estimation by [124] C. E. Shannon, “Prediction and entropy of printed English,” Bell system convex programming,” The Annals of Statistics, pp. 2278–2300, 2009. technical journal, vol. 30, no. 1, pp. 50–64, 1951. [96] J. Johannes and R. Schenk, “Adaptive estimation of linear functionals in [125] T. M. Cover and R. King, “A convergent gambling estimate of functional linear models,” Mathematical Methods of Statistics, vol. 21, the entropy of English,” Information Theory, IEEE Transactions on, no. 3, pp. 189–214, 2012. vol. 24, no. 4, pp. 413–421, 1978. [97] B. Y. Levit, “Asymptotically efficient estimation of nonlinear function- [126] J. Ziv and A. Lempel, “A universal algorithm for sequential data als,” Problemy Peredachi Informatsii, vol. 14, no. 3, pp. 65–72, 1978. compression,” Information Theory, IEEE Transactions on, vol. 23, [98] I. Ibragimov and R. Khasminskii, “On the nonparametric estimation of no. 3, pp. 337–343, 1977. functionals,” in Symposium in Asymptotic Statistics, 1978, pp. 41–52. [127] ——, “Compression of individual sequences via variable-rate coding,” [99] P. Hall and J. S. Marron, “Estimation of integrated squared density Information Theory, IEEE Transactions on, vol. 24, no. 5, pp. 530–536, derivatives,” Statistics & Probability Letters, vol. 6, no. 2, pp. 109– 1978. 115, 1987. [128] A. D. Wyner and J. Ziv, “Some asymptotic properties of the entropy of [100] P. J. Bickel and Y. Ritov, “Estimating integrated squared density a stationary ergodic data source with applications to data compression,” derivatives: sharp best order of convergence estimates,” Sankhya:¯ The IEEE Trans. Inf. Theory, vol. 35, no. 6, pp. 1250–1258, 1989. Indian Journal of Statistics, Series A, pp. 381–393, 1988. [129] I. Kontoyiannis, P. H. Algoet, Y. M. Suhov, and A. Wyner, “Non- [101] D. L. Donoho and M. Nussbaum, “Minimax quadratic estimation of a parametric entropy estimation for stationary processes and random quadratic functional,” Journal of Complexity, vol. 6, no. 3, pp. 290–323, fields, with applications to English text,” Information Theory, IEEE 1990. Transactions on, vol. 44, no. 3, pp. 1319–1327, 1998. [102] J. Fan, “On the estimation of quadratic functionals,” The Annals of [130] S. Verdu,´ “Universal estimation of information measures,” in Proc. Statistics, pp. 1273–1294, 1991. IEEE Inf. Theory Workshop, 2005. [103] S. Efromovich and M. Low, “On Bickel and Ritov’s conjecture about [131] J. Jiao, H. Permuter, L. Zhao, Y.-H. Kim, and T. Weissman, “Uni- adaptive estimation of the integral of the square of density derivative,” versal estimation of directed information,” Information Theory, IEEE The Annals of Statistics, vol. 24, no. 2, pp. 682–686, 1996. Transactions on, vol. 59, no. 10, pp. 6220–6242, 2013. 52

[132] E. J. Candes` and T. Tao, “Near-optimal signal recovery from random [164] ——, “The best approximation |x|p using polynomials of very high projections: Universal encoding strategies?” Information Theory, IEEE degree,” Bulletin of the Russian Academy of Sciences (Russian), vol. 2, Transactions on, vol. 52, no. 12, pp. 5406–5425, 2006. no. 2, pp. 169–190, 1938. [133] D. L. Donoho, “Compressed sensing,” Information Theory, IEEE [165] I. I. Ibragimov, “Sur la valeur asymptotique de la meilleure approxima- Transactions on, vol. 52, no. 4, pp. 1289–1306, 2006. tion d’une fonction ayant un point singulier reel,” Izvestiya Rossiiskoi [134] X. He and Q.-M. Shao, “On parameters of increasing dimensions,” Akademii Nauk. Seriya Matematicheskaya, vol. 10, no. 5, pp. 429–460, Journal of Multivariate Analysis, vol. 73, no. 1, pp. 120–135, 2000. 1946. [135] V. I. Serdobolskii, Multiparametric statistics. Elsevier, 2007. [166] N. Korneichuk, Exact constants in approximation theory. Cambridge [136] S. Boucheron and E. Gassiat, “A Bernstein-von Mises theorem for University Press, 1991, vol. 38. discrete probability distributions,” Electronic Journal of Statistics, [167] Z. Ditzian and V. Totik, Moduli of smoothness. Springer, 1987. vol. 3, pp. 114–148, 2009. [168] J. Bustamante, Algebraic Approximation: A Guide to Past and Current [137] V. Spokoiny, “Parametric estimation. finite sample theory,” The Annals Solutions. Springer Science & Business Media, 2011. of Statistics, vol. 40, no. 6, pp. 2877–2909, 2012. [169] J. R. Rice, “Tchebycheff approximation in several variables,” Transac- [138] ——, “Bernstein-von Mises theorem for growing parameter dimen- tions of the American Mathematical Society, pp. 444–466, 1963. sion,” arXiv preprint arXiv:1302.3430, 2013. [170] V. Totik, Polynomial Approximation on Polytopes. Memoirs of the [139] L. G. Valiant, “A theory of the learnable,” Communications of the ACM, American Mathematical Society, 2013, vol. 232. vol. 27, no. 11, pp. 1134–1142, 1984. [171] F. Dai and Y. Xu, Approximation theory and harmonic analysis on [140] V. Vapnik and S. Kotz, Estimation of dependences based on empirical spheres and balls. Springer, 2013. data. Springer, 2006. [172] C. S. Withers, “Bias reduction by Taylor series,” Communications in [141] V. N. Vapnik and V. Vapnik, Statistical learning theory. Wiley New Statistics-Theory and Methods, vol. 16, no. 8, pp. 2369–2383, 1987. York, 1998, vol. 2. [173] A. Tsybakov, “Aggregation and high-dimensional statistics,” Lecture [142] A. Orlitsky, N. P. Santhanam, and J. Zhang, “Universal compression notes for the course given at the Ecole´ det´ e´ de Probabilites´ in Saint- of memoryless sources over unknown alphabets,” Information Theory, Flour, URL http://www. crest. fr/ckfinder/userfiles/files/Pageperso/AT- IEEE Transactions on, vol. 50, no. 7, pp. 1469–1481, 2004. sybakov/Lecture notes SFlour. pdf, vol. 16, p. 20, 2013. [143] A. B. Wagner, P. Viswanath, and S. R. Kulkarni, “Probability estimation [174] ——, Introduction to Nonparametric Estimation. Springer-Verlag, in the rare-events regime,” Information Theory, IEEE Transactions on, 2008. vol. 57, no. 6, pp. 3207–3229, 2011. [175] T. Lindvall, Lectures on the coupling method. Courier Dover [144] M. I. Ohannessian, V. Y. Tan, and M. A. Dahleh, “Canonical estimation Publications, 2002. in a rare-events regime,” in Communication, Control, and Computing [176] I. J. Good, “The population frequencies of species and the estimation (Allerton), 2011 49th Annual Allerton Conference on. IEEE, 2011, of population parameters,” Biometrika, vol. 40, no. 16, pp. 237–264, pp. 1840–1847. 1953. [145] X. Yang and A. Barron, “Large alphabet coding and prediction through [177] S. Boucheron, G. Lugosi, and P. Massart, Concentration inequalities: poissonization and tilting,” in The Sixth Workshop on Information A nonasymptotic theory of independence. Oxford University Press, Theoretic Methods in Science and Engineering, Tokyo, 2013. 2013. [146] R. A. Fisher, “On the mathematical foundations of theoretical statis- [178] S. Schober, “Some worst-case bounds for bayesian estimators of tics,” Philosophical Transactions of the Royal Society of London. Series discrete distributions,” in Information Theory Proceedings (ISIT), 2013 A, Containing Papers of a Mathematical or Physical Character, pp. IEEE International Symposium on, July 2013, pp. 2194–2198. 309–368, 1922. [179] J. Jiao, K. Venkat, Y. Han, and T. Weissman, “Beyond maximum [147] ——, “Theory of statistical estimation,” in Mathematical Proceedings likelihood: from theory to practice,” arXiv preprint arXiv:1409.7458, of the Cambridge Philosophical Society, vol. 22, no. 05. Cambridge 2014. Univ Press, 1925, pp. 700–725. [180] P. H. Algoet and T. M. Cover, “A sandwich proof of the Shannon- [148] ——, “Two new properties of mathematical likelihood,” Proceedings of Mcmillan-Breiman theorem,” The Annals of Probability, vol. 16, no. 2, the Royal Society of London. Series A, vol. 144, no. 852, pp. 285–307, pp. 899–909, 1988. 1934. [181] Y. Han, J. Jiao, and T. Weissman, “Minimax estimation of distributions [149] J. Berkson, “Minimum chi-square, not maximum likelihood!” The under `1 loss,” submitted to IEEE Transactions on Information Theory, Annals of Statistics, pp. 457–487, 1980. 2014. [150] S. M. Stigler, “The epic story of maximum likelihood,” Statistical [182] K. Marton and P. C. Shields, “Entropy and the consistent estimation Science, vol. 22, no. 4, pp. 598–620, 2007. of joint distributions,” The Annals of Probability, vol. 22, no. 2, pp. [151] L. Le Cam, “Maximum likelihood: an introduction,” Statistics Branch, 960–977, 1994. Department of Mathematics, University of Maryland, 1979. [183] Y. Zhou, “Structure learning of probabilistic graphical models: a [152] C. Stein, “Inadmissibility of the usual estimator for the mean of a comprehensive survey,” arXiv preprint arXiv:1111.6925, 2011. multivariate ,” in Proceedings of the Third Berkeley [184] M. J. Wainwright and M. I. Jordan, “Graphical models, exponential symposium on mathematical statistics and probability, vol. 1, no. 399, families, and variational inference,” Foundations and Trends R in 1956, pp. 197–206. Machine Learning, vol. 1, no. 1-2, pp. 1–305, 2008. [153] W. James and C. Stein, “Estimation with quadratic loss,” in Proceedings [185] D. Koller and N. Friedman, Probabilistic graphical models: principles of the fourth Berkeley symposium on mathematical statistics and and techniques. MIT press, 2009. probability, vol. 1, no. 1961, 1961, pp. 361–379. [186] P. E. Meyer, K. Kontos, F. Lafitte, and G. Bontempi, “Information- [154] D. L. Donoho and J. M. Johnstone, “Ideal spatial adaptation by wavelet theoretic inference of large transcriptional regulatory networks,” shrinkage,” Biometrika, vol. 81, no. 3, pp. 425–455, 1994. EURASIP journal on bioinformatics and systems biology, vol. 2007, [155] B. Efron, “Maximum likelihood and decision theory,” The Annals of 2007. Statistics, pp. 340–356, 1982. [187] C. Chow and T. Wagner, “Consistency of an estimate of tree-dependent [156] J. Jiao, Y. Han, and T. Weissman, “Minimax estimation of divergence probability distributions (corresp.),” Information Theory, IEEE Trans- functions,” in preparation. actions on, vol. 19, no. 3, pp. 369–371, 1973. [157] P. R. Halmos, “The theory of unbiased estimation,” The Annals of [188] V. Y. Tan, A. Anandkumar, L. Tong, and A. S. Willsky, “A large- Mathematical Statistics, vol. 17, no. 1, pp. 34–43, 1946. deviation analysis of the maximum-likelihood learning of markov tree [158] A. N. Kolmogorov, “Unbiased estimates,” Izvestiya Rossiiskoi Akademii structures,” Information Theory, IEEE Transactions on, vol. 57, no. 3, Nauk. Seriya Matematicheskaya, vol. 14, no. 4, pp. 303–326, 1950. pp. 1714–1735, 2011. [159] V. Voinov, “Unbiased estimators and their applications, vols. 1, 2.” [189] D. L. Donoho and J. M. Johnstone, “Minimax risk over `p-balls for [160] E. Y. Remez, “Sur la determination´ des polynomesˆ dapproximation de `q-error,” Probability Theory and Related Fields, vol. 99, pp. 277–303, degre´ donnee,”´ Comm. Soc. Math. Kharkov, vol. 10, pp. 41–63, 1934. 1994. [161] L. N. Trefethen et al., Chebfun Version 5, The Chebfun Development [190] Y. Yang and A. Barron, “Information-theoretic determination of mini- Team, 2014, http://www.chebfun.org/. max rates of convergence,” Annals of Statistics, pp. 1564–1599, 1999. [162] R. Pachon´ and L. N. Trefethen, “Barycentric-Remez algorithms for [191] S. Bernstein, “Collected works, vol. 2,” Izdat. Akad. Nauk SSSR, best polynomial approximation in the chebfun system,” BIT Numerical Moscow, 1964. Mathematics, vol. 49, no. 4, pp. 721–741, 2009. [192] R. S. Varga and A. J. Carpenter, “On the Bernstein conjecture in [163] S. Bernstein, Extreme properties of polynomials and best approxima- approximation theory,” Constructive Approximation, vol. 1, no. 1, pp. tion of continuous functions of one real variable (Russian), 1937. 333–348, 1985. 53

[193] M. Mitzenmacher and E. Upfal, Probability and computing: Random- ized algorithms and probabilistic analysis. Cambridge University Press, 2005. [194] W. R. Inc., “Mathematica,” 2013. [195] B. Rennie and A. Dobson, “On Stirling numbers of the second kind,” Journal of Combinatorial Theory, vol. 7, no. 2, pp. 116–121, 1969. [196] S. Bernstein, “Collected works: Vol 1. constructive theory of functions (1905-1930), English translation,” Atomic Energy Commission, Spring- field, Va, 1958. [197] M. Qazi and Q. Rahman, “Some coefficient estimates for polynomials on the unit interval,” Serdica Math. J, vol. 33, pp. 449–474, 2007. [198] J. Riordan, “Moment recurrence relations for binomial, Poisson and hypergeometric frequency distributions,” The Annals of Mathematical Statistics, vol. 8, no. 2, pp. 103–111, 1937.

Jiantao Jiao (S’13) received the B.Eng. degree with the highest honor in Electronic Engineering from Tsinghua University, Beijing, China in 2012, and a Master’s degree in Electrical Engineering from Stanford University in 2014. He is currently working towards the Ph.D. degree in the Department of Electrical Engineering at Stanford University. He is a recipient of the Stanford Graduate Fellowship (SGF). His research interests include information theory and statistical signal processing, with applications in communication, control, computation, networking, data compression, and learning.

Kartik Venkat (S’12) is a Ph.D. candidate in the Department of Electrical Engineering at Stanford University. His research interests include statistical inference, information theory, machine learning, and their applications in genomics, wireless networks, neuroscience, and quantitative finance. Kartik received a Bachelors degree in Electrical Engineering from the Indian In- stitute of Technology, Kanpur in 2010, and a Master’s degree in Electrical Engineering from Stanford University in 2012. His honors include a Stanford Graduate Fellowship for Engineering and Sciences, the Numerical Technolo- gies Founders Prize, and a Jack Keil Wolf ISIT Student Paper Award at the 2012 International Symposium on Information Theory.

Yanjun Han (S’14) is currently working towards the B.Eng. degree in Electronic Engineering from Tsinghua University, Beijing, China. His research interests include information theory and statistics, with applications in com- munications, data compression, and learning.

Tsachy Weissman (S’99-M’02-SM’07-F’13) graduated summa cum laude with a B.Sc. in electrical engineering from the Technion in 1997, and earned his Ph.D. at the same place in 2001. He then worked at Hewlett-Packard Laboratories with the information theory group until 2003, when he joined Stanford University, where he is Associate Professor of Electrical Engineering and incumbent of the STMicroelectronics chair in the School of Engineering. He has spent leaves at the Technion, and at ETH Zurich. Tsachy’s research is focused on information theory, statistical signal processing, the interplay between them, and their applications. He is recipient of several best paper awards, and prizes for excellence in research. He served on the editorial board of the IEEE TRANSACTIONSON INFORMATION THEORY from Sept. 2010 to Aug. 2013, and currently serves on the editorial board of Foundations and Trends in Communications and Information Theory.