<<

arXiv:1207.5653v1 [stat.ME] 24 Jul 2012 pc srdcdt atc a rtcniee by considered first was lattice [ a Hammersley to of reduced is number of finite speak a models we ter cases, to these space in points; re- parameter to us the us provides allowing strict information theory additional biology, some with and processing signal siaini iceePrmtrModels Seri Parameter Raffaello and Choirat Discrete Christine in Estimation td elIsbi,VaMneGnrs 1 21100 71, e-mail: Generoso Italy Varese, Monte Via dell’Insubria, Universit`aStudi degli Economia, di Dipartimento Professor, ([ Hammersley [ by others with and lit- 199) dealt the page [ was in attention Khan and some by met erature also developed case Poisson further The was case this 02 o.2,N.2 278–293 2, DOI: No. 27, Vol. 2012, Science Statistical iloea EtaaEt) 18 apoa Spain Pamplona, e-mail: 31080 de Este), (Entrada Edificio Bibliotecas Navarra, de Business Universidad and Management, Economics of School Economics, neadukonitgrma se[ on (see vari- mean mainly known integer unknown with focused and distribution ance he Gaussian a of insulin, of measurement case of the the weight by mean motivated the was author the

hitn hia sAscaePoesr eatetof Department Professor, Associate is Choirat Christine c ttsia Science article Statistical original the the of by reprint published electronic an is This ern iesfo h rgnli aiainand pagination in detail. original typographic the from differs reprint oeie,epcal napiain eln with dealing applications in especially Sometimes, nttt fMteaia Mathematical of Institute 10.1214/11-STS371 [email protected] ttsia neec hnteparameter the when inference Statistical . 33 .INTRODUCTION 1. nasmnlppr oee,since However, paper. seminal a in ] nttt fMteaia Statistics Mathematical of Institute , e od n phrases: and words Key fo superefficiency and efficiency of with class relations their inform and theory, ties Theref distribution studied. asymptotic been ever consistency, o selection, hardly cuss speak has model models we and these old case, for testing quite theory this is with In problem connections the points. though interesting of Even number models. finite parameter restric crete a to to us space allowing information eter biology additional and with us processing vides signal theory, information with ing eitos nomto nqaiis ffiiny supereffi efficiency, inequalities, information deviations, Abstract. 02 o.2,N.2 278–293 2, No. 27, Vol. 2012, raff[email protected] 61 aaloSr sAssistant is Seri Raffaello . , m 75 2012 , -. ]. nsm siainpolm,epcal napiain de applications in especially problems, estimation some In iceeparame- discrete 33 ,pg 192); page ], . 46 iceeprmtrsae eeto,large detection, space, parameter Discrete This . – 33 49 in ], ]. 1 n eae oisaei [ in are topics related and h siaos(e [ of performance (see the estimators on the bounds some the- derived information signal and ory and in control practical models automatic the be parameter processing, showed to discrete proved papers of has few importance A sets finite tool. valuable of a sequence a by space [ encompassing Bayesian are studied been that have aspects Other search. random R´enyi’s of theory egneof vergence eety h oi a eevdnwitrs nthe [ in (see interest literature new theory received information has topic the recently, hpe )adi h aclto fecec rates efficiency [ of calculation (see [ the (see models in and statistical 4) Chapter complex of [ tion prediction [ best sufficiency and minimal and sufficiency 2.2), Section ([ experiments statistical of comparison 225), ([ intervals confidence of construction [ are spaces parameter 79 discrete in op- with timality dealing papers Other distribution). uniform quadratic [ the and under distribution) pa- data integral [ integral translation loss), a of of and case rameter the for 424, (page [ book 80 rt aaee pc a locniee yVa- by considered also [ was jda space parameter crete rvoswrshv hw httert fcon- of rate the that shown have works Previous , [ ], 81 80 1 82 , , 9 29 , ) pca ae aebe el ihi [ in with dealt been have cases special ]); 15 , 84 82 , 83 , .Otmlt fetmto ne dis- a under estimation of Optimality ]. , 33 56 83 ] eea ramnso admissibility of treatments General ]]. m , ) prxmtn eea parameter general a approximating ]), nannrhdxstigisie by inspired setting nonorthodox a in ] etmtr sotnepnnil[[ exponential often is -estimators 46 – 49 11 ciency. to inequali- ation frtecs fteGaussian the of case the (for ] 3 frtecs ftediscrete the of case the (for ] h param- the t pro- theory , 76 – asymptotic r,w dis- we ore, 6 general a r .Mroe,i h estima- the in Moreover, ]. , 28 34 n has and , – dis- f 38 36 , al- , 62 52 , 43 , 73 19 53 , seas the also (see ] ,pgs224– pages ], 69 , 58 ,adthe and ], 31 ) More ]). 77 27 ,[ ], ,[ ], , 18 24 54 78 44 33 56 ], ], ] , ] , ], 2 C. CHOIRAT AND R. SERI

3 θ review paper [37]), in stochastic integer program- spring is equal to ( 4 ) where θ is an integer equal to ming (see [25, 50, 86]), and in geodesy (see, e.g., [76], the number of genes determining transplantability. 1 θ Section 5). For another type of mating, the probability is ( 2 ) . However, no general formula for the convergence We aim at estimating θ knowing that n0 transplants rate has ever been obtained, no optimality proof take out of n. The likelihood is given by under generic conditions has been provided and no n θn0 θ n n0 general discussion of efficiency and superefficiency in ℓn(θ)= k (1 k ) − , n0 · · − discrete parameter models has appeared in the liter-   ature. In the present paper, we provide a full answer 1 3 θ N, k , . to these problems in the case of discrete parameter ∈ ∈ 2 4 models for samples of i.i.d. (independent and iden-   tically distributed) random variables. Therefore, af- In this case the parameter space is discrete and the ter introducing some examples of discrete parame- maximum likelihood can be shown to be ˆn ln(n0/n) ter models in Section 2, in Section 3 we investigate θ = ni[ ln k ] where ni[x] is the integer nearest the properties of a class of m-estimators. In partic- to x (see [33], page 236). ular, in Section 3.1, we derive some conditions for Example 2 (Exponential family restricted to a lat- strong consistency; then, in Section 3.2, we calculate tice). Consider a random variable X distributed an asymptotic approximation of the distribution of according to an exponential family where the natu- the estimator and we establish its convergence rate. ral parameter θ is restricted to a lattice θ0 + k { These results are specialized to the case of the max- ε N,N N , for fixed θ0 and ε (see [57], page 759). imum likelihood estimator (MLE) and extended to The· case∈ of} a Gaussian distribution has been con- Bayes estimators in Section 3.3. In Section 4, we de- sidered in [33] (page 192) and [46, 48], the Poisson rive upper bounds for the convergence rate in the case in [33] (page 199), [61, 75]. In particular, [33] standard and in the minimax contexts, and we dis- uses the Gaussian model to estimate the molecular cuss the relations between information inequalities, weight of insulin, assumed to be an integer (how- efficiency and superefficiency. In particular, we prove ever, see the remarks of Tweedie in the discussion that estimators of discrete parameters have uncom- of the same paper). mon efficiency properties. Indeed, under the zero– one loss , no estimator is efficient in the Example 3 (Stochastic discrete optimization). class of consistent estimators for any value of θ Θ We consider the optimization problem of the form 0 ∈ E (θ being here the true value of the parameter) and minx S g(x), where g(x)= G(x,W ) is an integral 0 ∈ no estimator attains the information inequality we functional, E is the mean under probability P, derive. But the MLE still has some appealing prop- G(x, w) is a real-valued function of two variables x erties since it is minimax efficient and attains the and w, W is a random variable having probability minimax information inequality bound. distribution P and S is a finite set. We approximate this problem through the sample average function 2. EXAMPLES OF DISCRETE PARAMETER , 1 n gˆn(x) n i=1 G(x,Wi) and the associated prob- MODELS lem minx S gˆn(x). See [50] for some theoretical re- ∈P The following examples are intended to show the sults and a discussion of the stochastic knapsack relevance of discrete parameter spaces in applied problem and [86] for an up-to-date bibliography. and theoretical statistics. In particular, they show Example 4 (Approximate inference). In many that the results in the following sections solve some applied cases, the requirement that the true model long-standing problems in statistics, optimization, generating the data corresponds to a point belong- information theory and signal processing. ing to the parameter space appears to be too strong We recall that a statistical model is a collection and unlikely. Moreover, the objective is often to re- of probability measures = P ,θ Θ where Θ is P { θ ∈ } cover a model reproducing some stylized facts from the parameter space. Θ is a subset of a Euclidean or the original data. In these cases, approximation of of a more abstract space. a continuous parameter space with a finite number Example 1 (Tumor transplantability). We con- of points allows for obtaining such a model under sider tumor transplantability in mice. For a certain weaker assumptions. This situation arises, for ex- type of mating, the probability of a tumor “taking” ample, in signal processing and automatic control when transplanted from the grandparents to the off- applications [4–6, 34–36] and is reminiscent of some DISCRETE PARAMETER MODELS 3 related statistical techniques, such as the discretiza- tion. The following assumption is used in order to tion device of Le Cam ([56], Section 6.3), or the prove consistency in the case of i.i.d. replications: sieve estimation of Grenander ([31]; see also [26], n A1. The data (Yi)i=1 are realizations of i.i.d. (Y, )- Remark 5). valued random variables having probability mea-Y Example 5 (M-ary hypotheses testing and re- sure P0. lated fields). In information theory, discrete pa- The estimator θˆn is obtained by maximizing rameter models are quite common, and their estima- over the set Θ = θ0,θ1,...,θJ , of finite cardi- { } tion is a generalization of binary hypothesis testing nality, the objective function n that goes under the names of M-ary hypotheses (or 1 multihypothesis) testing, classification or detection Qn(θ) , ln q(yi; θ). n (see the examples in [63]). Consider a received wave- Xi=1 form r(t) described by the equation r(t)= m(t)+ The function q is -measurable for each θ Θ σn(t) for t 0, where m(t) is a deterministic sig- and satisfies theY L1-domination condition∈ ≥ nal, n(t) is an additive Gaussian white noise and σ E0 ln q(Y ; θ) < + for every θ Θ, where E0 is the noise intensity. The set of possible signals denotes| the| expectation∞ taken under∈ the true is restricted to a finite number of alternatives, say probability measure P0. m (t),...,m (t) : the chosen signal is usually the Moreover, θ0 is the point of Θ maximizing { 0 J } one that maximizes the log-likelihood of the sample, E0 ln q(Y ; θ) and θ0 is globally identified (see [64], or an alternative criterion function. For example, if Section 2.2). the log-likelihood of the process based on the obser- Remark 1. (i) The assumption of a finite pa- vation window [0, T ] is used, we have rameter space seems restrictive with respect to the 1 T more general assumption of Θ being countable (see, mˆ j( )=arg max mj(t)r(t)dt e.g., [33]). However, A1 is compatible with the con- · j=0,...,J σ2 Z0 vex hull of Θ being compact, as in standard asymp- T 1 2 totic theory. Indeed, the cases analyzed in [33] have mj (t)dt . − 2 0 convex likelihood functions and this is a well-known Z  substitute for compactness of Θ (see [64], page 2133; Much more complex cases can be dealt with; see [37] see [17], for consistency with neither convexity nor for an introduction. compactness). Moreover, the restriction to finite pa- rameter spaces seems to be necessary to derive the 3. M-ESTIMATORS IN DISCRETE asymptotic approximation to the distribution of m- PARAMETER MODELS estimators. In this section, we consider an estimator obtained (ii) The relative position of the points of Θ is by maximizing an objective function of the form unimportant and the choice of θ0 as the maximizer is arbitrary and is made only for practical purposes. n 1 Note that θ0 has no link with P0 apart from being Qn(θ)= ln q(yi; θ); P n the pseudo-true value of ln q with respect to 0 on Xi=1 the parameter space Θ (see, e.g., [30], Volume 1, in what follows, we allow for misspecification. Note page 14). that the expression m-estimator stands for maxi- Proposition 1. Under Assumption A1, the m- mum likelihood type estimator, in the spirit of Hu- estimator θˆn is a P -strongly consistent estimator ber [39], and not for maximum (or extremum) esti- 0 of θ and is n-measurable. mator (see, e.g., [64], page 2114). 0 Y⊗ Remark 2. A similar result of consistency for 3.1 Consistency of m-Estimators discrete parameter spaces has been provided by [74] In the case of a discrete parameter space, uni- (page 446), by [13, 14] (pages 325–333), by [8] form convergence reduces to pointwise convergence. (pages 1293–1294) as an application of the Shannon– Therefore, m-estimators are strongly consistent un- McMillan–Breiman Theorem of information theory, der less stringent conditions than in the standard by [87] (Section 2.1) as a preliminary result of his case; in particular, no condition is needed on the work on partial likelihood, and by [60] (page 96, Sec- continuity or differentiability of the objective func- tion 7.1.6). 4 C. CHOIRAT AND R. SERI

3.2 Distribution of the m-Estimator derived from large deviations theory (see [21]); we (i) For a discrete parameter space, the finite sample recall that the processes Qn(θj), Xk and Xk have distribution of the m-estimator θˆn is a discrete dis- been introduced in (1). Then, for i = 0,...,J, we tribution converging to a Dirac mass concentrated define the moment generating functions at θ0. Since the determination of an asymptotic ap- (i) P λj [ln q(Y ;θi) ln q(Y ;θj )] M (λ) , E [e j=0,...,J,j=6 i · − ] proximation to this distribution is an interesting and 0 λTX(i) open problem, we derive in this section upper and = E0[e ], lower bounds and asymptotic estimates for proba- n bilities of the form P0(θˆ = θi). the logarithmic moment generating functions To simplify the following discussion, we introduce Λ(i)(λ) , ln M (i)(λ) the processes: n E Pj=0,...,J,j=6 i λj [ln q(Y ;θi) ln q(Y ;θj )] 1 = ln 0[e · − ] Q (θ ) , ln q(y ; θ ), T n j n · i j E λ X(i)  i=1 = ln 0[e ],  (i) X  X , [ln q(Y ; θ ) (1)  k k i and the Cram´er transforms   ln q(Yk; θj)]j=0,...,J,j=i, (i), , (i)  − 6 Λ ∗(y) sup [ y, λ Λ (λ)], X , X(0) λ RJ h i−  k k ∈   = [ln q(Yk; θ0) ln q(Yk; θj)]j=1,...,J , where , is the scalar product. Note that, in what  − h· ·i  i = 1,...,J, follows, M(λ), Λ(λ) and Λ∗(y) are respectively short-  cuts for M (0)(λ), Λ(0)(λ) and Λ(0), (y). Moreover, ˆn ∗ The probability of the estimator θ taking on the for a function f : E R, we will need the definition value θi can be written as → of the effective domain of f, f , x E : f(x) < n D { ∈ P0(θˆ = θi)= P0(Qn(θi) >Qn(θj), j = i) . (2) ∀ 6 ∞} n The following assumptions will be used to approx- n P (i) RJ imate the distribution of θˆ . = 0 Xk int + . ∈ ! Xk=1 A2. There exists a δ > 0 such that, for any η The only approaches that have been successful in our ( δ, δ), we have ∈ experience are large deviations (in logarithmic and − q(Y ; θ ) η exact form) and saddlepoint approximations. Note E j < + j, k = 0,...,J. 0 q(Y ; θ ) ∞ ∀ that we could have defined the probability in (2) as  k  P (Q (θ ) Q (θ ), j = i) or through any other 0 n i ≥ n j ∀ 6 Remark 3. In what follows, this assumption could combination of equality and inequality signs; this in- be replaced by a condition as in [68] (Assumptions H1 troduces some arbitrariness in the distribution of θˆn. and H2). However, we will give some conditions (see Proposi- tion 2) under which this difference is asymptotically (i) ∂Λ(i)(x) A3. Λ (λ) is steep, that is, limn = irrelevant. →∞ k ∂x k ∞ whenever xn n is a sequence in int( (i) ) con- Section 3.2.1 introduces definitions and assump- { } DΛ verging to a boundary point of int (i) . tions and discusses a preliminary result. In Sec- DΛ tion 3.2.2 we derive some results on the asymptotic Remark 4. Under Assumptions A1, A2 and A3, behavior of P (θˆn = θ ) using large deviations prin- Λ(i)( ) is essentially smooth (see, e.g., [21], page 44). 0 i · ciples (LDP). Then, we provide some refinements of A sufficient condition for A3 and essential smooth- the previous expressions using the theory of exact ness is openness of Λ(i) (see [66], page 905, and [40], asymptotics for large deviations, with special refer- pages 505–506). D ence to the case J = 1. At last, Section 3.2.3 derives RJ (i) ∅ (i) saddlepoint approximations for probabilities of the A4. int( + ) = , where is the closure of ∩ S 6 S (i) form (2). the convex hull of the support of the law of X . 3.2.1 Definitions, assumptions and preliminary re- We will also need the following lemma showing the sults As concerns the distribution of the m-estima- equivalence between Assumption A2 and the so-called n tor θˆ , we shall need some concepts and functions Cram´er condition 0 int( (i) ), for any i = 0,...,J. ∈ DΛ DISCRETE PARAMETER MODELS 5

Lemma 1. Under Assumption A1, the following Under Assumptions A1 and A2: conditions are equivalent: n P0(θˆ = θ0) H exp n inf Λ∗(y) osup(n) , RJ (i) Assumption A2 holds; 6 ≤ · − · y + − n ∈ o (ii) 0 int( (i) ), for any i = 0,...,J. ∈ DΛ where osup(n) is a function such that osup(n) As concerns the saddlepoint approximation of Sec- lim supn n = 0. tion 3.2.3, we need the following assumption: →∞ Remark 5. The proposition allows us to obtain A5. The inequality an upper bound on the bias of the m-estimator, Bias ˆn P ˆn uj +ι tj (θ ) supj=0 θj θ0 0(θ = θ0). q(Y ; θi) · ≤ 6 | − | · 6 E0 q(Y ; θj) A better description of the asymptotic behavior of j=0,...,J,j=i   P ˆn Y 6 the probability 0(θ = θi) could be obtained, un- uj der some additional conditions, from the study of E q(Y ; θi) < (1 δ) 0 the neighborhood of the contact point between the − · q(Y ; θj) j=0,...,J,j=i   R J Y 6 set ( +) and the level sets of the Cram´er trans- < form Λ(i), ( ). We leave the topic for future work. ∞ ∗ Here we just· remark the following brackets on the holds for u int( Λ(i) ), δ > 0 and c < t < (s 3)/2 ∈ D | | convergence rate. C n − (ι denotes the imaginary unit). · Proposition 4. Under Assumptions A1, A2, 3.2.2 Large deviations asymptotics In this section A3 and A4, for sufficiently large n, the following we consider large deviations asymptotics. We result holds: note that, in what follows, int(RJ )c stands for + (i),∗ y J c n infy∈RJ Λ ( ) int [(R ) ] . − · + + e n { } c1 P0(θˆ = θi) Proposition 2. (i) For i = 1,...,J, under As- nJ/2 ≤ (i),∗ y sumption A1, the following result holds: n infy∈RJ Λ ( ) e− · + P ˆn (i), c2 0(θ = θi) exp n inf Λ ∗(y)+ oinf (n) , ≤ n1/2 ≥ − · y int(RJ ) ∈ + n o for i = 1,...,J and for some 0 < c1 c2 < + . where oinf (n) is a function such that ≤ ∞ oinf (n) When J = 1, a more precise convergence rate can lim infn = 0. →∞ n be obtained under the following assumption: (ii) Under Assumptions A1 and A2: A6. When J =1, there is a positive value µ int( Λ(1) ) (1) ∈ D P ˆn (i), ∂Λ (λ) 0(θ = θi) exp n inf Λ ∗(y) osup(n) , such that ∂λ λ=µ = 0. Moreover, the law of ≤ − · y RJ − | ∈ + ln q(Y ;θ1) is nonlattice (see [21], page 110). n o q(Y ;θ0) where osup(n) is a function such that Proposition 5. Under Assumptions A1, A2, A3, lim sup osup(n) = 0. n n A4 and A6, with Θ= θ ,θ and J = 1, we have →∞ { 0 1} (iii) Under Assumptions A1, A2, A3 and A4: n n P0(θˆ = θ1)= P0(θˆ = θ0) ˆn (i), 6 P0(θ = θi)=exp (n + o(n)) inf Λ ∗(y) n Λ(1)(µ) − · y int(RJ ) e · ∈ + = (1 + o(1)) n o µ Λ(1), (µ)2πn · (i), ′′ = exp (n + o(n)) inf Λ ∗(y) . · − · y RJ n Λ(1),∗(0) (1), + e p (Λ ) (0) n ∈ o = − · ∗ ′′ (1), Proposition 3. Under Assumption A1, the fol- (Λ ∗)′(0) · r 2πn lowing inequality holds: (1 + o(1)). P ˆn · 0(θ = θ0) H exp n inf Λ∗(y)+oinf (n) , Remark 6. A refinement of the previous asymp- 6 ≥ · − ·y int(RJ )c n ∈ + o totic rates can be obtained using results in [2, 10]. where H is the finite cardinality of the set 3.2.3 Saddlepoint approximation In this section we arg infy int(RJ )c Λ∗(y) and oinf (n) is a function such ∈ + consider a different kind of approximation of the oinf (n) n that lim infn = 0. probabilities P0(θˆ = θi). →∞ n 6 C. CHOIRAT AND R. SERI

Theorem 1. Under Assumptions A1, A2 and A5, 3.3 The MLE and Bayes Estimators in Discrete for i = 0, it is possible to choose u such that, for ev- Parameter Models 6 (i) RJ ∂Λ (u) T ery v [(int +) ∂u ], u v 0 and In this section, we show how the previous results ∈ ⊖ ≥ MLE ∂Λ(i)(u) can be applied to the and Bayes estimators un- P (θˆn = θ )=exp n Λ(i)(u) u der the zero–one . The MLE is defined by 0 i − · ∂u    n J (i) ˆn , [e (u, int R E X ) θ arg max fYi (yi; θk) s 3 + 0 θ Θ · − ⊖ ∈ i=1 RJ E (i) Y + δ(u, int 0X )], n + ⊖ 1 where = argmax ln fYi (yi; θ) . θ Θ n RJ E (i) ∈  i=1  es 3(u, int + 0X ) X − ⊖ This corresponds to the minimum-error-probability 2 exp( nu y n y∗ /2) estimate of [69] and to the Bayesian estimator of = (i) − · J/−2 k1/2k int RJ ∂Λ (u) (2π/n) ∆ [82, 83]. On the other hand, using the prior densities Z +⊖ ∂u given by π(θ) for θ Θ, the posterior densities of the s 3 ∈ − i/2 Bayesian estimator are given by 1+ n− Qiu(√ny∗) dy, n · fY (yi; θk)π(θk) " i=1 # P θ Y = i=1 i . X { k| } J n j=0 i=1 fYi (yi; θj)π(θj) Qℓu(x) Q ˇn ℓ The Bayes estimatorP relativeQ to zero–one loss θ (see 1 κ κ = ∗ ∗∗ ν1n · · · νmn Section 4.3 for a definition) is the mode of the pos- m! ν ! ν ! m=1 1 m terior distribution and is given by X X X  · · ·  θˇn , arg max ln P θ Y HI1 (x1) HId (xd), θ Θ { | } · · · · (3) ∈ δ(u, int RJ E X(i)) n | + ⊖ 0 | 1 ln π(θ) = argmax ln fYi (yi; θ)+ . (s 2)/2 θ Θ n n C n− − ∈ " i=1 # ≤ · 2 (i) X and V = ∂ Λ (u) , y = V 1/2y, y 2 = y y = Note that the MLE coincides with the Bayes es- ∂u2 ∗ − k ∗k ∗ · ∗ yTV 1y, ∆= V , H is the usual Hermite–Che- timator corresponding to the uniform distribution − | | m 1 byshev polynomial of degree m, ∗ denotes the sum π(θ) = (J + 1)− for any θ Θ. Assumption A1 can be replaced∈ by the following over all m-tuples of positive integers (j1,...,jm) sat- P ones (where Assumptions A8 and A9 entail that the isfying j1 + + jm = ℓ, ∗∗ denotes the sum over · · · likelihood function is asymptotically maximized at θ all m-tuples (ν1,...,νm) with νi = (ν1i,...,νdi), sat- 0 isfying (ν + + ν =Pj + 2, i = 1,...,m), and only): 1i · · · di i Ih = νh1 + + νhm, h = 1,...,d. Note that Qℓu de- A7. The parametric statistical model is formed by · · · pends on u through the cumulants calculated at u. a set of probability measures onP a measurable space (Ω, ) indexed by a parameter θ rang- Remark 7. The main question that this theo- A rem leaves open is the choice of the point u. Usually ing over a parameter space Θ = θ0,θ1,...,θJ , of finite cardinality. Let (Y, ){ be a measur-} this point is chosen as a solution uˆ of m(uˆ)= xˆ; Y this corresponds to a saddlepoint in κ(u). [20] (Sec- able space and µ a positive σ-finite measure defined on (Y, ) such that, for every θ Θ, tion 6) and [59] (page 480) give some conditions for P Y ∈ J =1; [41] (page 23) and [7] (page 153) give con- θ is equivalent to µ; the densities fY (Y ; θ) are -measurable for each θ Θ. ditions for general J. [42] suggests that the most Y n ∈ The data (Yi)i=1 are i.i.d. realizations from common solution is to choose xˆ and uˆ (xˆ belonging P RJ E (i) the probability measure 0. to the boundary of [int + 0X ] and uˆ solv- 1 ⊖ RJ A8. The log density satisfies the L -domination ing m(uˆ)= xˆ), such that for every v [int + E (i) ∈ ⊖ condition 0 ln fY (Y ; θi) < + , for θi Θ, ∂Λ (u) ], uˆTv 0. This is the same as a dominating E | | ∞ ∈ ∂u ≥ where 0 denotes the expectation taken under point in [65–67]; therefore, A2, A3 and A4, for suffi- the true probability measure P0. ciently large n, imply the existence of this point for A9. θ0 is the point of Θ maximizing E0 ln fY (Y ; θ) any i. and is globally identified. DISCRETE PARAMETER MODELS 7

In order to obtain the consistency of Bayes esti- Moreover, due to its convexity, Hγ (θ0,...,θJ ) is surely mators, we need the following assumption on the finite for γ belonging to the closed simplex in RJ+1. behavior of the prior distribution: Proposition 4 holds if Assumption A1 is replaced A10. The prior distribution verifies π(θ) > 0 for any by Assumptions A7, A8 and A9, and if A2 and A3 θ Θ. hold true. However, Assumption A4 is unnecessary; ∈ indeed, the fact that int(RJ (i)) = ∅ can be proved MLE + ∩S 6 Proposition 1 holds for the under Assump- showing that 0 int( (i)). This is equivalent to the tions A7, A8 and A9, while for Bayes estimators A10 ∈ S existence, for j = 1,...,J,j = i, of two sets Aj∗ and Aj∗∗ is required, too. Note that, under correct specifica- of positive µ-measure and6 included in the support tion (i.e., when the true parameter value belongs of Y such that, for y∗ A∗ and y∗∗ A∗∗, fY (y∗; θi) > to Θ), a standard Wald’s argument (see, e.g., Lem- j ∈ j j ∈ j j fY (y∗; θj) and fY (y∗∗; θi) u>0 tained. When needed, we will refer to the former as ≥ θ Θ θ θ0 Θ 1 0 ∈  Z Chapman–Robbins lower bound (and to the related ∈ \{ } 1 u efficiency concept as Chapman–Robbins efficiency) f (y; θ ) − µ(dy) . · Y 0 since it recalls the lower bound proposed by these  DISCRETE PARAMETER MODELS 9

Remark 9. (i) The previous proposition pro- for every θ0 Θ and there exists at least a value vides an expression for the minimax Bahadur risk θ Θ such that∈ the inequality is replaced by a strict 0∗ ∈ (also called (minimax) rate of inaccuracy; see [1, 51]) inequality for θ0 = θ0∗. analogous to Chernoff’s Bound, thus providing a min- The estimator θ¯n is asymptotically CR-efficient imax version of Remark 8(ii). w.r.t. at θ if it attains the Chapman–Robbins R 0 (ii) Other methods to derive similar minimax in- lower bound of Proposition 6 at θ0 [say CR (θ0)] equalities are Fano’s Inequality and Assouad’s Lem- in the asymptotic form: −R ma (see [56], page 220); however, in the present case 1 n they do not allow us to obtain tight bounds, since lim inf ln (θ¯ ,θ0)=lnCR (θ0). n n the usual application of these methods relies on the →∞ R −R approximation of the parameter space with a finite The estimator θ¯n is asymptotically minimax CR- set of points Θ whose cardinality increases with n. efficient w.r.t. if it attains the minimax Chapman– R Clearly, this cannot be done in the present case. Robbins lower bound of Proposition 7 (say CR- max) (iii) Using Lemma 5.2 in [70], it is possible to in the asymptotic form: R show that the minimax bound is larger than the 1 n classical one. lim inf ln sup (θ¯ ,θ0)=lnCR max. n n θ0 Θ R −R (iv) Under Assumption A10, the Bayes risk r1 un- →∞ ∈ der the risk function and the prior π respects the ¯n R1 The estimator θ is asymptotically CR-superefficient equality w.r.t. if 1 1 R (8) lim ln r (θ˜n,π) = lim ln max (θ˜n,θ ). 1 n 1 1 0 lim inf ln (θ¯ ,θ0) lnCR (θ0) n n n n θ0 Θ R n →∞ →∞ ∈ →∞ n R ≤ −R Then, Proposition 7 holds also for the Bayes risk: for every θ0 Θ and there exists at least a value clearly this bound is independent of the prior dis- ∈ θ∗ Θ such that the inequality is replaced by a strict tribution π (provided it is strictly positive, i.e., A10 0 ∈ inequality for θ0 = θ0∗. holds) and also holds for the probability of error Pe. This inequality can be seen as an asymptotic ver- Remark 10. As in Remark 8(ii), it is easy to see sion of the van Trees inequality for a different risk that IR-optimality and CR-efficiency w.r.t. 1 co- R function. incide. 4.2 Optimality and Efficiency The efficiency landscape offered by discrete pa- rameter models will be illustrated by Example 6. In this section, we establish some optimality re- This shows that, even in the simplest case, that is, sults for the MLE in discrete parameter models. The the estimation of the integer mean of a Gaussian situation is much more intricate than in regular sta- random variable with known variance, the MLE does tistical models under the quadratic loss function, in not attain the lower bound on the missclassification which efficiency coincides with the attainment of the probability but it attains the minimax lower bound. Cram´er–Rao lower bound (despite superefficiency). Therefore, we propose the following definition. We Moreover, simple estimators are built that outper- n form the MLE for certain values of the true param- denote by = (θ¯ ,θ0) the risk function of the es- nR R eter value θ0. timator θ¯ evaluated at θ0, and by Θ˜ a class of es- timators. Example 6. Let us consider the estimation of Definition 1. The estimator θ¯n is efficient with the mean of a Gaussian distribution whose vari- ance σ2 is known: we suppose that the true mean respect to (w.r.t.) Θ˜ and w.r.t. at θ0 if R is α, while the parameter space is α, α , where α (9) (θ¯n,θ ) (θ˜n,θ ) θ˜n Θ˜ . {− } R 0 ≤ R 0 ∀ ∈ is known. The maximum likelihood estimator θˆn The estimator θ¯n is minimax efficient w.r.t. Θ˜ and takes the value α if the sample mean takes on its w.r.t. if value in ( , 0)− and α if it falls in [0, + ) (the R n n n −∞ ∞ (10) sup (θ¯ ,θ0) sup (θ˜ ,θ0) θ˜ Θ˜ . position of 0 is a convention). Therefore: θ0 Θ R ≤ θ0 Θ R ∀ ∈ ∈ ∈ n n Pθ (θˆ = θ0)= Pθ (θˆ = α) The estimator θ¯n is superefficient w.r.t. Θ˜ and 0 6 0 − w.r.t. if for every θ˜n Θ:˜ 0 (¯y α)2/(2σ2 /n) R ∈ e− − ¯n ˜n = d¯y (θ ,θ0) (θ ,θ0) 2πσ2/n R ≤ R Z−∞ p 10 C. CHOIRAT AND R. SERI

2 √nα/σ e t /2 Now, we show that this estimator cannot converge = − − dt √2π faster than the Chapman–Robbins lower bound with- n Z−∞ out losing its consistency. Indeed, Pθ (θ˜ (k) = θ0) is √nα 0 6 =Φ smaller than the Chapman–Robbins lower bound if − σ 2 4   2 α α nα2/(2σ2 ) k + 4k 12 0, e− σ 1 σ − σ ≥ = 1+ O ,     √2πn α · n    and this is never true under (11). If this estimator where we have used Problem 1 on page 193 in [22]. is pointwise more efficient than the MLE under θ0, Proposition 5 allows also for recovering the right then its risk under θ1 is given by convergence rate. Indeed, we have k σ2 2α2 n n P ˜n P (θˆ = α)= P (θˆ = α) θ1 (θ (k) = θ1)=Φ · − √n , θ0 θ0 6 2ασ · 6 −   nα2/(2σ2) e− σ and this is greater than for the MLE. This shows that = (1 + o(1)). √2πn α · a faster convergence rate can be obtained in some On the other hand, the lower bound of Proposition 6 points, the price to pay being a worse convergence yields rate elsewhere in Θ. 2 4.2.1 Optimality w.r.t. classes of estimators In the 1 n 2α lim ln Pθ (θˆ = θ0) , n n 0 6 ≥− σ2 following section, we show some optimality proper- →∞ ties of Bayes and ML estimators. We start with an and the lower bound of Proposition 7 yields important and well-known fact. 2 1 ˆn α lim inf sup ln Pθ (θ = θ0) . Proposition 8. Under A7, A8, A9 and A10, n 0 2 n θ0 α,α 6 ≥−2σ n →∞ ∈{− } the Bayes risk r1(θ˜ ,π) (under the zero–one loss Therefore, the MLE asymptotically attains the min- function) associated with a prior distribution π is imax lower bound but not the classical one. strictly minimized by the posterior mode correspond- In the following, we will show that estimators can ing to the prior π, for any finite n. be pointwise more efficient than the MLE; consider The following proposition shows that the MLE is the estimator defined by admissible and minimax efficient under the zero– θ if L (θ ) L (θ )+ k n, θ˜n(k)= 0 n 0 ≥ n 1 · one loss and minimizes the average probability of θ else. error. It implies that estimators that are more effi-  1 cient than the MLE at a certain point θ Θ are less When k = 0, θ˜n(k) coincides with the MLE θˆn. Then, 0 efficient in at least another point θ Θ.∈ As a result, the behavior of the estimator is characterized by the 1 estimators can be more efficient than∈ minimax effi- probabilities: cient ones only on portions of the parameter space, k n σ2 + 2α2 n P (θ˜n(k)= θ )=Φ · · · , but are then strictly less efficient elsewhere. θ0 0 2ασ√n   Proposition 9. Under Assumptions A7, A8 2 2 MLE P ˜n k n σ 2α n and A9, the is admissible and minimax effi- θ1 (θ (k)= θ0)=Φ · · − · . 2ασ√n cient w.r.t. the class of all estimators and w.r.t. 1   and minimizes the average probability of error PR. We have (weak) consistency if e 4.2.2 Optimality w.r.t. the information inequali- α 2 α 2 (11) 2 >k> 2 . ties In this subsection, we will show that the MLE σ − σ     does not attain the Chapman–Robbins lower bound n The risk 1(θ˜ (k),θ0) under θ0 is then in the form of Proposition 6 but that it attains the R minimax form of Proposition 7 and that efficiency k σ2 + 2α2 P (θ˜n(k) = θ )=Φ · √n ; and minimax efficiency are generally incompatible. θ0 6 0 − 2ασ ·   Therefore, the situation described in Example 6 is this can be made smaller than the probability of general, for it is possible to show that the MLE is error of the MLE simply taking k> 0, thus implying generally inefficient with respect to the lower bounds that the MLE is not pointwise efficient. exposed in Proposition 6. DISCRETE PARAMETER MODELS 11

Proposition 10. Under Assumptions A7, A8 where aj(θ0) > 0 for j = 1,...,J can be tuned in or- and A9: der to give more or less weight to different points of (i) the MLE is not asymptotically CR- the parameter space. The risk function is therefore efficient w.r.t. at θ ; given by the weighted probability of misclassifica- R1 0 tion (θ˜n,θ )= J a (θ ) P θ˜n = θ . (ii) the MLE is asymptotically minimax CR- 3 0 j=1 j 0 θ0 j It isR trivial to remark that · { } efficient w.r.t. 1; P (iii) an estimatorR that is asymptotically CR- 1 n lim ln 2(θ˜ ,θ0) efficient w.r.t. at θ is not asymptotically mini- n n R R1 0 →∞ max CR-efficient w.r.t. 1. 1 n R = lim ln Pθ (θ˜ = θ0), n 0 Remark 11. The assumption of homogeneity of →∞ n 6 the probability measures, necessary to derive (ii), 1 n lim inf ln sup 2(θ˜ ,θ0) can be removed in the proof of (i) along the lines n →∞ n θ0 Θ R of [45]. ∈ 1 n = lim inf ln sup Pθ (θ˜ = θ0), 4.2.3 The evil of superefficiency Ever since it was n 0 n θ0 Θ 6 discovered by Hodges, the problem of superefficiency →∞ ∈ has been dealt with extensively in regular statis- and the lower bounds of Propositions 6 and 7 hold also in this case. The same equalities hold also for . tical problems (see, e.g., [55, 85]). However, these R3 proofs do not transpose to discrete parameter es- As a result, Proposition 10 and Proposition 11(i) ap- timation problems, since they are mostly based on ply also to these risk functions. the equivalence of prior probability measures with On the other hand, as concerns Proposition 9 and the Lebesgue measure and on properties of Bayes Proposition 11(ii), it is simple to show that with re- ˜n ˜n estimators that do not hold in this case. Moreover, spect to the risk functions 2(θ ,θ0) and 3(θ ,θ0), R R the discussion of the previous sections has shown the results hold only asymptotically (see [46], for that, in discrete parameter problems, CR-efficiency asymptotic minimax efficiency of the estimator of and efficiency with respect to a class of estimators do the integral mean of a Gaussian sample with known not coincide. The following proposition yields a so- variance). lution to the superefficiency problem. 5. PROOFS Proposition 11. Under Assumptions A7, A8 and A9: Proof of Proposition 1. Under A1, Kolmogo- rov’s SLLN implies that P -a.s. 1 n ln q(Y ; θ ) (i) no estimator θ˜n is asymptotically CR-super- 0 n i=1 i j E ln q(Y ; θ ), and for P -a.s. any sequence of real-→ efficient w.r.t. at θ Θ; 0 j 0 1 0 ˆn P R ˜n ∈ izations, θ converges to θ0. Measurability follows (ii) no estimator θ is superefficient w.r.t. the n MLE from the fact that the following set belongs to ⊗ : and 1. Y R n 4.3 Alternative Risk Functions 1 ω Ω sup ln q(y ; θ) t ∈ n i ≤ Now we consider in what measure the previous ( θ Θ i=1 ) ∈ X results transpose when changing the risk function. n 1 Following [33], we first consider the quadratic cost = ω Ω ln q(yi; θj) t . ∈ n ≤  function and the corresponding risk function: θj Θ( i=1 ) \∈ X (θ˜n,θ ) = (θ˜n θ )2, C2 0 − 0 Proof of Lemma 1. Clearly (ii) implies A2 for n n 2(θ˜ ,θ0)= MSE(θ˜ ). a certain η> 0. On the other hand, suppose that A2 R holds; then, applying recursively H¨older inequality: The cost function 1 has the drawback of weighting in the same way pointsC of the parameter space that λj (i) , E q(Y ; θi) lie at different distances with respect to the true Λ (λ) ln 0 q(Y ; θj) j=0,...,J,j=i   value θ0. In many cases, a more general loss function Y 6 can be considered, as suggested in [30] (Volume 1, J λj 1 q(Y ; θi) · page 51) for multiple tests: ln E0 ≤ J · q(Y ; θj) ˜n j=0,...,J,j=i    ˜n 0 if θ = θ0, X 6 3(θ ,θ0)= n  C a (θ ) if θ˜ = θ , and choosing the λj ’s adequately, we get (ii).  j 0 j 12 C. CHOIRAT AND R. SERI

Proof of Proposition 2. The first two results Proof of Proposition 4. The assumptions of are straightforward applications of Cram´er’s Theo- the theorem on page 904 of [66] are easily verified. rem in Rd (see, e.g., [21], Corollary 6.1.6, page 253). This shows that a unique dominating point y(i) ex- Indeed, it is known that the lower bound holds with- ists and implies, through Proposition on page 161 out any supplementary assumption, while the upper of [65] (according to the “Remarks on the hypothe- bound requires a Cram´er condition 0 int( (i) ); ses” in [66], page 905, the “lattice” conditions are ∈ DΛ indeed, from Lemma 1, this is equivalent to Assump- not necessary), that the stated bracketing of P (θˆn = LDP 0 tion A2. Then, a full holds: θi) holds.  1 n lim inf ln P0(θˆ = θi) Proof of Proposition 5. Under Assumptions A1, n n →∞ A2, A3 and A4, according to Proposition 2(iii) we (i) inf sup y, λ Λ (λ) , have P0 Qn(θ1) Qn(θ0) = P0 Qn(θ1) >Qn(θ0) RJ J ≥− y int + λ R {h i− } { ≥ } { } · ∈ ∈ (1 + o(1)) and we can study the behavior of 1 P ˆn n n lim sup ln 0(θ = θi) P0(θˆ = θ0)= P0(θˆ = θ1)= P0 Qn(θ1) Qn(θ0) n n →∞ 6 { ≥ } (i) = P Q (θ ) Q (θ ) [0, + ) . inf sup y, λ Λ (λ) . 0{ n 1 − n 0 ∈ ∞ } RJ J ≤− y + λ R {h i− } ∈ ∈ Assumption A8 implies that the conditions of Theo- In order to prove the final result, we have to show rem 3.7.4 in [21] (page 110) are verified, in particular RJ (i), that + is a Λ ∗-continuity set, that is, the existence of a positive µ int( Λ(1) ) solution to (i), (i), (1) ∈ D infy int RJ Λ ∗(y) =infy RJ Λ ∗(y). It is enough the equation 0 = (Λ )′(µ). From Lemma 2.2.5(c) ∈ + ∈ + (1) (1), to apply part (ii) in Lemma on page 903 of [66].  in [21], this implies Λ (µ)= Λ ∗(0), and the re- sult follows.  − Proof of Proposition 3. First of all, we note that P (θˆn = θ )= P ( n X int(RJ )c). There- 0 0 0 k=1 k + Proof of Theorem 1. We note that the func- fore, we can6 apply large deviations∈ principles, with P tion κ( ) in [42] (page 1117) is given by the candidate rate function Λ∗(y); this is a strictly · (i) (i) convex function on int Λ∗ globally minimized at κ(u)=ln E exp[u (X E X )] D 0 0 E · − y′ = [ 0(ln q(Y ; θ0) ln q(Y ; θj))]j=1,...,J . E (i) E (i) − = ln 0 exp[u X ] u 0X RJ · − · By Assumption A1, y′ is finite and belongs to int +. (i) (i) = Λ (u) u E0X . From the strict convexity of the level sets of Λ∗(y), − · the set arg infy int(RJ )c Λ∗(y) has at most finite car- Therefore, we write the mean m(u) and covariance ∈ + dinality H. Moreover, since large deviations the- matrix V(u) as RJ c ory allows us to ignore the part of int( +) where u (i) u RJ c ∂κ( ) ∂Λ ( ) E (i) Λ∗(y) ε+infy int(RJ )c Λ∗(y), we can replace ( +) m(u)= κ′(u)= = 0X , ≥ ∈ + ∂u ∂u − with a collection of H disjoint sets, say Γh, h = ∂2κ(u) ∂2Λ(i)(u) 1,...,H, each of them containing in its interior one V u u ( )= κ′′( )= 2 = 2 . and only one of the points of arg infy int(RJ )c Λ∗(y) ∂u ∂u ∈ + (see [40], page 508): From (2), we have n P ˆn P RJ c 0(θ = θi) 0 Xk int( +) ∈ n k=1 ! X P (i) RJ n H = 0 Xk int( +) ∈ ! (12) =(1+ o(1)) P X int Γ k=1 · 0 k ∈ h X k=1 h=1 ! n X [ P 1 (i) E (i) RJ E (i) H n = 0 (Xk 0X ) int( +) 0X . P (n · − ∈ ⊖ ) =(1+ o(1)) 0 Xk int Γh . Xk=1 · ∈ ! Xh=1 Xk=1 Now we verify Assumptions (S.1)–(S.4) of [42]. As- As before, the bounds derive from Cram´er’s Theo- sumption (S.1) is implied by A2. Assumptions (S.2) d rem in R . Noting that the contribution of any Γh is and (S.3) hold since the random vectors are i.i.d. the same and recalling (12), we get the results.  and nontrivial. At last, (S.4) is implied by A5 (see, DISCRETE PARAMETER MODELS 13

E (i) P c e.g., [72], page 735). Since 0X is strictly negative Now, since limn θj Bn(j) = 0 and RJ E (i) →∞ { } by A1, int + 0X does not contain 0 and, ac- lim sup P θ˜n = θ < 1, the third term in the ⊖ n θj j cording to Theorem 1 in [42] (page 1118), the result right-hand→∞ side{ goes6 to zero;} since ε is arbitrary, the of the theorem follows.  result follows.  Proof of Proposition 6. First of all, we Proof of Proposition 7. From the Neyman– prove (5). We suppose that Pearson Lemma, we have n fY (y; θ1) sup Pθ (θ˜ = θ0) ln fY (y; θ1)µ(dy) < ; 0 6 f (y; θ ) ∞ θ0 Θ Z Y 0 ∈ otherwise the inequality is trivial. Then, for any max P (θ˜n = θ ), P (θ˜n = θ ) ≥ { θ0 6 0 θ1 6 1 } θ1 Θ θ0 , we apply Lemma 3.4.7 in [21] (page 94) ∈ \{ } n n 1 n n with α = P θ˜ = θ and β = P θ˜ = θ ; Pθ (θ˜ = θ0)+ Pθ (θ˜ = θ1) n θ1 { 6 1} n θ0 { 6 0} ≥ 2 · { 0 6 1 6 } since θ˜n is strongly consistent, α is ultimately less n 1 L (θ ) L (θ ) than any ε> 0 and the bound holds. P n 0 < 1 + P n 0 1 ≥ 2 · θ0 L (θ ) θ1 L (θ ) ≥ The second part can be proved as follows. Define   n 1   n 1  the sets for an arbitrary couple of different alternatives θ0 n and θ1 in Θ. Then we can use Chernoff’s Bound An(j)= ω : θ˜ = θj , { } ([21], page 93); the final expression derives from the 1 fY (Y ; θj)  B (j)= ω : ln equality Λ∗(0) = infλ R Λ(λ). n n f (Y ; θ ) − ∈   Y 0  Proof of Proposition 9. In order to prove E fY (Y ; θj) that the MLE is admissible and minimax we use the θj ln + ε . ≤ fY (Y ; θ0) Bayesian method. Using the prior densities given by    1 Therefore, we have π(θk) = (J + 1)− , the Bayes estimator relative to zero–one loss θˇn coincides with the MLE θˆn. There- P θ˜n = θ θ0 { 6 0} fore, respectively from Lemma 2.10 and Proposi- E 1 ˜n ˆn = θ0 θ = θ0 tion 6.3 in [71], θ is minimax and admissible. The { 6 } fact that the MLE minimizes the average probability E fY (Y ; θ0)1 ˜n  = θj θ = θ0 of error derives from Proposition 8. fY (Y ; θj) { 6 } Proof of Proposition 10. (i) In order to prove E fY (Y ; θ0)1 θj An(j) the first statement, we apply Lemma 2.4 in [45] ≥ fY (Y ; θj) { } (page 653). Clearly is closed in total variation, E 1 A (j) 1 B (j) since it is finite, andP is not exponentially convex; ≥ θj { n } { n } indeed, under Assumption A7, there exist θ1,θ2 Θ fY (Y ; θj) P∈ exp n E ln + ε and α [0, 1], such that the probability measure θ(α) · − · θj f (Y ; θ ) ∈    Y 0   defined as c c α 1 α [1 Pθ A (j) Pθ B (j) ] (f (x)) (f (x)) ≥ − j { n }− j { n } P θ1 θ2 − θ(α)(dx)= α · 1 α µ(dx) f (Y ; θ ) (fθ1 (x)) (fθ2 (x)) − µ(dx) exp n E ln Y j + ε · · · − · θj f (Y ; θ ) does not belong to . Therefore, from Lemma 2.4(iii)    Y 0   R P ˜n c in [45], there exist θ1′ ,θ2′ Θ such that Equation (2.12) [1 Pθ θ = θj Pθ B (j) ] ∈ ≥ − j { 6 }− j { n } in [45] holds and, as a consequence of Lemma 2.4(i) MLE E fY (Y ; θj) in [45], the fails to be an inaccuracy rate op- exp n θj ln + ε . · − · fY (Y ; θ0) timal estimator at least at one of the points θ1′ ,θ2′ .      This means that, say for θ : This implies: 1′ 1 n 1 lim inf ln P ′ θˆ θ′ > ε P ˜n θ1 1 lim inf ln θ0 θ = θ0 n n {| − | } n n { 6 } →∞ →∞ f (Y ; θ ) fY (Y ; θj) E Y 1′ E ln ε > sup θ ln , θj ′ fY (Y ; θ) ≥− f (Y ; θ ) − θ Θ, θ θ1 >ε  Y 0  ∈ | − |   1 P ˜n P c and this implies that the Chapman–Robbins bound + lim inf ln[1 θj θ = θj θj Bn(j) ]. n n is not attained at θ′ . →∞ − { 6 }− { } 1 14 C. CHOIRAT AND R. SERI

(ii) The second statement follows easily from the [7] Barndorff-Nielsen, O. (1978). Information and Expo- 1 ˜n nential Families in Statistical Theory. Wiley, Chich- results of [43] (Theorem 2) on limn n ln r1(θ ,π), using Equation (8). Indeed, the →∞MLE attains the ester. MR0489333 lower bound (7) and is therefore asymptotically min- [8] Barron, A. R. (1985). The strong ergodic theorem for densities: Generalized Shannon–McMillan–Breiman imax efficient. theorem. Ann. Probab. 13 1292–1303. MR0806226 (iii) If the estimator is asymptotically CR-efficient [9] Berger, J. O. (1993). Statistical Decision The- w.r.t. 1 at θ0, this means that at θ0 it is more ory and Bayesian Analysis. Springer, New York. efficientR than the MLE and therefore it has to be MR1234489 less efficient elsewhere (since from Proposition 9 the [10] Blackwell, D. and Hodges, J. L. Jr. (1959). The MLE minimizes the probability of error). Therefore, probability in the extreme tail of a convolution. it cannot be minimax CR-efficient.  Ann. Math. Statist. 30 1113–1120. MR0112197 [11] Blyth, C. R. (1974). Necessary and sufficient conditions Proof of Proposition 11. For (i) it is enough for inequalities of Cram´er–Rao type. Ann. Statist. to follow the proof of Proposition 6 and to reason 2 464–473. MR0356333 by contradiction, while (ii) is simply another way of [12] Blyth, C. R. and Roberts, D. M. (1972). On inequal-  itites of Cram´er–Rao type and admissibility proofs. stating Proposition 9. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability (Univ. Cali- ACKNOWLEDGMENTS fornia, Berkeley, Calif., 1970/1971), Vol. I: Theory The authors would like to thank Lucien Birg´e, of Statistics 17–30. Univ. California Press, Berkeley, CA. MR0415822 Mehmet Caner, Jean-Pierre Florens, Christian Gou- [13] Caines, P. E. (1975). A note on the consistency of ri´eroux, Christian Hess, Marc Hoffmann, Pierre Ja- maximum likelihood estimates for finite families cob, Søren Johansen, Rasul A. Khan, Oliver B. Lin- of stochastic processes. Ann. Statist. 3 539–546. ton, Christian P. Robert, Keunkwan Ryu, Igor Va- MR0368255 jda and the participants to seminars at Universit´e [14] Caines, P. E. (1988). Linear Stochastic Systems. Wiley, Paris 9 Dauphine, CREST and Institut Henri Poin- New York. MR0944080 car´e, to ESEM 2001 in Lausanne, to XXXIV`emes [15] Chamberlain, G. (2000). Econometric applications of maxmin expected utility. J. Appl. Econometrics 15 Journ´ees de Statistique 2002 in Bruxelles, to BS/ 625–644. IMSC 2004 in Barcelona, and to ESEM 2004 in [16] Chapman, D. G. and Robbins, H. (1951). Minimum Madrid. All the remaining errors are our respon- variance estimation without regularity assumptions. sibility. Ann. Math. Statist. 22 581–586. MR0044084 [17] Choirat, C., Hess, C. and Seri, R. (2003). A func- tional version of the Birkhoff ergodic theorem for REFERENCES a normal integrand: A variational approach. Ann. [1] Bahadur, R. R. (1960). On the asymptotic effi- Probab. 31 63–92. MR1959786 ciency of tests and estimates. Sankhy¯a 22 229–252. [18] Clement,´ E. (1995). Mod´elisation statistique en finance MR0293767 et estimation de processus de diffusion. Ph.D. the- [2] Bahadur, R. R. and Ranga Rao, R. (1960). On devi- sis, Universit´eParis 9 Dauphine. ations of the sample mean. Ann. Math. Statist. 31 [19] Cox, D. R. and Hinkley, D. V. (1974). Theoretical 1015–1027. MR0117775 Statistics. Chapman & Hall, London. MR0370837 [3] Baram, Y. (1978). A sufficient condition for consistent [20] Daniels, H. E. (1954). Saddlepoint approximations discrimination between stationary Gaussian mod- in statistics. Ann. Math. Statist. 25 631–650. els. IEEE Trans. Automat. Control 23 958–960. MR0066602 [4] Baram, Y. and Sandell, N. R. Jr. (1977). An informa- [21] Dembo, A. and Zeitouni, O. (1998). Large Deviations tion theoretic approach to dynamical systems mod- Techniques and Applications, 2nd ed. Applications eling and identification. In Proceedings of the 1977 of Mathematics (New York) 38. Springer, New York. IEEE Conference on Decision and Control (New MR1619036 Orleans, La., 1977), Vol. 1 1113–1118. Inst. Elec- [22] Feller, W. (1968). An Introduction to Probability The- trical Electron. Engrs., New York. MR0504256 ory, Vol. 1, 3rd ed. Wiley, New York, NY. [5] Baram, Y. and Sandell, N. R. Jr. (1978). Consistent [23] Finesso, L., Liu, C.-C. and Narayan, P. (1996). The estimation on finite parameter sets with applica- optimal error exponent for Markov order estima- tion to linear systems identification. IEEE Trans. tion. IEEE Trans. Inform. Theory 42 1488–1497. Automat. Control 23 451–454. MR0496912 MR1426225 [6] Baram, Y. and Sandell, N. R. Jr. (1978). An in- [24] Florens, J. P. and Richard, J. F. (1989). Encompass- formation theoretic approach to dynamical systems ing in finite parametric spaces. Discussion Paper modeling and identification. IEEE Trans. Automat. 89-03. Institute of Statistics and Decision Sciences, Control AC-23 61–66. MR0490387 Duke University. DISCRETE PARAMETER MODELS 15

[25] Futschik, A. and Pflug, G. (1995). Confidence sets for [44] Karlin, S. (1958). Admissibility for estimation with discrete stochastic optimization. Ann. Oper. Res. 56 quadratic loss. Ann. Math. Statist. 29 406–436. 95–108. MR1339787 MR0124101 [26] Geman, S. and Hwang, C.-R. (1982). Nonparametric [45] Kester, A. D. M. and Kallenberg, W. C. M. (1986). maximum likelihood estimation by the method of Large deviations of estimators. Ann. Statist. 14 648– sieves. Ann. Statist. 10 401–414. MR0653512 664. MR0840520 [27] Gerˇsanov, A. M. (1979). Optimal estimation of a dis- [46] Khan, R. A. (1973). On some properties of Hammers- crete parameter. Teor. Veroyatnost. i Primenen. 24 ley’s estimator of an integer mean. Ann. Statist. 1 220–224. MR0522259 756–762. MR0334350 [28] Gerˇsanov, A. M. and Samroni,ˇ S. K. (1976). Ran- [47] Khan, R. A. (1978). A note on the admissibility of Ham- domized estimation in problems with a discrete pa- mersley’s estimator of an integer mean. Canad. J. rameter space. Teor. Verojatnost. i Primenen. 21 Statist. 6 113–119. MR0521655 195–200. MR0411027 [48] Khan, R. A. (2000). A note on Hammersley’s estimator [29] Ghosh, M. and Meeden, G. (1978). Admissibility of of an integer mean. J. Statist. Plann. Inference 88 the mle of the normal integer mean. Sankhy¯aSer. 37–45. MR1767557 B 40 1–10. MR0588734 [49] Khan, R. A. (2003). A note on Hammersley’s inequal- [30] Gourieroux,´ C. and Monfort, A. (1995). Statistics ity for estimating the normal integer mean. Int. J. and Econometric Models. Cambridge Univ. Press, Math. Math. Sci. 34 2147–2156. Cambridge. [50] Kleywegt, A. J., Shapiro, A. and Homem-de Mello, [31] Grenander, U. (1981). Abstract Inference. Wiley, New T. (2001/02). The sample average approximation York. MR0599175 method for stochastic discrete optimization. SIAM [32] Hall, P. (1989). On convergence rates in nonparametric J. Optim. 12 479–502. MR1885572 Korostelev, A. P. Leonov, S. L. problems. International Statistical Review 57 45–58. [51] and (1996). Min- [33] Hammersley, J. M. (1950). On estimating restricted imax efficiency in the sense of Bahadur for small confidence levels. Problemy Peredachi Informatsii parameters (with discussion). J. Roy. Statist. Soc. 32 3–15. MR1441518 Ser. B 12 192–240. MR0040631 [52] Lainiotis, D. G. (1969). A class of upper bounds on [34] Hawkes, R. M. and Moore, J. B. (1976). Perfor- probability of error for multi-hypothesis pattern mance bounds for adaptive estimation. Proc. IEEE recognition. IEEE Trans. Information Theory IT- 64 1143–1150. MR0429280 15 730–731. MR0276006 [35] Hawkes, R. M. and Moore, J. B. (1976). Performance [53] Lainiotis, D. G. (1969). On a general relationship be- of Bayesian parameter estimators for linear signal tween estimation, detection, and the Bhattacharyya models. IEEE Trans. Automat. Control AC-21 523– coefficient. IEEE Trans. Inform. Theory IT-15 504– 527. MR0429279 505. MR0246692 [36] Hawkes, R. M. and Moore, J. B. (1976). An upper [54] LaMotte, L. R. (2008). Sufficiency in finite parame- bound on the mean-square error for Bayesian pa- ter and sample spaces. Amer. Statist. 62 211–215. rameter estimators. IEEE Trans. Inform. Theory MR2526138 IT-22 610–615. MR0416715 [55] Le Cam, L. (1953). On some asymptotic properties of [37] Hero, A. E. (1999). Signal detection and classifica- maximum likelihood estimates and related Bayes’ tion. In Digital Signal Processing Handbook (V. K. estimates. Univ. California Publ. Statist. 1 277–329. Madisetti and D. B. Williams, eds.) Chapter 13. MR0054913 CRC Press, Boca Raton, FL. [56] Le Cam, L. and Yang, G. L. (2000). Asymptotics in [38] Hsuan, F. C. (1979). A stepwise Bayesian procedure. Statistics: Some Basic Concepts, 2nd ed. Springer, Ann. Statist. 7 860–868. MR0532249 New York. MR1784901 [39] Huber, P. J. (1972). The 1972 Wald lecture. Robust [57] Lindsay, B. G. and Roeder, K. (1987). A unified treat- statistics: A review. Ann. Math. Statist. 43 1041– ment of integer parameter models. J. Amer. Statist. 1067. MR0314180 Assoc. 82 758–764. MR0909980 [40] Iltis, M. (1995). Sharp asymptotics of large deviations [58] Liporace, L. A. (1971). Variance of Bayes esti- d in R . J. Theoret. Probab. 8 501–522. MR1340824 mates. IEEE Trans. Inform. Theory IT-17 665–669. [41] Jensen, J. L. (1995). Saddlepoint Approximations. Ox- MR0339392 ford Statistical Science Series 16. Oxford Univ. [59] Lugannani, R. and Rice, S. (1980). Saddle point ap- Press, New York. MR1354837 proximation for the distribution of the sum of in- [42] Jing, B.-Y. and Robinson, J. (1994). Saddlepoint ap- dependent random variables. Adv. in Appl. Probab. proximations for marginal and conditional proba- 12 475–490. MR0569438 bilities of transformed variables. Ann. Statist. 22 [60] Manski, C. F. (1988). Analog Estimation Methods 1115–1132. MR1311967 in Econometrics. Chapman & Hall, New York. [43] Kanaya, F. and Han, T. S. (1995). The asymptotics of MR0996421 posterior entropy and error probability for Bayesian [61] McCabe, G. P. Jr. (1972). Sequential estimation of estimation. IEEE Trans. Inform. Theory 41 1988– a Poisson integer mean. Ann. Math. Statist. 43 803– 1992. MR1385590 813. MR0301875 16 C. CHOIRAT AND R. SERI

[62] Meeden, G. and Ghosh, M. (1981). Admissibility in fi- [75] Stark, A. E. (1975). Some estimators of the integer- nite problems. Ann. Statist. 9 846–852. MR0619287 valued parameter of a Poisson variate. J. Amer. [63] Nafie, M. and Tewfik, A. (1998). Reduced complex- Statist. Assoc. 70 685–689. MR0395009 ity M-ary hypotheses testing in wireless commu- [76] Teunissen, P. J. G. (2007). Best prediction in linear nications. In Proc. IEEE Int. Conf. on Acoustics, models with mixed integer/real unknowns: Theory Speech, and Signal Processing, Seattle, Washington, and application. J. Geod. 81 759–780. 1998, Vol. 6 3209–3212. Inst. Electrical Electron. [77] Torgersen, E. N. (1970). Comparison of experiments Engrs., New York. when the paramenter space is finite. Z. Wahrsch. [64] Newey, W. K. and McFadden, D. (1994). Large sam- ple estimation and hypothesis testing. In Hand- Verw. Gebiete 16 219–249. MR0283909 book of Econometrics, Vol. IV. Handbooks in [78] Vajda, I. (1967). On the statistical decision prob- Econom. 2 2111–2245. North-Holland, Amsterdam. lems with discrete parameter space. Kybernetika MR1315971 (Prague) 3 110–126. MR0215428 [65] Ney, P. (1983). Dominating points and the asymptotics [79] Vajda, I. (1967). On the statistical decision problems of large deviations for random walk on Rd. Ann. with finite parameter space. Kybernetika (Prague) Probab. 11 158–167. MR0682806 3 451–466. MR0223009 [66] Ney, P. (1984). Convexity and large deviations. Ann. [80] Vajda, I. (1967). Rate of convergence of the information Probab. 12 903–906. MR0744245 in a sample concerning a parameter. Czechoslovak [67] Ney, P. (1999). Notes on dominating points and large Math. J. 17 (92) 225–231. MR0215435 deviations. Resenhas 4 79–91. MR1712848 [81] Vajda, I. (1968). On the convergence of information con- Ney, P. E. Robinson, S. M. [68] and (1995). Polyhedral tained in a sequence of observations. In Proc. Col- approximation of convex sets with an application to loquium on Information Theory (Debrecen, 1967), large deviation probability theory. J. Convex Anal. Vol. II 489–501. J´anos Bolyai Math. Soc., Bu- 2 229–240. MR1363371 [69] Poor, H. V. and Verdu,´ S. (1995). A lower bound dapest. MR0258525 on the probability of error in multihypothesis test- [82] Vajda, I. (1971). A discrete theory of search. I. Apl. ing. IEEE Trans. Inform. Theory 41 1992–1994. Mat. 16 241–255. MR0294045 MR1385591 [83] Vajda, I. (1971). A discrete theory of search. II. Apl. [70] Puhalskii, A. and Spokoiny, V. (1998). On large- Mat. 16 319–335. MR0295483 deviation efficiency in statistical inference. [84] Vajda, I. (1974). On the convergence of Bayes Bernoulli 4 203–272. MR1632979 empirical decision functions. In Proceedings of [71] Robert, C. P. (1994). The Bayesian Choice. Springer, the Prague Symposium on Asymptotic Statistics New York. MR1313727 (Charles Univ., Prague, 1973), Vol. II 413–425. ¨ [72] Robinson, J., Hoglund, T., Holst, L. and Charles Univ., Prague. MR0383596 Quine, M. P. (1990). On approximating proba- [85] van der Vaart, A. W. (1997). Superefficiency. In bilities for small and large deviations in Rd. Ann. Festschrift for Lucien Le Cam 397–410. Springer, Probab. 18 727–753. MR1055431 [73] Robson, D. S. (1958). Admissible and minimax integer- New York. MR1462961 valued estimators of an integer-valued parameter. [86] van der Vlerk, M. H. (1996–2007). Stochastic integer Ann. Math. Statist. 29 801–812. MR0096339 programming bibliography. Available at http:// [74] Silvey, S. D. (1961). A note on maximum-likelihood www.eco.rug.nl/mally/biblio/sip.html. in the case of dependent random variables. J. Roy. [87] Wong, W. H. (1986). Theory of partial likelihood. Ann. Statist. Soc. Ser. B 23 444–452. MR0138158 Statist. 14 88–123. MR0829557