1 Information Theoretical Upper and Lower Bounds for Statistical Estimation Tong Zhang, Yahoo Inc., New York City, USA

Abstract— We establish upper and lower bounds for In the standard learning theory, we consider n random some statistical estimation problems through concise infor- samples instead of one sample. Although this appears mation theoretical arguments. Our upper bound analysis at first to be more general, it can be handled using is based on a simple yet general inequality which we call the information exponential inequality. We show that the one-sample formulation if our loss is ad- this inequality naturally leads to a general randomized ditive over the samples. To see this, consider X = estimation method, for which performance upper bounds {X1,...,Xn} and Y = {Y1,...,Yn}. Let Lθ(Z) = can be obtained. The lower bounds, applicable for all Pn i=1 Li,θ(Xi,Yi). If Zi = (Xi,Yi) are independent statistical estimators, are obtained by original applications random variables, then it follows that E L (Z) = of some well known information theoretical inequalities, Z θ Pn E L (X ,Y ). Therefore any result on the one- and approximately match the obtained upper bounds for i=1 Zi i,θ 1 1 various important problems. Moreover, our framework can sample formulation immediately implies a result on the be regarded as a natural generalization of the standard n-sample formulation. We shall thus focus on the one- minimax framework, in that we allow the performance sample case without loss of generality. of the estimator to vary for different possible underlying distributions according to a pre-defined prior. Index Terms— Gibbs algorithm, lower bound, minimax, In this paper, we consider both deterministic and ran- PAC-Bayes, randomized estimation, statistical estimation domized estimators. Randomized estimators are defined with respect to a prior π on G, which is a probability I.INTRODUCTION measure on G with the property that R dπ(θ) = 1. For a The purpose of this paper is to develop upper and G randomized estimation method, given sample Zˆ from D, lower bounds for some prediction problems in statistical we select θ from G based on a sample-dependent proba- learning. The upper bound analysis is based on a sim- bility measure dπˆ (θ) on G, In this paper, we shall call ple yet very general inequality (information exponential Zˆ such a sample-dependent probability measure as a poste- inequality), which can be applied to statistical learning rior randomization measure (or simplified as posterior). problems. Our lower bounds are obtained from some The word posterior in this paper is not necessarily the novel applications of well-known information theoretical Bayesian posterior distribution in the traditional sense. inequalities (specifically, data-processing theorems). We Whenever it appears, we shall always refer the latter show that the upper bounds and lower bounds have very as Bayesian posterior distribution to avoid confusion. similar forms, and approximately match under various For notational simplicity, we also use the symbol πˆ to conditions. denote πˆ . The randomized estimator associated with We shall first present a relatively abstract frame- Zˆ a posterior randomization measure is thus completely work, in which we develop the information exponen- determined by its posterior πˆ. Its posterior averaging tial inequality our upper bound analysis is based on. risk is the averaged risk of the randomized estimator The abstraction is motivated from statistical prediction drawn from this posterior randomization measure, which problems which this paper investigates. In statistical can be defined as (assuming that the order of integration prediction, we have input space X , output space Y, and can be exchanged) a space of predictors G. For any X ∈ X , Y ∈ Y, and Z any predictor θ ∈ G, we incur a loss Lθ(X,Y ) = Lθ(Z), Eθ∼πˆ EZ Lθ(Z) = EZ Lθ(Z)dπˆ(θ). where Z = (X,Y ) ∈ Z = X ×Y. Consider a probability ˆ ˆ measure D on Z. Our goal is to find a parameter θ(Z) In this paper, we are interested in estimating this average ˆ from a random sample Z from D, such that the loss risk for an arbitrary posterior πˆ. The statistical complex- EZ Lθˆ(Zˆ)(Z) is small, where EZ is the expectation with ity of this randomized estimator πˆ will be measured by respect to D (and Z is independent of Zˆ). its KL-complexity respect to the prior, which is defined as: proof is included in Section VII-A. There were also Z dπˆ(θ) DKL(ˆπ||π) = ln dπˆ(θ), (1) a number of studies that employed similar ideas. For G dπ example, the early work of McAllester on Pac-Bayes assuming it exists. learning [2] investigated the same framework, although It should be pointed out that randomized estimators his results were less general. Independent results that are are often suboptimal. For example, in prediction prob- quite similar to what we obtain here can also be found lems, the mean of the posterior can be a better estimator in [3]–[5]. (at least for convex loss functions). We focus on random- Lemma 2.1 (Information Exponential Inequality): ized estimation because the main techniques presented in Consider a measurable real-valued function Lθ(Z) on ˆ this paper only apply to randomized estimation methods G × Z. Then for an arbitrary sample Z-dependent due to the complexity measure used in (1). The prior π in randomized estimation method, with posterior πˆ, the our framework reflects the statistician’s conjecture about following inequality holds where the optimal parameter is more likely to happen.   E ˆ exp Eθ∼πˆ ∆ ˆ(θ) − DKL(ˆπ||π) ≤ 1. Our upper bound analysis reflects this belief: the bounds Z Z ˆ are good if the statistician puts a relatively large prior where ∆Zˆ(θ) = −Lθ(Z) − ln EZ exp(−Lθ(Z)). mass around the optimal parameter, and the bounds are The importance of this bound is that the left hand poor if the statistician makes a wrong bet by giving a side is a quantity that involves an arbitrary randomized very small prior mass around the optimal parameter. In estimator, and the right hand side is a numerical constant particular, the prior does not necessarily correspond to independent of the estimator. Therefore the inequality is nature’s prior in the Bayesian framework, in that nature a result that can be applied to an arbitrary randomized does not have to pick the target parameter according to estimator. The remaining issue is merely how to interpret this prior. this bound. With the above background in mind, we can now The following theorem gives a sample-dependent gen- start to develop our main machinery, as well as its eralization bound of an arbitrary randomized estimator consequence in statistical learning theory. The remainder which is easier to interpret then Lemma 2.1. The proof, of this paper is organized as follows. Section II intro- which uses Markov’s and Jensen’s inequality, is left to duces the basic information exponential inequality and Section VII-A. its direct consequence. In Section III, we present various Theorem 2.1 (Information Posterior Bound): generalizations of the basic inequality, and discuss their Consider randomized estimation, where we select implications. Section IV applies our analysis to statistical posterior πˆ on G based on Zˆ, with π a prior. Consider estimation, and derive corresponding upper bounds. Ex- a real-valued function Lθ(Z) on G × Z. Then ∀t, amples for some important but more specific problems the following event holds with probability at least will be given in Section V. Lower bounds that have 1 − exp(−t): the same form of the upper bounds in Section IV will −E ln E e−Lθ (Z) ≤ E L (Zˆ) + D (ˆπ||π) + t. be developed in Section VI, implying that our analysis θ∼πˆ Z θ∼πˆ θ KL gives tight convergence rates. Moreover, in order to avoid Moreover, we have the following expected risk bound: clutter in the main text, we put all but very short proofs − E E ln E e−Lθ (Z) into Section VII. Implications of our analysis will be Zˆ θ∼πˆ Z h i discussed in Section VIII. ˆ ≤EZˆ Eθ∼πˆ Lθ(Z) + DKL(ˆπ||π) . Throughout the paper, we ignore measurability issues. Moreover, for any integration with respect to multiple variables, we assume that Fubini’s theorem is applicable, Theorem 2.1 applies to unbounded loss functions. The so that the order of integration can be exchanged. right hand side of the theorem (see the first inequality) is the empirical loss plus complexity. The left-hand-side quantity is expressed as the minus logarithm of the true II.INFORMATION EXPONENTIAL INEQUALITY expected exponential of the negative loss. Using Jensen’s The main technical result which forms the basis of inequality, we can see that it is smaller than the true our upper bound analysis is given by the following expected loss. Bounds in terms of the true expected loss lemma, where we assume that πˆ =π ˆZˆ is a posterior can also be obtained (see Section IV and Section V). randomization measure on G that depends on the sample In general, there are two ways to use Theorem 2.1. Zˆ. The lemma in this specific form appeared first in Given a fixed prior π, one can obtain an estimator by an earlier conference paper [1]. For completeness, the minimizing the right hand side of the first inequality

2 k which is observation-dependent (and the minimizer is by DKL(dπˆk||dπk). Letting {f (θ)} be a sequence of independent of t). The left-hand side (generalization functions indexed by k ∈ M, we are interested in the error) is bounded according to the theorem. This ap- quantity: ∞ proach, which we may call Information Risk Mini- X µˆ E f k(θ). mization (IRM), is the focus of Section IV and Sec- k πˆk tion V. Another possible application of Theorem 2.1 is k=1 to estimate the generalization performance of any pre- Note that for notational simplicity, throughout this sec- ˆ defined randomized (and some deterministic) estimator tion, we shall drop the Z dependence in the randomized by choosing a prior π which optimizes the bound. estimators µˆk and πˆk. It can be seen immediately that this framework can III.EXTENSIONSOF INFORMATION EXPONENTIAL be mapped into the framework without model-selection INEQUALITY simply by changing the parameter space from G to The basic information exponential inequality and in- 0 G × M. The prior sequence πk becomes a prior π formation posterior bound can be extended in various 0 on G × M defined as dπ (θ, k) = µkdπk(θ). This ways. Although seemingly more general at first, these implies that mathematically, from the randomized es- extensions are essentially equivalent to the basic formu- timation point of view, the model-selection framework lations. considered above is not truly necessary. This is consistent with Bayesian statistics where prior selection (would A. Randomized model section be the counterpart of model selection in non-Bayesian One important problem we may encounter in prac- statistical inferencing) is not necessary since one can tice is the issue of prior selection, which corresponds always introduce hidden parameters and use a prior to model-selection in deterministic statistical modeling. mixture (with respect to the hidden variable). This is The purpose of this discussion is to illustrate that model similar to what happens here. We believe that the ability selection can be naturally integrated into our framework of our framework to treat model selection and estimation without any additional analysis. Because of this, the main under the same setting is very attractive and conceptually results of the paper can be applied to model selection, important. It also simplifies our theoretical analysis. although we do not explicitly state them in such context. Therefore although this section is not needed for the Based on this discussion, the statistical complexity of technical development of the paper, we still include it the randomized model-selection method in (2) is mea- here for completeness. sured by the KL-complexity of the resulting randomized Mathematically, we may consider countably many estimator on the G × M space, which can now be conveniently expressed as: models on Gk ⊂ G, which are indexed by k ∈ M =

{1, 2,...} (the countability assumption is more of a DKL(ˆµ, πˆ||µ, π) (3) notational convenience than a requirement). We shall call n X M the model space. Each model k ∈ M is associated = µˆkDKL(dπˆk||dπk) + DKL(ˆµ||µ), with a prior πk on Gk ⊂ G, together with a belief k=1 P∞ µk ≥ 0 such that k=1 µk = 1. The probability measure where µ = {µ } can be regarded as a prior belief put on the ∞ k k X k µˆ models. DKL(ˆµ||µ) = µˆ ln . µk Now we may consider a randomized model-selection k=1 scheme where we randomly select model k based on It follows that any KL-complexity based bound for ran- a data-dependent probability distribution µˆ = {µˆk} = domized estimation without model selection also holds {µˆk } on the model space M, and then a random estima- for randomized estimation with model selection. We Zˆ tor πˆ =π ˆk with respect to model k. In this framework, simply replace the KL-complexity definition in (1) by k Zˆ the resulting randomized estimator is equivalent to a the complexity measure (3). The case of information posterior randomization measure of the following form posterior bound is listed below. ∞ X Corollary 3.1 (Model Selection Bound): Consider a µˆkdπˆk (2) k sequence of measurable real-valued functions Lθ (Z) on k=1 G × Z indexed by k. Then for an arbitrary Zˆ-dependent on G. The posterior πˆk is a randomization measure randomized estimation method (with randomized model- with respect to model k, with its complexity measured selection) of the form (2), and for all t, the following

3 event holds with probability at least 1 − exp(−t): then for all t, the following event holds with probability ∞ at least 1 − exp(−t): k X −Lθ (Z) − µˆkEθ∼πˆ ln EZ e −Lθ (Z) k − (1 − α)Eθ∼πˆ ln EZ e k=1 ∞ ≤Eθ∼πˆ Lθ(Zˆ) + DKL(ˆπ||π) + c(α) + t. X k ≤ µˆkEθ∼πˆ L (Zˆ) + DKL(ˆµ, πˆ||µ, π) + t. k θ Moreover, we have the following expected risk bound: k=1

−Lθ (Z) Moreover, we have the following expected risk bound: − (1 − α)EZˆEθ∼πˆ ln EZ e ∞ h ˆ i k ≤E ˆ Eθ∼πˆ Lθ(Z) + DKL(ˆπ||π) + c(α). X −Lθ (Z) Z − EZˆ µˆkEθ∼πˆk ln EZ e k=1 " ∞ # Although the corollary is obtained merely through X k ˆ ≤EZˆ µˆkEθ∼πˆk Lθ (Z) + DKL(ˆµ, πˆ||µ, π) . a change of prior in Theorem 2.1, it is important to k=1 understand that this prior change itself depends on the Although not more general in principle, the introduc- function to be estimated. In order to obtain a statistical tion of the hidden component k maybe useful in various estimation procedure (see Section IV), one wants to applications. For example, consider a decomposition of minimize the right hand side of Theorem 2.1. Hence the posterior randomization measure of the form in (2) the prior used in the estimation procedure (which is with µˆ = µ . The corresponding prior decomposition is k k part of the estimator) cannot depend on the function dπ = P∞ µ dπ . From (3), the statistical complexity k=1 k k to be estimated. Therefore the statistical consequences of a posterior dπˆ = P∞ µ dπˆ can be measured by Zˆ k=1 k k of Theorem 2.1 and Corollary 3.2, when applied to P∞ µ D (dπˆ ||dπ ). With an appropriate choice k=1 k KL k k randomized estimation methods, are different. of prior decomposition, we may obtain a more refined As we shall see later, for some interesting estimation complexity estimate of a posterior dπˆ, which is better problems, the right hand side of Corollary 3.2 will be than the complexity measure D (dπˆ||dπ). In fact, such KL significantly better than that of Theorem 2.1. The c(α) a decomposition is necessary to obtain non-trivial results term in Corollary 3.2 is quite useful for some parametric for deterministic estimators under a continuous prior problems, where we would like to obtain a convergence π, where D (dπˆ||dπ) = ∞ for a delta-function like KL rate of the order O(1/n). In such cases, the choice posterior dπˆ. of α = 0 (as in Theorem 2.1) would lead to a rate of O(ln n/n), which is suboptimal. This localization B. Prior transformation and localization technique has also appeared in [3]–[6]. Since we are free to choose any prior in the complexity measure (1), a bound using this quantity can be turned IV. INFORMATION RISK MINIMIZATION into another bound with another complexity measure under a transformed prior. In particular, the following Given any prior π, by minimizing the right hand side transformation is useful. The result can be verified easily of Theorem 2.1, we obtain an estimation procedure, using direct calculation. which we call Information Risk Minimization (IRM). Proposition 3.1: Given a real-valued functions r(θ) We shall consider the method for the case of n iid ˆ ˆ ˆ on G, and define samples Z = (Z1,..., Zn) ∈ Z = Z1 × · · · × Zn, where Z1 = ··· = Zn. The loss function is Lθ(Zˆ) = r(θ) n →r e dπ(θ) P ˆ dπ (θ) = , ρ i=1 `θ(Zi), where ` is a function on G × Z1 and r(θ) Eθ∼πe ρ > 0 is a constant. Given a S of probability then distributions on G, the IRM method finds a randomized estimator πˆ which minimizes the empirical information →r r(θ) S DKL(ˆπ||π ) = DKL(ˆπ||π)−Eθ∼πˆ r(θ)+ln Eθ∼πe . complexity: " n # X DKL(ˆπ||π) −Lθ (Z) ˆ In particular, we may choose r(θ) = α ln EZ e , πˆρ,S = arg inf Eθ∼πˆ `θ(Zi) + . πˆ∈S ρ which leads to the following corollary. i=1 (4) Corollary 3.2 (Localized KL-complexity): Use the There are two cases of S that are most interesting. notation of Theorem 2.1 and consider α ∈ [0, 1]. Let One is to take S as the set of all possible posterior α −Lθ (Z) c(α) = ln Eθ∼πEZ e , randomization measures. In this case, the corresponding

4 estimator (often referred to as the Gibbs algorithm) can data-dependent choice of ρkˆ. be solved and becomes a randomized estimator which However, for simplicity, in the following discussion, we simply denote as πˆρ: we focus on IRM with a pre-specified (possibly non- optimal) ρ. A more general extension of (4) is to allow exp(−ρ Pn ` (Zˆ )) dπˆ = i=1 θ i dπ. (5) θ-dependent scaling factor ρ, which is also possible to ρ Pn ˆ Eθ∼π exp(−ρ i=1 `θ(Zi)) study (see related discussions in Section IV-B).

In order to see that dπˆρ minimizes (4), we simply note The following theorem gives a general bound for the that the right hand side of (4) can be rewritten as: IRM method. The proof can be found in Section VII-B. Theorem 4.1: " n # Define resolvability X ˆ DKL(ˆπ||π) DKL(ˆπ||πˆρ)   Eθ∼πˆ `θ(Zi) + = + c, 1 0 ρ ρ rρ,S = inf Eθ∼π0 EZ `θ(Z1) + DKL(π ||π) . i=1 π0∈S 1 ρn where c is a quantity that is independent of πˆ. It follows Then ∀α ∈ [0, 1), the expected generalization perfor- that the optimal solution is achieved at πˆ =π ˆρ. The mance of IRM method (4) can be bounded as Gibbs method is closely related to the Bayesian method. −ρ`θ (Z1) Specifically, the Bayesian posterior distribution is given − EZˆ Eθ∼πˆρ,S ln EZ1 e by the posterior randomization measure πˆρ of the Gibbs ρ  1  αn −ρ`θ (Z1) ≤ rρ,S + ln Eθ∼πE e . algorithm at ρ = 1 with the log-loss. 1 − α ρn Z1 The other most interesting choice is when G is a discrete net which is consisted of countably many dis- Note for the Gibbs estimator (5), where S consists crete parameters: G = {θ1, θ2,...}. The prior π can be of all possible posterior randomization measures, the represented as πk(k = 1, 2,...). We consider the set corresponding resolvability is given by S of distributions concentrated on a point kˆ: πˆ = 1 kˆ 1 ˆ −ρnEZ1 `θ (Z1) and πˆk = 0 when k 6= k. IRM in this case becomes a rρ,S = − ln Eθ∼πe . (7) ˆ ρn deterministic estimator which estimates a parameter kρ as follows: For estimation on discrete net (6), the associated " n # resolvability is 1 ˆ X ˆ   kρ = arg inf ρ `θk (Zi) + ln . (6) 1 1 k πk i=1 rρ,S = arg inf EZ1 `θk (Z1) + ln . (8) k ρn πk We then use the deterministic parameter θ as our kˆρ estimator. This method minimizes a penalized empirical B. Refined convergence bounds risk on a discrete net, which is closely related to the minimum description length method for density estima- In order to apply Theorem 4.1 to statistical estimation tion. In general, estimators on a discrete net have been problems, we need to write the left hand side in a frequently used in theoretical studies, although they can more informative form. For some specialized problems, be quite difficult to apply in practice. From a practical the left hand side may have intuitive interpretations. point of view, the Gibbs algorithm (5) with respect to a For example, for density estimation with log-loss, the continuous prior might be easier to implement since one right hand side (of the second inequality) involves the can use Monte Carlo techniques to draw samples from KL-distance between the true density and the estimated the Gibbs posterior. Moreover, in our analysis, the Gibbs density, and the left hand side can be interpreted as algorithm has better generalization guarantee. a different distance between the true density and the estimated density. Analysis following this line can be found in [6]. A. A general convergence bound For general statistical estimation problems, it is useful For the IRM method (4), we assume that ρ is a given to replace the left hand side by the true generalization constant. As is typical in statistical estimation, one may error with respect to the randomized estimator, which choose an optimal ρ by cross-validation. It is possible to is the quantity we are interested in. In order to achieve allow data-dependent ρ in our analysis using the model this, one can use a small value of ρ and bounds for selection idea discussed in Section III-A. In this case, we the logarithmic moment generating functions that are may define a discrete set of {ρk}, each with loss function given in Appendix I. Before going into technical details k Lθ (Z) = ρkLθ(Z). In our analysis, we can then pick a more rigorously, we first outline the basic idea (non- specific model kˆ deterministically, which then leads to a rigorously). Based on Proposition 1.1 and the remark

5 thereafter, for small positive ρ, we have for each θ: Section VII-B for its proof.

−ρ`θ (Z1) Theorem 4.2: Consider the IRM method in (4). As- − ln EZ1 e 2 sume there exists a positive constant K such that ∀θ: ρ 3 =ρEZ1 `θ(Z1) − VarZ1 `θ(Z1) + O(ρ ). 2 2 EZ1 `θ(Z1) ≤ KEZ1 `θ(Z1). Now substituting into Theorem 4.1 with α = 0, we Assume either of the following conditions: obtain

• Bounded loss: ∃M ≥ 0 s.t. − infθ,Z1 `θ(Z1) ≤ M; ρM 2 EZˆ Eθ∼πˆρ,S EZ1 `θ(Z1) let βρ = 1 − K(e − ρM − 1)/(ρM ). m ρ 2 • Bernstein loss: ∃M, b > 0 s.t. E (−` (Z )) ≤ ≤rρ,S + E ˆ Eθ∼πˆ VarZ `θ(Z1) + O(ρ ). Z1 θ 1 2 Z ρ,S 1 m−2 m!M KbEZ1 `θ(Z1) for all θ ∈ G and To further develop the idea (non-rigorously at the mo- m ≥ 3; let βρ = 1 − Kρ(1 − ρM + 2bρM)/(2 − ment), we can ignore the higher order term in ρ, and 2ρM). obtain a bound for small ρ as long as we can replace Assume we choose a sufficiently small ρ such that VarZ1 `θ(Z1) by a bound of the form βρ > 0. Let the (true) expected loss of θ be R(θ) =

VarZ1 `θ(Z1) ≤ KEZ1 `θ(Z1) + vθ, (9) EZ1 `θ(Z1), then the expected generalization perfor- mance of the IRM method is bounded by where K > 0 is a constant and vθ are θ-dependent constants. Up to the leading order, this gives a bound EZˆ Eθ∼πˆρ,S R(θ) (10) of the form 1  1  −αρβρnR(θ) ≤ rρ,S + ln Eθ∼πe , (1 − α)βρ ρn EZˆ Eθ∼πˆρ,S EZ1 `θ(Z1) h i 1 ρ 2 1 0 ≤ r + E E v + O(ρ ). where rρ,S = infπ0∈S Eθ∼π0 R(θ) + DKL(π ||π) . 1 − 0.5Kρ ρ,S 2 − Kρ Zˆ θ∼πˆρ,S θ ρn Remark 4.1: Since (ex − x − 1)/x → 0 as x → 0, we Interesting observations can be made from this bound. know that the first condition βρ > 0 can be satisfied as In order to optimize the bound on the right hand side, long as we pick a sufficiently small ρ. In fact, using the one may make a first-order adjustment to (4) by taking inequality (ex − x − 1)/x ≤ 0.5xex (when x ≥ 0), we the variance into consideration: ρM may also take βρ = 1 − 0.5ρKe in the first condition " n X ρ of Theorem 4.2. πˆρ,S = arg inf Eθ∼πˆ (`θ(Zˆi) + vθ) πˆ∈S 2 i=1 D (ˆπ||π) + KL . ρ C. Convergence bounds under various local prior decay conditions Note that for small ρ, the effect of the vθ adjustment is also relatively small. If v cannot be reliably estimated, θ We study consequences of (10) under some general then one may also consider using the empirical variance conditions on the local prior structure π() = π({θ : of ` as a plug-in estimator. Practically, the variance θ R(θ) ≤ }) around the best achievable parameter. For adjustment is quite reasonable. This is because if ` has θ some specific forms of local prior conditions, conver- a large variance, then its empirical risk is a less reliable gence rates can be stated very explicitly. estimate of its true risk, and thus more likely to overfit the data. It is therefore useful to penalize the unreliability For simplicity, we shall focus only on the Gibbs according to the estimated variances. A related method algorithm (5). Similar results can also be obtained for is to allow different scaling factor ρ for different θ, (6) under an appropriate discrete net. We first present which can also be used to compensate different variance a theorem which contains the general result, followed at different θ. This can be achieved through a re- by more specific consequences. The proofs are left to definition of the prior weighted by θ dependent ρ and Section VII-B. Related results can also be found in then normalize (also see [4]). Chapter 3 of [5]. Theorem 4.3: Consider the Gibbs algorithm (5). If In the following examples, we pay special attention (10) holds with a non-negative function R(θ), then to the case of vθ = 0 in (9). The following theorem is a direct consequence of Theorem 2.1, where the ∆(αβρ, ρn) EZˆ Eθ∼πˆρ R(θ) ≤ , technical idea discussed above is rigorously applied. See (1 − α)βρρn

6 where ing splines, etc, the -entropy often grows at the order −r −abR(θ) of O( ) for some r > 0. We shall not list detailed Eθ∼πe ∆(a, b) = ln examples here, and simply refer the readers to [7]–[9] E e−bR(θ) θ∼π and references therein. Similarly, we assume that there  max(0, π(/a) − v) ≤ ln inf sup exists constants C and r such that the prior π() satisfies u,v ≤u π() the condition: v + (1 − v) exp(−bu) −r 1 −r + inf −b , C  ≤ ln ≤ C  .  π()e 1 π() 2 and π() = π({θ : R(θ) ≤ }). 1/(1+r) This implies that ¯global ≤ (C2/(ρn)) . It is In the following, we give two simplified bounds, easy to check that ¯local is the same order of ¯global one with global entropy, which gives the correct rate when C1 > 0. Therefore, for a prior that behaves of convergence for non-parametric problems. The other non-parametrically around the truth, it does not matter bound is a refinement that uses localized entropy, useful whether we use global complexity or local complexity. for parametric problems. Corollary 4.1 (Global Entropy Bound): Consider the 2) Parametric type local prior: For standard paramet- Gibbs algorithm (5). If (10) holds, then ric families, the prior π has a density with an underlying dimensionality d: π() = O(−d). We may assume that inf [ρn − ln π()] 2¯ E E R(θ) ≤  ≤ global , the following condition holds: Zˆ θ∼πˆρ β ρn β ρ ρ 1 1 1 C1 + d ln ≤ ln ≤ C2 + d ln . where π() = π({θ : R(θ) ≤ }) and ¯global =  π()  n 1 1 o inf  :  ≥ ρn ln π() . This implies that ¯global is of the order d ln n/n. How- Corollary 4.2 (Local Entropy Bound): Consider the ever, we have Gibbs algorithm (5). If (10) holds, then ln 2 + C2 − C1 − d ln(αβρ) 1 ¯local ≤ , ρn EZˆ Eθ∼πˆρ R(θ) ≤ inf (1 − α)βρρn u1≤u2 " # which is of the order O(d/n) for large d. In this case, we π(/(αβ )) ρnu e−ρnu2/2 ln sup ρ + exp( 1 ) + obtain a better rate of convergence using the localized ∈[u1,u2] π() αβρ π(u2/2) complexity measure. ¯ ≤ local , 3) Singular local prior: It is possible to obtain a (1 − α)βρ rate of convergence faster than O(1/n). This cannot be where π() = π({θ : R(θ) ≤ }), and obtained with either ¯global or ¯local, which are of the order no better than n−1. The phenomenon of faster than 2   ¯local = + inf :  ≥ O(1/n) convergence rate is related to super-efficiency ρn αβρ and hence can only appear at a countably many isolated  0 −ρnu ) αβρ π( /(αβρ)) e points. sup ln 0 + . 0∈[,2u] ρn π( ) π(u) To see that it is possible to obtain faster than 1/n convergence rate (super efficiency) in our framework, we Remark 4.2: By letting u → ∞ in the definition only consider the simple case where there exists u > 0 of ¯local, we can see easily that ¯local ≤ ln 2/(ρn) + such that ¯global/(αβρ). Therefore using the localized complexity π(2u/(αβρ)) = π(0) > 0. ¯local is always better (up to a constant) than using That is, we have a point-like prior mass at the truth ¯global. If the ratio π(/(αβρ))/π() is much smaller than 1/π(), the localized complexity can be much better with zero density around it. In this case, we can apply than the global complexity. Corollary 4.2 with u1 = −∞ and u2 = 2u, and obtain h i In the following, we consider three cases of local ln 1 + exp(−ρnu) prior structures, and derive the corresponding rates of π(u) EZˆ Eθ∼πˆρ R(θ) ≤ . convergence. Comparable lower-bounds are given in (1 − α)βρρn Section VI. This gives an exponential rate of convergence. Clearly 1) Non-parametric type local prior: It is well known this example can be generalized to the case that a point that for standard nonparametric families such as smooth- is not completely isolated from its neighbor.

7 V. SOME EXAMPLES In the following, we are interested in a bound which compare the performance of the randomized estimator We focus on consequences and applications of Theo- (11) to the best possible predictor θ ∈ G, and thus rem 4.2. Specifically, we give a few examples of some G define important IRM formulations for which (10) hold with θG(Z1) some positive constants α, ρ, and β . It follows that `θ(Z1) = ln . ρ θ(Z1) for these problems, the convergence rate analysis in Section IV-C applies. In order to apply Theorem 4.1, we need the following variance bound which is of the form (9) (see Section VII- C). A. Conditional density estimation Proposition 5.1: If there exists a constant MG ≥ 2/3 such that E ` (Z )3 ≤ M E ` (Z )2, then Conditional density estimation is very useful in prac- Z1 θ 1 G Z1 θ 1 E ` (Z )2 ≤ 8MG E ` (Z ). tical applications. It includes the standard density esti- Z1 θ 1 3 Z1 θ 1 mation problem widely studied in statistics as a special Using this result, we obtain the following theorem from case. Moreover, many classification algorithms (such as Theorem 4.2. decision trees or logistic regression) can be considered Theorem 5.1: Consider the IRM estimator (11) for as conditional density estimators. conditional density estimation (under log-loss). Then ∀α ∈ [0, 1), inequality (10) holds with R(θ) = Let Z1 = (X1,Y1), where X1 is the input variable, θG (Z1) EZ ln under either of the following two condi- and Y1 is the output variable. We are interested in esti- 1 θ(Z1) mating the conditional density p(Y1|X1). In this frame- tions: θ1(Z1) work, we assume (with a slight abuse of notation) that • sup ln ≤ MG: we pick ρ such that θ1,θ2∈G,Z1 θ2(Z1) each parameter θ corresponds to a conditional density ρMG βρ = (11ρMG + 8 − 8e )/(3ρMG) > 0. function: θ(Z ) = p(Y |θ, X ). In density estimation, 1 1 1 • There exist MG, b > 0 such that MGb ≥ 1/9, we consider the negative log loss function − ln θ(Z1). θ(Z1) m and ∀θ ∈ G and m ≥ 3, EZ | ln | ≤ 1 θG (Z1) Our goal is to find a randomized conditional density m−2 θ(Z1) 2 m!M bEZ (ln ) : we pick ρ such that estimator θ from the data, such that the expected log- G 1 θG (Z1) βρ = 1−8bρMG(1−ρMG +2bρMG)/(1−ρMG) > loss −EZ1 ln θ(Z1) is as small as possible. In this case, the IRM estimator in (4) becomes 0. " n # Proof: Under the first condition, using Proposi- X 1 πˆρ,S = arg inf ρEθ∼πˆ ln + DKL(ˆπ||π) . tion 5.1, we may take K = 8/3MG and M = MG πˆ∈S ˆ i=1 θ(Zi) in Theorem 4.2 (bounded loss case). Under the second (11) condition, using Proposition 5.1, we may take K = If we let S be the set of all possible measures, then 16MGb and M = MG in Theorem 4.2 (Bernstein loss Qn ˆ ρ the optimal solution is dπˆρ,S ∝ i=1 θ(Zi) dπ, which case). corresponds to the Bayesian posterior distribution when Similar to the remark after Theorem 4.2, we may ρ = 1. For discrete S, the method is essentially a ρMG also let βρ = (3 − 4ρMGe )/3 under the first two-part code MDL estimator. As mentioned earlier, condition of Theorem 5.1. The second condition involves Theorem 4.1 can be directly applied since the left hand moment inequalities that needs to be verified for specific side can be interpreted as a (Hellinger-like) distance problems. It applies to certain unbounded conditional between distributions. This approach has been taken in density families such as conditional Gaussian models [1], [6] (also see [10] for related results). However, in with bounded variance. We shall discuss a related sce- this section, we are interested in using the log-loss on the nario in the least squares regression case. Note that under left-hand side. This requires rewriting the left-hand side Gaussian noise with identical variance, the conditional logarithmic moment generating function of Theorem 4.1 density estimation using the log-loss is equivalent to using ideas illustrated at the beginning of the section. the estimation of conditional mean using least squares We further assume that θ is defined on a domain G regression. which is a closed convex density class. However, we do Since for log-loss, (10) holds under appropriate not assume that G contains the true conditional density. boundedness or moment assumptions on the density We also let θG be the optimal density in G with respect family, the consequences in Section IV-C apply to the to the log loss: Gibbs algorithm (which gives the Bayesian posterior dis- 1 1 tribution when ρ = 1). As we shall show in Section VI, EZ1 ln = inf EZ1 ln . θG(Z1) θ∈G θ(Z1) similar lower bounds can be derived.

8 ρMG B. Least squares regression that βρ = (5ρMG + 4 − 4e )/(ρMG) > 0. • Proposition 5.3 holds for all integer m ≥ 2: we pick Let Z1 = (X1,Y1), where X1 is the input variable, small ρ such that βρ = 1−4MGρ/(1−2AGBGρ) > and Y1 is the output variable. We are interested in pre- 0. dicting Y1 based on X1. We assume that each parameter θ corresponds to a predictor: θ(X1). The quality of Proof: Under the first condition, using Proposi- 2 the predictor is measured by the mean squared error tion 5.2, we have MG ≤ supθ∈G,Z1 (θ(X1) − Y1) . We 2 may thus take K = 4MG and M = MG in Theorem 4.2 EZ1 (θ(X1)−Y1) . In this framework, the IRM estimator in (4) becomes (bounded loss case). Under the second condition, using " n Proposition 5.3, we can let K = 8MG, M = 2AGBG X 2 and b = 1/2 in Theorem 4.2 (Bernstein loss case). πˆρ,S = arg inf Eθ∼πˆ (θ(Xi) − Yi) πˆ∈S i=1 Remark 5.1: Similar to the remark after Theorem 4.2, ρMG D (ˆπ||π) we may also let βρ = 1 − 2ρMGe under the first + KL . (12) ρ (bounded loss) condition. The theorem applies to unbounded regression prob- We further assume that θ is defined on a domain G, lems with exponentially decaying noise such as Gaussian which is a closed convex function class. Let θG be the noise (also see [3]). For example, we have the following optimal predictor in G with respect to the least squares result (see Section VII-C for its proof). loss: Corollary 5.1: Assume that there exists a function 2 2 y0(X) such that EZ (θG(X1) − Y1) = min EZ (θ(X1) − Y1) . 1 θ∈G 1 • For all X1, the random variable |Y1 − y0(X1)|, In the following, we are interested in comparing the conditioned on X1, is dominated by the absolute performance of the randomized estimator (11) to the best value of a zero-mean Gaussian random variable1 possible predictor θG ∈ G. Define with standard deviation σ. • 2 2 ∃ constant b > 0 such that supX1 |y0(X) − `θ(Z1) = (θ(X1) − Y1) − (θG(X1) − Y1) . θ(X1)| ≤ b. We have the following proposition. The proof is in If we also choose A such that A ≥ supX1,θ∈G |θ(X1) − 2 Section VII-C. θG(X1)|, then (10) holds with βρ = 1−4ρ(b+σ) /(1− Proposition 5.2: Let 2A(b + σ)ρ) > 0. 2 Although we have only discussed least squares regres- MG = sup EY1|X1 (θ(X1) − Y1) . X1,θ∈G sion, similar results can also be obtained for some other 2 loss functions in regression and classification. Then EZ1 `θ(Z1) ≤ 4MGEZ1 `θ(Z1). The following refined bound estimates the moment of the C. Non-negative loss functions loss in terms of the moment of the noise Y1 − θ(X1). The proof is left to Section VII-C. Consider a non-negative loss function `θ(Zi) ≥ 0.

Proposition 5.3: Let AG = supX1,θ∈G |θ(X1) − We are interested in developing a bound under the m θG(X1)| and supX1,θ∈G EY1|X1 |θ(X1) − Y1| ≤ assumption that the best error Eθ∈G`θ(Zi) can approach m−2 m!BG MG for m ≥ 2. Then we have: zero. A straight-forward application of Theorem 4.2 leads to the following bound. E (−` (Z ))m ≤ m!(2A B )m−24M E ` (Z ). Z1 θ 1 G G G Z1 θ 1 Theorem 5.3: Consider the IRM estimator (4). As- sume `θ(Zi) ≥ 0, and there exist positive constants K Although Proposition 5.2 is a special case of Proposi- such that ∀θ: tion 5.3, we include it separately for convenience. These 2 EZ `θ(Z1) ≤ KEZ `θ(Z1). moment estimates can be combined with Theorem 4.2, 1 1 and we obtain the following theorem. Then ∀α ∈ [0, 1), inequality (10) holds with R(θ) =

Theorem 5.2: Consider the IRM estimator (12) for EZ1 `θ(Z1) and βρ = 1 − 0.5ρK. least squares regression. Then ∀α ∈ [0, 1), inequal- Proof: We can apply Theorem 4.2, under the 2 ity (10) holds with R(θ) = EZ1 (θ(X1) − Y1) − bounded loss condition −`θ(Z1) ≤ M, which holds 2 EZ1 (θG(X1) − Y1Z) , under either of the following 1 conditions: That is, conditioned on X1, the moments of |Y1 − y0(X1)| with respect to Y1 are no larger than the corresponding moments of the 2 • dominating Gaussian random variable. supθ∈G,Z1 (θ(X1) − Y1) ≤ MG: we pick ρ such

9 for any M > 0. Letting M → 0, we have βρ = lower bound for any possible estimator, so that we can 1 − K(eρM − ρM − 1)/(ρM 2) → 1 − 0.5Kρ. compare this lower bound to the upper bound for the The condition in Theorem 5.3 is automatically satis- IRM method developed earlier. fied if ` (Z ) is uniformly bounded: 0 ≤ ` (Z ) ≤ K. θ 1 θ 1 Note that (13) only gives one performance measure, while the upper bound for the IRM method is specific for VI.LOWER BOUNDS every possible truth Pθ. It is thus useful to study the best In previous sections, we obtained performance bounds local performance around any possible θ with respect to for the IRM method (4) using the information exponen- the underlying prior π. To address this issue, we observe tial inequality. The purpose of this section is to prove that for every partition of the θ space into the union of some lower bounds which hold for arbitrary statistical disjoint small balls Bk, we may rewrite (13) as estimators. Our goal is to match these lower bounds to X ˆ π(Bk)Eθ∼π EZ∼P n Eˆ Rθ(θ), the upper bounds proved earlier for the IRM method (at Bk θ θ∼πˆ(Z) least for certain problems), which implies that the IRM k method is near optimal. where for each small ball Bk, the localized prior is The upper bounds we obtained in previous sections defined as: π(A ∩ Bk) are for every possible realization of the underlying π (A) = . Bk π(B ) distribution. It is not possible to obtain a lower bound for k any specific realization since we can always design an Therefore, instead of bounding the optimal Bayes risk estimator that picks a parameter that achieves the best with respect to the global prior π in (13), we shall possible performance under this particular distribution. bound the optimal risk with respect to a local prior However, such an estimator will not work well for a πB for a small ball B around any specific parameter different distribution. Therefore as far as lower bounds θ, which gives a more refined performance measure. In are concerned, we are interested in the performance this framework, if for some small local ball πB, the IRM averaged over a set of underlying distributions. method has performance not much worse than the best In order to obtain lower bounds, we associate each possible estimator, then we can say that it is locally near parameter θ with a probability distribution Pθ(x, y) so optimal. that we can take samples Zi = (Xi,Yi) from this The main theorem in our lower bound analysis is distribution. In addition, we shall design the map in such presented below. The proof, which relies on a simple in- a way that the optimal parameter under this distribution formation theoretical argument, is given in Section VII- is θ. For (conditional) density estimation, the map is D. the density itself. For regression, we associate each predictor θ with a conditional Gaussian distribution with Theorem 6.1: Consider an arbitrary randomized esti- 0 constant variance and the conditional mean given by the mator πˆ(Z) on B ⊂ G, where Z consists of n indepen- dent samples Z = {(X ,Y ),..., (X ,Y )} from some prediction θ(X1) of each input X1. 1 1 n n We consider the following scenario: we put a prior π underlying density Pθ, then for all non-negative function R (θ0), we have for all B that contains θ: on θ, which becomes a prior on the distribution Pθ(x, y). θ Assume that we are interested in estimating θ, under a E E n E R (θˆ) ≥ 0.5 θ∼πB Z∼Pθ θˆ∼πˆ(Z) θ loss function `θ(Z1), then the quantity  1  0 sup  : inf ln ≥ 2n∆B + ln 4 , 0 0 0 0 Rθ(θ ) = EZ1∼Pθ `θ (Z1) θ ∈B πB({θ : Rθ(θ ) < }) 0 0 0 is the true risk between an estimated parameter θ and where ∆B = Eθ∼πB Eθ ∼πB DKL(Pθ(Z1)||Pθ (Z1)). the true distribution parameter θ. Let Z be n indepen- Theorem 6.1 has a form that resembles Corollary 4.2. dent samples Z = {(X1,Y1),..., (Xn,Yn)} from the n In the following, we state a result which shows the underlying distribution Pθ. Denote by Pθ the product n Qn relationship more explicitly. distribution: dPθ (Z) = i=1 dPθ(Zi). Let πˆ(Z) be an arbitrary randomized estimator, and consider a random Corollary 6.1 (Local Entropy Lower Bound): Under variable θˆ that is drawn from πˆ(Z). The average perfor- the notations of Theorem 6.1, consider a reference point mance of θˆ can thus be expressed as θ0 ∈ G, and a set of balls B(θ0, ) ⊂ G containing θ0 (by definition) and indexed by  > 0, such that E E n E R (θˆ), (13) θ∼π Z∼Pθ θˆ∼πˆ(Z) θ sup DKL(Pθ(Z1)||Pθ0 (Z1)) ≤ . 0 In this section, we are mainly interested in obtaining a θ,θ ∈B(θ0,)

10 Given u > 0, and consider (θ0, u) which satisfies: we can take

(θ0, u) = sup { : (θ0, u) = sup { : >0 >0   π(B(θ0, u)) π(B1(θ0, cu)) inf ln 0 ≥ 2nu + ln 4 , inf ln 0 ≥ 2nu + ln 4 . θ0∈B0 π({θ : Rθ(θ ) < } ∩ B(θ0, u)) θ0∈B0 π(B2(θ , ) ∩ B1(θ0, cu))

then locally around θ0, we have As a comparison, according to Corollary 4.2, the IRM ˆ method at θ0 gives an upper bound of the following form Eθ∼π EZ∼P n Eˆ Rθ(θ) ≥ 0.5(θ0, u). B(θ0,u(θ0,u)) θ θ∼πˆ(Z) (which we simplify to focus on the main term) with some constant u0 ∈ (0, 1):  0  Proof: The corollary is a simple application of 2 π(B1(θ0,  )) ¯local ≤ + inf  : ρn ≥ sup ln 0 0 . Theorem 6.1. Just let B = B(θ0, u(θ0, u)) and note ρn 0≥ π(B1(θ0, u  )) that ∆ ≤ u(θ , u). B 0 Essentially, the local IRM upper bound is achieved at B(θ , ) The definition of 0 requires that within the ¯local such that B(θ , ) ball, the distributions P are nearly indistin- 0 θ π(B (θ , 0)) guishable up to a scale of , when measured by their 1 0 n¯local ∼ sup ln 0 0 , 0 π(B (θ , u  )) KL-divergence. Corollary 6.1 implies that the local per-  ≥¯local 1 0 formance of an arbitrary statistical estimator cannot be where we use ∼ to denote approximately the same order, better than (θ0, u)/2. The bound in Corollary 6.1 will while the lower bound in Corollary 6.1 implies that (let 0 be good if the ball π(B(θ0, )) is relatively large. That is, u = 1/(cu)): there are many distributions that are statistically nearly π(B (θ , )) indistinguishable (in KL-distance). Therefore the bound n ∼ inf ln 1 0 . 0 0 0 0 of Corollary 6.1 is similar to Corollary 4.2, but the θ ∈B π(B2(θ , u ) ∩ B1(θ0, )) localization is within a small ball which is statistically From this, we see that our upper and lower bounds are nearly indistinguishable (rather than the R(·) localization very similar. There are two main differences which we for the IRM estimator). From an information theoretical outline below. point of view, this difference is rather intuitive and clearly also necessary since we allow arbitrary statis- • In the lower bound, for technical reasons, B2 ap- tical estimators (which can simply estimate the specific pears in the definition of the local entropy. In order to argue that the difference does not matter, we need underlying distribution Pθ if they are distinguishable). to assume that the prior probabilities of B1 and B2 It follows that if we want the upper bound in Corol- are of the same order. lary 4.2 to match the lower bound in Corollary 6.1, we • In the lower bound, we use the smallest local θ → P need to design a map θ such that locally around entropy in a neighbor of θ , while in the upper θ R(·) 0 0, a ball with small risk is also small information bound, we use the largest local entropy at θ across P 0 theoretically in terms of the KL-distance between θ and different scales. This difference is not surprising P . Consider the following two types of small R balls: θ0 since the lower bound is with respect to the average 0 0 B1(θ, ) = {θ : Rθ(θ ) < }, in a small neighborhood of θ0. 0 B2(θ, ) = {θ : Rθ0 (θ) < }. Both differences are relatively mild and somewhat expected. In fact, at the conceptual level, it is very Now assume that we can find a map θ → Pθ such that interesting that we are able to establish lower and upper locally around θ0, Pθ within a small B1-ball is small in the information theoretical sense (small KL-distance). bounds that are so similar. We consider two situations That is, we have for some c > 0 that which parallel Section IV-C.1 and Section IV-C.2. 0 sup{DKL(Pθ(Z1)||Pθ0 (Z1)) : θ, θ ∈ B1(θ0, c)} ≤ . (14) For problems such as density estimation and regression studied in this paper, it is easy to design such a map A. Non-parametric type local prior (under mild conditions such as the boundedness of the loss). We shall not go into the details for verifying Similar to Section IV-C.1, we assume that for some specific examples of (14). Now, under this condition, sufficiently large constant v: there exist 0 < C1 < C2

11 such that Remark 7.1: The above convex also has a

−r 1 straight-forward information theoretical interpretation: C2 ≤ inf ln , 0 0 0 consider v(θ) = exp(f(θ))/E exp(f(θ)). Since θ ∈B π(B2(θ , ) ∩ B1(θ0, v)) θ∼π Eθ∼πv(θ) = 1, we can regard it as a density with 1 −r ln ≤ C1 , respect to π. Now let dπ0 = vdπ, it is easy to verify π(B1(θ0, v)) that the inequality in Proposition 7.1 can be rewritten which measures the order of global entropy around a 0 as DKL(ˆπ||π ) ≥ 0, which is a well-known information small neighborhood of θ0. Now under condition (14) and theoretical inequality, and follows easily from Jensen’s with u = v/c, Corollary 6.1 implies that inequality.  −r  ≥ sup  : 2un + 4 ≤ (C2 − C1) . Lemma 2.1 Consider a measurable real-valued func- ˆ This implies that  is of the order n−1/(1+r), which tion Lθ(Z) on G × Z. Then for an arbitrary sample Z- matches the order of the IRM upper bound ¯global in dependent randomized estimation method, with posterior Section IV-C.1. πˆ, the following inequality holds   EZˆ exp Eθ∼πˆ ∆Zˆ(θ) − DKL(ˆπ||π) ≤ 1. B. Parametric type local prior where ∆ (θ) = −L (Zˆ) − ln E exp(−L (Z)). Similar to Section IV-C.2, we assume that for some Zˆ θ Z θ sufficiently large constant v: there exist 0 < C1 < C2 such that 1 1 Proof: Using the convex duality inequality in C2 + d ln ≤ inf ln , 0 0 0 Proposition 7.1 on the parameter space G, with prior  θ ∈B π(B2(θ , ) ∩ B1(θ0, v)) 1 1 π, posterior πˆ, and f(θ) = ∆Zˆ(θ), we obtain ln ≤ C1 + d ln , π(B1(θ0, v))    ∆Zˆ (θ) exp Eθ∼πˆ ∆Zˆ(θ) − DKL(ˆπ||π) ≤ Eθ∼πe . which measures the order of global entropy around a Taking expectation with respect to Zˆ, and notice that small neighborhood of θ . Now under the condition (14) 0 E exp(∆ (θ)) = 1, we obtain and with u = v/c, Corollary 6.1 implies that Zˆ Zˆ    −1 EZˆ exp Eθ∼πˆ ∆Zˆ(θ) − DKL(ˆπ||π)  ≥ sup  : 2c n ≤ C2 − C1 − 4 . ≤EZˆEθ∼π exp(∆Zˆ(θ)) = Eθ∼πEZˆ exp(∆Zˆ(θ)) = 1. That is, we have a convergence rate of the order 1/n, which matches the parametric upper bound ¯local for This proves the inequality. IRM in Section IV-C.2. Theorem 2.1 Consider randomized estimation, where we select posterior πˆ on G based on Zˆ, with π a prior. VII.PROOFS Consider a real-valued function Lθ(Z) on G × Z. Then In order to avoid clutter in the main text and improve for all t, the following event holds with probability at the readability, we put relatively long proofs into this least 1 − exp(−t): section. For convenience, we restate the theorems before −Lθ (Z) ˆ the proofs. −Eθ∼πˆ ln EZ e ≤ Eθ∼πˆ Lθ(Z) + DKL(ˆπ||π) + t. Moreover, we have the following expected risk bound: A. Proofs for Section II − E E ln E e−Lθ (Z) The key ingredient in our proof of the information Zˆ θ∼πˆ Z h ˆ i exponential inequality is a well-known convex duality, ≤EZˆ Eθ∼πˆ Lθ(Z) + DKL(ˆπ||π) . which employs KL-complexity, and has already been used in some recent machine learning papers to study sample complexity bounds [2], [4], [5], [11], [12]. Proposition 7.1: Assume that f(θ) is a measurable Proof: Let real-valued function on G, and πˆ is a probability mea- ˆ δ ˆ = Eθ∼πˆ ∆ ˆ(θ) − DKL(ˆπ||π), sure, we have Z Z ˆ then from Lemma 2.1, we have EZˆ exp(δZˆ) ≤ 1. Eθ∼πˆ f(θ) ≤ DKL(ˆπ||π) + ln Eθ∼π exp(f(θ)). ˆ This implies that P(δZˆ ≥ t) exp(t) ≤ 1. Therefore ˆ with probability at most exp(−t), δZˆ ≥ t. Rewriting

12 this expression gives the first inequality. In order to mance of the IRM method is bounded by ˆ prove the second inequality, we note that EZˆδZˆ ≤ EZˆ Eθ∼πˆρ,S R(θ) ln E ˆ exp(δˆˆ) ≤ ln 1 = 0. Again, rewriting this Z Z 1  1  expression gives the second inequality. −αρβρnR(θ) ≤ rρ,S + ln Eθ∼πe , (1 − α)βρ ρn

h 1 0 i where rρ,S = infπ0∈S Eθ∼π0 R(θ) + ρn DKL(π ||π) . B. Proofs for Section IV Proof: Under the bounded-loss condition, using the last bound on logarithmic generating function in Proposi- Theorem 4.1 Define resolvability tion 1.2, we obtain   −ρ` (Z ) 1 0 ln E e θ 1 0 Z1 rρ,S = inf Eθ∼π EZ1 `θ(Z1) + DKL(π ||π) . π0∈S ρn eρM − ρM − 1 ≤ − ρE ` (Z ) + E ` (Z )2 Then for all α ∈ [0, 1), the expected generalization Z1 θ 1 M 2 Z1 θ 1   performance of IRM method (4) can be bounded as K ρM ≤ − ρ − 2 (e − ρM − 1) EZ1 `θ(Z1) −ρ`θ (Z1) M − E ˆ Eθ∼πˆ ln EZ e Z ρ,S 1 = − ρβ E ` (Z ). ρ  1  ρ Z1 θ 1 αn −ρ`θ (Z1) ≤ rρ,S + ln Eθ∼πE e . 1 − α ρn Z1 Now substitute this bound into Theorem 4.1, and sim- plify to obtain the desired result.

Proof: We obtain from (4) Similarly, under the Bernstein-loss condition, the log- " n # X arithmic moment generating function estimate in Propo- E ρE ` (Zˆ ) + D (ˆπ ||π) Zˆ θ∼πˆρ,S θ i KL ρ,S sition 1.2 implies that i=1 −ρ`θ (Z1) " n # ln EZ e X ˆ 0 1 ≤ inf EZˆ ρEθ∼π0 `θ(Zi) + DKL(π ||π) 2 π0∈S Kρ 3 `θ(Z1) i=1 ≤ − ρE ` (Z ) + ` (Z ) + ρ KbM Z1 θ 1 2 θ 1 1 − ρM h ˆ 0 i ≤ inf ρnEθ∼π0 EZˆ `θ(Z1) + DKL(π ||π) = ρnrρ,S. π0∈S i = − ρβρEZ1 `θ(Z1). ˆ Pn ˆ Now substitute this bound into Theorem 4.1, and sim- Let Lθ(Z) = ρ i=1 `θ(Zi). Substituting the above bound into the right hand side of the second inequality plify to obtain the desired result. −L (Z) of Corollary 3.2, and using the fact that EZ e θ = En e−ρ`θ (Z1), we obtain the desired inequality. Z1 Theorem 4.3 Consider the Gibbs algorithm (5). If Theorem 4.2 Consider the IRM method in (4). (10) holds with a non-negative function R(θ), then Assume there exists a positive constant K such that for all θ: ∆(αβρ, ρn) 2 EZˆ Eθ∼πˆρ R(θ) ≤ , EZ1 `θ(Z1) ≤ KEZ1 `θ(Z1). (1 − α)βρρn Assume either of the following conditions: where −abR(θ) Eθ∼πe • ∆(a, b) = ln Bounded loss: ∃M ≥ 0 s.t. − infθ,Z1 `θ(Z1) ≤ M; −bR(θ) ρM 2 Eθ∼πe let βρ = 1 − K(e − ρM − 1)/(ρM ).  m max(0, π(/a) − v) • Bernstein loss: ∃M, b > 0 s.t. EZ1 (−`θ(Z1)) ≤ ≤ ln inf sup m−2 u,v ≤u π() m!M KbEZ1 `θ(Z1) for all θ ∈ G and integer m ≥ 3; let β = 1 − Kρ(1 − ρM + 2bρM)/(2 − v + (1 − v) exp(−bu) ρ + inf , 2ρM).  π()e−b Assume we choose a sufficiently small ρ such that and π() = π({θ : R(θ) ≤ }). βρ > 0. Let the (true) expected loss of θ be R(θ) =

EZ1 `θ(Z1), then the expected generalization perfor- Proof: For the Gibbs algorithm, the resolvability is given

13 by (7), which can be expressed as: Corollary 4.2 Consider the Gibbs algorithm (5). If (10) holds, then 1 −ρnR(θ) rρ,S = − ln Eθ∼πe 1 ρn E E R(θ) ≤ inf Z Zˆ θ∼πˆρ 1 − (1 − α)βρρn u1≤u2 = − ln π(/(ρn)) e d . " # ρn π(/(αβ )) ρnu e−ρnu2/2 | {z } ln sup ρ + exp( 1 ) + A ∈[u1,u2] π() αβρ π(u2/2) Similarly, the second term (the localization term) on the ¯ ≤ local , right hand side of (10) is (1 − α)βρ 1 −αρβρnR(θ) π() = π({θ : R(θ) ≤ }) ln Eθ∼πe where , and ρn  1 Z 2  = ln π(/(αβ ρn)) e−d ¯local = + inf :  ≥ ρn ρ ρn αβρ   0 −ρnu ) αβρ π( /(αβρ)) e Z ρnu sup ln 0 + . 1  − 0∈[,2u] ρn π( ) π(u) ≤ ln  (π(/(αρβρn)) − v) e d ρn  0 | {z } B  Proof: For the first inequality, we simply take u = u2 −ρnu and v = π(u1/(αβρ)) in Theorem 4.3, and use the + v + (1 − v)e  . | {z } following bounds C max(0, π(/(αβ )) − v) π(/(αβ )) sup ρ ≤ sup ρ , ≤u π() ∈[u1,u2] π() v v ρnu ≤ = exp( 1 ), −ρn −ρnu1/(αβρ) To finish the proof, we only need to show that (B + sup(π()e ) ve αβρ ∆(αβ ,ρn) C)/A ≤ e ρ . (1 − v) exp(−ρnu) e−ρnu2 e−ρnu2/2 ≤ = . −ρn −ρnu2/2 sup(π()e ) π(u2/2)e π(u2/2)

For the second inequality, we let u2 = 2u and Consider arbitrary real numbers u and v. From u1/(αβρ) = ¯local − ln 2/(ρn). Then by the definition the expressions, it is easy to see that B/A ≤ of ¯local, we have max(0,π(/(αβρ))−v) sup≤u . Moreover, since A ≥ −ρnu2/2 π() π(/(αβρ)) ρnu1 e sup (π()e−ρn), we have C/A ≤ C/ sup (π()e−ρn). sup + exp( ) +   π() αβ π(u /2) Combining these inequalities, we have (B + C)/A ≤ ∈[u1,u2] ρ 2 ∆(αβρ,ρn) ρnu1 e . The desired bound is now a direct conse- ≤2 exp( ) = exp(ρn¯local). quence of (10). αβρ This gives the second inequality.

Corollary 4.1 Consider the Gibbs algorithm (5). If (10) holds, then C. Proofs for Section V inf[ρn − ln π()] 2¯global EZˆ Eθ∼πˆρ R(θ) ≤ ≤ , βρρn βρ

where π() = π({θ : R(θ) ≤ }) and ¯global = Proposition 5.1 If there exists a constant MG ≥ n 1 1 o 3 2 inf  :  ≥ ln . 2/3 such that EZ1 `θ(Z1) ≤ MGEZ1 `θ(Z1) . Then ρn π() 2 8MG EZ1 `θ(Z1) ≤ 3 EZ1 `θ(Z1).

Proof: For the first inequality, we take v = 1 in + Theorem 4.3, and let α → 0. For the second inequal- Proof: Consider an arbitrary θ ∈ G. Note that ∀β → 0 : ity, we simply note from the definition of ¯global that θ(Z1) − θG(Z1) 2 EZ1 `θG +β(θ−θG )(Z1) = −βEZ1 +O(β ). inf[ρn − ln π()] ≤ 2ρn¯global. θG(Z1)

14 Let β → 0, then from the convexity of G and the Therefore ∀m ≥ 2, we have optimality of θG, we have m EZ1 (−`θ(Z1)) θ(Z ) − θ (Z ) 1 G 1 ≤E [|θ(X ) − θ (X )|m EZ1 ≤ 0. X1 1 G 1 θG(Z1) m EY1|X1 |(θ(X1) − Y1) + (θG(X1) − Y1)| This implies that ∀ρ ∈ (0, 1): m−2 2 ≤m!(2AGBG) 4MGEX1 |θ(X1) − θG(X1)| θ(Z ) m−2 −ρ`θ (Z1) ρ −`θ (Z1) ρ 1 E e ≤ E e = E ≤ 1. ≤m!(2AGBG) 4MGEZ1 `θ(Z1). Z1 Z1 Z1 θG(Z1) This gives the desired bound. We obtain from e−x ≥ 1 − x + x2/2 − x3/6: Corollary 5.1 Assume that there exists a function −ρ`θ (Z1) 1 ≥EZ1 e y0(X) such that 2 2 ≥1 − ρEZ1 `θ(Z1) + ρ EZ1 `θ(Z1) /2 • For all X1, the random variable |Y1 − y0(X1)|, 3 2 conditioned on X1, is dominated by the absolute − ρ EZ1 `θ(Z1) MG/6. value of a zero-mean Gaussian random variable Now let ρ = 3/(2MG) and rearrange, we obtain the with standard deviation σ. desired bound. • ∃ constant b > 0 such that supX1 |y0(X) − Proposition 5.2 Let θ(X1)| ≤ b.

If we also choose A such that A ≥ supX1,θ∈G |θ(X1) − M = sup E (θ(X ) − Y )2. 2 G Y1|X1 1 1 θG(X1)|, then (10) holds with βρ = 1−4ρ(b+σ) /(1− X ,θ∈G 1 2A(b + σ)ρ) > 0. 2 Then EZ1 `θ(Z1) ≤ 4MGEZ1 `θ(Z1). Proof: Let ξ be the zero-mean Gaussian random variable Proof: Note that ∀β ∈ (0, 1): with unit variance. Using the simple recursion 2 2 Z ∞ E ` (Z ) = β E (θ(X ) − θ (X )) 2 Z1 θG +β(θ−θG ) 1 X1 1 G 1 ξme−ξ /2dξ + 2βEZ (θ(X1) − θG(X1))(θG(X1) − Y1). 0 1 Z ∞ m−1 −ξ2/2 0 m−2 −ξ2/2 Let β → 0, then from the convexity of G and the =ξ e |∞ + (m − 1) ξ e dξ, 0 optimality of θG, we have it is not difficult to check that ∀m ≥ 2: E|ξ|m ≤ m!. EZ1 (θ(X1) − θG(X1))(θG(X1) − Y1) ≥ 0. We thus have Now let β = 1, we obtain from the above inequality that m 1/m (EY1|X1 |Y1 − θ(X1)| ) 2 m 1/m EZ1 `θ(Z1) ≥ EX1 (θ(X1) − θG(X1)) . ≤|y0(X1) − θ(X1)| + (EY1|X1 |Y1 − y0(X1)| ) 1/m 1/m Therefore we have ≤b + σ(m!) ≤ (b + σ)(m!) .

2 2 Therefore Proposition 5.3 holds with AG = A, BG = EZ1 `θ(Z1) = EZ1 (θ(X1) − θG(X1)) b + σ and M = (b + σ)2. Now we can just apply · [(θ(X ) − Y ) + (θ (X ) − Y )]2 G 1 1 G 1 1 Theorem 5.2.

≤4MEZ1 `θ(Z1). D. Proofs for Section VI

Proposition 5.3 Let AG = supX1,θ∈G |θ(X1) − Theorem 6.1 Consider an arbitrary randomized m 0 θG(X1)| and supX1,θ∈G EY1|X1 |θ(X1) − Y1| ≤ estimator πˆ(Z) on B ⊂ G, where Z consists of n m−2 m!BG MG for m ≥ 2. Then we have: independent samples Z = {(X1,Y1),..., (Xn,Yn)} m m−2 from some underlying distribution Pθ, then for all non- EZ (−`θ(Z1)) ≤ m!(2AGBG) 4MGEZ `θ(Z1). 0 1 1 negative function Rθ(θ ), we have for all B that contains θ:

n ˆ Eθ∼πB EZ∼P Eθˆ∼πˆ(Z)Rθ(θ) ≥ 0.5 Proof: From the proof of Proposition 5.2, we know θ  1  sup  : inf ln ≥ 2n∆B + ln 4 , 2 0 0 0 EZ1 `θ(Z1) ≥ EX1 (θ(X1) − θG(X1)) . θ ∈B πB({θ : Rθ(θ ) < })

15 where ∆ = E E 0 D (P (Z )||P 0 (Z )). E E n E R (θˆ) ≥ 0.5. B θ∼πB θ ∼πB KL θ 1 θ 1 θ∼πB Z∼Pθ θˆ∼πˆ(Z) θ Remark 7.2: The data-processing theorem (for diver- Proof: The joint distribution of (θ, Z) is given by gence) states that for random variables A and B, and a Qn (possibly random) processing function h, the inequality i=1 dPθ(Zi)dπB(θ). Denote by I(θ, Z) the mutual information between θ and Z. Now let Z0 be a random DKL(A||B) ≥ DKL(h(A)||h(B)) holds. The proof is variable independent of θ and with the same marginal not difficult. For a random variable C, we use PC to of Z, then by definition, the mutual information can denote the corresponding probability measure. Then it is be regarded as the KL-divergence between the joint easy to check distributions of (θ, Z) and (θ, Z0), which we write (with DKL((A1,A2)||(B1,B2)) a slight abuse of notation) as: =DKL(PA1 ||PA2 ) + Eξ∼PA DKL(PA2 (·|ξ)||PB2 (·|ξ)). 0 1 I(θ, Z) = DKL((θ, Z)||(θ, Z )). Now let A1 = A, A2 = h(A), B1 = B, and

Now consider an arbitrary randomized estimator πˆ(Z), B2 = h(B), then PA2 (·|ξ) = PB2 (·|ξ), which implies and let θˆ ∈ B0 be a random variable that is drawn from that K(A||B) = K((A, h(A))||(B, h(B))). Similarly, the distribution πˆ(Z). By the data processing theorem we can let A2 = A, A1 = h(A), B2 = B, and for KL-divergence (that is, processing does not increase B1 = h(B), then we obtain K((h(A),A)||(h(B),B)) ≥ KL-divergence), with input (θ, Z) ∈ G×Z and (random) DKL((h(A))||(h(B))). ˆ binary output 1(Rθ(θ(Z)) ≤ ), we obtain Remark 7.3: The author learned the idea of deriving lower bounds using data-processing theorem from [13], D (1(R (θˆ(Z)) ≤ )||1(R (θˆ(Z0)) ≤ )) KL θ θ where a generalization of Fano’s inequality was obtained. 0 ≤DKL((θ, Z)||(θ, Z )) = I(θ, Z) The authors in that paper attributed this idea back to dP n (Z) θ1 Blahut [14]. In Theorem 6.1, this technique is used to n =Eθ1∼πB EZ∼P ln n θ1 E dP (Z) directly obtain a general lower bound without going θ2∼πB θ2 dP n (Z) through the much more specialized Fano’s inequality, as θ1 n ≤Eθ1∼πB EZ∼dP Eθ2∼πB ln n has been typically done in the minimax literature (see θ1 dP (Z) θ2 [9] and discussions therein). This technique also enables

=nEθ1∼πB Eθ2∼πB DKL(Pθ1 ||Pθ2 ) = n∆B. us to obtain lower bounds with respect to an arbitrary prior (which is impossible using the Fano’s inequality The second inequality is a consequence of Jensen’s approach). inequality and the concavity of logarithm. ˆ Now let p1 = P(Rθ(θ(Z)) ≤ ) and p2 = VIII.DISCUSSIONS P(R (θˆ(Z0)) ≤ ), then the above inequality can be θ In this paper, we established upper and lower bounds rewritten as: for some statistical estimation problems. Our upper p1 1 − p1 DKL(p1||p2) = p1 ln + (1 − p1) ln ≤ n∆B. bound analysis is based on a simple information theoret- p2 1 − p2 ical inequality which we call the information exponential Since θˆ(Z0) is independent of θ, we have inequality. Consequences of this new inequality were explored. In particular, we showed that the inequality 0 p2 ≤ sup πB({θ : Rθ(θ ) ≤ }). naturally leads to a (randomized) statistical estimation θ0∈B0 method which we call information complexity mini- Now we consider any  such that supθ0∈B0 πB({θ : mization, which depends on a pre-specified prior. The 0 −2n∆B Rθ(θ ) ≤ }) ≤ 0.25e . This implies that p2 ≤ resulting upper bounds rely on the local prior decay 0.25e−2n∆B ≤ 0.25. rate in some small ball around the truth. Examples are provided for some important statistical estimation We now show that in this case, p1 ≤ 1/2. Since problems such as conditional density estimation and DKL(p1||p2) is increasing in [p2, 1], we only need to regression. show that DKL(0.5||p2) ≥ n∆B. This easily follows from the inequality Moreover, based on novel applications of some well known information theoretical inequalities, we are able 0.5 0.5 DKL(0.5||p2) ≥ 0.5 ln + 0.5 ln ≥ n∆B. to obtain lower bounds that have similar forms to the p2 1 upper bounds. For some problems (such as density Now, we have shown that p1 ≤ 0.5, which implies estimation and regression), the upper and lower bounds ˆ that P(Rθ(θ(Z)) ≥ ) ≥ 0.5. Therefore we have match under mild conditions. This suggests that our

16 upper bound and lower bound analyses are relatively APPENDIX I tight. LOGARITHMICMOMENTGENERATINGFUNCTION Our work can be regarded as an extension of the Let W be a random variable. In the following, we standard minimax framework since we allow the perfor- study properties of its logarithmic generating function mance of the estimator to vary for different underlying ΛW (ρ) = ln E exp(ρW ) as a function of ρ > 0, so that distributions, according to the pre-defined prior. Our we can gain additional insights into the behavior of the analysis can also be applied in the minimax frame- left-hand side of Theorem 4.1. Although the properties work by employing a uniform prior on an -net of presented below are not necessarily new, it is useful to the underlying space. The framework we study here list them for reference. In the following, we also use is closely related to the concept of adaption in the VarW = E(W − EW )2 to denote the variance of W . statistical literature. At the conceptual level, both seek to Proposition 1.1: The following properties of ΛW (ρ) find locally near optimal estimators around any possible hold for ρ > 0: true underlying distribution within the class. However, 1) ΛW (ρ) ≥ ρEW . 1 we believe our treatment is cleaner and more general 2) ρ ΛW (ρ) is an increasing function of ρ. since for many problems, the simple IRM estimator with 0 ρW −ΛW (ρ) 3) ΛW (ρ) = EW e is non-decreasing. appropriately chosen prior is automatically locally opti- 0 4) ΛW (ρ) ≤ ρΛW (ρ). mal around any possible truth. The traditional approach 00 0 2 ρW −ΛW (ρ) 5) ΛW (ρ) = E(W −ΛW (ρ)) e ≤ E(W − to adaptation usually requires us to design specialized EW )2eρW −ΛW (ρ) ≤ E(W − EW )2eρ(W −EW ). estimators that are much less general. ρ2 ρ3 6) ΛW (ρ) ≤ ρEW + 2 VarW + 2 EW ≥EW (W − Our results also indicate that if we pick a reasonably EW )3eρ(W −EW ). well-behaved prior (locally relatively smooth around the 7) ΛaW +b(ρ) → ΛW (aρ) + b. truth), then local statistical complexity is fully deter- Proof: mined by the local prior structure. Specifically, the rate 1) Using Jensen’s inequality, we have EeρW ≥ of convergence is determined by the decay rate of the exp(ρEW ). local entropy (local prior probability of a ball at some 2) Again using Jensen’s inequality, for 0 < ρ0 < ρ, 0 scale relative to a ball at a scale larger by a constant we have EeρW ≥ (E exp(ρ0W ))ρ/ρ . factor). It is possible to obtain a near optimal estimator 00 3) This is the same as ΛW (ρ) is convex, or ΛW (ρ) using the IRM method so that the performance is no is non-negative (see below). more than a constant time worse than the best local 4) Using Taylor expansion, we have ΛW (ρ) = estimator around every possible underlying truth. 0 0 0 0 0 ρΛW (ρ ) where ρ ∈ (0, ρ). Since ΛW (ρ ) ≤ Related to the optimality of IRM with respect to a ΛW (ρ), we obtain the inequality. general prior, it was pointed out in Section III-A that 5) The first equality is by direct calculation. The there is no need for model selection in our framework. second inequality follows from the calculation 0 2 ρW −ΛW (ρ) Similar to the Bayesian method which does not require that E(W − ΛW (ρ)) e = E(W − 2 ρW −ΛW (ρ) 0 2 prior selection (at least in principle), in our framework, EW ) e − (EW − ΛW (ρ)) . The third a prior mixture instead of prior selection can be used inequality uses −ΛW (ρ) ≤ −ρEW . for the IRM method. We believe that the ability of 6) Using Taylor expansion, we have ΛW (ρ) = incorporating model selection into prior knowledge is 0 ρ2 00 0 0 ΛW (0) + ρΛW (0) + 2 ΛW (ρ ) where ρ ∈ (0, ρ), 0 an important conceptual advantage of our analysis. ΛW (0) = 0, and ΛW (0) = EW . Now, we obtain This paper also implies that for some problems, the desired bound by using the following estimate: 00 0 2 ρ0(W −EW ) the IRM method can be better behaved than (possibly ΛW (ρ ) ≤ E(W − EW ) e ≤ E(W − penalized) empirical risk minimization that picks an EW )2 max(eρ(W −EW ), 1). estimator to minimize the (penalized) empirical risk. In 7) Direct calculation. particular, for density estimation and regression, the IRM method can achieve the best possible convergence rate Remark 1.1: The bound in 6 indicates that if + under relatively mild assumptions on the prior structure. ΛW (ρ0) < ∞ for some ρ0 > 0, then when ρ → 0 , ρ2 3 However, it is known that for some non-parametric ΛW (ρ) = ρEW + 2 VarW + O(ρ ). In general, we 2 problems, empirical risk minimization can lead to sub- have the property that ΛW (ρ) = ρEW + O(ρ ), which optimal convergence rate if the covering number grows is useful in our analysis. too rapidly (or in our case, prior decays too rapidly) Proposition 1.1 gives the behavior of ΛW (ρ) for when  (the size of the covering ball) decreases [15]. general random variables. In the following, for reference

17 purposes, we also list some examples of ΛW (ρ) for more REFERENCES concrete random variables. [1] T. Zhang, “Learning bounds for a generalized family of Bayesian Proposition 1.2: Given a random variable W , we posterior distributions,” in NIPS 03, 2004. have the following bounds for its logarithmic moment [2] D. McAllester, “PAC-Bayesian model averaging,” in COLT’99, 1999, pp. 164–170. generating function ΛW (ρ). [3] O. Catoni, Statistical Learning Theory and Stochastic Opti- 1) If W is a Gaussian random variable, then mization, ser. Lecture Notes in Mathematics. Ecole d’Ete de ρ2 Probabilites de Saint-Flour XXXI, 2001, vol. 1851. ΛW (ρ) = ρEW + 2 VarW . [4] ——, “A PAC-Bayesian approach to adaptive classification,” Lab- 2) If W is a bounded variable in [0, 1], then ΛW (ρ) ≤ oratoire de Probabilites et Modeles Aleatoires, Tech. Rep., 2003, ρ ρ2 http://www.proba.jussieu.fr/users/catoni/homepage/classif.pdf. ln(1 + (e − 1)EW ) ≤ ρEW + 8 . m [5] J.-Y. Audibert, “PAC-Bayesian statistical learning theory,” Ph.D. 3) If W satisfies the moment condition: EW ≤ dissertation, Universit Paris VI, 2004. m!M m−2b for integer m ≥ 3, then ∀ρ ∈ [0, 1/M), [6] T. Zhang, “From -entropy to KL-entropy: Analysis of minimum ρ2 2 3 −1 information complexity density estimation,” The Annals of Statis- ΛW (ρ) ≤ ρEW + 2 EW + ρ bM(1 − ρM) . tics, 2006, to appear. 4) If W ≤ 1, then ∀ρ ≥ 0: ΛW (ρ) ≤ 1 + ρEW + [7] S. van de Geer, Empirical Processes in M-estimation. Cam- (eρ − ρ − 1)EW 2. bridge University Press, 2000. [8] A. W. van der Vaart and J. A. Wellner, Weak convergence and Proof: empirical processes, ser. Springer Series in Statistics. New York: 1) Direct calculation. Springer-Verlag, 1996. [9] Y. Yang and A. Barron, “Information-theoretic determination of 2) The first inequality follows from Jensen’s inequal- minimax rates of convergence,” The Annals of Statistics, vol. 27, ity eρW ≤ (1 − W )e0 + W eρ, and is the basis pp. 1564–1599, 1999. of Hoeffding’s inequality in [16]. The second [10] A. Barron and T. Cover, “Minimum complexity density estima- tion,” IEEE Transactions on Information Theory, vol. 37, pp. inequality can be easily checked using a Taylor 1034–1054, 1991. expansion (for example, see [16]). The equality [11] R. Meir and T. Zhang, “Generalization error bounds for Bayesian holds for the first inequality when W ∈ {0, 1} is mixture algorithms,” Journal of Machine Learning Research, vol. 4, pp. 839–860, 2003. Bernoulli. [12] M. Seeger, “PAC-Bayesian generalization error bounds for Gaus- 3) This simple bound is the basis of a slightly more sian process classification,” JMLR, vol. 3, pp. 233–269, 2002. refined Bernstein’s inequality under moment con- [13] T. S. Han and S. Verdu,´ “Generalizing the Fano inequality,” IEEE ρW ρ2 2 Transactions on Information Theory, vol. 40, pp. 1247–1251, ditions. We have Ee = 1 + ρEW + 2 EW + 1994. P∞ ρm m ρ2 2 [14] R. Blahut, “Information bounds of the Fano-Kullback type,” m=3 m! EW ≤ 1 + ρEW + 2 EW + 3 −1 IEEE Transactions on Information Theory, vol. 22, pp. 410–421, ρ bM(1 − ρM) . Now, use the fact that ln x ≤ 1976. x − 1. [15] L. Birge´ and P. Massart, “Rates of convergence for minimum contrast estimators,” Probab. Theory Related Fields, vol. 97, no. 4) Using ln x ≤ x − 1, we have ΛW (ρ) ≤ ρEW + 2 eρW −ρW −1 2 2 eρ−ρ−1 2 1-2, pp. 113–150, 1993. ρ E (ρW )2 W ≤ ρEW + ρ E ρ2 W , [16] W. Hoeffding, “Probability inequalities for sums of bounded ran- where the second inequality follows from the fact dom variables,” Journal of the American Statistical Association, that the function (ex −x−1)/x2 is non-decreasing vol. 58, no. 301, pp. 13–30, March 1963. and ρW ≤ ρ.

Tong Zhang Tong Zhang received a B.A. in mathematics and computer science from Cornell University in 1994 and a Ph.D. in computer Science from Stanford University PLACE in 1998. After being a research staff mem- PHOTO ber of IBM T.J. Watson Research Center HERE in Yorktown Heights, New York, he joined Yahoo in 2005. His research interests include machine learning, numerical algorithms, and their applications in text processing.

18