<<

S&DS 677: Topics in High-dim and Info Theory Spring 2021

Lecture 1: Basics of Statistical Lecturer: Yihong Wu

{lec:stats-intro}

1.1 Basics of Statistical Decision Theory

We start by presenting the basic elements of statistical decision theory. We refer to the classics [Fer67, LC86, Str85] for a systematic treatment. A statistical or refers to a collection P of probability distributions (over a common measurable space (X , F)). Specifically, let us consider

P = {Pθ : θ ∈ Θ}, (1.1) {eq:parametric-model} where each distribution is indexed by a parameter θ taking values in the parameter space Θ. In the decision-theoretic framework, we play the following game: Nature picks some parameter θ ∈ Θ and generates a X ∼ Pθ. A observes the X and wants to infer the parameter θ or its certain attributes. Specifically, consider some functional T :Θ → Y and the goal is to estimate T (θ) on the basis of the observation X. Here the estimand T (θ) may be the parameter θ itself, or some function thereof (e.g. T (θ) = 1{θ>0} or kθk). An (decision rule) is a function Tˆ : X → Yˆ. Note the that the action space Yˆ need not be the same as Y (e.g. Tˆ may be a confidence interval). Here Tˆ can be either deterministic, i.e. Tˆ = Tˆ(X), or randomized, i.e., Tˆ obtained by passing X through a conditional probability

distribution (Markov transition kernel) PTˆ|X , or a channel in the language of Part ??. For all practical purposes, we can write Tˆ = Tˆ(X,U), where U denotes external uniform on [0, 1] and independent of X.

To measure the quality of an estimator Tˆ, we introduce a ` : Y × Yˆ → R such that `(T, Tˆ) is the risk of Tˆ for estimating T . Since we are dealing with loss (as opposed to reward), all the negative (converse) results are lower bounds and all the positive (achievable) results are upper bounds. Note that X is a random variable, so are Tˆ and `(T, Tˆ). Therefore, to make sense of “minimizing the loss”, we consider the average risk: Z ˆ ˆ ˆ ˆ Rθ(T ) = Eθ[`(T, T )] = Pθ(dx)PTˆ|X (dt|x)`(t(θ), t), (1.2) {eq:risk-def}

ˆ which we refer to as the risk of T at θ. The subscript in Eθ indicates the distribution with respect to which the expectation is taken. Note that the expected risk depends on the estimator as well as the ground truth. Remark 1.1. We note that the problem of hypothesis testing and inference can be encompassed as special cases of the estimation paradigm. There are three formulations for testing:

ˆ Simple vs. simple hypotheses

H0 : θ = θ0 vs. H1 : θ = θ1, θ0 6= θ1

1 ˆ Simple vs. composite hypotheses

H0 : θ = θ0 vs. H1 : θ ∈ Θ1, θ0 ∈/ Θ1

ˆ Composite vs. composite hypotheses

H0 : θ ∈ Θ0 vs. H1 : θ ∈ Θ1, Θ0 ∩ Θ1 = ∅.

In each case one can introduce the appropriate parameter space and loss function. For example, in the last (most general) case, we may take ( 0 θ ∈ Θ0 Θ = Θ0 ∪ Θ1,T (θ) = , Tˆ ∈ {0, 1} 1 θ ∈ Θ1 ˆ ˆ and use the zero-one loss `(T, T ) = 1{T 6=Tˆ} so that the expected risk Rθ(T ) = Pθ{θ∈ / ΘTˆ} is the probability of error. For the problem of inference, the goal is to output a confidence interval (or region) which covers the true parameter with high probability. In this case we can set Tˆ is a subset of Θ and we may choose ˆ ˆ the loss function `(θ, T ) = 1{θ∈ /Tˆ} + width(T ), in order to balance the coverage and the size of the confidence region. Remark 1.2 (Randomized versus deterministic ). Although most of the estimators used in practice are deterministic, there are a number of reasons to consider randomized estimators:

ˆ For certain formulations, such as the minimizing worst-case risk ( approach), deter- ministic estimators are suboptimal and it is necessary to randomize. On the other hand, if the objective is to minimize the average risk (Bayes approach), then it does not lose generality to restrict to deterministic estimators. ˆ The space of randomized estimators (viewed as Markov kernels) is convex which is the convex hull of deterministic estimators. This convexification is needed for example for the treatment of minimax theorems.

See Section 1.3 for a detailed discussion and examples. A well-known fact is that for convex loss function (i.e., Tˆ 7→ `(T, Tˆ) is convex), does not help. Indeed, for any randomized estimator Tˆ, we can derandomize it by considering its conditional expectation E[Tˆ|X], which is a deterministic estimator and whose risk dominates that ˆ ˆ ˆ ˆ of the original T at every θ, namely, Rθ(T ) = Eθ`(T, T ) ≥ Eθ`(T, E[T |X]), by Jensen’s inequality.

1.2 Gaussian Location Model (GLM) {sec:GLM} Note that, without loss of generality, all statistical models can be expressed in the parametric form of (1.1) (since we can take θ to be the distribution itself). In the statistical literature, it is customary to refer to a model as parametric if θ takes values in a finite-dimensional Euclidean space (so that each distribution is specified by finitely many parameters), and nonparametric if θ takes values in some infinite-dimensional space (e.g. or sequence model). Perhaps the most basic parametric model is the Gaussian Location Model (GLM), also known as the Normal Model (or the Gaussian channel in information-theoretic terms c.f. Example ??).

2 ˆ The model is given by 2 P = {N(θ, σ Id): θ ∈ Θ} d where Id is the d-dimensional identity matrix and Θ ⊂ R . Equivalently, we can express the data as a noisy observation of the unknown parameter θ as:

2 X = θ + Z,Z ∼ N(0, σ Id),

The case of d = 1 and d > 1 refers to the scalar and vector case, respectively. We will also consider the matrix case, for which we can vectorize a d × d matrix θ as a d2-dimensional vector.

ˆ The estimand can be T (θ) = θ, in which case the goal is to denoise the vector θ, or certain functionals such as T (θ) = kθk2, θmax = max{θ1, . . . , θd}, or eigenvalues in the matrix case. ˆ Loss function: For estimating θ itself, it is customary to use a loss function defined by certain ˆ ˆ α norms, such as `(θ, θ) = kθ − θkp for some 1 ≤ p ≤ ∞ (with p = α = 2 corresponding to the 1 P p quadratic loss), where kθkp , ( |θi| ) p . ˆ Parameter space: for example

d – Θ = R , in which case there is no assumption on θ (unstructured problem).

– Θ = `p-norm balls. d – Θ = {all k-sparse vectors} = {θ ∈ R : kθk0 ≤ k}, where kθk0 , |{i : θi 6= 0}| denotes the size of the support, known as the `0-“norm”. d×d – In the matrix case: Θ = {θ ∈ R : rank(θ) ≤ r}, the set of low-rank matrices. Note that by definition, more structure (smaller paramater space) always makes the estimation task easier (smaller risk), but not necessarily so in terms of computation.

ˆ Estimator: Some well-known estimators include the Maximum Likelihood Estimator (MLE) ˆ θML = X (1.3) {eq:MLE}

and the James-Stein estimator based on shrinkage

 2  ˆ (d − 2)σ θJS = 1 − 2 X (1.4) {eq:JS} kXk2 The choice of the estimator depends on both the objective and the parameter space. For instance, if θ is known to be sparse, it makes sense to set the smaller entries in the observed X to zero in order to better denoise θ.

ˆ The hypothesis testing problem in the GLM is well-studied. For example, one can consider detecting the presence of a signal by testing H0 : θ = 0 against H1 : kθk ≥ , or testing weak signal H0 : kθk ≤ 0 versus strong signal H1 : kθk ≥ 1, with or without further sparsity assumptions on θ. There is a rich body of literature devoted to these problems (cf. the monograph [IS03]), which we will revisit in Lecture ??.

3 1.3 Bayes risk, minimax risk, and the minimax theorem {sec:minimax-bayes} One of our main objectives in this part of the book is to understand the fundamental limit of statistical estimation, that is, to determine the performance of the best estimator. As in (1.2), the risk Rθ(Tˆ) of an estimator Tˆ for T (θ) depends on the ground truth θ. To compare the risk profiles of different estimators meaningfully requires some thought. As a toy example, Fig. 1.1 depicts the risk functions of three estimators. It is clear that θˆ1 is superior to θˆ2 in the sense that the risk of the former is pointwise lower than that of the latter. (In statistical literature we say θˆ2 is inadmissible.) However, the comparison of θˆ1 and θˆ3 is less clear. Although the peak risk value of θˆ3 is bigger than that of θˆ1, on average its risk (area under the curve) is smaller. In fact, both views are valid and meaningful, and they correspond to the worst-case (minimax) and average-case (Bayesian) approach, respectively. In the minimax formulation, we summarize the risk function into a scalar quantity, namely, the worst-case risk, and seek the estimator that minimize this objective. In the Bayesian formulation, the objective is the average risk. Below we discuss these two approaches and their connections. For notational simplicity, we consider the task of estimating T (θ) = θ.

Figure 1.1: Risk profiles of three estimators. {fig:risk-compare}

1.3.1 Bayes risk The Bayesian approach is an average-case formulation in which the statistician acts as if the parameter θ is random with a known distribution. Concretely, let π be a (prior) on Θ. Then the average risk (w.r.t. π) of an estimator θˆ is defined as ˆ ˆ ˆ Rπ(θ) = Eθ∼π[Rθ(θ)] = Eθ,X [`(θ, θ)]. (1.5) {eq:risk-avg} Given a prior π, its Bayes risk is the minimal average risk, namely

∗ ˆ Rπ = inf Rπ(θ). θˆ

An estimator θˆ∗ is called a if it attains the Bayes risk, namely, R∗ = [R (θˆ∗)]. π Eθ∼π θ {rmk:bayes-det} Remark 1.3. Bayes estimator is always deterministic – this fact holds for any loss function. To see this, note that for any randomized estimator, say θˆ = θˆ(X,U), where U is some external randomness

4 independent of X and θ, its risk is lower bounded by ˆ ˆ ˆ ˆ Rπ(θ) = θ,X,U `(θ, θ(X,U)) = U Rπ(θ(·,U)) ≥ inf Rπ(θ(·, u)). E E u

Note that for any u, θˆ(·, u) is a deterministic estimator. This shows that we can find a deterministic estimator whose average risk is no worse than that of the randomized estimator.

An alternative way to under this fact is the following: Note that the average risk Rπ(θˆ) defined in (1.5) is an affine function of the randomized estimator (understood as a Markov kernel Pθˆ|X ) is affine, whose minimum is achieved at the extremal points. In this case the extremal points of Markov kernels are simply delta measures, which corresponds to deterministic estimators.

d Example 1.1 (Quadratic loss and MMSE). Consider the problem of estimating θ ∈ R drawn from ˆ ˆ 2 a prior π. Under the quadratic loss `(θ, θ) = kθ − θk2, the Bayes estimator is the conditional mean θˆ(X) = E[θ|X] and the Bayes risk is the minimum mean-square error (MMSE)

∗ 2 Rπ = Ekθ − E[θ|X]k2 = Tr(Cov(θ|X)).

Example 1.2 (Gaussian Location Model). Consider the scalar case, where X = θ + Z and 2 Z ∼ N(0, 1) is independent of θ. Consider a Gaussian prior: θ ∼ π = N(0, σ0). One can verify that 2 2 2 σ0 σ0 σ0 the posterior distribution Pθ|X=x is N( 2 x, 2 ). Thus the Bayes estimator is E[θ|X] = 2 X 1+σ0 1+σ0 1+σ0 and the Bayes risk is 2 ∗ σ0 Rπ = 2 . (1.6) {eq:GLM-bayes} σ0 + 1 2 Similarly, for multivariate GLM: X = θ + Z,Z ∼ N(0,Id), if θ ∼ π = N(0, σ0Id), then we have

2 ∗ σ0 Rπ = 2 d. (1.7) {eq:GLM-bayesp} σ0 + 1

1.3.2 Minimax risk A common criticism to the Bayes approach is about which prior to pick. A framework related to this but not discussed in this case is the empirical Bayes approach [Rob56], where one “estimates” the prior from the data instead of choosing a prior a priori. Instead, we take a frequentist viewpoint by considering the worst-case situation. The minimax risk is defined as

∗ ˆ R = inf sup Rθ(θ). (1.8) {eq:minimax} θˆ θ∈Θ

ˆ ˆ ∗ ˆ If there exists θ s.t. supθ∈Θ Rθ(θ) = R , then the estimator θ is minimax (minimax optimal). Finding the value of the minimax risk R∗ entails proving two things, namely,

ˆ∗ ˆ∗ ∗ ˆ a minimax upper bound, by exhibiting an estimator θ such that Rθ(θ ) ≤ R +  for all θ ∈ Θ;

ˆ a minimax lower bound, by proving that for any estimator θˆ, there exists some θ ∈ Θ, such ∗ that Rθ ≥ R − ,

5 where  > 0 is arbitrary. This task is frequently difficult especially in high dimensions. Instead of the exact minimax risk, it is often useful to find a constant-factor approximation Ψ, which we call minimax rate, such that ∗ R  Ψ, (1.9) {eq:minimax-rate} that is, cΨ ≤ R∗ ≤ CΨ for some universal constants c, C ≥ 0. Establishing Ψ is the minimax rate still entails proving the minimax upper and lower bounds, albeit within multiplicative constant factors. In practice, minimax lower bounds are rarely established according to the original definition. The next result shows that the Bayes risk is always lower than the minimax risk. Throughout this book, all lower bound techniques essentially boil down to evaluating the Bayes risk with a sagaciously chosen prior. {thm:minimax-bayes} Theorem 1.1. Let ∆(Θ) denote the collection of probability distributions on Θ. Then

∗ ∗ ∗ R ≥ RBayes , sup Rπ. (1.10) {eq:minimax-bayes} π∈∆(Θ)

Proof. Two (equivalent) ways to prove this fact:

ˆ ˆ ˆ ˆ ˆ 1. “max ≥ mean”: For any θ, Rπ(θ) = Eθ∼πRθ(θ) ≤ supθ∈Θ Rθ(θ). Taking the infimum over θ completes the proof;

2. “min max ≥ max min”:

∗ ˆ ˆ ˆ ∗ R = inf sup Rθ(θ) = inf sup Rπ(θ) ≥ sup inf Rπ(θ) = sup Rπ, θˆ θ∈Θ θˆ π∈∆(Θ) π∈∆(Θ) θˆ π

where the inequality follows from the generic fact that minx maxy f(x, y) ≥ maxy minx f(x, y).

Remark 1.4. Unlike Bayes estimators which, as shown in Remark 1.3, are always deterministic, to minimize the worst-case risk it is sometimes necessary to randomize. Consider a trivial experiment where θ ∈ {0, 1} and X is absent, so that we are forced to guess the value of θ under the zero-one ˆ 1 loss `(θ, θ) = 1{θ6=θˆ}. It is clear that in this case the minimax risk is 2 , achieved by random guessing ˆ 1 ˆ θ ∼ Bern( 2 ) but not by any deterministic θ.

As an application of Theorem 1.1, let us determine the minimax risk of the Gaussian location model under the quadratic loss function. {ex:GLM-minimax-quadratic} Example 1.3 (Minimax quadratic risk of GLM). Consider the Gaussian location model without 2 d structural assumptions, where X ∼ N(θ, σ Id) with θ ∈ R . We show that ∗ ˆ 2 2 R ≡ inf sup θ[kθ(X) − θk ] = dσ . (1.11) {eq:GLM-minimax-Rd} d E 2 θ∈R θ∈Rd ˆ By scaling, it suffices to consider σ = 1. For the upper bound, we consider θML = X which achieves ˆ 2 Rθ(θML) = d for all θ. To get a minimax lower bound, we consider the prior θ ∼ N(0, σ0). 2 ∗ ∗ σ0 d Using the Bayes risk previously computed in (1.6), we have R ≥ Rπ = 2 . Sending σ0 → ∞ σ0 +1 yields R∗ ≥ d.

6 {rmk:minimaxest-nonunique} Remark 1.5 (Non-uniqueness of minimax estimators). In general, estimators that achieve the ˆ minimax risk need not be unique. For instance, as shown in Example 1.3, the MLE θML = X is minimax for the unconstrained GLM in any dimension. On the other hand, it is known that whenever d ≥ 3, the risk of the James-Stein estimator (1.4) is smaller that of the MLE everywhere (see Fig. 1.2) and thus is also minimax. In fact, there exist a continuum of estimators that are minimax for (1.11)[LC98, Theorem 5.5].

3.0

2.8

2.6

2.4

2.2

2 4 6 8

Figure 1.2: Risk of the James-Stein estimator (1.4) in dimension d = 3 and σ = 1 as a function of kθk. {fig:JS}

For most of the statistical models, Theorem 1.1 in fact holds with equality; such a result is known as a minimax theorem. Before discussing this important topic, here is an example where minimax risk is strictly bigger than the worst-case Bayes risk. {ex:minimaxfail} ˆ ˆ Example 1.4. Let θ, θ ∈ N , {1, 2, ...} and `(θ, θ) = 1{θ<θˆ }, i.e., the statistician loses one dollar if the Nature’s choice exceeds the statistician’s guess and loses nothing if otherwise. Consider the extreme case of blind guessing (i.e., no data is available, say, X = 0). Then for any θˆ possibly ˆ ˆ ∗ ˆ randomized, we have Rθ(θ) = P(θ < θ). Thus R ≥ limθ→∞ P(θ < θ) = 1, which is clearly achievable. ˆ ˆ ˆ On the other hand, for any prior π on N, Rπ(θ) = P(θ < θ), which vanishes as θ → ∞. Therefore, ∗ ∗ ∗ we have Rπ = 0. Therefore in this case R = 1 > RBayes = 0. Exercise 1.1. Show that the minimax quadratic risk of the GLM X ∼ N (θ, 1) with parameter space θ ≥ 0 is the same as the unconstrained case. (This might be a bit surprising because the thresholded estimator X+ = max(X, 0) achieves a better risk pointwise at every θ ≥ 0; nevertheless, just like the James-Stein estimator (cf. Fig. 1.2), in the worst case the gain is asymptotically diminishing.)

1.3.3 Minimax and Bayes risk: a duality perspective {sec:minimax-duality} Recall from Theorem 1.1 the inequality

∗ ∗ R ≥ RBayes.

7 This result can be interpreted from an optimization perspective. More precisely, R∗ is the value ∗ of a convex optimization problem (primal) and RBayes is precisely the value of its dual program. Thus the inequality (1.10) is simply weak duality. If strong duality holds, then (1.10) is in fact an equality, in which case the minimax theorem holds. For simplicity, we consider the case where Θ is a finite set. Then

∗ ˆ R = min max Eθ[`(θ, θ)]. (1.12) {eq:minimax-dual1} Pθˆ|X θ∈Θ

ˆ This is a convex optimization problem. Indeed, Pθˆ|X 7→ Eθ[`(θ, θ)] is affine and the pointwise supremum of affine functions is convex. To write down its dual problem, first let us rewrite (1.12) in an augmented form

∗ R = min t (1.13) {eq:minimax-dual2} Pθˆ|X ,t ˆ s.t Eθ[`(θ, θ)] ≤ t, ∀θ ∈ Θ.

Let πθ ≥ 0 denote the Lagrange multiplier (dual variable) for each inequality constraint. The Lagrangian of (1.13) is ! X  ˆ  X X ˆ L(Pθˆ|X , t, π) = t + πθ Eθ[`(θ, θ)] − t = 1 − πθ t + πθEθ[`(θ, θ)]. θ∈Θ θ∈Θ θ∈Θ

By definition, we have R∗ ≥ min L(θ,ˆ t, π). Note that unless P π = 1, min L(θ,ˆ t, π) is t,Pθˆ|X θ∈Θ θ t∈R −∞. Thus π = (πθ : θ ∈ Θ) must be a probability measure and the dual problem is

ˆ ∗ max min L(Pˆ , t, π) = max min Rπ(θ) = max R . π θ|X π Pθˆ|X ,t π∈∆(Θ) Pθˆ|X π∈∆(Θ)

∗ ∗ Hence, R ≥ RBayes. In summary, the minimax risk and the worst-case Bayes risk are related by convex duality, where the primal variables are (randomized) estimators and the dual variables are priors. This view can in fact be operationalized. Later we will revisit this observation in Section ?? when we dualize certain minimax lower bound for the purpose of constructing estimators.

1.3.4 Minimax theorem {sec:minimax-thm} Next we state the minimax theorem which gives conditions that ensure (1.10) holds with equality, ∗ ∗ namely, the minimax risk R and the worst-case Bayes risk RBayes coincide. For simplicity, let us consider the case of estimating θ itself where the estimator θˆ takes values in the action space Θˆ with a loss function ` :Θ × Θˆ → R. A very general result (cf. [Str85, Theorem 46.6]) asserts that ∗ ∗ R = RBayes, provided that the following condition hold:

ˆ The experiment is dominated, i.e., Pθ  ν holds for all θ ∈ Θ for some ν on X . ˆ The action space Θˆ is a locally compact topological space with a countable base (e.g. the Euclidean space).

8 ˆ The loss function is level-compact (i.e., for each θ ∈ Θ, `(θ, ·) is bounded from below and the sublevel set {θˆ : `(θ, θˆ) ≤ a} is compact for each a).

This result shows that for virtually all problems encountered in practice, the minimax risk coincides with the least favorable Bayes risk. At the heart of any minimax theorem, there is an application of the separating hyperplane theorem. Below we give a proof of a special case illustrating this type of argument. {thm:minimxthm} Theorem 1.2 (Minimax theorem). ∗ ∗ R = RBayes in either of the following cases:

ˆ Θ is a finite set and the data X takes values in a finite set X . ˆ ˆ Θ is a finite set and the loss function ` is bounded from below, i.e., infθ,θˆ `(θ, θ) > −∞.

Proof. The first case directly follows from the duality interpretation in Section 1.3.3 and the fact that strong duality holds for finite-dimensional linear programming (see for example [Sch98, Sec. 7.4].

∗ ∗ For the second case, we start by showing that if R = ∞, then RBayes = ∞. To see this, consider the uniform prior π on Θ. Then for any estimator θˆ, there exists θ ∈ Θ such that R(θ, θˆ) = ∞. ˆ 1 ˆ Then Rπ(θ) ≥ |Θ| R(θ, θ) = ∞. ∗ ∗ Next we assume that R < ∞. Then R ∈ R since ` is bounded from below (say, by a) by assumption. ˆ ˆ ˆ Given an estimator θ, denote its risk vector R(θ) = (Rθ(θ))θ∈Θ. Then its average risk with respect ˆ P ˆ to a prior π is given by the inner product hR(θ), πi = θ∈Θ πθRθ(θ). Define

Θ S = {R(θˆ) ∈ R : θˆ is a randomized estimator} = set of all possible risk vectors, Θ ∗ T = {t ∈ R : tθ < R , θ ∈ Θ}.

Θ Note that both S and T are convex (why?) subsets of Euclidean space R and S ∩ T = ∅ by ∗ Θ definition of R . By the separation hyperplane theorem, there exists a non-zero π ∈ R and c ∈ R, such that infs∈S hπ, si ≥ c ≥ supt∈T hπ, ti. Obviously, π must be componentwise positive, for otherwise supt∈T hπ, ti = ∞. Therefore by normalization we may assume that π is a probability ∗ ∗ ∗ vector, i.e., a prior on Θ. Then RBayes ≥ Rπ = infs∈S hπ, si ≥ supt∈T hπ, ti ≥ R , completing the proof.

1.4 Multiple observations and sample complexity {sec:multiple-obs}

Given a experiment {Pθ : θ ∈ Θ}, consider the experiment

⊗n Pn = {Pθ : θ ∈ Θ}, n ≥ 1. (1.14) {eq:multipleobs}

We refer to this as the independent model, in which we observe a sample X = (X1,...,Xn) d d d + drawn independently from Pθ for some θ ∈ Θ ⊂ R . Given a loss function ` : R × R → R , the minimax risk is denoted by ∗ ˆ Rn(Θ) = inf sup Eθ`(θ, θ). (1.15) {eq:minimax-Rn} θˆ θ∈Θ

9 ∗ Clearly, n 7→ Rn(Θ) is non-increasing since we can always discard observations. Typically, when d ∗ ∗ Θ is a fixed subset of R , Rn(Θ) vanishes as n → ∞. Thus a natural question is at what rate Rn converges to zero. Equivalently, one can consider the sample complexity, namely, the minimum sample size to attain a prescribed error  even in the worst case: ∗ ∗ n () , min {n ∈ N : Rn(Θ) ≤ } . (1.16) {eq:samplecomplexity-def} In the classical large-sample asymptotic regime (Lecture ??), the rate of convergence for the quadratic 1 risk is usually Θ( n ), which is commonly referred to as the “parametric rate“. In comparison, our focus in this course is understanding the dependency on the dimension and other structural parameters nonasymptotically.

i.i.d. 2 As a concrete example, let us revisit the GLM, where we observe X = (X1,...,Xn) ∼ N(0, σ Id), θ ∈ d R . In this case, the minimax quadratic risk is 2 ∗ dσ R = . (1.17) {eq:rstarpn} n n ¯ 1 To see this, note that in this case X = n (X1 + ... + Xn) is a sufficient (cf. Section ??) ¯ σ2 of X for θ. Therefore the model reduces to X ∼ N (θ, n Id) and (1.17) follows from the minimax risk in (1.11) for a single observation. From (1.17), we conclude that the sample complexity is ∗ dσ2 n () = d  e, which grows linearly with the dimension d. This is the common wisdom that “sample complexity scales proportionally to the number of parameters”, also known as “counting the degrees of freedom”. Indeed in high dimensions we typically expect the sample complexity to grow with the ambient dimension; however, the exact dependency need not be linear as it depends on the loss d×d function and the objective of estimation. For example, consider the matrix case θ ∈ R with n independent observations in Gaussian noise. Let  be a small constant. As we will show later,

ˆ ˆ 2 ∗ p2 ∗ 2 for quadratic loss, namely, kθ − θkF, we have R = n and hence n () = Θ(d ); ˆ ˆ 2 ∗ d ∗ if the loss function is kθ − θkop, then we have R  n and hence n () = Θ(d); √ ˆ ∗ if we only want to estimate the scalar functional kθk`∞ , then n () = Θ( log d). Exercise 1.2 (Nonparametric location model). Consider the class P of distributions (which need not have a density) on the real line with at most σ2. Given the observation i.i.d. X = (X1,...,Xn) ∼ P for some P ∈ P, the objective is to estimate θ(P ), the mean of the σ2 distribution P . Show that the minimax quadratic risk is given by n .

1.5 Tensor product of {sec:tensorization-model} Tensor product is a way to define a high-dimensional model from low-dimensional models. Given statistical experiments Pi = {Pθi : θi ∈ Θi} and the corresponding loss function `i, for i ∈ [d], their tensor product refers to the following statistical experiment:

( d d ) Y Y P = Pθ = Pθi : θ = (θ1, . . . , θp) ∈ Θ , Θi , i=1 i=1 n ˆ X ˆ ˆ `(θ, θ) , `i(θi, θi), ∀θ, θ ∈ Θ. i=1

10 In this model, the observation X = (X1,...,Xd) consists of independent (not identically distributed) ind Xi ∼ Pθi . This should be contrasted with the multiple-observation model in (1.14), in which n iid observations drawn from the same distribution are given.

∗ The minimax risk of the tensorized experiment is related to the minimax risk R (Pi) and worst-case ∗ Bayes risks RBayes(Pi) , supπi∈∆(Θi) Rπi (Pi) of each individual experiment as follows: {thm:minimax-tensor} Theorem 1.3 (Minimax risk of tensor product).

d d X ∗ ∗ X ∗ RBayes(Pi) ≤ R (P) ≤ R (Pi). (1.18) {eq:minimax-tensor} i=1 i=1

∗ ∗ Consequently, if minimax theorem holds for each experiment, i.e., R (Pi) = RBayes(Pi), then it also holds for the product experiment and, in particular,

d ∗ X ∗ R (P) = R (Pi). (1.19) {eq:minimax-tensoreq} i=1

Proof. The right inequality of (1.18) simply follows by separately estimating θi on the basis of ˆ ˆ ˆ ˆ Xi, namely, θ = (θ1,..., θd), where θi depends only on Xi. For the left inequality, consider a Qd product prior π = i=1 πi, under which θi’s are independent and so are Xi’s. Consider any randomized estiamtor θˆi = θˆi(X,Ui) of θi based on X, where Ui is some auxiliary randomness ˆ ˆ ˜ ˜ ˆ independent of X. We can rewrite it as θi = θi(Xi, Ui), where Ui = (X\i,Ui) ⊥⊥ Xi. Thus θi can be viewed as it a randomized estimator based on Xi alone and its the average risk must satisfy ˆ ˆ ∗ Rπi (θi) = E[`(θi, θi)] ≥ Rπi . Summing over i and taking the suprema over priors πi’s yields the left inequality of (1.18).

2 d As an example, we note that the unstructured d-dimensional GLM {N(θ, σ Id): θ ∈ R } with quadratic loss is simply the d-fold tensor product of the one-dimensional GLM. Since minimax theorem holds for the GLM (cf. Section 1.3.4), Theorem 1.3 shows the minimax risks sum up to σ2d, which agrees with Example 1.3. In general, however, it is possible that the minimax risk of the tensorized experiment is less than the sum of individual minimax risks and the right inequality of (1.19) can be strict. This might appear surprising since Xi only carries information about θi and it makes sense intuitively to estimate θi based solely on Xi. Nevertheless, the following is a counterexample: 1 ˆ Remark 1.6. Consider X = θZ, where θ ∈ N, Z ∼ Bern( 2 ). The estimator θ takes values in N as ˆ well and the loss function is `(θ, θ) = 1{θ<θˆ }, i.e., whoever guesses the greater number wins. The 1 minimax risk for this experiment is equal to P [Z = 0] = 2 . To see this, note that if Z = 0, then all information about θ is erased. Therefore for any (randomized) estimator Pθˆ|X , the risk is lower ˆ ˆ ˆ 1 ˆ bounded by Rθ(θ) = P[θ < θ] ≥ P[θ < θ, Z = 0] = 2 P[θ < θ|X = 0]. Therefore sending θ → ∞ ˆ 1 ˆ yields supθ Rθ(θ) ≥ 2 . This is achievable by θ = X. Clearly, this is a case where minimax theorem does not hold, which is very similar to the previous Example 1.4. Next consider the tensor product of two copies of this experiment with loss function `(θ, θˆ) = 1 + 1 . We show that the minimax risk is strictly less than one. For i = 1, 2, let {θˆ1<θ1} {θˆ2<θ2}

11 i.i.d. 1 Xi = θiZi, where Z1,Z2 ∼ Bern( 2 ). Consider the following estimator ( X1 ∨ X2 X1 > 0 or X2 > 0 θˆ1 = θˆ2 = 1 otherwise.

Then for any θ1, θ2 ∈ N, averaging over Z1,Z2, we get 1 3 [`(θ, θˆ)] ≤ 1 + 1 + 1 ≤ . E 4 {θ1<θ2} {θ2<θ1} 4

12 Bibliography

[Fer67] T. S. Ferguson. : A Decision Theoretic Approach. Academic Press, New York, NY, 1967.

[IS03] Y. I. Ingster and I. A. Suslina. Nonparametric goodness-of-fit testing under Gaussian models. Springer, New York, NY, 2003.

[LC86] Lucien Le Cam. Asymptotic methods in statistical decision theory. Springer-Verlag, New York, NY, 1986.

[LC98] E. L. Lehmann and G. Casella. Theory of . Springer, New York, NY, 2nd edition, 1998.

[Rin76] Yosef Rinott. On convexity of measures. Annals of Probability, 4(6):1020–1026, 1976.

[Rob56] Herbert Robbins. An empirical Bayes approach to statistics. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. The Regents of the University of California, 1956.

[Sch98] Alexander Schrijver. Theory of linear and integer programming. John Wiley & Sons, 1998.

[Str85] Helmut Strasser. Mathematical theory of statistics: Statistical experiments and asymptotic decision theory. Walter de Gruyter, Berlin, Germany, 1985.

13