1.1 Basics of Statistical Decision Theory

S&DS 677: Topics in High-dim Statistics and Info Theory Spring 2021 Lecture 1: Basics of Statistical Decision Theory Lecturer: Yihong Wu flec:stats-introg 1.1 Basics of Statistical Decision Theory We start by presenting the basic elements of statistical decision theory. We refer to the classics [Fer67, LC86, Str85] for a systematic treatment. A statistical experiment or statistical model refers to a collection P of probability distributions (over a common measurable space (X ; F)). Specifically, let us consider P = fPθ : θ 2 Θg; (1.1) feq:parametric-modelg where each distribution is indexed by a parameter θ taking values in the parameter space Θ. In the decision-theoretic framework, we play the following game: Nature picks some parameter θ 2 Θ and generates a random variable X ∼ Pθ. A statistician observes the data X and wants to infer the parameter θ or its certain attributes. Specifically, consider some functional T :Θ !Y and the goal is to estimate T (θ) on the basis of the observation X. Here the estimand T (θ) may be the parameter θ itself, or some function thereof (e.g. T (θ) = 1fθ>0g or kθk). An estimator (decision rule) is a function T^ : X! Y^. Note the that the action space Y^ need not be the same as Y (e.g. T^ may be a confidence interval). Here T^ can be either deterministic, i.e. T^ = T^(X), or randomized, i.e., T^ obtained by passing X through a conditional probability distribution (Markov transition kernel) PT^jX , or a channel in the language of Part ??. For all practical purposes, we can write T^ = T^(X; U), where U denotes external randomness uniform on [0; 1] and independent of X. To measure the quality of an estimator T^, we introduce a loss function ` : Y × Y!^ R such that `(T; T^) is the risk of T^ for estimating T . Since we are dealing with loss (as opposed to reward), all the negative (converse) results are lower bounds and all the positive (achievable) results are upper bounds. Note that X is a random variable, so are T^ and `(T; T^). Therefore, to make sense of \minimizing the loss", we consider the average risk: Z ^ ^ ^ ^ Rθ(T ) = Eθ[`(T; T )] = Pθ(dx)PT^jX (dtjx)`(t(θ); t); (1.2) feq:risk-defg ^ which we refer to as the risk of T at θ. The subscript in Eθ indicates the distribution with respect to which the expectation is taken. Note that the expected risk depends on the estimator as well as the ground truth. Remark 1.1. We note that the problem of hypothesis testing and inference can be encompassed as special cases of the estimation paradigm. There are three formulations for testing: Simple vs. simple hypotheses H0 : θ = θ0 vs: H1 : θ = θ1; θ0 6= θ1 1 Simple vs. composite hypotheses H0 : θ = θ0 vs: H1 : θ 2 Θ1; θ0 2= Θ1 Composite vs. composite hypotheses H0 : θ 2 Θ0 vs: H1 : θ 2 Θ1; Θ0 \ Θ1 = ;: In each case one can introduce the appropriate parameter space and loss function. For example, in the last (most general) case, we may take ( 0 θ 2 Θ0 Θ = Θ0 [ Θ1;T (θ) = ; T^ 2 f0; 1g 1 θ 2 Θ1 ^ ^ and use the zero-one loss `(T; T ) = 1fT 6=T^g so that the expected risk Rθ(T ) = Pθfθ2 = ΘT^g is the probability of error. For the problem of inference, the goal is to output a confidence interval (or region) which covers the true parameter with high probability. In this case we can set T^ is a subset of Θ and we may choose ^ ^ the loss function `(θ; T ) = 1fθ2 =T^g + width(T ), in order to balance the coverage and the size of the confidence region. Remark 1.2 (Randomized versus deterministic estimators). Although most of the estimators used in practice are deterministic, there are a number of reasons to consider randomized estimators: For certain formulations, such as the minimizing worst-case risk (minimax approach), deterministic estimators are suboptimal and it is necessary to randomize. On the other hand, if the objective is to minimize the average risk (Bayes approach), then it does not lose generality to restrict to deterministic estimators. The space of randomized estimators (viewed as Markov kernels) is convex which is the convex hull of deterministic estimators. This convexification is needed for example for the treatment of minimax theorems. See Section 1.3 for a detailed discussion and examples. A well-known fact is that for convex loss function (i.e., T^ 7! `(T; T^) is convex), randomization does not help. Indeed, for any randomized estimator T^, we can derandomize it by considering its conditional expectation E[T^jX], which is a deterministic estimator and whose risk dominates that ^ ^ ^ ^ of the original T at every θ, namely, Rθ(T ) = Eθ`(T; T ) ≥ Eθ`(T; E[T jX]), by Jensen's inequality. 1.2 Gaussian Location Model (GLM) fsec:GLMg Note that, without loss of generality, all statistical models can be expressed in the parametric form of (1.1) (since we can take θ to be the distribution itself). In the statistical literature, it is customary to refer to a model as parametric if θ takes values in a finite-dimensional Euclidean space (so that each distribution is specified by finitely many parameters), and nonparametric if θ takes values in some infinite-dimensional space (e.g. density estimation or sequence model). Perhaps the most basic parametric model is the Gaussian Location Model (GLM), also known as the Normal Mean Model (or the Gaussian channel in information-theoretic terms c.f. Example ??). 2 The model is given by 2 P = fN(θ; σ Id): θ 2 Θg d where Id is the d-dimensional identity matrix and Θ ⊂ R . Equivalently, we can express the data as a noisy observation of the unknown parameter θ as: 2 X = θ + Z; Z ∼ N(0; σ Id); The case of d = 1 and d > 1 refers to the scalar and vector case, respectively. We will also consider the matrix case, for which we can vectorize a d × d matrix θ as a d2-dimensional vector. The estimand can be T (θ) = θ, in which case the goal is to denoise the vector θ, or certain functionals such as T (θ) = kθk2, θmax = maxfθ1; : : : ; θdg, or eigenvalues in the matrix case. Loss function: For estimating θ itself, it is customary to use a loss function defined by certain ^ ^ α norms, such as `(θ; θ) = kθ − θkp for some 1 ≤ p ≤ 1 (with p = α = 2 corresponding to the 1 P p quadratic loss), where kθkp , ( jθij ) p . Parameter space: for example d { Θ = R , in which case there is no assumption on θ (unstructured problem). { Θ = `p-norm balls. d { Θ = fall k-sparse vectorsg = fθ 2 R : kθk0 ≤ kg, where kθk0 , jfi : θi 6= 0gj denotes the size of the support, known as the `0-\norm". d×d { In the matrix case: Θ = fθ 2 R : rank(θ) ≤ rg, the set of low-rank matrices. Note that by definition, more structure (smaller paramater space) always makes the estimation task easier (smaller risk), but not necessarily so in terms of computation. Estimator: Some well-known estimators include the Maximum Likelihood Estimator (MLE) ^ θML = X (1.3) feq:MLEg and the James-Stein estimator based on shrinkage 2 ^ (d − 2)σ θJS = 1 − 2 X (1.4) feq:JSg kXk2 The choice of the estimator depends on both the objective and the parameter space. For instance, if θ is known to be sparse, it makes sense to set the smaller entries in the observed X to zero in order to better denoise θ. The hypothesis testing problem in the GLM is well-studied. For example, one can consider detecting the presence of a signal by testing H0 : θ = 0 against H1 : kθk ≥ , or testing weak signal H0 : kθk ≤ 0 versus strong signal H1 : kθk ≥ 1, with or without further sparsity assumptions on θ. There is a rich body of literature devoted to these problems (cf. the monograph [IS03]), which we will revisit in Lecture ??. 3 1.3 Bayes risk, minimax risk, and the minimax theorem fsec:minimax-bayesg One of our main objectives in this part of the book is to understand the fundamental limit of statistical estimation, that is, to determine the performance of the best estimator. As in (1.2), the risk Rθ(T^) of an estimator T^ for T (θ) depends on the ground truth θ. To compare the risk profiles of different estimators meaningfully requires some thought. As a toy example, Fig. 1.1 depicts the risk functions of three estimators. It is clear that θ^1 is superior to θ^2 in the sense that the risk of the former is pointwise lower than that of the latter. (In statistical literature we say θ^2 is inadmissible.) However, the comparison of θ^1 and θ^3 is less clear. Although the peak risk value of θ^3 is bigger than that of θ^1, on average its risk (area under the curve) is smaller. In fact, both views are valid and meaningful, and they correspond to the worst-case (minimax) and average-case (Bayesian) approach, respectively. In the minimax formulation, we summarize the risk function into a scalar quantity, namely, the worst-case risk, and seek the estimator that minimize this objective. In the Bayesian formulation, the objective is the average risk. Below we discuss these two approaches and their connections. For notational simplicity, we consider the task of estimating T (θ) = θ.

Load more