STAT 200B: Theoretical Statistics
Total Page:16
File Type:pdf, Size:1020Kb
STAT 200B: Theoretical Statistics Arash A. Amini March 2, 2020 1 / 218 Statistical decision theory A probability model = Pθ : θ Ω for data X : • Ω: parameter space,P : samplef space.2 g 2 X X An action space : set of available actions (decisions). • A A loss function: • 0-1 loss L(θ; a) = 1 θ = a Ω = = 0; 1 . f 6 2g A f d g Quadratic loss (Squared error) L(θ; a) = θ a 2 Ω = = R . k − k A Statistical inference as a game: 1. Nature picks the \true" parameter θ, and draws X Pθ. Thus, X is a random element of . ∼ X 2. Statistician observes X and makes a decision δ(X ) . δ : is called a decision rule. 2 A X!A 3. Statistician incurs the loss L(θ; δ(X )). The goal of the statistician is to minimize its expected loss, a.k.a the risk: R(θ; δ) := EθL(θ; δ(X )) 2 / 218 The goal of the statistician is to minimize its expected loss, a.k.a the risk: • R(θ; δ) := EθL(θ; δ(X )) = L(θ; δ(x)) dPθ(x) Z = L(θ; δ(x)) pθ(x) dµ(x) Z when family is dominated: Pθ = pθdµ. Usually work with the family of densities p ( ): θ Ω . • f θ · 2 g 3 / 218 Example 1 (Bernoulli trials) A coin being flipped, want to estimate the probability of coming up heads. • One possible model: • iid X = (X1;:::; Xn), Xi Ber(θ), for some θ [0; 1]. n ∼ n 2 Formally, = 0; 1 , Pθ = (Ber(θ))⊗ and Ω = [0; 1]. • X f g PMF of Xi : • θ x = 1 x 1 x P(Xi = x) = = θ (1 θ) − ; x 0; 1 1 θ x = 0 − 2 f g ( − n xi 1 xi Joint PMF: p (x1;:::; xn) = θ (1 θ) − • θ i=1 − Action space: = Ω. • A Q Quadratic loss: L(θ; δ) = (θ δ)2. • − 4 / 218 Comparing estimators via their risk Bernoulli trials. Let us look at three estimators: 1 n θ(1 θ) Sample mean δ1(X ) = n i=1 Xi R(θ; δ1) = n− 1 P 1 2 Constant estimator δ2(X ) = R(θ; δ2) = (θ ) 2 − 2 P 2 i Xi +3 nθ(1 θ)+(3 6θ) Strange looking δ3(X ) = n+6 R(θ; δ3) = −(n+6)2− . Throw data out δ4(X ) = X1 R(θ; δ4) = θ(1 θ). − 5 / 218 Comparing estimators via their risk n = 10 n = 50 2 10− 0.2 2 · 0.15 1.5 0.1 δ4 1 δ2 2 5 10− 0.5 · δ1 δ3 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Comparison depends on the choice of the loss. A different loss gives a different picture. 6 / 218 Comparing estimators via their risk How to deal with the fact that the risks are functions? Summarize them by reducing to numbers: • • (Bayesian) Take a weighted average: Z inf R(θ; δ) dπ(θ) δ Ω • (Frequentist) Take the maximum: inf max R(θ; δ) δ θ 2 Ω Restrict to a class of estimators: unbiased (UMVU), equivariant, etc. • Rule out estimators that are dominated by others (inadmissible). • 7 / 218 Admissibility Definition 1 Let δ and δ∗ be decision rules. δ∗ (strictly) dominates δ if R(θ; δ∗) R(θ; δ); for all θ Ω, and • ≤ 2 R(θ; δ∗) < R(θ; δ); for some θ Ω. • 2 δ is inadmissible if there is a different δ∗ that dominates it; otherwise δ is admissible. An inadmissible rule can be uniformly \improved". δ4 in the Bernoulli example is inadmissible. We will see a non-trivial example soon (Exponential Distribution). 8 / 218 Bias Definition 2 The bias of δ for estimating g(θ) is Bθ(δ) := Eθ(δ) g(θ). − The estimator is unbiased if B (δ) = 0 for all θ Ω. θ 2 Not always possible to find unbiased estimators. Example, g(θ) = sin(θ) in the binomial family. (Keener Example 4.2, p. 62) Definition 3 g is called U-estimable if there an unbiased estimator δ for g. Usually g(θ) = θ. 9 / 218 Bias-variance decomposition For the quadratic loss L(θ; a) = (θ a)2, the risk is mean-squared error (MSE). In this case we have the following decomposition− 2 MSEθ(δ) = [Bθ(δ)] + varθ(δ) Proof. Let µθ := Eθ(δ). We have 2 2 MSEθ(δ) = Eθ(θ δ) = Eθ(θ µθ + µθ δ) − − − 2 2 = (θ µθ) + 2(θ µθ)Eθ[µθ δ] + Eθ(µθ δ) : − − − − Same goes for general g(θ) or higher dimensions: L(θ; a) = g(θ) a 2. k − k2 10 / 218 Example 2 (Berger) Let X N(θ; 1). ∼ Class of estimators of the form δc (X ) = cX , for c R. 2 MSE (δ) = (θ cθ)2 + c2 = (1 c)2θ2 + c2 θ − − For c > 1, we have 1 = MSEθ(δ1) < MSEθ(δc ) for all θ. For c [0; 1] the rules are incomparable. 2 11 / 218 Optimality depends on the loss Example 3 (Possion process) X1;:::; Xn be the inter-arrival times of a Poisson process with rate λ. iid X1;:::; Xn Expo(λ). The model has the following p.d.f. ∼ n P λxi n λ xi pλ(x) = pλ(xi ) = λe− 1 xi > 0 = λ e− i 1 min xi > 0 f g f i g i=1 i Y Y Ω = = (0; ). A 1 1 Let S = Xi and X¯ = S. • i n The MLE for λ is λ^ = 1=X¯ = n=S. • P 12 / 218 iid X1;:::; Xn Expo(λ) ∼ 1 Let S = Xi and X¯ = S. • i n The MLEP for λ is λ^ = 1=X¯ = n=S. • n S := Xi has the Gamma(n; λ) distribution. • i=1 1=S has Inv-Gamma(n; λ) distribution with mean λ/(n 1). • P − ^ Eλ[λ] = nλ/(n 1). MLE is biased for λ. • − Then, λ~ := (n 1)λ/^ n is unbiased. • − We also have var (λ~) < var (λ^). • λ λ It follows that • MSE (λ~) < MSE (λ^); λ λ λ 8 The MLE λ^ is inadmissible for quadratic loss. 13 / 218 8 7 Possible explanation: 6 Quadratic loss penalizes over-estimation 5 4 more than under-estimation for Ω = 3 (0 ). ; 2 1 1 0 0 1 2 3 4 5 6 Alternative loss function (Itakura{Saito distance) L(λ, a) = λ/a 1 log(λ/a); a; λ (0; ) − − 2 1 With this loss function, R(λ, λ~) > R(λ, λ^); λ. • 8 That is, MLE renders λ~ inadmissible. • An example of a Bregman divergence for φ(x) = log x. − For a convex function φ : Rd R, the Bregman divergence is defined as ! d (x; y) = φ(x) φ(y) φ(y); x y φ − − hr − i the remainder of the first order Taylor expansion of φ at y. 14 / 218 Details: Consider δ (X ) = α=S. Then, we have • α n n n n R(λ, δ ) R(λ, δ ) = log log α − β α − β − α − β Take α = n 1 and β = n. • − Use log x log y < x y for x > y 1. • − − ≥ (Follows from strict concavity of f (x) = log(x): f (x) f (y) < f 0(y)(x y) for y = x). − − 6 15 / 218 Sufficiency Idea: Separate the data into parts that are relevant for the estimating θ (sufficient) • and parts that are irrelevant (ancillary). • Benefits: Achieve data compression: efficient computation and storage • Irrelevant parts can increase the risk (Rao-Blackwell) • Definition 4 Consider the model = Pθ : θ Ω for X . A statistic T = T (XP) is sufficientf 2 forg (or for θ or for X ) if the conditional distribution of X given T does not dependP on θ. More precisely, we have Pθ(X A T = t) = Qt (A); t; A 2 j 8 for some Markov kernel Q. Making it more precise requires measure theory. Intuitively, given T , we can simulate X by an external source of randomness. 16 / 218 Sufficiency Example 4 (Coin tossing) iid Xi Ber(θ), i = 1;:::; n. • ∼ Notation: X = (X1;:::; Xn), x = (x1;:::; xn). • Will show that T = T (X ) = i Xi is sufficient for θ. (This should be • intuitive.) P n xi 1 xi T (x) n T (x) Pθ(X = x) = pθ(x) = θ (1 θ) − = θ (1 θ) − − − i=1 Y Then • Pθ(X = x; T = t) = Pθ(X = x)1 T (x) = t f g t n t = θ (1 θ) − 1 T (x) = t : − f g 17 / 218 Then • Pθ(X = x; T = t) = Pθ(X = x)1 T (x) = t f g t n t = θ (1 θ) − 1 T (x) = t : − f g Marginalizing, • t n t Pθ(T = t) = θ (1 θ) − 1 T (x) = t − f g x 0;1 n 2X f g n t n t = θ (1 θ) − : t − Hence, • t n t θ (1 θ) − 1 T (x) = t 1 Pθ(X = x T = t) = − f g = 1 T (x) = t : j n θt (1 θ)n t n f g t − − t What is the above (conditional) distribution? • 18 / 218 Factorization Theorem It is not convenient to check for sufficiency this way, hence: Theorem 1 (Factorization (Fisher{Neyman)) Assume that = Pθ : θ Ω is dominated by µ. A statistic T is sufficient iff for some functionP gf ; h 02 g θ ≥ pθ(x) = gθ(T (x))h(x); for µ-a.e. x The likelihood θ p (X ), only depends on X through T (X ). 7! θ Family being dominated (having a density) is important.