STAT 200B: Theoretical Statistics

STAT 200B: Theoretical Statistics Arash A. Amini March 2, 2020 1 / 218 Statistical decision theory A probability model = Pθ : θ Ω for data X : • Ω: parameter space,P : samplef space.2 g 2 X X An action space : set of available actions (decisions). • A A loss function: • 0-1 loss L(θ; a) = 1 θ = a Ω = = 0; 1 . f 6 2g A f d g Quadratic loss (Squared error) L(θ; a) = θ a 2 Ω = = R . k − k A Statistical inference as a game: 1. Nature picks the \true" parameter θ, and draws X Pθ. Thus, X is a random element of . ∼ X 2. Statistician observes X and makes a decision δ(X ) . δ : is called a decision rule. 2 A X!A 3. Statistician incurs the loss L(θ; δ(X )). The goal of the statistician is to minimize its expected loss, a.k.a the risk: R(θ; δ) := EθL(θ; δ(X )) 2 / 218 The goal of the statistician is to minimize its expected loss, a.k.a the risk: • R(θ; δ) := EθL(θ; δ(X )) = L(θ; δ(x)) dPθ(x) Z = L(θ; δ(x)) pθ(x) dµ(x) Z when family is dominated: Pθ = pθdµ. Usually work with the family of densities p ( ): θ Ω . • f θ · 2 g 3 / 218 Example 1 (Bernoulli trials) A coin being flipped, want to estimate the probability of coming up heads. • One possible model: • iid X = (X1;:::; Xn), Xi Ber(θ), for some θ [0; 1]. n ∼ n 2 Formally, = 0; 1 , Pθ = (Ber(θ))⊗ and Ω = [0; 1]. • X f g PMF of Xi : • θ x = 1 x 1 x P(Xi = x) = = θ (1 θ) − ; x 0; 1 1 θ x = 0 − 2 f g ( − n xi 1 xi Joint PMF: p (x1;:::; xn) = θ (1 θ) − • θ i=1 − Action space: = Ω. • A Q Quadratic loss: L(θ; δ) = (θ δ)2. • − 4 / 218 Comparing estimators via their risk Bernoulli trials. Let us look at three estimators: 1 n θ(1 θ) Sample mean δ1(X ) = n i=1 Xi R(θ; δ1) = n− 1 P 1 2 Constant estimator δ2(X ) = R(θ; δ2) = (θ ) 2 − 2 P 2 i Xi +3 nθ(1 θ)+(3 6θ) Strange looking δ3(X ) = n+6 R(θ; δ3) = −(n+6)2− . Throw data out δ4(X ) = X1 R(θ; δ4) = θ(1 θ). − 5 / 218 Comparing estimators via their risk n = 10 n = 50 2 10− 0.2 2 · 0.15 1.5 0.1 δ4 1 δ2 2 5 10− 0.5 · δ1 δ3 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Comparison depends on the choice of the loss. A different loss gives a different picture. 6 / 218 Comparing estimators via their risk How to deal with the fact that the risks are functions? Summarize them by reducing to numbers: • • (Bayesian) Take a weighted average: Z inf R(θ; δ) dπ(θ) δ Ω • (Frequentist) Take the maximum: inf max R(θ; δ) δ θ 2 Ω Restrict to a class of estimators: unbiased (UMVU), equivariant, etc. • Rule out estimators that are dominated by others (inadmissible). • 7 / 218 Admissibility Definition 1 Let δ and δ∗ be decision rules. δ∗ (strictly) dominates δ if R(θ; δ∗) R(θ; δ); for all θ Ω, and • ≤ 2 R(θ; δ∗) < R(θ; δ); for some θ Ω. • 2 δ is inadmissible if there is a different δ∗ that dominates it; otherwise δ is admissible. An inadmissible rule can be uniformly \improved". δ4 in the Bernoulli example is inadmissible. We will see a non-trivial example soon (Exponential Distribution). 8 / 218 Bias Definition 2 The bias of δ for estimating g(θ) is Bθ(δ) := Eθ(δ) g(θ). − The estimator is unbiased if B (δ) = 0 for all θ Ω. θ 2 Not always possible to find unbiased estimators. Example, g(θ) = sin(θ) in the binomial family. (Keener Example 4.2, p. 62) Definition 3 g is called U-estimable if there an unbiased estimator δ for g. Usually g(θ) = θ. 9 / 218 Bias-variance decomposition For the quadratic loss L(θ; a) = (θ a)2, the risk is mean-squared error (MSE). In this case we have the following decomposition− 2 MSEθ(δ) = [Bθ(δ)] + varθ(δ) Proof. Let µθ := Eθ(δ). We have 2 2 MSEθ(δ) = Eθ(θ δ) = Eθ(θ µθ + µθ δ) − − − 2 2 = (θ µθ) + 2(θ µθ)Eθ[µθ δ] + Eθ(µθ δ) : − − − − Same goes for general g(θ) or higher dimensions: L(θ; a) = g(θ) a 2. k − k2 10 / 218 Example 2 (Berger) Let X N(θ; 1). ∼ Class of estimators of the form δc (X ) = cX , for c R. 2 MSE (δ) = (θ cθ)2 + c2 = (1 c)2θ2 + c2 θ − − For c > 1, we have 1 = MSEθ(δ1) < MSEθ(δc ) for all θ. For c [0; 1] the rules are incomparable. 2 11 / 218 Optimality depends on the loss Example 3 (Possion process) X1;:::; Xn be the inter-arrival times of a Poisson process with rate λ. iid X1;:::; Xn Expo(λ). The model has the following p.d.f. ∼ n P λxi n λ xi pλ(x) = pλ(xi ) = λe− 1 xi > 0 = λ e− i 1 min xi > 0 f g f i g i=1 i Y Y Ω = = (0; ). A 1 1 Let S = Xi and X¯ = S. • i n The MLE for λ is λ^ = 1=X¯ = n=S. • P 12 / 218 iid X1;:::; Xn Expo(λ) ∼ 1 Let S = Xi and X¯ = S. • i n The MLEP for λ is λ^ = 1=X¯ = n=S. • n S := Xi has the Gamma(n; λ) distribution. • i=1 1=S has Inv-Gamma(n; λ) distribution with mean λ/(n 1). • P − ^ Eλ[λ] = nλ/(n 1). MLE is biased for λ. • − Then, λ~ := (n 1)λ/^ n is unbiased. • − We also have var (λ~) < var (λ^). • λ λ It follows that • MSE (λ~) < MSE (λ^); λ λ λ 8 The MLE λ^ is inadmissible for quadratic loss. 13 / 218 8 7 Possible explanation: 6 Quadratic loss penalizes over-estimation 5 4 more than under-estimation for Ω = 3 (0 ). ; 2 1 1 0 0 1 2 3 4 5 6 Alternative loss function (Itakura{Saito distance) L(λ, a) = λ/a 1 log(λ/a); a; λ (0; ) − − 2 1 With this loss function, R(λ, λ~) > R(λ, λ^); λ. • 8 That is, MLE renders λ~ inadmissible. • An example of a Bregman divergence for φ(x) = log x. − For a convex function φ : Rd R, the Bregman divergence is defined as ! d (x; y) = φ(x) φ(y) φ(y); x y φ − − hr − i the remainder of the first order Taylor expansion of φ at y. 14 / 218 Details: Consider δ (X ) = α=S. Then, we have • α n n n n R(λ, δ ) R(λ, δ ) = log log α − β α − β − α − β Take α = n 1 and β = n. • − Use log x log y < x y for x > y 1. • − − ≥ (Follows from strict concavity of f (x) = log(x): f (x) f (y) < f 0(y)(x y) for y = x). − − 6 15 / 218 Sufficiency Idea: Separate the data into parts that are relevant for the estimating θ (sufficient) • and parts that are irrelevant (ancillary). • Benefits: Achieve data compression: efficient computation and storage • Irrelevant parts can increase the risk (Rao-Blackwell) • Definition 4 Consider the model = Pθ : θ Ω for X . A statistic T = T (XP) is sufficientf 2 forg (or for θ or for X ) if the conditional distribution of X given T does not dependP on θ. More precisely, we have Pθ(X A T = t) = Qt (A); t; A 2 j 8 for some Markov kernel Q. Making it more precise requires measure theory. Intuitively, given T , we can simulate X by an external source of randomness. 16 / 218 Sufficiency Example 4 (Coin tossing) iid Xi Ber(θ), i = 1;:::; n. • ∼ Notation: X = (X1;:::; Xn), x = (x1;:::; xn). • Will show that T = T (X ) = i Xi is sufficient for θ. (This should be • intuitive.) P n xi 1 xi T (x) n T (x) Pθ(X = x) = pθ(x) = θ (1 θ) − = θ (1 θ) − − − i=1 Y Then • Pθ(X = x; T = t) = Pθ(X = x)1 T (x) = t f g t n t = θ (1 θ) − 1 T (x) = t : − f g 17 / 218 Then • Pθ(X = x; T = t) = Pθ(X = x)1 T (x) = t f g t n t = θ (1 θ) − 1 T (x) = t : − f g Marginalizing, • t n t Pθ(T = t) = θ (1 θ) − 1 T (x) = t − f g x 0;1 n 2X f g n t n t = θ (1 θ) − : t − Hence, • t n t θ (1 θ) − 1 T (x) = t 1 Pθ(X = x T = t) = − f g = 1 T (x) = t : j n θt (1 θ)n t n f g t − − t What is the above (conditional) distribution? • 18 / 218 Factorization Theorem It is not convenient to check for sufficiency this way, hence: Theorem 1 (Factorization (Fisher{Neyman)) Assume that = Pθ : θ Ω is dominated by µ. A statistic T is sufficient iff for some functionP gf ; h 02 g θ ≥ pθ(x) = gθ(T (x))h(x); for µ-a.e. x The likelihood θ p (X ), only depends on X through T (X ). 7! θ Family being dominated (having a density) is important.

STAT 200B: Theoretical Statistics

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support