<<

Lecture 5: Local Asymptotic Normality (LAN) and asymptotic theorems

Lecturer: Yanjun Han

April 12, 2021 Announcements

Project proposal due April 18 submit via gradescope only if you decide to do an original project all other students must do a literature review, with paper chosen by the end of week 6

2 / 24 Today’s plan

Asymptotic theorems: Fisher , Fisher’s program and Hodges’ three asymptotic theorems Gaussian location model, Anderson’s lemma reduction of LAN models to a Gaussian location model examples of asymptotic lower bounds

3 / 24 Score function and

Definition

A statistical model (X , (Pθ)θ∈Θ⊆Rd ) is quadratic differentiable ˙ d (QMD) at θ ∈ Θ if there exists a score function `θ : X → R such that Z  2 p p 1 > p 2 dPθ+h − dPθ − h `˙θ dPθ = o(khk ). X 2

˙ ˙> In this case, I (θ) = EPθ [`θ`θ ] exists and is the Fisher information at θ. Alternative definitions:

4 / 24 Cram´er-Raolower bound

Cram´er-Raolower bound For any unbiased estimator T , i.e. Eθ[T ] = θ for every θ ∈ Θ, it holds that

−1 Covθ(T )  I (θ) , ∀θ ∈ Θ.

Proof via χ2-:

( [h] − [h])2 χ2(P, Q) = sup EP EQ h VarQ (h)

5 / 24 Fisher’s program

Setting throughout this lecture:

(Pθ)θ∈Θ⊆Rd : QMD statistical model with Fisher information I (θ) ⊗n Tn: estimator sequence under product model (Pθ ) ψ(θ): a generic real-valued differentiable function of θ

Fisher’s program:

for any asymptotically normal Tn with

√ d n(Tn − θ) → N (0, Σθ)

−1 for every θ ∈ Θ, then Σθ  I (θ) ; −1 the MLE satisfies Σθ = I (θ) for every θ.

6 / 24 Counterexample: Hodges’ estimator

Gaussian location model: X1, ··· , Xn ∼ N (θ, 1), with θ ∈ R Hodges’ estimator: ( X¯ if |X¯| ≥ n−1/4, θbn = 0 if |X¯| < n−1/4.

asymptotic normality: ( √ d 0 if θ = 0, n(θbn − θ) → N (0, 1) if θ 6= 0.

“superefficiency”

7 / 24 First fix: almost everywhere convolution theorem

First fix: show that Fisher’s program is true almost everywhere Almost everywhere convolution theorem √ If n(Tn − ψ(θ)) converges in distribution to some measure Lθ for every θ, then there exists some probability measure Mθ such that

> −1 Lθ = N (0, ψ˙(θ) I (θ) ψ˙(θ)) ∗ Mθ

for Lebesgue almost every θ, where ∗ denotes convolution.

8 / 24 Second fix: convolution theorem

Second fix: restrict to a family of regular estimators Convolution theorem √ If n(Tn − ψ(θ)) converges in distribution to some probability measure Lθ for every θ, and Tn is regular in the sense that √   h  n T − ψ θ + √ →d L n n θ

d ⊗n √ for any h ∈ R under Pθ+h/ n, then there exists some probability measure Mθ such that > −1 Lθ = N (0, ψ˙(θ) I (θ) ψ˙(θ)) ∗ Mθ for every θ, where ∗ denotes convolution.

9 / 24 Third fix: local asymptotic minimax theorem

Third fix: local minimax instead of a single point Local asymptotic minimax theorem For every bowl-shaped `(·), i.e. ` is symmetric and quasi-convex, then for any sequence of estimators (Tn),  √    √ h lim lim inf sup Eθ+h/ n ` n Tn − ψ θ + √ ≥ E[`(Z)], c→∞ n→∞ khk≤c n

˙> −1 ˙ with Z ∼ N (0, ψθ I (θ) ψθ).

10 / 24 Gaussian location model

model: X ∼ N (θ, Σ) d unknown mean θ ∈ R known Σ

Theorem (Anderson) For any bowl-shaped loss function `, it holds that

inf sup Eθ[`(θb− θ)] = E[`(Z)] θb θ∈Rd with Z ∼ N (0, Σ).

11 / 24 Proof of theorem

consider a Gaussian prior θ ∼ N (0, Σ0) then

−1 −1 −1 −1 −1 −1 −1 θ | X = x ∼ N (Σ0 + Σ ) Σ x, (Σ0 + Σ )

Bayes estimator:

θb(x) = arg min Ep(θ|X =x)[`(z − θ)] z∈Rd

use Anderson’s lemma and choose Σ0 = λI with λ → ∞

Anderson’s lemma Let Z ∼ N (0, Σ) and ` be bowl-shaped. Then

min E[`(Z + x)] = E[`(Z)]. x∈Rd

12 / 24 Proof of Anderson’s lemma

Theorem (Pr´epoka–Leindler) Let λ ∈ (0, 1), and f , g, h be three non-negative real-valued functions on d R . If λ 1−λ d h(λx + (1 − λ)y) ≥ f (x) g(y) , ∀x, y ∈ R , then Z Z λ Z 1−λ h(x)dx ≥ f (x)dx g(x)dx . Rd Rd Rd

let K be any symmetric convex set, and φ be the pdf of N (0, Σ) apply above theorem to

f (z) = φ(x)·1K−x (z), g(z) = φ(z)·1K+x (z), h(z) = φ(z)·1K (z)

d consequently, P(Z ∈ K − x) ≤ P(Z ∈ K) for any x ∈ R R ∞ final step: E[`(Z + x)] = 0 (1 − P(Z + x ∈ {z : `(z) ≤ t}))dt

13 / 24 Local asymptotic normality (LAN)

Definition (Local asymptotic normality) d A sequence of models (Xn, (Pn,θ)θ∈Θn ) with Θn ↑ R is called locally asymptotically normal if

dPn,h > 1 > log = h Zn − h I0h + oPn,0 (1), dPn,0 2

d with central sequence Zn → N (0, I0) under Pn,0.

Comparison with the Gaussian location model:

f −1 (x) (h,I0 ) > 1 > log = h I0x − h I0h f −1 (x) 2 (0,I0 )

−1 where I0x ∼ N (0, I0) under x ∼ N (0, I0 ).

14 / 24 Likelihood ratio criterion

Theorem

Let Mn = (Xn, {Pn,0, ··· , Pn,m}) and M = (X , {P0, ··· , Pm}) be finite statistical models. Define the likelihood ratios

dPn,i dPi Ln,i (xn) = (xn), Li (x) = (x), i ∈ [m]. dPn,0 dP0

If M is further homogeneous, and (Ln,1, ··· , Ln,m) under xn ∼ Pn,0 converges in distribution to (L1, ··· , Lm) under x ∼ P0, then

∆(Mn, M) = 0.

weak convergence of likelihood ratios =⇒ convergence of statistical models

15 / 24 High-level idea: standard model

(X , {P1, ··· , Pm}) ⇐⇒ (Sm, {Q1, ··· , Qm}) finite model standard model

m Pm Sm = {(t1, ··· , tm) ∈ R+ : i=1 ti = m} Qi (dt) = ti µ(dt) with the correspondence (t , ··· , t ) = ( dP1 (x), ··· , dPm (x)) 1 m dP¯ dP¯ µ is the distribution of ( dP1 (x), ··· , dPm (x)) under x ∼ P¯ dP¯ dP¯

16 / 24 Local QMD model is LAN

Theorem

Let (X , (Pθ)θ∈Θ⊆Rd ) be QMD with Fisher information I (θ). Then the n ⊗n √ localized model (X , (Pθ+h/ n)h∈Θn ) with

 h  Θ = h ∈ d : θ + √ ∈ Θ ↑ d n R n R

satisfies the LAN condition with Fisher information I (θ).

Corollary: the above model converges to a Gaussian location model under Le Cam’s distance.

17 / 24 Proof of local asymptotic minimax (LAM) theorem

√ ⊗n √ product model aims to estimate ψ(θ + h/ n) under (Pθ+h/ n)h∈Rd limiting Gaussian model aims to estimate

 h  1 ψ θ + √ = ψ(θ) + √ ψ˙>h + o(n−1/2) n n θ

under X ∼ N (h, I (θ)−1) Anderson’s lemma gives LAM

18 / 24 Diagram

local QMD model

local expansion of likelihood ratio

LAN condition

weak convergence of likelihood ratio

Gaussian location model

Anderson’s lemma

local asymptotic minimax theorem

19 / 24 Example I: bias of the coin

√ 2 Problem: X1, ··· , Xn ∼ Bern(p), p ∈ [0, 1], L(p, T ) = (T − p) correspondence: 1 √ I (p) = , ψ(p) = p, `(z) = z2 p(1 − p)

LAM: for every p0 ∈ [0, 1],

√ 2 p(1 − p) 1 − p lim lim inf sup n · Ep(Tn − p) ≥ √ = h→∞ n→∞ √ (2 p)2 4 |p−p0|≤h/ n

translate into a global minimax lower bound:

√ 2 1 + on(1) inf sup Ep(Tn − p) ≥ Tn p∈[0,1] 4n

20 / 24 Example II: entropy estimation

(differential) entropy: Z h(f ) = −f (x) log f (x)dx Rd 2 Problem: X1, ··· , Xn ∼ f , L(f , T ) = (T − h(f ))

one-dimensional submodel: restrict to f = f0 + tg, with t ∈ [−ε, ε] constraint on g: R g(x)dx = 0 Rd correspondence:

Z 2 Z g(x) 0 I (0) = dx, ψ (0) = g(x)(1 − log f0(x))dx Rd f0(x) Rd LAM: 0 2 2 1 + on(1) ψ (0) inf sup Ef (T − h(f )) ≥ sup Tb f n g I (0)

21 / 24 Least favorable one-dimensional submodel

choose g to maximize

Z g(x)2 −1 Z 2 V (g) , dx g(x)(1 − log f0(x))dx Rd f0(x) Rd by Cauchy–Schwartz:

Z g(x)2 −1 Z 2 V (g) = inf dx g(x)(c − log f0(x))dx c Rd f0(x) Rd Z 2 ≤ inf f0(x)(c − log f0(x)) dx c Rd Z 2 2 = f0(x) log f0(x)dx − h(f0) Rd

equality attained at g(x) = f0(x)(− log f0(x) − h(f0))

22 / 24 Limitations of classical asymptotics

asymptotic vs. non-asymptotic parametric vs. non-parametric local vs. global

23 / 24 References

Lucien M. Le Cam, “Limits of .” Proceedings of the Sixth Berkeley Symposium on Mathematical and Probability, Volume 1: Theory of Statistics. The Regents of the University of California, 1972. Lucien M. Le Cam. “Asymptotic methods in .” Springer, New York, 1986. Aad W. Van der Vaart, “Asymptotic statistics.” Vol. 3. Cambridge university press, 2000. Thomas B. Berrett, Richard J. Samworth, and Ming Yuan, “Efficient multivariate entropy estimation via k-nearest neighbour distances.” The Annals of Statistics 47.1 (2019): 288-318.

Next lecture: statistical-computational tradeoff (Jay)

24 / 24