Lecture 5: Local Asymptotic Normality (LAN) and Asymptotic Theorems

Lecture 5: Local Asymptotic Normality (LAN) and asymptotic theorems

Lecturer: Yanjun Han

April 12, 2021 Announcements

Project proposal due April 18 submit via gradescope only if you decide to do an original project all other students must do a literature review, with paper chosen by the end of week 6

2 / 24 Today’s plan

Asymptotic theorems: Fisher information, Fisher’s program and Hodges’ estimator three asymptotic theorems Gaussian location model, Anderson’s lemma reduction of LAN models to a Gaussian location model examples of asymptotic lower bounds

3 / 24 Score function and Fisher information

Deﬁnition

A statistical model (X , (Pθ)θ∈Θ⊆Rd ) is quadratic mean diﬀerentiable ˙ d (QMD) at θ ∈ Θ if there exists a score function `θ : X → R such that Z 2 p p 1 > p 2 dPθ+h − dPθ − h `˙θ dPθ = o(khk ). X 2

˙ ˙> In this case, I (θ) = EPθ [`θ`θ ] exists and is the Fisher information at θ. Alternative deﬁnitions:

4 / 24 Cram´er-Raolower bound

Cram´er-Raolower bound For any unbiased estimator T , i.e. Eθ[T ] = θ for every θ ∈ Θ, it holds that

−1 Covθ(T ) I (θ) , ∀θ ∈ Θ.

Proof via χ2-divergence:

( [h] − [h])2 χ2(P, Q) = sup EP EQ h VarQ (h)

5 / 24 Fisher’s program

Setting throughout this lecture:

(Pθ)θ∈Θ⊆Rd : QMD statistical model with Fisher information I (θ) ⊗n Tn: estimator sequence under product model (Pθ ) ψ(θ): a generic real-valued diﬀerentiable function of θ

Fisher’s program:

for any asymptotically normal estimators Tn with

√ d n(Tn − θ) → N (0, Σθ)

−1 for every θ ∈ Θ, then Σθ I (θ) ; −1 the MLE satisﬁes Σθ = I (θ) for every θ.

6 / 24 Counterexample: Hodges’ estimator

Gaussian location model: X1, ··· , Xn ∼ N (θ, 1), with θ ∈ R Hodges’ estimator: ( X¯ if |X¯| ≥ n−1/4, θbn = 0 if |X¯| < n−1/4.

asymptotic normality: ( √ d 0 if θ = 0, n(θbn − θ) → N (0, 1) if θ 6= 0.

“supereﬃciency”

7 / 24 First ﬁx: almost everywhere convolution theorem

First ﬁx: show that Fisher’s program is true almost everywhere Almost everywhere convolution theorem √ If n(Tn − ψ(θ)) converges in distribution to some probability measure Lθ for every θ, then there exists some probability measure Mθ such that

> −1 Lθ = N (0, ψ˙(θ) I (θ) ψ˙(θ)) ∗ Mθ

for Lebesgue almost every θ, where ∗ denotes convolution.

8 / 24 Second ﬁx: convolution theorem

Second ﬁx: restrict to a family of regular estimators Convolution theorem √ If n(Tn − ψ(θ)) converges in distribution to some probability measure Lθ for every θ, and Tn is regular in the sense that √ h n T − ψ θ + √ →d L n n θ

d ⊗n √ for any h ∈ R under Pθ+h/ n, then there exists some probability measure Mθ such that > −1 Lθ = N (0, ψ˙(θ) I (θ) ψ˙(θ)) ∗ Mθ for every θ, where ∗ denotes convolution.

9 / 24 Third ﬁx: local asymptotic minimax theorem

Third ﬁx: local minimax instead of a single point Local asymptotic minimax theorem For every bowl-shaped loss function `(·), i.e. ` is symmetric and quasi-convex, then for any sequence of estimators (Tn), √ √ h lim lim inf sup Eθ+h/ n ` n Tn − ψ θ + √ ≥ E[`(Z)], c→∞ n→∞ khk≤c n

˙> −1 ˙ with Z ∼ N (0, ψθ I (θ) ψθ).

10 / 24 Gaussian location model

model: X ∼ N (θ, Σ) d unknown mean θ ∈ R known covariance Σ

Theorem (Anderson) For any bowl-shaped loss function `, it holds that

inf sup Eθ[`(θb− θ)] = E[`(Z)] θb θ∈Rd with Z ∼ N (0, Σ).

11 / 24 Proof of theorem

consider a Gaussian prior θ ∼ N (0, Σ0) then

−1 −1 −1 −1 −1 −1 −1 θ | X = x ∼ N (Σ0 + Σ ) Σ x, (Σ0 + Σ )

Bayes estimator:

θb(x) = arg min Ep(θ|X =x)[`(z − θ)] z∈Rd

use Anderson’s lemma and choose Σ0 = λI with λ → ∞

Anderson’s lemma Let Z ∼ N (0, Σ) and ` be bowl-shaped. Then

min E[`(Z + x)] = E[`(Z)]. x∈Rd

12 / 24 Proof of Anderson’s lemma

Theorem (Pr´epoka–Leindler) Let λ ∈ (0, 1), and f , g, h be three non-negative real-valued functions on d R . If λ 1−λ d h(λx + (1 − λ)y) ≥ f (x) g(y) , ∀x, y ∈ R , then Z Z λ Z 1−λ h(x)dx ≥ f (x)dx g(x)dx . Rd Rd Rd

let K be any symmetric convex set, and φ be the pdf of N (0, Σ) apply above theorem to

f (z) = φ(x)·1K−x (z), g(z) = φ(z)·1K+x (z), h(z) = φ(z)·1K (z)

d consequently, P(Z ∈ K − x) ≤ P(Z ∈ K) for any x ∈ R R ∞ ﬁnal step: E[`(Z + x)] = 0 (1 − P(Z + x ∈ {z : `(z) ≤ t}))dt

13 / 24 Local asymptotic normality (LAN)

Deﬁnition (Local asymptotic normality) d A sequence of models (Xn, (Pn,θ)θ∈Θn ) with Θn ↑ R is called locally asymptotically normal if

dPn,h > 1 > log = h Zn − h I0h + oPn,0 (1), dPn,0 2

d with central sequence Zn → N (0, I0) under Pn,0.

Comparison with the Gaussian location model:

f −1 (x) (h,I0 ) > 1 > log = h I0x − h I0h f −1 (x) 2 (0,I0 )

−1 where I0x ∼ N (0, I0) under x ∼ N (0, I0 ).

14 / 24 Likelihood ratio criterion

Theorem

Let Mn = (Xn, {Pn,0, ··· , Pn,m}) and M = (X , {P0, ··· , Pm}) be ﬁnite statistical models. Deﬁne the likelihood ratios

dPn,i dPi Ln,i (xn) = (xn), Li (x) = (x), i ∈ [m]. dPn,0 dP0

If M is further homogeneous, and (Ln,1, ··· , Ln,m) under xn ∼ Pn,0 converges in distribution to (L1, ··· , Lm) under x ∼ P0, then

∆(Mn, M) = 0.

weak convergence of likelihood ratios =⇒ convergence of statistical models

15 / 24 High-level idea: standard model

(X , {P1, ··· , Pm}) ⇐⇒ (Sm, {Q1, ··· , Qm}) ﬁnite model standard model

m Pm Sm = {(t1, ··· , tm) ∈ R+ : i=1 ti = m} Qi (dt) = ti µ(dt) with the correspondence (t , ··· , t ) = ( dP1 (x), ··· , dPm (x)) 1 m dP¯ dP¯ µ is the distribution of ( dP1 (x), ··· , dPm (x)) under x ∼ P¯ dP¯ dP¯

16 / 24 Local QMD model is LAN

Theorem

Let (X , (Pθ)θ∈Θ⊆Rd ) be QMD with Fisher information I (θ). Then the n ⊗n √ localized model (X , (Pθ+h/ n)h∈Θn ) with

h Θ = h ∈ d : θ + √ ∈ Θ ↑ d n R n R

satisﬁes the LAN condition with Fisher information I (θ).

Corollary: the above model converges to a Gaussian location model under Le Cam’s distance.

17 / 24 Proof of local asymptotic minimax (LAM) theorem

√ ⊗n √ product model aims to estimate ψ(θ + h/ n) under (Pθ+h/ n)h∈Rd limiting Gaussian model aims to estimate

h 1 ψ θ + √ = ψ(θ) + √ ψ˙>h + o(n−1/2) n n θ

under X ∼ N (h, I (θ)−1) Anderson’s lemma gives LAM

18 / 24 Diagram

local QMD model

local expansion of likelihood ratio

LAN condition

weak convergence of likelihood ratio

Gaussian location model

Anderson’s lemma

local asymptotic minimax theorem

19 / 24 Example I: bias of the coin

√ 2 Problem: X1, ··· , Xn ∼ Bern(p), p ∈ [0, 1], L(p, T ) = (T − p) correspondence: 1 √ I (p) = , ψ(p) = p, `(z) = z2 p(1 − p)

LAM: for every p0 ∈ [0, 1],

√ 2 p(1 − p) 1 − p lim lim inf sup n · Ep(Tn − p) ≥ √ = h→∞ n→∞ √ (2 p)2 4 |p−p0|≤h/ n

translate into a global minimax lower bound:

√ 2 1 + on(1) inf sup Ep(Tn − p) ≥ Tn p∈[0,1] 4n

20 / 24 Example II: entropy estimation

(diﬀerential) entropy: Z h(f ) = −f (x) log f (x)dx Rd 2 Problem: X1, ··· , Xn ∼ f , L(f , T ) = (T − h(f ))

one-dimensional submodel: restrict to f = f0 + tg, with t ∈ [−ε, ε] constraint on g: R g(x)dx = 0 Rd correspondence:

Z 2 Z g(x) 0 I (0) = dx, ψ (0) = g(x)(1 − log f0(x))dx Rd f0(x) Rd LAM: 0 2 2 1 + on(1) ψ (0) inf sup Ef (T − h(f )) ≥ sup Tb f n g I (0)

21 / 24 Least favorable one-dimensional submodel

choose g to maximize

Z g(x)2 −1 Z 2 V (g) , dx g(x)(1 − log f0(x))dx Rd f0(x) Rd by Cauchy–Schwartz:

Z g(x)2 −1 Z 2 V (g) = inf dx g(x)(c − log f0(x))dx c Rd f0(x) Rd Z 2 ≤ inf f0(x)(c − log f0(x)) dx c Rd Z 2 2 = f0(x) log f0(x)dx − h(f0) Rd

equality attained at g(x) = f0(x)(− log f0(x) − h(f0))

22 / 24 Limitations of classical asymptotics

asymptotic vs. non-asymptotic parametric vs. non-parametric local vs. global

23 / 24 References

Lucien M. Le Cam, “Limits of experiments.” Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Theory of Statistics. The Regents of the University of California, 1972. Lucien M. Le Cam. “Asymptotic methods in statistical theory.” Springer, New York, 1986. Aad W. Van der Vaart, “Asymptotic statistics.” Vol. 3. Cambridge university press, 2000. Thomas B. Berrett, Richard J. Samworth, and Ming Yuan, “Eﬃcient multivariate entropy estimation via k-nearest neighbour distances.” The Annals of Statistics 47.1 (2019): 288-318.

Next lecture: statistical-computational tradeoﬀ (Jay)

24 / 24