Stat210B: Theoretical Lecture Date: April 19, 2007

Lecture 27: Optimal and Functional Delta Method

Lecturer: Michael I. Jordan Scribe: Guilherme V. Rocha

1 Achieving Optimal Estimators

From the last class, we know that the best limiting distribution we can hope for a parameter ψ(θ) is T (0, ψ˙θIθψ˙ ). The next question to ask is whether such bound can be achieved. N θ This is the theme of our next result: Lemma 1. (Lemma 8.14 in van der Vaart, 1998) Assume that the experiment (Pθ : θ Θ) is differentiable in quadratic mean at θ0 with non-singular Fisher ∈ information matrix Iθ. Let ψ be differentiable at θ0. Let Tn be an sequence in the experiments (P n : θ Rk) such that: θ ∈ n 1 ˙ 1 ˙ √n (Tn ψ(θ)) = ψθI− ℓθ(Xi) + oPθ (1), − √n θ i=1 X then Tn is the best regular estimator for ψ(θ) at θ. Conversely, every best regular estimator sequence satisfies this expansion.

Proof. Let n 1 ∆ := ℓ˙ (X ). n,θ √n θ i i=1 X We know that ∆n,θ converges in distribution to a ∆θ with a (0,Iθ) distribution. N From Theorem 7.2 in van der Vaart (1998) we know that: n dPθ+ h √n T 1 T log n = h ∆θ h Iθh + oPθ (1). dPθ ! − 2

Using Slutsky’s theorem, we get:

√n (Tn ψ(θ)) n ˙ 1 dP− h θ ψ I− ∆ θ+ θ θ θ  √n  T 1 T log dP n h ∆θ 2 h Iθh θ !  −      ˙ 1 ˙ T ˙ 0 ψθIθ− ψθ ψθh 1 T , T T ∼ N h Iθh ψ˙ h h I h  − 2   θ θ  h Using Le Cam’s third lemma we can conclude that the sequence √n(Tn ψ(θ)) under θ + converges in − √n 1 T h distribution to (ψ˙θh, ψ˙θI− ψ˙ ). Since ψ is differentiable, we have that √n ψ(θ + ) ψ(θ) ψ˙θh as N θ θ √n − → h h n . We conclude that, under θ + , √n(Tn ψ(θ + )) does not involve h, that is, Tn is regular. → ∞ √n − √n

1 2 Lecture 27: Optimal Estimators and Functional Delta Method

To prove the converse, let Tn and Sn be a two best regular estimator sequences. Along subsequences, it can be shown that:

h h √n Sn ψ θ + θ+ ˙ − √n √n S ψθh h − ˙  √n Tn ψ θ +   T ψθh − √n  −       for a randomized estimator (S,T ) in the limiting experiment. Because Sn and Tn are best regular, S and T are best equivariant-in-law. Thus S = T = ψ˙θX almost surely and, as a result, √n(Sn Tn) converges in distribution to S T = 0. As a result, every two best regular estimator sequences are− asymptotically − equivalent. To get the result, apply this conclusion to Tn and:

1 1 S = ψ(θ) + ψ˙ I− ∆ n √n θ θ n,θ

Remarks on Theorem 8.14:

From theorem 5.39, an Maximum Likelihod Estimator θˆn satisfies: • n ˆ 1 1 ˙ √n θn θ = I− ℓθ(Xi) + oPθ (1) − √n θ i=1   X under regularity conditions. It follows that MLEs are asymptotically efficient. This result can be extended to a transforms of the MLE ψ(θ) for a differentiable ψ by using the delta method and observing that ψ(θˆn) satisfies the expansion in lemma 8.14. Lemma 8.14 suggests that Rao score functions leading to tests constructed from the scores are asymp- • totically efficient.

2 Functional Delta Method

The functional delta method aims at extending the delta method to a nonparametric context. The high level idea is to interpret a statistic as a functional φ mapping from the space of probability distributions D to the real line R and use a notion of derivative of this functional to obtain the of φ(Fˆn). We have proven before that: Fˆ p supx n(x) F (x) 0, (Glivenko-Cantelli Theorem) k − F k → √n Fˆn F GF , (Donsker Theorem) −   where Fˆn is the empirical distribution function based in n samples and GF is the Brownian Bridge. Our goal now is to find conditions on the functionals φ so we can extend the above modes of convergence to φ(Fˆn). As we will see in detail below, consistency of the sequence φ(Fˆn) follows easily from assuming φ to be continuous with respect to the supremum norm: this a natural extension of the continuous mapping theorem. The generalization of the delta method to functionals is more involved as different notions of differentiability of functionals exist. Before we jump into the continuous mapping theorem and the functional delta method, a few examples of statistical functionals are in order. Lecture 27: Optimal Estimators and Functional Delta Method 3

2.1 Examples of statistical functionals

The mean: µ(F ) = XdF (X); • The : Var(FR) = (X µ(F ))2dF (X); • − R k Higher order moments: µk(F ) = (X µ(F )) dF (X); • − R The Kolmogorov-Smirnov statistics: K(F ) = supx F (x) F0(x) , where F0 is a fixed hypothesized • distribution; k − k

2 The Cr´amer-von Mises statistics: C(F ) = (F (x) F0(x)) dF0(x), where F0 is a fixed hypothesized • distribution; − R

V-statistics: φ(F ) = EF (T (X1, X2,...,Xp)) where X1, X2,...,Xp are independent copies of F - • distributed random variables;

1 ∆ Quantile functional: φ(F ) = F − (p) = infx x : F (x) p ; • { ≥ }

2.2 Consistency of statistical functionals

One possible assumption to ensure that φ(Fˆn) converges to φ(F ) is continuous with respect to the supremum norm. Formally, this is defined as:

Definition 2. Continuity of a functional Let D be the space of distributions and φ : D R. We say φ is continuous (with respect to the supremum norm) at F if: →

sup Fn(x) F (x) 0 φ(Fn) φ(F ). x k − k → ⇒ →

The next result is an extension of the continuous mapping theorem to functionals and can be used to establish the consistency of statistical functionals.

Theorem 3. Continuous Mapping Theorem for Statistical Functionals Let φ : D R be a contin- uous functional at F . It follows that: →

p p If Fn F 0, then φ(Fn) φ(F ) 0. k − k∞ → − →

Proof. From continuity of φ (with respect to the sup norm), we have that for every ε > 0, there exists δ > 0 such that:

Fn F δ φ(Fn) φ(F ) ε. k − k∞ ≤ ⇒ k − k ≤ Hence,

0 P ( φ(Fn) φ(F ) > ε) P ( Fn F > δ(ε)) 0, ≤ k − k ≤ k − k∞ → p where the last convergence is due to Fn F 0 by hypothesis. k − k∞ → 4 Lecture 27: Optimal Estimators and Functional Delta Method

2.2.1 Examples of continuous statistical functionals

The following two functionals are continuous with respect to the supremum norm:

φ(F ) = F (a): The distribution function evaluated at a point. • To establish the continuity of this function, notice that for a sequence of distributions such that Fn F 0: k − k∞ →

0 Fn(a) F (a) sup Fn(x) F (x) 0 ≤ k − k ≤ x k − k →

2 φ(F ) = (F (x) F0(x)) dF0(x): The Cr´amer-von Mises functional. • − Again, take a sequence of distributions such that Fn F 0. We have: R k − k∞ →

2 2 0 φ(Fn) φ(F ) = F F + 2F0(F Fn) dF0 Fn F 2F0 F Fn dF0 ≤ k − k k n − − k ≤ | − | | − − | Z Z 2   ≤

2 sup Fn(x) F (x) dF0 0 | {z } ≤ x | − | → Z

2.3 Limiting distribution of statistical functionals

Recall our goal to determine the limiting distribution of φ(Fˆn). Heuristically, we might hope to derive it if we can find a linear φF′ resulting in an approximation of the sort:

φ(Fˆn) φ(F ) = φ′ (Fˆn F ) + some residual − F − Gˆ = √nφF′ ( n) + some residual

As before, we must keep track of the behavior of the residual term as n grows. The fact that φ operates on the infinite dimensional space of distribution functions will require us to be more careful and resort to some concepts of functional analysis. Namely, we will be looking at the notions of Gateaux and Hadamard derivatives.

We start by considering Gateaux derivatives. To see how they can be used, notice that from linearity of φF′ , we have that: n Gˆ 1 φ′ ( n) = φ′ (δXi F ), F √n F − i=1 X where δXi is the distribution function concentrating all mass at Xi. For each of the terms in the sum, we can consider a “directional” derivative of φ in the direction δXi F . That is the Gateaux derivative which in this case is defined as: − d φ′ (δXi F ) = [φ ((1 t)F + tδXi )] . F − dt − t=0

The expression for φ′ (δXi F ) above is known in the robust statistics literature as the influence function: F − d IFφ,F (x) = [φ ((1 t)F + tδx)] . dt − t=0

Lecture 27: Optimal Estimators and Functional Delta Method 5

To some extent, it measures how much the statistic φ(F ) is affected by adding a new observation at x to the sample. A related concept is the gross-error sensitivity defined as:

γ∗ = sup IFφ,F (x). x

For robustness, γ∗ must be bounded.

Going back to the approximation of φ(Fˆn) φ(F ), we now write: − n 1 φ(Fˆn) φ(F ) = IFφ,F (Xi) + Rn. − n i=1 X We now have:

E F (φ′ (δXi F )) = φ′ (δx P )dP (x) = 0, F − F − Z  If we assume:

2 VarF (φ′ (δXi F )) = (IFφ,F (x)) dF (x) < . F − ∞ Z and the residual term Rn can be controlled somehow (more on this in later classes), a will hold for the (random variable) φ′ (δXi F ) and we can expect that: F − F 2 √n φ(Fˆn) φ(F ) (0,λ ). − N   As is the case in multivariate calculus, the existence of Gateaux (directional) derivative does not ensure that the residual of the approximation is well behaved (differentiability). In the next classes, we will study the residual term more closely.

Coming up next

In the next classes, we will:

make this heuristic of the functional delta method more precise; • look more closely at the notion of Hadamard derivative; • use Hadamard derivative to determine conditions that ensure the functional delta method works •

References van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press, Cambridge.