STAT 200B: Theoretical Statistics

Arash A. Amini

March 2, 2020

1 / 218 Statistical

A probability model = Pθ : θ Ω for data X : • Ω: parameter space,P : sample{ space.∈ } ∈ X X An action space : set of available actions (decisions). • A A : • 0-1 loss L(θ, a) = 1 θ = a Ω = = 0, 1 . { 6 2} A { d } Quadratic loss (Squared error) L(θ, a) = θ a 2 Ω = = R . k − k A Statistical inference as a game:

1. Nature picks the “true” parameter θ, and draws X Pθ. Thus, X is a random element of . ∼ X 2. Statistician observes X and makes a decision δ(X ) . δ : is called a decision rule. ∈ A X → A 3. Statistician incurs the loss L(θ, δ(X )).

The goal of the statistician is to minimize its expected loss, a.k.a the risk:

R(θ, δ) := EθL(θ, δ(X ))

2 / 218 The goal of the statistician is to minimize its expected loss, a.k.a the risk: •

R(θ, δ) := EθL(θ, δ(X ))

= L(θ, δ(x)) dPθ(x) Z = L(θ, δ(x)) pθ(x) dµ(x) Z when family is dominated: Pθ = pθdµ. Usually work with the family of densities p ( ): θ Ω . • { θ · ∈ }

3 / 218 Example 1 (Bernoulli trials)

A coin being flipped, want to estimate the probability of coming up heads. • One possible model: • iid X = (X1,..., Xn), Xi Ber(θ), for some θ [0, 1]. n ∼ n ∈ Formally, = 0, 1 , Pθ = (Ber(θ))⊗ and Ω = [0, 1]. • X { }

PMF of Xi : •

θ x = 1 x 1 x P(Xi = x) = = θ (1 θ) − , x 0, 1 1 θ x = 0 − ∈ { } ( −

n xi 1 xi Joint PMF: p (x1,..., xn) = θ (1 θ) − • θ i=1 − Action space: = Ω. • A Q Quadratic loss: L(θ, δ) = (θ δ)2. • −

4 / 218 Comparing estimators via their risk

Bernoulli trials. Let us look at three estimators:

1 n θ(1 θ) Sample mean δ1(X ) = n i=1 Xi R(θ, δ1) = n−

1 P 1 2 Constant estimator δ2(X ) = R(θ, δ2) = (θ ) 2 − 2 P 2 i Xi +3 nθ(1 θ)+(3 6θ) Strange looking δ3(X ) = n+6 R(θ, δ3) = −(n+6)2− .

Throw data out δ4(X ) = X1 R(θ, δ4) = θ(1 θ). −

5 / 218 Comparing estimators via their risk

n = 10 n = 50 2 10− 0.2 2 ·

0.15 1.5

0.1 δ4 1

δ2 2 5 10− 0.5 · δ1 δ3

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Comparison depends on the choice of the loss. A different loss gives a different picture.

6 / 218 Comparing estimators via their risk

How to deal with the fact that the risks are functions? Summarize them by reducing to numbers: • • (Bayesian) Take a weighted average: Z inf R(θ, δ) dπ(θ) δ Ω • (Frequentist) Take the maximum:

inf max R(θ, δ) δ θ ∈ Ω Restrict to a class of estimators: unbiased (UMVU), equivariant, etc. • Rule out estimators that are dominated by others (inadmissible). •

7 / 218 Admissibility

Definition 1

Let δ and δ∗ be decision rules. δ∗ (strictly) dominates δ if

R(θ, δ∗) R(θ, δ), for all θ Ω, and • ≤ ∈ R(θ, δ∗) < R(θ, δ), for some θ Ω. • ∈ δ is inadmissible if there is a different δ∗ that dominates it; otherwise δ is admissible.

An inadmissible rule can be uniformly “improved”.

δ4 in the Bernoulli example is inadmissible. We will see a non-trivial example soon (Exponential Distribution).

8 / 218 Bias

Definition 2

The bias of δ for estimating g(θ) is Bθ(δ) := Eθ(δ) g(θ). − The estimator is unbiased if B (δ) = 0 for all θ Ω. θ ∈ Not always possible to find unbiased estimators. Example, g(θ) = sin(θ) in the binomial family. (Keener Example 4.2, p. 62) Definition 3 g is called U-estimable if there an unbiased estimator δ for g.

Usually g(θ) = θ.

9 / 218 Bias- decomposition

For the quadratic loss L(θ, a) = (θ a)2, the risk is mean-squared error (MSE). In this case we have the following decomposition−

2 MSEθ(δ) = [Bθ(δ)] + varθ(δ)

Proof.

Let µθ := Eθ(δ). We have

2 2 MSEθ(δ) = Eθ(θ δ) = Eθ(θ µθ + µθ δ) − − − 2 2 = (θ µθ) + 2(θ µθ)Eθ[µθ δ] + Eθ(µθ δ) . − − − −

Same goes for general g(θ) or higher dimensions: L(θ, a) = g(θ) a 2. k − k2

10 / 218 Example 2 (Berger) Let X N(θ, 1). ∼ Class of estimators of the form δc (X ) = cX , for c R. ∈ MSE (δ) = (θ cθ)2 + c2 = (1 c)2θ2 + c2 θ − −

For c > 1, we have 1 = MSEθ(δ1) < MSEθ(δc ) for all θ. For c [0, 1] the rules are incomparable. ∈

11 / 218 Optimality depends on the loss

Example 3 (Possion process)

X1,..., Xn be the inter-arrival times of a Poisson process with rate λ. iid X1,..., Xn Expo(λ). The model has the following p.d.f. ∼ n P λxi n λ xi pλ(x) = pλ(xi ) = λe− 1 xi > 0 = λ e− i 1 min xi > 0 { } { i } i=1 i Y Y Ω = = (0, ). A ∞ 1 Let S = Xi and X¯ = S. • i n The MLE for λ is λˆ = 1/X¯ = n/S. • P

12 / 218 iid X1,..., Xn Expo(λ) ∼ 1 Let S = Xi and X¯ = S. • i n The MLEP for λ is λˆ = 1/X¯ = n/S. • n S := Xi has the Gamma(n, λ) distribution. • i=1 1/S has Inv-Gamma(n, λ) distribution with mean λ/(n 1). • P − ˆ Eλ[λ] = nλ/(n 1). MLE is biased for λ. • − Then, λ˜ := (n 1)λ/ˆ n is unbiased. • − We also have var (λ˜) < var (λˆ). • λ λ It follows that • MSE (λ˜) < MSE (λˆ), λ λ λ ∀ The MLE λˆ is inadmissible for quadratic loss.

13 / 218 8

7 Possible explanation: 6 Quadratic loss penalizes over-estimation 5 4 more than under-estimation for Ω = 3 (0 ). , 2 ∞ 1

0 0 1 2 3 4 5 6

Alternative loss function (Itakura–Saito distance)

L(λ, a) = λ/a 1 log(λ/a), a, λ (0, ) − − ∈ ∞ With this loss function, R(λ, λ˜) > R(λ, λˆ), λ. • ∀ That is, MLE renders λ˜ inadmissible. • An example of a Bregman divergence for φ(x) = log x. − For a convex function φ : Rd R, the Bregman divergence is defined as → d (x, y) = φ(x) φ(y) φ(y), x y φ − − h∇ − i the remainder of the first order Taylor expansion of φ at y.

14 / 218 Details: Consider δ (X ) = α/S. Then, we have • α n n n n R(λ, δ ) R(λ, δ ) = log log α − β α − β − α − β     Take α = n 1 and β = n. • − Use log x log y < x y for x > y 1. • − − ≥ (Follows from strict concavity of f (x) = log(x): f (x) f (y) < f 0(y)(x y) for y = x). − − 6

15 / 218 Sufficiency

Idea: Separate the data into parts that are relevant for the estimating θ (sufficient) • and parts that are irrelevant (ancillary). • Benefits: Achieve data compression: efficient computation and storage • Irrelevant parts can increase the risk (Rao-Blackwell) • Definition 4

Consider the model = Pθ : θ Ω for X . A statistic T = T (XP) is sufficient{ ∈ for} (or for θ or for X ) if the conditional distribution of X given T does not dependP on θ.

More precisely, we have

Pθ(X A T = t) = Qt (A), t, A ∈ | ∀ for some Markov kernel Q. Making it more precise requires measure theory. Intuitively, given T , we can simulate X by an external source of randomness. 16 / 218 Sufficiency

Example 4 (Coin tossing)

iid Xi Ber(θ), i = 1,..., n. • ∼ Notation: X = (X1,..., Xn), x = (x1,..., xn). • Will show that T = T (X ) = i Xi is sufficient for θ. (This should be • intuitive.) P n xi 1 xi T (x) n T (x) Pθ(X = x) = pθ(x) = θ (1 θ) − = θ (1 θ) − − − i=1 Y Then •

Pθ(X = x, T = t) = Pθ(X = x)1 T (x) = t { } t n t = θ (1 θ) − 1 T (x) = t . − { }

17 / 218 Then •

Pθ(X = x, T = t) = Pθ(X = x)1 T (x) = t { } t n t = θ (1 θ) − 1 T (x) = t . − { } Marginalizing, • t n t Pθ(T = t) = θ (1 θ) − 1 T (x) = t − { } x 0,1 n ∈X { } n t n t = θ (1 θ) − . t −  

Hence, • t n t θ (1 θ) − 1 T (x) = t 1 Pθ(X = x T = t) = − { } = 1 T (x) = t . | n θt (1 θ)n t n { } t − − t   What is the above (conditional) distribution? •

18 / 218 Factorization Theorem

It is not convenient to check for sufficiency this way, hence: Theorem 1 (Factorization (Fisher–Neyman))

Assume that = Pθ : θ Ω is dominated by µ. A statistic T is sufficient iff for some functionP g{ , h 0∈ } θ ≥

pθ(x) = gθ(T (x))h(x), for µ-a.e. x

The likelihood θ p (X ), only depends on X through T (X ). 7→ θ Family being dominated (having a density) is important. Proof (only discrete case): Assume T = T (X ) is sufficient. Fix x, and let t = T (x), Then,

Pθ(X = x) = Pθ(X = x, T = t) = Pθ(X = x T = t)Pθ(T = t) | = Qt (X = x)gθ(t).

19 / 218 Factorization Theorem

Now assume factorization holds. Then, •

Pθ(X = x, T = t) = Pθ(X = x)1 T (x) = t { } = g (t)h(x)1 T (x) = t . θ { }

It follows that •

Pθ(T = t) = gθ(t) h(x 0)1 T (x 0) = t , { } x0 X

hence • gθ(t)h(x)1 T (x) = t Pθ(X = x T = t) = { } | g (t) 0 h(x )1 T (x ) = t θ x 0 { 0 } h(x)1 T (x) = t = P { } . 0 h(x )1 T (x ) = t x 0 { 0 } P 20 / 218 Example 5 (Uniform)

iid Let X1,..., Xn U[0, θ]. • ∼ Family is dominated by Lebesgue measure. •

X = max X1,..., Xn is sufficient by Factorization theorem: • (n) { } n 1 p (x) = 1 0 xi θ θ θ { ≤ ≤ } i=1 Y 1 1 = 1 0 xi θ, i = 1 0 min xi 1 max xi θ θn { ≤ ≤ ∀ } θn { ≤ i } { i ≤ }

Useful fact i 1Ai = 1 i Ai . ∩ Q

21 / 218 The entire data (X1,..., Xn) is always sufficient. • For i.i.d. data there is always more reduction. • Example 6 (IID data)

iid Let X1,..., Xn p . • ∼ θ Then, the order statistics (X ,..., X ) is sufficient: • (1) (n) Order the data such that X X X , and discard the ranks. • (1) ≤ (2) ≤ · · · ≤ (n)

22 / 218 Minimal sufficiency

There is a hierarchy among sufficient statistics in terms of data reduction. • Can be made precise by using “functions” as “reduction mechanisms”. • Lemma 1 If T is sufficient for and T = f (S) a.e. , then S is sufficient. P P We write T s S if such functional relation exists. ≤ An easy consequence of the factorization theorem. • Examples: 2 2 T sufficient = T sufficient. (T s T ) (Missing sign information) • 2 6 ⇒ 26≤ T sufficient = T sufficient. (T s T ) • ⇒ ≤ T sufficient T 3 sufficient. (bijection) • ⇐⇒

23 / 218 T s S is not standard notation, but useful shorthand for: • ≤ function f such that T = f (S) a.e. . ∃ P We want to obtain a sufficient statistic that achieves greatest reduction, • i.e. is at the bottom of that hierarchy. Definition 5 A statistic T is minimal sufficient if • T is sufficient, and

• T ≤s S for any sufficient statistic S.

24 / 218 Minimal sufficient statistics exist under mild conditions. • Minimal sufficient statistic is essentially unique modulo bijections. • Example 7 (Location family)

iid Consider X1,..., Xn pθ, that is, they have density pθ(x) = f (x θ). For example consider f (x∼) = C exp( β x α). − − | | iid • The case α = 2 corresponds to the normal location family X1,..., Xn ∼ N(θ, 1). 1 P Then, T = n i Xi is sufficient by factorization. We will show later that it is minimal sufficient. • The case α = 1 (Laplace or double exponential), no further reduction beyond order statistic is possible.

A family is DCS if it dominated with densities having common support. • P

25 / 218 Goal: To show that the likelihood (ratio) function is minimal sufficient. •

General idea: For any fixed θ and θ0, •

pθ(X )

pθ0 (X ) will always be a function of any sufficient statistic (by Fact. Thm).

We just have to collect enough of them so that collectively • p (X ) p (X ) p (X ) θ1 , θ2 , θ3 ,... pθ0 (X ) pθ0 (X ) pθ0 (X )   they are sufficient.

26 / 218 A useful lemma

Let us fix θ0 Ω and define • ∈ pθ(X ) Lθ := Lθ(X ) := . pθ0 (X )

Lemma 2 In a DCS family, the following are equivalent (a) U is sufficient for P.

(b) Lθ ≤s U, ∀θ ∈ Ω.

When DCS fails, (a) still implies (b), but not necessarily vice versa. •

27 / 218 Proof of Lemma 2

Work on the common support, densities can be assumed positive. • (a) = (b): U sufficient implies p (x) = g (U(x))h(x) (Fact. Thm.): • ⇒ θ θ

pθ(x) gθ(U(x)) Lθ(x) = = = fθ,θ0 (U(x)) pθ0 (x) gθ0 (U(x)) (b) = (a): f such that p (x) = p (x)f (U(x)). Q.E.D. • ⇒ ∃ θ,θ0 θ θ0 θ,θ0

28 / 218 A useful lemma

Let L := (Lθ)θ Ω. • ∈ Since R s U and S s U (R, S) s U, we have • ≤ ≤ ⇐⇒ ≤ Lemma 3 In a DCS family, the following are equivalent (a) U is sufficient for . P (b) L s U. ≤ The argument is correct when Ω is finite. • Needs more care dealing with “a.e. ” when Ω is infinite. • P From Lemma 3 follows that L is itself sufficient. (Why?) •

29 / 218 Likelihood is minimal sufficient

Proposition 1

In a DCS family, L := (Lθ)θ Ω is minimal sufficient. ∈ Proof: Let U be a sufficient statistics. • Lemma 3(a) = (b) gives L s U. • ⇒ ≤ (i.e., L is a function of any sufficient U.) • We also know that L is itself sufficient. • The proof is complete. •

30 / 218 Consequence of Prop. 1 is • Corollary 1 A statistic T is minimal sufficient if • T is sufficient, and

• T ≤s L.

That is, it is enough to show that T is sufficient and, • T can be written as a function of L. •

T s L is equivalent to either of the following: ≤ L(x) = L(y) = T (x) = T (y). • ⇒ Level sets of L are “included” in level sets of T , i.e., • level sets of T are coarser than level set of L. •

31 / 218 Corollary 2

T is minimal sufficient for DCS family iff P (a) T is sufficient, and (b) L(x) = L(y) = T (x) = T (y) ⇒ Corollary 3

T is minimal sufficient for DCS family iff P L(x) = L(y) T (x) = T (y) ⇐⇒

Can replace L(x) = L(y) with `x (θ) `y (θ), • ∝ where `x (θ) = pθ(x) is the . (Theorem 3.11 in Keener.) Corollary 3: T is minimal sufficient if it has the same level sets as L. • Informally, shape of the likelihood is minimal sufficient. •

32 / 218 Example 8 (Gaussian location family)

iid 2 Consider X1,..., Xn pθ = f (x θ) with f (x) = C exp( βx ). ∼ −2 − `X (θ) exp( β (Xi θ) ). • ∝ − i − Shape of ( ) uniquely determined by (X )2, `X P θ i i θ • · 2 7→ − Alternatively, θ 2( Xi )θ + nθ , • 7→ − i P Alternatively, Xi . • i P P Example 9 (Laplace location family)

iid Consider X1,..., Xn p = f (x θ) with f (x) = C exp( β x ). ∼ θ − − | | `X (θ) exp( β Xi θ ). • ∝ − i | − | Shape of `X ( ) uniquely uniquely determined by the breakpoints of the • · P piecewise linear function θ Xi θ , 7→ i | − | In one-to-one correspondence with the order statistic (X ,..., X ). • P (1) (n)

33 / 218 X θ i | i − | P

θ X(1) X(2) X(3)

Shape of the likelihood for the Laplace location family is determined by the • order statistics.

34 / 218 Example with no common support

= P0, P1, P2 where P { } P0 = U[ 1, 0], − P1 = U[0, 1], p2(x) = 2x1 x (0, 1) . { ∈ } ( p p 0 on (−1, 0) 1 = 2 = . p0 p0 ∞ on (0, 1)

p1 p2 • Cannot tell 1 and 2 based on ( , ). P P p0 p0 • However, there is information in the original model to statistically tell these two apart to some extent. • Consider in addition p2(x) = 2x1{x ∈ (0, 1)}. p1(x)

• A minimal suff. stat.: (1{X < 0}, X+)

• Could just take X+, since 1{X < 0} = 1 − X+

35 / 218 Empirical distribution

iid We saw that for IID data, X1,..., Xn Pθ, • ∼ the order statistic X X X is sufficient. • (1) ≤ (2) ≤ · · · ≤ (n) Another way: The empirical distribution Pn is sufficient, • 1 n := δ , (δ : unit point mass at x ) Pn n Xi x i=1 X

δx is a measure defined by: δx (A) := 1 x A : • { ∈ }

x 1 Example: X = (0, 1, 1, 0, 4, 4, 0) = Pn := (δ 1 + 3δ0 + δ1 + 2δ4). • − ⇒ 7 −

1 0 1 4 −

36 / 218 Example 10 (Empirical distribution in finite alphabet (IID data))

Suppose sample space is finite = a1,..., aK . • X { } Let = collection of all prob. measures P on . • P { X} can be parametrized by θ = (θ1, . . . , θK ) where θj = P( aj ). •P { } Empirical measure reduces to Pn = πj (X ) δaj where • j 1P πj (X ) := # i : Xi = aj n { }

Show that Pn or equivalently (π1(X ), . . . , πK (X )) is minimal sufficient. •

37 / 218 Statements from Theory of (TPE)

Proposition 2 (TPE)

Consider a finite DCS family = Pθ, θ Ω , i.e., Ω = θ0, θ1, . . . , θK . Then, the following is minimalP sufficient{ ∈ } { }

p (X ) p (X ) L(X ) = θ1 ,..., θK . pθ0 (X ) pθ0 (X )   Proposition 3 (TPE)

Assume is DCS and 0 . Assume that T is P P ⊂ P sufficient for , and • P minimal sufficient for 0. • P Then, T is minimal sufficient for . P

Same support gives “a.e. 0 = a.e. ”. P ⇒ P S sufficient for = S sufficient for 0. P ⇒ P T minimal suff. for 0 = T = f (S) a.e. 0 and hence a.e. . Q.E.D. P ⇒ P P 38 / 218 Example 11 (Gaussian location family)

iid Consider X1,..., Xn N(θ, 1). • ∼ Look at sub-family 0 = N(θ0, 1), N(θ1, 1) . Let T (X ) = Xi . • P { } i The following is minimal sufficient by Proposition 2, • P

pθ1 (x) 1 2 1 2 log L := log = (xi θ0) (xi θ1) θ1 p (x) 2 − − 2 − θ0 i i X X 1 2 2 = (θ1 θ0)T (x) + (θ θ ) − 2 0 − 1

showing that T (X ) is minimal sufficient for 0. P Since T (X ) is also sufficient for (Exercise.), Proposition 3 implies that it • is minimal sufficient for . P P

39 / 218 Completeness and ancillaritiy

We can compress even more! • Definition 6 • V = V (X ) is ancillary if its distribution does not depend on θ.

• First-order ancillary if its expectation does not depend on θ.(EθV = c, ∀θ). The latter is a weaker notion. • Example 12 (Location family) • The following statistics are all ancillary: ¯ Xi − Xj , Xi − X(j), X(i) − X(j), X(i) − X

iid • Hint: We can write Xi = θ + εi , where εi ∼ f .

Minimal sufficient statistic can still contain ancillary information, for • example X X in the Laplace location family. (n) − (1)

40 / 218 A notion stronger than minimal sufficiency is completeness: • Definition 7 A statistic T is complete if

Eθ[f (T )] = c, θ = f (t) = c t. ∀ ⇒ ∀ (Actually -a.e. t.)  P T is complete if no nonconstant function of it is first-order ancillary. • Minimal sufficient statistic need not be complete: • Example 13 (Laplace location family)

• X(n) − X(1) is ancillary, hence first-order ancillary.

• f (X(1),..., X(n)) is ancillary for the nonconstant function f (z1,..., zn) = z1 − zn.

The converse is however true. •

41 / 218 Showing completeness is not easy. • Will see a general result for exponential families. • Here is another example: • Example 14 iid X1,..., Xn U[0, θ]. • ∼ Will show that T = max X1,..., Xn is complete. • { } n CDF of T is FT (t) = (t/θ) 1 t (0, θ) + 1 t θ . • { ∈ } { ≥ } n n 1 Density t nθ− t − 1 t (0, θ) . • 7→ { ∈ } Suppose that Eθf (T ) = 0 for all θ > 0. Then, • θ n n 1 0 = Eθf (T ) = nθ− f (t)t − dt, t > 0 Z0 n 1 Fundamental theorem of calculus implies f (t)t − = 0, a.e. t > 0, • Hence f (t) = 0 a.e. t > 0. Conclude that T is complete. •

42 / 218 Detour: Conditional expectation as L2 projection

The L2 space of random variables: • 2 2 2 L := L (P) := X : E[X ] < { ∞} We can define an inner product on this space • 2 X , Y := E[XY ], X , Y L h i ∈ The inner product induces a norm, called the L2 norm, • 2 X 2 := X , X := E[X ] k k h i p p Norm induces a distance X Y 2. • k − k 2 2 Squared distance X Y 2 = E(X Y ) , the same as MSE. • k − k − Orthogonality: X Y if X , Y = 0, i.e., E[XY ] = 0. • ⊥ h i

43 / 218 Detour: Conditional expectation as L2 projection

Assume EX 2 < and EY 2 < (i.e., X , Y L2 ). • ∞ ∞ ∈ Consider the the linear space • 2 := g(X ) g is a (measurable) function with E[g(X )] < L | ∞ i.e., the space of all (meas.) functions of X . There is an essentially unique L2 projection of Y onto : • L

Y := argmin Y Z 2 Z k − k ∈L Alternatively, an essentiallyb unique functiong ˆ such that • 2 2 min E(Y g(X )) = E(Y gˆ(X )) g − −

We define E[Y X ] :=g ˆ(X ). • |

44 / 218 Detour: Conditional expectation as L2 projection

There is an essentially unique functiong ˆ such that • 2 2 min E(Y g(X )) = E(Y gˆ(X )) g − −

We define E[Y X ] :=g ˆ(X ). • | E[Y X ] is the best prediction of Y given X , in the MSE sense. • | From this definition, we get the following characterization ofg ˆ: • E[ Y gˆ(X ) g(X )] = 0, g − ∀ saying that the optimal prediction error Y gˆ(X ) is orthogonal to . − L Applied to the constant function g(X ) 1, we get • ≡ E[Y ] = E[ˆg(X )] = E[E[Y X ]] | the smoothing or averaging property of conditional expectation.

45 / 218 Proposition 4 A complete sufficient statistic is minimal sufficient.

Proof. Let T be complete sufficient, and U minimal sufficient. Idea is to show that T is a function of U. • U = g(T ). (By minimal sufficiency of U.) • Let h(U) := Eθ[T U] well defined by sufficiency of U. • | Eθ[T h(U)] = 0, θ Ω. (By smoothing.) • − ∀ ∈ T = h(U). (By completeness of T .) • Hints: Took f (t) := t h(g(t)) in the definition of completeness. • − Equalities are a.e. . • P

46 / 218 Proposition 5 (Basu) Let T be complete sufficient and V ancillary. Then T and V are independent.

Proof. Let A be an event.

qA := Pθ(V A) well-defined. (By ancillary of V .) • ∈ fA(T ) := Pθ(V A T ) well-defined. (By sufficiency of T .) • ∈ | Eθ[qA fA(T )] = 0, θ. • − ∀ qA = fA(T ). (By completeness of T .) • Equalities are a.e. . P

47 / 218 Application of Basu: • Example 15 (Gaussian location family)

iid 2 2 X1,..., Xn N(θ, σ ), θ is unknown, σ is known. • ∼ 1 X¯ := Xi is complete sufficient. (cf. Exponential families) • n i (Xi X¯, i = 1,..., n) is ancillary. • − P 2 1 ¯ 2 Hence, sample variance S := n 1 i (Xi X ) is ancillary. • − − Hence, X¯ and S 2 are independent.P • Had we taken (θ, σ2) as the parameter, then S 2 would not be ancillary. •

48 / 218 Rao–Blackwell

Rao–Blackwell theorem ties risk minimization with sufficiency. • It is a statement about convex loss functions. • Definition 8 A function f : Rp R is convex if for all x, y → f (λx + (1 λ)y) λf (x) + (1 λ)f (y), λ [0, 1], (1) − ≤ − ∀ ∈ and strictly convex if the inequality is strict for λ (0, 1) and x = y. ∈ 6

Example 16 (`p loss)

p • Loss function a 7→ L(θ, a) = |θ − a| on R. • Convex for p ≥ 1 and nonconvex for p ∈ (0, 1). • Stricly convex when p > 1.

• In particular, the `1 loss (p = 1) is convex but not stricly convex.

49 / 218 Jensen inequality

By induction (1) leads to f ( i αi xi ) i αi f (xi ), for αi 0 and i αi = 1. A sweeping generalization is the following:≤ ≥ P P P Proposition 6 (Jensen inequality) Assume that f : R is convex and consider a random variable X S → concentrated on (i.e., P(X ) = 1), and E X < . Then, S ∈ S | | ∞ Ef (X ) f (EX ) ≥ If f is strictly convex, equality holds iff X EX a.s. (that is, X is constant). ≡ Proof. Relies on the existence of supporting hyperplanes to f (i.e., affine minorants that touch the function)

50 / 218 Let x0 := EX . • Let A(x) = a, x x0 + f (x0) be a supporting hyperplane to f at x0: • h − i

f (x) A(x), x , and A(x0) = f (x0). ≥ ∀ ∈ S

Then, we have • f (X ) A(X ) = E[f (X )] E[A(X )] (Monotonicity of E) ≥ ⇒ ≥ = a, E[X x0] + f (x0) (Linearity of E) h − i = f (x0)

Strict convexity implies f (x) > A(x) for x = x0. • 6 If X = x0 with positive prob., we have f (X ) > A(X ) with positive prob., • 6 Hence, Ef (X ) > EA(X ), and the rest follows. Q.E.D. •

51 / 218 Recall the decision-theoretic setup: Loss function L(θ, a), decision rule δ = δ(X ), risk R(θ, δ) = EθL(θ, δ). Theorem 2 (Rao–Blackwell) Let us assume the following: T is sufficient for family , • P δ = δ(X ) is a possibly randomized decision rule, • a L(θ, a) is convex, for all θ Ω. • 7→ ∈ Define the estimator η(T ) := Eθ[δ T ]. Then, η dominates δ, that is, | R(θ, η) R(θ, δ) for all θ Ω ≤ ∈ The inequality is strict when the loss is strictly convex, unless η = δ a.e. . P Consequence: for convex loss functions, randomization does not help. Proof: η is well-defined. (By sufficiency of T .) • Eθ[L(θ, δ) T ] L(θ, Eθ[δ T ]) = L(θ, η). (By conditional Jensen inequality.) • | ≥ | Take expectation and use monotonicity and smoothing. Q.E.D. •

52 / 218 Example 17 iid Xi U[0, θ], i = 1,..., n. • ∼ 1 T = max X1,..., Xn is sufficient. Take δ = Xi . • { } n i Rao–Blackwell: η = Eθ[δ T ] strictly dominates δP, for any strictly convex • loss. Let us verify this: |

Conditional distribution of Xi given T : • Mixture of a point mass at T and uniform distribution on [0, T ] (why?):

1 1 T 1 Pθ(Xi A T ) = δT (A) + 1 1A(x)dx ∈ | n − n 0 T   Z 1 1 Compactly Xi T δT + (1 )Unif(0, T ). • | ∼ n − n It follows • 1 1 T n + 1 Eθ(Xi T ) = T + 1 = T | n − n 2 2n   Same expression holds for η by symmetry. •

53 / 218 n+1 That is, η = Eθ[δ T ] = T • | 2n Consider the quadratic loss: • η has the same bias as δ (by smoothing). • How about ? • 1 θ2 var (δ) = θ n 12 Since T /θ is Beta(n, 1) distributed, we have • n + 1 2 n θ2 var (η) = θ2 = (2) θ 2n (n + 1)2(n + 2) 4n(n + 2)   (Note: δ is biased. 2δ is unbiased. Better to compare 2η and 2δ.) •

54 / 218 Another result about strictly convex loss functions: • An admissible decision rule is uniquely determined by its risk function. Proposition 7

Assume a L(θ, a) is stricly convex, for all θ.(R(θ, δ) = Eθ[L(θ, δ)].) • 7→ Then, the map δ R( , δ) is injective over the class of admissible decision rules. 7→ · We are identifying decision rules that are the same a.e. . Proof. P Let δ be admissible. • Let δ0 = δ be such that R(θ, δ) = R(θ, δ0), θ. • 6 1 ∀ Take δ∗ = (δ + δ0). Then, by strict convexity of the loss • 2 1 L(θ, δ∗) < L(θ, δ) + L(θ, δ0) , θ 2 ∀  Taking expectation: (Note: X > 0 a.s. = E[X ] > 0 ) ⇒ 1 R(θ, δ∗) < R(θ, δ) + R(θ, δ0) = R(θ, δ), θ 2 ∀  δ0 = δ implies δ∗ = δ. • 6 6 δ∗ strictly dominates δ, contradicting admissibility of δ. • 55 / 218 Rao–Blackwell can fail for non-convex loss. Here is an example: Example 18 X Bin(n, θ). Ω = = [0, 1]. • ∼ A ε-sensitive loss function L (θ, a) = 1 θ a ε . • ε {| − | ≥ } Consider a general deterministic estimator of δ = δ(X ). • δ takes at most n + 1 values δ(0), δ(1), . . . , δ(n) [0, 1]. • { } ⊂ Divide [0, 1] into bins of length 2ε. • Assume that N := 1/(2ε) n + 2 and that N is an integer (for simplicity). • ≥ At least one of the N bins contains no δ(i), i = 0,..., n, and the midpoint • of that bin is at distance ε from any δ(i). Hence ≥ sup R(θ, δ) = 1 θ [0,1] ∈ for any nonrandomized rule (assuming ε 1/[2(n + 2)]). ≤ Consider randomized estimator δ0 = U U[0, 1] independent of X : • ∼ R(θ, δ0) = P( U θ ε) 1 ε (worst case at θ = 0, 1. ) | − | ≥ ≤ − supθ [0,1] R(θ, δ0) < 1 strictly better than that of δ. • ∈ 56 / 218 Uniformly minimum variance unbiased (UMVU) criterion

Comparing estimators based on their risk functions problematic. • One way to mitigate: restrict the class of estimators. • Focus on qudratic loss, restrict to unbiased estimators. • Bias-variance decomp., • Let g be the class of unbiased estimators of g(θ), that is, U

g = δ : Eθ[δ] = g(θ), θ . U { ∀ }

Definition 9 An estimator δ is UMVU for estimating g(θ) if

δ g , and • ∈ U var (δ) var (δ0), for θ Ω, and for all δ0 g . • θ ≤ θ ∈ ∈ U

57 / 218 Theorem 3 (Lehmann–Scheff´e) Consider the family and assume that P g is nonempty (i.e., g is U-estimable), and •U there is a complete sufficient statistic T for . • P Then, there is an essentially unique UMVU for g(θ) of the form h(T ).

Proof.

Pick δ g (Valid by non-emptiness.) • ∈ U Let η = Eθ[δ T ] be an estimator (Well-defined by sufficiency of T .) • | Claim: η is the essentially unique UMVU. • Pick any δ0 g and let η0 = Eθ[δ0 T ]. • ∈ U | Eθ[η η0] = g(θ) g(θ) = 0, θ (By smoothing and unbiasedness.) • − − ∀ (a) η η0 = 0 a.e. . (By completeness of T .) − P By Rao–Blackwell for quadratic loss a (g(θ) a)2 and unbiasedness • 7→ − (a) var (η) = R(θ, η) = R(θ, η0) R(θ, δ0) = var (δ0) θ ≤ θ

Since δ0 was an arbitrary element of the class g we are done. Q.E.D. • U 58 / 218 Proof.

Pick δ g (Valid by non-emptiness.) • ∈ U Let η = Eθ[δ T ] be an estimator (Well-defined by sufficiency of T .) • | Claim: η is the essentially unique UMVU. • Pick any δ0 g and let η0 = Eθ[δ0 T ]. • ∈ U | Eθ[η η0] = g(θ) g(θ) = 0, θ (By smoothing and unbiasedness.) • − − ∀ η η0 = 0 a.e. . (By completeness of T .) • − P By Rao–Blackwell for quadratic loss a (g(θ) a)2 and unbiasedness • 7→ − (b) varθ(η) = R(θ, η) = R(θ, η0) < R(θ, δ0) = varθ(δ0)

Since δ0 was an arbitrary element of the class g we are done. Q.E.D. • U Remark 1 Note that we have also shown the uniqueness:

(b) If δ0 g is UMVU and not a function of T , then it is strictly dominated ∈ U by η0 (by Rao–Blackwell and strict convexity of quadratic loss).

Otherwise, it is equal to η0 which is equal to η a.e. . • P

59 / 218 Lehman–Schuff´esuggest a way of constructing UMVUs. Example 19 (Coin tossing)

iid 2 X1,..., Xn Ber(θ), want to estimate g(θ) = θ . • ∼ T = i Xi is complete and sufficient. • (General result for exponential families.) P Take U = X1X2. • 2 2 U is unbiased for θ : Eθ[U] = Eθ[X1]Eθ[X2] = θ by independence. • By Lehman–Schuff´e • E[U T = t] = P(X1 + X2 = 2 T = t) | | n 2 n − / , t 2 t(t 1) = t 2 t ≥ = − 0− otherwise n(n 1) (   − is UMVU estimator for θ2.

60 / 218 Approach 2 in obtaining UMVUs: Example 20 iid X1,..., Xn U[0, θ]. • ∼ T = X = max X1,..., Xn is complete sufficient. • (n) { } UMVU for g(θ) is given by h(X ). • (n) h is the solution of the following integral equation: • θ n n 1 g(θ) = Eθ[h(X(n))] = nθ− t − h(t)dt. Z0 n+1 For g(θ) = θ, δ1 = T is unbiased, hence UMVU by Lehamn–Schuff´e. • n θ2 MSE (δ1) = var (δ1) = . • θ θ n(n+2) On the other hand, among estimators of the form δa = aT , • a = (n + 2)/(n + 1) gives the lowest MSE. This biased estimator has slightly better MSE = θ2/(n + 1)2. • A little bit of bias is not bad. •

61 / 218 Exponential family

Definition 10 : a general sample space, Ω: a general parameter space, X d A function T : R , T (x) = (T1(x),..., Td (x)). • X → d A function η :Ω R , η(θ) = (η1(θ), . . . , ηd (θ)). • → A measure ν on (e.g., Lebesgue or counting), and a function • X h : R+. X → The exponential family with sufficient statistic T and parametrization η, relative to h ν, is the dominated family of distributions given by the following densities w.r.t.· ν

p (x) = exp η(θ), T (x) A(θ) h(x), x . θ h i − ∈ X d  where η(θ), T (x) = ηi (θ)Ti (x) is the Euclidean inner product. h i i=1 P

62 / 218 T : Rd , η :Ω Rd . • X → → p (x) = exp η(θ), T (x) A(θ) h(x), x . θ h i − ∈ X A(θ) is the determined by the other ingredients, • via the normalization constraint p (x)dν(x) = 1, • θ R η(θ),T (x) A(θ) = log eh idν(x) Z where dν(x) := h(x)dν(x). e A is called the log partition function or cumulant generating function. • The actuale parameter space is •

Ω0 = θ Ω: A(θ) < . { ∈ ∞} By factorization theorem, T (X ) is indeed sufficient. • The representation of the exponential family is not unique. •

63 / 218 Here are some examples: Example 21 X Ber(θ): • ∼

x 1 x θ p (x) = θ (1 θ) − = exp x log + log(1 θ) . θ − 1 θ − h − i Here h(x) = 1 x 0, 1 , • { ∈ { }} θ η(θ) = log , T (x) = x 1 θ  −  A(θ) = log(1 θ), Ω0 = (0, 1) − − We need to take Ω = (0, 1) otherwise η is not well-defined. •

64 / 218 Example 22 X N(µ, σ2): Let θ = (µ, σ2), • ∼

1 1 2 pθ(x) = exp (x µ) √2πσ2 −2σ2 − h i 1 µ µ2 1 = exp x 2 + x + log(2πσ2) −2σ2 σ2 − 2σ2 2 h  i Here, h(x) = 1, • µ 1 η(θ) = , , T (x) = (x, x 2) σ2 −2σ2  2  µ 1 2 2 2 A(θ) = + log(2πσ ), Ω0 = (µ, σ ): σ > 0 2σ2 2 { }

1 µ2 1 2 We could have taken h(x) = and A(θ) = 2 + log σ . • √2π 2σ 2

65 / 218 Example 23 X U[0, θ], • ∼ 1 p (x) = θ− 1 x (0, θ) . • θ { ∈ } Not an exponential family since the support depends on the parameter. •

66 / 218 Consider the following conditions:

(E1) η(Ω0) has non-empty interior.

(E2) T1,..., Td , 1 are linearly independent ν a.e.. That is, { } d @a R 0 , c R such that a, T (x) = c, ν-a.e. x ∈ \{ } ∈ e h i (E1 ) η(Ω ) is open. 0 0 e Definition 11 A family satisfying (E1) and (E2) is called full-rank. • One that satisfies (E10) is regular. • One that satisfies (E2) is minimal. •

Condition (E1) prevents ηi from satisfying a constraint. • { } Condition (E2) prevents unidentifiability. •

67 / 218 Example 24

A Bernoulli model: p (x) exp(θ0(1 x) + θ1x). • θ ∝ − x + (1 x) = 1, x. Hence, the family is not full-rank. • − ∀ Example 25 2 A continuous model with Ω = R and η1(θ) = θ, η2(θ) = θ , • Interior of η(Ω) is empty, hence the model is not full-rank. •

68 / 218 Theorem 4 In a full-rank exponential family, T is complete.

We just show that T is minimal sufficient. • Completeness is more technical, but follows from Laplace transform • arguments.

69 / 218 Theorem 5 In a full-rank exponential family, T is complete.

Proof. (Minimal sufficiency.) By factorization theorem, T is sufficient for (the whole family). • P Choose θ0, θ1, . . . , θd Ω, with ηi := η(θi ), such that • { } ⊂

η1 η0, η2 η0, . . . , ηd η0 are linearly independent. Possible by (E1) { − − − } T d d Matrix A = (η1 η0, . . . , ηd η0) R × is full-rank. • − − ∈ Let 0 = p : i = 0, 1,..., d . Then, with T = T (X ) • P { θi }

pθ1 (X ) pθd (X ) log ,..., log = η1 η0, T ,..., ηd η0, T = AT pθ0 (X ) pθ0 (X ) h − i h − i    is minimal sufficient for 0. P It follows that T is so, since A is invertible. • Since 0 and have common support, T is also minimal for . • P P P

70 / 218 p (x) = exp η(θ), T (x) A(θ) h(x), x . θ h i − ∈ X Definition 12  An exponential family is in canonical (or natural) form if η(θ) = θ.

In this case: η = θ is called the natural parameter. • d Ω0 := θ R : A(θ) < is called the natural parameter space. • {d ∈ ∞} Ω0 R . • ⊂ Family determined by choice of , T (x) and ν = h ν. X · Example 26 (Two-parameter Gaussian) e = R, T (x) = (x, x 2). •X 2 pθ(x) = exp(θ1x + θ2x A(θ)), x . • 2 − ∀ ∈ X A(θ) = log eθ1x+θ2x dx. • A(θ) < iffR θ2 < 0. Natural parameter space: Ω0 = (θ1, θ2): θ2 < 0 . • ∞ µ 1 { 2 } Note: θ1 = 2 and θ2 = 2 (in original parametrization (µ, σ )). • σ − 2σ 71 / 218 d Recall: x 1 = xi . k k i=1 | | Example 27 (Multinomial)P

d = Z+ = x = (x1,..., xd ): xi integer, xi 0 •X { ≥ } T (x) = x. • n h(x) = 1 x 1 = n , ν = counting measure • x1,x2,...,xd {k k } d Canonical family:p (x) = exp θi xi A(θ) h(x). • θ i=1 − d θi Can show that A(θ) = n log Pi=1 e , finite everywhere. • d Hence Ω0 = R . • P  Not full-rank. Violates (E2). •

72 / 218 Multinomial distribution, with usual parameter, • d n xi qπ(x) = πi x1, x2,..., xd i=1   Y looks like a subfamily:

Corresponds to the following subset of the natural parameter space Ω0: • d d d θi (log πi ): πi > 0, πi = 1 = θ R : e = 1 . ∈ i=1 i=1 n X o n X o This family is also not full-rank (violates (E1)). • Actually not a sub-family of Example 27 since pθ = pθ+a1 for any a R. • ∈ That is, θ parametrization is non-identifiable in Example 27. •

73 / 218 Example 28 (Multivariate Gaussian) = Rp. ν = Lebesgue measure on . •X X 2 T (x) = xi , i = 1,..., p xi xj , 1 i < j p xi , i = 1,... p). • e | ≤ ≤ | param = θi , i = 1,..., p 2Θij , 1 i < j p Θii , i = 1,... p). • | ≤ ≤ | Corresponding canonical Expf • 2 p Θ(x) = exp θi xi + 2 Θij xi xj + Θii x A(θ, Θ) θ, i − i i j i  X X< X  Compactly, treating Θ as a symmetric matrix, • T p Θ(x) = exp θ, x + Θ, xx A(θ, Θ) θ, h i h i − where Θ, xx T := tr(Θxx T ) = tr(x T Θx) = x T Θx. h i Dimension (or rank) of the family d = p + p(p 1)/2. • −

74 / 218 Density of multivariate Gaussian N(µ, Σ): •

1 1 T 1 p Σ(x) exp (x µ) Σ− (x µ) µ, ∝ Σ 1/2 −2 − − | | h i 1 T 1 T 1 1 T 1 1 = exp x Σ− x + x Σ− µ µ Σ− µ log Σ −2 − 2 − 2 | | h i Can be written as a canonical exponential family • T p Θ(x) = exp θ, x + Θ, xx A(θ, Θ) θ, h i h i − where Θ, xx T := tr(Θxx T ) = tr(x T Θx) = x T Θx. h i Correspondence with the original parameters: • −1 1 −1 • θ = Σ µ and Θ = − 2 Σ 1 T −1 1 T −1 1 • A(θ, Θ) = 2 (µ Σ µ + log |Σ|) = 4 θ Θ θ − 2 log |−2Θ| + const. Sometimes called Gaussian Markov Random Field (GMRF), esp. when • Θij = 0 for (i, j) / E where E is the edge set of a graph. ∈

75 / 218 Example 29 (Ising model) Both a graphical model and an exponential family. • Used in statistical physics. Allows for complex correlations among discrete • variables. Discrete counterpart of GMRF. Ingredients: • • A given graph G = (V , E). V := {1,..., n} vertex set. E ⊂ V 2: edge set.

• Random variables attached to vertices X = (Xi : i ∈ V ).

• Each xi ∈ {−1, +1}, the spin of node i. Take = 1, +1 V 1, +1 n. • X {− } ' {− } T (X ) = Xi , i V , Xi Xj , (i, j) E . • ∈ ∈ Underlying measure is counting (and h(x) 1.) •  ≡

p (x) = exp θi xi + θij xi xj A(θ) θ − i V (i,j) E  X∈ X∈ 

76 / 218 Example 30 (Exponential random graph model (ERGM)) A parametric family of probability distributions on graphs. • Let = space of graphs on n nodes. • X Let Ti (G) be functions on the space of graphs for i = 1,..., k. • Usually subgraph counts: •

T1(G) = # number of edges

T2(G) = # number of triangles ... = ...

Tj (G) = # number of r-stars (for given r) ... = ...

Underling measure (counting on graphs) • k p (G) = exp θi Ti (G) A(θ) θ − i=1  X 

77 / 218 Focus: full-rank canonical (FRC) exponential families. Proposition 8 In a canonical exponential family,

(a) A is convex on its domain Ω0,

(b)Ω 0 is a convex set.

Proof. (Enough to show (a). Convexity of Ω0 follows from convexity of A.) Apply H¨olderinequality, with 1/p = α and 1/q = 1 α. (Exercise) Q.E.D. • − H¨olderinequality: For X , Y 0 a.s., • ≥ α 1 α α 1 α E[X Y − ] (EX ) (EY ) − , α [0, 1] ≤ ∀ ∈ Expectation can be replaced with integral w.r.t. a general measure: • f , g 0 a.e. ν ≥ α 1 α α 1 α − f eg − dν fdν gdν , α [0, 1]. ≤ ∀ ∈ Z  Z   Z  e e e

78 / 218 Proposition 9

In a FRC exponential family, A is C ∞ on int(Ω0) and moreover

2 Eθ[T ] = A(θ), covθ[T ] = A(θ) ∇ ∇ That is, ∂A ∂2A = Eθ[Ti (X )], = covθ[Ti (X ), Tj (X )] ∂θi ∂θi ∂θj Proof sketch. Moment generating function (mgf) of T is • u,T MT (u) := MT (u; θ) := Eθ[eh i]

u,T (x) θ,T (x) A(θ) = eh ieh i− dν(x) Z = eA(u+θ) A(θ) − e If θ int Ω0, then MT is finite in a neighborhood of zero: • ∈ MT (u) < , for u 2 ε. ∞ k k ≤ DCT implies MT is C ∞ in a neighborhood of 0, and we can interchange • the order of differentiation and integration. 79 / 218 Moment generating function: • u,T A(u+θ) A(θ) MT (u) = Eθ[eh i] = e −

We get (fixing θ) •

∂A ∂ ∂ u,T u,T MT (u) (u + θ) = MT (u)= Eθ[ eh i] = Eθ[Ti eh i], ∂ui ∂ui ∂ui valid in a neighborhood of 0.

Evaluating at u = 0 gives the result for mean. (MT (0) = 1.) • Getting the covariance is similar. (Exercise) • Remark 2 Covariance is positive semidefinite, hence 2A(θ) 0. • ∇  This gives another proof for convexity of A. •

80 / 218 Example 31 X N(θ, 1): • ∼

1 1 x2 1 2 pθ(x) = e− 2 exp θx θ √2π − 2   A(θ) = 1 θ2. Hence: • 2

Eθ[X ] = A0(θ) = θ, varθ(X ) = A00(θ) = 1

81 / 218 Mean parameters

Exponential family • dPθ(x) = exp[ θ, T (x) A(θ)] dν(x) h i − Alternative parametrization in terms of mean parameter µ: • µ := µ(θ) = Eθ[T (X )]

Mean parameters are easy to estimate: µ = 1 n T (X (i)). • n i=1 P Example 32 (Two-parameter Gaussian)b 2 2 •X = R, T (x) = (x, x ), pθ(x) = exp(θ1x + θ2x − A(θ)).

• Natural parameter space: Ω0 = {(θ1, θ2): θ2 < 0}. m 1 2 • θ1 = σ2 and θ2 = − 2σ2 in the original parametrization N(m, σ ). • Mean parameters:

      " − θ1 # µ1 E[X ] m 2θ2 µ = = = = 2 µ [X 2] m2 + σ2 θ1 1 2 E 2 − 2θ 4θ2 2

82 / 218 Realizable means

Interesting general set: • d := µ R µ = Ep[T (X )] for some density p w.r.t. ν , M { ∈ | } the set of realizable mean parameters by any distribution (absolutely continuous w.r.t. ν). 1 is essentially the convex hull of the support of ν#T = ν T − •M ◦ More precisely, • int( ) = int(co(supp(ν#T ))). M

83 / 218 d := µ R µ = Ep[T (X )] for some density p w.r.t. ν , M { ∈ | }

Example 33 T (X ) = (X , X 2) R2 and ν the Lebesgue measure: • ∈ 2 (µ1, µ2) = (Ep[X ], Ep[X ]) By nonnegativity of the variance we need to have • 2 0 := (µ1, µ2): µ2 µ M ⊂ M { ≥ 1}

2 Any (µ1, µ2) int 0 can be realized by a N(µ1, µ2 µ1). • ∈ M 2 − bd 0 := (µ1, µ2): µ2 = µ cannot be achieved by a density (why?). • M { 1} bd 0 can be approached arbitrarily closely by densities. • M We have = int 0 and = 0. • M M M M

84 / 218 Example 34 (Multivariate Gaussian) T (X ) = (X , XX T ). • T Let µ = Ep[X ] and Λ = Ep[XX ] for some density p w.r.t. Lebesgue • measure. Covariance matrix is PSD, hence Λ µµT 0. • −  Closure of (sef of realizable means) is • M := (µ, Λ) Λ µµT M { |  } int( ) = = (µ, Λ) Λ µµT can be realized by non-degenerate • GaussianM distributionsM { N|(µ, Λ µµ}T ), a full-rank exponential family. −

85 / 218 A remarkable result: Anything int can be realized by an exponential family. Let ∈ M d Ω := dom A := θ R : A(θ) < { ∈ ∞}

Theorem 6 In a FRC exponential family, assuming A is essentially smooth, A : int Ω int is one-to-one and onto. •∇ → M In other words, A establishes a bijection between int Ω and int . ∇ M Recall as part of FRC assumption, int Ω = . • 6 ∅ 1 WLOG, we can assume T (x) = x (absorb T into the measure, ν T − ), • ◦ That is, we work with the standard family •

dPθ(x) = exp( θ, x A(θ)) dν(x) h i −

By Proposition 9, Eθ(X ) = A(θ). • ∇ The proof is a tour de France of convex/real analysis. •

86 / 218 Proof sketch

A : int Ω int is one-to-one (injective) and onto (surjective) ∇ → M Let Φ := A. ∇ 2 d d 1. Φ is regular on int Ω: DΦ = A R × is a full-rank matrix. True since condition (E2) implies∇ ∈2A 0. ∇ 2. Φ is injective: Since condition (E2) implies 2A 0, we conclude that A is strictly convex. This in turns implies∇ that A is a strictly monotone operator: ∇ A(θ) A(θ0), θ θ0 > 0 for θ = θ0. h∇ − ∇ − i 6 3. Φ is an open mapping: (maps open sets to open sets) (Corollary 3.27 of ?) U Rd open, f C 1(U, Rd ) regular on U = f open mapping. ⊂ ∈ ⇒ 4. By Proposition 9, we have A(int Ω) . ∇ ⊂ M 5. But why A(int Ω) int ? Follows from A being an open map1 ∇ ⊂ M ∇

1A continuous map is not necessarily open: x 7→ sin(x) maps (0, 4π) to [−1, 1]. 87 / 218 Proof sketch

A : int Ω int is one-to-one (injective) and onto (surjective) ∇ → M It remains to show that int A(int Ω): M ⊂ ∇ For any µ int , need to show θ int Ω such that A(θ) = µ. ∈ M ∈ ∇ 6. By applying a shift to ν, WLOG enough to show it for µ = 0 int . ∈ M 7. WTS: 0 int = θ int Ω s.t. A(θ) = 0. ∈ M ⇒ ∃ ∈ ∇

In general 0 ∈/ int M. So, without employing a shift, all the arguments are applied to θ 7→ A(θ) − hµ, θi.

88 / 218 Proof sketch

0 int = θ int Ω s.t. A(θ) = 0. ∈ M ⇒ ∃ ∈ ∇

d 8. A is lower semi-continuous (lsc) on R : lim inf A(θ) A(θ0). θ θ0 ≥ Follows from Fatou’s Lemma. → (lsc only matters at bd Ω since A is continuous on int Ω.) d d 9. LetΓ 0(R ) := f : R ( , ] f is proper, convex, lsc . (proper means{ not identically→ −∞ ∞.) | } ∞ d 10. A Γ0(R ). ∈ 11. A is coersive: lim A(θ) = . To be shown. θ ∞ k k→∞ 12. A is essentially smooth: by assumption. d A function f Γ0(R ) is essentially smooth (a.k.a. steep) if ∈ (a) f is differentiable on int dom f 6= ∅ and (b) k∇f (xn)k → ∞ whenever xn → x ∈ bd dom f . We in fact only need this for x ∈ (dom f ) ∩ (bd dom f ), parts of the boundary that are in the domain. In particular, (b) is not needed if dom f is itself open.

89 / 218 0 int = θ int Ω s.t. A(θ) = 0. ∈ M ⇒ ∃ ∈ ∇

13. A coercive lsc function attains its minimum (over Rd ). d 14. f Γ0(R ) and essentially smooth = minimum cannot be attained at bd∈ dom f . ⇒ 15. If in addition f is strictly convex on int dom f , the minimum is unique. Lemma 4 d Assume that f Γ0(R ) is coersive, essentially smooth, and strictly convex on int dom f . Then,∈ f attains its unique minimum at some x¯ int dom f . ∈ A is coersive, essentially smooth, and strictly convex on int dom A = int Ω 16. Conclude that A attains its minimum at a unique point θ int Ω. ∈ 17. Necessary 1st order optimality condition is A(θ) = 0. ∇ 18. Done if we show the only remaining piece: coersivity.

90 / 218 A is coersive.

d 19. For every ε, let Hu,ε := x R : x, u ε and let S = u : u 2 = 1 . { ∈ h i ≥ } { k k } 20.0 int (and full-rank assumption) implies that ε > 0 such that ∈ M ∃

inf ν(Hu,ε) > 0. u S ∈ c i.e., ε > 0 and c R such that ν(Hu,ε) e for all u S. ∃ ∈ ≥ ∈ 21. Then, for any ρ > 0,

ρu,x ρu,x ρε ρε+c eh iν(dx) eh iν(dx) e ν(Hu ) e ≥ ≥ ,ε ≥ Z ZHu,ε That is, A(ρu) ρε + c. ≥ 22. For any θ = 0, taking u = θ/ θ and ρ = θ , we obtain 6 k k k k d A(θ) θ ε + c, θ R 0 ≥ k k ∀ ∈ \{ } showing that A is coersive.

91 / 218 Side Note: In fact, A : int Ω int is a C 1 diffeomorphism (i.e., a bijection which is C 1 in both∇ directions.)→ M This follows from Theorem 3.2.8 of ?: Theorem 7 (Global inverse function theorem) Let U Rd be open and Φ C 1(U, Rd ). The following are equivalent: ⊂ ∈ V = Φ(U) is open and Φ: U V is a C 1 diffeomorphism. • → Φ is injective and regular on U. • Φ = A is injective and regular on int Ω hence a C 1 diffeomorphism. ∇

92 / 218 MLE in exponential family

Significance of Theorem 6 for statistical inference: • Assume X1,..., Xn are i.i.d. draws from • p (x) = exp( θ, T (x) A(θ))h(x). θ h i − The likelihood is •

LX (θ) = p (Xi ) exp θ, T (Xi ) nA(θ) . θ ∝ h i − i i Y X  1 n Letting µ = T (Xi ) the log-likelihood is • n i=1

P `X (θ) = n θ, µ A(θ) + const.. b h i − If µ int , there exists a unique MLE, solution of A(θ) = µ. • ∈ M b ∇ 1 That is θMLE = A− (µ) = A∗(µ). • b ∇ ∇ b A∗ is the Fenchel–Legandre conjugate of A. • b If µ / , then the MLEb does notb exist. What happens at the boundary • can be∈ M determined on a case by case basis. b 93 / 218 Technical remarks

A is always lower semi-continuous (lsc). • If Ω = dom A is open, lower semicontinuity implies that A(θ) as θ • approaches the boundary. → ∞

(Pick θ0 bd Ω, then lim infθ θ0 A(θ) A(θ0) = .) ∈ → ≥ ∞ In other words, if Ω is open, A is automatically essentially smooth. •

94 / 218 Example 35 (Two-parameter Gaussian) 2 2 = R, T (x) = (x, x ). pθ(x) = exp(θ1x + θ2x A(θ)), x . •X − ∀ ∈ X 2 θ1x+θ2x A(θ) = log e dx. A(θ) < iff θ2 < 0. • ∞ Natural parameter space: Ω = dom A = (θ1, θ2): θ2 < 0 . • R { } m 1 2 θ1 = 2 and θ2 = 2 in original parametrization (m, σ ). • σ − 2σ µ = θ1 and σ2 = 1 . 2θ2 2θ2 • − − 2 θ1 θ1 1 Mean parametrization: µ1 = , µ2 = + 2θ2 2θ2 2θ2 • − − − 2 2 µ 1 2 θ1 1 π  A(θ) = 2 + log(2πσ ) = + log . 2σ 2 4θ2 2 θ2 • − − Easy to verify that A(θ) = (µ1, µ2) • and it establishes a∇ bijection between 2 (θ1, θ2): θ2 < 0 = int Ω int = (µ1, µ2): µ > µ2 { } ↔ M { 1 } Note that since A is lsc, µ(θ) = A(θ) as θ approaches the boundary. • ∇ → ∞ Show a picture of θ A(θ) θ, µ for µ = (0, 1). • 7→ − h i

95 / 218 Maximum entropy characterization of exponential family

Not only exponential families realize any mean, they achieve it with • maximum entropy: Solution to

max Ep[ log p(X )] s.t. Ep[T (X )] = µ, p − is given by a density of the form p(x) exp( θ, T (x) ). ∝ h i Discrete case, easy to verify by introducing Lagrange multipliers: • = x1,..., xK •X { } ν = counting meas. and pi = p(xi ), let p = (p1,..., pK ), and ti = T (xi ), •

maxp pi log pi − i s.t. p t = µ, Pi i i pi 0, pi = 1 P≥ i P Without pi ti = µ, uniform distribution maximizes the entropy. • i P 96 / 218 Information inequality (Cramer–Rao)

How small can the variance of an unbiased estimator be? How well can the • UMVU do? The bound also plays a role in asymptotics. • Idea: Use Cauchy–Schwarz (CS), also called covariance inequality in this • context:

2 2 2 2 (EXY ) (EX )(EY ), or [cov(X , Y )] var(X ) var(Y ) ≤ ≤ Running assumption: every RV/estimator has finite second moment. • δ unbiased for some g(θ), ψ any other estimator • 2 [covθ(δ, ψ)] varθ(δ) (3) ≥ varθ(ψ) Need to get rid of δ on the RHS. • By cleverly choosing ψ, we can obtain good bounds. •

97 / 218 Assume Pθ+h Pθ: pθ+h(x) = 0 whenever pθ(x) = 0. •  (Local) likelihood ratio is well-defined: (Can define it to be 1 for 0/0.) •

Lθ,h(X ) = pθ+h(X )/pθ(X )

(= dPθ+h/dPθ, Radon-Nikodym deriavative of Pθ+h w.r.t. Pθ.) • Change of measure by integrating against the likelihood ratio: •

Eθ[δLθ,h] = δ Lθ,h pθ dµ = δ pθ+h dµ = Eθ+h[δ] (4) Zpθ >0 Zpθ >0

Note that p +h is concentrated on x : p (x) > 0 . θ { θ }

98 / 218 Lemma 5 (Hammersley–Chapman–Robbins (HCR))

Assume Pθ+h Pθ, and let δ be unbiased for g(θ). Then,  [g(θ + h) g(θ)]2 varθ(δ) − 2 ≥ Eθ(Lθ,h 1) − Proof.

Idea: Apply CS inequality (3) to ψ = L h 1. • θ, − Eθ[ψ] = Eθ[Lθ,h] 1 = 0. (By an application of (4) to δ = 1.) • − Another application of (4) gives2 •

covθ(δ, ψ) = Eθ[δψ] = Eθ[δLθ,h] Eθ[δ] = g(θ + h) g(θ). − −

2ψ is not an unbiased estimator of 0, since it depends on θ. It is not a proper estimator. Not a contradiction with “UMVU is uncorrelated from any unbiased estimator of 0”. 99 / 218 Assume that θ (and hence h) is a scalar. • Likelihood ratio approaches 1 as h 0: • →

1 [pθ+h(X ) pθ(X )]/h lim [Lθ,h(X ) 1] = lim − h 0 h − h 0 p (X ) → → θ ∂θ[pθ(X )] = = ∂θ[log pθ(X )] pθ(X ) called the score function. Divide HCR by h, and let h 0: • → [g(θ + h) g(θ)]2/h2 varθ(δ) lim − 2 2 ≥ h 0 Eθ(Lθ,h 1) /h → −

Numerator goes to g 0(θ). If justified in exchanging limit and expectation, • 2 [g 0(θ)] varθ[δ(X )] 2 ≥ Eθ[∂θ log pθ(X )]

100 / 218 Cramer–Rao (formal statement)

log-likelihood: `θ(X ) := log pθ(X ), • ˙ d Score function: `θ(X ) := θ`θ(X ) = θ log pθ(X ) R • ∇ ∇ ∈ Theorem 8 (Cramer–Rao lower bound) Let be dominated family with densities having common support S on some P open parameter space Ω Rd . Assume: ⊂ (a) δ is an unbiased estimator for g(θ) R. ∈ d (b) g is differentiable over Ω, with gradient g˙ = θg R , ∇ ∈ (c) `˙ (x) exists for x S and θ Ω, θ ∈ ∈ (d) At least for ξ = 1 and ξ = δ and θ Ω, ∀ ∈ ∂ ∂ ξ(x)p (x) dµ(x) = ξ(x) p (x) dµ(x), i (5) ∂θ θ ∂θ θ ∀ i ZS ZS i Then, T 1 var (δ) g˙ (θ) [I (θ)]− g˙ (θ) θ ≥ ˙ ˙T d d where I (θ)= Eθ[`θ` ] R × is the Fisher information matrix. θ ∈ 101 / 218 Let us rewrite the assumption: • ∂ ∂ ξ(x)p (x) dµ(x) = ξ(x) p (x) dµ(x), i (6) ∂θ θ ∂θ θ ∀ i ZS ZS i Note that the right-hand side is: • ∂ log pθ(x) ˙ RHS = ξ(x) pθ(x) dµ(x) = Eθ ξ(X )[`θ(X )]i ∂θ ZS i  Putting the pieces together • ˙ θEθ[ξ] = Eθ[ξ`θ] (7) ∇ which is the differential form of the change of measure formula: •

Eθ+h[ξ] = Eθ[ξLθ,h]

102 / 218 Proof. ˙ Score function has zero mean, Eθ[`θ] = 0. (Apply (7) with ξ = 1.) • ˙ g˙ (θ) = Eθ[δ`θ] (Apply (7) with ξ = δ.) • d T ˙ Fix some a R . Will apply CS inequality (3) with ψ = a `θ. • ∈ Since aT `˙ is zero mean: • θ T T ˙ T ˙ a g˙ (θ) = Eθ[δ a `θ] = covθ(δ, a `θ). Similarly, • T ˙ T ˙ ˙T T varθ(a `θ) = Eθ[a `θ`θ a] = a I (θ)a

CS inequality (3) with ψ = aT `˙ gives: • θ [cov (δ, aT `˙ )]2 (aT g˙ (θ))2 var (δ) θ θ = . θ T ˙ T ≥ varθ(a `θ) a I (θ)a Almost done. Problem reduces to (Exercise) • T 2 (a v) T 1 sup T = v B− v. a=0 a Ba 6

Hint: Since B 0, B−1/2 is well-defined; take z = B1/2a. Q.E.D.

103 / 218 Regularity conditions for interchanging the integral and derivative are key, • so is the unbiasedness. • Under same assumptions (recall ` = log p (X )) • θ θ ¨ 2 I (θ) = Eθ[ `θ] = Eθ[ `θ] − −∇θ I (θ) measures expected local curvature of the likelihood. • Attainment of CRB is related to attainment of Cauchy–Schwarz: Wijsman • (1973) shows that it happens if and only if we are in the exponential family. Fisher info. is not invariant to reparametrization: • 2 θ = θ(µ) = I (µ) = [θ0(µ)] I (θ) ⇒ CRB is invariant to reparametrization. (Exercise.) • e Fisher info. is additive over independent sampling. •

104 / 218 Multiple parameters

What if g :Ω Rm where Ω Rd ? → m ⊂d Let Jθ = (∂gi /∂θj ) R × be the Jacobian of g. • ∈ Then, under similar assumptions (notation: I = I (θ)): • θ 1 T cov (δ) J I − J θ  θ θ θ for any δ unbiased for g(θ). A B means A B 0, i.e., A B is positive semidefinite (PSD). •  −  − Proof: Fix u Rm and apply the 1-D theorem to uT δ. (Exercise) • ∈

105 / 218 Example 36

iid 2 Xi N(θ, σ ), i = 1,..., n, • ∼ σ2 is fixed, g(θ) = θ. • n n 1 2 ` (X ) = log p (X ) = log p (Xi ) = (Xi θ) + const. θ θ θ −2σ2 − i=1 i=1 X X Differentiating, we get the score function • n ∂ 1 n `˙ (X ) = log p (X ) = (Xi θ) = `¨ (X ) = . θ ∂θ θ σ2 − ⇒ θ −σ2 i=1 X whence I (θ) = n/σ2. • CRB is var (δ) σ2/n and is achieved by sample mean. • θ ≥

106 / 218 Example 37 (Exponential families)

Xi p (xi ) = h(xi ) exp( θ, T (xi ) A(θ)), i = 1,..., n. • ∼ θ h i −

` (X ) = log p (X1,..., Xn) = θ, T (Xi ) nA(θ) + const. θ θ h i − i ¨ 2 X whence I (θ) = Eθ[ `θ(X )] = n A(θ) = n covθ[T ]. • − ∇ Consider 1-D case and n = 1. • Want unbiased estimate of the mean parameter: µ(θ) = Eθ[T ] = A0(θ). • CRB is • 2 2 [µ0(θ)] [A00(θ)] = = A00(θ) = varθ(T ) I (θ) A00(θ) i.e., it is attained by T . 1 n General case: T¯ := T (Xi ) attains CRB for the mean parameter: • n i=1 P ¯ ¯ covθ(δ) covθ(T ), δ s.t. Eθ(δ) = Eθ(T ).  ∀

107 / 218 Example 38 iid Xi Poi(λ), i = 1,..., n. • ∼ Exponential family with T (X ) = X and mean parameter λ, • 1 hence, sample mean δ(X ) = Xi achieves the CRB for λ. • n i What if we want an unbiased estimate of g(λ) = λ2? • P Since I (λ) = n/ var [X1] = n/λ (why?), • λ the CRB = [2λ]2/(n/λ) = 4λ3/n. • 1 n 2 The estimator T1 = Xi (Xi 1) is unbiased for λ and • n i=1 − P 3 2 varλ(T1) = 4λ /n + 2λ /n > CRB

S = i Xi is complete sufficient, hence • 2 Rao–Blackwellized estimator T2 = E[T1 S] = S(S 1)/n is UMVU. • P | − CRB is not attained since (exercise) • 3 2 2 varλ(T2) = 4λ /n + 2λ /n > CRB,

A vector of independent Poisson variables, conditioned on their sum has a 1 1 multinomial distribution. In this case, Mult(S, ( n ,..., n )). 108 / 218 Average vs. maximum risk optimality

Bayesian Methods: Trouble comparing estimators based on whole risk functions θ R(θ, δ). • 7→ The Bayesian approach: reduce to (weighted) average risk. • Assumes that the parameter is a random variable Θ with some distribution • Λ, called the prior, having density π(θ) (w.r.t. to say Lebesgue). Choice of the prior is important in the Bayesian framework. • Frequentest perspective: Bayes estimators have desirable properties. •

109 / 218 Recall the decision-theoretic framework: • Family of distributions = Pθ : θ Ω . • P { ∈ } Bayesian framework: • interpret Pθ as conditional distribution of X given Θ = θ,

Pθ(A) = P(X A Θ = θ) ∈ | Together with the marginal (prior) distribution of Θ, we have the joint • distribution of (Θ, X ). Recall the risk defined as •

R(θ, δ) = Eθ[L(θ, δ(X ))] = E[L(θ, δ(X )) Θ = θ] | or in other words, R(Θ, δ) = E[L(Θ, δ(X )) Θ]. | The Bayes risk is • r(Λ, δ)= E[R(Θ, δ)]= E[L(Θ, δ(X ))].

110 / 218 Write p(x θ) = pθ(x) for density of Pθ. • | Recall that • R(θ, δ) = L(θ, δ(x))p(x θ)dx. | Z Then,

r(Λ, δ) = π(θ)R(θ, δ)dθ = π(θ) L(θ, δ(x))p(x θ)dx dθ. | Z Z h Z i We rarely used this explicit form. •

111 / 218 A Bayes rule or estimator w.r.t. Λ, denoted as δΛ, is a minimizer of the • Bayes risk:

r(Λ, δΛ) = min r(Λ, δ) δ Depends both on the prior Λ and the loss L. • Theorem 9 (Existence of Bayes estimators) Assume that

(a) δ0 with r(Λ, δ0) < ∃ ∞ (b) Posterior risk has a minimizer for µ-almost all x, that is,

δΛ(x) := argmin E[L(Θ, a) X = x] a | ∈A is well-defined for µ-almost all x. (Measurable selection.)

Then, δΛ is a Bayes rule.

Proof. Condition (a) is to guarantee that we can use Fubini theorem.

By definition of δΛ, for any δ we have E[L(Θ, δ) X ] E[L(Θ, δΛ) X ]. • | ≥ | Taking expectation and using smoothing finishes the proof. • 112 / 218 Posterior risk can be computed based on the posterior distribution of Θ • given X = x. Bayes rule gives

p(x θ)π(θ) π(θ x)= | p(x θ)π(θ) | m(x) ∝ |

where m(x) = π(θ)p(x θ)dθ is the marginal distribution of X . | Posterior is proportional to prior times the likelihood. • R Example 39 Bayes estimators for two simple loss functions: 2 Quadratic (or `2) loss: L(θ, a) = (g(θ) a) : • − 2 δΛ(x) = min E[(g(Θ) a) X = x] = E[g(Θ) X = x]. a − | | For g(θ) = θ reduces to the posterior mean.

`1 loss: L(θ, a) = θ a : Here, δΛ(x) = median(Θ X = x) is one possible • . (Not| − unique| in this case.) |

113 / 218 Example 40 (Binomial) X Bin(n, θ). • ∼ n x n x PMF is p(x θ) = θ (1 θ) − . • | x − Put a Beta prior on Θ, with α, β > 0, • 

Γ(α + β) α 1 β 1 α 1 β 1 π(θ) = θ − (1 θ) − θ − (1 θ) − Γ(α)Γ(β) − ∝ −

x+α 1 n x+β 1 We have π(θ x) p (x)π(θ) θ − (1 θ) − − • | ∝ θ ∝ − showing that Θ X = x Beta(α + x, n x + β), whence • | ∼ − x + α x α δΛ(x) := E[Θ X = x] = = (1 λ) + λ | n + α + β − n α + β

α+β where λ = n+α+β . Note: α/(α + β) is the prior mean, and x/n is the MLE (or unbiased • estimator of the mean parameter). No coincidence, happens in a general exponential family. •

114 / 218 Example 41 (Normal location family) 2 Assume that Xi Θ = θ N(θ, σ ). • | ∼ Put a Gaussian prior on Θ N(µ, b2). • ∼ The model is equivalent to • iid 2 Xi = Θ + wi , wi N(0, σ ), for i = 1,..., n ∼ Reparametrize in terms of precisions τ 2 = 1/b2 and γ2 = 1/σ2. • (Θ, X1,..., Xn) is jointly Gaussian, and the posterior is • 2 Θ X = x N (1 λn)¯x + λnµ, 1/τ | ∼ − n δΛ(x)  e where | {z } 1 2 2 2 2 2 x¯ = xi , τ = nγ + τ , λn = τ /τ [0, 1] n n n ∈ Continued ... X • e e

115 / 218 With • 1 2 2 2 2 2 x¯ = xi , τ = nγ + τ , λn = τ /τ [0, 1] n n n ∈ We have X • e e 2 Θ X = x N δΛ(x), 1/τ | ∼ n

Posterior mean δΛ(x), i.e., the Bayes rule for `2 loss, is • e

δΛ(x) := (1 λn)¯x + λnµ − which is a convex combination ofx ¯ and µ and we have 2 2 δΛ(x) x¯ if n or SNR = γ /τ . • → → ∞ 2 2 → ∞ δΛ(x) µ if SNR = γ /τ 0. • → →

116 / 218 Conjugate priors

The two examples above are examples of conjugacy. • A family = π( ) of priors is conjugate to a family of likelihoods • = p( Qθ) {if the· } corresponding posteriors also belong to . P { · | } Q Example of conjugate families • normal beta Dirichlet Q normal binomial multinomial P Example 42 (Exponential families) We have the following conjugate pairs

p(x θ) = exp η(θ), T (x) ) A(θ) | h i − } qa b(θ) = exp a, η(θ) + bA(θ) B(a, b) , h i − 

117 / 218 Example 43 (Improper priors) 2 Xi N(θ, σ ), i = 1,..., n. • ∼ 1 Is δ(x) = xi , a Bayes estimator w.r.t. some prior? • n Not if we require proper priors (finite measures): ( )d , in which P π θ θ < • case π can be normalized to integrate to 1. ∞ R Need a uniform (proper) prior on the whole R which does not exist. • An improper prior can still be used if the posterior is well-defined. • (Generalized Bayes.) Alternatively, δ(x) is the limit of Bayes rules for a sequence of proper • priors. (see also the Beta-Binomial example.)

118 / 218 Comment on the uniqueness of the Bayes estimator. Theorem 10 (TPE 4.1.4)

Let Q be the marginal distribution of X , that is, Q(A) = Pθ(X A)dΛ(θ). ∈ Recall that δΛ is (a) Bayes estimator. Assume that R The loss function is strictly convex, • r(Λ, δΛ) < , • ∞ Q a.e = a.e. . Equivalently, Pθ Q for all θ Ω. • ⇒ P  ∈ Then, there is a unique Bayes estimator.

119 / 218 Minimax criterion

Instead of averaging the risk: look at the worst-case or maximum risk: • R(δ) := sup R(θ, δ) θ Ω ∈ More in accord with an adversarial nature. (A zero-sum game.) • Definition 13

An estimator δ is minimax if minδ R(δ) = R(δ ). ∗ ∈D ∗ An effective strategy for finding minimax estimators is to look among the • Bayes estimators: The minimax problem is: inf sup R(θ, δ). • δ θ We generalize this to: inf sup r(Λ, δ) • δ Λ

120 / 218 Recall: δΛ a Bayes estimator for prior Λ, and Bayes risk, •

rΛ = inf r(Λ, δ) = r(Λ, δΛ), δ

(Last equality: assume rΛ is finite and is achieved.) Can order priors based on their Bayes risk: • Definition 14

Λ is a least favorable prior if rΛ∗ rΛ for any prior Λ. ∗ ≥ For a least favorable prior, we have •

rΛ∗ = sup rΛ = sup inf r(Λ, δ) inf sup r(Λ, δ) =: inf r(δ) Λ Λ δ ≤ δ Λ δ

where r(δ) = supΛ r(Λ, δ) is a generalization of the maximum risk R(δ). Interested in situations where equality holds. •

121 / 218 Characteriztion of minimax estimators

Theorem 11 (TPE 5.1.4)

Assume that δΛ is Bayes for Λ, and r(Λ, δΛ) = R(δΛ). Then,

δΛ is minimax. • Λ is least favorable. • If δΛ is the unique Bayes estimator (a.e. ), then it is the unique minimax • estimator. P

Proof of minimaxity of δΛ: Maximum risk is always lower-bounded by Bayes risk, • R(δ) = sup R(θ, δ) R(θ, δ)dΛ(θ) = r(Λ, δ), δ θ Ω ≥ ∀ ∈ Z

R(δ) r(Λ, δ) rΛ = R(δΛ). (Last equality by assumption.) • ≥ ≥

122 / 218 Rest of the proof:

R(δ) r(Λ, δ) rΛ = R(δΛ). (Last equality by assumption.) • ≥ ≥ Uniqueness of the Bayes rule makes second inequality strict for δ = δΛ, • showing the uniqueness of minimax rule. 6 On the other hand, •

rΛ0 r(Λ0, δΛ) R(δΛ) = rΛ. ≤ ≤ showing that Λ is least favorable.

123 / 218 A decision rule δ is called an equalizer if it has constant risk: •

R(θ0, δ) = R(θ, δ), for all θ, θ0 Ω. ∈ Let ω(δ) := θ : R(θ, δ) = R(δ) = argmax R(θ, δ). • { } θ (δ is an equalizer if ω(δ) = Ω.) • Corollary 4 (TPE 5.1.5–6) (a) A Bayes estimator with constant risk (i.e., an equalizer) is minimax.

(b) A Bayes estimator δΛ is minimax if Λ(ω(δΛ)) = 1.

Both of these conditions are sufficient, not necessary. • (b) is weaker than (a). • Strategy: Find a prior Λ whose support is contained in argmax R(θ, δΛ). • θ

124 / 218 Example 44 (Bernoulli, continuous parameter space) X Ber(θ) with quadratic loss, and Θ [0, 1]. • ∼ ∈ 2 Given a prior Λ on [0, 1], let m1 = E[Θ] and m2 = E[Θ ]. • Frequentsit risk • 2 2 R(θ, δ) = (δ0 θ) (1 θ) + (δ1 θ) θ − − − 2 2 2 2 = θ [1 + 2(δ0 δ1)] + θ(δ δ 2δ0) + δ − 1 − 0 − 0 Bayes risk • 2 2 2 r(Λ, δ) = E[R(Θ, δ)] = m2[1 + 2(δ0 δ1)] + m1(δ1 δ0 2δ0) + δ0 − − −

Bayes decision rule is found by minimizing r(Λ, δ) w.r.t. δ0, δ1, • m2 m1 m2 δ1∗ = , δ0∗ = − . m1 1 m1 −

125 / 218 Bayes decision rule is found by minimizing r(Λ, δ) w.r.t. δ0, δ1, • m2 m1 m2 δ1∗ = , δ0∗ = − . m1 1 m1 − x 1 x Aside: Since p(θ x) = Cθ (1 θ) − π(θ), check that • | − δx∗ = E[Θ X = x], x = 0, 1 as it should be: | x+1 1 x θ (1 θ) − π(θ)dθ δ∗ = − . x θx (1 θ)1 x π(θ)dθ R − − A general rule δ is an equalizer,R i.e., R(θ, δ) does not depend on θ, if • 2 2 δ1 δ0 = 1/2 and δ δ 2δ0 = 0. − 1 − 0 − These equations have a single solution: δ0 = 1/4 and δ1 = 3/4. • (There is a unique equalizer rule.)

m2 m1 m2 Equalizer Bayes rule: need 3/4 = , and 1/4 = − . m1 1 m1 • − Solving: m∗ = 1/2 and m∗ = 3/8. • 1 2 Need prior Λ that has these moments; Λ = Beta(1/2, 1/2) fits the bill. • This is a least favorable prior. • Corresponding Bayes, hence minimax, risk is 1/16. • 126 / 218 The above can be generalized to an i.i.d. sample of size n: • iid X1,..., Xn Ber(θ), where Beta(n/2, n/2) is least favorable and the ∼ 1 associated minimax risk is 4(√n+1)2 . Compare with the risk of the sample mean R(θ, X¯) = θ(1 θ)/n. • −

127 / 218 Example 45 (Bernoulli, discrete parameter space) Let X Ber(θ) and Ω = 1/3, 2/3 =: a, b . • ∼ { } { } Take L(θ, δ) = (θ δ)2. • − Any (nonrandomized) decision rule is specified by a pair of numbers (δ0, δ1). • Any prior π specified by a single number πa = P(Θ = a) [0, 1]. • ∈ Frequentist risk • 2 2 R(θ, δ) = (δ0 θ) (1 θ) + (δ1 θ) θ − − −

Bayes risk r(π, δ) = E[R(Θ, δ)], •

r(π, δ) = πaR(a, δ) + (1 πa)R(b, δ) −

128 / 218 Take derivatives w.r.t. δ0, δ1, set to zero, find the Bayes rule • aπa(1 a) + b(1 b)(1 πa) δ0∗ = − − − (1 a)πa + (1 b)(1 πa) 2 − 2 − − a πa + b (1 πa) δ1∗ = − aπa + b(1 πa) − For a = 1/3 = 1 b, • − 2 4 3πa δ0∗ = and δ1∗ = − . 3(πa + 1) 6 3πa − Equalizer rule, one that R(a, δ) = R(b, δ): • 2 2 (a + b)[2(δ0 δ1) + 1] + δ δ 2δ0 = 0. − 1 − 0 −

A Bayes rule that is also equalizer occurs for π∗ = 1/2. • a This is the least favorable prior. • Corrsponding rule (δ∗, δ∗) = (4/9, 5/9) is minimax. • 0 1

129 / 218 Geometry of Bayes and Minimax

Risk body for two-parameter Bernoulli Ω = 1/3, 2/3 . • { } Determinisitc rules, Bayes rules, minimax rule. • 0.5 0.2

0.45 0.18

0.4 0.16

0.35 0.14

0.3 0.12

0.25 0.1

0.2 0.08

0.15 0.06

0.1 0.04

0.05 0.02

0 0 0 0.1 0.2 0.3 0.4 0.5 0 0.05 0.1 0.15 0.2

130 / 218 Geometry of Bayes and minimax for finite Ω

Assume Ω = θ1, . . . , θk finite, and consider the risk set (or body) • { } k S = (y1,..., yk ) yi = R(θi , δ) for some δ R . | ⊂  k Alternatively, define ρ : R by • D →

ρ(δ) = (R(θ1, δ),..., R(θk , δ))

where is the set of randomized decision rules. D S is the image of underρ, i.e., S = ρ( ). • D D Lemma 6 S is a convex set (with randomized estimators).

Proof. For δ, δ0 , and a [0, 1], we can form a randomized decision rule δa ∈ D ∈ such that S R(θ, δa) = aR(θ, δ) + (1 a)R(θ, δ0). (Exercise.) 3 −

131 / 218 k Every prior Λ corresponds to a vector λ = (λ1, . . . , λk ) R , via • ∈ Λ( θi ) = λi . Note that λ lies in the( k 1)-simplex, { } − k k ∆ := (λ1, . . . , λk ) R+ : λi = 1 { ∈ } i=1 Bayes risk is X • k T r(Λ, δ) = E[R(Θ, δ)] = λi R(θi , δ) = λ ρ(δ) i=1 X Hence finding the Bayes rule is equivalent to • inf r(Λ, δ) = inf λT ρ(δ) = inf λT y δ δ y S ∈ D ∈ D ∈ a convex problem in Rk . Minimax problem is

inf ρ(δ) = inf y δ k k∞ y S k k∞ ∈ D ∈ T Finding the least favorable prior corresponds to supλ ∆[infy S λ y]. ∈ ∈

132 / 218 Admissibility of Bayes rules

In general, a unique (a.e. ) Bayes rule is admissable (TPE 5.2.4). • P Complete answer to admissibility question for finite parameter spaces. • Proposition 10

Assume Ω = θ1, . . . , θk and that δ is Bayes rule for λ. { } λ If λi > 0 for all i, then δλ is admissible.

Proof. If δλ is inadmissible, there is δ such that

R(θi , δ) R(θi , δ ), i ≤ λ ∀ with strict inequality for some j. Then,

λi R(θi , δ) < λi R(θi , δλ). i i X X Q.E.D.

133 / 218 Proposition 11

Assume Ω = θ1, . . . , θk and δ admissible. Then δ is Bayes w.r.t. some prior λ. { } Proof. k Let x := ρ(δ), the risk vector of δ and Qx := y R yi xi x . • { ∈ | ≤ }\{ } Qx is convex. • (Removing an extreme point from a convex set preserves convexity.)

Admissibility means Qx S = . • ∩ ∅ Two non-empty disjoint convex sets Rk , can be separated by a • hyperplane: ⊂

T T u = 0 s.t. u z u y, for all z Qx and y S. ∃ 6 ≤ ∈ ∈ Suppose we can choose u to have nonnegative coordinates. • (Proof by contradiction. (Exercise.))

134 / 218 k 1 Since u = 0, we can set λ = u/( i ui ) ∆ − . • T 6 T ∈ λ z infy S λ y = rλ, z Qx . ∈ P • ≤ ∀ ∈ T Taking zn Qx such that zn x, we obtain λ x r . • { } ⊂ → ≤ λ But, by defintion of optimal Bayes risk r λT x, hence r = λT x. • λ ≤ λ

135 / 218 M-estimation

(1) Setup: An i.i.d. sample of size n from a model = Pθ : θ Ω on • sample space , i.e., P { ∈ } X iid X1,..., Xn Pθ. ∼ (n) n n Full model is actually = P⊗ : θ Ω and sample space . • P { θ ∈ } X M-estimators: those obtained as solutions of optimization problems. • Definition 15

Given a family of functions mθ : R, for θ Ω, the corresponding X → ∈ M-estimator based on X1,..., Xn is

1 n θ := θ (X ,..., X ) := argmax m (X ) n n 1 n n θ i θ Ω i=1 ∈ X b 1 b n Often Mn(θ) := m (Xi ), a random function. • n i=1 θ P 136 / 218 Alternative approach is to specify θ as Z-estimator, i.e., the solution of a • set of estimating equations b 1 n Ψ (θ) := ψ(X , θ) = 0. n n i i=1 X Often 1st-order optimality conditions for an M-estimator produce a set of • estimating equations. (Simplistic in general, ignoring the possibility of constraints imposed by Ω.)

137 / 218 Example 46 2 1 2 ¯ 1. mθ(x) = (x θ) , then Mn(θ) = n i=1(Xi θ) , giving θ = X . − − −1 − 2. m (x) = x θ , then Mn(θ) = Xi θ , giving θ −| − | − n Pi=1 | − | θ = median(X ,..., X ). b 1 n P 1 3. mθ(x) = log pθ(x), then Mn(θ) = n i=1 log pθ(Xi ), giving the maximum likelihoodb estimator (MLE). P p In location family with pθ(x) = C exp( β x θ ), MLE is equivalent to • an M-estimator with m (x) = x θ−p. | − | θ −| − | • p = 2, Gaussian distribution. (Case 1) • p = 1, Laplace distribution. (Case 2) Corresponding Z-estimator forms of 1. and 2. are obtained for • ψ (x) = x θ and ψ (x) = sign(x θ), obtained by differentiation (or θ − θ − sub-differentiation) of mθ.

138 / 218 Example 47 (Method of Moments (MOM)) Find θ by matching empirical and population (true) moments: • 1 n [X k ] = X k , k = 1, 2,..., d. Eθ 1 n i i=1 X Usually d is the dimension of parameter θ (d equations in d unknown). • k k A set of estimating equations for ψθ(x) = x Eθ[X1 ]. • − A generalized version of MOM is to solve • 1 n [ϕ (X )] = ϕ (X ), k = 1, 2,..., d. Eθ k 1 n k i i=1 X

for some collection of function ϕk , corresponding to Z-estimator with { } ψθ(x) = ϕk (x) Eθ(ϕ(X1)). −

139 / 218 In canonical exponential families: • iid Xi p (x) exp θ, T (x) A(θ) ∼ θ ∝ {h i − } ML and MOM are equivalent. MLE is the M-estimator associated with • m (x) = log p (x) = θ, T (x) A(θ) θ θ h i − hence 1 Mn(θ) = θ, T (Xi ) A(θ) = θ, T A(θ) n h i − h i − i X   1 b where T = n i T (Xi ) is the empirical mean of the sufficient statistic. MLE is • b P θmle = argmax θ, T A(θ) θ Ω h i − ∈  Setting derivatives tob zero gives T = A(θbmle). (First-order optimality.) • ∇ Since Eθ[T ] = A(θ), MLE is the solution of • ∇ b b Eθ[T ] = T for θ, which is a MOM estimator. (If you will [T (X )] = 1 T (X ).) b Eθ 1 n i i 140 / 218 P Sidenote

Recall that µ = A(θ) is the mean parameterization. • ∇ The inverse of this map is θ = A∗(µ) where • ∇

A∗(µ) = sup θ, µ A(θ) θ Ω{h i − } ∈ is the conjugate dual of A. (Exercise.)

So θmle = A∗(T ), assuming that T int(dom(A∗)). • ∇ ∈ b b b

141 / 218 Asymptotics or large-sample theory Zeroth-order (consistency)

Statistical behavior of estimators, in particular M-estimators, as n . • → ∞ For concreteness consider the sequence: • 1 n θ = θ (X ,..., X ) = argmax m (X ) n n 1 n n θ i θ Ω i=1 ∈ X b b Definition 16 iid p Let X1,..., Xn Pθ0 . We say that θn is consistent if θn θ0. ∼ { } → Equivalently, • b b ε > 0, P(d(θn, θ0) > ε) 0, as n . ∀ → → ∞ d Usually dθn, θ0) = θn θ0 forb Eucledian parameters spaces Ω R . • k − k ⊂ For d = 1, d(θn, θ0) = θn θ0 . • b b | − | b b 142 / 218 p We write Zn = op(1) if Zn 0. • → By the WLLN, for any fixed θ, we have (assuming Eθ0 mθ(X1) < ) • | | ∞ n 1 p mθ(Xi ) Eθ0 [mθ(X1)] n → i=1 X p Letting M(θ) := Eθ0 [mθ(X1)], for any fixed θ, Mn(θ) M(θ). • → If θ0 is the maximum of M over Ω, hope θn which is the maximum of Mn • over Ω approaches it.

However, pointwise convergence of Mn tobM is not enough; need uniform • convergence, i.e.,

Mn M := sup Mn(θ) M(θ) k − k∞ θ Ω | − | ∈ to go to zero in probability.

143 / 218 Why uniform convergence?

Even a nonrandom example is enough: • Here Mn(t) M(t) pointwise, but Mn(tn) = 1 and M(t0) = 1/2. • →

1

0.8 1 2 1 n x n x < n 1 − | − | |1 | 3 0.6 Mn(t) = x 1 < x <  2 − | − | 2 | | 2 0 otherwise 0.4 1 1 3  x 1 < x < 0.2 M(t) = 2 − | − | 2 | | 2 0 otherwise ( 0

0 0.5 1 1.5

144 / 218 Theorem 12 (AS 5.7 modified)

Let Mn be random functions, and let M be a fixed function of θ. Let

θn argmax Mn(θ) (cond-M) ∈ θ Ω ∈ be well-defined. Assume: b p (a) Mn M 0. (Unifrom convergence.) k − k∞ → (b)( ε > 0) sup M(θ) < M(θ0).(M has well-separated maxima.) ∀ θ: d(θ,θ0) ε ≥ p Then θn is consistent, i.e., θn θ0. →

By optimality of θ for Mn, we have Mn(θ0) Mn(θn), or • b b ≤ 0 Mn(θn) Mn(θ0) Basic inequality b≤ − b By adding and subtracting, we get • b M(θ0) M(θn) Mn(θn) M(θn) [Mn(θ0) M(θ0)] − ≤ − − − 2 Mn M b ≤ k b − k∞b (We are keeping random deviations from the mean on one side and fixed functions on the other side.)

145 / 218 Fix some ε > 0, and let •

η(ε) := M(θ0) sup M(θ) = inf [M(θ0) M(θ)] d( ) − d(θ,θ0) ε θ,θ0 ε − ≥ ≥ By assumption (b) η(ε) > 0. • Since d(θ, θ0) ε implies M(θ0) M(θ) η(ε), we have • ≥ − ≥

P d(θn, θ0) ε P M(θ0) M(θn) η(ε) ≥ ≤ − ≥  P2 Mn M η(ε)  0 b ≤ k − k∞b ≥ → by assumption (a).  Q.E.D.

Remark 3

A key step is bounding Mn(θn) M(θn) with Mn M . − k − k∞

Exercise: Condition (cond-M) can be replaced with Mn(θn) Mn(θ0) op(1). b b ≥ − b

146 / 218 Sufficient conditions for uniform convergence can be found in Keener, • Chapter 9, Theorem 9.2. For example, we have (a) if • • Ω is compact, • θ 7→ mθ(x) is continuous (for a.e x), and • Ekm∗(X1)k∞ < ∞.(km∗(X1)k∞ = supθ∈Ω |mθ(X1)|) For example, we have (b) if • • Ω is compact, and • M is continuous, and • M has a unique maximizer over Ω. In general, the key factor in whether the uniform convergence holds is the • size of the parameter space Ω.

147 / 218 Side note:

Why do we have (b) if • • Ω is compact, and • M is continuous, and • M has a unique maximizer over Ω? Since Ω is compact and M is continuous, it attains its maximum over •

Ω B(θ0; ε) := θ Ω: d(θ, θ0) ε . \ { ∈ ≥ }

where B(θ0; ε) is the open ball of radius ε centered at θ0.

Let θ be a maximizer of M over Ω B(θ0; ε). Then, • ε \

sup M(θ) = M(θε) < M(θ0) θ: d(θ,θ0) ε ≥ Strict inequality is due to the uniqueness of maximizer of M over Ω. • Compactness is key, otherwise uniqueness of global maximizer does not • imply this inequality.

148 / 218 Example 48

pθ (x) MLE can be obtained as an M-estimator with mθ(x) = log p (x) . • θ0 Addition of log p (x) does not change the maximizer of Mn(θ). • − θ0

pθ0 (x) M(θ) = Eθ0 [mθ(X1)] = pθ0 (x) log dx = D(pθ0 pθ). − p (x) − k Z θ D(p q) is the Kullback–Leibler (KL) divergence between p and q. • k A form of (squared) distance among distributions. • Does not satisfy triangle equality or symmetry. • D(p q) 0 with equality iff p = q. • k ≥ Condition (b) is a bit stronger. • Often, we can show (strong identifiability) •

γ d(θ0, θ) D(p p ) ≤ θ0 k θ for some strictly increasing functionγ [0, ) [0, ) in a ∈ ∞ → ∞ neighborhood of θ0.

149 / 218 λx Example: exponential distribution with p (x) = λe− 1 x > 0 : • λ { }

λ0X1 λ0e− D(pλ pλ) = λ log 0 E 0 λX1 k λe− hλ0 i = log + Eλ0 [(λ λ0)X1] λ − λ λ = log + 1 − λ0 λ0 − Itakura-Saito distance, or the Bregman divergence for φ(x) = log x, from • an earlier lecture. − f (x) = log x + x 1 is strictly convex on (0, ) with unique minimum • at x = 1.− − ∞

150 / 218 First-order (asymptotic normality)

More refined understanding, by looking at scaled (magnified) deviations of • consistent estimators.

IID sequence X1, Xn,... with mean µ = E[X1] and Σ = cov(X1): • p WLLN X¯n µ X¯n is consistent for µ. → d CLT √n(X¯n µ) N(0, Σ) Characterizes fluctuations of X¯n µ. − → − 1/2 Fluctuations are of the order n− and after normalization have • approximate Gaussian dist’n.

151 / 218 First, let us look at how modes of convergence interact. • Proposition 12

p d (a) Xn → X implies Xn → X , but not vice versa. p d (b) Xn → c is equivalent to Xn → c.(c is a constant.)

(c) Continuous mapping (CM): Xn → X and f is continuous, implies f (Xn) → f (X ). p Holds for both →d and →. d d d (d) Slutsky’s: Xn → X and Yn → c implies (Xn, Yn) → (X , c). p p p (e) Xn → X and Yn → Y implies (Xn, Yn) → (X , Y ). d p d (f) Xn → X and d(Xn, Yn) → 0 implies Yn → X .

For (c), f only need to be continuous on a set C with P(X C) = 1. • ∈ (d) does not hold in general if c is replaced by some random variable Y . •

152 / 218 Slutsky’s lemma is usually not what we mentioned. • It in fact is a special of application of (c) and (d), to functions • 1 (x, y) x + y,(x, y) xy and (x, y) y − x. 7→ 7→ 7→ Corollary 5 (Slutsky’s lemma)

Let Xn, Yn and X be random variables, or vectors or matrices, and c a constant. d d Assume that Xn X and Yn c. Then, → → d d 1 d 1 Xn + Yn X + c, YnXn cX , Y − Xn c− X , → → n → d assuming c is invertible for the latter. More generally f (Xn, Yn) f (Xn, c) for any continuous function. →

E.g. op(1) + op(1) = op(1).

153 / 218 Simple examples: • d 2 d 2 2 (a) Xn Z N(0, 1) implies X Z χ . → ∼ n → ∼ 1 Example 49 (Counterexample)

Xn = X U(0, 1), n and • ∼ ∀

Yn = Xn1 n odd + (1 Xn)1 n even . { } − { } d d Xn X and Yn X , but (Xn, Yn) does not converge in distribution. • → → Why? • 2 2 Let C1 = (x, y) [0, 1] : x = y and C2 = (x, y) [0, 1] : x + y = 1 . • { ∈ } { ∈ } Let U(Ci ) be uniform distribution on Ci . Then, •

U(C1) n odd (Xn, Yn) ∼ (U(C2) n even

154 / 218 Example 50 (t-statistic) 2 IID sequence Xi , with E[Xi ] = µ and var(Xi ) = σ . • { } 1 2 1 2 1 2 2 Let X¯n = Xi and S = (Xi X¯) = X (X¯n) . Then, • n n n − n i − P P P X¯n µ d tn 1 := − N(0, 1). − Sn/√n →

1 2 d 2 2 2 ¯2 d 2 Why? Xi E[X1 ] = (σ + µ ) and Xn µ . • n → → d 2 2 2 These implyP Sn σ + µ µ = σ. • → − It follows that p • 2 √n(X¯n µ) d N(0, σ ) tn 1 = − = N(0, 1) − Sn → σ

Distribution-free result: We are not assuming that Xi are Gaussian. • { }

155 / 218 Also need the concept of uniform tightness or boundedness in probability. • A collection of random vectors Xn is uniformly tight if • { } ε > 0, M such that sup P( Xn > M) < ε ∀ ∃ n k k

We will write Xn = Op(1) in this case.

Proposition 13 (Uniform Tightness)

p p (a) If Xn 0 and Yn is uniformly tight, then XnYn 0. → { } → d (b) If Xn X , then Xn is uniformly tight. → { }

(a) can be written compactly as op(1)Op(1) = op(1).

156 / 218 Simplified notation: E[m ˙ θ0 ] in place of E[m ˙ θ0 (X1)]. Theorem 13 (Asymptotic normality of M-estimators) Assume the following

(a)˙ mθ0 (X1) has up to second moments with

• E[m ˙ θ0 ] = 0, and T • well-defined covariance matrix Sθ0 := E[m ˙ θ0 m˙ θ0 ].

(b) The Hessian m¨θ0 (X1) is integrable with Vθ0 := E[m ¨θ0 ] 0. ≺ (c) θn is consistent for θ0. { } ¨ ¨ p (d) ε > 0 such that sup θ θ0 ε Mn(θ) M(θ0) 0 ∃b k − k≤ k − k → 1 n Let ∆n,θ := √n i=1 m˙ θ(Xi ). Then,

P 1 d √n(θn θ0) = V − ∆n,θ + op(1), and ∆n,θ N(0, Sθ ). − − θ0 0 0 → 0

d 1 1 In particular,b√n(θn θ0) N(0, V − Sθ V − ). − → θ0 0 θ0 In (b), only need theb Hessian to be nonsingular. (d) is (local) Uniform Convergence (UC)

157 / 218 Proof of AN

1. θn is a maximizer of Mn, hence

2. M˙ n(θn) = 0. (first-order optimality condition) b 3. Taylor-expand M˙ around θ0: b M˙ n(θn) M˙ n(θ0) = M¨n(θn)[θn θ0] − − for some θn in the line segmentb [θn, θ0]. e b (Mean-value theorem, assuming continuity of M¨n) e b 4. θn = θ0 + op(1). (By consistency of θn) 5. M¨ (θ ) = M¨ (θ ) + o (1) (By (d): UC) e n n n 0 p b 1 6. Note that M¨n(θ0) = n m¨ (Xi ) is an average. e − i θ0 ¨ 7. Mn(θ0) = Eθ0 [m ¨θ0 (X1)] +Pop(1) = Vθ0 + op(1). (By (b) and WLLN) ¨ 8. Mn(θn) = Vθ0 + op(1). (Combine 5. and 7. + CM) 1 9. By CM applied w/ f (X ) = X − , and invertibility of Vθ : e 0 1 1 1 [M¨ ( )] = [V + o (1)] = V − + o (1) n θn − θ0 p − θ0 p .

158 / 218 e 10. Combine 2., 3. and 9.

1 θn θ0 =[ M¨n(θn)]− M˙ n(θn) M˙ n(θ0) − − 1 ˙ =[ V − + op(1)] 0 Mn(θ0)  b θ0 e b− 11. Expand RHS and multiply by √n,  

1 √n(θn θ0) = V − [√nM˙ n(θ0)] op(1)[√nM˙ n(θ0)] (8) − − θ0 − ˙ 12. Mn(θ0) is an averageb with zero-mean terms w/ covariance Sθ0 by (a). d 13. √nM˙ n(θ0) N(0, S ). (CLT and (a)) → θ0 14. √nM˙ n(θ0) = Op(1) (By Prop. 13(b) and 13.)

15. Applying op(1)Op(1) = op(1) to (8), (Prop. 13(b) and 11.)

1 √n(θn θ0) = V − [√nM˙ n(θ0)] + op(1). − − θ0 ˙ 16. Note that √nMn(θ0)b=∆ n,θ by definition. 1 17. Second part: Apply CM with f (x) = V − x. (Exercise.) − θ0

159 / 218 Example 51 (AN of MLE) For MLE, m (x) = ` (x) = log p (x). • θ θ θ m˙ = `˙ , the score-function, zero-mean under regularity conditions. • θ θ ˙ ˙T Sθ = Eθ[`θ` ] = I (θ). • θ ¨ Vθ = Eθ[`θ] = I (θ) • − 1 1 1 Asymptotic covariance of MLE = [ I (θ)]− I (θ)[ I (θ)]− = [I (θ)]− . • − − It follows (assuming (c) and (d) hold) • d 1 √n(θmle θ0) N(0, [I (θ0)]− ) − → Often interpreted as “MLE is asymptotically efficient”, • b i.e., achieves Cramer–Rao bound in the limit. •

160 / 218 Hodge’s superefficient example

d 2 2 If √n(δn θ) N(0, σ (θ)), one might think that σ (θ) 1/I (θ) by CRB. • − → ≥ If so, any estimator with variance 1/I (θ) could be called asymptotically • efficient. Unfortunately this is not true. (The convergence in distribution is rather • weak to guarantee this.) Here is a counterexample:

Example 52 Consider the shrinkage estimator • 1/4 aδn δn n− δn0 = | | ≤ (δn otherwise

δ0 has the same asymptotic behavior as δn for θ = 0. • n 6 Asymptotic behavior of δn0 at θ = 0 is the same as aδn which has • asymptotic variance a2σ2(θ); this can be made arbitrarily small by choosing a sufficiently small.

161 / 218 Delta method

Delta method is a powerful extension of the CLT. • Assume that f :Ω Rk , with Ω Rd , is differentiable and θ Ω. • → ⊂ ∈ Let Jθ = (∂fi /∂xj ) x=θ be the Jacobian of f at θ. • k d | Note: Jθ R × . • ∈ Proposition 14 d Under the above assumption: If an(Xn θ) Z, with an , then − → → ∞ d an[f (Xn) f (θ)] J Z − → θ If f is differentiable, then it is partially differentiable and its total derivative • can be represented (or identified) with the Jacobian matrix Jθ. d Simplest case, k = d = 1: an[f (Xn) f (θ)] f 0(θ)Z • − →

162 / 218 Proof of Delta method

an(Xn θ)= Op(1) and since an , we have Xn θ = op(1). • − → ∞ − By differentiability (1st order Taylor expansion) • f (θ + h) = f (θ) + J h + R(h) h θ k k where R(h) = o(1). Define R(0) = 0 so that R is continuous at 0.

Applying with h = Xn θ, we have • −

f (Xn) = f (θ) + J (Xn θ) + R(Xn θ) Xn θ . θ − − k − k

Multiplying by an, we get •

an[f (Xn) f (θ)] = J [an(Xn θ)] + R(Xn θ) an(Xn θ) − θ − − k − k

an(Xn θ) = Op(1). •k − k

163 / 218 Multiplying by an, we get •

an[f (Xn) f (θ)] = J [an(Xn θ)] + R(Xn θ) an(Xn θ) − θ − − k − k op (1) Op (1)

d | {z } | {z } R(Xn θ) = op(1), and J [an(Xn θ)] J Z, both by CM. • − θ − → θ The result follows from op(1)Op(1) = op(1) and Prop 12(f). •

164 / 218 Example 53 2 Let Xi be iid with µ = E[Xi ] and σ = var[Xi ]. • d 2 By CLT, √n(X¯n µ) N(0, σ ). • − → Consider the function f (t) = t2. Then, by delta method • d 2 √n(f (X¯n) f (µ)) f 0(µ)N(0, σ ), − → that is, 2 2 d 2 2 √n[(X¯n) µ ] N(0, σ (2µ) ). − →

2 d For µ = 0, we get the degenerate result that √n(X¯n) 0. • → In this case, we need to scale the error further, i.e., • 2 d 2 2 d n(X¯n) σ χ , which follows from CLT √nX¯ σN(0, 1) and CM. • → 1 →

165 / 218 Example 54 iid Xi Ber(p). • ∼ d By CLT √n(X¯n p) N(0, p(1 p)). • − → − Let f (p) = p(1 p). • − f (X¯n) is a plugin estimator for the variance, and • d 2 √n(f (X¯n) f (p)) N(0, (1 2p) p(1 p)) − → − −

since f 0(x) = 1 2x. − Again at p = 1/2 this is degenerate and the convergence happens at a • faster rate.

166 / 218 These examples can be dealt with using the following extension. Proposition 15

d 2 Consider the scalar case k = d = 1. If √n(Xn θ) N(0, σ ) and f is twice − → differentiable with f 0(θ) = 0, then,

d 1 2 2 n[f (Xn) f (θ)] f 00(θ)σ χ − → 2 1 Inofrmale Derivation: 1 2 2 f (Xn) f (θ) = f 00(θ)(Xn θ) + o((Xn θ) ). • − 2 − − 2 d 2 Since n(Xn θ) (σZ) where Z N(0, 1), we get the result. • − → ∼

167 / 218 Example 55 (Multivariate delta method) 2 1 2 2 S = X (X¯n) . Let • n i i − P n 1 X µ X Z := i , θ = , Σ = cov 1 n n X 2 µ2 + σ2 X 2 i=1 i 1 X       By (multivariate) CLT, we have • d √n(Zn θ) N(0, Σ) − → Letting f (x, y) = (x, y x 2), we have • −

X¯n µ d √n 2 2 JθN(0, Σ) = N(0, JθΣJθ) Sn − σ → h     i Exercise: Evaluate asymptotic covariance J ΣJ . • θ θ

168 / 218 What are asymptotic normality results useful for?

Simplify comparison of estimators: Can use asymptotic variances. (ARE) • Can build asymptotic confidence intervals. •

169 / 218 Asymptotic relative efficiency (ARE)

Can compare estimators based on their asymptotic variance. • Assume that for two estimators θ1 n and θ2 n, we have • , , d 2 √n(θi n µ(θ))b N(0, σb (θ)), i = 1, 2. , − → i 2 For large n, the varianceb of θi n is σ (θ)/n. • , ≈ i Relative efficiency of θ1,n with respect to θ2,n can be measured in terms the • ratio of the number of samplesb required to achieve the same asymptotic variance (i.e., error), b b

2 2 2 σ1(θ) σ2(θ) n2 σ2(θ) = = AREθ(θ1, θ2) = = 2 n1 n2 ⇒ n1 σ1(θ) b b If the above ARE > 1, then we prefer θ1 over θ2.

b b

170 / 218 Example 56 iid Xi fX with mean = (unique) median = θ, and variance 1. • ∼ By CLT, we have √n(X¯n θ) N(0, 1). • − → Sample median: Zn = median(X1,..., Xn) = X( 1 n) • 2 Can show • d 1 √n(Zn θ) N 0, 2 − → 4[fX (θ)]   iid Consider normal location family: Xi N(θ, 1). • ∼ fX (θ) = φ(0) = 1/√2π where φ is the density of standard normal. • Hence, σ2 (θ) = π/2. • Zn ARE of sample mean relative to median: • σ2 (θ) π Zn = 1.57 σ2 (θ) 2 ≈ X¯n In normal family, we prefer the mean, since the median requires roughly • 1.57 more samples to achieve the same accuracy.

171 / 218 Confidence intervals

An alternative to point estimators which provides a measure of our uncertainty iid or confidence. Recall X1,..., Xn Pθ0 . ∼ Definition 17

A (1 α)-confidence set for θ0 is a random set S = S(X1,..., Xn) such that − Pθ0 (θ0 S) 1 α ∈ ≥ −

Trade-off between size of the set S and its coverage probability Pθ0 (θ0 S). • Want to minimize size while maintaining a lower bound on coverage prob.∈ • Usually CIs are built based on pivots: • Functions of data and parameter whose dist’n is independent of param. • Example 57 (Normal family, known variance)

iid 2 Xi N(µ, σ ), then Z = (X¯n µ)/(σ/√n) N(0, 1). • ∼ − ∼ Let zα/2 be such that P(Z zα/2) = α/2. • ≥ ¯ ¯ σ P( √n(Xn µ) zα/2) = 1 α P µ [Xn zα/2] = 1 α. | − | ≤ − ⇐⇒ ∈ ± √n −   172 / 218 Example 58 (Normal family, unknown variance) 2 Xi N(µ, σ ). • ∼ Z = (X¯ µ)/(σ/√n) N(0, 1) • − ∼ 2 2 2 2 1 n ¯ 2 V = (n 1)Sn /σ χn 1 where S = n 1 i=1(Xi X ) . • − ∼ − − − Hence, T := Z/ V /(n 1) tn 1 (Student’sP t distribution). • − ∼ − α Let tα be such thatp P( T tn 1( )) = α. • | | ≥ − 2 α Sn X¯n tn 1( ) is an exact (1 α) confidence interval. • ± − 2 √n −

173 / 218 Asymptotic CIs

Definition 18

An asymptotic (1 α)-confidence set for θ0 is a random set S = S(X1,..., Xn) − such that Pθ0 (θ0 S) 1 α as n . ∈ → − → ∞ Example 59

d 2 If √n(Tn θ0) N(0, σ (θ0)), then assuming σ( ) is continuous. • − → · 2 σ (Tn) Tn zα/2 ± r n is an asymptotic C.I. at level 1 α. −

p σ(θ0) p Why? Since Tn θ0, by CM theorem, 1. By Slutsky’s lemma • → σ(Tn) →

2 n σ (θ0) n d (Tn θ0) = (Tn θ0) N(0, 1). σ2(T ) − σ2(T ) σ2(θ ) − → r n s n r 0

174 / 218 Asymptotic CI for MLE

Example 60 (Asym. CI for MLE based on Fisher info)

d 1 Recall that under regularity √n(θn θ0) N(0, ), or • − → I (θ0) b d nI (θ0)(θn θ0) N(0, 1). − → Assuming I ( ) is continuous,p applying the previous example, • · b d nI (θn)(θn θ0) N(0, 1). − → q Hence, the following is asymptotic (1 α)-CI for θ0: • b b − zα/2 θn ± nI (θn) b q b

175 / 218 Example 61 (Asym. CI based on empirical Fisher Info.)

n ∂2 Let `n(θ) = log pθ(Xi ) and I (θ) = E[ 2 ] • i=1 − ∂θ 1 One can consider `¨n(θ) as the empirical version of I (θ). • P − n (It is an unbiased and consistent estimate. I (θ) is Fisher info. based on • sample of size 1.) 1 p By the same argument as in AN theorem, `¨n(θn) I (θ0). • − n → ¨ `n(θbn) p It follows that − 1 • nI (θ0) → b By Slutsky’s lemma,q •

`¨n(θn) d − nI (θ0)(θn θ0) N(0, 1) s nI (θ0) − → b p b d In other words, `¨n(θn)(θn θ0) N(0, 1). • − − → Hence, the followingq is asymptotic (1 α)-CI for θ0: • b b − zα/2 θn ± `¨n(θn) − b q 176 / 218 b Variance-stabilizing transform

d 2 Assume √n(Tn θ) N(0, σ (θ)). • − → d 2 2 By delta method, √n f (Tn) f (θ) N(0, [f 0(θ)] σ (θ)). • − 2 2 → We can choose f so that [f 0(θ)] σ (θ) = C, a constant. •   Good for building asymptotic pivots. • Example 62 iid Xi Poi(θ). Note Eθ[Xi ] = varθ[Xi ] = θ. • ∼ d By CLT √n(X¯n θ) N(0, θ). • 1 − → Take f 0(θ) = . Can be realized by f (θ) = 2√θ, hence • √θ

d 2√n( X¯n √θ) N(0, 1) − → p 1 Asymptotic CI for √θ of level 1 α: X¯n z . • − ± 2√n α/2 ¯ p Xn  Compare with standard asym. CI for θ: X¯n z . • ± n α/2 q  177 / 218 Hypothesis testing

Recall decision theory framework: • Probabilistic model Pθ : θ Ω for X . • { ∈ } ∈ X Special case : Ω is partitioned into two disjoint sets Ω0 and Ω1. • Want to decide which piece θ belongs. • Can form an estimate θ for θ and output 1 θ Ω1 . • { ∈ } A general principal: • Do not estimate moreb than what you careb about. The more complex the model, the more potential for fitting to noise.

178 / 218 We want to test • H0 : θ Ω0 (null) ∈ H1 : θ Ω1 (alternative) ∈ A non-random test can be specified with a critical region S as • ⊂ X δ(X ) = 1 X S . { ∈ }

When δ(X ) = 1, we have accepted H1, or “rejected H0”. • Power function of the test is given by •

β(θ) = Pθ(X S) = Eθ[δ(X )] ∈

We would like β(θ) 1 θ Ω1 . • ≈ { ∈ } It cannot be achieved, so we settle for a trade-off. Define • significance level α = sup β(θ) θ Ω0 ∈ power of the test β = inf β(θ) θ Ω1 ∈ Neyman–Pearson framework: Maximize β subject to a fixed α. • 179 / 218 Often need to consider a randomized test, in which case interpret • δ(x) = P(accept H1 X = x) |

Power function β(θ) = Eθ[δ(X )] still gives the probability of accepting H1, • by the smoothing property.

180 / 218 Simple hypothesis test

Ω0 = θ0 and Ω1 = θ1 . • { } { } Neyman–Pearson criterion reads: Fix α •

sup Eθ1 [δ(X )] s.t. Eθ0 [δ(X )] α. δ ≤

Most powerful (MP) test for significance level at most α. Neyman–Pearson lemma: • Most power achieved by a likelihood ratio test (LRT),

δ(X ) = 1 L(x) >τ + γ 1 L(x)= τ , L(x) := p (x)/p (x). { } { } θ1 θ0 Sometimes write 1 p (x) τp (x) to avoid division by zero. • { θ1 ≥ θ0 } For simplicity assume write p0 = p and p1 = p . • θ0 θ1 So we write L(x) = p1(x)/p0(x) for example. •

181 / 218 Informal proof

For simplicity drop dependence on X : δ = δ(X ) and L = L(X ). • Introduce Lagrange multipliers, and solve the unconstrained problem: •

δ∗ = argmax E1(δ) + λ(α E0(δ)) = argmax E1(δ) λE0(δ) δ − δ − h i h i Recall the change of measure formula (note L = p1/p0): •

E1[δ] = δp1 dµ = δL p0 dµ = E0[δL]. Z Z The problem reduces to • δ∗ = argmax E0[δ(L λ)] δ − The optimal solution is • 1 L > λ δ∗ = (0 L < λ which is a likelihood ratio test. 182 / 218 Theorem 14 (Nyeman-Pearson Lemma) Consider the family of (randomized) likelihood ratio tests

1 p1(x) > t p0(x) δt,γ (x) = γ p1(x) = t p0(x) 0 p1(x) < t p0(x)

The following hold: 

(a) For every α [0, 1], there are t, γ such that E0[δt,γ (X )] = α. ∈ (b) If a LRT satisfies E0[δt,γ (X )] = α, then it is most powerful (MP) at level α. (c) Any MP test at level α can be written as a LRT.

Part (a) follows by looking at g(t) = P0(L(X ) > t) = 1 FZ (t) where • Z = L(X ). g is non-increasing and right-continuous, etc.− (Draw a picture.)

183 / 218 Proof of Neyman-Pearson Lemma

For part (b), let δ∗ be the LRT with significance level α. • Let δ be any other rule satisfying E0[δ(X )] α = E0[δ∗(X )]. • ≤ For all x, (consider the three possibilities) •

δ(x) p1(x) tp0(x) δ∗(x) p1(x) tp0(x) − ≤ − Integrate w.r.t. x:     • E1[δ(X )] tE0[δ(X )] E1[δ∗(X )] tE0[δ∗(X )] − ≤ − or

E1[δ(X )] E1[δ∗(X )] t E0[δ(X )] E0[δ∗(X )] 0 − ≤ − ≤  Conclude that E1[δ(X )] E1[δ∗(X )]. • ≤ Part (c), left as an exercise. •

184 / 218 Example 63 Consider X N(θ, 1) and the two hypotheses • ∼

H0 : θ = θ0 versus H1 : θ = θ1

Likelihood ratio is • 1 2 p (x) exp[ (x θ1) ] L(x) = 1 = − 2 − 1 2 p0(x) exp[ (x θ0) ] − 2 −

LRT rejects H0 if L(x) > t. Equivalently •

1 2 1 2 log L(x) > log t (x θ1) + (x θ0) > log t ⇐⇒ −2 − 2 − 1 2 2 x(θ1 θ0) + (θ θ ) > log t ⇐⇒ − 2 0 − 1 1 2 2 log t 2 (θ0 θ1) x sign(θ1 θ0) > − − =: τ ⇐⇒ · − θ1 θ0 | − | Assume θ1 > θ0. Then, the test is equivalent to x > τ. • 1 We set τ by requiring P0(X > τ) = α. This gives τ = θ0 + Q− (α). • 185 / 218 Power calculation (Previous example continued)

Q(x) = 1 Φ(x) where Φ is the CDF of standard . • − The power is (since X θ1 N(0, 1) under P1) • − ∼ β = P1(X > τ) = P1(X θ1 > τ θ1) = Q(τ θ1) − − − 1 Plugging in τ, we have β = Q( δ + Q− (α)) where δ = θ1 θ0. • − − Plot of β versus α is the ROC curve of the test. • ROC = Receiver Operating Characteristic • See next slide. • Alternatively, can plot parametric curve β = Q(τ θ1) and α = Q(τ θ1), • − − where parameter τ varies in R. ROC of no test can go about this curve (by Neyman-Pearson lemma). •

186 / 218 ROC curve

1

0.9

0.8

0.7

0.6

- 0.5

0.4

0.3

0.2

0.1

0 0 0.2 0.4 0.6 0.8 1

, 187 / 218 Composite hypothesis testing

Often want to test H0 : θ Ω0 versus H1 : θ Ω1 where Ω0 Ω1 = . • ∈ ∈ ∩ ∅ Example 64

Testing whether a coin is fair or not. Here Ω0 = 1/2 and Ω1 = [0, 1] 1/2 . { } \{ } Definition 19 A test δ of size α is uniformly most powerful (UMP) at level α if

tests φ of level α, θ Ω1, β (θ) β (θ). ∀ ≤ ∀ ∈ δ ≥ φ UMP tests do not always exists (they often don’t in fact). •

188 / 218 Example 65 (Coin flipping continued) Observe X Bin(n, θ). • ∼ Consider testing H0 : θ = 1/2 versus H1 : θ = θ1 based on X . • Most powerful test is a LRT, based on • x n x x n θ1 (1 θ1) − θ1 1 θ1 T (x) = x− n x = − (1/2) (1/2) − 1 θ1 1/2  −    Nature of the test changes based on whether θ1 < 1/2 or θ1 > 1/2: •

θ1 < 1/2 = log[θ1/(1 θ1)] < 0 = reject H0 when x < τ ⇒ − ⇒ θ1 > 1/2 = log[θ1/(1 θ1)] > 0 = reject H0 when x > τ ⇒ − ⇒ Suggests that a special structure is needed for the existence of a UMP test. •

189 / 218 Definition 20 A family = p (x): θ Ω of densities has a monotone likelihood ratio P { θ ∈ } (MLR) in T (x) if for θ0 < θ1, the LR L(x) = pθ1 (x)/pθ0 (x) is a non-decreasing function of T (x).

For example, in the coin flipping problem, the model has MLR in T (X ) = X . Example 66 (1-D exponential family) Consider p (x) = h(x) exp[ η(θ)T (x) A(θ) ]. LR is • θ −

pθ1 (x) L(x) = = exp (η(θ1) η(θ0))T (x) A(θ1) + A(θ0) . pθ0 (x) − −   If η is monotone (e.g. θ0 θ1 = η(θ0) η(θ1)), then the family has • MLR in T (x) or T (x). ≤ ⇒ ≤ − θ Includes the Bernoulli (or binomial) example before with η(θ) = log 1 θ . • Others cases: normal location family, Poisson and exponential. −

190 / 218 Example 67 (Non-exponential family)

iid Xi U[0, θ], i = 1,..., n. • ∼ n p(x) = θ− 1 x θ 1 x 0 , and • { (n) ≤ } { (1) ≥ } n n θ1 θ1 1 x(n) θ1 , x(n) [0, θ0) L(x) = { ≤ } = θ0 ∈ θ0 1 x(n) θ0 ( x(n) [θ0, θ1)   { ≤ } ∞  ∈

Consider θ1 > θ0. • n For x (0, θ1), L(x) increasing in x (transitions from (θ1/θ0) to ). • (n) ∈ (n) ∞

191 / 218 Theorem 15 (UMP for one-sided problems) Let be a family with MLR in T (x). • P Consider the one-sided test H0 : θ θ0 versus H1 : θ > θ0. • ≤ Then, δ(x) = 1 T (x) > C + γ1 T (x) = C , for γ, C such that • { } { } βδ(θ0) = α is UMP of size α.

Take θ1 > θ0 and let L (x) = p (x)/p (x) be the corresponding LR. • θ1,θ0 θ1 θ0 By the MLR property, the given test is a LR test, i.e., •

δ(x) = 1 L (x) > C 0 + γ1 L (x) = C 0 { θ1,θ0 } { θ1,θ0 }

for some constant C 0 = C 0(θ1, θ0).

Since β (θ0) = α, by Neyman–Pearson lemma, δ is MP for testing • δ H0 : θ = θ0 versus H1 : θ = θ1

Since θ1 > θ0 was arbitrary, δ is UMP for θ = θ0 versus θ > θ0. • Last piece to check is β (θ) α for θ < θ0. (Exercise.) • δ ≤

192 / 218 Example 68 (Non-exponential family (continued))

iid Xi U[0, θ], i = 1,..., n. • ∼ The family has MLR in X . • (n) δ(X ) = 1 X t is UMP for one-sided testing: θ > θ0 against θ θ0. • { (n) ≥ } ≤ To set the threshold • n 1 (t/θ0) , t θ0 g(t) = 1 Pθ0 (X(n) t) = 1 Pθ0 (Xi t) = − ≤ − ≤ − ≤ 0 t i ( > θ0 Y which is a continuous function. 1/n Solving g(t) = α gives t = (1 α) θ0. • − Similarly the power function is • n β(θ) = Pθ(X(n) > t) = [1 (t/θ) ]+ − which holds for all θ > 0. n For the UMP test, we have β(θ) = [1 (1 α)(θ0/θ) ]+. • − −

193 / 218 Plots of the power function for U[0, θ] example for various n, • (θ0 = 1, α = 0.2).

1

0.9

0.8

) 0.7

3

( - 0.6

0.5

0.4

power function function power 0.3 n=2 0.2 n=5 n=10 0.1 n=20 n=100 0 0 0.5 1 1.5 2 3

194 / 218 Plots of the power function for U[0, θ] example for various n, • (θ0 = 1, α = 0.05).

1

0.9

0.8

) 0.7

3

( - 0.6

0.5

0.4

power function function power 0.3 n=2 0.2 n=5 n=10 0.1 n=20 n=100 0 0 0.5 1 1.5 2 3

195 / 218 ROC plots for U[0, θ] example for various n,(θ0 = 1, θ = 1.1). • 3 = 1.1 1

0.9

0.8

0.7

0.6

- 0.5

0.4

0.3

0.2 n=2 n=5 0.1 n=10 random 0 0 0.2 0.4 0.6 0.8 1 ,

196 / 218 ROC plots for U[0, θ] example for various n,(θ0 = 1, θ = 2). • 3 = 2.0 1

0.9

0.8

0.7

0.6

- 0.5

0.4

0.3

0.2 n=2 n=5 0.1 n=10 random 0 0 0.2 0.4 0.6 0.8 1 ,

197 / 218 Generalized likelihood ratio test (GLRT)

Consider testing H0 : θ Ω0 versus H1 : θ Ω1. • ∈ ∈ In the absence of UMPs a natural extension of LRT is the following • generalized LRT

p (x) supθ Ω pθ(x) θb L(x) = ∈ = [1, ] sup p (x) p (x) θ Ω0 θ θ ∈ ∞ ∈ b0

where Ω = Ω0 Ω1. ∪ θ is the unconstrained MLE, whereas θ0 is the constrained MLE, the • maximizer of θ p (x) over θ Ω1. 7→ θ ∈ Ab GLRT rejects H0 if L(x) > λ for someb threshold. • Alternatively, GLRT can be written as •

δ(x) = 1 Λn(x) τ + γ1 Λn(x) = τ , Λn(x) = 2 log L(x) { ≥ } { }

The threshold τ is set as usual by solving supθ Ω0 Eθ[δ(X )] = α • ∈

198 / 218 Why the above is a reasonable test? • Assume γ = 0: No randomization is needed. • When θ Ω0, then both θ and θ0 approach θ as n . • ∈ → ∞ Hence, L(x) 1 when θ Ω0. • ≈ ∈b b However, when θ Ω1, the unconstrained MLE θ approaches θ as n , • ∈ → ∞ while θ0 does not. This is because θ0 Ω0 and θ Ω1, and Ω0 and Ω1 are ∈ ∈ disjoint. b

It followsb that L(x) > 1 when θ Ωb1 (in fact L(x) 1 usually). • ∈  By thresholding L(x) at some λ > 1, we can tell the two hypotheses apart •

199 / 218 Example 69

iid 2 2 2 Xi N(µ, σ ), both µ and σ unknown. Let θ = (µ, σ ). Take • ∼ 2 2 2 Ω = (µ, σ ) µ R, σ > 0 , Ω0 = θ0 , for θ0 = (µ0, σ0) = (0, 1). { | ∈ } { }

Want to test Ω0 against Ω1 = Ω Ω0. • \ 1 1 sup p (x) = p (x) = exp x 2 θ θ0 (2π)n/2 − 2 i θ Ω0 i ∈  X  1 1 2 sup p (x) = p (x) = exp (xi µ) θ θb (2πσ2)n/2 − 2σ2 − θ Ω i ∈  X  2 1 2 1 2 b where θ = (µ, σ ) with µ = xbi and σ = b(xi µ) is the MLE. n i n i − P P b b b b b b

200 / 218 Example continued

The GLR is • p (x) θb 2 n/2 1 2 1 2 L(x) = = (σ )− exp xi 2 (xi µ) . pθ0 (x) 2 − 2σ − h i i X n/2 b b | b {z } The GLRT rejects H0 when L(x) > tα where Pθ0 (L(x) > tα) = α. Alternatively, threshold • 2 2 Λn(x) = 2 log L(x) = n log σ + x n. − i − i X 2 b Will see that Λn in general has χ distribution where r is the difference • r between the dimensions of the full (Ω) versus null (Ω0) parameter spaces. 2 Thus, we expect Λn in this problem to have an asymptotic χ distribution • 2 under the null θ = θ0. (It is instructive to try to show this directly.)

201 / 218 Asymptotics of GLRT

Consider Ω Rd , open, and let r d. Take the null hypothesis to be • ⊂ ≤

Ω0 = θ Ω: θ1 = θ2 = = θr = 0 { ∈ ··· } = θ Ω: θ = (0,..., 0, θr+1, . . . , θr+d r ) . { ∈ − }

Note that Ω0 is a (d r)-dimensional subspace of Ω. • − Theorem 16 Under the same assumptions guaranteeing asymptotic normality of MLEs,

d 2 Λn = 2 log L(X ) χ , under H0. → r Degrees of freedom r = d (d r), that is, the difference in the local dimensions of full and null− parameter− sets.

202 / 218 1 Recall ` (X ) = log p (X ) and let Mn(θ) = ` (Xi ). • θ θ n i θ We have Λn = 2n[Mn(θ0 n) Mn(θn)]. • − , − P By Taylor expansion around θn (unrestricted MLE), for some θn [θn, θ0,n], • b b ∈

T 1 T Mn(θ0 n) Mn(θn) = [M˙ n(θnb)] (θ0 n θn) + (θ0 n θn) M¨en(θn)(bθ0 nb θn) , − , − 2 , − , − b b b b b b b e b b Since θn is the MLE, we have M˙ n(θn) = 0 assuming θn int(Ω). • ∈ By the same uniform arguments θn = θ0 + op(1) implies • b b b ¨ ¨ Mn(θn) = Mn(θ0) + op(1) = Iθ + op(1), • − 0e ¨ p ¨ the last equality is because Mn(θ0) Eθ0 [`θ0 ] = Iθ0 by WLLN. • e → −

203 / 218 Assuming that √n(θ0 n θn) = Op(1) , we obtain • , − T Λn = [b√n(θ0b,n θn)] Iθ0 √n(θ0,n θn) + op(1). − −

Asymptotically GLR measuresb ab particular distanceb b (squared) between θ0,n • and θn, one which weighs different directions differently, according to eigenvectors of I (θ0). b T 1/2 2 Moreb specifically, let z Q := √z Qz = Q z 2, which defines a norm • when Q 0. Then, k k k k 2 Λn = √n(θ0,n θn) I + op(1). k − k θ0

d 1 In the simple case where Ω0 = bθ0 , web have √n(θn θ0) N(0, I − ). • { } − → θ0 d 1/2 Equivalently, √n(θn θ0) I − Z where Z N(0, Id ). • − → θ0 ∼ b It follows from the CM theorem (since z z Q is continuous) that • b 7→ k k d 1/2 2 1/2 1/2 2 2 Λn I − Z I = I I − Z 2 = Z 2. → k θ0 k θ0 k θ0 θ0 k k k 2 d 2 2 Since Z = Z χ we have the proof Ω0 = θ0 . • k k2 i=1 i ∼ d { } The proof of the general case is more complicated and is omitted. • P 204 / 218 Example 70 (Multinomial: testing uniformity)

(X1,..., Xd ) Multinomial(n, θ), where θ = (θ1, . . . , θd ) is a probability • vector, that is,∼

θ Ω = θ R : θi 0, θi = 1 ∈ { ∈ ≥ } i Xj counts how many of n objects fall into categoryX j. • d d n xi xi pθ(x) = θi θi . x1,..., xd ∝ i=1 i=1   Y Y 1 1 Would like to test Ω0 = θ0 where θ0 = ( ,..., ) versus Ω1 = Ω Ω0. • { } d d \

MLE over Ω is given by θi = xi /n. Requires techniques for constrained • optimization, such as Lagrange multipliers, since Ω itself is constrained. (Exercise.) b

205 / 218 MLE over Ω is given by θi = xi /n. This requires using techniques for • constrained optimization, such as Lagrange multipliers, since Ω itself is constrained. (Exercise.) b We obtain • d d p (x) xi θb θi θi Λn = 2 log = 2 log = 2 xi log p (x) (θ0)i (θ0)i θ0 i=1 i=1 Y  b  X b θi = 2n θi log = 2nD(θ θ0) (θ0)i k i X b b b Both θ and θ0 are probability vectors; D(θ θ0) is their KL divergence. • k GLRT does a sensible test: Reject null if θ is far from θ0 in KL divergence. • b b d 2 Our asymptotic theory implies: Λn χd 1 under the null, i.e. θ = θ0, • → −b since Ω is (d 1)-dimensional and Ω0 is 0-dimensional. This is a fairly non-trivial result.−

206 / 218 p-values

Consider a family of tests δ , α (0, 1) indexed by their level α. • α ∈ Assume: α δ (x) is nondecreasing, and right-continuous. • 7→ α E.g., if δ (x) = 1 x C(α) , then C(α1) C(α2) if α1 α2. • α { ∈ } ⊆ ≤ p-value or attained significance for observed x is defined as •

p(x) := inf α : δα(x) = 1

Note that p = p(X ) is a random variable. We have • p(X ) α δ (X ) = 1 ≤ ⇐⇒ α

p ≤ α implies 1 = δp ≤ δα, since the infimum is attained by assumptions on δα, hence δα = 1. The other direction follows from the definition of inf.

This implies P0(p(X ) α) = P0(δα(X ) = 1) = α. • ≤ That is, p = p(X ) U[0, 1] under null. • ∼

207 / 218 Example 71 (Normal example continued)

Consider X N(θ, 1) and H0 : θ = θ0 versus H1 : θ = θ1. • ∼ 1 MP test at level α is δ (X ) = 1 X θ0 Q− (α) . • α { − ≥ } Alternatively δ (X ) = 1 Q(X θ0) α since Q is decreasing. • α { − ≤ } The p-value is •

p(X ) = inf α : Q(X θ0) α = Q(X θ0) (9) { − ≤ } −

Under null X θ0 N(0, 1) hence Φ(X θ0) U[0, 1] (why?). • − ∼ − ∼ Then, p(X ) = 1 Φ(X θ0) U[0, 1] as expected. • − − ∼

Exercise: Verify that under H1 : θ = θ1, the CDF of p(X ) is • 1 P1(p(X ) t) = Q( δ + Q− (t)). ≤ −

where δ = θ1 θ0. Note that this curve is the same as the ROC. − Recall Q(t) = 1 − Φ(t) where Φ is the CDF of N(0, 1). •

208 / 218 Can verify that definition (9) produces the “common” definition of • p-values, say when δ (X ) = 1 T τ or δ (X ) = 1 T τ . α { ≥ α} α {| | ≥ α} Example 72

Consider the two-sided test and let G(t) = P0( T t). • | | ≥ Assume that G is continuous hence invertible (both decreasing functions.) • 1 Requiring level α: G(τα) = P0( T τα) = α = τα = G − (α). • | 1 | ≥ ⇒ This gives δ (X ) = 1 T G − (α) = 1 G( T ) α . • α {| | ≥ } { | | ≤ } By definition (9), p = G( T ) which is the common definition. • | |

209 / 218 Multiple Hypothesis testing

We have a collection of null hypotheses H0 i , i = 1,..., n. • , Example 73 (Basic example) Testing in the normal means model •

yi N(µi , 1), i = 1,..., n ∼

and H0,i : µi = 0.

yi could be the expression level (or state) of gene i. • H0 i means that gene i has no effect on the disease under consideration. • ,

Suppose that for each H0 i , we have a test, hence a p-value pi . • , Assume under H0 i : pi U[0, 1]. • , ∼ A test that reject H0 i when pi α, is of size α under ith null. • , ≤

210 / 218 Testing global null

n The global null is H0 = H0 i . • i=1 , Want to combine p1,..., pn to build a test of level α for H0. • T Can use 1 pi α for a fixed i. But, better power if use all of them. • { ≤ } Benferroni’s test for global null: • α Reject H0 if min pi i ≤ n By union bound (no independence needed), • n n

PH0 (rejecting H0) = PH0 pi α/n PH0 (pi α/n) = α { ≤ } ≤ ≤ i=1 i=1  [  X Exercise: Assuming pi s are independent under H0, the exact the size of • α n α Benferroni’s test is 1 (1 n ) 1 e− as n . Thus for large n − − α → − → ∞ and small alpha, size 1 e− α, hence union bound is not bad in this case. ≈ − ≈

211 / 218 Fisher test for global null

Fisher combination test: • n 2 Reject H0 if Tn := 2 log pi > χ (1 α) − 2n − i=1 X

Lemma 7 2 If p1,..., pn are independent, then Tn χ . ∼ 2n

Thus, assuming independence under H0, Fisher test has the correct size. •

212 / 218 Simes test for global null

Simes test: • n Reject H0 if Tn := min p(i), α i i ≤ n o where p p p is the order statistics of p1,..., pn. (1) ≤ (2) ≤ · · · ≤ (n) Lemma 8

If p1,..., pn are independent, then Tn U[0, 1]. ∼ (Independence can be relaxed.) • Thus, assuming independence under H0, Simes test has the correct size. • Equivalent form of Simes test: • i Reject H0 if p α for some i (i) ≤ n Less conservative than Benferroni’s that reject if p 1 α. • (1) ≤ n

213 / 218 Testing individual hypotheses

In the gene expression example, we care about which genes are null/not • null. We would like to test H0,i : µi = 0 versus H1,i : µi = 1 for all i. The problem has resemblance to classification. • Interested in how we are doing in an aggregate sense. • We can think of having a decision problem • n p Pθ, where θ 0, 1 . (10) ∼ ∈ { }

θi = 1 iff H0 i is true. • , Global null corresponds to θ = 0 (zero vector). • Minimal assumption: When θi = 0, we have pi U[0, 1], i.e., the ith • ∼ marginal of Pθ is uniform.

214 / 218 Terminology (shared with classification)

Confusion matrix: Count how many combinations we have. • For example if θ 0, 1 n is our guess for the hypotheses: • ∈ { } n b 1 TP = 1 θi = 1, θi = 1 n { } i=1 X b positive (1) negative (0) total accepted rejected true (1) TP TN T false (0) FP FN F PN

True here means H0 i is true. • ,

215 / 218 Alternative notation • positive (1) negative (0) total accepted rejected true (1) UV n0 false (0) TS n n0 n RR n − − Familywise error rate (FWER): •

FWERθ = Pθ(V 1) ≥ A much less stringent criterion is the False Discovery Rate (FDR). • Consider the false discovery proportion (a random variable) • V V FDP = = 1 R > 0 max(R, 1) R { } FDR is the expectation of FDP: •

FDRθ = Eθ[FDP]

216 / 218 Controlling FWER in a strong sense: control for all θ 0, 1 n. • ∈ { } Theorem 17 Benferroni’s approach, where we test each hypothesis at level α/n, controls FWER at level α in a strong sense. In fact

n0 n Eθ[V ] α, θ 0, 1 . ≤ n ∀ ∈ { } n E[V ] = P(V i) which holds for any nonnegative discrete variable. • i=1 ≥ Hence, E[V ] P(V 1). • P ≥ ≥

Let Vi = 1 Hi 0 is true but rejected = 1 θi = 1, θi = 0 , • { , } { }

Eθ[Vi ] = 1 θi = 1 Pθ(θi =b 0) { }

Since V = Vi , • i b P α n0 Eθ[V ] = Eθ[Vi ] = Pθ(θi = 0) = α ≤ n n i i: =1 i: =1 X Xθi Xθi b (Here θi is only based on pi .)

217 / 218 b Benjamini-Hochberg (BH) procedure: Let i0 be the largest i such that • i p q (i) ≤ n

Reject all H0 i for i i0. , ≤ Theorem 18 Under independence of null hypotheses from each other and from the non-nulls, the BH procedure (uniformly) controls the FDR at level q. In fact,

n0 n FDR (θBH) = q, θ 0, 1 θ n ∀ ∈ { } b

218 / 218