Statistical Theory: Introduction 1 What is This Course About?

Statistical Theory 2 Elements of a Statistical Model

Victor Panaretos Ecole polytechnique f´ed´eralede Lausanne 3 Parameters and Parametrizations

4 Some Motivating Examples

Statistical Theory (Week 1) Introduction 1 / 16 Statistical Theory (Week 1) Introduction 2 / 16 What is This Course About What is This Course About?

Statistics Extracting Information from Data −→ We may at once admit that any inference from the Age of Universe (Astrophysics) particular to the general must be attended with Random Networks (Internet) some degree of uncertainty, but this is not the Microarrays (Genetics) Inflation (Economics) same as to admit that such inference cannot be Stock Markets (Finance) absolutely rigorous, for the nature and degree of Phylogenetics (Evolution) Pattern Recognition (Artificial the uncertainty may itself be capable of rigorous expression. Intelligence) Molecular Structure (Structural Biology) Climate Reconstruction (Paleoclimatology) Seal Tracking (Marine Biology) Ronald A. Fisher Quality Control (Mass Disease Transmission (Epidemics) The object of rigor is to sanction and legitimize Production) the the conquests of intuition, and there was never any other object for it. Variety of different forms of data are bewildering But concepts involved in their analysis show fundamental similarities Imbed and rigorously study in a framework Jacques Hadamard Is there a unified mathematical theory? Statistical Theory (Week 1) Introduction 3 / 16 Statistical Theory (Week 1) Introduction 4 / 16 What is This Course About? What is This Course About?

The Job of the Probabilist Statistical Theory: What and How? Given a probability model P on a measurable space (Ω, ) find the What? The rigorous study of the procedure of extracting information from F probability P[A] that the outcome of the experiment is A . data using the formalism and machinery of mathematics. ∈ F How? Thinking of data as outcomes of probability experiments The Job of the Statistician Given an outcome of A (the data) of a probability experiment on Probability offers a natural language to describe uncertainty or partial ∈ F (Ω, ), tell me something interesting about the (uknown) probability knowledge F ∗ model that generated the outcome. Deep connections between probability and formal logic P

Can break down phenomenon into systematic and random parts. (∗something in addition to what I knew before observing the outcome A) Such questions can be: What can Data be? 1 Are the data consistent with a certain model? To do probability we simply need a measurable space (Ω, ). Hence, F 2 Given a family of models, can we determine which model generated almost anything that can be mathematically expressed can be thought as the data? data (numbers, functions, graphs, shapes,...) These give birth to more questions: how can we answer 1,2? is there a best way? how much “off” is our answer? Statistical Theory (Week 1) Introduction 5 / 16 Statistical Theory (Week 1) Introduction 6 / 16 A Probabilist and a Statistician Flip a Coin A Probabilist and a Statistician Flip a Coin

Example Example (cont’d) Let X1, ..., X10 denote the results of flipping a coin ten times, with Statistician Asks: 0 if heads , Xi = , i = 1, ..., 10. Is the coin fair? (1 if tails What is the true value of θ given X? iid How much error do we make when trying to decide the above from X? A plausible model is Xi Bernoulli(θ). We record the outcome ∼ How does our answer change if X is perturbed? X = (0, 0, 0, 1, 0, 1, 1, 1, 1, 1). Is there a “best” solution to the above problems? iid How sensitive are our answers to departures from Xi Bernoulli(θ) Probabilist Asks: ∼ How do our “answers” behave as # tosses ? Probability of outcome as function of θ? −→ ∞ How many tosses would we need until we can get “accurate answers”? Probability of k-long run? Does our model agree with the data? If keep tossing, how many k-long runs? How long until k-long run?

Statistical Theory (Week 1) Introduction 7 / 16 Statistical Theory (Week 1) Introduction 8 / 16 The Basic Setup The Basic Setup: An Ilustration

Example (Coin Tossing) Elements of a Statistical Model: Consider the following probability space: n Have a random experiment with sample space Ω. Ω = [0, 1] with elements ω = (ω1, ..., ωn) Ω ∈ n X :Ω R is a random variable, X = (X1, ..., Xn), defined on Ω are Borel subsets of Ω (product σ-algebra) → F When outcome of experiment is ω Ω, we observe X(ω) and call it P is the uniform probability measure (Lebesge measure) on [0, 1]n ∈ the data (usually ω omitted). Now we can define the experiment of n coin tosses as follows: Probability experiment of observing a realisation of X completely Let θ (0, 1) be a constant determined by distribution F of X. ∈ For i = 1, ..., n let Xi = 1 ωi > θ n { } F assumed to be member of family F of distributions on R . Let X = (X , ..., X ), so that X :Ω 0, 1 n 1 n → { } 0 if x ( , 0), Goal i ∈ −∞ Then FX (xi ) = P[Xi xi ] = θ if xi [0, 1), Learn about F F given data X. i ≤  ∈ ∈ 1 if x [1, + ). i ∈ ∞ And F (x) = n F (x ) X i=1 Xi i  Statistical Theory (Week 1) Introduction 9 / 16 Statistical Theory (Week 1)Q Introduction 10 / 16 Describing Families of Distributions: Parametric Models Parametric Models

Definition (Parametrization) Example (Geometric Distribution) Let Θ be a set, F be a family of distributions and g :Θ F an onto → k mapping. The pair (Θ, g) is called a parametrization of F. Let X1, ..., Xn be iid geometric(p) distributed: P[Xi = k] = p(1 p) , − k N 0 . Two possible parametrizations are: ∈ ∪ { } Definition (Parametric Model) 1 [0, 1] p geometric(p) 3 7→ A parametric model with parameter space Θ Rd is a family of 2 [0, ) µ geometric with mean µ ⊆ ∞ 3 7→ probability models F parametrized by Θ, F = F : θ Θ . { θ ∈ } Example () Example (IID Normal Model) λ λk Let X1, ..., Xn be Poisson(λ) distributed: P[Xi = k] = e− k! , k N 0 . n xi ∈ ∪ { } 1 1 2 Three possible parametrizations are: (yi µ) 2 F = e− 2σ − dyi :(µ, σ ) R R+ ( σ√2π ∈ × ) 1 [0, ) λ Poisson(λ) Yi=1 Z−∞ ∞ 3 7→ 2 [0, ) µ Poisson with mean µ When Θ is not Euclidean, we call F non-parametric ∞ 3 7→ 3 [0, ) σ2 Poisson with σ2 When Θ is a product of a Euclidean and a non-Eucidean space, we ∞ 3 7→ call F semi-parametric Statistical Theory (Week 1) Introduction 11 / 16 Statistical Theory (Week 1) Introduction 12 / 16 Identifiability Identifiability

Parametrization often suggested from phenomenon we are modelling Example (Binomial Thinning) But any set Θ and surjection g :Θ F give a parametrization. → Let Bi,j be an infinite iid array of Bernoulli(ψ) variables and ξ1, ..., ξn be Many parametrizations possible! Is any parametrization sensible? { } an iid sequence of geometric(p) random variables with probability mass k function P[ξi = k] = p(1 p) ,k N 0 . Let X1, ..., Xn be iid random Definition (Identifiability) − ∈ ∪ { } variables defined by A parametrization (Θ, g) of a family of models F is called identifiable if g :Θ F is a bijection (i.e. if g is injective on top of being surjective). ξj → X = B , j = 1, .., n When a parametrization is not identifiable: j i,j i=1 Have θ = θ but F = F . X 1 6 2 θ1 θ2 Even with amounts of data we could not distinguish θ from θ . Any F F is completely determined by (ψ, p), so [0, 1]2 (ψ, q) F ∞ 1 2 X ∈ 3 7→ X is a parametrization of F. Can show (how?) Definition (Parameter) A parameter is a function ν : F , where is arbitrary. p θ → N N X geometric ∼ ψ(1 p) + p  −  A parameter is a feature of the distribution Fθ When θ F is identifiable, then ν(F ) = q(θ) for some q :Θ . However (ψ, p) is not identifiable (why?). 7→ θ θ → N Statistical Theory (Week 1) Introduction 13 / 16 Statistical Theory (Week 1) Introduction 14 / 16 Parametric Inference for Regular Models Examples

Will focus on parametric families F. The aspects we will wish to learn about will be parameters of F F. ∈ Regular Models Example (Five Examples) Assume from now on that in any parametric model we consider either: Sampling Inspection (Hypergeometric Distribution)

1 All of the F are continuous with densities f (x, θ) θ Problem of Location (Location-Scale Families) 2 All of the Fθ are discrete with frequency functions p(x, θ) and there exists a countable set A that is independent of θ such that Regression Models (Non-identically distributed data) x A p(x, θ) = 1 for all θ Θ. ∈ ∈ Will beP considering the mathematical aspects of problems such as: Autoregressive Measurement Error Model (Dependent data) 1 Estimating which θ Θ (i.e. which F F) generated X ∈ θ ∈ Random Projections of Triangles (Shape Theory) 2 Deciding whether some hypothesized values of θ are consistent with X 3 The performance of methods and the existence of optimal methods 4 What happens when our model is wrong?

Statistical Theory (Week 1) Introduction 15 / 16 Statistical Theory (Week 1) Introduction 16 / 16 1 Motivation: Functions of Random Variables Overview of Stochastic Convergence 2 Stochastic Convergence How does a R.V. “Converge”? Statistical Theory Convergence in Probability and in Distribution

Victor Panaretos Ecole Polytechnique F´ed´eralede Lausanne 3 Useful Theorems Weak Convergence of Random Vectors

4 Stronger Notions of Convergence

5 The Two “Big” Theorems

Statistical Theory (Week 2) Stochastic Convergence 1 / 21 Statistical Theory (Week 2) Stochastic Convergence 2 / 21 Functions of Random Variables Functions of Random Variables

Once we assume that n we start understanding dist[X¯n] more: 2 → ∞ Let X1, ..., Xn be i.i.d. with EXi = µ and var[Xi ] = σ . Consider: At a crude level X¯n becomes concentrated around µ n ¯ 1 Xn = Xi ¯ P[ Xn µ < ] 1,  > 0, as n n | − | ≈ ∀ → ∞ Xi=1 Perhaps more informative is to look at the “magnified difference” If X (µ, σ2) or X exp(1/µ) then know dist[X¯ ]. i ∼ N i ∼ n But X may be from some more general distribution ¯ n ¯ i P[√n(Xn µ) x] →∞ ? could yield P[Xn x] − ≤ ≈ ≤ Joint distribution of Xi may not even be completely understood More generally Want to understand distribution of Y = g(X , ..., X−→) for some general g: ¯ 1 n Would like to be able to say something about Xn even in those cases! Often intractable Resort to asymptotic approximations to understand behaviour of Y Perhaps this is not easy for fixed n, but what about letting n ? → ∞ , (a very common approach in mathematics) Warning: While lots known about asymptotics, often they are → misused (n small!)

Statistical Theory (Week 2) Stochastic Convergence 3 / 21 Statistical Theory (Week 2) Stochastic Convergence 4 / 21 Convergence of Random Variables Convergence in Probability

Need to make precise what we mean by: Definition (Convergence in Probability) Yn is “concentrated” around µ as n → ∞ Let Xn n 1 and X be random variables defined on the same probability More generally what “Y behaves like Y ” for large n means { } ≥ n space. We say that Xn converges in probability to X as n (and write n p → ∞ dist[g(X1, ..., Xn)] →∞ ? X X ) if for any  > 0, ≈ n → n , Need appropriate notions of convergence for random variables P[ Xn X > ] →∞ 0. → | − | −→ p Recall: random variables are functions between measurable spaces Intuitively, if Xn X , then with high probability Xn X for large n. → ≈ = Convergence of random variables can be defined in various ways: Example ⇒ Convergence in probability (convergence in measure) iid Let X1,..., Xn [0, 1], and define Mn = max X1, ..., Xn . Then, Convergence in distribution (weak convergence) ∼ U { } n Convergence with probability 1 (almost sure convergence) FMn (x) = x = P[ Mn 1 > ] = P[Mn < 1 ε] p ⇒ | − | n − Convergence in L (convergence in the p-th moment) = (1 )n →∞ 0 − −→ p Each of these is qualitatively different - Some notions stronger than others for any 0 <  < 1. Hence M 1. n → Statistical Theory (Week 2) Stochastic Convergence 5 / 21 Statistical Theory (Week 2) Stochastic Convergence 6 / 21 p Convergence in Distribution Some Comments on “ ” and “ d ” → → Definition (Convergence in Distribution) Convergence in probability implies convergence in distribution. Let X and X be random variables (not necessarily defined on the same Convergence in distribution does NOT imply convergence in { n} probability space). We say that Xn converges in distribution to X as probability d p n (and write X d X ) if , Consider X (0, 1), X + 1 X but X + 1 X . → ∞ n → → ∼ N − n → − n → − d n “ ” relates distribution functions P[Xn x] →∞ P[X x], → ≤ −→ ≤ , Can use to approximate distributions (approximation error?). → at every continuity point of FX (x) = P[X x]. Both notions of convergence are metrizable ≤ , i.e. there exist metrics on the space of random variables and → Example distribution functions that are compatible with the notion of convergence. iid Let X1,..., Xn [0, 1], Mn = max X1, ..., Xn , and Qn = n(1 Mn). , Hence can use things such as the triangle inequality etc. ∼ U { } − → d n x n x “ ” is also known as “weak convergence” (will see why). P[Qn x] = P[Mn 1 x/n] = 1 1 →∞ 1 e− → ≤ ≥ − − − n −→ −   d d Equivalent Def: X X Ef (Xn) Ef (X ) cts and bounded f for all x 0. Hence Qn Q, with Q exp(1). → ⇐⇒ → ∀ ≥ → ∼ Statistical Theory (Week 2) Stochastic Convergence 7 / 21 Statistical Theory (Week 2) Stochastic Convergence 8 / 21 Some Basic Results (proof cont’d).

which yields P[X x ] P[ Xn X > ] P[Xn x]. Theorem ≤ − − | − | ≤ ≤ p Combining the two inequalities and “sandwitching” yields the result. (a) X X = X d X n → ⇒ n → (b) Let F be the distribution function of a constant r.v. c, d p (b) Xn c = Xn c, c R. → ⇒ → ∈ 1 if x c, F (x) = P[c x] = ≥ Proof ≤ (0 if x < c. (a)Let x be a continuity point of FX and  > 0. Then,

P[Xn x] = P[Xn x, Xn X ] + P[Xn x, Xn X > ] P[ Xn c > ] = P[ Xn c >  c Xn >  ] ≤ ≤ | − | ≤ ≤ | − | | − | { − } ∪ { − } P[X x + ] + P[ Xn X > ] = P[Xn > c + ] + P[Xn < c ] ≤ ≤ | − | − 1 P[Xn c + ] + P[Xn c ] since X x +  contains Xn x, Xn X  . Similarly, n ≤ − ≤ ≤ − { ≤ } { ≤ | − | ≤ } →∞ 1 F (c + ) + F (c ) = 0 −→ − − c ] ≤ − ≤ − | − | ≤ ≤ − | − | ≥ d | {z } | {z } Since Xn c. P[Xn x] + P[ Xn X > ] → ≤ ≤ | − | Statistical Theory (Week 2) Stochastic Convergence 9 / 21 Statistical Theory (Week 2) Stochastic Convergence 10 / 21

Theorem (Continuous Mapping Theorem) Proof of Slutsky’s Theorem. (a) We may assume c = 0. Let x be a continuity point of F . We have Let g : R R be a continuous function. Then, X p→ p (a) X X = g(X ) g(X ) n → ⇒ n → P[Xn + Yn x] = P[Xn + Yn x, Yn ] + P[Xn + Yn x, Yn > ] d d ≤ ≤ | | ≤ ≤ | | (b) Yn Y = g(Yn) g(Y ) P[Xn x + ] + P[ Yn > ] → ⇒ → ≤ ≤ | |

Similarly, P[Xn x ] P[Xn + Yn x] + P[ Yn > ] Exercise ≤ − ≤ ≤ | | Prove part (a). You may assume without proof the Subsequence Lemma: Therefore, p X X if and only if every subsequence X of X , has a further P[Xn x ] P[ Yn > ] P[Xn +Yn x] P[Xn x +]+P[ Yn > ] n nm n ≤ − − | | ≤ ≤ ≤ ≤ | | → k Taking n , and then  0 proves (a). →∞ subsequence Xnm(k) such that P[Xnm(k) X ] = 1. → ∞ → −→ (b) By (a) we may assume that c = 0 (check). Let , M > 0: Theorem (Slutsky’s Theorem) P[ XnYn > ] P[ XnYn > , Yn 1/M] + P[ Yn 1/M] | | ≤ | | | | ≤ | | ≥ d d [ X > M] + [ Y 1/M] Let Xn X and Yn c R. Then P n P n → → ∈ n ≤ | | | | ≥ (a) X + Y d X + c →∞ P[ X > M] + 0 n n → −→ | | d (b) XnYn cX The first term can be made arbitrarily small by letting M . → → ∞

Statistical Theory (Week 2) Stochastic Convergence 11 / 21 Statistical Theory (Week 2) Stochastic Convergence 12 / 21 Theorem (General Version of Slutsky’s Theorem) Theorem (The Delta Method) d d Let g : R R R be continuous and suppose that Xn X and × → → Let Zn := an(Xn θ) Z where an, θ R for all n and an . Let g( ) d − → ∈ ↑d ∞ · Yn c R. Then, g(Xn, Yn) g(X , c) as n . be continuously differentiable at θ. Then, a (g(X ) g(θ)) g (θ)Z. → ∈ → → ∞ n n − → 0 , Notice that the general version of Slutsky’s theorem does not follow → immediately from the continuous mapping theorem. Proof Taylor expanding around θ gives: The continuous mapping theorem would be applicable if (Xn, Yn) weakly converged jointly (i.e. their joint distribution) to (X , c). g(Xn) = g(θ) + g 0(θ∗)(Xn θ), θ∗ between Xn, θ. d n − n But here we assume only marginal convergence (i.e. Xn X and d → 1 1 p Yn c separately, but their joint behaviour is unspecified). Thus θn∗ θ < Xn θ = an− an(Xn θ) = an− Zn 0 [by Slutsky] → | − | p | − | · | − | → p d Therefore, θn∗ θ. By the continuous mapping theorem g 0(θn∗) g 0(θ). The key of the proof is that in the special case where Yn c where c → → → is a constant, then marginal convergence joint convergence. ⇐⇒ Thus an(g(Xn) g(θ)) = an(g(θ) + g 0(θn∗)(Xn θ) g(θ)) d d − − − However if Xn X where X is non-degenerate, and Yn Y where d → → = g 0(θn∗)an(X θ) g 0(θ)Z. Y is non-degenerate, then the theorem fails. − → Notice that even the special cases (addition and multiplication) of The delta method actually applies even when g 0(θ) is not continuous Slutsky’s theorem fail of both X and Y are non-degenerate. (proof uses Skorokhod representation).

Statistical Theory (Week 2) Stochastic Convergence 13 / 21 Statistical Theory (Week 2) Stochastic Convergence 14 / 21

p Exercise: Give a counterexample to show that neither of Xn X or Remarks on Weak Convergence d → Xn X ensures that EXn EX as n . → → → ∞ Often difficult to establish weak convergence directly (from definition)

Theorem (Convergence of Expecations) Indeed, if Fn known, establishing weak convergence is “useless” d n Need other more “handy” sufficient conditions If Xn < M < and Xn X , then EX exists and EXn →∞ EX. | | ∞ → −→ Proof. Scheff´e’sTheorem Continuity Theorem Assume first that Xn are non-negative n. Then, ∀ Let Xn have density functions (or Let Xn and X have characteristic itXn M probability functions) fn, and let X functions ϕn(t) = E[e ], and itX EXn EX = P[Xn > x] P[X > x]dx have density function (or probability ϕ(t) = E[e ], respectively. Then, | − | − Z0 function) f . Then d M (a) Xn X φn φ pointwise n → ⇔ → P[Xn > x] P[X > x] dx →∞ 0. n d (b) If φn(t) converges pointwise to ≤ 0 | − | → f →∞ f (a.e.) = X X Z n −→ ⇒ n → some limit function ψ(t) that is since M < and the integration domain is bounded. continuous at zero, then: ∞ The converse to Scheff´e’s (i) a measure ν with c.f. ψ theorem is NOT true (why?). ∃ w Exercise: Generalise the proof to arbitrary random variables. (ii) F ν. Xn → Statistical Theory (Week 2) Stochastic Convergence 15 / 21 Statistical Theory (Week 2) Stochastic Convergence 16 / 21 Weak Convergence of Random Vectors Almost Sure Convergence and Convergence in Lp

Definition There are also two stronger convergence concepts (that do not compare) Let X be a sequence of random vectors of d , and X a random vector n R Definition (Almost Sure Convergence) {d } (1) (d) T (1) (d) T of R with Xn = (Xn , ..., Xn ) and X = (X , ..., X ) . Define the (1) (1) (d) (d) Let Xn n 1 and X be random variables defined on the same probability distribution functions FXn (x) = P[Xn x , ..., Xn x ] and { } ≥ n (1) (1) (d) (d) ≤ (1) ≤ (d) T d space (Ω, , P). Let A := ω Ω: Xn(ω) →∞ X (ω) . We say that Xn FX(x) = P[X x , ..., X x ], for x = (x , ..., x ) R . We F { ∈ → } a.s. ≤ ≤ ∈ d converges almost surely to X as n (and write Xn X ) if P[A] = 1. say that Xn converges in distribution to X as n (and write Xn X) → ∞ −→ → ∞ → a.s. if for every continuity point of FX we have More plainly, we say Xn X if P[Xn X ] = 1. −→ → n F (X) →∞ F (x). Definition (Convergence in Lp) Xn −→ X Let Xn n 1 and X be random variables defined on the same probability There is a link between univariate and multivariate weak convergence: { } ≥ p space. We say that Xn converges to X in L as n (and write Theorem (Cram´er-Wold Device) Lp → ∞ Xn X ) if d → p n Let Xn be a sequence of random vectors of R , and X a random vector E Xn X →∞ 0. { } | − | −→ of d . Then, R p 1/p d T d T d Note that X Lp := (E X ) defines a complete norm (when finite) Xn X θ Xn θ X, θ R . k k | | → ⇔ → ∀ ∈ Statistical Theory (Week 2) Stochastic Convergence 17 / 21 Statistical Theory (Week 2) Stochastic Convergence 18 / 21 Relationship Between Different Types of Convergence Recalling two basic Theorems

a.s. p d Multivariate Random Variables “ d ” defined coordinatewise Xn X = Xn X = Xn X → → −→Lp ⇒ −→ ⇒ p −→ d Xn X , for p > 0 = Xn X = Xn X Theorem (Strong Law of Large Numbers) −→ Lp ⇒ −→Lq ⇒ −→ for p q, X X = X X Let Xn be pairwise iid random variables with EXk = µ and E Xk < , n n { } | | ∞ ≥ −→ ⇒ −→ a.s. Lp for all k 1. Then, There is no implicative relationship between “ ” and “ ” ≥ n −→ → 1 X a.s. µ Theorem (Skorokhod’s Representation Theorem) n k −→ Xk=1 Let Xn n 1, X be random variables defined on a probability space { } ≥ 2 d “Strong” is as opposed to the “weak” law which requires EXk < (Ω, , P) with Xn X . Then, there exist random variables Yn n 1, Y p ∞ F → { } ≥ instead of X < and gives “ ” instead of “ a.s. ” defined on some probability space (Ω0, , Q) such that: E k G | | ∞ → −→ d d (i) Y = X&Yn = Xn, n 1, Theorem (Central Limit Theorem) a.s. ∀ ≥ (ii) Y Y. d n −→ Let Xn be an iid sequence of random vectors in R with mean µ and { } ¯ n covariance Σ and define Xn := m=1 Xm/n. Then, Exercise 1 P d Prove part (b) of the continuous mapping theorem. √nΣ− 2 (X¯ µ) Z (0, I ). − → ∼ Nd d Statistical Theory (Week 2) Stochastic Convergence 19 / 21 Statistical Theory (Week 2) Stochastic Convergence 20 / 21 Convergence Rates

Often convergence not enough How fast? −→ , [quality of approximation] → 2 1/2 Law of Large Numbers: assuming finite variance, L rate of n− What about Central Limit Theorem? Theorem (Berry-Essen) d Let X1, ..., Xn be iid random vectors taking values in R and such that E[Xi ] = 0, cov[Xi ] = Id . Define, 1 S = (X + ... + X ). n √n 1 n

d If denotes the class of convex subsets of R , then for Z d (0, Id ), A ∼ N 1/4 3 d E Xi sup P[Sn A] P[Z A] C k k . A | ∈ − ∈ | ≤ √n ∈A

Statistical Theory (Week 2) Stochastic Convergence 21 / 21 1 Statistics

Principles of Data Reduction 2 Ancillarity

3 Sufficiency Statistical Theory Sufficient Statistics Establishing Sufficiency Victor Panaretos Ecole Polytechnique F´ed´eralede Lausanne 4 Minimal Sufficiency Establishing Minimal Sufficiency

5 Completeness Relationship between Ancillary and Sufficient Statistics Relationship between completeness and minimal sufficiency

Statistical Theory (Week 3) Data Reduction 1 / 19 Statistical Theory (Week 3) Data Reduction 2 / 19 Statistical Models and The Problem of Inference Statistics

Recall our setup: Definition (Statistic) Collection of r.v.’s (a random vector) X = (X1, ..., Xn) Let X be a random sample from Fθ.A statistic is a (measurable) function d X Fθ F T that maps X into and does not depend on θ. ∼ ∈ R F a parametric class with parameter θ Θ Rd ∈ ⊆ , Intuitively, any function of the sample alone is a statistic. → , Any statistics is itself a r.v. with its own distribution. The Problem of → 1 Assume that Fθ is known up to the parameter θ which is unknown Example 1 n 2 Let (x1, ..., xn) be a realization of X Fθ which is available to us t(X) = n X is a statistic (since n, the sample size, is known). ∼ − i=1 i 3 Estimate the value of θ that generated the sample given (x1, ..., xn) Example P The only guide (apart from knowledge of F) at hand is the data: T (X) = (X ,..., X ) where X X ... X are the order (1) (n) (1) ≤ (2) ≤ (n) , Anything we “do” will be a function of the data g(x1, ..., xn) statistics of X. Since T depends only on the values of X, T is a statistic. → , Need to study properties of such functions and information loss → incurred (any function of (x1, .., xn) will carry at most the same Example information but usually less) Let T (X) = c, where c is a known constant. Then T is a statistic

Statistical Theory (Week 3) Data Reduction 3 / 19 Statistical Theory (Week 3) Data Reduction 4 / 19 Statistics and Information About θ Statistics and Information about θ

Evident from previous examples: some statistics are more informative If T is ancillary for θ then T contains no information about θ and others are less informative regarding the true value of θ In order to contain any useful information about θ, the dist(T ) must Any T (X) that is not “1-1” carries less information about θ than X depend explicitly on θ. Which are “good” and which are “bad” statistics? Intuitively, the amount of information T gives on θ increases as the dependence of dist(T ) on θ increases Definition (Ancillary Statistic) Example A statistic T is an ancillary statistic (for θ) if its distribution does not iid functionally depend θ Let X1, ..., Xn [0, θ], S = min(X1,..., Xn) and T = max(X1,..., Xn). ∼ U n 1 n x − , So an ancillary statistic has the same distribution θ Θ. fS (x; θ) = θ 1 θ , 0 x θ → ∀ ∈ − ≤ ≤ n x n 1 fT (x; θ) = − , 0 x θ Example θ θ ≤ ≤ , Neither S nor T are ancillary for θ Suppose that X , ..., X iid (µ, σ2) (where µ unknown but σ2 known). →  1 n ∼ N Let T (X , ..., X ) = X X ; then T has a with mean , As n , fS becomes concentrated around 0 1 n 1 − 2 → ↑ ∞ 0 and variance 2σ2. Thus T is ancillary for the unknown parameter µ. If , As n , fT becomes concentrated around θ while → ↑ ∞ both µ and σ2 were unknown, T would not be ancillary for θ = (µ, σ2). , Indicates that T provides more information about θ than does S. → Statistical Theory (Week 3) Data Reduction 5 / 19 Statistical Theory (Week 3) Data Reduction 6 / 19 Statistics and Information about θ Statistics and Information about θ

Look at the dist(X) on an fibre At : fX T =t (x) | iid Suppose fX T =t is independent of θ X = (X1,..., Xn) Fθ and T (X) a statistic. | ∼ = Then X contains no information about θ on the set At =⇒ In other words, X is ancillary for θ on A ⇒ t The fibres or level sets or contours of T are the sets If this is true for each t Range(T ) then T (X) contains the same n ∈ At = x R : T (x) = t . information about θ as X does. { ∈ } , It does not matter whether we observe X = (X1, ..., Xn) or just T (X). ,→ Knowing the exact value X in addition to knowing T (X) does not give , T is constant when restricted to an fibre. → us any additional information - X is irrelevant if we already know T (X). → Any realization of X that falls in a given fibre is equivalent as far as T is concerned Definition (Sufficient Statistic) A statistic T = T (X) is said to be sufficient for the parameter θ if for all (Borel) sets B the probability P[X B T (X) = t] does not depend on θ. ∈ |

Statistical Theory (Week 3) Data Reduction 7 / 19 Statistical Theory (Week 3) Data Reduction 8 / 19 Sufficient Statistics Sufficient Statistics

Example (Bernoulli Trials) Definition hard to verify (especially for continuous variables) Definition does not allow easy identification of sufficient statistics Let X , ..., X iid Bernoulli(θ) and T (X) = n X . Given x 0, 1 n, 1 n ∼ i=1 i ∈ { } Theorem (Fisher-Neyman Factorization Theorem) P P[X = x, T = t] P[X = x] n P[X = x T = t] = = 1 Σi=1xi = t Suppose that X = (X ,..., X ) has a joint density or frequency function | P[T = t] P[T = t] { } 1 n n n f (x; θ), θ Θ. A statistic T = T (X) is sufficient for θ if and only if Σ xi n Σ xi θ i=1 (1 θ) − i=1 n ∈ = − 1 Σi=1xi = t n θt (1 θ)n t { } f (x; θ) = g(T (x), θ)h(x). t − − t n t 1 θ (1 θ) n − =  − − = . n θt (1 θ)n t t Example t − −   Let X , ..., X iid [0, θ] with pdf f (x; θ) = 1 x [0, θ] /θ. Then,  1 n ∼ U { ∈ } T is sufficient for θ Given the number of tosses that came heads, → 1 1 max[x1, ..., xn] θ 1 min[x1, ..., xn] 0 knowing which tosses exactly came heads is irrelevant in deciding if f (x) = 1 x [0, θ]n = { ≤ } { ≥ } X θn { ∈ } θn the coin is fair: Therefore T (X) = X(n) = max[X1, ..., Xn] is sufficient for θ. 0 0 1 1 1 0 1 VS 1 0 0 0 1 1 1 VS 1 0 1 0 1 0 1 Statistical Theory (Week 3) Data Reduction 9 / 19 Statistical Theory (Week 3) Data Reduction 10 / 19 Sufficient Statistics Minimally Sufficient Statistics

Proof of Neyman-Fisher Theorem - Discrete Case. Suppose first that T is sufficient. Then Saw that sufficient statistic keeps what is important and leaves out irrelevant information. f (x; θ) = P[X = x] = P[X = x, T = t] How much info can we through away? Is there a “necessary” statistic? t X = P[X = x, T = T (x)] = P[T = T (x)]P[X = x T = T (x)] | Definition (Minimally Sufficient Statistic) Since T is sufficient, P[X = x T = T (x)] is independent of θ and so A statistic T = T (X) is said to be minimally sufficient for the parameter θ | if for any sufficient statistic S = S(X) there exists a function g( ) with f (x; θ) = g(T (x); θ)h(x). Now suppose that f (x; θ) = g(T (x); θ)h(x). · Then if T (x) = t, T (X) = g(S(X)). P[X = x, T = t] P[X = x] P[X = x T = t] = = 1 T (x) = t | P[T = t] P[T = t] { } Lemma g(T (x); θ)h(x)1 T (x) = t h(x)1 T (x) = t If T and S are minimaly sufficient statistics for a parameter θ, then there = { } = { }. y:T (y)=t g(T (y); θ)h(y) T (y)=t h(y) exists injective functions g and h such that S = g(T ) and T = h(S). which does not depend onPθ. P Statistical Theory (Week 3) Data Reduction 11 / 19 Statistical Theory (Week 3) Data Reduction 12 / 19 Theorem [minimality part] Suppose that T 0 is another sufficient statistic. By the Let X = (X , ..., X ) have joint density or frequency function f (x; θ) and 1 n factorization thm: g 0, h0 : f (x; θ) = g 0(T 0(x); θ)h0(x). Let x, y be such T = T (X) be a statistic. Suppose that f (x; θ)/f (y; θ) is independent of θ ∃ that T 0(x) = T 0(y). Then if and only if T (x) = T (y). Then T is minimally sufficient for θ. f (x; θ) g (T (x); θ)h (x) h (x) = 0 0 0 = 0 . Proof. f (y; θ) g 0(T 0(y); θ)h0(y) h0(y) Assume for simplicity that f (x; θ) > 0 for all x Rn and θ Θ. ∈ ∈ Since ratio does not depend on θ, we have by assumption T (x) = T (y). [sufficiency part] Let T = T (y): y Rn be the image of Rn under T { ∈ } Hence T is a function of T ; so is minimal by arbitrary choice of T . and let A be the level sets of T . For each t, choose an element y A . 0 0 t t ∈ t Notice that for any x, yT (x) is in the same level set as x, so that Example (Bernoulli Trials) f (x; θ)/f (y ; θ) T (x) Let X , ..., X iid Bernoulli(θ). Let x, y 0, 1 n be two possible outcomes. 1 n ∼ ∈ { } Then does not depend on θ by assumption. Let g(t, θ) := f (yt ; θ) and notice Σx n Σx f (x; θ) θ i (1 θ) − i = Σy − n Σy f (xT (x); θ)f (x; θ) f (y; θ) θ i (1 θ) − i f (x; θ) = = g(T (x), θ)h(x) − f (xT (x); θ) which is constant if and only if T (x) = xi = yi = T (y), so that T is minimally sufficient. and the claim follows from the factorization theorem. P P

Statistical Theory (Week 3) Data Reduction 13 / 19 Statistical Theory (Week 3) Data Reduction 14 / 19 Complete Statistics Complete Statistics

Ancillary Statistic Contains no info on θ → Theorem (Basu’s Theorem) Minimally Sufficient Statistic Contains all relevant info and as little → irrelevant as possible. A complete sufficient statistic is independent of every ancillary statistic. Should they be mutually independent? Proof. Definition (Complete Statistic) We consider the discrete case only. It suffices to show that, Let g(t; θ): θ Θ be a family of densities of frequencies corresponding { ∈ } P[S(X) = s T (X) = t] = P[S(X) = s] to a statistic T (X). The statistic T is called complete if | Define: h(t) = P[S(X) = s T (X) = t] P[S(X) = s] h(t)g(t; θ)dt = 0 θ Θ = P[h(T ) = 0] = 1 θ Θ. | − ∀ ∈ ⇒ ∀ ∈ Z and observe that: 1 [S(x) = s] does not depend on θ (ancillarity) Example P 2 P[S(X) = s T (X) = t] = P[X x : S(x) = s T = t] does not ˆ ˆ | ∈ { }| If θ = T (X) is an unbiased estimator of θ (i.e. Eθ = θ) which can be depend on θ (sufficiency) written as a function of a complete sufficient statistic, then it is the unique and so h does not depend on θ. such estimator.

Statistical Theory (Week 3) Data Reduction 15 / 19 Statistical Theory (Week 3) Data Reduction 16 / 19 Completeness and Minimal Sufficiency Therefore, for any θ Θ, ∈ Theorem (Lehmann-Scheff´e) Eh(T ) = (P[S(X) = s T (X) = t] P[S(X) = s])P[T (X) = t] | − t Let X have density f (x; θ). If T (X) is sufficient and complete for θ then T X = P[S(X) = s T (X) = t]P[T (X) = t] + is minimally sufficient. | t X Proof. +P[S(X) = s] P[T (X) = t] t First of all we show that a minimally sufficient statistic exists. Define an X = [S(X) = s] [S(X) = s]=0 . equivalence relation as x x0 if and only if f (x; θ)/f (x0; θ) is independent P P ≡ − of θ. If S is any function such that S = c on these equivalent classes, then But T is complete so it follows that h(t) = 0 for all t. QED. S is a minimally sufficient, establishing existence (rigorous proof by Lehmann-Scheff´e(1950)). Therefore, it must be the case that S = g (T ), for some g . Let Basu’s Theorem is useful for deducing independence of two statistics: 1 1 g2(S) = E[T S] (does not depend on θ since S sufficient). Consider: No need to determine their joint distribution | Needs showing completeness (usually hard analytical problem) g(T ) = T g2(S) − Will see models in which completeness is easy to check Write E[g(T )] = E[T ] E E[T S] = ET ET = 0 for all θ. − { | } − Statistical Theory (Week 3) Data Reduction 17 / 19 Statistical Theory (Week 3) Data Reduction 18 / 19

(proof cont’d).

By completeness of T , it follows that g2(S) = T a.s. In fact, g2 has to be injective, or otherwise we would contradict minimal sufficiency of S. But then T is 1-1 a function of S and S is a 1 1 function of T . Invoking our − previous lemma proves that T is minimally sufficient.

One can also prove: Theorem If a minimal sufficient statistic exists, then any complete statistic is also a minimal sufficient statistic.

Statistical Theory (Week 3) Data Reduction 19 / 19 Special Families of Models 1 Focus on Parametric Families

Statistical Theory

Victor Panaretos 2 The Exponential Family of Distributions Ecole Polytechnique F´ed´eralede Lausanne

3 Transformation Families

Statistical Theory (Week 4) Special Models 1 / 26 Statistical Theory (Week 4) Special Models 2 / 26 Focus on Parametric Families Focus on Parametric Families

We describe F by a parametrization Θ θ Fθ: Recall our setup: 3 7→ Definition (Parametrization) Collection of r.v.’s (a random vector) X = (X1, ..., Xn) Let Θ be a set, F be a family of distributions and g :Θ F an onto X Fθ F → ∼ ∈ mapping. The pair (Θ, g) is called a parametrization of F. F a parametric class with parameter θ Θ Rd ∈ ⊆ , assigns a label θ Θ to each member of F The Problem of Point Estimation → ∈

1 Assume that Fθ is known up to the parameter θ which is unknown Definition (Parametric Model) 2 Let (x1, ..., xn) be a realization of X Fθ which is available to us A parametric model with parameter space Θ Rd is a family of ∼ ⊆ 3 Estimate the value of θ that generated the sample given (x , ..., x ) probability models F parametrized by Θ, F = F : θ Θ . 1 n { θ ∈ } The only guide (apart from knowledge of F) at hand is the data: So far have seen a number of examples of distributions... , Anything we “do” will be a function of the data g(x , ..., x ) ...have worked out certain properties individually → 1 n So far have concentrated on aspects of data: approximate distributions+ Question data reduction...... But what about F? Are there more general families that contain the standard ones as special cases and for which a general and abstract study can be pursued? Statistical Theory (Week 4) Special Models 3 / 26 Statistical Theory (Week 4) Special Models 4 / 26 The Exponential Family of Distributions Motivation: Maximum Entropy Under Constraints

Definition (Exponential Family) Consider the following variational problem: p Let X = (X1, ..., Xn) have joint distribution Fθ with parameter θ R . We Determine the probability distribution f supported on with maximum ∈ X say that the family ofdistributions Fθ is a k-parameter exponential family if entropy the joint density or joint frequency function of (X , ..., X ) admits the form 1 n H(f ) = f (x) log f (x)dx − k ZX f (x; θ) = exp ci (θ)Ti (x) d(θ) + S(x) , x , θ Θ, subject to the linear constraints ( − ) ∈ X ∈ Xi=1 Ti (x)f (x)dx = αi , i = 1, ..., k with supp f ( ; θ) = is independent of θ. { · } X ZX

k need not be equal to p, although they often coincide. Philosophy: How to choose a probability model for a given situation? The value of k may be reduced if c or T satisfy linear constraints. Maximum entropy approach: We will assume that the representation above is minimal. In any given situation, choose the distribution that gives highest , Can re-parametrize via φ = c (θ), the natural parameter. uncertainty while satisfying situation–specific required constraints. → i i

Statistical Theory (Week 4) Special Models 5 / 26 Statistical Theory (Week 4) Special Models 6 / 26

Proposition. But g also satisfies the moment constraints, so the last term is The unique solution of the constrained optimisation problem has the form k k = log Q f (x) λ T (x) dx = f (x) log f (x)dx − − i i f (x) = Q(λ1, ..., λk ) exp λi Ti (x) ZX i=1 ! ZX ( ) X Xi=1 = H(f ) Proof. Uniqueness of the solution follows from the fact that strict equality can only follow when KL(g f ) = 0, which happens if and only if g = f . Let g(x) be a density also satisfying the constraints. Then, k

g(x) The λ’s are the Lagrange multipliers derived by the Lagrange form of H(g) = g(x) log g(x)dx = g(x) log f (x) dx − − f (x) the optimisation problem. ZX ZX   = KL(g f ) g(x) log f (x)dx These are derived so that the constraints are satisfied. − k − ZX They give us the ci (θ) in our definition of exponential families. k Note that the presence of S(x) in our definition is compatible: log Q g(x)dx g(x) λi Ti (x) dx ≤ − − ! S(x) = ck+1Tk+1(x), where ck+1 does not depend on θ. ZX ZX Xi=1 =1 (provision for a multiplier that may not depend on parameter)

Statistical Theory (Week 4) | {z Special} Models 7 / 26 Statistical Theory (Week 4) Special Models 8 / 26 Example (Binomial Distribution) Example (Heteroskedastic Gaussian Distribution)

Let X Binomial(n, θ) with n known. Then Let X , ..., X iid (θ, θ2). Then, ∼ 1 n ∼ N n x n x θ n n f (x; θ) = θ (1 θ) − = exp log x + n ln(1 θ) + log 1 1 2 x − 1 θ − x fX(x; θ) = exp (xi θ)     −    θ√2π −2θ2 − i=1   Y n n Example (Gamma Distribution) 1 1 n = exp x2 + x (1 + 2 log θ) + log(2π) −2θ2 i θ i − 2 Let X , ..., X iidGamma with unknown shape parameter α and unknown " i=1 i=1 # 1 n ∼ X X scale parameter λ. Then, Notice that even though k = 2 here, the dimension of the parameter space n α α 1 is 1. This is an example of a curved exponential family. λ x − exp( λx ) f (x; α, λ) = i − i X Γ(α) i=1 Example (Uniform Distribution) Y n n 1 x [0,θ] = exp (α 1) log x λ x + nα log λ n log Γ(α) Let X [0, θ]. Then, fX (x; θ) = { ∈ } . Since the support of f , , i i ∼ U θ X " − − − # depends on θ, we do not have an exponential family. Xi=1 Xi=1

Statistical Theory (Week 4) Special Models 9 / 26 Statistical Theory (Week 4) Special Models 10 / 26 The Exponential Family of Distributions Proof. Define φ = c(θ) the natural parameter of the exponential family. Let 1 1 d0(φ) = d(c− (φ)), where c− is well-defined since c is 1-1. Since c is a Proposition homeomorphism, Φ = c(Θ) is open. Choose s sufficiently small so that Suppose that X = (X1, ..., Xn) has a one-parameter exponential family φ + s Φ, and observe that the m.g.f. of T is distribution with density or frequency function ∈ sT (x) φT (x) d (φ)+S(x) E exp[sT (X)] = e e − 0 dx f (x; θ) = exp [c(θ)T (x) d(θ) + S(x)] − Z d0(φ+s) d0(φ) (φ+s)T (x) d0(φ+s)+S(x) for x where = e − e − dx ∈ X Z (a) the parameter space Θ is open, =1 (b) c(θ) is a one-to-one function on Θ, = exp[d (φ + s) d (φ)], 0 |− 0 {z } (c) c(θ), c 1(θ), d(θ) are twice differentiable functions on Θ. − By our assumptions we may differentiate w.r.t. s, and, setting s = 0, we Then, get E[T (X)] = d00 (φ) and Var[T (X)] = d000(φ). But d (θ) d (θ)c (θ) d (θ)c (θ) 0 00 0 0 00 d (φ) = d (θ)/c (θ) and d (φ) = [d (θ)c (θ) d (θ)c (θ)]/[c (θ)]3 ET (X) = & Var[T (X)] = − 3 00 0 0 000 00 0 0 00 0 c0(θ) [c0(θ)] − and so the conclusion follows.

Statistical Theory (Week 4) Special Models 11 / 26 Statistical Theory (Week 4) Special Models 12 / 26 Exponential Families and Sufficiency Exponential Families and Completeness

Exercise Theorem

Extend the result to the the means, and covariances of the Suppose that X = (X1, ..., Xn) has a k-parameter exponential family random variables T1(X), ..., Tk (X) in a k-parameter exponential family distribution with density or frequency function

Lemma k f (x; θ) = exp ci (θ)Ti (x) d(θ) + S(x) Suppose that X = (X1, ..., Xn) has a k-parameter exponential family − " i=1 # distribution with density or frequency function X for x . Define C = (c1(θ), ..., ck (θ)) : θ Θ . If the set C contains k ∈ X { ∈ } an open set (rectangle) of the form (a1, b1) ... (ak , bk ) then the f (x; θ) = exp ci (θ)Ti (x) d(θ) + S(x) × × " − # statistic (T1(X), ..., Tk (X)) is complete for θ, and so minimally sufficient. Xi=1 for x . Then, the statistic (T1(x), ..., Tk (x)) is sufficient for θ The result is essentially a consequence of the uniqueness of ∈ X characteristic functions. Proof. Intuitively, result says that a k-dimensional sufficient statistic in a S(x) k-parameter exponential family will also be complete provided that Set g(T(x); θ) = exp Ti (x)ci (θ) + d(θ) and h(x) = e 1 x , { i } { ∈ X } the effective dimension of the natural parameter space is k. and apply the factorization theorem. Statistical Theory (Week 4) P Special Models 13 / 26 Statistical Theory (Week 4) Special Models 14 / 26 Sampling Exponential Families The Natural Statistics

The families of distributions obtained by sampling from exponential Lemma families are themselves exponential families. The joint distribution of τ = (τ (X), ..., τ (X)) is of exponential family Let X , ..., X be iid distributed according to a k-parameter 1 k 1 n form with natural parameters c (θ), ..., c (θ). exponential family. Consider the density (or frequency function) of 1 k X = (X1, ..., Xn), Proof. (discrete case). k n k Let y = (x : τ1(x) = y1, ..., τk (x) = yk ) be the level set of y R . T ∈ f (x; θ) = exp ci (θ)Ti (xj ) d(θ) + S(xj ) " − # k n Yj=1 Xi=1 P[τ (X) = y] = P[X = x] = δ(θ) exp ci (θ)τi (x) + S(xj ) k n   x y x y i=1 j=1 = exp ci (θ)τi (x) nd(θ) + S(xj ) X∈T X∈T X X  −  k   Xi=1 Xj=1 = δ(θ)S(y) exp ci (θ)yi . n   for τi (X) = Ti (Xj ) the natural statistics, i = 1, ..., k. " # j=1 Xi=1 Note that the natural sufficient statistic is k-dimensional n. P ∀ What about the distribution of τ = (τ1(X), ..., τk (X))?

Statistical Theory (Week 4) Special Models 15 / 26 Statistical Theory (Week 4) Special Models 16 / 26 The Natural Statistics The Natural Statistics and Sufficiency

Lemma For any A 1, ..., k , the joint distribution of τ (X); i A conditional ⊆ { } { i ∈ } Look at the previous results through the prism of the canonical on τ (X); i Ac is of exponential family form, and depends only on { i ∈ } parametrisation: ci (θ); i A . { ∈ } Already know that τ is sufficient for φ = c(θ). Proof. (discrete case). But result tells us something even stronger: T T S k Let i = τi (X). Have P[ = y] = δ(θ) (y) exp i=1 ci (θ)yi , so that each τi is sufficient for φi = ci (θ) h i [T = yP, T c = y c ] P A A A A In fact any τA is sufficient for φA, A 1, ..., k P[TA = yA TAc = yAc ] = ∀ ⊆ { } | w l P[TA = w, TAc = yAc ] ∈R Therefore, each natural statistic contains the relevant information for P each natural parameter δ(θ)S((yA, yAc )) exp i A ci (θ)yi exp i Ac ci (θ)yi A useful result that is by no means true for any distribution. = ∈ ∈ δ(θ) exp i Ac ci (θ)yi w l S((w, yAc )) exp i A ci (θ)wi ∈ P ∈R  P ∈  P  P P  = ∆( ci (θ): i A )h(yA) exp i A ci (θ)yi { ∈ } ∈ Statistical Theory (Week 4) PSpecial Models  17 / 26 Statistical Theory (Week 4) Special Models 18 / 26 Groups Acting on the Sample Space Groups Acting on the Sample Space

Have a group of transformations G, with G g : Basic Idea 3 X → X gX := g(X ) and (g g )X := g (g (X )) Often can generate a family of distributions of the same form (but with 2 ◦ 1 2 1 different parameters) by letting a group act on our data space . Obviously dist(gX ) changes as g ranges in G. X Is this change completely arbitrary or are there situations where it has Recall: a group is a set G along with a binary operator such that: ◦ a simple structure? 1 g, g G = g g G 0 ∈ ⇒ ◦ 0 ∈ 2 (g g 0) g 00 = g (g 0 g 00), g, g 0, g 00 G Definition (Transformation Family) ◦ ◦ ◦ ◦ ∀ ∈ 3 e G : e g = g e = g, g G Let G be a group of transformations acting on and let f (x); θ Θ be ∃ ∈ ◦ ◦ ∀ ∈ X { θ ∈ } 4 g G g 1 G : g g 1 = g 1 g = e a parametric family of densities on . If there exists a bijection h : G Θ ∀ ∈ ∃ − ∈ ◦ − − ◦ X → then the family fθ θ Θ will be called a (group) transformation family. Often groups are sets of transformations and the binary operator is the { } ∈ composition operator (e.g. SO(2) the group of rotations of 2): R Hence Θ admits a group structure G¯ := (Θ, ) via: ∗ cos φ sin φ cos ψ sin ψ cos(φ + ψ) sin(φ + ψ) 1 1 − − = − θ1 θ2 := h(h− (θ1) h− (θ2)) sin φ cos φ sin ψ cos ψ sin(φ + ψ) cos(φ + ψ) ∗ ◦       1 Usually write gθ = h− (θ), so gθ gθ = gθ θ ◦ 0 ∗ 0 Statistical Theory (Week 4) Special Models 19 / 26 Statistical Theory (Week 4) Special Models 20 / 26 Invariance and Equivariance Invariance and Equivariance

Define an equivalence relation on via G: X G Intuitively, a maximal invariant is a reduced version of the data that x x0 g G : x0 = g(x) ≡ ⇐⇒ ∃ ∈ represent it as closely as possible, under the requirement of remaining Partitions into equivalence classes called the orbits of under G invariant with respect to G. X X Definition (Invariant Statistic) If T is an invariant statistic with respect to the group defining a transformation family, then it is ancillary. A statistic T that is constant on the orbits of under G is called an X invariant statistic. That is, T is invariant with respect to G if, for any Definition (Equivariance) arbitrary x , we have T (x) = T (gx) g G. ∈ X ∀ ∈ A statistic S : Θ will be called equivariant for a transformation X → Notice that it may be that T (x) = T (y) but x, y are not in the same family if S(gθx) = θ s(x), gθ G & x . orbit, i.e. in general the orbits under G are subsets of the level sets of an ∗ ∀ ∈ ∈ X invariant statistic T . When orbits and level sets coincide, we have: Equivariance may be a natural property to require if S is used as an estimator of the true parameter θ Θ, as it suggests that a Definition (Maximal Invariant) ∈ transformation of a sample by g would yield an estimator that is the A statistic T will be called a maximal invariant for G when ψ original one transformed by ψ. G T (x) = T (y) x y Statistical Theory (Week 4) Special Models⇐⇒ ≡ 21 / 26 Statistical Theory (Week 4) Special Models 22 / 26 Invariance and Equivariance Location-Scale Families

Lemma (Constructing Maximal Invariants) An important transformation family is the location-scale model: Let S : Θ be an equivariant statistic for a transformation family with Let X = η + τε with ε f completely known. X → 1 ∼ parameter space Θ and transformation group G. Then, T (X ) = gS−(X )X Parameter is θ = (η, τ) Θ = R R+. defines a maximally invariant statistic. ∈ × Define set of transformations on by g x = g x = η + τx so X θ (η,τ) Proof. g g x = η + τµ + τσx = g x (η,τ) ◦ (µ,σ) (η+τµ,τσ) def 1 eqv 1 1 1 set of transformations is closed under composition T (gθx) = (g − gθ)x = (g − gθ)x = [(g − g − ) gθ]x = T (x) S(gθx) θ S(x) S(x) θ ◦ ∗ ◦ ◦ ◦ g g = g g = g (so identity) (0,1) ◦ (η,τ) η,τ ◦ (0,1) (η,τ) ∃ so that T is invariant. To show maximality, notice that g( η/τ, τ 1) g = g g( η/τ, τ 1) = g (so inverse) − − ◦ (η,τ) (η,τ) ◦ − − (0,1) ∃ 1 1 1 Hence G = gθ : θ R R+ is a group under . T (x) = T (y) = gS−(x)x = gS−(y)y = y = gS(y) gS−(x)x { ∈ × } ◦ ⇒ ⇒ ◦ n Action of G on random sample X = Xi i=1 is g(η,τ)X = η1n + τX. =g G { } ∈ Induced group action on Θ is (η, τ) (µ, σ) = (η + τµ, τσ). so that g G with y = gx which completes the proof.| {z } ∗ ∃ ∈

Statistical Theory (Week 4) Special Models 23 / 26 Statistical Theory (Week 4) Special Models 24 / 26 Location-Scale Families Transformation Families

The sample mean and sample variance are equivariant, because with ¯ 1/2 1 ¯ 2 S(X) = (X , V ) where V = n 1 (Xj X ) : − − Example (The Multivariate Gaussian Distribution) 1 P 1/2 S(g ) = η + τX, (η + τX (η + τX ))2 Let Z (0, I ) and consider X = µ + ΩZ (µ, ΩΩT) (η,τ)X j ∼ Nd ∼ N n 1 − ! d  − X  Parameter is (µ, Ω) R GL(d) 1/2 ∈ × 1 = η + τX¯ , (η + τX η τX¯ )2 Set of transformations is closed under n 1 j − − ◦   ! g(0,I ) g(µ,Ω) = gµ,Ω g(0,I ) = g(µ,Ω) − X ◦ ◦ = (η + τX¯ , τV 1/2) = (η, τ) S(X) g Ω 1µ, Ω 1) g = g g( Ω 1µ, Ω 1) = g ∗ ( − − − ◦ (µ,Ω) (µ,Ω) ◦ − − − (0,I ) 1 Hence G = gθ : θ R R+ is a group under (affine group). A maximal invariant is given by A = gS−(X)X the corresponding { ∈ × } ◦ 1/2 1/2 Action of G on X is g(µ,Ω)X = µ + ΩX. parameter being ( X¯ /V , V − ). Hence the vector of residuals is − Induced group action on Θ is (µ, Ω) (ν, Ψ) = (ν + Ψµ, ΨΩ). a maximal invariant: ∗ (X X¯ 1 ) X X¯ X X¯ A = − n = 1 − ,..., n − V 1/2 V 1/2 V 1/2  

Statistical Theory (Week 4) Special Models 25 / 26 Statistical Theory (Week 4) Special Models 26 / 26 1 The Problem of Point Estimation Basic Principles of Point Estimation

2 Bias, Variance and Mean Squared Error Statistical Theory

Victor Panaretos 3 The Plug-In Principle Ecole Polytechnique F´ed´eralede Lausanne

4 The Moment Principle

5 The Likelihood Principle

Statistical Theory (Week 5) Point Estimation 1 / 31 Statistical Theory (Week 5) Point Estimation 2 / 31 Point Estimation for Parametric Families Point Estimators

Collection of r.v.’s (a random vector) X = (X1, ..., Xn) Definition (Point Estimator) X F F ∼ θ ∈ d d Let Fθ be a parametric model with parameter space Θ R and let F a parametric class with parameter θ Θ R { } ⊆ ˆ ∈ ⊆ X = (X1, ..., Xn) Fθ0 for some θ0 Θ. A point estimator θ of θ0 is a n ∼ ∈ statistic T : R Θ, whose primary purpose is to estimate θ0 The Problem of Point Estimation → n 1 Assume that F is known up to the parameter θ which is unknown Therefore any statistic T : R Θ is a candidate estimator! θ → 2 Let (x , ..., x ) be a realization of X F which is available to us , Harder to answer what a good estimator is! 1 n ∼ θ → 3 Estimate the value of θ that generated the sample given (x1, ..., xn) Any estimator is of course a random variable So far considered aspects related to point estimation: Hence as a general principle, good should mean: ˆ Considered approximate distributions of g(X , ..., X ) as n dist(θ) concentrated around θ 1 n ↑ ∞ Studied the information carried by g(X1, .., Xn) w.r.t θ , An -dimensional description of quality. → ∞ Examined general parametric models Look at some simpler measures of quality? Today: How do we estimate θ in general? Some general recipes?

Statistical Theory (Week 5) Point Estimation 3 / 31 Statistical Theory (Week 5) Point Estimation 4 / 31 Concentration around a Parameter Bias and Mean Squared Error

Definition (Bias) The bias of an estimator θˆ of θ Θ is defined to be ∈ ˆ ˆ bias(θ) = Eθ[θ] θ − Describes how “off” we’re from the target on average when employing θˆ. Definition (Unbiasedness)

Density Density ˆ ˆ ˆ An estimator θ of θ Θ is unbiased if Eθ[θ] = θ, i.e. bias(θ) = 0. ∈ Will see that not too much weight should be placed on unbiasedness. Definition (Mean Squared Error) The mean squared error of an estimator θˆ of θ Θ R is defined to be ∈ ⊆ ˆ ˆ 2 MSE(θ) = Eθ (θ θ) − h i Statistical Theory (Week 5) Point Estimation 5 / 31 Statistical Theory (Week 5) Point Estimation 6 / 31 Bias and Mean Squared Error The Bias-Variance Decomposition of MSE

Bias and MSE combined provide a coarse but simple description of concentration around θ: 2 2 E[θˆ θ] = E[θˆ Eθˆ + Eθˆ θ] Bias gives us an indication of the location of dist(θˆ) relative to θ − − − 2 2 = E (θˆ Eθˆ) + (Eθˆ θ) + 2(θˆ Eθˆ)(Eθˆ θ) (somehow assumes mean is good measure of location) − − − − MSE gives us a measure of spread/dispersion of dist(θˆ) around θ n 2 2 o = E(θˆ Eθˆ) + (Eθˆ θ) If θˆ is unbiased for θ R then Var(θˆ) = MSE(θˆ) − − ∈ for Θ d have MSE(θˆ) := θˆ θ 2. R E Bias-Variance Decomposition for Θ R ⊆ k − k ⊆ Example MSE(θˆ) = Var(θˆ) + bias2(θˆ) iid 2 Let X1, ..., Xn N(µ, σ ) and letµ ˆ := X . Then ∼ A simple yet fundamental relationship σ2 Requiring a small MSE does not necessarily require unbiasedeness Eµˆ = µ and MSE(µ) = Var(µ) = . n Unbiasedeness is a sensible property, but sometimes biased estimators In this case bias and MSE give us a complete description of the perform better than unbiased ones concentration of dist(ˆµ) around µ, sinceµ ˆ is Gaussian and so completely Sometimes have bias/variance tradeoff determined by mean and variance. (e.g. nonparametric regression) Statistical Theory (Week 5) Point Estimation 7 / 31 Statistical Theory (Week 5) Point Estimation 8 / 31 Bias–Variance Tradeoff Consistency

Can also consider quality of an estimator not for given sample size, but also as sample size increases. Consistency

A sequence of estimators θˆn n 1 of θ Θ is said to be consistent if { } ≥ ∈ θˆ P θ n → A consistent estimator becomes increasingly concentrated around the true value θ as sample size grows (usually have θˆn being an estimator based on n iid values). Often considered as a “must have” property, but... A more detailed understanding of the “asymptotic quality” of θˆ requires the study of dist[θˆ ] as n . n ↑ ∞

Statistical Theory (Week 5) Point Estimation 9 / 31 Statistical Theory (Week 5) Point Estimation 10 / 31 Consistency: X , ..., X (0, 1), plot X¯ for n = 1, 2, ... Plug-In Estimators 1 n ∼ N n Want to find general procedures for constructing estimators. 1.0 , An idea: θ F is bijection under identifiability. → 7→ θ Recall that more generally, a parameter is a function ν : F → N Under identifiability ν(Fθ) = q(θ), some q. 0.5 The Plug-In Principle

Let ν = q(θ) = ν(Fθ) be a parameter of interest for a parametric model Fθ θ Θ. If we can construct an estimate Fˆθ of Fθ on the basis of our 0.0 { } ∈ ˆ means[1:100] sample X, then we can use ν(Fθ) as an estimator of ν(Fθ). Such an estimator is called a plug-in estimator.

−0.5 Essentially we are “flipping” our point of view: viewing θ as a function of Fθ instead of Fθ as a function of θ. Note here that θ = θ(Fθ) if q is taken to be the identity. In practice such a principle is useful when we can explicitly describe −1.0 the mapping Fθ ν(Fθ). 0 20 40 60 80 100 7→ Statistical Theory (Week 5) Point EstimationIndex 11 / 31 Statistical Theory (Week 5) Point Estimation 12 / 31 Parameters as Functionals of F The Empirical Distribution Function

Examples of “functional parameters”: Plug-in Principle + The mean: µ(F ) := ∞ xdF (x) Converts problem of estimating θ into problem of estimating F . But how? Z−∞ + Consider the case when X = (X1, .., Xn) has iid coordinates. We may The variance: σ2(F ) := ∞[x µ(F )]2dF (x) define the empirical version of the distribution function F ( ) as − Xi · Z−∞ n The median: med(F ) := inf x : F (x) 1/2 1 { ≥ } Fˆ (y) = 1 X y An indirectly defined parameter θ(F ) such that: n n { i ≤ } Xi=1 + ∞ ψ(x θ(F ))dF (x) = 0 − Places mass 1/n on each observation Z −∞ ˆ a.s. SLLN = Fn(y) F (y) y R d ⇒ −→ ∀ ∈ The density (when it exists) at x : θ(F ) := F (x) , since 1 Xi y are iid Bernoulli(F (y)) random variables 0 dx → { ≤ } x=x0 ˆ Suggests using ν(Fn) as estimator of ν(F )

Statistical Theory (Week 5) Point Estimation 13 / 31 Statistical Theory (Week 5) Point Estimation 14 / 31

ecdf(x10) ecdf(x50) The Empirical Distribution Function 1.0 1.0

0.8 0.8 Seems that we’re actually doing better than just pointwise convergence... 0.6 0.6

Fn(x) Fn(x) Theorem (Glivenko-Cantelli) 0.4 0.4 Let X1, .., Xn be independent random variables, distributed according to F . 0.2 0.2 ˆ 1 n Then, Fn(y) = n− i=1 1 Xi y converges uniformly to F with 0.0 0.0 { ≤ } -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 probability 1, i.e. x x P a.s. sup Fˆn(x) F (x) 0 ecdf(x100) ecdf(x500) x | − | −→ ∈R 1.0 1.0

0.8 0.8 Proof.

0.6 0.6 Assume first that F (y) = y1 y [0, 1] . Fix a regular finite partition

Fn(x) Fn(x) { ∈ } 1

0.4 0.4 0 = x x ... x = 1 of [0,1] (so x x = (m 1) ). By 1 ≤ 2 ≤ ≤ m k+1 − k − − monotonicity of F , Fˆn 0.2 0.2

0.0 0.0 sup Fˆn(x) F (x) < max Fˆn(xk ) F (xk+1) + max Fˆn(xk ) F (xk 1) -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 k k − x x x | − | | − | | − |

Statistical Theory (Week 5) Point Estimation 15 / 31 Statistical Theory (Week 5) Point Estimation 16 / 31 Observe that Adding and subtracting F (xk ) within each term we can bound above by W x U F (x) i ≤ ⇐⇒ i ≤ d 2 max Fˆn(xk ) F (xk ) + max F (xk ) F (xk+1) + max F (xk ) F (xk 1) so that Wi = Xi . By Skorokhod’s representation theorem, we may thus k | − | k | − | k | − − | assume that 2 =maxk xk xk+1 +maxk xk xk 1 = m 1 W = X a.s. | − | | − − | − i i | {z } ˆ by an application of the triangle inequality to each term. Letting n , Letting Gn be the edf of (U1, ..., Un) we note that ↑ ∞ the SSLN implies that the first term vanishes almost surely. Since m is n n arbitrary we have proven that, given any  > 0, 1 1 Fˆ (y) = n− 1 W y = n− 1 U F (y) = Gˆ (F (y)), a.s. n { i ≤ } { i ≤ } n Xi=1 Xi=1 lim sup Fˆn(x) F (x) <  a.s. n x | − | in other words Fˆ = Gˆ F , a.s. →∞   n n ◦ which gives the result when the cdf F is uniform. Now let A = F (R) [0, 1] so that from the first part of the proof For a general cdf F , we let U , U , ... iid [0, 1] and define ⊆ 1 2 ∼ U a.s. sup Fˆn(x) F (x) = sup Gˆn(t) t sup Gˆn(t) t 0 1 x | − | t A| − | ≤ | − | → W := F − (U ) = inf x : F (x) U . R t [0,1] i i { ≥ i } ∈ ∈ ∈ since obviously A [0, 1]. ⊆ Statistical Theory (Week 5) Point Estimation 17 / 31 Statistical Theory (Week 5) Point Estimation 18 / 31

Example (Mean of a function) Example (Density Estimation) + Consider θ(F ) = ∞ xdF (x). A plug-in estimator based on the edf is Let θ(F ) = f (x0), where f is the density of F , −∞ t R + n ∞ 1 F (t) = f (x)dx θˆ := θ(Fˆn) = h(x)dFˆn(x) = h(Xi ) n Z−∞ Z−∞ Xi=1 If we tried to plug-in Fˆn then our estimator would require differentiation of ˆ ˆ Example (Variance) Fn at x0. Clearly, the edf plug-in estimator does not exist since Fn is a step function. We will need a “smoother” estimate of F to plug in, e.g. 2 + 2 Consider now σ (F ) = ∞(x µ(F )) dF (x). Plugging in Fˆn gives −∞ − n ∞ 1 R 2 ˜ ˆ + + 2 n n Fn(x) := G(x y)dFn(y) = G(x Xi ) 2 ∞ 2 ∞ 2 1 − n − σ (Fˆ ) = x dFˆ (x) xdFˆ (x) = X X Z−∞ i=1 n n − n i − n i X i=1 i=1 ! Z−∞ Z−∞  X X for some continuous G concentrated at 0. Exercise Saw that plug-in estimates are usually easy to obtain via Fˆn 2 ˆ Show that σ (Fn) is a biased but consistent estimator for any F . But such estimates are not necessarily as “innocent” as they seem.

Statistical Theory (Week 5) Point Estimation 19 / 31 Statistical Theory (Week 5) Point Estimation 20 / 31 The Method of Moments Motivational Diversion: The Moment Problem

Perhaps the oldest estimation method (Karl Pearson, late 1800’s). Theorem Method of Moments p Suppose that F is a distribution determined by its moments. Let Fn be a Let X1, ..., Xn be an iid sample from Fθ, θ R . The method of moments { } ∈ sequence of distributions such that xk dF (x) < for all n and k. Then, estimator θˆ of θ is the solution w.r.t θ to the p random equations n ∞ R + + k k w ∞ ∞ kj ˆ kj p lim x dFn(x) = x dF (x), k 1 = Fn F . x dFn(x) = x dFθ(x), kj N. n ∀ ≥ ⇒ → { }j=1 ⊂ →∞ Z Z Z−∞ Z−∞ BUT: Not all distributions are determined by their moments! In some sense this is a plug-in estimator - we estimate the theoretical moments by the sample moments in order to then estimate θ. Lemma Useful when exact functional form of θ(F ) unavailble The distribution of X is determined by its moments, provided that there exists an open neighbourhood A containing zero such that While the method was introduced by equating moments, it may be generalized to equating p theoretical functionals to their empirical u,X MX (u) = E e−h i < , u A. analogues. ∞ ∀ ∈ , Choice of equations can be important h i → Statistical Theory (Week 5) Point Estimation 21 / 31 Statistical Theory (Week 5) Point Estimation 22 / 31

Example (Exponential Distribution) Example (Discrete Uniform Distribution) iid r r Suppose X1, ..., Xn Exp(λ). Then, [X ] = λ− Γ(r + 1). Hence, we iid E i Let X1, ..., Xn 1, 2, ..., θ , for θ N. Using the first moment of the ∼ ∼ U{ } ∈ may define a class of estimators of λ depending on r, distribution we obtain the equation

1 n − r 1 ˆ 1 r X¯ = (θ + 1) λ = Xi . 2 "nΓ(r + 1) # Xi=1 yielding the MoM estimator θˆ = 2X¯ 1. Tune value of r so as to get a “best estimator” (will see later...) − A nice feature of MoM estimators is that they generalize to non-iid data. p Example (Gamma Distribution) if X = (X1, ..., Xn) has distribution depending on θ R , one can → ∈ choose statistics T1, ..., Tp whose expectations depend on θ: Let X , ..., X iid Gamma(α, λ). The first two moment equations are: 1 n ∼ T = g (θ) n n Eθ k k α 1 α 1 = X = X¯ and = (X X¯ )2 λ n i λ2 n i − and then equate i=1 i=1 X X Tk (X) = gk (θ), k = 1, ..., p. ¯ 2 2 ˆ ¯ ˆ2 yielding estimatesα ˆ = X /σˆ and λ = X /σ . Important here that Tk is a reasonable estimator of ETk . (motivation) → Statistical Theory (Week 5) Point Estimation 23 / 31 Statistical Theory (Week 5) Point Estimation 24 / 31 Comments on Plug-In and MoM Estimators Comments on Plug-In and MoM Estimators

Usually easy to compute and can be valuable as preliminary estimates The estimate provided by an MoM estimator may / Θ! ∈ for algorithms that attempt to compute more efficient (but not easily (exercise: show that this can happen with the binomial distribution, computable) estimates. both n and p unknown). Can give a starting point to search for better estimators in situations Will later discuss optimality in estimation, and appropriateness (or where simple intuitive estimators are not available. inappropriateness) will become clearer. Often these estimators are consistent, so they are likely to be close to Observation: many of these estimators do not depend solely on the true parameter value for large sample size. sufficient statistics , Use empirical process theory for plug-ins , Sufficiency seems to play an important role in optimality – and it does → , Estimating equation theory for MoM’s → (more later) → Can lead to biased estimators, or even completely ridiculous Will now see a method where estimator depends only on a sufficient estimators (will see later) statistic, when such a statistic exists.

Statistical Theory (Week 5) Point Estimation 25 / 31 Statistical Theory (Week 5) Point Estimation 26 / 31 The Maximum Likelihood Estimators

A central theme in statistics. Introduced by Ronald Fisher. Definition (The Likelihood Function) Definition (Maximum Likelihood Estimators) ˆ Let X = (X1, ..., Xn) be random variables with joint density (or frequency Let X = (X1, ..., Xn) be a random sample from Fθ, and suppose that θ is function) f (x; θ), θ Θ Rp. The likelihood function L(θ) is the random such that ∈ ⊆ function L(θˆ) L(θ), θ Θ. ≥ ∀ ∈ L(θ) = f (X; θ) Then θˆ is called a maximum likelihood estimator of θ. , Notice that we consider L as a function of θ NOT of X. ˆ → We call θ the maximum likelihood estimator, when it is the unique Interpretation: Most easily interpreted in the discrete case How likely → maximum of L(θ), does the value θ make what we observed? θˆ = arg maxL(θ). (can extend interpretation to continuous case by thinking of L(θ) as how θ Θ ∈ likely θ makes something in a small neighbourhood of what we observed) Intuitively, a maximum likelihood estimator chooses that value of θ that is When X has iid coordinates with density f ( ; θ), then likelihood is: most compatible with our observation in the sense that it makes what we · n observed most probable. In not-so-mathematical terms, θˆ is the value of θ L(θ) = f (Xi ; θ) that is most likely to have produced the data. Yi=1 Statistical Theory (Week 5) Point Estimation 27 / 31 Statistical Theory (Week 5) Point Estimation 28 / 31 Comments on MLE’s Comments on MLE’s

Saw that MoMs and Plug-Ins often do not depend only on sufficient statistics. When the support of a distribution depends on a parameter, , i.e. they also use “irrelevant” information maximization is usually carried out by direct inspection. → For a very broad class of statistical models, the likelihood can be If T is a sufficient statistic for θ then the Factorization theorem maximized via differential calculus. If Θ is open, the support of the implies that distribution does not depend on θ and the likelihood is differentiable, L(θ) = g(T (X); θ)h(X) g(T (X); θ) then the MLE satisfies the log-likelihood equations: ∝ log L(θ) = 0 ∇θ i.e. any MLE depends on data ONLY through the sufficient statistic Notice that maximizing log L(θ) is equivalent to maximizing L(θ) When Θ is not open, likelihood equations can be used, provided that MLE’s are also invariant. If g :Θ Θ is a bijection, and if θˆ is the → 0 we verify that the maximum does not occur on the boundary of Θ. MLE of θ, then g(θˆ) is the MLE of g(θ).

Statistical Theory (Week 5) Point Estimation 29 / 31 Statistical Theory (Week 5) Point Estimation 30 / 31 Example (Uniform Distribution)

Let X , ..., X iid [0, θ]. The likelihood is 1 n ∼ U n n n L(θ) = θ− 1 0 X θ = θ− 1 θ X . { ≤ i ≤ } { ≥ (n)} Yi=1

Hence if θ X(n) the likelihood is zero. In the domain [X(n), ), the ≤ ˆ ∞ likelihood is a decreasing function of θ. Hence θ = X(n) .

Example (Poisson Distribution)

Let X , ..., X iid Poisson(λ). Then 1 n ∼ n n n xi λ λ L(λ) = e− = log L(λ) = nλ + log λ xi log(xi !) xi ! ⇒ − − Yi=1   Xi=1 Xi=1 Setting log L(θ) = n + λ 1 x = 0 we obtain λˆ =x ¯ since ∇θ − − i 2 log L(θ) = λ 2 x < 0. ∇θ − − i P Statistical Theory (Week 5) Point Estimation 31 / 31 P Maximum Likelihood Estimation 1 The Problem of Point Estimation

Statistical Theory 2 Maximum Likelihood Estimators

Victor Panaretos Ecole Polytechnique F´ed´eralede Lausanne 3 Relationship with Kullback-Leibler Divergence

4 Asymptotic Properties of the MLE

Statistical Theory (Likelihood) Maximum Likelihood 1 / 24 Statistical Theory (Likelihood) Maximum Likelihood 2 / 24 Point Estimation for Parametric Families Maximum Likelihood Estimators

Collection of r.v.’s (a random vector) X = (X1, ..., Xn) Recall our definition of a maximum likelihood estimator: X F F ∼ θ ∈ F a parametric class with parameter θ Θ Rd Definition (Maximum Likelihood Estimators) ∈ ⊆ Let X = (X1, ..., Xn) be a random sample from Fθ, and suppose that θˆ is The Problem of Point Estimation such that 1 Assume that Fθ is known up to the parameter θ which is unknown L(θˆ) L(θ), θ Θ. ≥ ∀ ∈ 2 Let (x , ..., x ) be a realization of X F which is available to us 1 n ∼ θ Then θˆ is called a maximum likelihood estimator of θ. 3 Estimate the value of θ that generated the sample given (x1, ..., xn) We call θˆ the maximum likelihood estimator, when it is the unique Last week, we saw three estimation methods: maximum of L(θ), the plug-in method, θˆ = arg maxL(θ). θ Θ the method of moments, ∈ θˆ makes what we observed most probable, most likely. maximum likelihood. → Makes sense intuitively. But why should it make sense mathematically? Today: focus on maximum likelihood. Why does it make sense? What are → its properties? Statistical Theory (Likelihood) Maximum Likelihood 3 / 24 Statistical Theory (Likelihood) Maximum Likelihood 4 / 24 Kullback-Leibler Divergence Likelihood through KL-divergence

Definition (Kullback-Leibler Divergence) Lemma (Maximum Likelihood as Minimum KL-Divergence)

Let p(x) and q(x) be two probability density (frequency) functions on R. An estimator θˆ based on an iid sample X1, ..., Xn is a maximum likelihood The Kullback-Leibler divergence, of q with respect to p is defined as: estimator if and only if KL(F (x; θˆ) Fˆ (x)) KL(F (x; θ) Fˆ (x)) θ Θ. k n ≤ k n ∀ ∈ + p(x) KL(q p) := ∞ p(x) log dx. Proof (discrete case). k q(x) Z−∞   1 We recall that h(x)dFˆn(x) = n− h(Xi ) so that + Have KL(p p) = ∞ p(x) log(1)dx = 0. R δ (x) k −∞ + n PXi n 1 By Jensen’s inequality, for X p( ) we have ˆ ∞ i=1 n ˆ 1 n− R ∼ · KL(Fθ Fn) = log dFn(x) = log k f (x; θ) n f (Xi ; θ) q(X ) Z−∞ P ! i=1   KL(q p) = E log[q(X )/p(X )] log E = 0 n n X k {− } ≥ − p(X ) 1 1    = log n log f (Xi ; θ) since q integrates to 1. −n − n Xi=1 Xi=1 p = q implies that KL(q p) > 0. n 6 k 1 1 KL is, in a sense, a distance between probability distributions = log n log f (Xi ; θ) = log n log L(θ) − − n " # − − n KL is not a metric: no symmetry and no triangle inequality! Yi=1 Statistical Theory (Likelihood) Maximum Likelihood 5 / 24 whichStatistical is Theory minimized (Likelihood) w.r.t to θ iffMaximumL(θ) Likelihood is maximized w.r.t θ. 6 / 24

Likelihood through KL-divergence AsymptoticsTherefore, maximizing for MLE’s the likelihood is equivalent to choosing the → element of the parametric family fθ θ Θ that minimizes the { } ∈ KL-divergence with the empirical distribution function. Intuition: Under what conditions is an MLE consistent? ˆ Fˆ is (with probability 1) a uniformly good approximation of F , θ How does the distribution of θMLE concetrate around θ as n ? n θ0 0 → ∞ the true parameter (large n). Often, when MLE coincides with an MoM estimator, this can be seen ˆ So Fθ0 will be “very close” to Fn (for large n) directly. So set the “projection” of Fˆn into Fθ θ Θ as the estimator of Fθ . { } ∈ 0 Example (Geometric distribution) (“projection” with respect to KL-divergence) Let X1, ..., Xn be iid Geometric random variables with frequency function Final comments on KL-divergence: x KL(p q) measures how likely it would be to distinguish if an f (x; θ) = θ(1 θ) , x = 0, 1, 2, ... k − observation X came from q or p given that it came from p. MLE of θ is A related quantity is the entropy of p, defined as log(p(x))p(x)dx 1 − θˆn = . which measures the “inherent randomness” of p (how ”surprising” an X¯ + 1 R outcome from p is on average). By the central limit theorem, √n(X¯ (θ 1 1)) d (0, θ 2(1 θ)). − − − → N − −

Statistical Theory (Likelihood) Maximum Likelihood 7 / 24 Statistical Theory (Likelihood) Maximum Likelihood 8 / 24 Example (Geometric distribution) Example (Uniform distribution) Now apply the delta method with g(x) = 1/(1 + x), so that To determine the asymptotic concentration of dist(θˆn) around θ, 2 g 0(x) = 1/(1 + x) : − x [n(θ θˆ ) x] = θˆ θ 1 d 2 P n P n √n(θˆ θ) = √n(g(X¯ ) g(θ− 1)) (0, θ (1 θ)). − ≤ ≥ − n n − n − − → N − h x ni = 1 1 − − θn Example (Uniform distribution) n   →∞ 1 exp( x/θ) iid −→ − − Suppose that X1, ..., Xn [0, θ]. MLE of θ is ∼ U so that n(θ θˆn) weakly converges to an exponential random variable. ˆ − ˆ θn = X(n) = max X1, ..., Xn Thus we understand the concentration of dist(θn θ) around zero for large { } − 1 n as that of an exponential distribution with variance θ2n2 . ˆ n with distribution function P[θn x] = (x/θ) 1 x [0, θ] . Thus for ≤ { ∈ }  > 0, From now on assume that X1, ..., Xn are iid with density (frequency) n f (x; θ), θ . Notation: ˆ ˆ θ  n R P[ θn θ > ] = P[θn < θ ] = − →∞ 0, ∈ | − | − θ −→ `(x; θ) = log f (x; θ)   so that the MLE is a consistent estimator. `0(x; θ), `00(x; θ) and `000(x; θ) are partial derivatives w.r.t θ.

Statistical Theory (Likelihood) Maximum Likelihood 9 / 24 Statistical Theory (Likelihood) Maximum Likelihood 10 / 24 Asymptotics for the MLE Asymptotics for the MLE

Under condition (A2) we have d f (x; θ)dx = 0 for all θ Θ so dθ supp f ∈ Regularity Conditions that, if we can interchange integration and differentiation, R (A1) Θ is an open subset of R. d 0 = f (x; θ)dx = `0(x; θ)f (x; θ)dx = Eθ[`0(Xi ; θ)] (A2) The support of f , suppf , is independent of θ. dθ Z Z (A3) f is thrice continuously differentiable w.r.t. θ for all x suppf . ∈ so that in the presence of (A2), (A4) is essentially a condition that enables (A4) Eθ[`0(Xi ; θ)] = 0 θ and Varθ[`0(Xi ; θ)] = I (θ) (0, ) θ. differentiation under the integral and asks that the r.v. `0 have a finite ∀ ∈ ∞ ∀ second moment for all θ. Similarly, (A5) requires that ` have a first (A5) Eθ[`00(Xi ; θ)] = J(θ) (0, ) θ. 00 − ∈ ∞ ∀ moment for all θ. (A6) M(x) > 0 and δ > 0 such that Eθ0 [M(Xi )] < and ∃ ∞ Conditions (A2) and (A6) are smoothness conditions that will allow us to “linearize” the problem, while the other conditions will allow us to θ θ0 < δ = `000(x; θ) M(x) | − | ⇒ | | ≤ “control” the random linearization. Let’s take a closer look at these conditions... Furthermore, if we can differentiate twice under the integral sign

d 2 0 = [`0(x; θ)f (x; θ)]dx = `00(x; θ)f (x; θ)dx + (`0(x; θ)) f (x; θ)dx If Θ is open, then for θ0 the true parameter, it always makes sense for an dθ estimator θˆ to have a symmetric distribution around θ0 (e.g. Gaussian). Z Z Z so that I (θ) = J(θ). − Statistical Theory (Likelihood) Maximum Likelihood 11 / 24 Statistical Theory (Likelihood) Maximum Likelihood 12 / 24 Example (Exponential Family)

Let X1, ..., Xn be iid random variables distributed according to a one-parameter exponential family Example (Exponential Family) Furthermore, f (x; θ) = exp c(θ)T (x) d(θ) + S(x) , x supp f . { − } ∈ 2 I (θ) = [c0(θ)] Var[T (Xi )] It follows that d0(θ) = d00(θ) c00(θ) − c0(θ) `0(x; θ) = c0(θ)T (x) d0(θ) − `00(x; θ) = c00(θ)T (x) d00(θ). and − On the other hand, J(θ) = d00(θ) c00(θ)E[T (Xi )] − d0(θ) d0(θ) = d00(θ) c00(θ) E[T (Xi )] = − c0(θ) c0(θ) 1 d0(θ) so that I (θ) = J(θ). Var[T (X )] = d00(θ) c00(θ) i [c (θ)]2 − c (θ) 0  0 

Hence E[`0(Xi ; θ)] = c0(θ)E[T (Xi )] d0(θ) = 0. − Statistical Theory (Likelihood) Maximum Likelihood 13 / 24 Statistical Theory (Likelihood) Maximum Likelihood 14 / 24

Asymptotic Normality of the MLE Proof.

Under conditions (A1)-(A3), if θˆn maximizes the likelihood, we have

Theorem (Asymptotic Distribution of the MLE) n `0(Xi ; θˆn) = 0. Let X1, ..., Xn be iid random variables with density (frequency) f (x; θ) and Xi=1 satisfying conditions (A1)-(A6). Suppose that the sequence of MLE’s θˆn p satisfies θˆ θ where Expanding this equation in a Taylor series, we get n → n n n 0 = `0(Xi ; θˆn) = `0(Xi ; θ) + `0(Xi ; θˆn) = 0, n = 1, 2, ... i=1 i=1 i=1 X X X n Then, +(θˆn θ) `00(Xi ; θ) − I (θ) i=1 √n(θˆ θ) d 0, . X n − → N J2(θ) n   1 ˆ 2 + (θn θ) `000(Xi ; θn∗) d 1 2 − When I (θ) = J(θ), we have of course √n(θˆ θ) 0, . i=1 n − → N I (θ) X   ˆ with θn∗ lying between θ and θn.

Statistical Theory (Likelihood) Maximum Likelihood 15 / 24 Statistical Theory (Likelihood) Maximum Likelihood 16 / 24 Dividing accross by √n yields Next, the weak law of large numbers along with condition (A5) implies

n n n 1 1 1 p 0 = `0(X ; θ) + √n(θˆ θ) `00(X ; θ) `00(Xi ; θ) J(θ). √n i n − n i n → − i=1 i=1 i=1 X n X X 1 2 1 + √n(θˆ θ) `000(X ; θ∗) Now we turn to show that the remainder vanishes in probability, 2 n − n i n Xi=1 n ˆ 1 p Rn = (θn θ) `000(Xi ; θn∗) 0. which suggests that √n(θˆn θ) equals − 2n → − Xi=1 1/2 n n− `0(Xi ; θ) We have that for any  > 0 − i=1 . n 1 n ` (X ; θ) + (θˆ θ)(2n) 1 n ` (X ; θ ) − i=1 00 i Pn − i=1 000 i n∗ ˆ ˆ − P[ Rn > ] = P[ Rn > , θn θ > δ] + P[ Rn > , θn θ δ] | | | | | − | | | | − | ≤ Now, from theP central limit theorem and conditionP (A4), it follows that and n ˆ ˆ p 1 d P[ Rn > , θn θ > δ] P[ θn θ > δ] 0. `0(Xi ; θ) (0, I (θ)). | | | − | ≤ | − | → √n → N δ n If θˆn θ < δ, (A6) gives Rn M(Xi ). Xi=1 | − | | | ≤ 2n i=1 P Statistical Theory (Likelihood) Maximum Likelihood 17 / 24 Statistical Theory (Likelihood) Maximum Likelihood 18 / 24 Consistency of the MLE Since the weak law of large numbers implies that

n Consider the random function 1 p M(Xi ) E[M(X1)] < , n n → ∞ 1 i=1 φ (t) = [log f (X ; t) log f (X ; θ)] X n n i − i i=1 ˆ X the quantity P[ Rn > , θn θ δ] can be made arbitrarily small (for | | | − | ≤ p ˆ large n) by taking δ sufficiently small. Thus, R 0 and applying which is maximized at t = θn. By the WLLN, for each t Θ, n → ∈ Slutsky’s theorem we may conclude that p f (Xi ; t) φn(t) φ(t) = E log . I (θ) → f (Xi ; θ) √n(θˆ θ) d 0, .    n − → N J2(θ)   which is minus the KL-divergence. The latter is minimized when t = θ and so φ(t) is maximized at t = θ. Notice that for our proof, we assumed that the sequence of MLE’s Moreover, unless f (x; t) = f (x; θ) for all x supp f , we have ∈ was consistent. φ(t) < 0 Proving consistency of an MLE can be subtle Since we are assuming identifiability, it follows that φ is uniquely maximized at θ Statistical Theory (Likelihood) Maximum Likelihood 19 / 24 Statistical Theory (Likelihood) Maximum Likelihood 20 / 24 Consistency of the MLE More assumptions are needed on the φn(t). Theorem p Does the fact that φn(t) φ(t) t with φ maximized uniquely at θ Suppose that φn(t) and φ(t) are real-valued random functions defined p → ∀ { } imply that θˆn θ? on the real line. Suppose that → p 1 for each M > 0, sup t M φn(t) φ(t) 0 Unfortunately, the answer is in general no. | |≤ | − | → 2 Tn maximizes φn(t) and T0 is the unique maximizer of φ(t) Example (A Deterministic Example) 3  > 0, there exists M such that P[ Tn > M] <  n 1 ∀ p | | ∀ 1 n t n− for 0 t 2/n, Then, T T − | − | ≤ ≤ n → 0 Define φn(t) = 1/2 t 2 for 3/2 t 5/2, − | − | ≤ ≤ If φn are concave, can weaken the assumptions, 0 otherwise. It is easy to see that φn φ pointwise, with Theorem  → Suppose that φn(t) and φ(t) are random concave functions defined on φ(t) = 1 t 2 1 3/2 t 5/2 . { } 2 − | − | { ≤ ≤ } the real line. Suppose that p   1 1 φn(t) φ(t) for all t But now note that φn is maximized at tn = n− with φn(tn) = 1 for all n. → 2 On the other hand, φ is maximized at t0 = 2. Tn maximizes φn and T0 is the unique maximizer of φ. p Then, T T . n → 0 Statistical Theory (Likelihood) Maximum Likelihood 21 / 24 Statistical Theory (Likelihood) Maximum Likelihood 22 / 24

Example (Exponential Families) Example (Exponential Families) Let X1, ..., Xn be iid random variables from a one-parameter exponential Now, by the weak law of large numbers, for each u, we have family p φn∗(u) uE[T (X1)] d0(u) = φ∗(u). f (x; θ) = exp c(θ)T (x) d(θ) + S(x) , x suppf . → − { − } ∈ Furthermore, φ∗(u) is maximized when d00 (u) = E[T (X1)]. But since, The MLE of θ maximizes n E[T (X1)] = d00 (c(θ)), 1 φn(t) = [c(t)T (Xi ) d(t)] n − we must have that φ∗ is maximized when Xi=1 If c( ) is continuous and 1-1 with inverse c 1( ), we may define u = c(t) d00 (u) = d00 (c(θ)) · − · and consider n 1 The condition holds is if we set u = c(θ), so c(θ) is a maximizer of φ∗. By φn∗(u) = [uT (Xi ) d0(u)] concavity, it is the unique maximizer. n − p i=1 It follows from our theorem that ifu ˆ = c(θˆ ) thenu ˆ = c(θˆ ) c(θ). X n n n n → 1 But c is 1-1 and continuous, so the continuous mapping theorem implues with d0(u) = d(c− (u)). It follows that φn∗ is a concave function, since its second derivative is, (φn∗)00(u) = d000(u), which is negative p − θˆn θ. (d000(u) = VarT (Xi )). → Statistical Theory (Likelihood) Maximum Likelihood 23 / 24 Statistical Theory (Likelihood) Maximum Likelihood 24 / 24 More on Maximum Likelihood Estimation 1 Consistent Roots of the Likelihood Equations

Statistical Theory 2 Approximate Solution of the Likelihood Equations

Victor Panaretos Ecole Polytechnique F´ed´eralede Lausanne 3 The Multiparameter Case

4 Misspecified Models and Likelihood

Statistical Theory (Likelihood) Maximum Likelihood 1 / 24 Statistical Theory (Likelihood) Maximum Likelihood 2 / 24 Maximum Likelihood Estimators Consistent Likelihood Roots

Recall our definition of a maximum likelihood estimator: Theorem Definition (Maximum Likelihood Estimators) Let f ( ; θ) θ R be an identifiable parametric class of densities { · } ∈ Let X = (X1, ..., Xn) be a random sample from Fθ, and suppose that θˆ is (frequencies) and let X1, ..., Xn be iid random variables each having density such that f (x; θ0). If the support of f ( ; θ) is independent of θ, ˆ · L(θ) L(θ), θ Θ. n ≥ ∀ ∈ P[L(θ0 X1, ..., Xn) > L(θ X1, ..., Xn)] →∞ 1 Then θˆ is called a maximum likelihood estimator of θ. | | −→ for any fixed θ = θ . We saw that, under regularity conditions, the distribution of a consistent 6 0 sequence of MLEs converges weakly to the normal distribution centred Therefore, with high probability, the likelihood of the true parameter around the true parameter value when this is real. exceeds the likelihood of any other choice of parameter, provided that Consistent likelihood equation roots the sample size is large.

Newton-Raphson and “one-step” estimators Hints that extrema of L(θ; X) should have something to do with θ0 The multivariate parameter case (even though we saw that without further assumptions a maximizer of What happens if the model has been mis-specified? L is not necessarily consistent).

Statistical Theory (Likelihood) Maximum Likelihood 3 / 24 Statistical Theory (Likelihood) Maximum Likelihood 4 / 24 Consistent Sequences of Likelihood Roots Proof. Notice that Theorem (Cram´er) n 1 f (X ; θ) Let f ( ; θ) θ R be an identifiable parametric class of densities i { · } ∈ L(θ0 Xn) > L(θ Xn) log < 0 (frequencies) and let X1, ..., Xn be iid random variables each having density | | ⇐⇒ n f (Xi ; θ0) i=1   f (x; θ ). Assume that the the support of f ( ; θ) is independent of θ and X 0 · By the WLLN, that f (x; θ) is differentiable with respect to θ for (almost) all x. Then, given any  > 0, with probability tending to 1 as n , the likelihood n → ∞ 1 f (Xi ; θ) p f (X ; θ) equation log E log = KL(fθ fθ0 ) ∂ n f (Xi ; θ0) → f (X ; θ0) − k `(θ; X1, ..., Xn) = 0 Xi=1     ∂θ But we have seen that the KL-divergence is zero only at θ0 and positive has a root θˆ (X , ..., X ) such that θˆ (X , ..., X ) θ < . n 1 n | n 1 n − 0| everywhere else. Does not tell us which root to choose, so not useful in practice Actually the consistent sequence is essentially unique

Statistical Theory (Likelihood) Maximum Likelihood 5 / 24 Statistical Theory (Likelihood) Maximum Likelihood 6 / 24

Corollary (Consistency of Unique Solutions) Consistent Sequences of Likelihood Roots Under the assumptions of the previous theorem, if the likelihood equation has a unique root δ for each n and all x, then δ is a consistent sequence n n Fortunately, some “good” estimator is already available, then... of estimators for θ0. Lemma

The statement remains true if the uniqueness requirement is Let αn be any consistent sequence of estimators for the parameter θ. For substituted with the requirement that the probability of multiple roots each n, let θn∗ denote the root of the likelihood equations that is closest to tends to zero as n . α . Then, under the assumptions of Cram´er’stheorem, θ θ. → ∞ n n∗ → Notice that the statement does not claim that the root corresponds to a maximum: it merely requires that we have a root. Therefore, when the likelihood equations do not have a single root, On the other hand, even when the root is unique, the corollary says we may still choose a root based on some estimator that is readily nothing about its properties for finite n. available , Only require that the estimator used is consistent Example (Minimum Likelihood Estimation) ,→ Often the case with Plug-In or MoM estimators → Let X take the values 0, 1, 2 with probabilities 6θ2 4θ + 1, θ 2θ2 and Very often, the roots will not be available in closed form. In these − − 3θ 4θ2 (θ (0, 1/2)). Then, the likelihood equation has a unique root cases, an iterative approach will be required to approximate the roots − ∈ for all x, which is a minimum for x = 0 and a maximum for x = 1, 2.

Statistical Theory (Likelihood) Maximum Likelihood 7 / 24 Statistical Theory (Likelihood) Maximum Likelihood 8 / 24 The Newton-Raphson Algorithm Construction of Asymptotically MLE-like Estimators

We wish to solve the equation Theorem

`0(θ) = 0 Suppose that assumptions (A1)-(A6) hold and let θ˜n be a consistent estimator of θ such that √n(θ˜ θ ) is bounded in probability. Then, the ˜ 0 n − 0 Supposing that θ is close to a root (perhaps is a consistent estimator), sequence of estimators ` (θ˜ ) 0 = ` (θˆ) ` (θ˜) + (θˆ θ˜)` (θ˜) δ = θ˜ 0 n 0 0 00 n n ˜ ' − − `00(θn) By using a second-order Taylor expansion. This suggests satisfies d 2 √n(δn θ0) (0, I (θ)/J(θ) ). ` (θ˜) − → N θˆ θ˜ 0 ˜ ' − `00(θ) Therefore, with a single Newton-Raphson step, we may obtain an estimator that, asymptotically, behaves like a consistent MLE. ˜ The procedure can then be iterated by replacing θ by the right hand side , Provided that we have a √n consistent estimator! of the above relation. → − The “one-step” estimator does not necessarily behave like an MLE for Many issues regarding convergence, speed of convergence, etc... → finite n! (numerical analysis course)

Statistical Theory (Likelihood) Maximum Likelihood 9 / 24 Statistical Theory (Likelihood) Maximum Likelihood 10 / 24

Proof. The Multiparameter Case We Taylor expand around the true value, θ , 0 Extension of asymptotic results to multiparameter models easy under → similar assumptions, but notationally cumbersome. ˜ ˜ 1 ˜ 2 `0(θn) = `0(θ0) + (θn θ0)`00(θ0) + (θn θ0) `000(θn∗) Same ideas: the MLE will be a zero of the likelihood equations − 2 − → ˜ n with θn∗ between θ0 and θn. Subsituting this expression into the definition `(Xi ; θ) = 0 of δn yields ∇ Xi=1 (1/√n)` (θ ) √n(δ θ ) = 0 0 + √n(θ˜ θ ) A Taylor expansion can be formed n 0 ˜ n 0 − (1/n)`00(θn) − × − n n ` (θ ) 1 ` (θ ) 1 1 1 00 0 (θ˜ θ ) 000 n∗ 0 = `(X ; θ) + 2`(X ; θ ) √n(θˆ θ) ˜ n 0 ˜ i i n∗ N × − `00(θn) − 2 − `00(θn) √n ∇ n ∇ ! −   Xi=1 Xi=1 Exercise. Under regularity conditions we should have: 1 n d Use the central limit theorem and the law of large numbers to complete `(Xi ; θ) p(0, Cov[ `(Xi ; θ)]) √n i=1 ∇ → N ∇ the proof. 1 n 2 p 2 P `(Xi ; θ∗) E[ `(Xi ; θ)] n i=1 ∇ n → ∇ Statistical Theory (Likelihood) Maximum Likelihood 11 / 24 StatisticalP Theory (Likelihood) Maximum Likelihood 12 / 24 The Multiparameter Case The Multiparameter Case

Regularity Conditions (B1) The parameter space Θ Rp is open. ∈ (B2) The support of f ( θ), suppf ( θ), is independent of θ ·| ·| Theorem (Asymptotic Normality of the MLE) (B3) All mixed partial derivatives of ` w.r.t. θ up to degree 3 exist and are Let X1, ..., Xn be iid random variables with density (frequency) f (x; θ), continuous. satisfying conditions (B1)-(B6). If θˆn = θˆ(X1, ..., Xn) is a consistent (B4) E[ `(Xi ; θ)] = 0 θ and Cov[ `(Xi ; θ)] =: I (θ) 0 θ. sequence of MLE estimators, then ∇ ∀ ∇ ∀ 2 (B5) [ `(Xi ; θ)] =: J(θ) 0 θ. E d 1 1 − ∇ ∀ √n(θˆ θ) (0, J− (θ)I (θ)J− (θ)) (B6) δ > 0 s.t. θ Θ and for all 1 j, k, l p, n − → Np ∃ ∀ ∈ ≤ ≤ ∂ The theorem remains true if each X is a random vector `(x; u) M (x) i ∂θ ∂θ ∂θ ≤ jkl j k l The proof mimics that of the one-dimensional case

for θ u δ with Mjkl such that E[Mjkl (Xi )] < . k − k ≤ ∞ The interpretation of the conditions is the same as with the one-dimensional case Statistical Theory (Likelihood) Maximum Likelihood 13 / 24 Statistical Theory (Likelihood) Maximum Likelihood 14 / 24 Misspecification of Models Example (cont’d) Notice that the exponential distribution is not a member of this Statistical models are typically mere approximations to reality parametric family. George P. Box: “all models are wrong, but some are useful” However, letting α, θ at rates such that α/θ λ, we have → ∞ → As worrying as this may seem, it may not be a problem in practice. f (x α, θ) λ exp( λx) | → − Often the model is wrong, but is “close enough” to the true situation Thus, we may approximate the true model from within this class. Even if the model is wrong, the parameters often admit a fruitful Reasonableα ˆ and λˆ will yield a density “close” to the true density. interpretation in the context of the problem. Example Example 2 Let X1, ..., Xn be independent random variables with variance σ and mean Let X1, ..., Xn be iid Exponential(λ) r.v.’s but we have modelled them as having the following two parameter density E[Xi ] = α + βti (α+1) α x − f (x α, θ) = 1 + , x > 0 If we assume that the Xi are normal when they are in fact not, the MLEs | θ θ 2   of the parameters α, β, σ remain good (in fact optimal in a sense) for the with α and θ positive unknown parameters to be estimated. true parameters (Gauss-Markov theorem).

Statistical Theory (Likelihood) Maximum Likelihood 15 / 24 Statistical Theory (Likelihood) Maximum Likelihood 16 / 24 Misspecified Models and Likelihood Misspecified Models and Likelihood

The Framework Consider the functional parameter θ(F ) defined by X1, ..., Xn are iid r.v.’s with distribution F + We have assumed that the Xi admit a density in f (x; θ) θ Θ. ∞ { } ∈ `0(x; θ(F ))dF (x) = 0 The true distribution F does not correspond to any of the fθ { } Z−∞ ˆ Let θn be a root of the likelihood equation, Then, the plug-in estimator of θ(F ) when using the edf Fˆn as an estimator

n of F is given by solving ˆ `0(Xi ; θn) = 0 + n ∞ i=1 `0(x; θ(Fˆ ))dFˆ (x) = 0 `0(X ; θˆ ) = 0 X n n ⇐⇒ i n Z i=1 where the log-likelihood `(θ) is w.r.t. f ( θ). −∞ X ·| What exactly is θˆn estimating? so that the MLE is a plug-in estimator of θ(F ).

What is the behaviour of the sequence θˆn n 1 as n ? { } ≥ → ∞

Statistical Theory (Likelihood) Maximum Likelihood 17 / 24 Statistical Theory (Likelihood) Maximum Likelihood 18 / 24 Model Misspecification and the Likelihood Proof. Assume without loss of generality that `0(x; θ) is strictly decreasing in θ. Theorem Let  > 0 and observe that iid ˆ ˆ ˆ ˆ Let X1, ..., Xn F and let θn be a random variable solving the equations P[ θn θ(F ) > ] = P θn θ(F ) >  θ(F ) θn >  n ∼ | − | − ∪ − i=1 `0(Xi ; θ) = 0 for θ in the open set Θ. If hnˆ o n ˆ oi P θn θ(F ) >  + P θ(F ) θn >  . P(a) `0 is a strictly monotone function on Θ for each x ≤ − − + hn oi hn oi (b) ∞ `0(x; θ(F ))dF (x) = 0 has a unique solution θ = θ(F ) on Θ −∞ By our monotonicity assumption, we have + 2 (c) RI (F ) := ∞[`0(x; θ(F ))] dF (x) < −∞ ∞ n + ˆ ˆ 1 (d) J(F ) := R ∞ `00(x; θ(F ))dF (x) < θn θ(F ) >  = θn > θ(F ) +  = `0(Xi ; θ(F ) + ) > 0 − ∞ − ⇒ ⇒ n −∞ i=1 (e) `000(x; t) M(x) for t (θ(F ) δ, θ(F ) + δ), some δ > 0 and X | + | ≤R ∈ − ∞ M(x)dF (x) < ˆ 1 n −∞ ∞ because θn is the solution to the equation i=1 `0(Xi ; θ) = 0. Then n R p Similarly we also obtain θˆn θ(F ) P → n 1 and θ(F ) θˆn >  = θ(F )  > θˆn = `0(Xi ; θ(F ) ) < 0. d 2 − ⇒ − ⇒ n − √n(θˆn θ(F )) (0, I (F )/J (F )) − → N Xi=1 Statistical Theory (Likelihood) Maximum Likelihood 19 / 24 Statistical Theory (Likelihood) Maximum Likelihood 20 / 24 Hence, n Similar arguments give ˆ 1 P[ θn θ(F ) > ] P `0(Xi ; θ(F ) + ) > 0 | − | ≤ n n  i=1  1 X P `0(Xi ; θ(F ) ) < 0 0 n n − → 1 " i=1 # + P `0(Xi ; θ(F ) ) < 0 . X n −  i=1  and thus X p θˆ θ(F ). We may re-write the first term on the right-hand side as n −→ n n 1 1 Expanding the equation that defines the estimator in a Taylor series, gives P `0(Xi ; θ(F ) + ) > 0 = P `0(Xi ; θ(F ) + ) n n n n  i=1   i=1 1 X X 0 = `0(Xi ; θˆn) = `0(Xi ; θ(F )) + ∞ ∞ √n `0(x; θ(F ) + )dF (x) > `0(x; θ(F ) + )dF (x) . i=1 i=1 − − X X n Z−∞ Z−∞  1 +√n(θˆ θ(F )) `00(X ; θ(F )) This converges to zero because the monotonicity assumption implies that n − n i i=1 ∞ `0(x; θ(F ) + )dF (x) > 0 and the law of large numbers implies that X n − −∞ 1 n √ ˆ 2 R 1 p ∞ + n(θn θ(F )) `000(Xi ; θn∗) `0(Xi ; θ(F ) + ) `0(x; θ(F ))dF (x). − 2n n −→ Xi=1 Xi=1 Z−∞ Statistical Theory (Likelihood) Maximum Likelihood 21 / 24 Statistical Theory (Likelihood) Maximum Likelihood 22 / 24 Model Misspecification and the Likelihood Here, θ lies between θ(F ) and θˆ . n∗ n What is the interpretation of the parameter θ(F ) in the misspecified setup? Exercise: complete the proof by mimicking the proof of asymptotic Suppose that F has density (frequency) g and assume that normality of MLEs. integration/differentiation may be interchanged:

+ d d + The result extends immediately to the multivariate parameter case. ∞ log f (x; θ)dF (x) = 0 ∞ log f (x; θ)dF (x) = 0 dθ ⇐⇒ dθ Notice that the proof is essentially identical to MLE asymptotics Z−∞ Z−∞ + + proof. d ∞ ∞ log f (x; θ)dF (x) log g(x)dF (x) = 0 ⇐⇒ dθ − The difference is the first part, where we show consistency. Z−∞ Z−∞  d This is where assumptions (a) and (b) come in KL(f (x; θ) g(x)) = 0 ⇐⇒ dθ k These can be replaced by any set of assumptions yielding consistency When the equation is assumed to have a unique solution, then this is Indicated the subtleties that are involved when proving convergence to be though as a minimum of the KL-distance for indirectly defined estimators Hence we may intuitively think of the θ(F ) as the element of Θ for which fθ is “closest” to F in the KL-sense.

Statistical Theory (Likelihood) Maximum Likelihood 23 / 24 Statistical Theory (Likelihood) Maximum Likelihood 24 / 24 1 Statistics as a Random Game The Framework 2 Risk (Expected Loss)

Statistical Theory 3 Admissibility and Inadmisibility Victor Panaretos Ecole Polytechnique F´ed´eralede Lausanne 4 Minimax Rules

5 Bayes Rules

6 Randomised Rules

Statistical Theory () Decision Theory 1 / 23 Statistical Theory () Decision Theory 2 / 23 Statistics as a Random Game? Statistics as a Random Game?

Nature and a statistician decide to play a game. What’s in the box? How the game is played: A family of distributions F, usually assumed to admit densities First we agree on the rules:

(frequencies). This is the variant of the game we decide to play. 1 Fix a parametric family Fθ θ Θ { } ∈ A parameter space Θ p which parametrizes the family 2 Fix an action space R A ⊆ 3 Fix a F = Fθ θ Θ. This represents the space of possible { } ∈ L plays/moves available to Nature. Then we play: 1 Nature selects (plays) θ Θ. A data space , on which the parametric family is supported. This 0 ∈ X 2 The statistician observes X F represents the space of possible outcomes following a play by Nature. ∼ θ0 3 The statistician plays α in response. An action space , which represents the space of possible actions or 4 ∈ A A The statistician has to pay nature (θ0, α). decisions or plays/moves available to the statistician. L Framework proposed by A. Wald in 1939. Encompasses three basic A loss function :Θ R+. This represents how much L × A → statistical problems: the statistician has to pay nature when losing. Point estimation A set of decision rules. Any δ is a (measurable) function D ∈ D Hypothesis testing δ : . These represent the possible strategies available to the X → A statistician. Interval estimation

Statistical Theory () Decision Theory 3 / 23 Statistical Theory () Decision Theory 4 / 23 Point Estimation as a Game Risk of a Decision Rule

Statistician would like to pick strategy δ so as to minimize his losses. In the problem of point estimation we have: But losses are random, as they depend on X. 1 Fixed parametric family Fθ θ Θ { } ∈ Definition (Risk) 2 Fixed an action space = Θ A 2 Given a parameter θ Θ, the risk of a decision rule δ : is the 3 Fixed loss function (θ, α) (e.g. θ α ) ∈ X → A L k − k expected loss incurred when employing δ: R(θ, δ) = Eθ [ (θ, δ(X))] . The game now evolves simply as: L 1 Nature picks θ Θ 0 ∈ Key notion of decision theory 2 The statistician observes X F ∼ θ0 decision rules should be compared by comparing their risk functions 3 The statistician plays δ(X) = Θ ∈ A 4 The statistician loses (θ , δ(X)) Example (Mean Squared Error) L 0 Notice that in this setup δ is an estimator (it is a statistic Θ). In point estimation, the mean squared error X → The statistician always loses. 2 , Is there a good strategy δ for the statistician to restrict his losses? MSE(δ(X)) = Eθ[ θ δ(X) ] → ∈ D k − k , Is there an optimal strategy? → is the risk corresponding to a squared error loss function.

Statistical Theory () Decision Theory 5 / 23 Statistical Theory () Decision Theory 6 / 23 Coin Tossing Revisited Coin Tossing Revisited

Consider the “coin tossing game” with quadratic loss: Nature picks θ [0, 1] Risks associated with different decision rules: ∈ iid We observe n variables Xi Bernoulli(θ). 2 ∼ Rj (θ) = R(θ, δj (X)) = Eθ[(θ δj (X)) ] Action space is = [0, 1] − A Loss function is (θ, α) = (θ α)2. L − 1 R1(θ) = n θ(1 θ) 3 − Consider 3 different decision procedures δj j=1: { } R2(θ) = θ(1 θ) 1 1 n δ1(X) = n i=1 Xi − 2 2 δ2(X) = X1 1 P R3(θ) = θ 2 1 − 3 δ (X) = 3 2  Let us compare these using their associated risks as benchmarks.

Statistical Theory () Decision Theory 7 / 23 Statistical Theory () Decision Theory 8 / 23 Coin Tossing Revisited Risk of a Decision Rule

Saw that decision rule may strictly dominate another rule (R2(θ)>R1(θ)). 0.20 Definition (Inadmissible Decision Rule)

Let δ be a decision rule for the experiment ( Fθ θ Θ, ). If there exists a 0.15 { } ∈ L decision rule δ∗ that strictly dominates δ, i.e.

R(θ, δ∗) R(θ, δ), θ Θ& θ0 Θ: R(θ0, δ∗) < R(θ0, δ), Risk 0.10 ≤ ∀ ∈ ∃ ∈ then δ is called an inadmissible decision rule. 0.05 An inadmissible decision rule is a “silly” strategy since we can find a strategy that always does at least as well and sometimes better.

0.00 However “silly” is with respect to and Θ. (it may be that our L 0.0 0.2 0.4 0.6 0.8 1.0 choice of is “silly”!!!) L theta If we change the rules of the game (i.e. different loss or different parameter space) then domination may break down. R1(θ), R2(θ), R3(θ) Statistical Theory () Decision Theory 9 / 23 Statistical Theory () Decision Theory 10 / 23 Risk of a Decision Rule Risk of a Decision Rule

Example (Exponential Distribution) Example (Exponential Distribution) iid Notice that the parameter space in this example is (0, ). In such cases, Let X1, ..., Xn Exponential(λ), n 2. The MLE of λ is ∞ ∼ ≥ quadratic loss tends to penalize over-estimation more heavily than 1 under-estimation (the maximum possible under-estimation is bounded!). λˆ = X¯ Considering a different loss function gives the opposite result! Let ¯ a with X the empirical mean. Observe that (a, b) = 1 log(a/b) L b − − ˆ nλ Eλ[λ] = . n 1 where, for each fixed a, limb 0 (a, b) = limb (a, b) = . Now, − → L →∞L ∞ ¯ ¯ It follows that λ˜ = (n 1)λ/ˆ n is an unbiased estimator of λ. Observe now ˜ nλX nλX R(λ, λ) = Eλ 1 log − n 1 − − n 1 that  −  −  ˜ ˆ ¯ MSEλ(λ) < MSEλ(λ) ¯ ¯ Eλ(λX ) n = Eλ λX 1 log(λX ) + log ˜ ˜ ˆ − − n 1 − n 1 since λ is unbiased and Varλ(λ) < Varλ(λ). Hence the MLE is an −  −   ¯ ¯  ˆ inadmissible rule for quadratic loss. > Eλ λX 1 log(λX ) = R(λ, λ). − − Statistical Theory () Decision Theory 11 / 23 Statistical Theory ()  Decision Theory 12 / 23 Criteria for Choosing Decision Rules Minimax Decision Rules

Definition (Admissible Decision Rule) Another approach to good procedures is to use global rather than

A decision rule δ is admissible for the experiment ( Fθ θ Θ, ) if it is not local criteria (with respect to θ). { } ∈ L strictly dominated by any other decision rule. Rather than look at risk at every θ Concentrate on maximum risk ↔ In non-trivial problems, it may not be easy at all to decide whether a Definition (Minimax Decision Rule) given decision rule is admissible. Let be a class of decision rules for an experiment ( Fθ θ Θ, ). If D { } ∈ L Stein’s paradox (“one of the most striking post-war results in δ is such that mathematical statistics”-Brad Efron) ∈ D Admissibility is a minimal requirement - what about the opposite end sup R(θ, δ) sup R(θ, δ0), δ0 , θ Θ ≤ θ Θ ∀ ∈ D (optimality) ? ∈ ∈ In almost any non-trivial experiment, there will be no decision rule then δ is called a minimax decision rule. that makes risk uniformly smallest over θ A minimax rule δ satisfies supθ ΘR(θ, δ) = infκ supθ Θ R(θ, κ). Narrow down class of possible decision rules by ∈ ∈D ∈ unbiasedness/symmetry/... considerations, and try to find uniformly In the minimax setup, a rule is preferable to another if it has smaller dominating rules of all other rules (next week!). maximum risk.

Statistical Theory () Decision Theory 13 / 23 Statistical Theory () Decision Theory 14 / 23 Minimax Decision Rules Minimax Decision Rules

A few comments on minimaxity: Motivated as follows: we do not know anything about θ so let us Inadmissible minimax rule Counterintuitive minimax rule insure ourselves against the worst thing that can happen. Makes sense if you are in a zero-sum game: if your opponent chooses

θ to maximize then one should look for minimax rules. But is 1.0 L 1.0 nature really an opponent? 0.8 If there is no reason to believe that nature is trying to “do her worst”, 0.8 then the minimax principle is overly conservative: it places emphasis 0.6 on the “bad θ”. 0.6 Risk Risk 0.4

Minimax rules may not be unique, and may not even be admissible. A 0.4 minimax rule may very well dominate another minimax rule. 0.2 A unique minimax rule is (obviously) admissible. 0.2 0.0

Minimaxity can lead to counterintuitive results. A rule may dominate 0.0 another rule, except for a small region in Θ, where the other rule -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 achieves a smaller supremum risk. theta theta

Statistical Theory () Decision Theory 15 / 23 Statistical Theory () Decision Theory 16 / 23 Bayes Decision Rules Bayes Decision Rules

Wanted to compare decision procedures using global rather than local Bayes principle: a decision rule is preferable to another if it has criteria (with respect to θ). smaller Bayes risk (depends on the prior π(θ)!). We arrived at the minimax principle by assuming we have no idea about the true value of θ. Definition (Bayes Decision Rule) Suppose we have some prior belief about the value of θ. How can this Let be a class of decision rules for an experiment ( Fθ θ Θ, ) and let D { } ∈ L be factored in our risk-based considerations? π( ) be a probability density (frequency) on Θ. If δ is such that Rather than look at risk at every θ Concentrate on average risk · ∈ D ↔ r(π, δ) r(π, δ0) δ0 , Definition (Bayes Risk) ≤ ∀ ∈ D Let π(θ) be a probability density (frequency) on Θ and let δ be a decision then δ is called a Bayes decision rule with respect to π. rule for the experiment ( Fθ θ Θ, ). The π-Bayes risk of δ is defined as { } ∈ L The minimax principle aims to minimize the maximum risk.

r(π, δ) = R(θ, δ)π(θ)dθ = (θ, δ(x))Fθ[dx]π(θ)dθ The Bayes principle aims to minimize the average risk Θ Θ L Z Z ZX Sometime no Bayes rule exist becaise the infimum may not be attained for any δ . However in such cases  > 0 δ : The prior π(θ) places different emphasis for different values of θ based on ∈ D ∀ ∃  ∈ D r(π, δ) < infδ r(π, δ) + ε. our prior belief/knowedge. ∈D Statistical Theory () Decision Theory 17 / 23 Statistical Theory () Decision Theory 18 / 23 Admissibility of Bayes Rules Admissibility of Bayes Rules

Rule of thumb: Bayes rules are nearly always admissible. Theorem (Discrete Case Admissibility) Theorem (Uniqueness and Admissibility) Assume that Θ = θ , ..., θ is a finite space and that the prior π(θ ) > 0, If a Bayes rule is unique, it is admissible. { 1 t } i i = 1, ..., t. Then a Bayes rule with respect to π is admissible. Proof. Proof. Suppose that δ is a unique Bayes rule and assume that κ strictly Let δ be a Bayes rule, and suppose that κ strictly dominates δ. Then dominates it. Then,

R(θ , κ) R(θ , δ), j j ≤ j ∀ R(θ, κ)π(θ)dθ R(θ, δ)π(θ)dθ. R(θ , κ)π(θ ) R(θ , δ)π(θ ), θ Θ Θ ≤ Θ j j ≤ j j ∀ ∈ Z Z R(θj , κ)π(θj ) < R(θ, δ)π(θj ) as a result of strict domination and by π(θ) being non-negative. This j j implies that κ either improves upon δ, or κ is a Bayes rule. Either X X possibility contradicts our assumption. which is a contradiction (strict inequality follows by strict domination and the fact that π(θj ) is always positive).

Statistical Theory () Decision Theory 19 / 23 Statistical Theory () Decision Theory 20 / 23 Admissibility of Bayes Rules Admissibility of Bayes Rules

Observe now that Theorem (Continuous Case Admissibility) Let Θ Rd . Assume that the risk functions R(θ, δ) are continuous in θ r(π, κ) = R(θ, κ)π(θ)dθ ⊂ for all decision rules δ . Suppose that π places positive mass on any ZΘ ∈ D open subset of Θ. Then a Bayes rule with respect to π is admissible. = R(θ, κ)π(θ)dθ + R(θ, κ)π(θ)dθ Θ Θc Z 0 Z 0 Proof. < R(θ, δ)π(θ)dθ + R(θ, δ)π(θ)dθ Θ Θc Let κ be a decision rule that strictly dominates δ. Let Θ0 be the set on Z 0 Z 0 which R(θ, κ) < R(θ, δ). Given a θ Θ , we have R(θ , κ) < R(θ , δ). = r(π, δ), 0 ∈ 0 0 0 By continuity, there must exist an  > 0 such that R(θ, κ) < R(θ, δ) for all theta satisfying θ θ0 < . It follows that Θ0 is open and hence, by our since Θc R(θ, κ)π(θ)dθ Θc R(θ, δ)π(θ)dθ, while we have strict 0 ≤ 0 k − k inequality on Θ , contradicting our assumption that δ is a Bayes rule. assumption, π[Θ0] > 0. Therefore, it must be that R 0 R

R(θ, κ)π(θ)dθ < R(θ, δ)π(θ)dθ The continuity assumption and the assumption on π ensure that Θ0 ZΘ0 ZΘ0 is not an isolated set, and has positive measure, so that it “contributes” to the integral.

Statistical Theory () Decision Theory 21 / 23 Statistical Theory () Decision Theory 22 / 23 Randomised Decision Rules

Given

decision rules δ1, ..., δk probabilities π 0, k p = 1 i ≥ i=1 i we may define a new decisionP rule k δ = pi δi ∗ Xi=1 called a randomised decision rule. Interpretation:

Given data X, choose a δi randomly according to p but independent of X. If δ is the outcome (1 j k), then take action δ (X). j ≤ ≤ j k Risk of δ is average risk: R(θ, δ ) = i=1 pi R(θ, δi ) → ∗ ∗ Appears artificial but often minimaxP rules are randomised Examples of randomised rules with supθ R(θ, δ ) < supθ R(θ, δi ) i ∗ ∀ Statistical Theory () Decision Theory 23 / 23 1 Optimality in the Decision Theory Framework Minimum Variance Unbiased Estimation

2 Uniform Optimality in Unbiased Quadratic Estimation Statistical Theory

Victor Panaretos 3 The role of sufficiency and “Rao-Blackwellization” Ecole Polytechnique F´ed´eralede Lausanne

4 The role of completeness in Uniform Optimality

5 Lower Bounds for the Risk and Achieving them

Statistical Theory () MVUE 1 / 24 Statistical Theory () MVUE 2 / 24 Decision Theory Framework Optimality in Point Estimation

Saw how point estimation can be seen as a game: NatureVS Statistician. The decision theory framework includes: An optimal decision rule would be one that uniformly minimizes risk: A family of distributions F, usually assumed to admit densities R(θ, δOPTIMAL) R(θ, δ), θ Θ& δ . (frequencies) and a parameter space Θ Rp which parametrizes the ≤ ∀ ∈ ∀ ∈ D ⊆ family F = Fθ θ Θ. { } ∈ But such rules can very rarely be determined. A data space , on which the parametric family is supported. , optimality becomes a vague concept X → An action space , which represents the space of possible actions , can be made precise in many ways... A available to the statistician. In point estimation Θ → A ≡ Avenues to studying optimal decision rules include: A loss function :Θ R+. This represents the lost incurred L × A → Restricting attention to global risk criteria rather than local when estimating θ Θ by α . , Bayes and minimax risk. ∈ ∈ A → A set of decision rules. Any δ is a (measurable) function Focusing on restricted classes of rules D ∈ D D δ : . In point estimation decision rules are simply estimators. , e.g.MinimumVarianceUnbiasedEstimation. X → A → Performance of decision rules was to be judged by the risk they induce: Studying risk behaviour asymptotically (n ) → ∞ , e.g.AsymptoticRelativeEfficiency. R(θ, δ) = Eθ[ (θ, δ(X))], θ Θ, X Fθ, δ → L ∈ ∼ ∈ D

Statistical Theory () MVUE 3 / 24 Statistical Theory () MVUE 4 / 24 Unbiased Estimators under Quadratic Loss Comments on Unbiasedness

Focus on Point Estimation Unbiasedness requirement is one means of reducing the class of rules/estimators we are considering 1 Assume that F is known up to the parameter θ which is unknown θ , Other requirements could be invariance or equivariance, e.g. → 2 Let (x1, ..., xn) be a realization of X Fθ which is available to us ∼ δ(X + c) = δ(X) + c 3 Estimate the value of θ that generated the sample given (x1, ..., xn) Risk reduces to variance since bias is zero. Focus on Quadratic Loss Not necessarily a sensible requirement , e.g. violates “likelihood principle” Error incurred when estimating θ by θˆ = δ(X) is → Unbiased Estimators may not exist in a particular problem (θ, θˆ) = θ θˆ 2 L k − k Unbiased Estimators may be silly for a particular problem ˆ ˆ 2 2 However unbiasedness can be a reasonable/natural requirement in a giving MSE as risk R(θ, θ) = Eθ θ θ = Variance + Bias . k − k wide class of point estimation problems. RESTRICT class of estimators (=decision rules) Unbiasedness can be defined for more general loss functions, but not as conceptually clear (and with tractable theory) as for quadratic loss. Consider ONLY ubiased estimators: := δ : Θ θ[δ(X)] = θ . E , δ is unbiased under if Eθ[ (θ0, δ)] Eθ[ (θ, δ)] θ, θ0 Θ. D { X → | } → L L ≥ L ∀ ∈ Statistical Theory () MVUE 5 / 24 Statistical Theory () MVUE 6 / 24 Comments on Unbiasedness Comments on Unbiased Estimators

Example (Unbiased Estimators Need not Exist) Example (Unbiased Estimators May Be “Silly”) Let X Poisson(λ). We wish to estimate the parameter Let X Binomial(n, θ), with θ unknown but n known. We wish to estimate ∼ ∼ ψ = e 2λ. ψ = sin θ − If δ(X ) is an unbiased estimator of ψ, then We require that our estimator δ(X ) be unbiased, Eθ[δ] = ψ = sin θ. Such an estimator satisfies x x ∞ λ λ 2λ ∞ λ λ δ(x) e− = e− = δ(x) = e− n x! ⇒ x! n x n x x=0 x=0 δ(x) θ (1 θ) − = sin θ X X x − x x x=0   ∞ λ ∞ λ X = δ(x) = ( 1)x ⇒ x! − x! but this cannot hold for all θ, since the sine function cannotbe Xx=0 Xx=0 represented as a finite polynomial. so that δ(X ) = ( 1)X is the only unbiased estimator of ψ. The class of unbiased estimators in this case is empty. − But 0 < ψ < 1 for λ > 0, so this is clearly a silly estimator

Statistical Theory () MVUE 7 / 24 Statistical Theory () MVUE 8 / 24 Comments on Unbiased Estimators Unbiased Estimation and Sufficiency

Example (A Non-Trivial Example) Theorem (Rao-Blackwell Theorem)

Let X1, ..., Xn be iid random variables with density Let X be distributed according to a distribution depending on an unknown parameter θ and let T be a sufficient statistic for θ. Let δ be decision rule (x µ) f (x; µ) = e− − , x µ R. such that ≥ ∈ 1 Eθ[δ(X)] = g(θ) for all θ Two possible unbiased estimators are 2 Varθ(δ(X)) < , for all θ. 1 ∞ µˆ = X &µ ˜ = X¯ 1. Then δ∗ := E[δ T ] is an unbiased estimator of g(θ) that dominates δ, i.e. (1) − n − | 1 Eθ[δ∗(X)] = g(θ) for all θ. In fact, tµˆ + (1 t)˜µ is unbiased for any t. Simple calculations reveal 2 Var (δ (X)) Var (δ(X)) for all θ. − θ ∗ ≤ θ 1 1 Moreover, inequality is replaced by equality if and only if Pθ[δ∗ = δ] = 1. R(µ, µˆ) = Var(ˆµ) = & R(µ, µ˜) = Var(˜µ) = n2 n The theorem indicates that any candidate minimum variance unbiased so thatµ ˆ dominatesµ ˜. Will it dominate any other unbiased estimator? estimator should be a functions of the sufficient statistic.

(note thatµ ˆ depends only on the one-dimensional sufficient statistic X(1)) Intuitively, an estimator that takes into account aspects of the sample that are irrelevant with respect to θ, can always be improved Statistical Theory () MVUE 9 / 24 Statistical(throwing Theory away () some irrelevantMVUE information improves risk). 10 / 24 Proof. Unbiasedness and Sufficiency Since T is sufficient for θ, E[δ T = t] = h(t) is independent of θ, so that | δ is well-defined as a statistic (depends only on X). Then, ∗ Any admissible unbiased estimator should be a function of a sufficient statistic Eθ[δ∗(X)] = Eθ[E[δ(X) T (X)]] = Eθ[δ(X)] = g(θ). | , If not, we can dominate it by its conditional expectation given a → Furthermore, we have sufficient statistic. But is any function of a sufficient statistic admissible? Varθ(δ) = Varθ[E(δ T )] + Eθ[Var(δ T )] Varθ[E(δ T )] (provided that it is unbiased) | | ≥ | = Varθ(δ∗) Suppose that δ is an unbiased estimator of g(θ) and T , S are θ-sufficient. In addition, note that ? 2 2 What is the relationship between Var ( [δ T ]) Var ( [δ S]) Var(δ T ) := E[(δ E[δ T ]) T ] = E[(δ δ∗) T ] θ E T θ E | − | | − | | | δT∗ δS∗ 2 so that Eθ[Var(δ T )]= Eθ(δ δ∗) > 0 unless if Pθ(δ∗ = δ) = 1. | − Intuition suggests that whichever of T ,|S {zcarries} the least| {z irrelevant} information (in addition to the relevant information) should “win”

Exercise , More formally, if T = h(S) then we should expect that δ∗ dominate δ∗. → T S Show that Var(Y ) = E[Var(Y X)] + Var[E(Y X)] when Var(Y ) < . | | ∞ Statistical Theory () MVUE 11 / 24 Statistical Theory () MVUE 12 / 24 Unbiasedness and Sufficiency Proof. Recall the tower property of conditional expectation: if Y = f (X ), then

Proposition E[Z Y ] = E E(Z X ) Y . | { | | } Let δ be an unbiased estimator of g(θ) and for T , S two θ-sufficient Since T = f (S) we have statistics define δ := [δ T ]& δ := [δ S]. T∗ E S∗ E δ∗ = E[δ T ] | | T | Then, the following implication holds = E[E(δ S) T ] | | = E[δS∗ T ] T = h(S) = Var (δ∗ ) Var (δ∗ ) | ⇒ θ T ≤ θ S The conclusion now follows from the Rao-Blackwell theorem. 1 Essentially this means that the best possible “Rao-Blackwellization” is achieved by conditioning on a minimal sufficient statistic. A mathematical remark 2 This does not necessarily imply that for T minimally sufficient and δ To better understand the tower property intuitively, recall that E[Z Y ] is | unbiased, E[δ T ] has minimum variance. the minimizer of E[(Z ϕ(Y ))2] over all (measurable) functions ϕ of Y . | − , In fact it does not even imply that E[δ T ] is admissible. You can combine that with the fact that E[(X Y )2] defines a Hilbert → | − norm on random variables with finite variance to get geometric intuition. p Statistical Theory () MVUE 13 / 24 Statistical Theory () MVUE 14 / 24 Completeness, Sufficiency, Unbiasedness, and Optimality Proof. To prove (1) we go through the following steps: Theorem (Lehmann-Scheff´eTheorem) Take V to be any unbiased estimator with finite variance.

Let T be a complete sufficient statistic for θ and let δ be a statistic such Define its “Rao-Blackwellized” version V ∗ := E[V T ] | that Eθ[δ] = g(θ) and Varθ(δ) < , θ Θ. If δ∗ := E[δ T ] and V is ∞ ∀ ∈ | By unbiasedness of both estimators, any other unbiased estimator of g(θ), then 1 Var (δ ) Var (V ), θ Θ 0 = Eθ[V ∗ δ∗] = Eθ[E[V T ] E[δ T ]] = Eθ[h(T )], θ Θ. θ ∗ ≤ θ ∀ ∈ − | − | ∀ ∈ 2 Varθ(δ∗) = Varθ(V ) = Pθ[δ∗ = V ] = 1. ⇒ By completeness of T we conclude Pθ[h(T ) = 0] = 1 for all θ. That is, δ∗ := E[δ T ] is the uniqueUniformlyMinimumVarianceUnbiased | In other words, Pθ[V ∗ = δ∗] = 1 for all θ. Estimator of g(θ). But V ∗ dominates V by the Rao-Blackwell theorem.

The theorem says that if a complete sufficient statistic T exists, then Hence Varθ(δ∗) = Varθ(V ∗) Varθ(V ). ≤ the MVUE of g(θ) (if it exists) must be a function of T . For part (2) (the uniqueness part) notice that from our reasoning above Moreover it establishes that whenever UMVUE, it is unique. ∃ Varθ(V ) = Varθ(δ∗) = Varθ(V ) = Varθ(V ∗) Can be used to examine whether unbiased estimators exist at all: if a ⇒ But Rao-Blackwell theorem says complete sufficient statistic T exists, but there exists no function h Varθ(V ) = Varθ(V ∗) Pθ[V = V ∗] = 1. with E[h(T )] = g(θ), then no unbiased estimator of g(θ) exists. ⇐⇒

Statistical Theory () MVUE 15 / 24 Statistical Theory () MVUE 16 / 24 Completeness, Sufficiency, Unbiasedness, and Optimality Example (Bernoulli Trials) First suppose that n = 2. If a UMVUE exists, it must be of the form h(T ) Taken together, the Rao-Blackwell and Lehmann-Scheff´etheorems also with h satisfying suggest approaches to finding UMVUE estimators when a complete 2 2 2 k 2 k sufficient statistic T exists: θ = h(k) θ (1 θ) − k − 1 Find a function h such that Eθ[h(T )] = g(θ). If Varθ[h(T )] < for k=0   ∞ X all θ, then δ = h(T ) is the unique UMVUE of g(θ). It is easy to see that h(0) = h(1) = 0 while h(2) = 1. Thus, for n = 2, , The function h can be found by solving the equation Eθ[h(T )] = g(θ) 2 → h(T ) = T (T 1)/2 is the unique UMVUE of θ . or by an educated guess. − For n > 2, set δ = 1 X1 + X2 = 2 and note that this is an unbiased 2 Given an unbiased estimator δ of g(θ), we may obtain the UMVUE by 2 { } estimator of θ . By the Lehmann-Scheff´etheorem, δ∗ = E[δ T ] is the | “Rao-Blackwellizing” with respect to the complete sufficient statistic: unique UMVUE estimator of θ2. We have Example (Bernoulli Trials) E[S T = t] = P[X1 + X2 = 2 T = t] | | Let X , ..., X iidBernoulli(θ). What is the UMVUE of θ2? Pθ[X1 + X2 = 2, X3 + ... + Xn = t 2] 1 n ∼ = − Pθ[T = t] By the Neyman factorization theorem T = X1 + ... + Xn is sufficient, 0 if t 1 t(t 1) Since the distribution of (X1, ..., Xn) is a 1-parameter exponential = ≤ = − . n 2 n n(n 1) family, T is also complete. ( t−2 / t if t 2) − ≥ −

Statistical Theory () MVUE 17 / 24 Statistical Theory ()  MVUE 18 / 24 Variance Lower Bounds for Unbiased Estimators Cuachy-Schwarz Bounds

Often minimal sufficient statistic exists but is not complete. Theorem (Cauchy-Schwarz Inequality) → , Cannot appeal to the Lehmann-Scheff´etheorem in search of a UMVUE. Let U, V be random variables with finite variance. Then, → However, if we could establish a lower bound for the variance as a Cov(U, V ) Var(U)Var(V ) function of θ, than an estimator achieving this bound will be the ≤ unique UMVUE. The theorems yields an immediate lowerp bound for the variance of an unbiased estimator δ0: The Aim 2 Covθ(δ0, U) For iid X1, ..., Xn with density (frequency) depending on θ unknown, we Varθ(δ0) ≥ Varθ(U) want to establish conditions under which which is valid for any random variable U with Var (U) < for all θ. θ ∞ Varθ[δ] φ(θ), θ The bound can be made tight be choosing a suitable U. ≥ ∀ However this is still not very useful as it falls short of our aim for any unbiased estimator δ. We also wish to determine φ(θ). The bound will be specific to δ0, while we want a bound that holds for any unbiased estimator δ. Is there a smart choice of U for which Covθ(δ0, U) depends on Let’s take a closer look at this... g(θ) = Eθ(δ0) only? (and so is not specific to δ0) Statistical Theory () MVUE 19 / 24 Statistical Theory () MVUE 20 / 24 Optimizing the Cauchy-Schwartz Bound The Cram´er-RaoLower Bound

Assume that θ is real and the following regularity conditions hold Theorem

Regularity Conditions Let X = (X1, ..., Xn) have joint density (frequency) f (x; θ) satisfying (C1) The support of A := x : f (x; θ) > 0 is independent of θ conditions (C1), (C2) and (C3). If the statistic T satisfies condition (C4), { } then (C2) f (x; θ) is differentiable w.r.t. θ, θ Θ 2 ∀ ∈ [g 0(θ)] ∂ Varθ(T ) (C3) Eθ ∂θ log f (X; θ) = 0 ≥ I (θ) (C4) For a statistic T = T (X) with Eθ T < and g(θ) = EθT 2   | | ∞ with I (θ) = ∂ log f (X; θ) differentiable, Eθ ∂θ h  i ∂ Proof. g 0(θ) = Eθ T log f (X; θ) , θ ∂θ ∀ ∂   By the Cauchy-Schartz inequality with U = ∂θ log f (X; θ), To make sense of (C3) and (C4), suppose that f ( ; θ) is a density. Then · 2 ∂ Covθ T , ∂θ log f (X; θ) d ! f (x; θ) d d Varθ(T ) ∂ S(x)f (x; θ)dx= S(x) f (x; θ)dx = S(x)f (x; θ) log f (x; θ)dx ≥ Varθ log f (X; θ) dθ f (x; θ) dθ dθ ∂θ  Z Z Z ∂ ∂  provided integration/differentiation can be interchanged. Since Eθ ∂θ log f (X; θ) = 0 we have Varθ ∂θ log f (X; θ) = I (θ). Statistical Theory () MVUE 21 / 24 Statistical Theory ()  MVUE  22 / 24 The Cram´er-RaoLower Bound The Cram´er-RaoLower Bound

When is the Cram´er-Raolower bound achieved? [g (θ)]2 Also, observe that if Var [T ] = 0 θ I (θ) ∂ ∂ 2 ∂ Covθ T , log f (X; θ) = Eθ T log f (X; θ) Covθ T , ∂θ log f (X; θ) ∂θ ∂θ then Varθ[T ] = ∂     Varθ log f (X; θ) ∂  ∂θ  Eθ[T ]Eθ log f (X; θ) which occurs if and only if ∂ log f (X; θ) is a linear function of T − ∂θ ∂θ     (correlation 1). That is, w.p.1: ∂ = Eθ T log f (X; θ) ∂θ ∂   log f (X; θ) = A(θ)T (x) + B(θ) d ∂θ = Eθ[T ] Solving this differential equation yields, for all x, dθ = g 0(θ) log f (x; θ) = A∗T (x) + B∗(θ) + S(x) which completes the proof. so that Varθ(T ) attains the lower bound if and only if the density (frequency) of X is a one-parameter exponential family as above

Statistical Theory () MVUE 23 / 24 Statistical Theory () MVUE 24 / 24 1 Contrasting Theories With Experimental Evidence Testing Statistical Hypotheses

2 Hypothesis Testing Setup Statistical Theory

Victor Panaretos 3 Type I vs Type II Error Ecole Polytechnique F´ed´eralede Lausanne

4 The Neyman-Pearson Setup

5 Optimality in the Neyman Pearson Setup

Statistical Theory (Week 10) Hypothesis Testing 1 / 19 Statistical Theory (Week 10) Hypothesis Testing 2 / 19 Using Data to Evaluate Theories/Assertions Statistical Framework for Testing Hypotheses

The Problem of Hypothesis Testing Scientific theories lead to assertions that are testable using empirical X = (X , ..., X ) random variables with joint density/frequency f (x; θ) data. 1 n θ Θ where Θ = Θ Θ and Θ Θ = (or Λ(Θ Θ ) = 0) Data may discredit the theory (call it a hypothesis) or not ∈ 0 ∪ 1 0 ∩ 1 ∅ 0 ∩ 1 , i.e. are empirical findings reasonable under hypothesis? Observe realization x = (x1, ..., xn) of X fθ → ∼ Example: Large Hadron Collider in CERN, Gen`eve.Does the Higgs Decide on the basis of x whether θ Θ0 or θ Θ1 ∈ ∈ Boson exist? Study if particle trajectories are consistent with what , Often dim(Θ ) < dim(Θ) so θ Θ represents a simplified model. theory predicts. → 0 ∈ 0 Example: Theory of “luminoferous aether” in late 19th century to Example explain light travelling in vacuum. Discredited by Michelson-Morley iid iid Let X , ..., X (µ, 1) and Y , ..., Y (ν, 1). Have θ = (µ, ν) and experiment. 1 n ∼ N 1 n ∼ N 2 Similarities with the logical/mathematical concept of a necessary Θ = (µ, ν): µ R, ν R = R condition { ∈ ∈ } May be interested to see if X and Y have same distribution, even though they may be measurements on characteristics of different groups. In this Formal statistical framework? 2 case Θ0 = (µ, ν) R : µ = ν { ∈ } Statistical Theory (Week 10) Hypothesis Testing 3 / 19 Statistical Theory (Week 10) Hypothesis Testing 4 / 19 Decision Theory Perspective on Hypothesis Testing Decision Theory Perspective on Hypothesis Testing

Given X we need to decide between two hypotheses: Typically loss function is “0–1” loss, i.e.

H0: θ Θ0 (the NULL HYPOTHESIS) ∈ 1 if θ Θ0 & a = 1 (Type I Error) H1: θ Θ1 (the ALTERNATIVE HYPOTHESIS) ∈ ∈ (θ, a) = 1 if θ Θ & a = 0 (Type II Error) L  ∈ 1 Want decision rule δ : = 0, 1 (chooses between H0 and H1)  → X → A { } 0 otherwise (No Error) In hypothesis testing δ is called a test function  Often δ depends on X only through some real-valued statistic i.e. way lose 1 unit whenever committing a type I or type II error. Leads to the following risk function: T = T (X) called a test statistic. −→

Unlikely that a test function is perfect. Possible errors to be made? Eθ[1 δ = 1 ] = Pθ[δ = 1] if θ Θ0 (prob of type I error) R(θ, δ) = { } ∈ (Eθ[1 δ = 0 ] = Pθ[δ = 0] if θ Θ1 (prob of type II error) Action/ Truth H0 H1 { } ∈ 0 ^¨ Type II Error In short, 1 Type I Error ^¨ R(θ, δ) = Pθ[δ = 1]1 θ Θ0 + Pθ[δ = 0]1 θ Θ1 { ∈ } { ∈ } Potential asymmetry of errors in practice: false positive VS false negative “ = ” “Pθ[choose H1 H0 is true]” OR “Pθ[choose H0 H1 is true]” (e.g. spam filters for e-mail) | |

Statistical Theory (Week 10) Hypothesis Testing 5 / 19 Statistical Theory (Week 10) Hypothesis Testing 6 / 19 Optimal Testing? The Neyman-Pearson Setup

As with point estimation, we may wish to find optimal test functions Classical approach: restrict class of test functions by “minimax reasoning” , Find test functions that uniformly minimize risk? 1 We fix an α (0, 1), usually small (called the significance level) → ∈ Possible to do in some problems, but in intractable in general 2 We declare that we will only consider test functions δ such that , As in point estimation → How to relax problem in this case? Minimize each type I and type II Pθ[δ = 1] α θ Θ0 i.e. sup Pθ[δ = 1] α error probabilities separately? ≤ ∀ ∈ θ Θ0 ≤  ∈  In general there is a trade-off between the two error probabilities i.e. rules for which prob of type I error is bounded above by α For example, consider two test functions δ and δ and let , Jargon: we fix a significance level for our test 1 2 → 3 Within this restricted class of rules, choose δ to minimize prob of type R1 = x : δ1(x) = 1 & R2 = x : δ2(x) = 1 { } { } II error uniformly on Θ1: Assume that R1 R2. Then, for all θ Θ, ⊂ ∈ Pθ[δ(X) = 0] = 1 Pθ[δ(X) = 1] − Pθ[δ1(X) = 1] Pθ[δ2(X) = 1] ≤ 4 Equivalently, maximize the power uniformly over Θ1 Pθ[δ = 0]=1 Pθ[δ1(X) = 1] 1 Pθ[δ2(X) = 1] = Pθ[δ2(X) = 0] − ≥ − β(θ, δ)= Pθ[δ(X) = 1] = Eθ[1 δ(X) = 1 ]= Eθ[δ(X)], θ Θ1 so by attempting to reduce the probability of error when θ Θ we may { } ∈ ∈ 0 increase the the probability of error when θ Θ ! (since δ = 1 1 δ = 1 = 1 and δ = 0 1 δ = 1 = 0) ∈ 1 ⇐⇒ { } ⇐⇒ { } Statistical Theory (Week 10) Hypothesis Testing 7 / 19 Statistical Theory (Week 10) Hypothesis Testing 8 / 19 The Neyman-Pearson Setup The Neyman-Pearson Setup

Intuitive rationale of the approach: Neyman-Pearson setup naturally exploits any asymmetric structure Want to test H0 against H1 at significance level α But, if natural asymmetry absent, need judicious choice of H0 Suppose we observe δ(X) = 1 (so we take action 1) Example: Obama VS McCain 2008. Pollsters gather iid sample X from α is usually small, so that if H0 is indeed true, we have observed Ohio with Xi = 1 vote Obama . Which pair of hypotheses to test? something rare or unusual { } , since δ = 1 has probability at most α under H H0 : Obama wins Ohio H0 : McCain wins Ohio → 0 OR H1 : McCain wins Ohio H1 : Obama wins Ohio Evidence that H0 is false (i.e. in favour of H1) ( ( So taking action 1 is a highly reasonable decision Which pair to choose to make a prediction? (confidence intervals?) But what if we observe δ(X) = 0? (so we take action 0) If Obama is conducting poll to decide whether he’ll spend more money Our significance level does not guarrantee that our decision is to campaign in Ohio, then his possible losses due to errors are: necessarily reasonable (a) Spend more $’s to campaign in Ohio even though he would win anyway: lose $’s Our decision would have been reasonable if δ was such that the type (b) Lose Ohio to McCain because he thought he would win without any II error was also low (given the significance level). extra effort. If we had maximized power β at level α though, then we would be (b) is much worse than (a) (especially since Obama had lots of $’s) reassured of our decision. Hence Obama would pick H = McCain wins Ohio as his null 0 { } Statistical Theory (Week 10) Hypothesis Testing 9 / 19 Statistical Theory (Week 10) Hypothesis Testing 10 / 19 Finding Good Test Functions Proof. Consider simplest situation: Use obvious notation E0, E1, P0, P1 corresponding to H0 or H1. Have (X , ..., X ) f ( ; θ) with Θ = θ , θ It suffices to prove that if ψ is any function with ψ(x) 0, 1 , then 1 n ∼ · { 0 1} ∈ { } Want to test H0 : θ = θ0 vs H1 : θ = θ1 E0[ψ(X)] E0[δ(X)] = E1[ψ(X)] E1[δ(X)]. The Neyman-Pearson Lemma ≤ ⇒ ≤ =α(by definition) β1(ψ) β1(δ) Let X = (X1, ..., Xn) have joint density (frquency) function f f0, f1 and ∈ { } | {z } | {z } | {z } suppose we wish to test (recall that β1(δ) = 1 P1[δ = 0] = P1[δ = 1] = E1[δ]). − WLOG assume that f0 and f1 are density functions. Note that H0 : f = f0 VS H1 : f = f1. f1(x) k f0(x) 0 if δ(x) = 1 & f1(x) k f0(x) < 0 if δ(x) = 0. Then, the test whose test function is given by − · ≥ − · Since ψ can only take the values 0 or 1, 1 if f1(X) k f0(X), δ(X) = ≥ · ψ(x)(f (x) k f (x)) δ(x)(f (x) k f (x)) (0 otherwise 1 − · 0 ≤ 1 − · 0 ψ(x)(f1(x) k f0(x))dx δ(x)(f1(x) k f0(x))dx for some k (0, ), is a most powerful (MP) test of H versus H at n − · ≤ n − · ∈ ∞ 0 1 ZR ZR significance level α = P0[δ(X) = 1](= E0[δ(X)] = P0[f1(X) k f0(X)]). ≥ · Statistical Theory (Week 10) Hypothesis Testing 11 / 19 Statistical Theory (Week 10) Hypothesis Testing 12 / 19 The Neyman-Pearson Setup Rearranging the terms yields General version of Neyman-Pearson lemma considers relaxed problem:

(ψ(x) δ(x))f1(x)dx k (ψ(x) δ(x))f0(x)dx maximize 1[δ] subject to 0[δ] = α &0 δ(X) 1 a.s. n − ≤ n − E E ZR ZR ≤ ≤ = E1[ψ(X)] E1[δ(X)] k (E0[ψ(X)] E0[δ(X)]) ⇒ − ≤ − It is then proven that an optimal δ exists and is given by

So when E0[ψ(X)] E0[δ(X)] the RHS is negative, i.e. δ is an MP test of 1 if f1(X) > kf0(X), ≤ H vs H at level α. 0 1 δ(X) = c if f1(X) = kf0(X) 0 if f1(X) < kf0(X) Essentially the result says that the optimal test statistic for simple where k and c [0, 1] are such that the conditions are satisfied hypotheses vs simple alternatives is T (X) = f1(X)/f0(X) ∈  The optimum need not be a test function (relaxation randomization) The optimal test function would then reject the null whenever T k → ≡ ≥ , But when the test statistic T = f /f is a continuous RV, then δ can → 1 0 k is chosen so that the test has desirable level α be taken to have range 0, 1 , i.e. be a test function { } The result does not guarantee existence of an MP test , In this case an MP test function of H : f = f against H : f = f • → 0 0 1 1 The result does not guarantee uniqueness when MP test exists exists for any significance level α > 0. • When T is discrete then the optimum need not be a test function for → certain levels α, unless we consider randomized tests as well. Statistical Theory (Week 10) Hypothesis Testing 13 / 19 Statistical Theory (Week 10) Hypothesis Testing 14 / 19 The Neyman-Pearson Setup The Neyman-Pearson Setup

Example (Exponential Distribution) Example (cont’d) iid n Let X1, ..., Xn Exp(λ) and λ λ1, λ2 , with λ1 > λ0 (say). Consider To determine k we note that T is a decreasing function of S = X ∼ ∈ { } i=1 1 (since λ0 < λ1). Therefore H0 : λ = λ0 P H : λ = λ T k S K ( 1 1 ≥ ⇐⇒ ≤ Have for some K, so that n λx n λ n x f (x; λ) = λe− i = λ e− i=1 i n

i=1 P α = Pλ0 [T k] α = Pλ0 Xi K Y ≥ ⇐⇒ " ≤ # So Neyman-Pearson say we must base our test on the statistic Xi=1 For given values of λ and α it is entirely feasible to find the appropriate n n 0 f (X; λ1) λ1 K: under the null hypothesis, S has a gamma distribution with parameters T = = exp (λ0 λ1) Xi f (X; λ0) λ0 " − # n and λ0. Hence we reject H0 at level α if S exceeds that α-quantile of a   Xi=1 gamma(n, λ0) distribution. rejecting the null if T k, for k such that the level is α. ≥ Statistical Theory (Week 10) Hypothesis Testing 15 / 19 Statistical Theory (Week 10) Hypothesis Testing 16 / 19 Example (Uniform Distribution) Example (cont’d) iid Let X1, ...Xn [0, θ] with θ θ0, θ1 where θ0 > θ1. Consider , What about finding an MP test for α < (θ /θ )n? ∼ U ∈ { } → 1 0 An intuitive test statistic is the sufficient statistic X(n), giving the test H0 : θ = θ0 H : θ = θ reject H iff X k ( 1 1 0 (n) ≤ Recall that with k solving the equation: 1 f (x; θ) = 1 max Xi θ n θn 1 i n ≤ k  ≤ ≤  Pθ0 [X(n) k] = = α, ≤ θ0 so an MP test of H0 vs H1 should be based on the discrete test statistic   n 1/n f (X; θ1) θ0 i.e. with k = θ0α , with power T = = 1 X(n) θ1 . f (X; θ ) θ { ≤ } n 0  1  1/n n 1/n θ0α θ0 Pθ1 [X(n) θ0α ] = = α . So if the test rejects H0 when X θ1 then it is MP for H0 vs H1 at ≤ θ θ (n) ≤ 1 !  1  n α = Pθ0 [X(n) θ1] = (θ1/θ0) n ≤ Is this the MP test at level α < (θ1/θ0) though? with power Pθ1 [X(n) θ1] = 1. What about smaller values of α? Statistical Theory (Week 10) ≤ Hypothesis Testing 17 / 19 Statistical Theory (Week 10) Hypothesis Testing 18 / 19

Example (cont’d) Use general form of the Neyman-Pearson lemma to solve relaxed problem:

n θ1 maximize E1[δ(X)] subject to Eθ0 [δ(X)] = α < & 0 δ(x) 1. θ0 ≤ ≤   One solution to this problem is given by

α(θ /θ )n if X θ , δ(X) = 0 1 (n) ≤ 1 (0 otherwise. which is not a test function. However, we see that its power is

n θ0 1/n Eθ1 [δ(X)] = α = Pθ1 [X(n) θ0α ] θ ≤  1  which is the power of the test we proposed. 1/n Hence the test that rejects H0 if X(n) θ0α is an MP test for all levels n ≤ α < (θ1/θ0) .

Statistical Theory (Week 10) Hypothesis Testing 19 / 19 1 Contrasting Theories With Experimental Evidence Testing Statistical Hypotheses II

2 Hypothesis Testing Setup Statistical Theory

Victor Panaretos 3 Type I vs Type II Error Ecole Polytechnique F´ed´eralede Lausanne

4 The Neyman-Pearson Setup

5 Optimality in the Neyman Pearson Setup

Statistical Theory (Week 11) Hypothesis Testing 1 / 19 Statistical Theory (Week 11) Hypothesis Testing 2 / 19 Neyman-Pearson Framework for Testing Hypotheses Uniformly Most Powerful Tests

A uniformly most powerful test of H : θ Θ vs H : θ Θ at level α: The Problem of Hypothesis Testing 0 ∈ 0 1 ∈ 1 1 Respects the level for all θ Θ0 (i.e. for all possible simple nulls), X = (X1, ..., Xn) random variables with joint density/frequency f (x; θ) ∈

θ Θ where Θ = Θ0 Θ1 and Θ0 Θ1 = Eθ[δ] α θ Θ0 ∈ ∪ ∩ ∅ ≤ ∀ ∈ Observe realization x = (x1, ..., xn) of X fθ ∼ 2 Is most powerful for all θ Θ1 (i.e. for all possible simple Decide on the basis of x whether θ Θ0 (H0) or θ Θ1 (H1) ∈ ∈ ∈ alternatives),

Eθ[δ] Eθ[δ0] θ Θ1 & δ0 respecting level α Neyman-Pearson Framework: ≥ ∀ ∈ 1 Fix a significance level α for the test Unfortunately UMP tests rarely exist. Why? 2 Among all rules respecting the significance level, pick the one that , Consider H : θ = θ vs H : θ = θ → 0 0 1 6 0 uniformly maximizes power A UMP test must be MP test for any θ = θ . 6 1 When H /H both simple Neyman-Pearson lemma settles the problem. But by the form of the MP test typically differs for θ > θ and 0 1 → 1 0 , What about more general structure of Θ , Θ ? θ1 < θ0! → 0 1

Statistical Theory (Week 11) Hypothesis Testing 3 / 19 Statistical Theory (Week 11) Hypothesis Testing 4 / 19 Example (No UMP test exists) Example (A UMP test exists)

Let X Binom(n, θ) and suppose we want to test: Let X , ..., X iid Exp(λ) and suppose we wish to test ∼ 1 n ∼ H0 : θ = θ0 vs H1 : θ = θ0 H : λ λ vs H : λ > λ 6 0 ≤ 0 1 0 at some level α. To this aim, consider first at some level α. To this aim, consider first the pair

H00 : θ = θ0 vs H10 : θ = θ1 H00 : λ = λ0 vs H10 : λ = λ1

Neyman-Pearson lemma gives test statistics with λ > λ which we saw last time to admit a MP test λ > λ : 1 0 ∀ 1 0 n X f (X ; θ ) 1 θ θ (1 θ ) n n T = 1 = − 0 1 − 0 f (X ; θ0) 1 θ1 θ0(1 θ1) Reject H0 for Xi k, with k such that Pλ0 Xi k = α  −   −  ≤ " ≤ # Xi=1 Xi=1 If θ1 > θ0 then T increasing in X n n But for λ < λ0, Pλ0 [ i=1 Xi k] = α = Pλ [ i=1 Xi k] < α. , MP test would reject for large values of X ≤ ⇒ ≤ → P P If θ1 < θ0 then T decreasing in X So the same test respects level α for all singletons under the the null. , MP test would reject for small values of X , The test is UMP of H0 vs H1 → → Statistical Theory (Week 11) Hypothesis Testing 5 / 19 Statistical Theory (Week 11) Hypothesis Testing 6 / 19 When do UMP tests exist? When do UMP tests exist?

Examples: insight on which composite pairs typically admit UMP tests: Proposition 1 Hypothesis pair concerns a single real-valued parameter Let X = (X , ..., X ) have joint distribution of monotone likelihood ratio 2 Hypothesis pair is “one-sided” 1 n with respect to a statistic T , depending on θ R. Further assume that T However existence of UMP test does not only depend on hypothesis ∈ is a continuous random variable. Then, the test function given by structure, as was the case with simple vs simple... , Also depends on specific model. Sufficient condition? 1 if T (X) > k, → δ(X) = (0 if T (X) k Definition (Monotone Likelihood Ratio Property) ≤ A family of density (frequency) functions f (x; θ): θ Θ with Θ R is { ∈ } ⊆ is UMP among all tests with type one error bounded above by Eθ0 [δ(X)] said to have monotone likelihood ratio if there exists a real-valued function for the hypothesis pair T (x) such that, for any θ1 < θ2, the function H : θ θ 0 ≤ 0 f (x; θ2) (H1 : θ > θ1 f (x; θ1) [The assumption of continuity of the random variable T can be removed, is a non-decreasing function of T (x). by considering randomized tests as well, similarly as before]

Statistical Theory (Week 11) Hypothesis Testing 7 / 19 Statistical Theory (Week 11) Hypothesis Testing 8 / 19 When do UMP tests exist? Locally Most Powerful Tests

, What if MLR property fails to be satisfied? Can optimality be “saved”? T yielding monotone likelihood ratio necessarily a sufficient statistic →

Consider θ R and wish to test: H0 : θ θ0 vs H0 : θ > θ0 Example (One-Parameter Exponential Family) ∈ ≤ Intuition: if true θ far from θ0 any reasonable test powerful Let X = (X1, ..., Xn) have a joint density (frequency) ? So focus on maximizing power in small neighbourhood of θ0 f (x; θ) = exp[c(θ)T (x) b(θ) + S(x)] − Consider power function β(θ) = Eθ[δ(X)] of some δ. → Require β(θ0) = α (a boundary condition, similar with MLR setup) and assume WLOG that c(θ) is strictly increasing. For θ1 < θ2, → Assume that β(θ) is differentiable, so for θ close to θ → 0 f (x; θ ) 2 = exp [c(θ ) c(θ )]T (x) + b(θ b(θ )) 2 1 1 2 β(θ) β(θ0) + β0(θ0)(θ θ0) = α + β0(θ0)(θ θ0) f (x; θ1) { − − } ≈ − − >0 is increasing in T by monotonicity of c( ). · | {z } Since Θ1 = (θ0, ), this suggests approach for locally most powerful test Hence the UMP test of H : θ θ vs H : θ > θ would reject iff ∞ 0 ≤ 0 1 0 T (x) k, with α = Pθ0 [T k]. Choose δ to Maximize β0(θ0) Subject to β(θ0) = α ≥ ≥

Statistical Theory (Week 11) Hypothesis Testing 9 / 19 Statistical Theory (Week 11) Hypothesis Testing 10 / 19 How do we solve this constrained optimization problem? Locally Most Powerful Tests , Solution similar to Neyman-Pearson lemma? → Supposing that X = (X1, ...Xn) has density f (x; θ), then Theorem

Let X = (X1, ..., Xn) have joint density (frequency) f (x; θ) and define the β(θ) = δ(x)f (x; θ)dx test function n ZR 1 if S(X; θ ) k, ∂ ∂ δ(X) = 0 ≥ β(θ) = δ(x) f (x; θ)dx [provided interchange possible] 0 otherwise ∂θ n ∂θ ( ZR f (x; θ) ∂ where k is such that [δ(X)] = α. Then δ maximizes = δ(x) f (x; θ)dx Eθ0 n f (x; θ) ∂θ ZR ∂ Eθ0 [ψ(X)S(X; θ0)] = δ(x) log f (x; θ) f (x; θ)dx Rn ∂θ Z   over all test functions ψ satisfying the constraint Eθ0 [ψ(X)] = α. ∂ = Eθ δ(X) log f (x; θ) Gives recipe for constructing LMP test ∂θ   We were concerned about power only locally around θ0  S(X;θ)    May not even give a level α test for some θ < θ0 | {z } •

Statistical Theory (Week 11) Hypothesis Testing 11 / 19 Statistical Theory (Week 11) Hypothesis Testing 12 / 19 Proof. Example (Cauchy distribution)

Consider ψ with ψ(x) 0, 1 x and Eθ0 [ψ(X)] = α. Then, iid ∈ { } ∀ Let X , ..., X Cauchy(θ), with density, 1 n ∼ 0 if S(x; θ0) k, 1 δ(x) ψ(x) = ≥ ≥ f (x; θ) = − ( 0 if S(x; θ0) k π(1 + (x θ)2) ≤ ≤ − Therefore H : θ 0 and consider the hypothesis pair 0 ≥ Eθ0 [(δ(X) ψ(X))(S(X; θ0) k)] 0 − − ≥ (H1 : θ 0 ≤ Since Eθ0 [δ(X) ψ(X)] = 0 it must be that We have − n 2Xi S(X; 0) = 2 Eθ0 [δ(X)S(X; θ0)] Eθ0 [ψ(X)S(X; θ0)] 1 + X ≥ i=1 i X so that the LMP test at level α rejects the null if S(X; 0) k, where ≤ How is the critical value k evaluated in practice? (obviously to give level α) P0[S(X; 0) k] = α When X are iid then S(X; θ) = n ` (X ; θ) ≤ { i } i=1 0 i Under regularity conditions: sum of iid rv’s mean zero variance I (θ) While the exact distribution is difficult to obtain, for large n, Pd d So, for θ = θ and large n, S(X; θ) (0, nI (θ)) S(X; 0) (0, n/2). 0 ≈ N ≈ N Statistical Theory (Week 11) Hypothesis Testing 13 / 19 Statistical Theory (Week 11) Hypothesis Testing 14 / 19 Likelihood Ratio Tests Example Let X , ..., X iid (µ, σ2) where both µ and σ2 are unknown. Consider: So far seen Tests for Θ = R, simple vs simple, one sided vs one sided 1 n ∼ N → p , Extension to multiparameter case θ R ? General Θ0,Θ1? → ∈ H0 : µ = µ0 vs H1 : µ = µ0 Unfortunately, optimality theory breaks down in higher dimensions 6 n n 2 2 2 n 2 2 General method for constructive reasonable tests? sup(µ,σ2) + f (X; µ, σ ) σˆ (Xi µ0) ∈R×R 0 i=1 Λ(X) = 2 = 2 = n − The idea: Combine Neyman-Pearson paradigm with Max Likelihood 2 + ¯ 2 sup(µ,σ ) µ0 f (X; µ, σ ) σˆ i=1(Xi X ) → ∈{ }×R   P −  Definition (Likelihood Ratio) So reject when Λ k, where k is s.t. P0[Λ k] = α. DistributionP of Λ? ≥ ≥ The likelihood ratio statistic corresponding to the pair of hypotheses By monotonicity look only at H0 : θ Θ0 vs H1 : θ Θ1 is defined to be ∈ ∈ n (X µ )2 n(X¯ µ )2 1 n(X¯ µ )2 i=1 i − 0 = 1 + 0 = 1 + 0 n 2 n − 2 −2 supθ Θ f (X; θ) supθ Θ L(θ) (Xi X¯ ) (Xi X¯ ) n 1 S Λ(X) = ∈ = ∈ P i=1 i=1 −   − 2 − supθ Θ0 f (X; θ) supθ Θ0 L(θ) T ∈ ∈ P = 1 + P n 1 “Neyman-Pearson”-esque approach: reject H0 for large Λ. − Intuition: choose the “most favourable” θ Θ0 (in favour of H0) and 2 1 n ¯ 2 ¯ H0 ∈ With S = n 1 i=1(Xi X ) and T = √n(X µ0)/S tn 1. compare it against the “most favourable” θ Θ (in favour of H ) in − − − ∼ − 1 1 2 H0 ∈ So T F1,n 1Pand k may be chosen appropriately. a simple vs simple setting (applying NP-lemma) ∼ − Statistical Theory (Week 11) Hypothesis Testing 15 / 19 Statistical Theory (Week 11) Hypothesis Testing 16 / 19 Distribution of Likelihood Ratio? Example iid iid More often than not, dist(Λ) intractable Let X1, ..., Xm Exp(λ) and Y1, ..., Yn Exp(θ). Assume X indep Y. ∼ ∼ , (and no simple dependence on T with tractable distribution either) → Consider: H0 : θ = λ vs H1 : θ = λ Consider asymptotic approximations? 6 Setup ˆ ¯ ˆ ¯ Unrestricted MLEs: λ = 1/X & θ = 1/Y p sup 2 f (X,Y;λ,θ) Θ open subset of R (λ,θ) R+ ∈ s 1 either Θ0 = θ0 or Θ0 open subset of R , where s < p mX¯ + nY¯ − { } Restricted MLEs: λˆ0 = θˆ0 = Concentrate on X = (X1, ..., Xn) has iid components. sup(λ,θ) (x,y) 2 :x=y f (X,Y;λ,θ) m + n ∈{ ∈R+ }   Initially restrict attention to H0 : θ = θ0 vs H1 : θ = θ0. LR becomes: m n 6 m n Y¯ n m X¯ n = Λ = + + f (Xi ; θˆn) ⇒ m + n n + m X¯ n + m m + n Y¯ Λn(X) =     f (Xi ; θ0) i=1 Depends on T = X¯ /Y¯ and can make Λ large/small by varying T . Y H0 , But T F so given α we may find the critical value k. where θˆn is the MLE of θ. → ∼ 2m,2n Impose regularity conditions from MLE asymptotics

Statistical Theory (Week 11) Hypothesis Testing 17 / 19 Statistical Theory (Week 11) Hypothesis Testing 18 / 19 Asymptotic Distribution of the Likelihood Ratio Asymptotic Distribution of the Likelihood Ratio

Theorem (Wilks’ Theorem, case p = 1) Proof. Under the conditions of the theorem and when H is true, Let X1, ..., Xn be iid random variables with density (frequency) depending 0 on θ and satisfying conditions (A1)-(A6), with I (θ) = J(θ). If the R d 1 ∈ ˆ √n(θˆn θ0) (0, I − (θ)) MLE sequence θn is consistent for θ, then the likelihood ratio statistic Λn − → N for H : θ = θ satisfies 0 0 Now take logarithms and expand in a Taylor series d 2 2 log Λn V χ1 n → ∼ log Λ = [`(X ; θˆ ) `(X ; θ )] n i n − i 0 when H0 is true. i=1 X n n 1 2 Obviously, knowing approximate distribution of 2 log Λ is as good as = (θ0 θˆn) `0(Xi ; θˆn) (θˆn θ0) `00(Xi ; θ∗) n − − 2 − n knowing approximate distribution of Λn for the purposes of testing i=1 i=1 X n X (by monotonicity and rejection method). 1 2 1 = n(θˆ θ ) `00(X ; θ∗) Theorem extends immediately and trivially to the case of general p −2 n − 0 n i n i=1 and for a hypothesis pair H : θ = θ vs H : θ = θ . X 0 0 1 6 0 (i.e. when null hypothesis is simple) ˆ where θn∗ lies between θn and θ0. Statistical Theory (Week 11) Hypothesis Testing 19 / 19 Statistical Theory (Week 11) Hypothesis Testing 20 / 19 Asymptotic Distribution of the Likelihood Ratio Asymptotic Distribution of the Likelihood Ratio

Theorem (Wilk’s theorem, general p, general s p) ≤ Let X1, ..., Xn be iid random variables with density (frequency) depending But under assumptions (A1)-(A6), it follows that when H is true, p 0 on θ R and satisfying conditions (B1)-(B6), with I (θ) = J(θ). If the ∈ n MLE sequence θˆn is consistent for θ, then the likelihood ratio statistic Λn 1 p s d 2 `00(Xi ; θ∗) Eθ0 [`00(Xi ; θ0)] = I (θ0) for H : θ = θ satisfies 2 log Λ V χ when H is true. n n → − 0 { j j,0}j=1 n → ∼ s 0 Xi=1 On the other hand, by the continuous mapping theorem, Exercise Prove Wilks’ theorem. Note that it may potentially be that s < p. ˆ 2 d V n(θn θ0) s − → I (θ0) Hypotheses of the form H : g (θ) = a , for g differentiable real 0 { j j }j=1 j functions, can also be handled by Wilks’ theorem: Applying Slutsky’s theorem now yields the result. Define (φ1, ..., φp) = g(θ) = (g1(θ), ..., gp(θ)) g , ..., g defined so that θ g(θ) is 1-1 s+1 p 7→ Apply theorem with parameter φ

Statistical Theory (Week 11) Hypothesis Testing 21 / 19 Statistical Theory (Week 11) Hypothesis Testing 22 / 19 Other Tests?

Many other tests possible once we “liberate” ourselves from strict optimality criteria. For example: Wald’s test , For a simple null, may compare the unrestricted MLE with the MLE → under the null. Large deviations indicate evidence against null hypothesis. Distributions are approximated for large n via the asymptotic normality of MLEs. Score Test , For a simple null, if the null hypothesis is false, then the loglikelihood → gradient at the null should not be close to zero, at least when n reasonably large: so measure its deviations form zero. Use asymptotics for distributions (under conditions we end up with a χ2) ...

Statistical Theory (Week 11) Hypothesis Testing 23 / 19 1 p-values From Hypothesis Tests to Confidence Regions

2 Confidence Intervals Statistical Theory

Victor Panaretos 3 The Pivoting Method Ecole Polytechnique F´ed´eralede Lausanne

4 Extension to Confidence Regions

5 Inverting Hypothesis Tests

Statistical Theory (Week 12) Confidence Regions 1 / 27 Statistical Theory (Week 12) Confidence Regions 2 / 27 Beyond Neyman-Pearson? Observed Significance Level

So far restricted to Neyman-Pearson Framework: Definition (p–Value) 1 Fix a significance level α for the test Let δα α (0,1) be a family of test functions satisfying 2 Consider rules δ respecting this significance level { } ∈ , We choose one of those rules, δ∗, based on power considerations α < α = x : δ (x) = 1 x : δ (x) = 1 . → 1 2 ⇒ { ∈ X α1 } ⊆ { ∈ X α2 } 3 We reject at level α if δ∗(x) = 1. The p–value (or observed significance level) of the family δα is Useful for attempting to determine optimal test statistics { } What if we already have a given form of test statistic in mind? (e.g. LRT) p(x) = inf α : δα(x) = 1 { } , A different perspective on testing (used more in practice) says: → , The p–value is the smallest value of α for which the null would be → Rather then consider a family of test functions respecting level α... rejected at level α, given X = x. ... consider family of test functions indexed by α Most usual setup: Have a single test statistic T 1 Fix a family δα α (0,1) of decision rules, with δα having level α { } ∈ , for a given x some of these rules reject the null, while others do not Construct family δα(x) = 1 T (x) > kα → { } 2 If PH0 [T t] = G(t) then p(x) = PH0 [T (X) T (x)] = 1 G(T (x)) Which is the smallest α for which H0 is rejected given x? ≤ ≥ −

Statistical Theory (Week 12) Confidence Regions 3 / 27 Statistical Theory (Week 12) Confidence Regions 4 / 27 Observed Significance Level Example (Normal Mean)

Notice: contrary to NP-framework did not make explicit decision! iid 2 2 Let X1, ..., Xn (µ, σ ) where both µ and σ are unknown. Consider: We simply reported a p–value ∼ N H : µ = 0 vs H : µ = 0 The p–value is used as a measure of evidence against H0 0 1 6 , Small p–value provides evidence against H0 → 2 ¯ H0 , Large p–value provides no evidence against H0 Likelihood ratio test: reject when T large, T = √nX /S tn 1. → ∼ − How small does “small” mean? 2 H0 Since T F1,n 1, p–value is , Depends on the specific problem... ∼ − → 2 2 p(x) = PH0 [T (X) T (x)] = 1 GF1,n 1 (T (x)) Intuition: ≥ − − Recall that extreme values of test statistics are those that are Consider two samples (datasets), “inconsistent” with null (NP-framework) p–value is probability of observing a value of the test statistic as x = (0.66, 0.28, 0.99, 0.007, 0.29, 1.88, 1.24, 0.94, 0.53, 1.2) extreme as or more extreme than the one we observed, under the null − − − − − y = (1.4, 0.48, 2.86, 1.02, 1.38, 1.42, 2.11, 2.77, 1.02, 1.87) If this probability is small, the we have witnessed something quite − unusual under the null hypothesis Obtain p(x) = 0.32 while p(y) = 0.006. Gives evidence against the null hypothesis

Statistical Theory (Week 12) Confidence Regions 5 / 27 Statistical Theory (Week 12) Confidence Regions 6 / 27 Significance VS Decision A Glance Back at Point Estimation

Reporting a p–value does not necessarily mean making a decision Let X , ..., X be iid random variables with density (frequency) f ( ; θ). 1 n · A small p–value can simply reflect our “confidence” in rejecting a null ˆ Problem with point estimation: Pθ[θ = θ] typically small (if not zero) , reflects how statistically significant the alternative statement is → , always attach an estimator of variability, e.g. standard error ,→ interpretation? Recall example: Statisticians working for Obama gather iid sample X from → Ohio with Xi = 1 vote Obama . Obama team want to test { } Hypothesis tests may provide way to interpret estimator’s variability within the setup of a particular problem H0 : McCain wins Ohio ˆ H : Obama wins Ohio , e.g. if observe P[obama wins] = 0.52 can actually see what p–value we ( 1 → get when testing H : P[obama wins] 1/2. 0 ≥ Will statisticians decide for Obama? Something more directly interpretable? Perhaps better to report p–value to him and let him decide... Back to our example: What do pollsters do in newspapers? , They announce their point estimate (e.g. 0.52) What if statisticians working for newspaper, not Obama? → , They give upper and lower confidence limits Something easier to interpret than test/p–value? → What are these and how are they interpreted? Statistical Theory (Week 12) Confidence Regions 7 / 27 Statistical Theory (Week 12) Confidence Regions 8 / 27 Interval Estimation Interval Estimation: Interpretation

Simple underlying idea: Instead of estimating θ by a single value Probability statement is NOT Present a whole range of values for θ that are consistent with the data made about θ, which is constant. , In the sense that they could have produced the data → Statement is about interval: Definition (Confidence Interval) probability that the interval

Let X = (X1, ..., Xn) be random variables with joint distribution depending contains the true value is at on θ R and let L(X) and U(X) be two statistics with L(X) < U(X) a.s. least 1 α. ∈ − Then, the random interval [L(X), U(X)] is called a 100(1 α)% Given any realization X = x, the − confidence interval for θ if interval [L(x), U(x)] will either contain or not contain θ. Pθ[L(X) θ U(X)] 1 α ≤ ≤ ≥ − Interpretation: if we construct for all θ Θ, with equality for at least one value of θ. intervals with this method, then ∈ we expect that 100(1 α)% of − 1 α is called the coverage probability or confidence level the time our intervals will engulf − Beware of interpretation! the true value.

Statistical Theory (Week 12) Confidence Regions 9 / 27 Statistical Theory (Week 12) Confidence Regions 10 / 27 Example Pivotal Quantities Let X , ..., X iid (µ, 1). Then √n(X¯ µ) (0, 1), so that 1 n ∼ N − ∼ N ¯ What can we learn from previous example? Pµ[ 1.96 √n(X µ) 1.96] = 0.95 − ≤ − ≤ Definition (Pivot) and since A random function g(X, θ) is said to be a pivotal quantity (or simply a 1.96 √n(X¯ µ) 1.96 X¯ 1.96/√n µ X¯ + 1.96/√n pivot) if it is a function both of X and θ whose distribution does not − ≤ − ≤ ⇐⇒ − ≤ ≤ depend on θ. we obviously have , √n(X¯ µ) (0, 1) is a pivot in previous example → − ∼ N ¯ 1.96 ¯ 1.96 Pµ X µ X + = 0.95 − √n ≤ ≤ √n Why is a pivot useful?   α (0, 1) we can find constants a < b independent of θ, such that ¯ 1.96 ¯ 1.96 ∀ ∈ So that the random interval [L(X), U(X)] = X √n , X + √n is a 95% − Pθ[a g(X, θ) b] = 1 α θ Θ confidence interval for µ. h i ≤ ≤ − ∀ ∈ Central Limit Theorem: same argument can yield approximate 95% CI If g(X, θ) can be manipulated then the above yields a CI when X1, ..., Xn are iid, EXi = µ and Var(Xi ) = 1, regardless of their distribution. Statistical Theory (Week 12) Confidence Regions 11 / 27 Statistical Theory (Week 12) Confidence Regions 12 / 27 Example Comments on Pivotal Quantities Let X , ..., X iid [0, θ]. Recall that MLE θˆ is θˆ = X , with distribution 1 n ∼ U (n) Pivotal method extends to construction of CI for θk , when n x X(n) n Pθ X(n) x = FX (x) = = Pθ y = y p ≤ (n) θ ⇒ θ ≤ θ = (θ1, ..., θk , ..., θp) R   ∈     and the remaining coordinates are also unknown. Pivotal quantity → Hence X /θ is a pivot for θ. Can now choose a < b such that should now be function g(X; θk ) which → (n) 1 Depends on X, θk , but no other parameters X(n) Pθ a b = 1 α 2 Has a distribution independent of any of the parameters ≤ θ ≤ −   , e.g.: CI for normal mean, when variance unknown → But there are -many such choices! → ∞ Main difficulties with pivotal method: , Idea: choose pair (a, b) that minimizes interval’s length! → → Hard to find exact pivots in general problems Solution can be seen to be a = α1/n and b = 1, yielding Exact distributions may be intractable Resort to asymptotic approximations... X(n) X , ˆ d 2 (n) α1/n , Most classic example when have an(θn θ) (0, σ (θ)).   → − → N Statistical Theory (Week 12) Confidence Regions 13 / 27 Statistical Theory (Week 12) Confidence Regions 14 / 27 Confidence Regions Pivots for Confidence Regions

Let g : Θ R be a function such that dist[g(X, θ)] independent of θ X × → What about higher dimensional parameters? , Since image space is the real line, can find a < b s.t. → Definition (Confidence Region) Pθ[a g(X, θ) b] = 1 α ≤ ≤ − Let X = (X1, ..., Xn) be random variables with joint distribution depending = Pθ[R(X) θ] = 1 α on θ Θ Rp. A random subset R(X) of Θ depending on X is called a ⇒ 3 − ∈ ⊆ where R(x) = θ Θ: g(x, θ) [a, b]) 100(1 α)% confidence region for θ if { ∈ ∈ } − Notice that region can be “wild” since it is a random fibre of g Pθ[R(X) θ] 1 α 3 ≥ − Example for all θ Θ, with equality for at least one value of θ. iid ∈ Let X1, ..., Xn k (µ, Σ). Two unbiased estimators of µ and Σ are ∼ N n No restriction requiring R(X) to be convex or even connected 1 , So when p = 1 get more general notion than CI µˆ = Xi → n Nevertheless, many notions extend immediately to CR case i=1 X n 1 , e.g. notion of a pivotal quantity Σˆ = (X µˆ)(X µˆ)T → n 1 i − i − − Xi=1 Statistical Theory (Week 12) Confidence Regions 15 / 27 Statistical Theory (Week 12) Confidence Regions 16 / 27 Getting Confidence Regions from Confidence Intervals Example (cont’d) Consider the random variable Visualisation of high-dimensional CR’s can be hard When these are ellipsoids spectral decomposition helps n n(n k) T 1 g( X , µ) := − (µ ˆ µ) Σˆ − (µ ˆ µ) F -dist with k and n-k d.f. { }i=1 k(n 1) − − ∼ But more generally? − Things especially easy when dealing with rectangles - but they rarely occur! A pivot! , What if we construct a CR as Cartesian product of CI’s? , If f is q-quantile of this distribution, then get 100q% CR as → → q Let [Li (X), Ui (X)] be 100qi % CI’s for θi , i = 1, .., p, and define n n n(n k) T ˆ 1 R( X i=1) = θ R : − (µ ˆ µ) Σ− (µ ˆ µ) fq { } ∈ k(n 1) − − ≤ R(X) = [L (X), U (X)] ... [L (X), U (X)]  −  1 1 × × p p An ellipsoid in Rn Bonferroni’s inequality implies that Ellipsoid centred atµ ˆ p p 1 Principle axis lengths given by eigenvalues of Σˆ Pθ[R(X) θ] 1 P[θi / [Li (X), Ui (X)]] = 1 (1 qi ) − 3 ≥ − ∈ − − ˆ 1 i=1 i=1 Orientation given by eigenvectors of Σ− X X So pick q such that p (1 q ) = α (can be conservative...) → i i=1 − i Statistical Theory (Week 12) Confidence Regions 17 / 27 Statistical Theory (Week 12) P Confidence Regions 18 / 27 Confidence Intervals and Hypothesis Tests Confidence Intervals and Hypothesis Tests

Discussion on CR’s no guidance to choosing “good” regions → Going the other way around, can invert tests to get CR’s: But: close relationship between CR’s and HT’s! ∃ Suppose we have tests at level α for any choice of simple null, θ Θ. , exploit to transform good testing properties into good CR properties 0 ∈ → , Say that δ(X; θ ) is appropriate test function for H : θ = θ → 0 0 0 Suppose R(X) is an exact 100q%=100(1 α)% CR for θ. Consider − Define R (X) = θ : δ(X; θ ) = 0 ∗ { 0 0 } H : θ = θ vs H : θ = θ 0 0 1 6 0 Coverage probability of R∗(X) is Define test function: Pθ[θ R∗(X)] = Pθ[δ(X; θ) = 0] 1 α 1 if θ / R(X), ∈ ≥ − δ(X) = 0 ∈ (0 if θ0 R(X). ∈ Obtain a 100(1 α)% confidence region by choosing all the θ for which − the null would not be rejected given our data X. Then, Eθ0 [δ(X)] = 1 Pθ0 [θ0 R(X)] α − ∈ ≤ , If test inverted is powerful, then get “small” region for given 1 α. Can use a CR to construct test with significance level α! → −

Statistical Theory (Week 12) Confidence Regions 19 / 27 Statistical Theory (Week 12) Confidence Regions 20 / 27 Multiple Testing Multiple Testing

More generally: Modern example: looking for signals in noise Observe Interested in detecting presence of a signal µ(xt ), t = 1,..., T over a Yt = µ(xt ) + εt , t = 1,..., T . discretised domain, x1,..., xt , on the basis of noisy measurements { } Wish to test, at some significance level α: This is to be detected against some known background, say 0.

May or may not be specifically interested in detecting the presence of H0 : µ(xt ) = 0 for all t 1,..., T , ∈ { } the signal in some particular location xt , but in detecting whether (HA : µ(xt ) = 0 for some t 1,..., T . the a signal is present anywhere in the domain. 6 ∈ { } Formally: May also be interested in which specific locations signal deviates from zero Does there exist a t 1,..., T such that µ(x ) = 0? ∈ { } t 6 More generally: May have T hypotheses to test simultaneously at or level α (they may be related or totally unrelated) for which t’s is µ(x ) = 0? t 6 Suppose we have a test statistic for each individual hypothesis H0,t yielding a p-value pt .

Statistical Theory (Week 12) Confidence Regions 21 / 27 Statistical Theory (Week 12) Confidence Regions 22 / 27 Bonferroni Method Holm-Bonferroni Method

Advantage: Works for any (discrete domain) setup! If we test each hypothesis individually, we will not maintain the level! Disadvantage: Too conservative when T large Can we maintain the level α? Holm’s modification increases average # of hypotheses rejected at level α Idea: use the same trick as for confidence regions! (but does not increase power for overall rejection of H0 = t T H0,t ) ∩ ∈ Bonferroni Holm’s Procedure 1 Test individual hypotheses separately at level α = α/T t 1 We reject H0,t for large values of a corresponding p-value, pt T 2 Reject H0 if at least one of the H0,t is rejected { }t=1 2 Order p-values from most to least significant: p ... p (1) ≤ ≤ (T ) Global level is bounded as follows: 3 Starting from t = 1 and going up, reject all H0,(t) such that p(t) T T significant at level α/t. Stop rejecting at first insignificant p(t). H H α P[ZHZ0 H0] = P H0H,t H0 P[H0H,t H0] = T = α | " { } # ≤ | T Genuine improvement over Bonferroni if want to detect as many signals as t=1 t=1 [ X possible, not just existence of some signal

Both Holm and Bonferroni reject the global H0 if and only if inft pt significant at level α/T . Statistical Theory (Week 12) Confidence Regions 23 / 27 Statistical Theory (Week 12) Confidence Regions 24 / 27 Taking Advantage of Structure: Independence Taking Advantage of Structure: Independence

In the (special) case where individual test statistics are independent, one One can, however, devise a sequential procedure to “localise” Sime’s may use Sime’s (in)equality, procedure, at the expense of lower power for the global hypothesis H0:

jα Hochberg’s procedure (assuming independence) P p(j) , for all j = 1, ..., T H0 1 α ≥ T ≥ − 1 Suppose we reject H0,j for small values of pj  

2 Order p-values from most to least significant: p(1) ... p(T ) (strict equality requires continuous test statistics, otherwise α) ≤ ≤ ≤ 3 Starting from j = T , T 1, ... and down, accept all H such that − 0,(j) Yields Sime’s procedure (assuming independence) p(j) insignificant at level α/j. 1 Suppose we reject H0,j for small values of pj 4 Stop accepting for the first j such that p(j) is significant at level α/j, and reject all the remaining ordered hypotheses past that j going 2 Order p-values from most to least significant: p ... p (1) ≤ ≤ (T ) down. 3 jα If, for some j = 1,..., T the p-value p(j) is significant at level T , Genuine improvement over Holm-Bonferroni both overall (H0) and in terms of signal then reject the global H0. localisation:

Provides a test for the global hypothesis H0, but does not “localise” the 1 Rejects “more” individual hypotheses than Holm-Bonferroni 2 signal at a particular xt Power for overall H0 “weaker” than Sime’s (for T > 2), much “stronger” than Holm (for T > 1). Statistical Theory (Week 12) Confidence Regions 25 / 27 Statistical Theory (Week 12) Confidence Regions 26 / 27 Taking Advantage of Structure: Independence

Bonferroni, Hochberg, Simes

● 0.05 ●

● 0.04 ●

● 0.03 ●

● ● ●

● Rejection Level ● 0.02 ● ● ●

● ● ●

● ● 0.01 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

5 10 15 20 25

Minimal to Maximal Significance

Statistical Theory (Week 12) Confidence Regions 27 / 27 1 θ as a random variable Some Elements of

2 Using Bayes’ Rule to Update a Prior Statistical Theory

Victor Panaretos 3 Empirical Bayes Ecole Polytechnique F´ed´eralede Lausanne

4 Choice of Prior: Informative, Conjugate, Uninformative

5 Inference in the Bayesian Framework

Statistical Theory (Week 13) Bayesian Inference 1 / 19 Statistical Theory (Week 13) Bayesian Inference 2 / 19 Back to Basics! Updating a Prior

Classical Perspective 1 Have knowledge/belief about θ: expressed via π(θ) 1 θ Θ Rp is unknown fixed ∈ ⊆ 2 Observe data X = x 2 data are realization of random X f (x; θ) ∼ 3 use data to learn about value of θ (estimation, testing,...) How do we readjust our belief incorporating this new evidence? , Answer: use Bayes’ rule to obtain a posterior for θ I Somehow assumes complete ignorance about θ → What if have some prior knowledge/belief about plausible values? I f (x θ)π(θ) , We did this when defining Bayes risk π(θ x) = | → | Θ f (x θ)π(θ)dθ , Placed a prior π( ) on θ | → · since denominator is constant, R Bayesian Perspective π(θ x) f (x θ)π(θ) 1 θ is RANDOM VARIABLE with prior distribution π( ) | ∝ | · Likelihood Prior 2 data are realization of random X such that X θ f (x θ) Bayesian Principle: Everything we want to learn about θ from the data | ∼ | 3 use data X = x to UPDATE the distribution of θ contained in the posterior { }

Statistical Theory (Week 13) Bayesian Inference 3 / 19 Statistical Theory (Week 13) Bayesian Inference 4 / 19 A Few Remarks Example (Coin Flips) iid Let (X1, ..., Xn) θ Bernoulli(θ) and consider a Beta prior density on θ: Some strengths of the Bayesian approach: | ∼ (provided we accept viewing θ as random) Γ(α + β) α 1 β 1 π(θ) = θ − (1 θ) − , θ (0, 1) Γ(α)Γ(β) − ∈ I Inferences on θ take into account only observed data and prior , Contrast to classical approach which worries about all samples that Given X = x , ..., X = x with y = x + ... + x , have → 1 1 n n 1 n could have but did not occur (risk) f (x1, ..., xn θ)π(θ) π(θ x) = 1 | Provides unified approach for solving (almost) any inference problem | f (x1, ..., xn θ)π(θ)dθ I 0 | y+α 1 n y+β 1 R θ − (1 θ) − − Allows natural way to incorporate prior information in inference = − I 1 θy+α 1(1 θ)n y+β 1dθ 0 − − − − Posterior can become your prior for next exepriment! Γ(α + β + n) y+α 1 n y+β 1 I = R θ − (1 θ) − − (updating process: can unify a series of experiments naturally) Γ(α + y)Γ(β + n y) − − But... one basic weaknesses: Recall that Γ(k) = (k 1)! for k Z+ Choice of prior? Objective priors typically NOT available... − ∈ I , our choice of prior includes great variety of densities, including uniform → Statistical Theory (Week 13) Bayesian Inference 5 / 19 distributionStatistical Theory (Week 13) Bayesian Inference 6 / 19

Coin Flipping Example: n = 10, xi = 2, prior vs posterior Coin Flipping Example: n = 15, xi = 3, prior vs posterior

10 P10 10 P10 8 8 8 8 6 6 6 6 Density Density Density Density 4 4 4 4 2 2 2 2 0 0 0 0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Prior=Beta(5,1), Posterior=Beta(7,9) Prior=Beta(1/2,1/2), Posterior=Beta(2.5,8.5) Prior=Beta(5,1), Posterior=Beta(8,13) Prior=Beta(1/2,1/2), Posterior=Beta(3.5,12.5) 10 10 10 10 8 8 8 8 6 6 6 6 Density Density Density Density 4 4 4 4 2 2 2 2 0 0 0 0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Prior=Beta(2,5), Posterior=Beta(4,13) Prior=Beta(1,1), Posterior=Beta(3,9) Prior=Beta(2,5), Posterior=Beta(5,17) Prior=Beta(1,1), Posterior=Beta(4,13) Statistical Theory (Week 13) Bayesian Inference 7 / 19 Statistical Theory (Week 13) Bayesian Inference 8 / 19 Coin Flipping Example: n = 100, xi = 20, prior vs posterior and Empirical Bayes

10 P10 8 8

6 6 Typically Prior depends itself on parameters (=hyperparameters) → Density Density

4 4 , They are tuned to reflect prior knowledge/belief → 2 2 I “Orthodox” Bayesians: 0 0 1 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Accept the role of subjective probability / a priori beliefs Prior=Beta(5,1), Posterior=Beta(25,82) Prior=Beta(1/2,1/2), Posterior=Beta(20.5,80.5) 2 Hyperparameters should be specified independent of the data

10 10 I Empirical Bayes Approach: 8 8 1 Not willing to a prior specify hyperparameters 6 6 2 Tune prior do observed data (estimate hyperparameters from data) Density Density 4 4

2 2 (essentially a non-Bayesian approach since prior is tuned to data) 0 0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Prior=Beta(2,5), Posterior=Beta(22,85) Prior=Beta(1,1), Posterior=Beta(21,81) Statistical Theory (Week 13) Bayesian Inference 9 / 19 Statistical Theory (Week 13) Bayesian Inference 10 / 19 Empirical Bayes Example iid Let X1, ..., Xn λ Poisson(λ) with a gamma prior on λ Prior π(θ; α) depending on α. Write | ∼ β α β 1 π(λ; α, β) = λ − exp( αλ) g(x; α) = f (x θ)π(θ; α)dθ Γ(β) − Θ | marginal of x Z joint density of θ and x Letting y = x1 + ... + xn, observe that the marginal for x is

| {z } | {z } y Marginal of x depends on α ∞ exp( nλ)λ g(x; α, β) = − π(λ; α, β)dλ , classical point estimation setting - estimate α! x ! ... x ! → Z0 1 n Maximum likelihood α β Γ(y + β) 1 y = Plug in principle n + α Γ(α)x ! ... x ! n + α   1 n   ... So use ML estimation with L(α, β) = g(x; α, β). Alternatively, MoM: , Then plugα ˆ into prior, and obtain posterior → α f (x θ)π(θ;α ˆ) f (x θ)π(θ;α ˆ) E[X ] = E[E(X λ)] = ∞ λπ(λ; α; β)dλ = π(θ x;α ˆ) = | = | i i | β | f (x θ)π(θ;α ˆ)dθ g(x;α ˆ) Z0 Θ | E α α Var[Xi ] = [Var(Xi λ)] + Var[E(Xi λ)] = + R | | β β2 Statistical Theory (Week 13) Bayesian Inference 11 / 19 Statistical Theory (Week 13) Bayesian Inference 12 / 19 Informative, Conjugate and Ignorance Priors Informative, Conjugate and Ignorance Priors

Conjugate families: posterior immediately available The “catch” with Bayesian inference is picking a prior. Can be done: , great simplification to Bayesian inference 1 By expressing prior knowledge/opinion (informative) → Example (Exponential Family) 2 For convenience (conjugate) Let (X1, ..., Xn) θ follow a one-parameter exponential family, 3 As objectively as possible (ignorance) | Focus on (2) and (3). f (x θ) = exp[c(θ)T (x) d(θ) + S(x)] | − The most convenient priors to work with are conjugate families Consider prior: π(θ) = K(α, β) exp[αc(θ) βd(θ)] − Definition (Conjugate Family) Posterior: π(θ x) π(θ)f (x; θ) | ∝ A parametric family P = π( ; α) α A on Θ is called a conjugate family exp[(T (x) + α)c(θ) (β + 1)d(θ)] { · } ∈ ∝ − for a family of distributions F = f ( ; θ) θ Θ on if, { · } ∈ X So obtain a posterior in the same family as the prior f (x θ)π(θ; α) | = π(θ x; α) P, α A & x . f (x θ)π(θ; α)dθ | ∈ ∀ ∈ ∈ X π(θ x) = K(T (x) + α, β + 1) exp[(T (x) + α)c(θ) (β + 1)d(θ)] Θ | | − β β R α0 0 α0 0 Statistical Theory (Week 13) Bayesian Inference 13 / 19 Statistical Theory (Week| 13) {z } | {z Bayesian} Inference| {z } | {z } 14 / 19 Informative, Conjugate and Ignorance Priors Example (Lebesgue Prior for Normal Distribution) Let X1, ..., Xn µ (µ, 1). Assume prior π is “uniform” on R, Bayesian inference ofetn perceived as not objective. | ∼ N b π[a, b] = b a = dx, a < b ↓ − ∀ Za Can we find priors that express an indifference on values of θ? (density 1 with respect to Lebesgue measure). Obtain posterior If Θ finite: easy, place mass 1/ Θ on each point | | n Infinite case? 1 π(µ x) = k(x) exp (x µ)2 | −2 i − " i=1 # Consider initially Θ = [a, b]. Natural uninformative prior [a, b] X U , Uninformative for θ. But what about g(θ)? → n 1 , If g( ) non-linear, we are being informative about g(θ) + 1 − → · k(x) = ∞ exp (x τ)2 dτ −2 i − What if Θ not bounded? No uniform probability distribution: Z−∞ " i=1 # ! n X n 1 2 kdθ = k > 0 = exp (xi x¯) 2π 2 − Θ ∞ ∀ r " i=1 # Z X Some “improper” priors (i.e. infinite mass) yield valid posterior densities so the posterior is (¯x, 1/n). Statistical Theory (Week 13) Bayesian Inference 15 / 19 Statistical Theory (Week 13)N Bayesian Inference 16 / 19 Jeffreys’ Prior Computational Issues

Invariance problem remains even with improper priors. , Jeffreys (1961) proposed the following approach. Assume X θ f ( θ) → | ∼ ·| When prior is conjugate: easy to obtain posterior 1 Let g be monotone on Θ, define ν = g(θ) However for a general prior: posterior not easy to obtain 2 Define π(θ) I (θ) 1/2 (Fisher information) ∝ | | 3 Fisher information for ν: , Problems especially with evaluation of f (x θ)π(θ)dθ → Θ | Explicit intergration infeasible - alternatives d 1 R I (ν) = Var log f (X; g − (ν)) ν dν   Numerical Integration 2 2 d 1 d dθ Monte Carlo methods = g − (ν) Var log f (X; θ) = I (θ) dν θ dθ dν ×     , Monte Carlo Integration → 1/2 1/2 , 4 Thus I (θ) dθ = I (ν) dν → | | | | , Gives widely-accepted solution to some standard problems. →

Note that this prior distribution can be improper.

Statistical Theory (Week 13) Bayesian Inference 17 / 19 Statistical Theory (Week 13) Bayesian Inference 18 / 19 Inference in the Bayesian Framework?

In principle: posterior contains everything you want to know

, So any inference really is just a descriptive measure of posterior → , Any descriptive measure contains less information than posterior →

Point estimators: posterior mean, mode, median,... , Relate to Bayes decision rules → Highest Posterior Density Regions (vs Confidence Regions) , Different Interpretation from CRs! → Hypothesis Testing: Bayes Factors

Statistical Theory (Week 13) Bayesian Inference 19 / 19