Stats 512 513 Review ♥

Eileen Burns, FSA, MAAA

June 16, 2009 Contents

1 Basic Probability 4 1.1 , Space, RV, and Probability ...... 4 1.2 Density, Value Number Urn Model, and Dice ...... 5 1.3 Expected Value Descriptive ...... 5 1.4 More About Normal Distributions ...... 5 1.5 Independent Trials and a Pictorial CLT ...... 5 1.6 The Population, the Sample, and ...... 5 1.7 Elementary Probability, Stressing Independence ...... 5 1.8 Expectation, , and the CLT ...... 6 1.9 Applications of the CLT ...... 7

2 Introduction to 8 2.1 PresentationofData ...... 8 2.2 Estimation of µ and σ2 ...... 8 2.3 Elementary Classical Statistics ...... 8 2.4 Elementary Statistical Applications ...... 9

3 Probability Models 10 3.1 MathFacts ...... 10 3.2 Combinatorics and Hypergeometric RVs ...... 11 3.3 Independent Bernoulli Trials ...... 11 3.4 ThePoissonDistribution ...... 12 3.5 The Poisson Process N ...... 12 3.6 The Function λ( ) ...... 13 · 3.7 Min, Max, , and Order Statistics ...... 13 3.8 Multinomial Distributions ...... 14 3.9 from a Finite Populations, with a CLT ...... 14

4 Dependent Random Variables 17 4.1 Two-Dimensional Discrete RVs ...... 17 4.2 Two-Dimensional Continuous RVs ...... 18 4.3 Conditional Expectation ...... 19 4.4 Prediction...... 20 4.5 ...... 21 4.6 Bivariate Normal Distributions ...... 22

5 Distribution Theory 24 5.1 TransformationsofRVs ...... 24 5.2 Linear Transformations in Higher Dimensions ...... 25 5.3 General Transformations in Higher Dimensions ...... 27 5.4 Asymptotics...... 27 5.5 Generating Functions ...... 28

2 CONTENTS 3

6 Classical Statistics 30 6.1 Estimation ...... 30 6.2 The One-Sample Normal Model ...... 30 6.3 The Two-Sample Normal Model ...... 31 6.4 Other Models ...... 32

7 Estimation, Principles and Approaches 33 7.1 Sufficiency, and UMVUE Estimators ...... 33 7.2 Completeness, and Ancillary Statistics ...... 34 7.3 Exponential Families, and Completeness ...... 35 7.4 The Cram´er-Rao Bound ...... 36 7.5 Maximum Likelihood Estimators ...... 37 7.6 Procedures for Location and Scale Models ...... 38 7.7 Bootstrap Methods ...... 39 7.8 BayesianMethods ...... 39

8 Hypothesis Tests 41 8.1 The Neyman-Pearson Lemma and UMP Tests ...... 41 8.2 TheLikelihoodRatioTest...... 42

9 Regression, Anova, and Categorical Data 44 9.1 Simple ...... 44 9.2 One Way ...... 45 9.3 CategoricalData ...... 45

A Distributions 47 A.1 ImportantTransformations ...... 47 Chapter 1

Basic Probability

1.1 Experiment, Sample Space, RV, and Probability

Bell Curve (6) Probability an observation falls within x standard deviations of the of a normal x = 1: P ( 1 Z 1) = .682 x = 2: P (−2 ≤ Z ≤ 2) = .954 x = 3: P (−3 ≤ Z ≤ 3) = .9974 − ≤ ≤ X µ Note: this is true for any X N(µ, σ2) with Z = − . ∼ σ From a quantile perspective,

p .500 .600 .700 .750 .800 .850 .900 .950 .975 .990 .999

zp 0 .254 .524 .672 .840 1.37 1.28 1.645 1.96 2.33 3.80

Basic probability facts (8) P (Ac) = 1 P (A) P (A B) =−P (A) + P (B) P (A B) ∪ − ∩

Union-Intersection Principle (§1.7) (73)

n P ( 1 Ai) = i P (Ai) = i=j P (AiAj ) + i=j=k P (AiAjAk) i=j=k=l P (AiAj AkAl) + ∪ 6 6 6 − 6 6 6 · · · P P P P

Partition of Event (§1.7) (67) For a given partition of the event A, P ( ∞ A ) = ∞ P (A ) when all pairs of events A , A are disjoint. ∪i=1 i i=1 i i j P

Density Functions and Distribution Functions (10) For density function f and distribution function F , b b a P (a < X b) = a f(v) dv = f(v) dv f(v) dv = F (b) F (a) ≤ −∞ − −∞ − R R R 4 1.2. DENSITY, VALUE NUMBER URN MODEL, AND DICE 5

1.2 Density, Value Number Urn Model, and Dice

1.3 Expected Value Descriptive Parameters

Discrete Continuous Expected Value (26) (31) ∞ µ = E[X] v p (v) vf(v) dv X · X allv Z−∞ X ∞ E[g(X)] g(v) p (v) g(v)f(v) dv · X Xallv Z−∞ Mean deviation (27) (31) ∞ τ = E[ X µ ] v µ p (v) v µ f(v) dv X | − | | − X | · X | − X | Xallv Z−∞ Variance ∞ σ2 = E[X2] (E[X])2 (v µ )2 p (v) (v µ )2f(v) dv X − − X · X − X Xallv Z−∞ 1.4 More About Normal Distributions 1.5 Independent Trials and a Pictorial CLT (46) Let X for 1 i n denote iid rvs. i ≤ ≤ Let Tn = X1 + X2 + + Xn. Let Z denote a N(0, 1)· · · rv. T nµx Then for all a < b , P a n − b P (a Z b) as n . −∞ ≤ ≤ ∞ ≤ √nσ ≤ → ≤ ≤ → ∞  x  1.6 The Population, the Sample, and Data 1.7 Elementary Probability, Stressing Independence Conditional Probability (58)

P (A B) P (A B) = ∩ P (A B) = P (A B) P (B) | P (B) ∩ | ·

Independent Events (62) P (A B) = P (A) P (A B) = P (A) P (B) | ∩ · General Independence of Random Variables (64)

1. X1 and X2 are independent rvs

2. P (X A and X B) = P (X A)P (X B) for all (one-dimensional) events A and B. 1 ∈ 2 ∈ 1 ∈ 2 ∈ 6 CHAPTER 1. BASIC PROBABILITY

3. Discrete: P (X1 = vi and X2 = wj ) = P (X1 = vi)P (X2 = wj ) for all vi, wj .

4. Continuous: fX1,X2 (v, w) = fX1 (v)fX2 (w) for all v and w in the sample space

5. Discrete: Repetitions X1, X2 of a value number X-experiment are independent provided sampling is done with replacement. 6. These statements generalizes to n dimensions for both discrete and continuous distributions.

Law of Total Probability (67)

Bayes Theorem (67) P (A B ) P (A B ) P (B ) P (A B )P (B ) P (B A) = ∩ i = | i · i = | i i i| P (A) P (A) P (A B )P (B ) + P (A B )P (B ) + + P (A B )P (B ) | 1 1 | 2 2 · · · | n n Marginals (67) pX (v) = P (X = v) = allw P (X = v and Y = w) = allw pX,Y (v, w). P P Example Maximums and minimums of independent rvs (74)

1.8 Expectation, Variance, and the CLT Sums of Arbitrary RVs (83) 2 2 2 µaX+b = aµX + b µaX+bY = aµX + bµY σaX+b = a σX

Sums of Independent RVs (83) E(XY ) = E(X) E(Y ) = µ µ V ar(X + Y ) = V ar(X) + V ar(Y ) = σ2 + σ2 · X · Y X Y Mean and Variance of a Sum (83) For independent repetitions of a basic X-experiment X ,...,X , Let T = X + X . Then 1 n n 1 · · · n E(Tn) = nµx 2 V ar(Tn) = nσx

Tn nµX X¯n µX Zn = − = − has mean 0 and variance 1. √nσX σX /√n

Covariance (86) Cov(X,Y ) = E[(X µ )(Y µ )] = E(XY ) µ µ − X − Y − X Y Correlation (90)

Corr[X,Y ] =Cov[X,Y ]/σX σY 1.9. APPLICATIONS OF THE CLT 7

Sample Moments (87) ¯ 1 n µˆ = X = n i=1 Xi

σˆ2 = 1 n P(X X¯ )2 n i=1 i − n 2 1P n ¯ 2 s = n 1 i=1(Xi Xn) − − σˆ = 1 P n (X X¯)(Y Y¯ ) X,Y n i=1 i − i − P Handy formulas (89) σ2 = E[X(X 1)] + µ µ2 σ2 = E[X(X +− 1)] µ − µ2 − − 1.9 Applications of the CLT

Continuity correction (93) Pretty graphs (99) Chapter 2

Introduction to Statistics

2.1 Presentation of Data

See Appendix B for examples of plot types.

2.2 Estimation of µ and σ2

Estimator Expected Value Variance ¯ 1 n ¯ ¯ 2 Sample meanµ ˆ = X = n i=1 Xi E(Xn) = µ Var[Xn] = σ /n (114) 2 1 n ¯ 2 2 2 2 µ4 1 n 3 4 Sample variance s = n 1 i=1(Xi Xn) E[Sn] = σ Var[Sn] = n n n−1 σ (115) − P − − − P 2.3 Elementary Classical Statistics

Actual accuracy ratio of X¯n(Tn) (121)

X¯n µ Z = − N(0, 1) n σ/√n ∼=

CLT of Probability (121) No matter what the shape of the distribution of the population of possible outcomes of an original X- experiment that has mean µ and σ, the following result will hold true. As n gets larger, the distribution of the standardized Zn form of the X¯n-experiment gets closer to the distribution given by the standardized bell shaped Normal curve.

Estimated accuracy ratio of X¯n(Tn) (122)

X¯n µ Tn = − = Tn 1 Sn/√n ∼ −

CLT of Statistics (122)

The distribution of the ratio Tn is given (not by the standardized bell curve, but rather) by the Student n 1 density. This is an approximation that gets better as the sample size n gets larger. T −

Random confidence intervals vs. hypothesis testing vs p-values (123) Random confidence intervals: 95% CI for µ consists of exactly those values of µ that are .95-plausible Hypothesis testing: .05-level test of Ho : µ = µo rejects Ho at the .05 level if µo is not .95-plausible p-values: 2-sided p-value associated with µo is just exactly that level at which µo first becomes plausible

8 2.4. ELEMENTARY STATISTICAL APPLICATIONS 9

Theorem 2.3.1 – Student T - (124) 2 If X1, X2,...,Xn are independent rvs that all have exactly a N(µ, σ ) distribution, then Students esti- mated accuracy ratio has exactly the n 1 distribution. T − Metatheorem 2.3.1 (126) 2 If X1, X2,...,Xn are independent rvs that all have approximately a N(µ, σ ) distribution, then Students estimated accuracy ratio has approximately the n 1 distribution. T − Metatheorem 2.3.2 (126)

If X1, X2,...,Xn are nearly independent rvs that have distributions that are not too skewed and do not produce severe outliers, then the following estimated accuracy ratio is approximately Normal, or some : W¯ µ T − . S

Student n density (128) T 2 Γ((n + 1)/2) 1 1 t /2 fn(t) = e− . √πn Γ(n/2) (1 + t2/n)(n+1)/2 → √2π 2.4 Elementary Statistical Applications Type I error (130)

Rejecting the null hypothesis Ho when true

Type II error (130)

Not rejecting Ho when it is false

Test statistic Tn (130)

X¯n µo Tn(µo) = − = Tn 1 sn/√n ∼ − Chapter 3

Probability Models

3.1 Math Facts

Geometric Series (135)

n k a n+1 ∞ k a Sn = ar = (1 r ) S = ar = 1 r − ∞ 1 r kX=0 − kX=0 − Infinite Series (135)

∞ k ∞ k 1 f(x) = akx f ′(x) = kakx − kX=0 kX=0 Exponential and Logarithmic Series (135)

∞ xk ∞ ( 1)k+1xk ex = log(1 + x) = − for x < 1 k! k | | kX=0 kX=1 Fundamental Exponential Limit (136)

a n 1 + n ea as n , whenever a a as n . n → → ∞ n → → ∞  Mean Value Theorem (136) f(b) f(a) = (b a)f ′(c) for some a

L’Hospital’s Rule (136)

Suppose f(a) = g(a) = 0, and that f ′(a) and g′(a) = 0 exist. Then f(x) f (a) f (x) 6 lim = ′ = lim ′ . x a x a → g(x) g′(a) → g′(x)

Taylor Series Expansion (136)

k (x a) (x a) (k) f(x) = f(a) + − f ′(a) + + − f (a) + R (x) 1! · · · k! k

Integration by Parts (137) uvdv = uv vdu − R R 10 3.2. COMBINATORICS AND HYPERGEOMETRIC RVS 11

Gamma Function & Properties (138)

r 1 1 1 Γ(r) = ∞ y − e− dy Γ(r) = (r 1)Γ(r 1) Γ(r) = (r 1)!Γ( ) = √π 0 − − − 2 R Standard Normal Density Function & Moments (139)

2 1 x /2 2 f(x) = e− µx = 0 σ = 1 √2π x

Gamma Density & Moments (140) 1 wr 1 f(x) = − e w/θ µ = rθ σ2 = rθ2 Γ(r) θr − w w

3.2 Combinatorics and Hypergeometric RVs

Permutations & Combinations (148)

k n! k n n! Pn = (n k)! Cn = k = (n k)!k! − −  Sampling with replacement (Binomial) (150) Sampling without replacement (Hypergeometric(R, W, n)) (151)

R W k n k nR 2 nR (N R)(N n) P (X = k) = − µ = σ = − − N x N x N · N(N 1) n  −  Mathematics via revisualization (157)

N N N N 1 N 1 m n 1 n n+1 n+2 m 1 k = N k k = k− + k −1 n = n−1 + n 1 + n 1 + n 1 + + n −1 − − − − − − · · · −            3.3 Independent Bernoulli Trials

Bernoulli(p) – Xi = independent trials (171)

2 P (X = 1) = p µxi = p σxi = qp P (X = 0) = 1 p = q −

Binomial(n, p) – Tn = sum of independent trials (173)

n k n k 2 P (Tn = k) = k p q − µTn = np σTn = npq  Geometric Turns – GeoT(p) – Yi = interarrival times (174)

k 1 2 2 P (Y = k) = q − p µYi = 1/p σYi = q/p P (Y > k) = qk

GeoT(p) lack of memory property (174) P (Y > i + k Y > k) = P (Y > i) = qi |

Negative Binomial Turns – NegBiT(r, p) – Wr = waiting times (176)

k 1 r k r 2 2 P (Wr = k) = r−1 p q − µWr = r/p σWr = rq/p −  12 CHAPTER 3. PROBABILITY MODELS

Event Identity (176)

[Wr > n] = [Tn < r] [Wr = k] = [TK 1 = r 1] [Xk = 1] − − ∩

Sums of NegBiT rvs (177) NegBiT(r, p)+NegBiT(s,p) =NegBiT(r + s,p)

Geometric Failures – GeoF(p) – Y ′ = failures before first success (177)

k ′ 2 2 P (Y ′ = k) = q p Y ′ + 1 = Y µY = q/p σY ′ = q/p

′ th Negative Binomial Failures – NegBiF(r, p) – Wr = failures before the k success (177)

k+r 1 r k 2 2 − ′ ′ P (Wr′ = k) = r 1 p q Wr′ + r = Wr µWr = rq/p σWr = rq/p −  Examples

Comparing two success probabilities p1 and p2 (185) Gamber’s Ruin (194)

3.4 The

Poisson(λ) – Nt = count by time t (208, 219) λk P (T = k) = e λ µ = λ σ2 = λ k! − T T

Poisson Limit Theorem (208)

k λ λ T Binomial(n, p ) λ = np λ as n P (T = k) P (T = k) = e− n ∼ n n n → → ∞ n → k!

Sums of Poissons is Poisson (210) 3.5 The Poisson Process N

Poisson(νt) – Nt = count by time t (208, 219) (νt)k P (T = k) = e νt µ = νt = λ σ2 = νt = λ k! − T T

Exponential (θ) – Yi = interarrival times (220)

νt 1 t/θ 2 2 2 fY (t) = νe− = θ e− µYi = 1/ν = θ σYi = (1/ν) = θ

Event Identity (220)

[Y >t] = [Nt = 0] = [Nt < 1]

Exponential (θ) lack of memory (220)

νt t/θ P (Y >s + t Y >s) = P (Y >t) = e− = e− | 3.6. THE FAILURE RATE FUNCTION λ( ) 13 ·

Gamma(r, ”rate”= ν) – Wr = waiting times (221) 1 1 tr 1 f (t) = νrtr 1e νt = − e t/θ µ = r/ν = rθ σ2 = r/ν2 = rθ2 Wr Γ(r) − − Γ(r) θr − Wr Wr

Event Identity (221)

[Wr >t] = [N(t) < r]

Sums of Gamma RVs (222) Gamma(r, ν) + Gamma(s,ν) = Gamma(r + s,ν)

Random location of the times of event occurrences is Binomial (222, 214)

n k n k λ1 P (T1 = k T1 + T2 = n) = k p (1 p) − where p = | − λ1 + λ2  Conditioned Poisson sums have Binomial Distributions (222, 214) (N N = m) Binomial(n, p = s ) s| t ∼ t

Example Comparison of bulb lifetimes (230)

3.6 The Failure Rate Function λ( ) · Failure Rate Function (241) f(t) With a constant failure rate, λ(t) = = F ′(t)/[1 F (t)] = λ for all t 0. 1 F (t) − ≥ − For a nonconstant failure rate, this generalizes to d log[1 F (t)] = λ(t). dt − −

3.7 Min, Max, Median, and Order Statistics

Uniform(0,1) order statistics (252)

Let U1,...,Un be idnependent Uniform(0,1) rvs. Let 0

[Un:k >t] = [Bn(t) < k]

Distribution of Uniform Order Statistics (252)

Un:k has the Beta(k, n k + 1) density. n k− 1 n k fUn:k (t) = k 1,1,n k t − (1 t) − for 0 t 1. − − − ≤ ≤  Asymptotic Distribution of the minimum of Uniform(0,1) rvs (258)

n n t P (nU >t) = P (X > t/n) = (1 t/n) e− for all 0 t < n. n:1 − → ≤ 14 CHAPTER 3. PROBABILITY MODELS

Joint Distribution of Uniform Order Statistics (253) n i 1 k i 1 n k f n:i n:k (s,t) = s − (t s) − − (1 t) − for 0 s

3.8 Multinomial Distributions Multinomial(n; p) Distribution of N (263) P (N = m) = n pm1 pm2 pmI for each possible outcome m 0 having I m = n. m1,m2,...,mI 1 2 · · · I ≥ 1 i  P Marginal Distributions (264)

Ni Binomial(n, pi). Cov[∼ X , X ] = p (1 p ) for i = j = p p for i = j ki ℓj i − i − i j 6 Cov[Ni, Nj ] = npi(1 pi) = npipj Cov[ˆpi, pˆj ] = 1 p (1− p ) = − 1 np p n i − i − n i j Conditional Distributions (265) Given that T = N + + N = t I1+1 · · · I ((N ,...,N 1 ) T = t) Multinomial(n t; p′ ,...,p′ ) 1 I | ∼ − 1 I1 3.9 Sampling from a Finite Populations, with a CLT 3.9. SAMPLING FROM A FINITE POPULATIONS, WITH A CLT 15

3.A Distribution Relationships – Casella Berger 16 CHAPTER 3. PROBABILITY MODELS

3.B Distribution Relationships – Galen Shorack Chapter 4

Dependent Random Variables

4.1 Two-Dimensional Discrete RVs Joint (288)

PX,Y (x,y) = P (X = x,Y = y)

Marginal distributions of X and Y (288)

PX (x) = y PX,Y (x,y) PY (y) = x PX,Y (x,y) ∀ ∀ P P Independence (288) P (X A, Y B) = P (X A) P (Y B) P (v, w) = P (v) P (w) v, w ∈ ∈ ∈ · ∈ X,Y X · Y ∀ Convolution formula (for Z = X + Y ) (290)

∞ P (k) = P (Z = k) = P (X + Y = k) = P (X = i, Y = k i) Z − i= X−∞ ∞ = P (i) P (k i) X · Y − i= −∞ Xk = P (i) P (k i) X · Y − i=0 X Conditional distribution of Y given X = x (292)

P (Y = y, X = x) pX,Y (x,y) pY X=x(y) = P (Y = y X = x) = = | | P (X = x) pX (x) pX,Y (x,y) = pX (x) pY X=x(y) = pY (y) pX Y =y(x) · | · | Expected value of two-dimensional discrete RVs (294)

E[g(X,Y )] = wg(v, w) P ((X,Y ) = (v, w)) ∀ · v X∀ X E[g(X)] = g(v) PX,Y (v, w) = g(v)PX (v) v w v X∀ X∀ X∀ Theorem of the Unconscious (295)

If V = g(X,Y ) then E(V ) = vpV (v) = E[g(X,Y )] = g(x,y)p(x,y). v x y X∀ X∀ X∀ 17 18 CHAPTER 4. DEPENDENT RANDOM VARIABLES

Independence of Partitions (297)

P (Ai,Bj ) = pX,Y (ai, bj ) = pX (ai) pY X=x(bj ) = P (Ai) P (Bj Ai) i, j · | · | ∀ = p (a ) p (b ) = P (A ) P (B ) X i · Y j i · j 4.2 Two-Dimensional Continuous RVs Joint density function (304) P ((X,Y ) A) = f(x,y) dx dy ∈ A RR Joint distribution function (304) F (x,y) = F (x,y) = P (X x,Y y) X,Y ≤ ≤ Marginal densities (304)

∞ ∞ fX (x) = f(x,y) dy fY (y) = f(x,y) dx Z−∞ Z−∞ Conditional densities (305) f(x,y) fY X=x(y) = f(x,y) = fX (x) fY X=x(y) | fX (x) ⇒ · | f(x,y) = f (x) f (y) ⇒ X · Y Expected value of two-dimensional continuous RVs (305)

∞ ∞ E[g(X,Y )] = g(x,y)f(x,y) dx dy Z−∞ Z−∞ Convolution formula for the sum of continuous RVs (Z = X + Y ) (306)

FZ (z) = P (Z z) = P (X + Y z) = f(u,v) du dv ≤ ≤ [u+v z] z ZZv ≤ ∞ − = f(u,v) du dv for any joint density f Z−∞ Z−∞z ∞ = f(u, w u) dw du letting w = v + u − Z−∞z Z−∞ ∞ = f(u, w u) du dw always ok when f 0 − ≥ Z−∞z Z−∞ ∞ = f (u)f (w u) du dw when X and Y are independent. X Y − Z−∞  Z−∞  Theorem of the Unconscious Statistician (307)

∞ ∞ ∞ If Z = g(X,Y ), then E(Z) = zfZ (z) dz = E[g(X,Y )] = g(x,y)f(x,y) dx dy. Z−∞ Z−∞ Z−∞ Sums of Independent rvs (308) Sums of independent Normals are Normal Sums of independent Cauchy rvs are Cauchy Sums of independent Gammas are Gamma 4.3. CONDITIONAL EXPECTATION 19

Sums of two independent Uniform rvs have a Triangular density (see Appendix C)

Factoring (74,309) Random variables are independent if and only if their distribution factors

Example: Density of a quotient (309)

The quotient R = U/V has density fR(r) = ∞ vf(rv,v)dv for all r. −∞ R 4.3 Conditional Expectation Conditional probability distribution of Y given X = x (312) p(x,y) pY X=x(y) = P (Y = y X = x) = | | pX (x)

Conditional mean (regression function) (312,328) m(x) = E(Y X = x) = ypY X=x(y) | | y X∀ µ = E(Y ) = E[E(Y X = x)] = E[m(x)] Y | Conditional variance (312)

2 2 v (x) = V ar(Y X = x) = (y m(x)) pY X=x(y) | − | y X∀ σ2 = V ar(Y ) = E[V ar(Y X = x)] + V ar[E(Y X = x)] = E[v(x)] + V ar[m(x)] Y | | Conditional expectation of g(X,Y ) given X = x (312)

E[g(X,Y ) X = x] = g(x,y) pY X=x(y) | | y X∀ E[g(X,Y )] = E[E[g(X,Y ) X = x]] | A Factoring Example (313) For 0 y x 1, f(x,y) = 1 ≤ ≤ ≤ x 1 = 1[0 y x 1] x · ≤ ≤ ≤ = 1 (x) [ 1 1 (y)] [0,1] · x · [0,x]

= fX (x) fY X=x(y) · | Summing a random number of RVs (315)

Let X1, X2,...,XN be iid with mean µX and N 0. N ≥ 2 2 2 Then for T = Xi, E(T ) = µX µN V ar(T ) = µN σX + µX σN i=1 X Example Conditioning via regeneration (317) 20 CHAPTER 4. DEPENDENT RANDOM VARIABLES

4.4 Prediction

Mean squared error (326) E[(Y a)2] −

Least squares predictor of Y = µF (326) a¯ such that E[(Y a¯ )2] = min E[(Y a)2] F − F a −

Mean absolute deviation (326) E[ Y a ] | − |

Least absolute deviation predictor of Y¨ (¨aF ) (326) E[ Y a¨ ] = min E[ Y a ] | − F | a | − |

Bias = systematic error (326) bF = aF µF = difference between predictor and mean E[(Y a−)2] = σ2 + b2 = mean squared error is the variance plus bias squared − F F F

Median (m¨ F ) (327)

m¨ F fY (y) dy = 0.5 Z−∞

Analogs in Samples (327)

LSˆ predictor of a future observation Y = Y¯ LADˆ predictor of a future observation Y = Y¨ =median(Y1,...,Yn)

Other moments (329) E[Xk] kth moment E[(X µ)k] kth central moment r− E[ X ] rth absolute moment E[|X| µ¨ ] mean deviation about the median (¨µ is a median of X) | − |

Least squares predictor (330) 4.5. COVARIANCE 21

Best linear predictor (330)

4.5 Covariance Covariance (333) σ = Cov(X,Y ) = E[(X µ )(Y µ )] X,Y − X − Y = Cov(X µ ,Y µ ) − X − Y = E(XY ) E(X)E(Y ) − Correlation (333) Cov(X,Y ) X µ Y µ ρ = Corr(X,Y ) = = Cov − X , − Y X,Y σ σ σ σ X Y  X Y  Correlation inequality (335) 1 ρ 1 − ≤ ≤ ρ = 1 iff (Y µ ) = σY (X µ ) ± − y ± σX − X Sample variance, population variance (336)

1 n 1 1 n 1 s2 = s2 = (X X¯ )2 = SS σˆ2 =σ ˆ2 = (X X¯ )2 = SS X n 1 i − n n 1 XX X n i − n n XX i=1 i=1 − X − X E[S2] = σ2 µ 1 n 3 µ σ4 2σ4 µ σ4 V ar[S2] = 4 − σ4 = 4 − + 4 − n − n n 1 n n(n 1) ≈ n − − Sample covariance (336) 1 1 ¯ ¯ ¯ ¯ sX,Y = n 1 SSXY σˆX,Y = n SSXY = XY XY where SSXY = (Xi Xn)(Yi Yn) − − − − P Sampe correlation coefficient, population correlation coefficient (336)

sX,Y σˆX,Y SSXY ρˆ =ρ ˆX,Y = = = sX sY σˆX σˆY √SSXX SSY Y

Empirical distribution (distribution of the sample) (337)

E(U) = X¯ E(V ) = Y¯ Cov(U, V ) =σ ˆX,Y 2 2 V ar(U) =σ ˆX V ar(V ) =σ ˆY Corr(U, V ) =ρ ˆX,Y 22 CHAPTER 4. DEPENDENT RANDOM VARIABLES

Covariance of linear combinations (337) Cov[X + Y,Y + Z] = Cov[X,Y ] + Cov[X,Z] + Cov[Y,Y ] + Cov[Y,Z] = Cov[X,Y ] + Cov[X,Z] + V ar(Y ) + Cov[Y,Z]

Covariance of general linear combinations (338) Union/intersection formula (339) Poisson Limit Theorem (redux) (339) One-sample Covariance structure (341) ¯ 1 2 Cov[Xi, X] = n σ

n 1 2 V ar[X X¯] = − σ i − n Cov[X X,¯ X X¯] = 1 σ2 i − j − − n ¯ ¯ 1 Corr[Xi X, Xj X] = n 1 − − − − 4.6 Bivariate Normal Distributions

Bivariate (348) X and Y have a Bivariate Normal Distribution:

2 X xµ σX σX,Y N , 2 , where σX,Y = ρσX σY Y ∼ yµ σX,Y σ       Y  with 1 1 f(x,y) = exp( Q(x,y)) 2 2 2πσX σY 1 ρ −2(1 ρ ) − 2 − 2 x µX x µX y µY y µY where Q(x,y) = p− 2ρ − − + − σX − σX σY σY       

Marginal distribution, Conditional distribution (348) f(x,y) = fX (x)fY X=x, where | 2 fX (x) N(µX ,σX ) ∼ y [µ + (ρ σY )(x µ )] 2 1 1 Y σX X 2 fY X=x(y) = exp − − N(m(x),v (x)) 2 2 2 | √2πσY 1 ρ − σY 1 ρ ∼ −   −   p p Regression function & parameters (349)

σY m(x) = E[Y X = x] = µY + (ρ σX )(x µX ) | − σY = α + β(x µX ) where α = µY , β = ρ σX = γ + βx − where γ = µ βµx Y − v2(x) = V ar(Y X = x) = σ2 (1 ρ2) = σ2 where σ = σ 1 ρ2 | Y − ǫ ǫ Y − p Independence (349) Bivariate Normal rvs (X,Y ) are independent if and only if ρ = 0. 4.6. BIVARIATE NORMAL DISTRIBUTIONS 23

Factorization (352)

Any factorization of the type g(x)hx(y)) (in which all hx(y) are dentsities) is a unique factorization, in general.

Regression line (m(x)) (355) y = ρx for standard normals y = µ + ρ σY (x µ ) for general normals Y σX − X Standard deviation line/line of symmetry for contour ellipses (355) y = x for standard normals y = µ + σY (x µ ) for general normals Y σX − X Extra Facts (356) Linear functions of Bivariate Normals are Bivariate Normal (see section 5.2) Elliptical contours (357) Contour lines (orthogonal transformations) (357) Chapter 5

Distribution Theory

5.1 Transformations of RVs Linear transformations (365) Y = aX + b

Distribution function method (365) Note that: F (y) = P (Y y) = P (aX + b y) = P (X (y b)/a) = F ((y b)/a) Y ≤ ≤ ≤ − X − Differentiating both sides yields: f (y) = d F (y) = d F ((y b)/a) = 1 f ((y b)/a) Y dy Y dy X − a X − Linear transformations of Normal RVs (365) Z N(0, 1) X = µ + σZ ∼ N(µ, σ2) Y = aX + b ∼ (aµ + b) + (aσ)Z N(aµ + b, a2σ2) 2 2 ∼ 1 ∼ 2 Z = (N(0, 1)) Gamma( 2 , ”mean” = 2) χ1 Z2 ∼ (Gamma(1/2, ”mean” = 2))2 ∼ χ2 i ∼ ∼ 2r P P Scale transformations of Gamma RVs (365) Y Gamma(r, θ) Y/θ ∼ Gamma(r, ”mean” = 1) 2Y/θ ∼ Gamma(r, ”mean” = 2) χ2 ∼ ∼ 2r Distribution function method for monotone transformations (365) For monotone transformations (strictly increasing & decreasing functions Y = g(X)), 1 1 F (y) = P (Y y) = P (g(X) y) = P (X g− (y)) = F (g− (y)) Y ≤ ≤ ≤ X Therefore, f (y) = d F (y) = d F ((y b)/a) = 1 f ((y b)/a). Y dy Y dy X − a X − Stated differently, dx 1 d 1 1 1 f (y) = f (x) = f (g− (y)) g− (y) = f (g− (y)) ′ −1 Y X dy X dy X · g (g (y)) Change of variable formula (365) dx 1 dOld fY (y) = fX (x) dy = fX (x) dy/dx OR fNew(y) = fOld(x) dNew · | | · | | · | | 24 5.2. LINEAR TRANSFORMATIONS IN HIGHER DIMENSIONS 25

Distribution function method for multiple valued transformation (366) For Y = X2, FY (y) = P (Y y) = P ( √y X √y) = FX (√y) FX ( √y 0) ≤ − ≤ ≤ − − − fY (y) = [fX (√y) + fX ( √y)]/2√y − Probability integral transformation (366) U = F (X) U(0, 1) ∼ 1 1 F (u) = P (U u) = P (F (X) u) = P (X F − (u)) = F (F − (u)) = F (u) U ≤ ≤ ≤ Inverse transformation (367)

1 U U(0, 1); X = F − (u) ∼ 1 F (x) = P (X x) = P (F − (u) x) = P (u F (x)) = F (x) X ≤ ≤ ≤ Transformation by integration (367)

FW (x) = P (W x) = [x:H(x) w] fX1,...,Xn (x1,...,xn) dx1 dxn ≤ · · · ≤ · · · R R Student-t, F, Beta (369) For Z N(0, 1), U χ2 , V χ2 (independent), ∼ ∼ m ∼ n Z T = n V/n ∼ T 1 m U W = p 1 m,n n V ∼ F U B = Beta( m , n ) U + V ∼ 2 2

Non-central T -distribution, χ2, F (370) Z + δ Tδ = n(δ) V/n ∼ T m m (Z + δ )2 = (Z + δ)2 +p Z2 χ2 (δ) i i 1 i ∼ m 1 2 X 2 X χm(δ)/m 2 m,n(δ) χn/n ∼ F

5.2 Linear Transformations in Higher Dimensions

Definitions orthogonal (382) – A 2x2 matrix A is orthogonal if its rows are two perpendicular vectors of length 1. mean vector (384) – The mean vector of the random vector X= (X ,...,X )′ is the n 1 1 n × vector µ = µX = (E(X1),...,E(Xn))′. (384) – The covariance matrix for the same vector is the n n matrix × Σ = ΣX = σij . non-negativek definitek (384) – Every covariance matrix Σ is non-negative definite, i.e. Σ 0. positive definite (384) – Σ > 0. ≥ orthogonal unit vectors (386) – γ1,...,γn(with γi = (γi1,...,γin)′) such that γi′γi = 1 for all 1 i n, but γ′γ = 1 for all i = j. ≤ ≤ i j 6 26 CHAPTER 5. DISTRIBUTION THEORY

Matrix Identities (382) a b A = c d  

1 1 a b 1 1 A− = A = ad bc A A− = AA− = I = 1 ad bc c d | | − | || | | | | | −  

Extended Identities (382) Y a X + a X a a X Y = 1 = 11 1 12 2 = 11 12 1 = AX Y a X + a X a a X  2   21 1 22 2   21 22  2 

+ + ∂x1 ∂x1 ∂y1 ∂y1 ∂y1 ∂y2 1 + + ∂x1 ∂x2 Jacobian= ∂x2 ∂x2 = A− = 1/ A = 1/ ∂y2 ∂y2 ∂y1 ∂y2 | | | | ∂x1 ∂x2

Density of a linearly transformed variable (382)

1 1 + 1 + If X = A− Y , f 1 2 (y ,y ) = f 1 2 (x ,x ) A− = f 1 2 (x ,x )/ A− Y ,Y 1 2 X ,X 1 2 · | | X ,X 1 2 | |

Mean vectors and covariance matrices of linear transformations (384)

Y = AX + c has mean vector µY = AµX + c and covariance matrix ΣY = AΣXA′

Multivariate Normal distribution (385)

For Z1,...,Zn independent N(0, 1) rvs,

a11 ... a1n Z1′ µ1′ ai′ Z1′ µ1′ Y = AZ + µ = . . . + . = . . + . N(µ, Σ)  . .   .   .   .   .   .  ∼ a ... a Z µ a Z µ  n1 mn  n′   m′   m′   n′   m′              Y has the Multivariate Normal(µ, Σ) density, determined solely by µY and ΣY = AΣA′:

1 1 1 1 fY(y) = exp( (y µ)′Σ− (y µ)) Σ 1/2 (2π)n/2 − 2 − − | | Orthogonal transformations (386) z=Gy where γ1′ γ11 ... γ1n . . . G =  .  =  . .  is an orthogonal matrix as defined above (note G′G = I = GG′). γ γ ... γ  n′   n1 nn     Partitioning a multivariate Normal (389) Let Y = AZ + µ as above, and partition Y, µ, and Σ so that

A1 A1A1′ A2′ Σ11 Σ12 Σ = AA′ = [A1′ A2′ ] = = A A A′ A A′ Σ Σ  2   2 1 2 2  21 22 Then Y = N(µ1, Σ11) and Y = N(µ2, Σ22).

Y1 and Y2 are uncorrelated if and only if Σ12 = 0. 5.3. GENERAL TRANSFORMATIONS IN HIGHER DIMENSIONS 27

5.3 General Transformations in Higher Dimensions

5.4 Asymptotics

Definitions

convergence in probability (408) – Un p U when P ( Un U ǫ) 0 as n , for every ǫ > 0. → | − | ≥ → → ∞ convergence in distribution (408) – W W when the df F of W convergest to the n →d n n df F of W at each point x where F is continuous. We also say that Wn is asymptotically distributed as W .

Classical CLT of Probability (408)

√n(X¯ µ) Z = n − Z N(0, 1) n σ → ∼

Finite Sampling CLT (409)

√n(X¯n µ) Zn = − d Z N(0, 1) n 1 → ∼ σ 1 N− 1 − − q Other applications (409)

PLT: Tn Binomial(n, pn) d T Poisson(λ) when λn = npn λ. Minimum∼ of Uniforms: nU→ ∼ Exponential(1). → n:1 →d Weakest Link Limit Theorem: rnYn:1 dWeibull(0,α,β). Discrete Uniform: For X with Discrete→ Uniform(0,n) distribution, X /n Uniform(0,1). n n →d

Markov Inequality (409) For any rv X, P ( X t) E[ X ]/t for all t > 0 | | ≥ ≤ | |

Chebyshev Inequality (409) For X with finite mean µ and variance σ2, 2 σ P ( X µ t) 2 for all t > 0. | − | ≥ ≤ t

Weak Law of Large Numbers (WLLN) (410)

Let X1, X2,...,Xn be iid with finite mean µ. Then X¯ µ. n →p 2 If, additionally, the Xi have common finite variance σ , the WLLN gives S2 σ2, σˆ2 σ2, andσ ˆ2 σ2. n →p n →p on →p

Slutsky’s Theorem (411)

For Wn d W, Un p a, Vn p b, and h is continuous at a. Then U W +→V aW→+ b and→h(U ) h(a). n n n →d n →p √n(X¯n µ) √n(X¯n µ) The CLT of Statistics thereby follows from the CLT of Probability: − − N(0, 1). Sn →d σ →d 28 CHAPTER 5. DISTRIBUTION THEORY

Mann-Wald Theorem (Continuous Mapping Theorem) (412) For continuous functions h, W W implies h(W ) h(W ). n →d n →d e.g. Z2 = [√n(X¯ µ)/σ]2 N 2(0, 1) χ2 follows from Mann-Wald with h(t) = t2. n n − →d ∼ 1

Delta Method (413) If Z = √n(T¯ ν) Z = N(0,τ 2) and g is differentiable at ν, then n n − →d 2 2 √n[g(X¯ ) g(µ)] = g′(ν) Z g′(ν) Z = N(0, [g′(ν)] τ ) n − a × →d × ·

Bivariate CLT of Probability (413) ¯X ¯ 2 Zn √n(Xn µX ) ZX 0 σX σXY Y = − d N , 2 Z¯ √n(Y¯n µY ) → ZY ∼ 0 σXY σ  n   −        Y 

Bivariate Mann-Wald (414) For continuous functions h( , ), and Z¯X and Z¯Y as defined above, h(Z¯X , Z¯Y ) h(Z ,Z ) as n . · · n n n n →d X Y → ∞ e.g. For Y = (X µ)2 having mean µ and varianceσ2, i i − ¯ ¯ 2 Xn µX Xn µX 0 σX µ3 √n 2 − 2 =a √n ¯ − 2 d N , 4 S σ Yn σ → 0 µ3 µ4 σ  n −   −      − 

Two-Dimensional Delta Method (415)

ZX √n[g(X¯n, Y¯n) g(µX , µY )] =a g(µ) = c1ZX + c2ZY − ∇ × ZY    ZX 2 2 2 2 = c′ N(0, cΣc) = N(0, c1σX + 2c1c2σXY + c2σY ). ZY ∼  

More General CLTs (416) Weighted average CLT Liapunov CLT Lindeberg-Feller CLT Multivariate CLT

5.5 Moment Generating Functions

Moment Generating Function (mgf) (421)

tX M(t) = MX (t) = E[e ]

Characteristic Function (chf) (421)

itX φ(t) = φX (t) = E[e ]

Cumulant Generating Function (cgf) (421)

K(t) = KX (t) = log φ(t) 5.5. MOMENT GENERATING FUNCTIONS 29

Math Facts eitX = cos(tX) + i sin(tX) eitX 1 for all possible values of X and t. | | ≤ etX =1+(tX)/1! + + (tX)k/k! + · · · · · · Fundamental Properties of Transforms (422) Linear changes in X t(aX+b) tb (at)X tb MaX+b(t) = E[e ] = e E[e ] = e MX (at) it(aX+b) itb i(at)X itb φaX+b(t) = E[e ] = e E[e ] = e φX (at)

MGF of a sum of independent rvs MX+Y (t) = MX (t)MY (t) φX+Y (t) = φX (t)φY (t) KX+Y (t) = log φX+Y (t) = log(φX (t)φY (t)) = KX (t) + KY (t)

Uniqueness Theorem MX ( ) completely determines PX (t), and vice versa, with qualifications. φ ( )· completely determines P (t), and vice versa. X · X Cram´er-Levy Continuiuty Theorem

If MZn (t) MZ (t), then Zn d Z, with qualifications. φ (t) φ→ (t) if and only if →Z Z for all t. Zn → Z n →d Getting moments from the MGF (423) M (k)(0) = E[Xk]

Getting moments from the CGF (424) (j) j K (0) = i κj For any rv X with µ < , κ = µ κ = σ2 κ = µ κ = µ 3µ2 = µ 3σ4 4 ∞ 1 2 3 3 4 4 − 2 4 − For standardized rvs, κ3 = γ1 = κ4 = γ2 =

Multivariate Moment Generating Functions (429)

′ textsft textsfX t1X1+ +tnXn MX(textsft) = E[e ] = E[e ··· ] ′ itextsft textsfX it1X1+ +itnXn φX(textsft) = E[e ] = E[e ··· ] Analogs of the four fundamental properties listed above exist for rvs in n dimensions.

Multivariate Normal MGF (429) Chapter 6

Classical Statistics

6.1 Estimation

Definitions space (435) – The set Θ of all possible values of the parameter θ of the model statistic (435) – any function of the data unbiased (435) – If Eθ[T (X) = τ(θ) for each θ, then T is an unbiased estimator for τ(θ) consistent (435) – If T (X) τ(θ) for each θ, then T is a consistent estimator for τ(θ) n →p bias (436) – Eθ[T ] τ(θ) − 2 mean squared error (MSE) (436) – Mse(θ) = Eθ[(T τ(θ)) ] method of moments estimators (437) – Parameter estimators− resulting from equating the population moments with the sample moments, and solving for the parameters

Main Elementary Facts (436)

2 Mse(θ) = V arθ(T ) + Bias (θ)

Tn is consistent whenever MseTn (θ) 0 as n , or whenever Var (T ) 0 and Bias→(θ) 0.→ ∞ θ n → Tn →

6.2 The One-Sample Normal Model

Definitions

confidence intervals (441) – X¯n t∗Sn/√n provides a (1 α) confidence interval for µ. random theoretical confidence± interval (441) – with probability− 1 α, will contain the fixed unknown value µ. This probability is exact for a Normally distributed− experiment X, and approximately correct for any experiment X having finite mean and variance. numerical confidence interval (441) – We have 1 α degree of belief that the non-random − numerical valuex ¯n did estimate µ to within an estimated margin for error of t∗sn/√n. (α/2) level-α test (442) – Reject Ho : µ = µo in favor of Ha : µ = µo whenever Tn(µo) >tn 1 . 6 (α/2) | | − power (442) – Power(µ, σ) = Pµ,σ(Ho is rejected)= P ( Tn 1(δ) >tn 1 ). p-value (444) – chance of the statistic taking on a value| at− least| as extreme− as its observed value; numerical evaluation of the strength of evidence against the null hypothesis. 2 model equation (445) – Xi = µ + ǫi, for 1 i n, with independent N(0,σ ) rvs ǫi. (445) – µ in the model≤ equation.≤ (445) – σ in the model equation. errors, residual (445) – ǫi = Xi µˆ = Observationi Fitted Valuei. sum of squares (445) – SS(ˆµ) =− n (X X¯ )2 =−SS i=1 i − n XX fitted value, least squares estimator (445) –µ ˆ for which SSXX is minimized. P pth quantile (446) – zp such that P (Z

30 6.3. THE TWO-SAMPLE NORMAL MODEL 31

Theorem 6.2.1 – One-Sample Normal Theory Building Blocks (441) 2 For an independent random sample X1,...,Xn from the N(µ, σ ) distribution,

Z = √n(X¯ µ)/σ N(0, 1) n n − ∼ 2 2 Sn 1 SSXX χn 1 = − σ2 σ2 · n 1 ∼ n 1 − − ¯ 2 Xn and Sn are independent rvs.

√n(X¯n µ) √n(X¯n µ) Tn = − = − n 1. Sn SS /(n 1) ∼T − XX − p Kantorovich Inequality (449) If P (a X b) = 1 for some a 0, then 1 E[X] E[1/X] (a + b)2/(4ab). ≤ ≤ ≥ ≤ · ≤ Jensen Inequality (449) g(µ) E[g(X)], for convex g on the interval I, where g has mean µ in the interior of the interval I. ≤ e.g. For g(X) = X2, we obtain E[X2] µ2 = σ2 0. − ≥ 6.3 The Two-Sample Normal Model Definitions 2 2 2 variance ratio (457) – ∆ = σ2/σ1. paired differences (458) – For an independent sample (X1,Y1),..., (Xn,Yn) from a basic (X,Y )-experiment with joint Bivariate Normal distribution, Di = Xi Yi is the paired difference. − efficiency of estimators (461) – (¯µ , µˆ ) = V ar[ˆµn] (¯µ, µˆ). E n n V ar[¯µn] → E Equal (456) For independent samples X ,...,X and Y ,...,Y from N(µ, σ2) and N(ν,σ2), and θ = µ ν, 1 n 1 n − θˆ = X¯ Y¯ N(θ, m+n σ2). m − n ∼ mn 2 2 m 1 2 n 1 2 1 2 2 σ 2 S = m+−n 2 SX + m+−n 2 SY = m+n 2 (SSXX + SSY Y ) m+n 2 χm+n 2. − − − ∼ − − It follows from the independence of θˆ and S2 that mn (X¯m Y¯n) (µ ν) mn θˆ θ T = − − − = − Student- m+n 2. m + n S m + n S ∼ T − r m,n r m,n Unequal variances (456) For independent samples X ,...,X and Y ,...,Y from N(µ, σ2) and N(ν,σ2), and θ = µ ν, 1 n 1 n 1 2 − 2 2 θˆ = X¯ Y¯ N(θ, m+n σ¯2), whereσ ¯2 = nσ1 +mσ2 . m − n ∼ mn m+n 2 2 ¯2 nSX +mSY S = m+n .

By the CLT (provided only that the basic X and Y have finite variances, mn (X¯m Y¯n) (µ ν) (X¯m Y¯n) (µ ν) T¯ = − − − = − − − d N(0, 1). m + n S¯m,n 1 2 1 2 → r m SX + n SY q 32 CHAPTER 6. CLASSICAL STATISTICS

Known Variance Ratio (∆2) (457)

2 2 σ1 2 2 2 The pooled estimator above now satisfies S m+n 2 (χm 1 + ∆ χn 1), but a more helpful statistic is 2 ∼ − − − 2 m 1 2 n 1 1 2 σ1 2 S (∆) = m+−n 2 SX + m+−n 2 ∆2 SY m+n 2 (χm+n 2). − − ∼ − − The corresponding T -statistic is mn T (∆o) = 2 [(X¯m Y¯n) (µ ν)]/Sm,n(∆o) m+n 2. n+m∆o − − − ∼T − q Matched Pairs T -test (458) 2 2 As defined above, D = X Y N(µD,σD) = N(θ,σD), where the variance is given by σ2 = σ2 + σ2 2σ . − ∼ D 1 2 − 12 From Theorem 6.1.2, we have ZD = √n(D¯ n θ)/σD N(0, 1), − ∼ 2 2 1 2 ¯ 2 σD 2 SD = n 1 i=1(Di D) n 1 χn 1, − − ∼ − − P ¯ 2 It follows from the independence of D and SD that TD = √n(D¯ n θ)/SD,n n 1. − ∼T − Normal Theory Building Blocks (461) The optimal choice for weighting two samples is that which minimizes the variance ofµ ˆ. The natural estimator of the variance employs the same weighting used forµ ˆ.

6.4 Other Models Examples Bernoulli (466) Poisson (467) NegBiT (469) Chapter 7

Estimation, Principles and Approaches

7.1 Sufficiency, and UMVUE Estimators

Definitions

sufficient statistic (473) – T = T (X) is a sufficient statistic for θ (or for the family = P Pθ : θ Θ ) when the conditional distribution PX T =t( ) does not depend on θ. ancillary{ statistic∈ } (474, 486) – A statistic V = V (X)| is called· ancillary if the distribution of V does not depend on θ. zero function (476) – h(t) = 0 for all t. complete family of distributions (476, 484) – An arbitrary family of the possible Pt distributions (of some rv T ) is called complete if the condition EP [h(T )] = 0 for every possible distribution of P of T in the family t implies that h can only be the zero function. uniform minimum unbiased estimator (UMVUE)P (476) – Suppose T is an unbiased estimator of τ(θ) for which Varθ[T ] Varθ[U] for all θ, for any other unbiased estimator U of τ(θ). Then T is a UMVUE of τ≤(θ). minimally sufficient statistic (478) – In general, the statistic T (X) is said to be minimally sufficient in some model if any other sufficient statistic T˜ = T˜(X) for this model necessarily satisfies a relationship of the form T = h˜(T˜) for some function h˜. common support (480) – The support of Pθ is x : pθ(x) > 0 . A family of densities for which the support of each density is the same is{ said to have }common support.

Sufficiency Principle (474)

If T = T (X) is a sufficient statistic for θ in the model, then to estimate any parameter of the form τ(θ) one should only use estimators that are functions of T .

Ancillarity Principle (474)

Basing an analysis on the conditional distribution given the ancillaries might well access useful information about the underlying distribution. (Often useful, it is less than compelling; ancillaries are not unique.)

Fisher-Neyman Factorization Theorem (476)

T (X) is sufficient for θ if and only if the model distributions can be factored as pθ(x) = gθ(t) ht(x) = gθ(t) h(x) f (x) = g (t) × h (x) = g (t) × h(x) θ θ × t θ × 33 34 CHAPTER 7. ESTIMATION, PRINCIPLES AND APPROACHES

Rao-Blackwell (476) Let U = U(X) be unbiased for τ(θ), and suppose T (X) is sufficient for θ. Then V = V (T ) = E(U T ) a) is a function of| the sufficient statistic, b) is unbiased for τ(θ) since Eθ(V ) = Eθ(m(T )) = Eθ(E(U T )) = Eθ(U) = τ(θ), and c) is at least as good as U since | V ar (V ) = V ar (m(T )) V ar (E(U T )) + E (V ar(U T )) = V ar (U). θ θ ≤ θ | θ | θ

Lehmann-Scheff´e(unique UMVUEs) (476) Let T be sufficient for θ and complete, and V = V (T ) be any unbiased estimator of τ(θ) with finite variance. Then

V is the unique UMVUE of τ(θ). (Often a computation of the form V = E(U T ) will be useful.) |

Likelihood Ratios and Sufficiency (481) Within the common support model, T (X) is sufficient for θ if and only if each likelihood ratio is a function of this T .

Examples Sufficiency geometrically (normal spheres)(477) Condition on Ancillary Statistics (481)

7.2 Completeness, and Ancillary Statistics

Math Fact 1 (infinite series) (484)

k h(x) = k∞=0 akx = 0 for all x in some interval iff ak = 0 for all k. P Math Fact 2 (only the zero function) (484)

( ,x] h(t) dt = 0 for all x iff h(x) = 0 for all x. −∞ R Basu (487) If T = T (X) is sufficient and complete for θ, then every V (X) is independent of T .

Sufficiency and Completeness Relative to Inclusion (487)

If T = T (X) is sufficient for , then T is sufficient for the smaller o. if the family T is complete,P then the bigger family T is complete.P Po P

Examples Uniform(0,θ), Uniform (θ,θ) (485) – complete and sufficient statistics Uniform(0,θ) (486) – ancillary statistics Exponential(θ) (486) – ancillary statistics Poisson UMVUEs (487) 7.3. EXPONENTIAL FAMILIES, AND COMPLETENESS 35

7.3 Exponential Families, and Completeness Definitions (492) – A family = P : θ Θ of distributions that can be written Èκ o θ o i=1 ai(θ)Ti(x) P { ∈ } in the format c(θ)e h(x)1D(x) where c(θ) depends only on θ, h(x) > 0 iff x D. The family has common support D. ∈ Order of o (492) – the minimum κ for which such an expression of an exponential family is possible.P affinely independent over D (492) – that κ c T (x) = c for a.e. x D implies 1 i i o ∈ that all ci = 0 minimal over D (492) – T that satisfies affine independenceP affinely independent over Θ (492) – means that κ c a (θ) = c for θ Θ implies that o 1 i i o ∈ o all ci = 0 P minimal over Θo (492) – a that satisfies affine independence natural parameter space (η) (492) – For with order κ and T and a both minimal, replace o Èκ P d(η)+ ηiTi(x) a(θ) with η = (η ,...,η )′. Rewrite the natural parameter space as e− i=1 h(x)1 (x), 1 Èκκ D i=1 ηiTi(x) where d(η) = log( D e h(x) dx). = η : d(η) > is the natural parameter space. N { ∞} R full exponential family (492) – = Pη : η for as just defined. curved exponential family (495)PN – situations{ ∈ where N } theN parameters in the natural para- meter space are subject to specified relationships (e.g. Normal(θ,θ) or Normal (θ,θ2)).

Natural Parameter Space Properties (492) 1) is a convex set and d( ) is convext on . N · N If Eη[g(X] < for all η , then 2) all derivatives∞ of φ( ) exist∈ N and 3) φ(η) = E [g(X)] can· be differentiated “under the integral sign” on the interior of . η N Theorem 7.3.1: Sufficient and complete statistic for exponential families (493)

Let X1, ..., Xn be iid, with each distribution in a family of order κ given by

Èκ i=1 ai(θ)Ti(x) c(θ)e h(x)1D(x) n (A) For 1 i κ, define Ti(X) = j=1 Ti(Xj ), and then define the statistic T = T(X). This T is sufficient≤ ≤ for . PN P (B) If = (a (θ), ..., a (θ)) : θ Θ contains a κ-dimensional subset , then No { 1 n ∈ o} H T is both minimally sufficient and complete for ≀ = . PN P (C) The sufficient statistic T necessarily satisfies T is distributed as an exponential family with the same ai(θ)’s as X.

(D) Suppose κ is fixed. The conditional distribution of (T T = t) is again an exponential family. 1| 2

(E) The conditional distribution PX T=t takes the form | n a(θ)t n n c (θ)e h(xj) h(xj) × j=1 = j=1 cn(θ)ca(θ)t h (t) h (t) Q× ∗ Q ∗ Theorem 7.3.2: Sufficiency and completeness of one-to-one transformations (494) The outcome of some X-experiment is modeled as being distributed according to one of the distributions Pθ in some family = Pθ : θ Θ . Suppose T = T (X) is sufficient and complete for θ. Let U = U(T ) denote any one-to-oneP function{ ∈ of T}. The U is sufficient and complete for θ. (Also, T,θ, and U may all denote vectors.) 36 CHAPTER 7. ESTIMATION, PRINCIPLES AND APPROACHES

7.4 The Cram´er-Rao Bound

Definitions

likelihood of θ (503) – L(θ; x) = pθ(x) OR fθ(x). In the case of independent rvs Xk, n L(θ; x) = k=1 L(θ; xk). maximum likelihood estimator (503) – the value of θ in the parameter set Θ that maxi- mizes theQ likelihood L(θ; x). iid n log-likelihood (503) – ℓ(θ; x) = log L(θ; x) = +k = 1 ℓ(θ; xk) ∂ ∂ score function (503) – ℓ˙(θ; x) = log pθ(x) OR log fθ(x) ∂θ P ∂θ likelihood equation (503) – ℓ˙(θ; x) = 0; the MLE is commonly found by solving this equa- tion. fisher information (504) – (θ) = E [ℓ˙ (θ; X)]2 . In θ{ n } alt fisher information (504) – ¯n(θ) = Eθ ℓ¨n(θ; X) . this is equal to the fisher information for “regular” distributions I {− } information per observation (504) – the as calculated from one obeser- vation xk of the data theoretical score function (504) – ℓ˙n(θ; X) numerical score function (504) – ℓ˙n(θ; x) Mseθ[T ] relative efficiency (506) – θ(T,T˜ ) = E Mseθ[T˜] absolute efficiency (506) – θ(T,T˜ ) = CRBn(θ)/Mseθ[T˜] asymptotic relative efficiencyE (506) – the limit of absolute efficiencey as n → ∞

Elementary Fact (505) Cov(U, V ) = E[(U 0)(V µ )] when E[U] = 0. − − V

Cram´er-Rao, Information Inequality, Equality Corollary (504)

(A) Let the model distributions = Pθ : θ Θ have common support. If the equation 1 = Eθ(1) can be differentiated under the integralP sign,{ then∈ the} score function has mean 0; that is Eθ[ℓ˙(θ, X)] = 0.

(B) If the equation 1 = Eθ(1) can be differentiated under the integral sign twice, then the score function has variance ¯n(θ). That is (using (A)), (θ) = EI [(ℓ˙ (θ, X))2] = V ar [ℓ˙ (θ, X)] = ¯ (θ) = E [ ℓ¨ (θ, X)] In θ n θ n In θ − n (C) (Information Inequality) Mainly, let T = T (X) denote any statistic of interest for which E [ T (X) ] < , and define τ(θ) = E [T (X)]. Suppose θ | | ∞ θ τ(θ) has a derivative τ ′(θ) on the interior of Θ, and both τ(θ) and 1 = Eθ(1) can be differentiated once under the integral.

Then a bound on the variance of any such unbiased estimator T (X) of τ(θ) is ′ 2 V ar [T (X)] [τ (θ)] θ ≥ In(θ) Thus, we have a standard which no unbiased estimator can surpass.

(D) (Equality Corollary) Equality is achieved in (C) for a particular estimator T (X) of τ(θ) only if ℓ˙(θ, X) = An(θ)[T (X) τo(θ)] for some constant An(θ), and in this case the CR-bound− of (C) becomes ′ [τo(θ)] V arθ[T (X)] = An(θ) .

Moreover, when this relationship holds, n(θ) = τo′ (θ)An(θ) ThisI format| can hold for| at most one function τ ( )(and linear variations on it). o · 7.5. MAXIMUM LIKELIHOOD ESTIMATORS 37

Theorem 7.4.2: UMVU Estimators of τ(θ) (506) X is distributed according to one of the distributions P in = P : θ Θ . θ P { θ ∈ } (a) Let V be a UMVUE of τ(θ) that has finite variance for all θ. Then V is unique.

(b) Let V be unbiased for τ(θ) that has finite variance within the model. Let U denote any statis- tic having finite variance within the model. Then V is UMVUE of τ(θ) iff Covθ[U, V ] = 0 for all U with Eθ[U] = 0 for all θ.

Example

Poisson(ν ai) regression (508) Truncation· family (509) Power family (509) Non-regularity of Uniform(0,θ) (510)

7.5 Maximum Likelihood Estimators Definitions

maximum likelihood estimator (513) – θˆn(x) is the value of θ in the parameter set Θ that maximizes the likelihood L(θ; x). theoretical maximum likelihood estimator (513) – θˆn(x), the MLE when evaluated at the random unperformed experimental outcome X. sufficiency (513) – If T (X) is sufficient for θ, then any MLE θˆn is a function of T . invariance (513) – If θˆn is the MLE of θ, then the MLE of τ(θ) is τ(θˆn). ˜ 2 general asymptotic normality and an-consistency (529) – an(θn bn(θ)) d N( ,vθ ) asymptotic normality and √n-consistency (513) – If an iid model− has nice→ smoothness− 1 properties, then √n(θˆn θ) d N(0, − (θ)) for the Fisher information in one X. 2 2 − → I ∂ ∂ Eθ[ ∂α2 ℓn(θ; X)] Eθ[ ∂β ∂α ℓn(θ; X)] ¯ 2 2 fisher information for a vector (516) – n(θ) = − ∂ − ∂ I E [ ℓ (θ; X)] E [ 2 ℓ (θ; X)] " θ − ∂α∂β n θ − ∂β n # ∂ 2 ∂ ∂ Eθ[( ℓn(θ; X)) ] Eθ[( ℓn(θ; X))( ℓn(θ; X))] (alternative for) (516) – (θ) = ∂α ∂β ∂α n ∂ X ∂ X ∂ X 2 I "Eθ[( ∂α ℓn(θ; ))( ∂β ℓn(θ; ))] Eθ[( ∂β ℓn(θ; )) ] # regularity conditions (517) – The previous alternative forms the fisher information ma- trix are equivalent under certain regularity conditions (see Theorem 7.5.1). Under these conditions, the theoretical score function has mean 0 and variance (θ). expected information (517) – (θ) I I 1 information random variable (517) – I (θ) = ℓ¨ (θ; X) = V¯ µ = ¯ n −n n →p V I observed information (517) – Iˆ = I (θˆ ) = 1 n ℓ¨(θˆ ; X ) n n n − n k=1 n k estimated information (518) – (θˆ ) n P Newton-Raphson iteration methodI (522)

Theorem 7.5.1: Asymptotic properties of MLEs (518) Regularity Conditions (519) Assumptions that suffice (even for an r-dimensional θ) are: (i) Z (θ ) N(0, (θ )) and I (θ ) ¯(θ ) = (θ ) > 0 is finite. n o →d I o n o →p I o I o (ii) θˆn satisfies the likelihood equation. (iii) θˆn p θo. (iv) The→ mean value theorem is applicable. (v) D = I (θˆ∗ ) I (θ ) sup[ I (θ′) I (θ ) : θ′ θ θˆ θ ] 0. | n| | N n − n o | ≤ | N − n o | | − o| ≤ | n − o| →p (vi) distributions Pθ have common support. 38 CHAPTER 7. ESTIMATION, PRINCIPLES AND APPROACHES

Conclusions (518) Consider an iid model that satisfies these regularity conditions. Then θˆn and Iˆn = In(θˆn) satisfy: ˆ 1 1 1 n ˙ (i) √n(θn θ) =a − (θo)Zn(θo) = − (θo)[ √n k=1 ℓ(θo; Xk) − 1 I 1 I 1 (ii) − (θ )Z(θ ) − (θ )N(0, (θ )) N(0, − (θ )) →d I o o ∼I o I o ∼P I o (iii) Iˆn = In(θˆn) p (θo) > 0 for the observed information Iˆn = In(θˆn). →1/2I (iv) √n(θˆn θ) Iˆn d N(0,Im) (where Im is the m m identity matrix). 1 −(α/2)· 1/→2 × (v) θˆ z Iˆn− (for an m = 1 dimensional parameter θ). n ± √n

Theorem 7.5.2: Asymptotic properties of MLE vectors (519) Regularity Conditions (519) Assuming the following regularity conditions: (N0)

(i) Observations X1, ..., Xn are iid as one of the distinct distributions Pθo in = Pθ : θ Θ defined by densities f ( ) or mass functions p ( ). P { ∈ } θ · θ · (ii) The parameter vector θ is m-dimensional with Θ a subset of Rm. (iii) The support ( x : f (x) > 0 or x : p (x) > 0 ) does not depend on θ. { θ } { θ } (N1) ˙ (i) All partials have mean 0 under θo; thus, Eθo [ℓi(θo; X)] = 0 for all i. (ii) All partials have finite variances under θ ; thus E [ℓ˙ (θ ; X)]2 < for all i. o θo { i o } ∞ (iii) I¯(θo) = I(θo) > 0.

Also: Θ contains an open neighborhood of the true parameter value θo in which: (N2) For all x, all second order partial derivativesH of ℓ(θ; x) are continuous in θ and the magnitude of each is bounded by a function of M(x) for which E [M(x)] < . θo ∞ (N3) For all x, all third order partial derivatives of ℓ(θ; x) are continuous in θ and the magnitude of each is bounded by a function of M(x) for which E [M(x)] < . θo ∞ Conclusions (519) Let θˆn be a solution of the likelihood equation for which θˆn p θo, as required. Suppose that either all of the regularity conditions hold, or→ that the assumptions for the vector version hold. Then the conclusions of Theorem 7.5.1 still hold.

Newton-Raphson Iteration Method (522) Newton=Raphson provides iterated solutions to the liklihood equations. The method works extremely well if H′(θ) > 0 (or H′(θ) < 0) over the entire region in question. The solution is found through iteration of a one-term Taylor series approximation.

7.6 Procedures for Location and Scale Models

Definitions location and scale model equation (545) – Y = θ + τ W for 1 k n, where k · k ≤ ≤ W1,...,Wn are independent with a fixed and knownd df Fo( ). location invariance (545) – C(Y) = C(θ + τW) = θ + τC(W) · scale invariance (545) – V (Y) = V (θ + τW) = τV (W) C(Y) θ C(W) pivot (545) – T = T (Y) = √n − = = √n = T (W) is a pivot for θ on which V (Y) · · · V (W) confidence intervals can be based. Likewise, U = U(Y) = V/τ = V (W) is a pivot for τ. 7.7. BOOTSTRAP METHODS 39

Natural Estimators of Location (545) “The examples of location estimators listed below are seen to behave in a very pleasing fasion in the context of a location-scale model. These estimators of location (or center include)”

Y¯ = 1 n Y = 1 n (θ + τW ) = θ + τ W¯ n k=1 k n k=1 k · n P P Y¨n =median1 k nYk =median1 k n(θ + τWk) = θ + τ W¨ n ≤ ≤ ≤ ≤ · 1 (Y + Y ) = θ + τ 1 (W + W ) 2 n:1 n:n · 2 n:1 n:n

Natural Estimators of Scale (545) “The examples of scale estimators listed below are seen to behave in a very pleasing fashion in the context of a location-scale model. These estimators of scale (or dispersion, or variation) include”

D¯ = 1 n Y Y¯ = 1 n (θ + τW ) (θ + τW¯ ) n n k=1 | k − n| n k=1 | k − n | P = τ 1 n WP W¯ · n k=1 | k − n| 1 n P¯ 2 1/2 1 n ¯ 2 1/2 Sn = [ n 1 k=1 Yk Yn ] = [ n 1 k=1 (θ + τW )k) (θ + τWn) ] − | − | − | − | P 1 n ¯P2 1/2 = τ [ n 1 k=1 Wk Wn ] · − | − | Y Y = (θ + τW P) (θ + τW ) = τ (W W ) n:n − n:1 n:n − n:1 · n:n − n:1

7.7 Bootstrap Methods

7.8 Bayesian Methods

Distributions prior distribution (571) – π(θ) – distribution of model parameter by current knowledge. posterior distribution (571) – π(θ x) – distribution upon which all Bayesian statistical | f(x θ)π(θ) f(x θ)π(θ) procedures will be based: fθ X=x(θ) = | = | | fX(x) f(x θ )π(θ )dθ Θ | ′ ′ ′ predictive/marginal distribution (571) – fX(x) = f(x θ′)π(θ)dθ′ ΘR | R Moments posterior mean (571) – E(y x) = m(x) = mean of posterior distribution π(θ x); often can be expressed as a weighted| average of the prior mean and the data mean. | posterior mean of τ(θ) (581) –τ ˆB =τ ˆπ(x) = Eπ(τ(θ) x); minimizes Eπ(L(ˆτB,τ) x) = posterior risk with respect to π( ) (generally the | used is the squared| error loss, making this equivalent to minimizing· the posterior variance). posterior variance (571) – Var(y x) = v2(x) = variance of posterior distribution π(θ x). | |

Risk

posterior risk for π (572) – Varπ(τ(θ) x), evaluated atτ ˆ(x) = Eπ[τ(θ) x] that minimizes E [(τ(θ) τˆ(θ))2 x] among allτ ˆ(x). | | π − | risk function (581) – R(ˆτ,θ) = Eθ[L(ˆτ,τ(θ))] π-averaged risk ofτ ˆ (581) – Rπ(ˆτ) = Eπ[R(ˆτ,θ)] = R(ˆτ,θ)π(θ)dθ R 40 CHAPTER 7. ESTIMATION, PRINCIPLES AND APPROACHES

Extra Definitions

1 α credible set (574) – Bx s.t. the probability τ(θ) is in Bx = 1 α, under the posterior − − distribution. When Bx is an interval, call this a 1 α for τ(θ). 1 − 1 α credible interval for θ (574) – Ix = τ − (Bx) − highest probability density intervals (HPD intervals) (574) – shortest intervals Ix are where the density takes its highest values. of τ(θ) (571) – see posterior mean of τ(θ). Bayesian interval for θ (571) – see 1 α credible interval. − Bayesian Hypothesis Testing (572) c The level α Bayesian test of Θo vs. Θo is based on the following hypothesis testing decision rule: Reject Θ if P (Θ X=x) α o π o| ≤ Under this rule, we can define a Bayesian p-value as the probability of rejecting the null hypothesis under the (null hypothesis of the) posterior distribution: P (Θ X=x) π o| Likewise, we can define Bayesian power as the probability of rejecting the null hypothesis under the ( of the) posterior distribution: P (Θ X=x) π˜ o| Bayesian Sufficiency (572) If π(θ x) depends on x only through t = T (X), then T is Bayesian sufficient for θ (and we can replace X with T| in π(θ x) above). If T is sufficient| for , then T is Bayesian sufficient for θ. P Conjugate priors Binomial-Beta (573) - Y Binomial(n, θ) and π Beta(α, β) have posterior distribution i ∼ ∼ Beta(α∗, β∗), where α∗ = y + α, β∗ = n + β y. − Gamma-Gamma (576) - Y Gamma(r, ν) and π Gamma(r , c) have posterior Gamma(r∗, c∗), i ∼ ∼ o o where ro∗ = nr + ro, c∗ = t + c = yi + c. Poisson-Gamma (586) - Yi Poisson(λ) and π Gamma(r, ν) have posterior distribution ∼ P ∼ Gamma(r∗,ν∗), where r∗ = t + r = yi + r, ν∗ = n + ν. 2 2 Normal-Normal (575) - Yi Normal(θ,σo ) and π Normal(µo,τo ) have posterior Normal 2 2 2 2 µoσo+∼nτo Y¯n P σoτo ∼ (µ ,σ ), where µ = 2 2 ,σ = 2 2 . o∗ o∗ o∗ σo+nτo o∗ σo+nτo Normal-Gamma conjugate prior for N(θ,σ2) with both unknown (578) Exponential family conjugate priors (590)

Loss Functions L(ˆτ,τ) squared error loss function (581) – [ˆτ τ(θ)]2 absolute error loss function (581) – τˆ− τ | − | τˆ τ 2 relative squared error loss function (581) – [ −τ ] τˆ τ τˆ τˆ τ τˆ τ Stein loss function (581) – − log = − log(1 + − ) τ − τ τ − τ Chapter 8

Hypothesis Tests

8.1 The Neyman-Pearson Lemma and UMP Tests Review

null hypothesis (598) – Ho alternative hypothesis (598) – Ha decision rule (598) – φ – a rule defining when to reject Ho (e.g. Reject Ho if X C) test/critical function (598) – φ : S [0, 1], where φ(x) is the conditional probability∈ that → Ho is rejected given that X has the observed value x. rejection region, critical region C (598) – typically consists of values X in the sample space for which some statistic T takes on values implausible under Ho. Bayesian Approach: risk (604) – Rπ(φ) is the expected loss associated with the decision rule φ Bayes rule (604) – for the prior π( ), any rule φ that minimizes the risk R (φ). · π π Bayes risk (604) – the risk as defined by such a rule φπ. conditional/posterior risk (604) – Rπ(φ X=x), the expected loss associated with the deci- sion rule, given observations x |

Definitions

simple hypothesis (598) – Ho : θ = θo (one value) composite hypothesis (598) – more than one value in Ho(Θo) level (598) – aimed-for size, can be less than or equal to size; may not be achieved in discrete distributions

size (598) – α = supΘo β(θ) = actual level of the test randomized test (598) – determined by test function, and used for discrete distributions to get level equal to α power (599) = β(θ) = Eθ[Pθ(Ho is rejected X)] = Eθ[φ(x)] no data test (599) – φ(X) = α for all x, always| rejecting at random non-randomized test (599) – φ(X) = 1C (X) for some rejection set C. Neyman-Pearson likelihood ratio statistic (LR-statistic) (599) – Λ(x) = fa(x)/fo(x)

Theorem 8.1.1 – Neyman-Pearson Lemma (599)

Fix 0 α 1. Suppose Po and Pa have distrtibutions fo and fa. (i) There≤ exists≤ a constant k and a function γ(x) for which the test 1 > k (a) φ(x) = γ(x) if Λ(x) = fa(x) = = k has (b) E [φ(X)] = α.  fo(x)  o  0  < k (ii) Any test of this type is a most powerful test of P vs. P of level α.   o a

41 42 CHAPTER 8. HYPOTHESIS TESTS

Corollary 1 If Λ(x) = fa(x)/fo(x) = g(t) for some function g( ) of t = T (x), then the test in (i) above may be replaced (for appropriate constants c and↑ γ that do exist)· by 1 if t > c (a) φ(x) = γ if t = c where (b) P (T > c) + γP (T = c) = α.  o o  0 if t < c  Theorem 8.1.2 – Uniformly Most Powerful (UMP) tests (601) If φ ( ) is a fixed test that gives a most powerful test for the simple hypothesis θ versus the simple o · o alternative θa, and that this same φo test is most powerful against each fixed alternative θa Θa, then φ is a UMP test of H : θ vs H : θ Θ . ∈ o o o a ∈ a

If this same test is a level α test for the composite hypothesis θ Θo (that is [supθ Θo βφo (θ)] α, then ∈ ∈ ≤ φ is a UMP test of H : θ Θ vs H : θ Θ . o o o ∈ o a ∈ a

Role of sufficient statistics (601) The Neyman-Pearson test may be based directly on any sufficient statistic rather than the full likelihood by application of the factorization theorem (that fθ(x) = gθ(t)h(x)): NP T Λ (x) = fθa (x)/fθo (x) = gθa (t)/gθo (t)=Λ (t).

Theorem 8.1.3 – Karlin-Rubin (601)

fθ′ (x) Define monotone likelihood ratio for a family of distributions f : θ Θ if for θ<θ′, is an { θ ∈ } fθ (x) increasing (or decreasing) function of t = T (x).

For a family with MLR in T , the test φ defined in the Neyman-Pearson Corollary is UMP for test- ing H : θ θ vs H : θ>θ . o ≤ o a o

Exponential families with MLR (602) Model the X-experiment as having one of the distributions in the exponential family c(θ)eη(θ)T (X)h(x), with natural parameter η(θ) that is in θ. Then the family of distributions T has MLR in T . (Likewise↑ if η(θ) is in θ. ↓

8.2 The Likelihood Ratio Test

Definitions sup L(X,θ) L(θˆ ; x) Likelihood Ratio Test (610) – ΛLR = Θ = n ˆo supΘoL(X,θ) L(θn; x) Kullbak-Leibler information number (626) – H = K(P ,P ) = E [log pa(X) ] 0 θ,θo a o Pa po(X) ≥ compares Pa and Po, and is a well defined number for every possible pair Pa and Po of probability distributions.

Theorem 8.2.1 – Asymptotic null distribution of the LR- Λn (610)

Let r denote the dimension of the vector parameter θ = (θ ,...,θ )′, and express H : θ = = θ = 0. 1 r o q+1 · · · r 2 2 log Λn(X) d χr q when Ho is true. → − 2(α) Asymptotically, the level α LR-test rejects Ho in case 2 log Λn exceeds χr q . − 8.2. THE LIKELIHOOD RATIO TEST 43

Improvements on Asymptotic LR-test for Specific Cases (610) (i) If Λ is an function of some simpler statistic T = T (X) whose distribution is exactly known under n ↑ Ho, the LR-test rejects Ho if T is “too big”.

(ii) If Λn is a u-shaped function of such a statistic T , the LR-test rejects Ho for both extremely large and extremely small values of T .

Theorem 8.2.2 – Asymptotic behavior of λn under alternatives (610)

Let Θo = θo. Then

2 log Λ (X) 2 (some number H ) > 0 when θ is true. n n →p × θ,θo Thus, the power of the LR-test converges to 1 for every θ H . ∈ a Math Fact (expanding log(1+ z)) log(1 + z) = z z2/2 + z2g(z), where g(z) 0 as z 0. − → → Examples General result for Normal Theory (614) Theorem 8.2.3 – Asymptotic null distribution of various statistics (618) (a) (b) Rao (c) LR-Test (d) Consistency of Observed Information (e) m-Dimensions Maxwell(θ) practice (619) Chapter 9

Regression, Anova, and Categorical Data

9.1

Definitions

simple linear regression model (634) – Y = α + β(x x¯) + ǫ , where ǫ N(0,σ2). An i i − i i ∼ alternative parameterization: Yi = γ + βxi + ǫi α (634) – intercept of fitted regression line atx ¯ = 0 γ (634) – intercept of fitted regression line at Y = 0 β (634) – slope; mean increase in the output per unit increase in the input x. least squares estimators (LSEs) (635) – estimatorsα, ˆ βˆ which minimize the sum of squares n 2 n 2 i=1[Observationi Expectationi] = i=1[Yi (a + b(xi x¯))] . They solve the normal equations. − − − normalP equations (634) – 0 = n[Y¯ a];P 0 = n (x x¯)Y b n (x x¯)2 − i=1 i − i − i=1 i − fitted values (635) – Yˆ =α ˆ + βˆ(x x¯) i i P P residuals (635) –ǫ ˆ = Y Yˆ − i i − i studentized residuals (635) – Zˆi =ǫ ˆi/Sn; these form ancillary statistics 2 SSY Y SSE coefficient of determination (639) – R = SSxY /√SSxxSSY Y = − . SSY Y

Estimators (635)

The least squares estimators as defined above give: αˆ = Y¯ βˆ = 1 n (x x¯)Y = 1 n (x x¯)(Y Y¯ ) = SSxY SSxx i=1 i − i SSxx i=1 i − i − SSxx γˆ = Y¯ βˆx¯ − P P An estimate of σ2 is: 2 1 1 n 2 1 n ˆ 2 Sn = n 2 SSE = n 2 i=1 ǫˆi = n 2 i=1(Yi αˆ β(xi x¯)) − − − − − − ˆ 2 P P α,ˆ β,and Sn are independent, sufficient, complete, and independent of the studentized residuals (which are ∴ ancillary).

2 Behavior of Sn (637)

2 1 2 2 2 1 n 2 1 2 2 Sn = n 2 SSE E[Sn] = σ Sn = n 2 i=1 ǫˆi n 2 σ χn 2 ∼ − − ˆ − − This knowledge is useful in characterizing the behavior ofα ˆ and β. P

44 9.2. ONE WAY ANALYSIS OF VARIANCE 45

Behavior of αˆ (635,637)

1 2 2 αˆ α αˆ (α, n σ ) √n(ˆα α) = √nǫ¯ N(0,σ ) − n 2 ∼ − ∼ Sn/√n ∼T − (α/2) Sn αˆ tn 2 yields a 1 α confidence interval for α. ± − √n − (α) Reject Ho : α = αo vs Ha : α > αo when Tn(αo) >tn 2. −

Zα + δα (α) (α) Power(α,σ) = Pα,σ >tn 2 = P ( n 2(δα) >tn 2) 2 − T − −  χn 2/(n 2)  − − q Behavior of βˆ (637) ˆ ˆ 2 ˆ n 2 β β β (β,σ /SSxx) √SSxx(β β) = i=1 dniǫi N(0,σ ) − n 2 ∼ − ∼ Sn/√SSxx ∼T − ˆ (α/2) Sn P β tn 2 yields a 1 α confidence interval for β. ± − √SSxx − (α) Reject Ho : β = βo vs Ha : β > βo when Tn(βo) >tn 2. −

Zβ + δβ (α) (α) Power(β,σ) = Pβ,σ >tn 2 = P ( n 2(δβ) >tn 2) 2 − T − −  χn 2/(n 2)  − − q Extensions

Behavior of l(xo) = α + (xo x¯)β at another xo (638) Prediction Intervals (638) − Simultaneous Confidence Intervals for α,β, and the line l( ) (638) Exact Confidence Intervals without the Normal RVs Assumption· (639) Maximum Likelihood Estimates (MLEs), Likelihood Ratio Tests (640)

9.2 One Way Analysis of Variance

9.3 Categorical Data

Definitions

observed cell counts (654) – (N1,...,Nκ) expected cell counts (654) – E1 = npo1,...,Eκ = npoκ Ni Ei Ni npoi √n(ˆpi poi) normed differences (ndiffs) (654) – Di =NDiffi = − = − = − √Ei √npoi √poi

Theorem 9.3.1 – Chisquare tests of fit for Multinomial observations (655)

For (N1,...,Nκ) Multinomial(n; p1,...,pκ), define the null hypothesis Ho : p1 = po1,...,pκ = poκ κ ∼ 2 2 √n(ˆpi poi) 2 χˆn = − d χκ 1 when Ho is true. √poi → − i=1   X κ 2 1 2 (pi poi) χˆ dp = − n n →p p i=1 oi 2 X χˆn p whenever Ho is false. → ∞ 2 2(α) These results lead to a level-α test: Reject Ho if χn >χκ 1 ˙ and asymptotic power function: Power(p) = − 2 2(α) Pp(χn >χκ 1 ) p 1 as n . − → → ∞ 46 CHAPTER 9. REGRESSION, ANOVA, AND CATEGORICAL DATA

Theorem 9.3.2 – Chisquare test for independence (659) For the IJ rvs N Multinomial(n; P), define the null hypothesis H : P ij ∼ o o I J 2 2 (Nij Ni N j/n) 2 P χˆn = − · · d χ(I 1)(J 1) when Ho : o is true. n N /n → − − i=1 j=1 i j X X · · I 2 1 (pij pi p j ) 2 P · · n χˆn p d = J − → pi p j i=1 j=1 · · 2 X X χˆn p whenever Ho is false. → ∞ 2 2(α) These results lead to a level-α test: Reject Ho if χn >χ(I 1)(J 1). 2(α−) − P P 2 and asymptotic power function: Power( ) = P (ˆχn >χ(I 1)(J 1)) p 1 as n . − − → → ∞ Theorem 9.3.3 – Chisquare test for independence via the conditionals (663)

For (Ni1,...,NiJ ) Multinomial(ni ; p1 i,...,pJ i), define the null hypothesis Ho : Po ∼ · | | I J 2 2 (Nij ni N j/n) 2 P χˆn = − · · d χ(I 1)(J 1) when Ho : o is true. n N /n → − − i=1 j=1 i j X X · · I 2 1 ni (pj i p¯j ) χˆ2 dP = J · | − n n →p n p¯ i=1 j=1 j 2 X X χˆn p whenever Ho is false. → ∞ 2 2(α) These results lead to a level-α test: Reject Ho if χn >χ(I 1)(J 1). 2(α−) − P P 2 and asymptotic power function: Power( ) = P (ˆχn >χ(I 1)(J 1)) p 1 as n . − − → → ∞ Theorem 9.3.4 – Chisquare test for independence with marginals fixed (665)

I J 2 2 (Nij ni n j /n) 2 χˆn = − · · d χ(I 1)(J 1) when Ho is true. n n /n → − − i=1 j=1 i j X X · · 2 2(α) A level-α test is: Reject Ho if χn >χ(I 1)(J 1). − − Degrees of Freedom Rule (666)

When Ho is true, df =[number of cells]-[number of parameters estimated]-[number of cell count restrictions].

Chisquare test of fit: df = [κ] [0] [1] = κ 1 Testing for independence: df = [IJ] [(I 1) + (J 1)] [1]− = (I− 1)(J −1) Testing for independence via the conditional distributions: df− = [−IJ] [(J − 1)] − [I] = (I − 1)(J − 1) Testing for independence with all marginals fixed: df = [IJ] −[0] [−I + J− 1] = (I − 1)(J − 1) − − − − −

Proofs that pˆi is the MLE and unique UMVUE of pi (656) Lagrange Multipliers (656) Jensen’s Inequality (671) Guess and verify (671) Maximizing a function of κ 1 variables (671) − Appendix A

Distributions

A.1 Important Transformations

Cauchy(0,1) (142) 1 f(x) = on < x < π(1 + x2) −∞ ∞

Cauchy(α, β) (373) 1 1 f (x) = on < x < α,β πβ 1 + ((x α)/β)2 −∞ ∞ − Inverted Gamma(r, θ) (142) Let M = 1/W , where W has the Gamma(r, θ) distribution. Then M has the Inverted Gamma(r, θ) distribution: 1 1 1 1/mθ f (m) = e− M −Γ(r) θr mr+1

Rayleigh(θ) distribution (417)

1 2 2 x 2 x /θ f (x) = 2 e− for x 0 θ θ ≥

Maxwell(θ) distribution (417)

2 1 2 2 2 x 2 x /θ f (x) = 3 e− for x 0 θ π θ ≥ q Exponential(α,ν) (373)

nu(y α) f (y) =− − on α y < α,µ ≤ ∞

Logistic(0,1) distribution (426)

x x 2 f(x) = e− /(1 + e− ) on < x < −∞ ∞

Logistic(α, β) (373) exp( (x α)/β) f (x) = − − on < x < α,β β[1 + exp( (x α)/β)]2 −∞ ∞ − − 47 48 APPENDIX A. DISTRIBUTIONS

Double Exponential(µ, θ) (33) f (x) = 1 exp( x µ /θ) on < x < X 2θ −| − | −∞ ∞

Pareto2(1,ν) (531) 1+ν fν (x) = ν/(1 + x) for x > 0

Truncation families (490,509,521) f (x) = c(θ)1 (x)h(x) for x 0 θ [0,θ] ≥ Power family (509,531) γ x γ 1 fβ,γ (x) = β ( β ) − 1[0,β](x) γ 1 f (x) = − for x > 1 γ xγ

Chisquare (140)

k 2 1 y − y/2 fk(y) = k e− for all y 0. k 2 ≥ Γ( 2 )2

Logarithmic Series (144,530) (1 θ)k p (k) = − for k = 1, 2,... θ (log1/θ)k

Additional Distributions Uniform(a,b) (33) Discrete Uniform (1,N) (33) Discrete Uniform (0,N) (33) Triangular(c,a) (33) Student Correlation Coefficient (145) Empirical (146) Truncated Poisson (301) Generalized Binomial (319) Studentized distribution (371) LogNormal(µ, σ2) (373) LogGamma(κ,0,1) distribution (374,427) LogGamma(1,0,1)=ExtValueMin(0,1)=Gompertz distribution (374,427,429) Multivariate Normal(µ, Σ) distribution (385) Slash distribution (403) Inverted Gamma(a,b) (577) Dirichlet distribution (587)