Mathematical (NYU, Spring 2003) Summary (answers to his potential exam questions) By Rebecca Sela

1Sufficient theorem (1)

Let X1, ..., Xn be a from the distribution f(x, θ).LetT (X1, ..., Xn) be asufficient statistic for θ with continuous factor function F (T (X1,...,Xn),θ). Then,

P (X A T (X )=t) = lim P (X A (T (X ) t h) ∈ | h 0 ∈ | − ≤ → ¯ ¯ P (X A, (T (X ) t h¯)/h ¯ ¯ ¯ = lim ∈ − ≤ h 0 ¯ ¯ → P ( T (X¯ ) t h¯)/h ¯ − ≤ ¯ d ¯ ¯ P (X A,¯ T (X ) ¯t) = dt ∈ ¯ ≤ ¯ d P (T (X ) t) dt ≤ Consider first the numerator:

d d P (X A, T (X ) t)= f(x1,θ)...f(xn,θ)dx1...dxn dt ∈ ≤ dt A x:T (x)=t Z ∩{ } d = F (T (x),θ),h(x)dx1...dxn dt A x:T (x)=t Z ∩{ } 1 = lim F (T (x),θ),h(x)dx1...dxn h 0 h A x: T (x) t h → Z ∩{ | − |≤ }

Since mins [t,t+h] F (s, θ) F (t, θ) maxs [t,t+h] on the interval [t, t + h], we find: ∈ ≤ ≤ ∈

1 1 lim (min F (s, θ)) h(x)dx lim F (T (x),θ)h(x)dx h 0 s [t,t+h] h A x: T (x) t h ≤ h 0 h A x: T (x) t h → ∈ Z ∩{ k − k≤ } → Z ∩{ k − k≤ } 1 lim (max F (s, θ)) h(x)dx ≤ h 0 s [t,t+h] h A x: T (x) t h → ∈ Z ∩{ k − k≤ } 1 By the continuity of F (t, θ), limh 0(mins [t,t+h] F (s, θ)) h h(x)dx = → ∈ A x: T (x) t h 1 ∩{ k − k≤ } limh 0(maxs [t,t+h] F (s, θ)) h A x: T (x) t h h(x)dx = F (t, θ).Thus, → ∈ ∩{ k − k≤ } R R 1 1 lim F (T (x),θ),h(x)dx1...dxn = F (t, θ) lim h(x)dx h 0 h A x: T (x) t h h 0 h A x: T (x) t h → Z ∩{ | − |≤ } → Z ∩{ | − |≤ } d = F (t, θ) h(x)dx dt A x:T (x) t Z ∩{ ≤ }

1 If we let A be all of Rn, then we have the case of the denominator. Thus, we find:

d F (t, θ) dt A x:T (x) t h(x)dx ∩{ ≤ } P (X A T (X)=t)= d ∈ | F (t, θ) dtR x:T (x) t h(x)dx { ≤ } d dt A x:T (x) t h(x)dx R ∩{ ≤ } = d dtR x:T (x) t h(x)dx { ≤ } which is not a function of Rθ. Thus, P (X A T (X )=t) does not depend on θ when T (X ) is a sufficient statistic. ∈ |

2Examplesofsufficient statistics (2) 2.1 Uniform 1 Suppose f(x, θ)= θ I(0,θ)(x). Then,

1 f(x ,θ)= I (X ) i θn (0,θ) i Y 1 Y = n I( ,θ)(max Xi)I(0, )(min Xi) θ −∞ ∞

1 Let F(max Xi,θ)= θn I( ,θ)(max Xi) and h(X1,...,Xn)=I(0, )(min Xi). −∞ ∞ This is a factorization of f(xi,θ),somax Xi is a sufficient statistic for the uniform distribution. Y 2.2 Binomial

xi 1 xi xi Suppose f(x, θ)=θ (1 θ) − ,x =0, 1. Then, f(xi,θ)=θ (1 − − n xi t n t θ) − .LetT (x1, ..., xn)= Xi, F (t, θ)=θ (1 θY) − ,andh(x1,P ..., xn)= − 1. ThisP is a factorization of Xf(xi,θ),whichshowsthatT (x1, ..., xn)= Xi is a sufficient statistic. Y X 2.3 Normal 2 1 1 (x µ)2 Suppose f(x, µ, σ )= e− 2σ2 − . Then, √2πσ2

1 2 2 2 n/2 (xi µ) f(xi,µ,σ )=(2πσ )− e− 2σ2 − P 1 2 1 2 2 n/2 (xi x¯) n(¯x µ) Y =(2πσ )− e− 2σ2 − e− 2σ2 − P

2 since

2 2 2 2 (xi x¯) + n(¯x µ)= (xi 2xix¯ +¯x )+n(¯x 2µx¯ + µ ) − − − − X = X x2 2¯x(nx¯)+nx¯2 + nx¯2 2nµx¯ + nµ2 i − − 2 2 = X x 2µ xi + nµ i − 2 2 = X(x 2µxXi + µ ) i − 2 = X(xi µ) − X

Case 1: σ2 unknown, µ known. 1 2 2 2 n/2 2 t Let T (x1,...,xn)= (xi µ) , F (t, σ )=(2πσ )− e− 2σ ,andh(x1, ..., xn)= − 2 1. Thisisafactorizationof f(xi,σ ). Case 2: σ2 known,Pµ unknown. 1 2 1 2 Q 2 n(¯x µ) 2 n/2 2 (xi x¯) Let T (x1,...xn)=¯x, F(t, µ)=e− 2σ − ,andh(x1, ...xn)=(2πσ )− e− 2σ − . P This is a factorization of f(xi,µ). Case 3: µ unknown, σ2 unknown. 1 1 2 Q 2 2 2 n/2 2 t2 2 n(t1 µ) Let T1(x1, ...xn)=¯x, T2(x1,...xn)= (xi x¯) , F (t1,t2,µ,σ )=(2πσ )− e− 2σ e− 2σ − , − and h(x1, ..., xn)=1. This is a factorization. P 3 Rao-Blackwell Theorem (3)

Let X1, ..., Xn be a sample from the distribution f(x, θ).LetY = Y (X1,...,Xn) be an unbiased estimator of θ.LetT = T (X1, ..., Xn) be a sufficient statistics for θ.Letϕ(t)=E(Y T = t). | Lemma 1 E(E(g(Y ) T )) = E(g(Y )), for all functions g. | Proof.

∞ E(E(g(Y ) T )) = E(g(Y ) T )f(t)dt | | Z−∞ ∞ ∞ = ( g(y)f(y t)dy)f(t)dt | Z−∞ Z−∞ ∞ ∞ = g(y)f(y, t)dydt Z−∞ Z−∞ ∞ ∞ = g(y)( f(y,t)dt)dy Z−∞ Z−∞ ∞ = g(y)f(y)dy Z−∞ = E(g(y))

3 Step 1: ϕ(t) does not depend on θ.

ϕ(t)= y(x1, ..., xn)f(x1, ..., xn T (x1, ..., xn)=t)dx1...dxn. n | ZR Since T (x1, ..., xn) is a sufficient statistic, f(x1, ..., xn T (x1, ..., xn) does not | depend on θ.Sincey(x1, ..., xn) is an estimator, it is not a function of θ.Thus, the integral of their product over Rn does not depend on θ. Step 2: ϕ(t) is unbiased.

E(ϕ(t)) = E(E(Y T )) | = E(Y ) = θ (1) by the lemma above. Step 3: Var(ϕ(t)) Var(Y ) ≤

Var(ϕ(T )) = E(E(Y T )2) E(E(Y T ))2 | − | E(E(Y 2 T )) E(Y )2 ≤ | − = E(Y 2) E(Y )2 − = Var(Y )

Thus, conditioning an unbiased estimator on the sufficient statistic gives a new unbiased estimator with at most that of the old estimator.

4 Some properties of the derivative of the log (4)

∂ Let X have the distribution function f(x, θ0).LetY = ∂θ log f(X, θ) θ=θ0 . ∂ 1 ∂ | Notice that, by the chain rule, ∂θ log f(x, θ)= f(x,θ) ( ∂θ f(x, θ)).Usingthis

4 fact, we find: ∂ E(Y )=E( log f(X, θ) θ=θ ) ∂θ | 0 ∞ ∂ = log f(X, θ) θ=θ f(X, θ0)dX ∂θ | 0 Z−∞ ∞ 1 ∂ = ( f(x, θ) θ=θ0 )f(X, θ0)dX f(x, θ0) ∂θ | Z−∞ ∞ ∂ = f(x, θ) θ=θ dX ∂θ | 0 Z−∞ ∂ ∞ = ( f(x, θ)dX) θ=θ ∂θ | 0 Z−∞ ∂ = (1) θ=θ ∂θ | 0 =0

∂2 ∂ 1 ∂ log f(x, θ)= ( ( f(x, θ))) ∂θ2 ∂θ f(x, θ) ∂θ 1 ∂2 ∂ = (f(x, θ) f(x, θ) ( f(x, θ))2) f(x, θ)2 ∂θ2 − ∂θ 1 ∂2 1 ∂ = f(x, θ) ( f(x, θ))2 f(x, θ) ∂θ2 − f(x, θ) ∂θ 1 ∂2 ∂ = f(x, θ) ( log f(x, θ))2 f(x, θ) ∂θ2 − ∂θ

2 2 ∂ ∞ 1 ∂ ∂ 2 E( log f(x, θ) θ=θ )= f(x, θ) θ=θ ( log f(x, θ) θ=θ ) dx ∂θ2 | 0 f(x, θ) ∂θ2 | 0 − ∂θ | 0 Z−∞ 2 ∞ 1 ∂ ∞ ∂ 2 = f(x, θ) θ=θ dx ( log f(x, θ) θ=θ ) dx f(x, θ) ∂θ2 | 0 − ∂θ | 0 Z−∞ Z−∞ 2 ∂ ∞ 1 ∂ 2 = ( f(x, θ)dx) θ=θ E(( log f(x, θ) θ=θ ) ) ∂θ2 f(x, θ) | 0 − ∂θ | 0 Z−∞ 2 ∂ ∂ 2 = (1) θ=θ E(( log f(x, θ) θ=θ ) ) ∂θ2 | 0 − ∂θ | 0 ∂ 2 = E(( log f(x, θ) θ=θ ) ) − ∂θ | 0

∂2 Thus, the expected value of Y is zero, and the variance of Y is E( ∂θ2 log f(x, θ) θ=θ0 ), which is defined as the information function, I(θ). − |

5 5 The Cramer-Rao lower bound (5)

Let T be an unbiased estimator based on a sample X , from the distribution f(x, θ). Then, E(T )=θ. We take the derivative of this equation to find:

∂ 1= E(T ) (2) ∂θ ∂ = T (x)f(x, θ)dx (3) ∂θ n ZR ∂ = T (x) f(x, θ)dx (4) n ∂θ ZR ∂ = T (x)( log f(x, θ))f(x, θ)dx (5) n ∂θ ZR ∂ = E(T (X ) log f(X,θ )) (6) ∂θ ∂ ∂ = E(T (X ) log f(X,θ )) cE( log f(X,θ )) (7) ∂θ − ∂θ ∂ = E((T (X ) c) log f(X,θ )) (8) − ∂θ ∂ = E((T (X ) θ) log f(X,θ )) (9) − ∂θ By the Cauchy-Schwartz Inequality, E(AB)2 E(A2)E(B2).Squaring both sides of the equation above and applying this,≤ we find:

∂ 1=E((T (X ) θ) log f(X,θ ))2 − ∂θ ∂ E((T (X ) θ)2)E(( log f(X,θ ))2) ≤ − ∂θ ∂ = Var(T )E(( log f(X,θ ))2) ∂θ Since the sample is independent and identically distributed,

∂ ∂ n ( log(f(X,θ )))2 =( log( f(x ,θ)))2 ∂θ ∂θ i i=1 Y ∂ n =( log f(x ,θ))2 ∂θ i i=1 X n ∂ =( log f(x ,θ))2 ∂θ i i=1 X

6 n ∂ n ∂ E(( log f(x ,θ))2)=Var( log f(x ,θ)) ∂θ i ∂θ i i=1 i=1 X X n ∂ = Var( log f(x ,θ)) ∂θ i i=1 X ∂ = n Var( log f(xi,θ)) · ∂θ ∂ 2 = n E(( log f(xi,θ)) ) · ∂θ = nI(θ)

Thus, 1 Var(T )(nI(θ)) and the variance of an unbiased estimator is at 1 ≤ least nI(θ) .

6 Where the Cramer-Rao lower bound fails to hold (6)

1 Let X be distributed uniform on (0,θ).Thatis,f(x, θ)= θ I(0,θ)(x).Let Y =maxXi. Then,

P (Y y)=P (X1, ...Xn y)I (y) ≤ ≤ (0,θ) =( P (Xi y))I (y) (10) ≤ (0,θ) y =(Y )I (y) (11) θ (0,θ) yYn = I (y) (12) θn (0,θ)

d f(y)= P (Y y) dy ≤ yn 1 = n − I (y) θn (0,θ)

∞ E(Y )= yf(y)dy Z−∞ θ yn = n n dy 0 θ Z n = θ n +1

7 ∞ E(Y 2)= y2f(y)dy Z−∞ θ yn+1 = n n dy 0 θ Z n = θ2 n +2

Var(Y )=E(Y 2) E(Y )2 n − n = θ2 ( θ)2 n +2 − n +1 n3 +2n2 + n n3 2n2 = θ2 − − (n +2)(n +1)2 n = θ2 (n +2)(n +1)2

n n+1 n+1 Since E(Y )= n+1 θ, E( n Y )=θ,and n max Xi is an unbiased estimator of θ. The variance of this estimator is given by

n +1 n n +1 Var( Y )=θ2 ( )2 n (n +2)(n +1)2 n 1 = θ2 n(n +2)

1 which is of order n2 (and would therefore violate the Cramer-Rao lower bound if it applied).

7 Maximum likelihood estimators for various dis- tributions (7) 7.1 Normal n 1 n 2 2 1 n/2 (xi µ) f(x ,µ,σ )=( ) e− 2σ2 i=1 − i 2πσ2 i=1 P Y n 1 n log f(x , ..., x ,µ,σ2)= log(2πσ2) (x µ)2 1 n 2 2σ2 i − − i=1 − X n ∂ 2 1 log f(x1, ..., xn,µ,σ )= (xi µ) ∂µ −σ2 − i=1 n X = (¯x µ) −σ2 −

8 n ∂ 2 n 1 2 log f(x1, ..., xn,µ,σ )= + (xi µ) ∂σ2 −2σ2 2(σ2)2 − i=1 n X 1 2 2 = ( (xi µ) nσ ) −2(σ2)2 − − i=1 X ∂ 2 If µ is unknown, then we set ∂µ log f(x1, ..., xn,µ,σ )=0to find that x¯ is the maximum likelihood estimator of µ. 2 ∂ 2 If σ is unknown and µ is known, then we set ∂σ2 log f(x1, ..., xn,µ,σ )=0 2 1 n 2 and solve to find that the maximum likelihood estimator of σ is n i=1(xi µ) . If both µ and σ2 are unknown, then we estimate µ using its maximum− likelihood estimator x¯ (which does not depend on σ2). We useP this estimate of µ to maximize with respect to σ2 and find that the maximum likelihood 2 1 n 2 estimator of σ is (xi x¯) . n i=1 − P 7.2 Bernoulli x 1 x f(x, θ)=θ (1 θ) − ,x=0, 1 −

xi n xi f(x1, ..., xn,θ)=θ (1 θ) − P − P

log f(x1, ..., xn,θ)=( xi)logθ +(n xi)log(1 θ) − − X X

∂ xi n xi log f(x1, ..., xn,θ)= − ∂θ θ − 1 θ P −P (1 θ) xi θ(n xi) = − − − θ(1 θ) P − P θ x¯ = − nθ(1 θ) − Setting the derivative equal to zero, we find that the maximum likelihood estimator of θ is X¯.

7.3 P oisson 1 f(x, θ)= θxe θ,x=0, 1, 2, ... x! −

nθ xi 1 f(x1, ..., xn,θ)=e− θ ( ) P xi! Y log f(x1, ..., xn,θ)= nθ + xi log θ log(xi!) − − X X

9 ∂ 1 log f(x1, ..., xn,θ)= n + xi ∂θ − θ n = (¯x θX) θ − Thus, the maximum likelihood estimator of θ is x¯.

7.4 Uniform 1 f(x, θ)= I (x) θ (0,θ)

1 n f(x , ..., x ,θ)= I (x ) 1 n θn (0,θ) i i=1 Y 1 = I (max x ) θn (0,θ) i

1 Since θn is strictly decreasing, we maximize it by choosing the smallest θ such that max Xi θ. That is, the maximum likelihood estimator of θ is ≤ max Xi.

7.5 Gamma (with β unknown) 1 f(x, α, β)= xα 1e x/β βαΓ(α) − −

n 1 1 n α 1 xi f(x , ..., x ,α,β)=( ) ( x ) e− β 1 n βαΓ(α) i − i=1 P Y

xi log f(x1, ..., xn,θ)= nα log β n log Γ(α)+(α 1) log xi − − − − β X P

∂ nα 1 log f(x1, ..., xn,θ)= + xi ∂θ − β β2 1 X = ( xi nαβ) β2 − X x¯ The derivative is 0 for β = α .Thus,x/α¯ is the maximum likelihood estimator.

10 8 The likelihood equation will have a solution (8)

Let x1, ..., xn be a sample from the distribution f(x, θ0).BytheLawof 1 n Large Numbers, the sequence n i=1 Yi converges to E(Yi) with proba- { 1 } n ∂ bility 1. In particular, the sequence n i=1 ∂θ log f(xi,θ) θ=θ0 converges ∂ P { 1 n ∂2 | } to E( ∂θ log f(x, θ) θ=θ0 )=0, and the sequence n i=1 ∂θ2 log f(xi,θ) θ=θ0 2| P { | } converges to E( ∂ log f(x, θ) )= I(θ), with probability 1. Let g (θ)= ∂θ2 θ=θ0 P n 1 n ∂ | − n i=1 ∂θ log f(xi,θ). Then, limn gn(θ0)=0and limn gn0 (θ0)= I(θ0) = 0. −→ ∞ →∞ − 6 PLet ε>0 be given. By the Taylor expansion,

gn(θ) gn(θ0)+g0 (θ0)(θ θ0), θ θ0 <ε ≈ n − | − | For sufficiently large n, then,

gn(θ)= I(θ0)(θ θ0) − − 1 gn0 (θ0)= I(θ0) θ θ0 − − Since I(θ0) < 0 in the interval (θ0 ε, θ0 + ε) while θ θ0 is both positive − − − and negative in that interval, gn(θ) must change sign in this interval as well. 1 n ∂ Since gn(θ) is continuous, this n i=1 ∂θ log f(xi,θ)=gn(θ)=0in this interval, and the likelihood equation has a solution in (θ ε, θ + ε) with probability one for large n. P −

9 The limiting distribution of the maximum like- lihood estimators (9)

∂ Let x1, ..., xn be a sample from the distribution f(x, θ0). Recall that E( ∂θ log f(x, θ) θ=θ0 = ∂ ∂ 2 | 0) and that Var( ∂θ log f(x, θ) θ=θ0 )=E(( ∂θ log f(xi,θ) θ=θ0 ) )=I(θ0).Thus, | 1 n ∂ | by the , log f(xi,θ) θ=θ has a normal limit- √n i=1 ∂θ | 0 ing distribution with 0 and variance I(θ0). By the Law of Large Numbers, ∂2 P 1 n ∂2 since E( ∂θ2 log f(x, θ) θ=θ0 )= I(θ0), limn n i 1 ∂θ2 log f(x, θ) θ=θ0 = | − →∞ − | I(θ0) with probability one. − 1 n ∂ P Consider √n i=1 ∂θ log f(xi,θ). Using the Taylor expansion about θ0,we find: P

1 n ∂ 1 n ∂ 1 n ∂2 log f(xi,θ) log f(xi,θ) θ=θ + log f(xi,θ) θ=θ (θ θ0) √n ∂θ ≈ √n ∂θ | 0 √n 2 | 0 − i=1 i=1 i=1 ∂θ X X X

11 in a neighborhood of θ0.Forlargen, there is a solution to the likelihood equation in this neighborhood with probability one. Let θˆn be this solution. Then, we have:

1 n ∂ 0= log f(xi,θ) ˆ √n ∂θ |θ=θn i=1 X 1 n ∂ 1 n ∂2 log f(xi,θ) θ=θ + √n(θˆn θ0)( log f(xi,θ) θ=θ ) ≈ √n ∂θ | 0 − n 2 | 0 i=1 i=1 ∂θ X X 1 n ∂2 Substituting in the limit, for limn n i 1 ∂θ2 log f(x, θ) θ=θ0 = I(θ0) →∞ − | − and solving for √n(θˆn θ0),wefind: − P 1 1 n ∂ √n(θˆn θ0) ( log f(xi,θ) θ=θ ) − ≈ I(θ ) √n ∂θ | 0 0 i=1 X The right hand side is the product of 1 and a normal variable with mean I(θ0) 0andvarianceI(θ0); that is, a normal variable with mean 0 and variance 1 1 ˆ 1 2 I(θ0)= 2 .Thus,√n(θn θ0)˜N(0, ). I(θ0) I(θ0) − I(θ0) 10 Confidence Intervals for Various Distributions (10) 10.1 Normal, σ2 known: Let X˜Normal(µ, σ2),withσ2 known. Let α<1 be given (this is the confi- X µ dence level). Set Z = σ− . Then, Z˜Normal(0, 1).Choosezα/2 such that zα/2 α Φ(zα/2)= φ(z)dz =1 2 . Then, by symmetry: −∞ − R 1 α = P ( z Z z ) − − α/2 ≤ ≤ α/2 X µ = P ( z − z ) − α/2 ≤ σ ≤ α/2 = P ( σz X µ σz ) − α/2 ≤ − ≤ α/2 = P (X z σ µ X + z σ) − α/2 ≤ ≤ α/2 and (X zα/2σ, X + zα/2σ) is a 1 α confidence interval for µ. − − 2 ¯ σ If we choose a sample of size n,thenX˜Normal(µ, n ),andthen

X µ 1 α = P ( z − z ) − − α/2 ≤ σ/√n ≤ α/2 σ σ = P (X z µ X + z ) − α/2 √n ≤ ≤ α/2 √n so that (X¯ z σ , X¯ + z σ ) is a 1 α confidence interval for µ. − α/2 √n α/2 √n −

12 10.2 Normal, σ2 unknown: 2 ¯ σ2 Let X1,...,Xn be a sample from the distribution Normal(µ, σ ). Then, X˜Normal(µ, n ) 2 1 n ¯ 2 2 and s = n 1 i=1(Xi X) is distributed χ with n 1 degrees of freedom, − X¯ µ − − σ/−√n X¯ µ so that T = P 2 = − has a t-distribution with n 1 degrees ((n 1) s )/(n 1) s/√n − − σ2 − q α of freedom. Choose tα/2,n 1 such that P (T tα/2,n 1)=1 2 . Then, by symmetry, − ≤ − −

X¯ µ 1 α = P ( tα/2,n 1 − tα/2,n 1) − − − ≤ s/√n ≤ − ¯ s ¯ s = P (X tα/2,n 1 µ X + tα/2,n 1 ) − − √n ≤ ≤ − √n

¯ s ¯ s and (X tα/2,n 1 , X + tα/2,n 1 ) is a 1 α confidence interval for µ. − − √n − √n − If n is sufficiently large, then the t-distribution with n 1 degrees of freedom − is approximately the standard , and tα/2,n 1 zα/2. − ≈ 10.3 Binomial: If X is a Bernoulli (with probability θ of success) then, by the Central Limit Theorem, X¯ is distributed approximately normal with mean θ and variance θ(1 θ).Sinceθ is unknown, we must approximate the variance. Note that θ(1 −θ) 0.5(1 0.5) = 0.25 for all values of θ.Thus,0.25 is a conservative estimate− ≤ of the variance− and we may use normal confidence intervals with this estimate of the variance.

10.4 Non-normal, large sample: We may use the Central Limit Theorem and the fact that the maximum likeli- hood estimator is approximately normally distributed with mean θ and variance I(θ) to construct an approximate confidence interval using the methods above. (We should estimate I(θ) by I(θˆ) or by some upper bound, as in the binomial case.)

11 The sum of normals squared is chi-squared (11)

Let X1, ..., Xn be a sample from a standard normal distribution. Consider the n 2 cumulative distribution function of i=1 Xi : P

13 n 2 P ( Xi y)= f(x1, ..., xn,θ)dx1...dxn ≤ 2 i=1 x1,...xn: xi y X Z{ ≤ } P n/2 x2 = (2π)− e− i dx1...dxn 2 x1,...xn: x y P Z{ i ≤ } P Taking the derivative with respect to y,wefind:

d n f(y)= P ( X2 y) dy i ≤ i=1 X d n/2 x2 = (2π)− e− i dx1...dxn 2 dy x1,...xn: x y P Z{ i ≤ } P d Recall that F(T (x1, ..., xn),θ)h(x1, ..., xn)dx1...dxn = dt x1,...,xn T (x1,...,xn)=t d { | } F (t, θ) h(x1, ..., xn)dx1...dxn.Thus,wefind: dt x1,...,xnRT (x1,...,xn)=t { | } R n/2 y d f(y)=(2π)− e− dx1...dxn 2 dy x1,...xn: x y Z{ i ≤ } P Notice that 2 dx1...dxn is the volume of an n-ball of radius x1,...xn: xi y { ≤n/}2 d y, which is proportionalP to y .Thus, 2 dx1...dxn is pro- R dy x1,...xn: xi y n { ≤ n} n 2 1 y 2 1 portional to 2 y − .Thus,f(y) is proportional toPe− y − ,whichisthe n 2 nR 2 Gamma( 2 , 1) = χn distribution. Thus, i=1 Xi has a chi-squared distribu- tion with n degrees of freedom. P 12 Independence of the estimated mean and stan- dard deviation (12)

2 Xi µ Let X1,...Xn be a sample from a Normal(µ, σ ) distribution. Set Zi = σ− . Then, Z1, ..., Zn are distributed Normal(0, 1).LetU = √nZ¯ and V =(n 2 2 − 1)s = (Zi Z¯) . The joint distribution function of these two variables is given by: − P n/2 Z2 F (u, v)= (2π)− e− i dZ1...dZn 2 Zi,...Zn:√nZ¯ u,(n 1)s v P Z{ ≤ − ≤ } 1 1 Let P be any real orthonormal matrix with first row ( √n , ..., √n );sucha matrix exists because this vector is of length one and we may apply the Gram- 1 n Schmidt orthogonalization. Set Y = P Z. Then, Y1 = √n i=1 Zi = U.Since n 2 n 2 orthogonal matrices preserve inner products, i=1 Yi = i=1 Zi ,andV = n 2 n 2 2 n 2 2 nP 2 (Zi Z¯) = Z nZ¯ = Y Y = Y . Substituting i=1 − i=1 i − i=1 i −P 1 i=2P i Y1, ..., Yn for Z1, ..., Zn in the joint distribution function above, we find: P P P P

14 n/2 n Y 2 F (u, v)= (2π)− e i=1 i dY1...dYn n 2 Y1,...,Yn:Y1 u, Y v P Z{ ≤ i=2 i ≤ } u n/2 PY 2 n Y 2 =(2π)− ( e 1 dY1)( e i=2 i dY2...dYn) n 2 Y2,...,Yn: Y v P Z−∞ Z{ i=2 i ≤ } Thus, we see that the density function factors,P meaning that U and V are independent. Furthermore, the factor containing U = √nZ is of the form of a standard normal random variable, and the factor containing V =(n 1)s2 is of the form of a χ2 random variable with n 1 degrees of freedom.− Since − ¯ ¯ 2 ¯ x¯ µ 2 Xi µ X µ 2 (Xi X) U = √nZ = √n −σ and V =(n 1)s = ( σ− σ− ) = σ−2 = s2 (n 1) x¯− µ s2 (n 1) − P x − ,wehaveshownthat n and x − are independent, with the σ2 √ −σ Pσ2 former normally distributed and the latter distributed chi-square.

13 The t-distribution (13)

Let X be a standard normal random variable and Y an independently distrib- uted chi-square variable with n degrees of freedom. Let T = X .TheT is √y/n distributed with a t-distribution with n degrees of freedom. If X1, ..., Xn are independently distributed normal random variables, then X¯ µ (n 1)s2 − σ/−√n and σ2 are independently distributed standard normal and chi-square with n 1 degrees of freedom random variables respectively. Thus, −

X¯ µ − T = σ/√n 1 (n 1)s2 n 1 ( −σ2 ) − qx¯ µ σ n 1 = − − σ/√n · s n 1 r − X¯ µ = − s/√n has a t-distribution with n 1 degrees of freedom. − 14 Some facts about hypothesis tests, levels of significance, and power (14)

Let the paramter space be the disjoint union of H0 and H1.Ahypothesistest is a mapping from a sample X1, ..., Xn to the set H0,H1 . The inverse image { } of H1 is called the critical region, C. Pθ(C)=P (C θ) is the probability of | rejecting H0 if θ is the true value; this is called the power function. The level of significance, which is the maximum probability of a false rejection, is supθ H0 Pθ(C). ∈

15 14.1 A simple normal hypothesis test 2 2 Suppose X is distributed Normal(µ, σ ),withσ known. Let H0 = µ : µ { ≤ µ and H1 = µ : µ>µ .LetC1 = x : x>µ . Then, 0} { 0} { }

Pµ(C1)=Pµ(X>µ0) X µ µ µ = P ( − > 0 − ) µ σ σ µ µ =1 Φ( 0 − ) − σ Since Φ is an increasing function, the function above is maximized by choos- ing the largest possible µ is H0. Therefore, the level of significance is:

µ0 µ0 sup Pµ(C1)=1 Φ( − ) µ µ − σ ≤ 0 1 1 =1 = − 2 2 σ Consider a second critical region: C2 = X¯ : X>µ¯ + zα . Then, { 0 √n } σ P (C )=P (X>µ¯ + z ) µ 2 µ 0 α √n σ X¯ µ µ0 + zα √n µ = Pµ( − − ) σ/√n ≥ σ/√n µ0 µ =1 Φ( − + zα) − σ/√n Then the level of significance is:

µ0 µ sup Pµ(C2)= sup1 Φ( − + zα) µ µ µ µ − σ/√n ≤ 0 ≤ 0 =1 Φ(zα) − =1 α − 15 The Neyman-Pearson Lemma (15)

Theorem 2 Let X be a sample from the distribution f(x; θ).LetH0 and H1 be n f(xi,θ1) i=1 the simple hypotheses H0 : θ = θ0 and H1 : θ = θ1.Define R(X)= Qn . f(xi,θ0) i=1 Define a critical region C Rn for a given λ R by C = x Rn : RQ(x) >λ ; that is, reject when R(x) >λ⊂ .IfD is the critical∈ region for{ ∈ any test such that}

16 Pθ0 (D) Pθ0 (C) then Pθ1 (D) Pθ1 (C). (The test based on the likelihood ration is≤ the most powerful for a≤ given level of significance.) n n Proof. For any A R ,define P0(A)= A f(xi,θ0)dx1...dxn and P1(A)= ⊂ i=1 n R Q A f(xi,θ1)dx1...dxn. Thus, we want to show that P0(D) P0(C) implies i=1 ≤ RthatQP1(D) P1(C). Let D be given. Then: ≤ n C P1(D C )= f(xi,θ1)dx1...dxn ∩ D x:R(x) λ Z ∩{ ≤ } i=1 Y n = R(x) f(xi,θ0)dx1...dxn D x:R(x) λ Z ∩{ ≤ } i=1 n Y λ f(xi,θ0)dx1...dxn ≤ D x:R(x) λ Z ∩{ ≤ } i=1 C Y = λP0(D C ) ∩ n C P1(D C)= f(xi,θ1)dx1...dxn ∩ DC x:R(x) λ Z ∩{ ≥ } i=1 Y n = R(x) f(xi,θ0)dx1...dxn DC x:R(x) λ Z ∩{ ≥ } i=1 n Y λ f(xi,θ0)dx1...dxn ≥ DC x:R(x) λ Z ∩{ ≥ } i=1 C Y = λP0(D C) ∩ Combining these facts, we find: C P1(D)=P1(D C)+P1(D C ) ∩ ∩ C P1(D C)+λP0(D C ) ≤ ∩ ∩ = P1(D C)+λ(P0(D) P0(D C)) ∩ − ∩ P1(D C)+λ(P0(C) P0(D C)) ≤ ∩ − C ∩ = P1(D C)+λ(P0(C D )) ∩ ∩ C P1(D C)+P1(C D ) ≤ ∩ ∩ = P1(C)

16 The likelihood ratio test for the normal dis- tribution (16)

Suppose X is a sample from the distribution Normal(µ, σ2).Letthenull 2 hypothesis be H0 : µ = µ0, σ unknown. Let the be

17 2 H1 : µ = µ , σ unknown. The of this distribution is: 6 0 n 1 2 2 1 (Xi µ) L(µ, σ , X )= ( )e− 2σ2 − 2πσ2 i=1 Y 1 2 2 n/2 (Xi µ) =(2πσ )− e− 2σ2 − P ˆ Under the null hypothesis, µˆ = µ0, so we maximize the likelihood function with respect to σ2:

2 n 2 1 2 log L(µ, σ , X )= log(2πσ ) (Xi µ ) − 2 − 2σ2 − 0 X ∂ 2 n 1 2 log L(µ, σ , X )= + (Xi µ ) ∂σ2 −σ2 2(σ2)2 − 0 =0 X

2 1 2 σˆ = (Xi µ ) n − 0 Under the alternative hypothesis,X we have the maximum likelihood esti- ¯ 2 1 ¯ 2 mates: µˆ = X and σˆ = n (Xi X) . Then, the likelihood ratio of the− null and alternative hypotheses is: P L(µ,ˆ σˆ2) λ = L(ˆµ, σˆ2) 1 ˆ 2 ˆ2 n/2 ˆ2 (Xi µˆ) (2πσˆ )− e− 2σˆ − = 1 P 2 2 n/2 2 (Xi µˆ) (2πσˆ )− e− 2σ − 2 P 1 2 1 2 σˆ (Xi µˆ) + (Xi µˆ) =( )e− 2σˆ2 − 2ˆσ2 − 2 σˆ P P 2 (Xi X¯) n n n/2 2 + 2 =( − 2 ) e− (Xi µ ) P − 0 (X X¯)2 P i n/2 =( − 2 ) (Xi µ ) P − 0 2 2 2 Recall that (Xi µ ) = (Xi X¯) + n(X¯ µ ) ,so − 0 P − − 0

P P 2 (Xi X¯) n λ =( ) 2 2 − 2 (Xi X¯) + n(X¯ µ ) −P − 0 1 n 2 =(P ¯ 2 ) (X µ0) 1+n − 2 (Xi X¯ ) − (X¯P µ )2 X¯ µ 0 − 0 2 t 2 is a decreasing function of n − 2 =( ) =( ) ,whichis (Xi X¯ ) √n| 1s/√| n √n 1 an increasing function of the t-statistic.− Thus, λ −is minimized if and− only if the P t-statistic is large, and a t-test is a likelihood ratio test.

18 17 Linear combinations of multivariate normals (and MGF’s) (17) 17.1 Case 1: Standard Normal The generating function for a standard normal variable, Z,isgivenby:

E(et0Z )=E(e tj zj ) m P = E(etj zj ) j=1 Ym 1 t2 = e 2 j j=1 Y 1 t2 = e 2 j 2 1 Pt = e 2 k k

17.2 Case 2: General normal

Let X be a multivariate normal random vector with distribution Normalm(µ, Σ). 1/2 Then, we may write X = Σ Z + µ,whereµ is distributed Normalm(0,I). Then, the moment generating function is:

1/2 E(et0X )=E(et0(Σ Z+µ)) 1/2 = E(et0Σ Zet0µ) 1/2 = et0µE(et0Σ Z ) 1/2) = et0µE(e(Σ t 0Z )

1 1/2 2 t0µ 0Σ t = e e 2 k k t µ 1 (t Σt) = e 0 e 2 0 t µ+ 1 t Σt = e 0 2 0

17.3 Case 3: Linear combinations of normal variables Let Y = BX ,whereB is not necessarily symmetric, invertible, or square. Then, Y = BX = BΣ1/2Z + Bµ, and its moment generating function is:

1/2 E(et0Y )=E(et0(BΣ Z+Bµ)) 1/2 = et0BµE(et0BΣ Z )

1 1/2 2 t0Bµ 0Σ B0t = e e 2 k k t Bµ 1 ( Σ1/2B t) ( Σ1/2B t) = e 0 e 2 0 0 0 0 0 t Bµ+ 1 t BΣB t = e 0 2 0 0

19 Thus, Y = BX is also multivariate normal, with mean Bµ and matrix BΣB0.

18 The estimated regression coefficients mini- mize the estimated errors (18)

Let X be a vector of length n and A be an n k matrix of rank k.Let × ˆ 1 θ =(A0A)− A0X . Then,

2 ˆ ˆ 2 X Aθ = X Aθ + Aθ Aθ − − − ° ° ° ˆ 2 ˆ ° 2 ˆ ˆ ° ° = °X Aθ + Aθ °Aθ +2(X Aθ)0(Aθ Aθ) ° ° ° − −° − − ° ° ° ° ° ° ° ° The last term is zero:° ° ° ° ˆ ˆ ˆ ˆ ˆ ˆ (X Aθ)0(Aθ Aθ)=X 0Aθ X 0Aθ θ0A0Aθ + θA0Aθ − − − − 1 1 1 1 = X 0A(A0A)− A0X X 0Aθ X 0A(A0A)− A0A(A0A)− A0X + X 0A(A0A)− A0Aθ − − 1 1 = X 0A(A0A)− A0X X 0Aθ X 0A(A0A)− A0X + X 0Aθ − − =0

Thus, 2 ˆ 2 ˆ 2 X Aθ = X Aθ + Aθ Aθ − − − ° ° ° ° ° ° ° ° ˆ ° ° ° ° Since X , A, and° therefore° θ are° all fixed,° we° minimize° the expression above ˆ 2 by choosing θ in the second term. Since Aθ Aθ 0 for all values of θ,we − ≥ ˆ ˆ 2 ° ° 2 choose θ = θ,sothat Aθ Aθ =0and° X A°θ is minimized. − ° − ° ° ° ° ° ° ° ° ° 19 The properties° ° of the° least° squares coeffi- cients (19)

Suppose X is a random vector with mean Aθ and σ2I.Let ˆ 1 θ =(A0A)− A0X . Then,

ˆ 1 E(θ)=E((A0A)− A0X ) 1 =(A0A)− A0E(X ) 1 =(A0A)− A0Aθ = θ

20 ˆ 1 Cov(θ)=Cov((A0A)− A0X ) 1 2 1 =(A0A)− A0(σ I)A(A0A)− 2 1 = σ (A0A)− ˆ If X is normally distributed, then θ is a linear transformation of X and thus is normally distributed as well.

20 Estimating the variance of the coefficients (20)

Suppose X is a multivariate normal random vector with mean Aθ and covariance 2 ˆ 1 matrix σ I.Letθ =(A0A)− A0X . Let V Rk be any vector. Then, ∈ 2 1 (X + AV ) A(A0A)− A0(X + AV ) − 2 ° 1 ° 1 = °X + AV A(A0A)− A0X A(A0A)°− A0AV ° − − ° ° ˆ 2 ° = °X + AV Aθ AV ° ° − − ° ° ˆ 2 ° = °X Aθ ° ° − ° ° ° ° ° ˆ Thus, we may replace° X° be X + AV and re-estimate θ for this new vector without changing the difference between the observed vector and its distance ˆ from the predicted vector, Aθ. In particular, we may choose V = θ and set Y = X Aθ,sothatE(Y )=E(X Aθ)=E(X ) Aθ =0.Sinceweare− − − − adding a fixed number to X to give Y , the covariance matrix does not change, and Cov(Y )=σ2I. Let e1, ..., ek be an orthonormal basis for the column space of A (such a basis exists by the Gram-Schmidt algorithm and the rank of A). Choose ek+1, ..., en n n such that e1, ..., en is an orthonormal basis for R .SinceY R ,wemay { n } ∈ write Y = i=1(Y 0ej)ej. This is a linear combination of fixed basis vectors with random coefficients, Y e , ..., Y e . The moments of these coefficients P 0 1 0 n are: { } E(Y 0ej)=E(ej0 Y ) = ej0 E(Y ) = ej0 (0) =0

21 Cov(Y 0ej, Y 0eh)=Cov(ej0 Y,Y 0eh)

= E(e0 Y Y 0eh) E(e0 Y )E(Y 0eh) j − j = ej0 E(Y Y 0)eh 2 = ej0 (σ I)eh 2 = σ (ej0 eh) = σ2 if j = h, 0 otherwise

Thus, the coefficients Y 0e1,...,Y 0en have mean zero and covariance matrix ˆ { } σ2I.SinceAθ is the projection of Y onto the column space of A,thatis,the ˆ k ˆ n span of e1,...,ek , Aθ = i=1(Y 0ej)ej,sothatY Aθ = i=k+1(Y 0ej)ej,and { 2 } − 2 ˆ n 2 ˆ n 2 Y Aθ = (Y 0Pej) . Then, E( Y Aθ )=EP( (Y 0ej) )= − i=k+1 − i=k+1 ° n E°((Y e )2)= n σ2 =(n k)°σ2, since° there is zero mean and zero ° i=k+1 ° 0 Pj i=k+1 ° ° P covariance.° ° − ° ° P P 21 The joint distribution of the least squares es- timates (21)

Suppose X is a multivariate normal random vector with mean Aθ and covariance 2 ˆ 1 matrix σ I.Letθ =(A0A)− A0X . ˆ In this case, recall that θ is normally distributed with mean θ and covariance 2 1 matrix σ (A0A)− . Intheprevioustheorem,weshowedthatwemaynormalizeX by subtracting its expected value; thus, we assume that X has zero mean. Then, we may write 2 n ˆ k ˆ n 2 X = (X 0ej)ej, Aθ = (X 0ej)ej,and X Aθ = (X 0ej) , i=1 i=1 − i=k+1 whereP the coefficients X0e1P, ..., X0en have mean° zero and° covarianceP matrix ˆ 2 { } ° ° X Aθ ° ° 2 − n X0ej 2 σ I. Then, ° σ2 ° = i=k+1( σ2 ) is the sum of n k independent standard ° ° − normal random° variables,° which has a chi-squared distribution with n k degrees of freedom. P − ˆ 1 n n 1 Since θ =(A0A)− A0( i=1(X0ej)ej)= i=1(X0ej)(A0A)− A0ej and A0ej = ˆ 0 when j>k(since ej is orthogonalP to the columnP space of A in this case), θ de- ˆ 2 X Aθ − pends only on (X0e1), ..., (X0ek) . ° σ2 ° depends only on (X0ek+1), ..., (X0en) . { } ° ° { } ° ° ˆ Since all the (X 0ej) are independent, these two sets are independent, θ and ˆ 2 X Aθ − ° σ2 ° are independent. ° ° ° °

22 22 Prediction error (22)

Suppose X is a multivariate normal random vector with mean Aθ and covariance 2 2 ˆ 1 2 1 ˆ matrix σ I.Letθ =(A0A)− A0X.Lets = n k X Aθ .LetXn+1 be a − − new observation with associated inputs α, which is° independent° of all previous ° ° ˆ observations. Then, the predicted value of Xn+1 is° Xˆn+1 =°α0θ, and the error ˆ 1 of prediction is α0θ Xn+1 = α0(A0A)− A0X Xn+1. The distribution of the error of prediction is:− −

1 1 E(α0(A0A)− A0X Xn+1)=E(Xn+1) α0(A0A)− A0E(X ) − − 1 = α0θ α0(A0A)− A0Aθ − = α0θ α0θ − =0

ˆ ˆ Var(α0θ Xn+1)=Var(α0θ)+Var(Xn+1) − 2 1 2 = α0(σ (A0A)− )α + σ 2 1 = σ (1 + α0(A0A)− α)

The estimated of the error of prediction is found by re- 2 2 2 1 placing σ by s , which gives an estimated error of s (1 + α0(A0A)− α).No- tice that X is independent of both s2 and Xˆ because they depend only on n+1 n+1 p 2 2 ˆ X1,...,Xn. In addition, s is independent of Xˆn+1 because s and θ are inde- pendent. Thus, we may consider the following ratio which has a t-distribution with n k degrees of freedom, since the numerator is a standard normal random variable− and the denominator is an independent chi-square random variable with n k degrees of freedom: −

Xˆn+1 Xn+1 − 1 ˆ σ√1+α0(A0A)− α Xn+1 Xn+1 = − (n k)s2/σ2 1 s 1+α0(A0A)− α −n k − q p 23 Form of the ANOVA Test (23)

2 Let Xij˜Normal(µi,σ ),forj =1, ..., n, i =1, ..., k,withalltheXij indepen- dent. Let the null hypothesis be H0 : µ1 = ... = µk = µ. Let the alternative hypothesis be H : µ = µ for some i = i .Wefind the likelihood ratio test 1 i i0 0 statistic. 6 6 2 Under the null hypothesis, Xij˜Normal(µ, σ ), and this is a sample of size nk from the population Normal(µ, σ2). The maximum likelihood estimators

23 for µ and σ2 are then:

1 k n µˆ = X = X¯ nk ij i=1 j=1 X X k n 2 1 2 σˆ = (Xij X¯) nk − i=1 j=1 X X Substituting these estimators into the likelihood function gives the restricted maximum likelihood:

1 k n 2 2 nk/2 (Xij X¯) max L(X, θ)=(2πσˆ )− e− 2σˆ2 i=1 j=1 − θ H0 ∈ P P 2 nk/2 1 (nkσˆ2) =(2πσˆ )− e− 2σˆ2 =(2πσˆ2e)nk/2 Without this restriction, we instead have k samples of size n with a common variance. Since the means are unrelated, we have 1 n µˆ = X = X¯ i n ij i j=1 X We find the maximum likelihood estimator for σ2 by looking at the log- likelihood:

1 k n 2 2 2 nk/2 2 i=1 j=1(Xij µi) log L(µ1, ..., µn,σ ) = log((2πσ )− e− 2σ − Pk Pn nk 2 1 2 = log(2πσ ) (Xij µ ) − 2 − 2σ2 − i i=1 j=1 X X

k n ∂ 2 nk 1 2 log L(µ , ..., µ ,σ )= + (Xij µ ) ∂σ2 1 n −2σ2 2(σ2)2 − i i=1 j=1 X X 1 k n nk =(σ2 (X µ )2)( ) nk ij i 2(σ2)2 − i=1 j=1 − − X X Solving for σ2 and substituting the maximum likelihood estimators for the µi,wefind: k n 2 1 2 σˆ = (Xij X¯i) nk − i=1 j=1 X X Substituting these estimators gives the unrestricted maximum likelihood:

1 k n 2 2 nk./2 ( (Xij X¯i) ) max L(X, θ)=(2πσˆ )− e− 2ˆσ2 i=1 j=1 − θ H ∈ P P 2 nk./2 1 (nkσˆ2) =(2πσˆ )− e− 2ˆσ2 2 nk/2 =(2πσˆ e)−

24 We find the likelihood ratio by looking at the ratio of the restricted and unrestricted maximum likelihoods:

2 nk/2 2 k n ¯ 2 (2πσˆ e) σˆ (Xij Xi) λ(X )= =( )nk/2 =( i=1 j=1 − )nk/2 2 nk/2 ˆ2 k n 2 (2πσˆ e)− σˆ (Xij X¯) P i=1P j=1 − Notice that P P

k n k n 2 2 (Xij X¯) = (Xij X¯i + X¯i X¯) − − − i=1 j=1 i=1 j=1 X X X X k n 2 2 = ( (Xij X¯i) + n(X¯i X¯) ) i=1 j=1 − − X X k n k 2 2 = (Xij X¯i) + n (X¯i X¯) − − i=1 j=1 i=1 X X X The firstterminthisequationisdefined as the sum of squares within (SSW); the second term is defined as the sum of squares between (SSB). Using these definition, we may rewrite the likelihood ratio as:

k n 2 (Xij X¯i) λ(X )=( i=1 j=1 − )nk/2 k n 2 (Xij X¯) P i=1P j=1 − SSW =(P P )nk/2 SSW + SSB 1 nk/2 =( SSB ) 1+ SSW

SSB Thus, we see that λ(X) is a decreasing function of SSW .

24 The ANOVA test statistic has an F-distribution (24)

k 2 1 k 1 n Recall that SSB = n (X¯i X¯) .SincewemaywriteX¯ = ( Xij)= i=1 − k i=1 n i=1 1 k X¯ , SSB can be written as a function of only X¯ , ..., X¯ .SinceX¯ , ..., X¯ k i=1 i P 1 k P 1 P k are the means of disjoint independent samples of size n from Normal(µ, σ2), ¯P ¯ σ2 k ¯ X1,...,Xk are independent and distributed Normal(µ, n ).Thus, i=1(Xi X¯)2 is the sum of squares about the average of a sample of size k from a distribu-− 2 k ¯ ¯ 2 σ (Xi X) SSB P i=1 − tion with variance n ,whichmeansthat σ2/n = σ2 has a chi-squared distribution with k 1 degrees of freedom.P − k n ¯ 2 2 1 n ¯ 2 Recall that SSW = i=1 j=1(Xij Xi) .Letsi = n 1 j=1(Xij Xi) − − − P P P

25 for each i =1, ...k.Thenwemayrewrite

k n 2 SSW = (Xij X¯i) − i=1 j=1 X X k = (n 1)s2 − i i=1 X 2 2 Because s1, ..., sk are the sample of samples of size n from normal pop- 2 2 2 (n 1)s1 (n 1)sk ulations with variance σ , −σ2 , ..., −σ2 are each distributed chi-squared 2 2 with n 1 degrees of freedom. Since s1, ..., sk depend on disjoint independent k 2 − (n 1)s SSW i=1 − i samples, they are independent. Thus, σ2 = σ2 is the sum of k inde- pendent chi-squared variables with n P1 degrees of freedom. By the Addition Theorem for Chi-Squares, this sum has− a chi-squared distribution with k(n 1) degrees of freedom. − 2 Consider X¯i and s .Ifi = i , then these are the mean and sample variance i0 0 from a sample from a normal population. These two statistics are independent. If i = i0,thenthesetwostatisticsdependondifferent independent samples, and 6 ¯ ¯ 2 2 are therefore independent. Thus, X1, ..., Xk and s1, ..., sk are independent sets. Since SSB and SSW depend{ on these} two{ sets, SSB} and SSW are independent. Therefore, the following expression is the ratio of independent chi- square random variables with k 1 and k(n 1) degrees of freedom respectively, divided by their degrees of freedom:− −

SSB 2 /(k 1) SSB/(k 1) F = σ − = − SSW /k(n 1) SSW/k(n 1) σ2 − − SSB/(k 1) Hence, the test statistic for ANOVA, SSW/k(n− 1) has an F-distribution with k 1 and k(n 1) degrees of freedom. − − − 25 Multivariate Central Limit Theorem (25)

Theorem 3 Let X 1, ..., X n be a sample of independent identically distributed m random vectors in R .LetE(X i)=µ and Cov(X i)=Σ. Then, as n , 1 n →∞ (X j µ) has the limiting distribution Normalm(0, Σ). √n j=1 − P m 1 n Proof. Let t R be any fixed vector. Define ξ = t0( (X j µ)) = ∈ n √n j=1 − 1 n t0(X j µ).Eacht0(X j µ) is an independent, identically distributed √n j=1 − − P random variable with: P E(t0(X j µ)) = t0E(X j µ) − − = t0E(X j) t0µ − = t0µ t0µ − =0

26 Var(t0(X j µ)) = t0Cov(X j µ)t − − = t0Cov(X j)t

= t0Σt

1 n By the one-variable central limit theorem, t0(X j µ) has a limiting √n j=1 − distribution with mean 0 and variance t0Σt. P By a theorem on the convergence of characteristic functions, the convergence of this distribution to a normal distribution implies the convergence of its char- acteristic function to the characteristic function of the same normal distribution. That is, u n 1 2 i t0(Xj µ) u t0Σt lim E(e √n j=1 − )=e− 2 n →∞ P for all u R. Thisistrueforallt Rm.Inparticular,setu =1. Then, we find ∈ ∈ 1 n 1 it0( (Xj µ)) t0Σt lim E(e √n j=1 − )=e− 2 n →∞ P m 1 n for all t R , and every linear combination of the elements of (X j µ) ∈ √n j=1 − converges to a normal distribution. By the Multivariate Continuity Theorem, P the convergence of every linear combination elements of a random vector to a normal distribution implies the convergence of the random vector to a multivari- 1 n ate normal distribution. Thus, (X j µ) converges to a Normal(0, Σ) √n j=1 − distribution. P 26 Random vectors describing multinomial dis- tributions (26)

k Let Z =(Z1, ..., Zk)0 be a random vector with P (Zi =1)=pi, i=1 Zi = k p =1. Because each Z is a binomial random variable with probability i=1 i i P pi of success, E(Zi)=pi and Var(Zi)=pi(1 pi). Becauseexactlyoneof P − Z1, ..., Zk is one, ZiZj =0when i = j. Therefore, we find that Cov(Zi,Zj)= 6 E(ZiZj) E(Zi)E(Zj)=0 pipj = pipj. This means that the covariance − − − matrix of Z is:

p1(1 p1) ... pipj Cov(Z)= ...− ...− ...   pipj ... pk(1 pk) − − p1 ... 0 ...  = ...... pipj ...   −  −  0 ... pk ...

p1 ... 0  p1  = ...... p1 ... pk   −   0 ... pk pk £ ¤    

27 Let P be the first matrix in this expansion; P is a diagonal matrix with the th 1 i diagonal entry equal to pi. Then, P − 2 is a diagonal matrix with entries 1 along the diagonal, and: √pi

1 1 1 Cov(P − 2 Z)=P − 2 Cov(Z)P − 2

p1 1 1 = P − 2 (P ... p1 ... pk )P − 2 −   pk £ ¤ √p1   = I ... √p1 ... √pk −   √pk £ ¤   27 The limiting distribution of the chi-square test statistic (27)

Consider a sample of size n in which each observation falls into exactly one of k k classes, with probability pi of being in class i, i=1 pi =1.Letfi be the k k fi npi 2 sample of class i,sothat fi = n.Define Ξ = ( − ) . i=1 P i=1 √npi Let P be the diagonal matrix withP diagonal entries pi.LetP Zj be a vec- tor with ith entry equal to 1 if the jth observation falls into the ith category f1 n n n and 0 otherwise. Then, Zj = ... and E( Zj)= E(Zj)= j=1   j=1 j=1 fk P P P  

28 p1 np1 n ... = ... . Then, we find: j=1     pk npk P     k fi npi Ξ = ( − )2 np i=1 √ i X 2 f1 np1 − √np1 = ° ... ° ° fk npk ° ° −np ° ° √ k ° 2 ° f1 np°1 ° − ° ° 1 √p1° = °  ... ° °√n fk npk ° ° −p ° °  √ k ° °  ° 2 ° f1 °np1 ° 1 1 −° = P − 2 ... °√n  ° ° fk npk ° ° − ° °  ° 2 ° f1 np°1 ° 1 1 ° = P − 2 ( ...... ) °√n   −   ° ° fk npk ° ° ° °     ° 2 ° n n ° ° 1 1 ° = P − 2 ( Zj E(Zj)) °√n − ° ° j=1 j=1 ° ° X X ° ° 2 ° ° n ° ° 1 1 ° = P − 2 (Zj E(Zj)) °√n − ° ° j=1 ° ° X ° ° ° ° 1 n ° By the multivariate central° limit theorem, ° (Zj E(Zj)) has a lim- √n j=1 − iting normal distribution with mean 0 and covarianceP matrix Cov(Z).Thus, 1 1 n 1 1 P − 2 (Zj E(Zj)) has the limiting distribution Normal(0,P− 2 Cov(Z)P − 2 )= √n j=1 − √p1 P 1 1 n Normal(0,I ... √p1 ... √pk ).LetY = P − 2 (Zj E(Zj)). −   √n j=1 − √pk £ ¤ P Let Q be an orthogonal matrix with first row √p1 ... √pk .Sincethe 2 2 norm is invariant under orthogonal transformations,£ Y = QY ¤ .Also,QY is distributed approximately Normal(0,QCov(Y )Q°),sothatthecovariance° ° ° °0 ° ° ° ° ° ° °

29 matrix is:

√p1 QCov(Y )Q0 = Q(I ... √p1 ... √pk )Q0 −   √pk £ ¤  √p1 √p1 = QQ0 (Q ... )(Q ... )0 −     √pk √pk 1     0 = I   10... 0 − ...  0  £ ¤   0  0 = 0 Ik 1 · − ¸ 0 X Thus, we see that QY can be written 2 ;wheretheX are independent,  ...  i Xk   2 2   k 2 approximately standard normal variables. Thus, Y = QY = i=2 Xi is (approximately) the sum of k 1 independent standard° ° normal° random° variables − ° ° ° ° P and therefore is (approximately) χ2 with k 1 degrees° ° of° freedom.° Thus, the chi-square statistic has a limiting χ2 distribution− with k 1 degrees of freedom. − 28 The limiting normal distribution of quantile statistics. (28)

th Let F (x) be a cumulative density function. Let F 0(x) exist. Let ξp be the p quantile of F(x) (that is, F (ξp)=p). Assume F 0(ξp) > 0.Letr = r(n) be an r(n) index of an , X(r),suchthatlimn √n( p)=0. We will →∞ n − show that √n(X(r) ξp) has a limiting distribution which is normal with mean p(1−p) 0andvariance − 2 . (F 0 (ξp)) Consider the cumulative distribution of √n(X ξ ): (r) − p y P (√n(X ξ ) y)=P (X + ξ ) (r) − p ≤ (r) ≤ √n p n n y j y n j = (F ( + ξ )) (1 F ( + ξ )) − j √n p − √n p j=r µ ¶ X y = P (at least r successes of n probability F ( + ξ ) of success) | √n p =1 P (less than r successes) −

30 According to the normal approximation to the , a bi- nomial distribution with n trials and probability θ of success is approximately 1 y normal with mean nθ and variance nθ(1 θ) .Inthiscase,θ = F ( √n + ξp) and we find: −

r nF ( y + ξ ) − √n p lim P (√n(X(r) ξp) y)=1 Φ( lim ) n n y y →∞ − ≤ − →∞ nF ( + ξ )(1 F( + ξ )) √n p − √n p q r nF ( y +ξ ) − √n p However, we must show that limn y y exists. →∞ nF ( +ξ )(1 F ( +ξ )) √n p − √n p Case 1: Uniform distribution. q In the case of the uniform distribution:

F (x)=x, x [0, 1] ∈ ξp = p y y y F (ξ + )=ξ + = p + p √n p √n √n

F 0(ξp)=1

Substituting these values into the limit, we find:

y y r nF ( + ξp) r n(p + ) lim − √n =lim − √n n y y n y y →∞ nF ( + ξ )(1 F( + ξ )) →∞ n(p + )(1 p ) √n p − √n p √n − − √n q rq np y√n =lim− − n np(1 p) →∞ − 1 r np = p lim ( − y) p(1 p) n √n − − →∞ 1 r = p lim (√n( p) y) n n p(1 p) →∞ − − −y = p − p(1 p) − y p y (Note: p + √n converges to √p faster than p + √n converges to p.) Applyingq the normal approximation, we find:

y P (√n(X(r) ξ ) y)=1 Φ( ) − p ≤ − − p(1 p) y − = Φ( )p p(1 p) − p 31 Case 2: The general case. By the Probability Integral Transformation, if F (x) is a continuous cumula- tive distribution function and U is a random variable uniformly distributed on [0, 1], then the cumulative distribution function of the random variable F(U) is F (x).Conversely,ifX is a random variable with cumulative distribution function F (x), then the random variable F (X) is randomly distributed uniform on [0, 1]. Since F (x) is a non-decreasing function (and is increasing in a neighborhood of ξp), y y P (X ξ + )=P (F (X ) F (ξ + )) (r) ≤ p √n (r) ≤ p √n

Applying a Taylor expansion about ξp,wefind:

y y P (F (X ) F (ξ + )) = P (F (X ) F (ξ )+F 0(ξ ) + ε(y)) (r) ≤ p √n (r) ≤ p p √n

= P (√n(F(X ) F (ξ )) F 0(ξ )y + ε(y)√n) (r) − p ≤ p (ε(y) is a smaller order function containing the rest of the terms in the Taylor th expansion; it is negligible.) Notice that F(X(r))=U(r),whereU(r) is the r order statistic of a sample of n uniform random variables on [0, 1]. Also, recall that F (ξp)=p. Substituting these facts, applying Case 1, and taking the limit, we find:

y P (X ξ + )=P (√n(U p) F 0(ξ )y + ε(y)√n) (r) ≤ p √n (r) − ≤ p 1 = Φ( F 0(ξ )y) p(1 p) p − Thus, p

y P (√n(X ξ ) y)=P (X ξ + ) (r) − p ≤ (r) ≤ p √n 1 = Φ( F 0(ξ )y) p(1 p) p − p p(1 p) which is equivalent to √n(X(r) ξp)˜Normal(0, − 2 ). − (F 0 (ξp)) 29 Propogation of Errors (29)

Theorem 4 Let Xn be a sequence of random variables. Let an be a se- { } { } quence of constants such that an > 0 for all n and limn an =0.Let Xn µ 2 →∞ µ be fixed. If − has a limiting Normal(0,σ ) distribution, then for any an f(Xn) f(µ) continuously differentiable function f : R R, − has a limiting an 2 2 → Normal(0,σ (f 0(µ)) ) distribution.

32 Xn µ ε Proof. For every ε>0, P ( Xn µ >ε)=P ( − > ).Sinceε is fixed, | − | an an ε ¯ ¯ Xn µ limn = ,sothatlimn P ( Xn µ >ε) = limn P ( − > →∞ an →∞ ¯ ¯ →∞ an ε ∞ε | − |¯ ¯ )=0,since diverges while Xn and µ are finite. Thus, Xn converges¯ ¯ in an an ¯ ¯ probability to µ. By the Taylor expansion, (f(Xn) f(µ)) f 0¯(µ)(Xn¯ Xn µ − 2≈ − µ) within ε of µ.Since − has a limiting Normal(0,σ ) distribution, an Xn µ f(Xn) f(µ) f (µ) − = − has a normal limiting distribution with mean f (µ)0 = 0 an an 0 2 2 0 and variance (f 0(µ)) σ .

Theorem 5 Let X n be a sequence of random vectors. Let an be a sequence { } { } of constants such that an > 0 for all n and limn an =0.Letµ be a fixed 1 →∞ vector. If (X n µ) has a limiting Normal(0, Σ) distribution, then for any an − k 1 smooth function f : R R, (f(X n) f(µ)) has the limiting distribution → an − Normal(0, (( f)(µ))0Σ(( f)(µ))) ∇ ∇ 1 Proof. Because (X n µ) has a limiting distribution, X n converges in prob- an − ability to µ. By the vector form of Taylor”s expansion, f(X n) f(µ) 1 − ≈ (( f)(µ)) (X n µ).Since (X n µ) has a limiting Normal(0, Σ) distribution, 0 an 1∇ − 1 − (( f)(µ))0(X n µ)= (f(X n) f(µ)) has a limiting normal distribution an ∇ − an − with mean (( f)(µ))00=0and variance (( f)(µ))0Σ(( f)(µ)). ∇ ∇ ∇ 30 Distribution of the sample correlation coef- ficient (30)

Definition 6 Let X and Y be bivariate normal random variables with corre- lation coefficient ρ.Let(X1,Y1), ..., (Xn,Yn) be a sample of pairs from this n (Xi X¯)(Yi Y¯ ) distribution. Let r = i=1 − − .Wecallr the sample n ¯ 2 n ¯ 2 √( (Xi X) )( (Yi Y ) ) i=1P − i=1 − correlation coefficient. P P Lemma 7 ρ and r are invariant under positive linear transformations, U = aX + b and V = cY + d,witha>0,c>0.

Proof. Let ρ(U, V ) be the correlation coefficient of U and V and ρ(X, Y ) be the correlation coefficient of X and Y . Then,

X E(X) Y E(Y ) ρ(X, Y )=E( − − ) Var(X) · Var(Y )

E(U)=p aE(X)+pb E(V )=cE(Y )+d Var(U)=a2Var(X) Var(V )=c2Var(Y )

33 U E(U) V E(V ) ρ(U, V )=E( − − ) Var(U) · Var(V ) aX + b (aE(X)+b) cY + d (cE(Y )+d) = E(p − p − ) a2Var(X) · c2Var(Y ) X E(X) Y E(Y ) = E( − p − ) p Var(X) · Var(Y ) = ρ(X,p Y ) p Wedothesameforr(X, Y ) and r(U, V ):

n (Xi X¯)(Yi Y¯ ) r(X, Y )= i=1 − − n ¯ 2 n ¯ 2 ( P(Xi X) )( (Yi Y ) ) i=1 − i=1 − q P U¯ = aX¯ + b P V¯ = cY¯ + d n n n 2 2 2 2 (Ui U¯) = (aXi + b (aX¯ + b)) = a (Xi X¯) − − − i=1 i=1 i=1 X X X n n n 2 2 2 2 (Vi V¯ ) = (cYi + d (cY¯ + d)) = c (Yi Y¯ ) − − − i=1 i=1 i=1 X X X

n n (Ui U¯)(Vi V¯ )= (aXi + b (aX¯ + b))(cYi + d (cY¯ + d)) − − − − i=1 i=1 X Xn = ac(Xi X¯)(Yi Y¯ ) − − i=1 X

n (Ui U¯)(Vi V¯ ) r(U, V )= i=1 − − n ¯ 2 n ¯ 2 ( P(Ui U) )( (Vi V ) ) i=1 − i=1 − q n ¯ ¯ P ac(Xi PX)(Yi Y ) = i=1 − − n 2 ¯ 2 n 2 ¯ 2 ( Pa (Xi X) )( c (Yi Y ) ) i=1 − i=1 − q n ¯ ¯ P (Xi X)(YPi Y ) = i=1 − − n ¯ 2 n ¯ 2 ( P(Xi X) )( (Yi Y ) ) i=1 − i=1 − = rq(X,P Y ) P

Theorem 8 As n , √n(r ρ) has a limiting Normal(0, (1 ρ2)2) distri- bution. →∞ − −

34 Proof. Lemma 9 Proof. Because ρ and the distribution of r are invariant under positive linear transformations, given any bivariate normal random variables, we may subtract their means and divide by their standard deviations without affect ρ or r. Thus, without loss of generality, we may prove this theorem for standard bivariate normal random variables, X and Y ,withcorrelationcoefficient, ρ.

Lemma 10 For standard bivariate normals, r has the same distrib- n V W ution as i=2 i i ,whereV and W are standard bivariate normals ( n V 2)( n W 2) √ i=2P i i=2 i with the sameP correlationP coefficient.

1 1 Proof. Let Q be the orthonormal transformation with first row √n ... √n . Then, the first elements of V = QX and W = QY are √nX¯ andh √nY¯ respec-i tively. Because orthonormal transformations preserve inner products,

n n 2 2 2 (Xi X¯) = X nX¯ − i − i=1 i=1 X X 2 = V V 2 − 1 °n ° ° ° 2 = ° °Vi i=1 X n n ¯ 2 2 (Yi Y ) = Wi i=1 − i=1 X X

n n (Xi X¯)(Yi Y¯ )= XiYi nX¯Y¯ i=1 − − i=1 − X X = < V, W> V1W1 n − = ViWi i=2 X Because Q is an orthogonal transformation, it is a rotation of Rn. Therefore, V and W have the same correlation as X and Y . u3 n 2 n Define f(u1,u2,u3)= . Then, r = f( (Xi X¯) , (Yi √u1u2 i=1 − i=1 − ¯ 2 n ¯ ¯ n 2 n 2 n Y ) , i=1(Xi X)(Yi Y )) = f( i=1 Vi , i=1 Wi , i=2 ViWi).Define in- − − 2 P P V1 P P2 P P dependent random vectors Zi = W .SinceVi and Wi are standard bivari-  i  ViWi   1 2 2 ate normal, E(Vi )=E(Wi )=1and E(ViWi)=ρ.Thus,E(Zi)= 1 . ρ   35 1 ρ2 ρ 2 It can be shown that Cov(Zi)=2 ρ 1 ρ .Thus,bytheMul-  1+ρ2  ρρ 2 2  V1  1 1 n W 2 1 tivariate Central Limit Theorem, √n 1 i=2( i ) has a limiting −   −   ViWi ρ P 1 ρ2 ρ     Normal(0, 2 ρ2 1 ρ ) distribution. By the Propogation of Error The-  1+ρ2  ρρ 2 1 n 2 1 n 2 1 n orem, √n 1( f( n 1 i=2 Vi , n 1 i=2 Wi , n 1 i=2 ViWi) f(1, 1,ρ)) has − − − − − a limiting Normal(0, (( f)(µ))0Cov(Zi)(( f)(µ)) distribution. We calculate: P∇ P ∇ P 3 u3 1 ( )(u1) 2 √u2 2 − − 3 u3 1 f(u1,u2,u3)= ( )(u2)− 2  √u1 2  ∇ − 1  √u1u2    1 2 ρ − 1 ( f)(1, 1,ρ)= 2 ρ ∇ −1    1 ρ2 ρ 1 2 ρ 1 1 ρ2 1 ρ − 1 (( f)(1, 1,ρ))0Cov(Zi)(( f)(1, 1,ρ)) = 2 2 ρ 2 ρ 1 2 ρ ∇ ∇ − −  2  −  ρρ1+ρ 1 £ ¤ 2  1 3 1    2 ρ + 2 ρ 1 1 − 1 3 1 =2 2 ρ 2 ρ 1 2 ρ + 2 ρ − − −1 1 ρ2  £ ¤ 2 − 2 = ρ4 2ρ2 +1   − =(1 ρ2)2 − 1 n 2 1 n 2 1 n Thus, √n(r ρ)=√n 1(f( V , W , ViWi) − − n 1 i=2 i n 1 i=2 i n 1 i=2 − f(1, 1,ρ)) has a limiting distribution− which is Normal− (0, (1 −ρ2)2). P P − P 31 A variance-stabilizing transformation (31)

Recall that √n(r ρ) has a limiting Normal(0, (1 ρ2)2) distribution. x x − e e− 1 − 1 1+x 1 Consider tanh(x)= x− x . Then, tanh− (x)= ln( )= (ln(1 + x) e +e− 2 1 x 2 1 1 −1 1 1 − ln(1 x)).Setf(x)=tanh− (x). Then, f 0(x)= ( + )= 2 . − 2 1+x 1 x 1 x Applying the Propogation of Errors Theorem, we find that √n(f(r−) f(ρ))−has 2 2 −1 2 a limiting normal distribution with mean 0 and variance (1 ρ ) ( 1 ρ2 ) =1. 1 1 − − Thus, √n(tanh− (r) tanh− (ρ)) has a limiting standard normal distribution. − 1 (This allows us to construct confidence intervals for tanh− (ρ) andthentake the hyperbolic tangent of the endpoint to find confidence intervals for ρ.)

36