Mathematical Statistics (NYU, Spring 2003) Summary (answers to his potential exam questions) By Rebecca Sela
1Sufficient statistic theorem (1)
Let X1, ..., Xn be a sample from the distribution f(x, θ).LetT (X1, ..., Xn) be asufficient statistic for θ with continuous factor function F (T (X1,...,Xn),θ). Then,
P (X A T (X )=t) = lim P (X A (T (X ) t h) ∈ | h 0 ∈ | − ≤ → ¯ ¯ P (X A, (T (X ) t h¯)/h ¯ ¯ ¯ = lim ∈ − ≤ h 0 ¯ ¯ → P ( T (X¯ ) t h¯)/h ¯ − ≤ ¯ d ¯ ¯ P (X A,¯ T (X ) ¯t) = dt ∈ ¯ ≤ ¯ d P (T (X ) t) dt ≤ Consider first the numerator:
d d P (X A, T (X ) t)= f(x1,θ)...f(xn,θ)dx1...dxn dt ∈ ≤ dt A x:T ( x)=t Z ∩{ } d = F (T ( x),θ),h( x)dx1...dxn dt A x:T ( x)=t Z ∩{ } 1 = lim F (T ( x),θ),h( x)dx1...dxn h 0 h A x: T ( x) t h → Z ∩{ | − |≤ }
Since mins [t,t+h] F (s, θ) F (t, θ) maxs [t,t+h] on the interval [t, t + h], we find: ∈ ≤ ≤ ∈
1 1 lim (min F (s, θ)) h( x)d x lim F (T ( x),θ)h( x)d x h 0 s [t,t+h] h A x: T ( x) t h ≤ h 0 h A x: T ( x) t h → ∈ Z ∩{ k − k≤ } → Z ∩{ k − k≤ } 1 lim (max F (s, θ)) h( x)d x ≤ h 0 s [t,t+h] h A x: T ( x) t h → ∈ Z ∩{ k − k≤ } 1 By the continuity of F (t, θ), limh 0(mins [t,t+h] F (s, θ)) h h( x)d x = → ∈ A x: T ( x) t h 1 ∩{ k − k≤ } limh 0(maxs [t,t+h] F (s, θ)) h A x: T ( x) t h h( x)d x = F (t, θ).Thus, → ∈ ∩{ k − k≤ } R R 1 1 lim F (T ( x),θ),h( x)dx1...dxn = F (t, θ) lim h( x)d x h 0 h A x: T ( x) t h h 0 h A x: T ( x) t h → Z ∩{ | − |≤ } → Z ∩{ | − |≤ } d = F (t, θ) h( x)d x dt A x:T ( x) t Z ∩{ ≤ }
1 If we let A be all of Rn, then we have the case of the denominator. Thus, we find:
d F (t, θ) dt A x:T ( x) t h( x)d x ∩{ ≤ } P (X A T (X)=t)= d ∈ | F (t, θ) dtR x:T ( x) t h( x)d x { ≤ } d dt A x:T ( x) t h( x)d x R ∩{ ≤ } = d dtR x:T ( x) t h( x)d x { ≤ } which is not a function of Rθ. Thus, P (X A T (X )=t) does not depend on θ when T (X ) is a sufficient statistic. ∈ |
2Examplesofsufficient statistics (2) 2.1 Uniform 1 Suppose f(x, θ)= θ I(0,θ)(x). Then,
1 f(x ,θ)= I (X ) i θn (0,θ) i Y 1 Y = n I( ,θ)(max Xi)I(0, )(min Xi) θ −∞ ∞
1 Let F(max Xi,θ)= θn I( ,θ)(max Xi) and h(X1,...,Xn)=I(0, )(min Xi). −∞ ∞ This is a factorization of f(xi,θ),somax Xi is a sufficient statistic for the uniform distribution. Y 2.2 Binomial
xi 1 xi xi Suppose f(x, θ)=θ (1 θ) − ,x =0, 1. Then, f(xi,θ)=θ (1 − − n xi t n t θ) − .LetT (x1, ..., xn)= Xi, F (t, θ)=θ (1 θY) − ,andh(x1,P ..., xn)= − 1. ThisP is a factorization of Xf(xi,θ),whichshowsthatT (x1, ..., xn)= Xi is a sufficient statistic. Y X 2.3 Normal 2 1 1 (x µ)2 Suppose f(x, µ, σ )= e− 2σ2 − . Then, √2πσ2
1 2 2 2 n/2 (xi µ) f(xi,µ,σ )=(2πσ )− e− 2σ2 − P 1 2 1 2 2 n/2 (xi x¯) n(¯x µ) Y =(2πσ )− e− 2σ2 − e− 2σ2 − P
2 since
2 2 2 2 (xi x¯) + n(¯x µ)= (xi 2xix¯ +¯x )+n(¯x 2µx¯ + µ ) − − − − X = X x2 2¯x(nx¯)+nx¯2 + nx¯2 2nµx¯ + nµ2 i − − 2 2 = X x 2µ xi + nµ i − 2 2 = X(x 2µxXi + µ ) i − 2 = X(xi µ) − X
Case 1: σ2 unknown, µ known. 1 2 2 2 n/2 2 t Let T (x1,...,xn)= (xi µ) , F (t, σ )=(2πσ )− e− 2σ ,andh(x1, ..., xn)= − 2 1. Thisisafactorizationof f(xi,σ ). Case 2: σ2 known,Pµ unknown. 1 2 1 2 Q 2 n(¯x µ) 2 n/2 2 (xi x¯) Let T (x1,...xn)=¯x, F(t, µ)=e− 2σ − ,andh(x1, ...xn)=(2πσ )− e− 2σ − . P This is a factorization of f(xi,µ). Case 3: µ unknown, σ2 unknown. 1 1 2 Q 2 2 2 n/2 2 t2 2 n(t1 µ) Let T1(x1, ...xn)=¯x, T2(x1,...xn)= (xi x¯) , F (t1,t2,µ,σ )=(2πσ )− e− 2σ e− 2σ − , − and h(x1, ..., xn)=1. This is a factorization. P 3 Rao-Blackwell Theorem (3)
Let X1, ..., Xn be a sample from the distribution f(x, θ).LetY = Y (X1,...,Xn) be an unbiased estimator of θ.LetT = T (X1, ..., Xn) be a sufficient statistics for θ.Letϕ(t)=E(Y T = t). | Lemma 1 E(E(g(Y ) T )) = E(g(Y )), for all functions g. | Proof.
∞ E(E(g(Y ) T )) = E(g(Y ) T )f(t)dt | | Z−∞ ∞ ∞ = ( g(y)f(y t)dy)f(t)dt | Z−∞ Z−∞ ∞ ∞ = g(y)f(y, t)dydt Z−∞ Z−∞ ∞ ∞ = g(y)( f(y,t)dt)dy Z−∞ Z−∞ ∞ = g(y)f(y)dy Z−∞ = E(g(y))
3 Step 1: ϕ(t) does not depend on θ.
ϕ(t)= y(x1, ..., xn)f(x1, ..., xn T (x1, ..., xn)=t)dx1...dxn. n | ZR Since T (x1, ..., xn) is a sufficient statistic, f(x1, ..., xn T (x1, ..., xn) does not | depend on θ.Sincey(x1, ..., xn) is an estimator, it is not a function of θ.Thus, the integral of their product over Rn does not depend on θ. Step 2: ϕ(t) is unbiased.
E(ϕ(t)) = E(E(Y T )) | = E(Y ) = θ (1) by the lemma above. Step 3: Var(ϕ(t)) Var(Y ) ≤
Var(ϕ(T )) = E(E(Y T )2) E(E(Y T ))2 | − | E(E(Y 2 T )) E(Y )2 ≤ | − = E(Y 2) E(Y )2 − = Var(Y )
Thus, conditioning an unbiased estimator on the sufficient statistic gives a new unbiased estimator with variance at most that of the old estimator.
4 Some properties of the derivative of the log (4)
∂ Let X have the distribution function f(x, θ0).LetY = ∂θ log f(X, θ) θ=θ0 . ∂ 1 ∂ | Notice that, by the chain rule, ∂θ log f(x, θ)= f(x,θ) ( ∂θ f(x, θ)).Usingthis
4 fact, we find: ∂ E(Y )=E( log f(X, θ) θ=θ ) ∂θ | 0 ∞ ∂ = log f(X, θ) θ=θ f(X, θ0)dX ∂θ | 0 Z−∞ ∞ 1 ∂ = ( f(x, θ) θ=θ0 )f(X, θ0)dX f(x, θ0) ∂θ | Z−∞ ∞ ∂ = f(x, θ) θ=θ dX ∂θ | 0 Z−∞ ∂ ∞ = ( f(x, θ)dX) θ=θ ∂θ | 0 Z−∞ ∂ = (1) θ=θ ∂θ | 0 =0
∂2 ∂ 1 ∂ log f(x, θ)= ( ( f(x, θ))) ∂θ2 ∂θ f(x, θ) ∂θ 1 ∂2 ∂ = (f(x, θ) f(x, θ) ( f(x, θ))2) f(x, θ)2 ∂θ2 − ∂θ 1 ∂2 1 ∂ = f(x, θ) ( f(x, θ))2 f(x, θ) ∂θ2 − f(x, θ) ∂θ 1 ∂2 ∂ = f(x, θ) ( log f(x, θ))2 f(x, θ) ∂θ2 − ∂θ
2 2 ∂ ∞ 1 ∂ ∂ 2 E( log f(x, θ) θ=θ )= f(x, θ) θ=θ ( log f(x, θ) θ=θ ) dx ∂θ2 | 0 f(x, θ) ∂θ2 | 0 − ∂θ | 0 Z−∞ 2 ∞ 1 ∂ ∞ ∂ 2 = f(x, θ) θ=θ dx ( log f(x, θ) θ=θ ) dx f(x, θ) ∂θ2 | 0 − ∂θ | 0 Z−∞ Z−∞ 2 ∂ ∞ 1 ∂ 2 = ( f(x, θ)dx) θ=θ E(( log f(x, θ) θ=θ ) ) ∂θ2 f(x, θ) | 0 − ∂θ | 0 Z−∞ 2 ∂ ∂ 2 = (1) θ=θ E(( log f(x, θ) θ=θ ) ) ∂θ2 | 0 − ∂θ | 0 ∂ 2 = E(( log f(x, θ) θ=θ ) ) − ∂θ | 0
∂2 Thus, the expected value of Y is zero, and the variance of Y is E( ∂θ2 log f(x, θ) θ=θ0 ), which is defined as the information function, I(θ). − |
5 5 The Cramer-Rao lower bound (5)
Let T be an unbiased estimator based on a sample X , from the distribution f(x, θ). Then, E(T )=θ. We take the derivative of this equation to find:
∂ 1= E(T ) (2) ∂θ ∂ = T ( x)f( x, θ)d x (3) ∂θ n ZR ∂ = T ( x) f( x, θ)d x (4) n ∂θ ZR ∂ = T ( x)( log f( x, θ))f( x, θ)d x (5) n ∂θ ZR ∂ = E(T (X ) log f(X,θ )) (6) ∂θ ∂ ∂ = E(T (X ) log f(X,θ )) cE( log f(X,θ )) (7) ∂θ − ∂θ ∂ = E((T (X ) c) log f(X,θ )) (8) − ∂θ ∂ = E((T (X ) θ) log f(X,θ )) (9) − ∂θ By the Cauchy-Schwartz Inequality, E(AB)2 E(A2)E(B2).Squaring both sides of the equation above and applying this,≤ we find:
∂ 1=E((T (X ) θ) log f(X,θ ))2 − ∂θ ∂ E((T (X ) θ)2)E(( log f(X,θ ))2) ≤ − ∂θ ∂ = Var(T )E(( log f(X,θ ))2) ∂θ Since the sample is independent and identically distributed,
∂ ∂ n ( log(f(X,θ )))2 =( log( f(x ,θ)))2 ∂θ ∂θ i i=1 Y ∂ n =( log f(x ,θ))2 ∂θ i i=1 X n ∂ =( log f(x ,θ))2 ∂θ i i=1 X
6 n ∂ n ∂ E(( log f(x ,θ))2)=Var( log f(x ,θ)) ∂θ i ∂θ i i=1 i=1 X X n ∂ = Var( log f(x ,θ)) ∂θ i i=1 X ∂ = n Var( log f(xi,θ)) · ∂θ ∂ 2 = n E(( log f(xi,θ)) ) · ∂θ = nI(θ)
Thus, 1 Var(T )(nI(θ)) and the variance of an unbiased estimator is at 1 ≤ least nI(θ) .
6 Where the Cramer-Rao lower bound fails to hold (6)
1 Let X be distributed uniform on (0,θ).Thatis,f(x, θ)= θ I(0,θ)(x).Let Y =maxXi. Then,
P (Y y)=P (X1, ...Xn y)I (y) ≤ ≤ (0,θ) =( P (Xi y))I (y) (10) ≤ (0,θ) y =(Y )I (y) (11) θ (0,θ) yYn = I (y) (12) θn (0,θ)
d f(y)= P (Y y) dy ≤ yn 1 = n − I (y) θn (0,θ)
∞ E(Y )= yf(y)dy Z−∞ θ yn = n n dy 0 θ Z n = θ n +1
7 ∞ E(Y 2)= y2f(y)dy Z−∞ θ yn+1 = n n dy 0 θ Z n = θ2 n +2
Var(Y )=E(Y 2) E(Y )2 n − n = θ2 ( θ)2 n +2 − n +1 n3 +2n2 + n n3 2n2 = θ2 − − (n +2)(n +1)2 n = θ2 (n +2)(n +1)2
n n+1 n+1 Since E(Y )= n+1 θ, E( n Y )=θ,and n max Xi is an unbiased estimator of θ. The variance of this estimator is given by
n +1 n n +1 Var( Y )=θ2 ( )2 n (n +2)(n +1)2 n 1 = θ2 n(n +2)
1 which is of order n2 (and would therefore violate the Cramer-Rao lower bound if it applied).
7 Maximum likelihood estimators for various dis- tributions (7) 7.1 Normal n 1 n 2 2 1 n/2 (xi µ) f(x ,µ,σ )=( ) e− 2σ2 i=1 − i 2πσ2 i=1 P Y n 1 n log f(x , ..., x ,µ,σ2)= log(2πσ2) (x µ)2 1 n 2 2σ2 i − − i=1 − X n ∂ 2 1 log f(x1, ..., xn,µ,σ )= (xi µ) ∂µ −σ2 − i=1 n X = (¯x µ) −σ2 −
8 n ∂ 2 n 1 2 log f(x1, ..., xn,µ,σ )= + (xi µ) ∂σ2 −2σ2 2(σ2)2 − i=1 n X 1 2 2 = ( (xi µ) nσ ) −2(σ2)2 − − i=1 X ∂ 2 If µ is unknown, then we set ∂µ log f(x1, ..., xn,µ,σ )=0to find that x¯ is the maximum likelihood estimator of µ. 2 ∂ 2 If σ is unknown and µ is known, then we set ∂σ2 log f(x1, ..., xn,µ,σ )=0 2 1 n 2 and solve to find that the maximum likelihood estimator of σ is n i=1(xi µ) . If both µ and σ2 are unknown, then we estimate µ using its maximum− likelihood estimator x¯ (which does not depend on σ2). We useP this estimate of µ to maximize with respect to σ2 and find that the maximum likelihood 2 1 n 2 estimator of σ is (xi x¯) . n i=1 − P 7.2 Bernoulli x 1 x f(x, θ)=θ (1 θ) − ,x=0, 1 −
xi n xi f(x1, ..., xn,θ)=θ (1 θ) − P − P
log f(x1, ..., xn,θ)=( xi)logθ +(n xi)log(1 θ) − − X X
∂ xi n xi log f(x1, ..., xn,θ)= − ∂θ θ − 1 θ P −P (1 θ) xi θ(n xi) = − − − θ(1 θ) P − P θ x¯ = − nθ(1 θ) − Setting the derivative equal to zero, we find that the maximum likelihood estimator of θ is X¯.
7.3 P oisson 1 f(x, θ)= θxe θ,x=0, 1, 2, ... x! −
nθ xi 1 f(x1, ..., xn,θ)=e− θ ( ) P xi! Y log f(x1, ..., xn,θ)= nθ + xi log θ log(xi!) − − X X
9 ∂ 1 log f(x1, ..., xn,θ)= n + xi ∂θ − θ n = (¯x θX) θ − Thus, the maximum likelihood estimator of θ is x¯.
7.4 Uniform 1 f(x, θ)= I (x) θ (0,θ)
1 n f(x , ..., x ,θ)= I (x ) 1 n θn (0,θ) i i=1 Y 1 = I (max x ) θn (0,θ) i
1 Since θn is strictly decreasing, we maximize it by choosing the smallest θ such that max Xi θ. That is, the maximum likelihood estimator of θ is ≤ max Xi.
7.5 Gamma (with β unknown) 1 f(x, α, β)= xα 1e x/β βαΓ(α) − −
n 1 1 n α 1 xi f(x , ..., x ,α,β)=( ) ( x ) e− β 1 n βαΓ(α) i − i=1 P Y
xi log f(x1, ..., xn,θ)= nα log β n log Γ(α)+(α 1) log xi − − − − β X P
∂ nα 1 log f(x1, ..., xn,θ)= + xi ∂θ − β β2 1 X = ( xi nαβ) β2 − X x¯ The derivative is 0 for β = α .Thus,x/α¯ is the maximum likelihood estimator.
10 8 The likelihood equation will have a solution (8)
Let x1, ..., xn be a sample from the distribution f(x, θ0).BytheLawof 1 n Large Numbers, the sequence n i=1 Yi converges to E(Yi) with proba- { 1 } n ∂ bility 1. In particular, the sequence n i=1 ∂θ log f(xi,θ) θ=θ0 converges ∂ P { 1 n ∂2 | } to E( ∂θ log f(x, θ) θ=θ0 )=0, and the sequence n i=1 ∂θ2 log f(xi,θ) θ=θ0 2| P { | } converges to E( ∂ log f(x, θ) )= I(θ), with probability 1. Let g (θ)= ∂θ2 θ=θ0 P n 1 n ∂ | − n i=1 ∂θ log f(xi,θ). Then, limn gn(θ0)=0and limn gn0 (θ0)= I(θ0) = 0. −→ ∞ →∞ − 6 PLet ε>0 be given. By the Taylor expansion,
gn(θ) gn(θ0)+g0 (θ0)(θ θ0), θ θ0 <ε ≈ n − | − | For sufficiently large n, then,
gn(θ)= I(θ0)(θ θ0) − − 1 gn0 (θ0)= I(θ0) θ θ0 − − Since I(θ0) < 0 in the interval (θ0 ε, θ0 + ε) while θ θ0 is both positive − − − and negative in that interval, gn(θ) must change sign in this interval as well. 1 n ∂ Since gn(θ) is continuous, this means n i=1 ∂θ log f(xi,θ)=gn(θ)=0in this interval, and the likelihood equation has a solution in (θ ε, θ + ε) with probability one for large n. P −
9 The limiting distribution of the maximum like- lihood estimators (9)
∂ Let x1, ..., xn be a sample from the distribution f(x, θ0). Recall that E( ∂θ log f(x, θ) θ=θ0 = ∂ ∂ 2 | 0) and that Var( ∂θ log f(x, θ) θ=θ0 )=E(( ∂θ log f(xi,θ) θ=θ0 ) )=I(θ0).Thus, | 1 n ∂ | by the Central Limit Theorem, log f(xi,θ) θ=θ has a normal limit- √n i=1 ∂θ | 0 ing distribution with mean 0 and variance I(θ0). By the Law of Large Numbers, ∂2 P 1 n ∂2 since E( ∂θ2 log f(x, θ) θ=θ0 )= I(θ0), limn n i 1 ∂θ2 log f(x, θ) θ=θ0 = | − →∞ − | I(θ0) with probability one. − 1 n ∂ P Consider √n i=1 ∂θ log f(xi,θ). Using the Taylor expansion about θ0,we find: P
1 n ∂ 1 n ∂ 1 n ∂2 log f(xi,θ) log f(xi,θ) θ=θ + log f(xi,θ) θ=θ (θ θ0) √n ∂θ ≈ √n ∂θ | 0 √n 2 | 0 − i=1 i=1 i=1 ∂θ X X X
11 in a neighborhood of θ0.Forlargen, there is a solution to the likelihood equation in this neighborhood with probability one. Let θˆn be this solution. Then, we have:
1 n ∂ 0= log f(xi,θ) ˆ √n ∂θ |θ=θn i=1 X 1 n ∂ 1 n ∂2 log f(xi,θ) θ=θ + √n(θˆn θ0)( log f(xi,θ) θ=θ ) ≈ √n ∂θ | 0 − n 2 | 0 i=1 i=1 ∂θ X X 1 n ∂2 Substituting in the limit, for limn n i 1 ∂θ2 log f(x, θ) θ=θ0 = I(θ0) →∞ − | − and solving for √n(θˆn θ0),wefind: − P 1 1 n ∂ √n(θˆn θ0) ( log f(xi,θ) θ=θ ) − ≈ I(θ ) √n ∂θ | 0 0 i=1 X The right hand side is the product of 1 and a normal variable with mean I(θ0) 0andvarianceI(θ0); that is, a normal variable with mean 0 and variance 1 1 ˆ 1 2 I(θ0)= 2 .Thus,√n(θn θ0)˜N(0, ). I(θ0) I(θ0) − I(θ0) 10 Confidence Intervals for Various Distributions (10) 10.1 Normal, σ2 known: Let X˜Normal(µ, σ2),withσ2 known. Let α<1 be given (this is the confi- X µ dence level). Set Z = σ− . Then, Z˜Normal(0, 1).Choosezα/2 such that zα/2 α Φ(zα/2)= φ(z)dz =1 2 . Then, by symmetry: −∞ − R 1 α = P ( z Z z ) − − α/2 ≤ ≤ α/2 X µ = P ( z − z ) − α/2 ≤ σ ≤ α/2 = P ( σz X µ σz ) − α/2 ≤ − ≤ α/2 = P (X z σ µ X + z σ) − α/2 ≤ ≤ α/2 and (X zα/2σ, X + zα/2σ) is a 1 α confidence interval for µ. − − 2 ¯ σ If we choose a sample of size n,thenX˜Normal(µ, n ),andthen
X µ 1 α = P ( z − z ) − − α/2 ≤ σ/√n ≤ α/2 σ σ = P (X z µ X + z ) − α/2 √n ≤ ≤ α/2 √n so that (X¯ z σ , X¯ + z σ ) is a 1 α confidence interval for µ. − α/2 √n α/2 √n −
12 10.2 Normal, σ2 unknown: 2 ¯ σ2 Let X1,...,Xn be a sample from the distribution Normal(µ, σ ). Then, X˜Normal(µ, n ) 2 1 n ¯ 2 2 and s = n 1 i=1(Xi X) is distributed χ with n 1 degrees of freedom, − X¯ µ − − σ/−√n X¯ µ so that T = P 2 = − has a t-distribution with n 1 degrees ((n 1) s )/(n 1) s/√n − − σ2 − q α of freedom. Choose tα/2,n 1 such that P (T tα/2,n 1)=1 2 . Then, by symmetry, − ≤ − −
X¯ µ 1 α = P ( tα/2,n 1 − tα/2,n 1) − − − ≤ s/√n ≤ − ¯ s ¯ s = P (X tα/2,n 1 µ X + tα/2,n 1 ) − − √n ≤ ≤ − √n
¯ s ¯ s and (X tα/2,n 1 , X + tα/2,n 1 ) is a 1 α confidence interval for µ. − − √n − √n − If n is sufficiently large, then the t-distribution with n 1 degrees of freedom − is approximately the standard normal distribution, and tα/2,n 1 zα/2. − ≈ 10.3 Binomial: If X is a Bernoulli random variable (with probability θ of success) then, by the Central Limit Theorem, X¯ is distributed approximately normal with mean θ and variance θ(1 θ).Sinceθ is unknown, we must approximate the variance. Note that θ(1 −θ) 0.5(1 0.5) = 0.25 for all values of θ.Thus,0.25 is a conservative estimate− ≤ of the variance− and we may use normal confidence intervals with this estimate of the variance.
10.4 Non-normal, large sample: We may use the Central Limit Theorem and the fact that the maximum likeli- hood estimator is approximately normally distributed with mean θ and variance I(θ) to construct an approximate confidence interval using the methods above. (We should estimate I(θ) by I(θˆ) or by some upper bound, as in the binomial case.)
11 The sum of normals squared is chi-squared (11)
Let X1, ..., Xn be a sample from a standard normal distribution. Consider the n 2 cumulative distribution function of i=1 Xi : P
13 n 2 P ( Xi y)= f(x1, ..., xn,θ)dx1...dxn ≤ 2 i=1 x1,...xn: xi y X Z{ ≤ } P n/2 x2 = (2π)− e− i dx1...dxn 2 x1,...xn: x y P Z{ i ≤ } P Taking the derivative with respect to y,wefind:
d n f(y)= P ( X2 y) dy i ≤ i=1 X d n/2 x2 = (2π)− e− i dx1...dxn 2 dy x1,...xn: x y P Z{ i ≤ } P d Recall that F(T (x1, ..., xn),θ)h(x1, ..., xn)dx1...dxn = dt x1,...,xn T (x1,...,xn)=t d { | } F (t, θ) h(x1, ..., xn)dx1...dxn.Thus,wefind: dt x1,...,xnRT (x1,...,xn)=t { | } R n/2 y d f(y)=(2π)− e− dx1...dxn 2 dy x1,...xn: x y Z{ i ≤ } P Notice that 2 dx1...dxn is the volume of an n-ball of radius x1,...xn: xi y { ≤n/}2 d y, which is proportionalP to y .Thus, 2 dx1...dxn is pro- R dy x1,...xn: xi y n { ≤ n} n 2 1 y 2 1 portional to 2 y − .Thus,f(y) is proportional toPe− y − ,whichisthe n 2 nR 2 Gamma( 2 , 1) = χn distribution. Thus, i=1 Xi has a chi-squared distribu- tion with n degrees of freedom. P 12 Independence of the estimated mean and stan- dard deviation (12)
2 Xi µ Let X1,...Xn be a sample from a Normal(µ, σ ) distribution. Set Zi = σ− . Then, Z1, ..., Zn are distributed Normal(0, 1).LetU = √nZ¯ and V =(n 2 2 − 1)s = (Zi Z¯) . The joint distribution function of these two variables is given by: − P n/2 Z2 F (u, v)= (2π)− e− i dZ1...dZn 2 Zi,...Zn:√nZ¯ u,(n 1)s v P Z{ ≤ − ≤ } 1 1 Let P be any real orthonormal matrix with first row ( √n , ..., √n );sucha matrix exists because this vector is of length one and we may apply the Gram- 1 n Schmidt orthogonalization. Set Y = P Z. Then, Y1 = √n i=1 Zi = U.Since n 2 n 2 orthogonal matrices preserve inner products, i=1 Yi = i=1 Zi ,andV = n 2 n 2 2 n 2 2 nP 2 (Zi Z¯) = Z nZ¯ = Y Y = Y . Substituting i=1 − i=1 i − i=1 i −P 1 i=2P i Y1, ..., Yn for Z1, ..., Zn in the joint distribution function above, we find: P P P P
14 n/2 n Y 2 F (u, v)= (2π)− e i=1 i dY1...dYn n 2 Y1,...,Yn:Y1 u, Y v P Z{ ≤ i=2 i ≤ } u n/2 PY 2 n Y 2 =(2π)− ( e 1 dY1)( e i=2 i dY2...dYn) n 2 Y2,...,Yn: Y v P Z−∞ Z{ i=2 i ≤ } Thus, we see that the density function factors,P meaning that U and V are independent. Furthermore, the factor containing U = √nZ is of the form of a standard normal random variable, and the factor containing V =(n 1)s2 is of the form of a χ2 random variable with n 1 degrees of freedom.− Since − ¯ ¯ 2 ¯ x¯ µ 2 Xi µ X µ 2 (Xi X) U = √nZ = √n −σ and V =(n 1)s = ( σ− σ− ) = σ−2 = s2 (n 1) x¯− µ s2 (n 1) − P x − ,wehaveshownthat n and x − are independent, with the σ2 √ −σ Pσ2 former normally distributed and the latter distributed chi-square.
13 The t-distribution (13)
Let X be a standard normal random variable and Y an independently distrib- uted chi-square variable with n degrees of freedom. Let T = X .TheT is √y/n distributed with a t-distribution with n degrees of freedom. If X1, ..., Xn are independently distributed normal random variables, then X¯ µ (n 1)s2 − σ/−√n and σ2 are independently distributed standard normal and chi-square with n 1 degrees of freedom random variables respectively. Thus, −
X¯ µ − T = σ/√n 1 (n 1)s2 n 1 ( −σ2 ) − qx¯ µ σ n 1 = − − σ/√n · s n 1 r − X¯ µ = − s/√n has a t-distribution with n 1 degrees of freedom. − 14 Some facts about hypothesis tests, levels of significance, and power (14)
Let the paramter space be the disjoint union of H0 and H1.Ahypothesistest is a mapping from a sample X1, ..., Xn to the set H0,H1 . The inverse image { } of H1 is called the critical region, C. Pθ(C)=P (C θ) is the probability of | rejecting H0 if θ is the true parameter value; this is called the power function. The level of significance, which is the maximum probability of a false rejection, is supθ H0 Pθ(C). ∈
15 14.1 A simple normal hypothesis test 2 2 Suppose X is distributed Normal(µ, σ ),withσ known. Let H0 = µ : µ { ≤ µ and H1 = µ : µ>µ .LetC1 = x : x>µ . Then, 0} { 0} { }
Pµ(C1)=Pµ(X>µ0) X µ µ µ = P ( − > 0 − ) µ σ σ µ µ =1 Φ( 0 − ) − σ Since Φ is an increasing function, the function above is maximized by choos- ing the largest possible µ is H0. Therefore, the level of significance is:
µ0 µ0 sup Pµ(C1)=1 Φ( − ) µ µ − σ ≤ 0 1 1 =1 = − 2 2 σ Consider a second critical region: C2 = X¯ : X>µ¯ + zα . Then, { 0 √n } σ P (C )=P (X>µ¯ + z ) µ 2 µ 0 α √n σ X¯ µ µ0 + zα √n µ = Pµ( − − ) σ/√n ≥ σ/√n µ0 µ =1 Φ( − + zα) − σ/√n Then the level of significance is:
µ0 µ sup Pµ(C2)= sup1 Φ( − + zα) µ µ µ µ − σ/√n ≤ 0 ≤ 0 =1 Φ(zα) − =1 α − 15 The Neyman-Pearson Lemma (15)
Theorem 2 Let X be a sample from the distribution f(x; θ).LetH0 and H1 be n f(xi,θ1) i=1 the simple hypotheses H0 : θ = θ0 and H1 : θ = θ1.Define R(X)= Qn . f(xi,θ0) i=1 Define a critical region C Rn for a given λ R by C = x Rn : RQ( x) >λ ; that is, reject when R( x) >λ⊂ .IfD is the critical∈ region for{ ∈ any test such that}
16 Pθ0 (D) Pθ0 (C) then Pθ1 (D) Pθ1 (C). (The test based on the likelihood ration is≤ the most powerful for a≤ given level of significance.) n n Proof. For any A R ,define P0(A)= A f(xi,θ0)dx1...dxn and P1(A)= ⊂ i=1 n R Q A f(xi,θ1)dx1...dxn. Thus, we want to show that P0(D) P0(C) implies i=1 ≤ RthatQP1(D) P1(C). Let D be given. Then: ≤ n C P1(D C )= f(xi,θ1)dx1...dxn ∩ D x:R( x) λ Z ∩{ ≤ } i=1 Y n = R( x) f(xi,θ0)dx1...dxn D x:R( x) λ Z ∩{ ≤ } i=1 n Y λ f(xi,θ0)dx1...dxn ≤ D x:R( x) λ Z ∩{ ≤ } i=1 C Y = λP0(D C ) ∩ n C P1(D C)= f(xi,θ1)dx1...dxn ∩ DC x:R( x) λ Z ∩{ ≥ } i=1 Y n = R( x) f(xi,θ0)dx1...dxn DC x:R( x) λ Z ∩{ ≥ } i=1 n Y λ f(xi,θ0)dx1...dxn ≥ DC x:R( x) λ Z ∩{ ≥ } i=1 C Y = λP0(D C) ∩ Combining these facts, we find: C P1(D)=P1(D C)+P1(D C ) ∩ ∩ C P1(D C)+λP0(D C ) ≤ ∩ ∩ = P1(D C)+λ(P0(D) P0(D C)) ∩ − ∩ P1(D C)+λ(P0(C) P0(D C)) ≤ ∩ − C ∩ = P1(D C)+λ(P0(C D )) ∩ ∩ C P1(D C)+P1(C D ) ≤ ∩ ∩ = P1(C)
16 The likelihood ratio test for the normal dis- tribution (16)
Suppose X is a sample from the distribution Normal(µ, σ2).Letthenull 2 hypothesis be H0 : µ = µ0, σ unknown. Let the alternative hypothesis be
17 2 H1 : µ = µ , σ unknown. The likelihood function of this distribution is: 6 0 n 1 2 2 1 (Xi µ) L(µ, σ , X )= ( )e− 2σ2 − 2πσ2 i=1 Y 1 2 2 n/2 (Xi µ) =(2πσ )− e− 2σ2 − P ˆ Under the null hypothesis, µˆ = µ0, so we maximize the likelihood function with respect to σ2:
2 n 2 1 2 log L(µ, σ , X )= log(2πσ ) (Xi µ ) − 2 − 2σ2 − 0 X ∂ 2 n 1 2 log L(µ, σ , X )= + (Xi µ ) ∂σ2 −σ2 2(σ2)2 − 0 =0 X
2 1 2 σˆ = (Xi µ ) n − 0 Under the alternative hypothesis,X we have the maximum likelihood esti- ¯ 2 1 ¯ 2 mates: µˆ = X and σˆ = n (Xi X) . Then, the likelihood ratio of the− null and alternative hypotheses is: P L(µ,ˆ σˆ2) λ = L(ˆµ, σˆ2) 1 ˆ 2 ˆ2 n/2 ˆ2 (Xi µˆ) (2πσˆ )− e− 2σˆ − = 1 P 2 2 n/2 2 (Xi µˆ) (2πσˆ )− e− 2σ − 2 P 1 2 1 2 σˆ (Xi µˆ) + (Xi µˆ) =( )e− 2σˆ2 − 2ˆσ2 − 2 σˆ P P 2 (Xi X¯) n n n/2 2 + 2 =( − 2 ) e− (Xi µ ) P − 0 (X X¯)2 P i n/2 =( − 2 ) (Xi µ ) P − 0 2 2 2 Recall that (Xi µ ) = (Xi X¯) + n(X¯ µ ) ,so − 0 P − − 0
P P 2 (Xi X¯) n λ =( ) 2 2 − 2 (Xi X¯) + n(X¯ µ ) −P − 0 1 n 2 =(P ¯ 2 ) (X µ0) 1+n − 2 (Xi X¯ ) − (X¯P µ )2 X¯ µ 0 − 0 2 t 2 is a decreasing function of n − 2 =( ) =( ) ,whichis (Xi X¯ ) √n| 1s/√| n √n 1 an increasing function of the t-statistic.− Thus, λ −is minimized if and− only if the P t-statistic is large, and a t-test is a likelihood ratio test.
18 17 Linear combinations of multivariate normals (and MGF’s) (17) 17.1 Case 1: Standard Normal The moment generating function for a standard normal variable, Z ,isgivenby: