Part IB —

Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua Lent 2015

These notes are not endorsed by the lecturers, and I have modified them (often significantly) after lectures. They are nowhere near accurate representations of what was actually lectured, and in particular, all errors are almost surely mine.

Estimation Review of distribution and density functions, parametric families. Examples: bino- mial, Poisson, gamma. Sufficiency, minimal sufficiency, the Rao-Blackwell theorem. Maximum likelihood estimation. Confidence intervals. Use of prior distributions and . [5]

Hypothesis testing Simple examples of hypothesis testing, null and , critical region, size, power, type I and type II errors, Neyman-Pearson lemma. Significance level of outcome. Uniformly most powerful tests. Likelihood ratio, and use of generalised likeli- hood ratio to construct test statistics for composite hypotheses. Examples, including t-tests and F -tests. Relationship with confidence intervals. Goodness-of-fit tests and contingency tables. [4]

Linear models Derivation and joint distribution of maximum likelihood estimators, , Gauss-Markov theorem. Testing hypotheses, geometric interpretation. Examples, including simple and one-way analysis of . Use of software. [7]

1 Contents IB Statistics

Contents

0 Introduction 3

1 Estimation 4 1.1 Estimators ...... 4 1.2 squared error ...... 5 1.3 Sufficiency ...... 6 1.4 Likelihood ...... 10 1.5 Confidence intervals ...... 12 1.6 Bayesian estimation ...... 15

2 Hypothesis testing 19 2.1 Simple hypotheses ...... 19 2.2 Composite hypotheses ...... 22 2.3 Tests of goodness-of-fit and independence ...... 25 2.3.1 Goodness-of-fit of a fully-specified null distribution . . . . 25 2.3.2 Pearson’s chi-squared test ...... 26 2.3.3 Testing independence in contingency tables ...... 27 2.4 Tests of homogeneity, and connections to confidence intervals . . 29 2.4.1 Tests of homogeneity ...... 29 2.4.2 Confidence intervals and hypothesis tests ...... 31 2.5 Multivariate normal theory ...... 32 2.5.1 Multivariate ...... 32 2.5.2 Normal random samples ...... 34 2.6 Student’s t-distribution ...... 35

3 Linear models 37 3.1 Linear models ...... 37 3.2 ...... 39 3.3 Linear models with normal assumptions ...... 42 3.4 The F distribution ...... 46 3.5 Inference for β ...... 46 3.6 Simple linear regression ...... 47 3.7 Expected response at x∗ ...... 48 3.8 Hypothesis testing ...... 50 3.8.1 Hypothesis testing ...... 50 3.8.2 Simple linear regression ...... 52 3.8.3 One way with equal numbers in each group ...... 53

2 0 Introduction IB Statistics

0 Introduction

Statistics is a set of principles and procedures for gaining and processing quan- titative evidence in order to help us make judgements and decisions. In this course, we focus on formal . In the process, we assume that we have some generated from some unknown probability model, and we aim to use the data to learn about certain properties of the underlying probability model. In particular, we perform parametric inference. We assume that we have a X that follows a particular known family of distribution (e.g. ). However, we do not know the of the distribution. We then attempt to estimate the from the data given. For example, we might know that X ∼ Poisson(µ) for some µ, and we want to figure out what µ is. Usually we repeat the (or observation) many times. Hence we will have X1,X2, ··· ,Xn being iid with the same distribution as X. We call the set X = (X1,X2, ··· ,Xn) a simple random . This is the data we have. We will use the observed X = x to make inferences about the parameter θ, such as – giving an estimate θˆ(x) of the true value of θ. ˆ ˆ – Giving an interval estimate (θ1(x), θ2(x)) for θ – testing a hypothesis about θ, e.g. whether θ = 0.

3 1 Estimation IB Statistics

1 Estimation 1.1 Estimators

The goal of estimation is as follows: we are given iid X1, ··· ,Xn, and we know that their probability density/mass function is fX (x; θ) for some unknown θ. We know fX but not θ. For example, we might know that they follow a Poisson distribution, but we do not know what the mean is. The objective is to estimate the value of θ. Definition (). A statistic is an estimate of θ. It is a function T of the data. If we write the data as x = (x1, ··· , xn), then our estimate is written as θˆ = T (x). T (X) is an estimator of θ. The distribution of T = T (X) is the distribution of the statistic. Note that we adopt the convention where capital X denotes a random variable and x is an observed value. So T (X) is a random variable and T (x) is a particular value we obtain after .

Example. Let X1, ··· ,Xn be iid N(µ, 1). A possible estimator for µ is 1 X T (X) = X . n i Then for any particular observed sample x, our estimate is 1 X T (x) = x . n i What is the of T ? Recall from IA Probability that in 2 P P P 2 general, if Xi ∼ N(µi, σi ), then Xi ∼ N( µi, σi ), which is something we can prove by considering -generating functions. So we have T (X) ∼ N(µ, 1/n). Note that by the , even if Xi were not normal, we still have approximately T (X) ∼ N(µ, 1/n) for large values of n, but here we get exactly the normal distribution even for small values of n. 1 P The estimator n Xi we had above is a rather sensible estimator. Of course, we can also have silly estimators such as T (X) = X1, or even T (X) = 0.32 always. One way to decide if an estimator is silly is to look at its bias. Definition (Bias). Let θˆ = T (X) be an estimator of θ. The bias of θˆ is the difference between its expected value and true value. ˆ ˆ bias(θ) = Eθ(θ) − θ.

Note that the subscript θ does not represent the random variable, but the thing we want to estimate. This is inconsistent with the use for, say, the probability mass function. ˆ An estimator is unbiased if it has no bias, i.e. Eθ(θ) = θ.

To find out Eθ(T ), we can either find the distribution of T and find its expected value, or evaluate T as a function of X directly, and find its expected value.

Example. In the above example, Eµ(T ) = µ. So T is unbiased for µ.

4 1 Estimation IB Statistics

1.2 Mean squared error Given an estimator, we want to know how good the estimator is. We have just come up with the concept of the bias above. However, this is generally not a good measure of how good the estimator is. For example, if we do 1000 random trials X1, ··· ,X1000, we can pick our estimator as T (X) = X1. This is an unbiased estimator, but is really bad because we have just wasted the data from the other 999 trials. On the other hand, 0 1 P T (X) = 0.01 + 1000 Xi is biased (with a bias of 0.01), but is in general much more trustworthy than T . In fact, at the end of the section, we will construct cases where the only possible unbiased estimator is a completely silly estimator to use. Instead, a commonly used measure is the mean squared error. Definition (Mean squared error). The mean squared error of an estimator θˆ is ˆ 2 Eθ[(θ − θ) ]. Sometimes, we use the root mean squared error, that is the square root of the above. We can express the mean squared error in terms of the variance and bias:

ˆ 2 ˆ ˆ ˆ 2 Eθ[(θ − θ) ] = Eθ[(θ − Eθ(θ) + Eθ(θ) − θ) ] ˆ ˆ 2 ˆ 2 ˆ ˆ ˆ = Eθ[(θ − Eθ(θ)) ] + [Eθ(θ) − θ] + 2Eθ[Eθ(θ) − θ] Eθ[θ − Eθ(θ)] | {z } 0 = var(θˆ) + bias2(θˆ).

If we are aiming for a low mean squared error, sometimes it could be preferable to have a biased estimator with a lower variance. This is known as the “bias-variance trade-off”. For example, suppose X ∼ binomial(n, θ), where n is given and θ is to be determined. The standard estimator is TU = X/n, which is unbiased. TU has variance var (X) θ(1 − θ) var (T ) = θ = . θ U n2 n Hence the mean squared error of the usual estimator is given by

2 mse(TU ) = varθ(TU ) + bias (TU ) = θ(1 − θ)/n.

Consider an alternative estimator X + 1 X 1 T = = w + (1 − w) , B n + 2 n 2 where w = n/(n + 2). This can be interpreted to be a weighted average (by the sample size) of the sample mean and 1/2. We have

nθ + 1 1  (T ) − θ = − θ = (1 − w) − θ , Eθ B n + 2 2 and is biased. The variance is given by var (X) θ(1 − θ) var (T ) = θ = w2 . θ B (n + 2)2 n

5 1 Estimation IB Statistics

Hence the mean squared error is

θ(1 − θ) 1 2 mse(T ) = var (T ) + bias2(T ) = w2 + (1 − w)2 − θ . B θ B B n 2

We can plot the mean squared error of each estimator for possible values of θ. Here we plot the case where n = 10. mse

0.03 unbiased estimator

0.02

0.01 biased estimator 0 θ 0 0.2 0.4 0.6 0.8 1.0

This biased estimator has smaller MSE unless θ has extreme values. We see that sometimes biased estimators could give better mean squared errors. In some cases, not only could unbiased estimators be worse — they could be completely nonsense. Suppose X ∼ Poisson(λ), and for whatever reason, we want to estimate θ = [P (X = 0)]2 = e−2λ. Then any unbiased estimator T (X) must satisfy Eθ(T (X)) = θ, or equivalently,

∞ X λx E (T (X)) = e−λ T (x) = e−2λ. λ x! x=0

The only function T that can satisfy this equation is T (X) = (−1)X . Thus the unbiased estimator would estimate e−2λ to be 1 if X is even, -1 if X is odd. This is clearly nonsense.

1.3 Sufficiency Often, we do experiments just to find out the value of θ. For example, we might want to estimate what proportion of the population supports some political candidate. We are seldom interested in the data points themselves, and just want to learn about the big picture. This leads us to the concept of a sufficient statistic. This is a statistic T (X) that contains all information we have about θ in the sample.

Example. Let X1, ··· Xn be iid Bernoulli(θ), so that P(Xi = 1) = 1 − P(Xi = 0) = θ for some 0 < θ < 1. So

n P P Y xi 1−xi xi n− xi fX(x | θ) = θ (1 − θ) = θ (1 − θ) . i=1 P This depends on the data only through T (X) = xi, the total number of ones.

6 1 Estimation IB Statistics

Suppose we are now given that T (X) = t. Then what is the distribution of X? We have

Pθ(X = x,T = t) Pθ(X = x) fX|T =t(x) = = . Pθ(T = t) Pθ(T = t) Where the last equality comes because since if X = x, then T must be equal to t. This is equal to P P −1 θ xi (1 − θ)n− xi n = . n t n−t t θ (1 − θ) t So the conditional distribution of X given T = t does not depend on θ. So if we know T , then additional knowledge of x does not give more information about θ. Definition (Sufficient statistic). A statistic T is sufficient for θ if the conditional distribution of X given T does not depend on θ. There is a convenient theorem that allows us to find sufficient statistics. Theorem (The factorization criterion). T is sufficient for θ if and only if

fX(x | θ) = g(T (x), θ)h(x) for some functions g and h. Proof. We first prove the discrete case. Suppose fX(x | θ) = g(T (x), θ)h(x). If T (x) = t, then

Pθ(X = x,T (X) = t) fX|T =t(x) = Pθ(T = t) g(T (x), θ)h(x) = P {y:T (y)=t} g(T (y), θ)h(y) g(t, θ)h(x) = g(t, θ) P h(y) h(x) = P h(y) which doesn’t depend on θ. So T is sufficient. The continuous case is similar. If fX(x | θ) = g(T (x), θ)h(x), and T (x) = t, then g(T (x), θ)h(x) fX|T =t(x) = R y:T (y)=t g(T (y), θ)h(y) dy g(t, θ)h(x) = g(t, θ) R h(y) dy h(x) = , R h(y) dy which does not depend on θ. Now suppose T is sufficient so that the conditional distribution of X | T = t does not depend on θ. Then

Pθ(X = x) = Pθ(X = x,T = T (x)) = Pθ(X = x | T = T (x))Pθ(T = T (x)). The first factor does not depend on θ by assumption; call it h(x). Let the second factor be g(t, θ), and so we have the required factorisation.

7 1 Estimation IB Statistics

Example. Continuing the above example, P P xi n− xi fX(x | θ) = θ (1 − θ) .

t n−t P Take g(t, θ) = θ (1 − θ) and h(x) = 1 to see that T (X) = Xi is sufficient for θ.

Example. Let X1, ··· ,Xn be iid U[0, θ]. Write 1[A] for the of an arbitrary set A. We have

n Y 1 1 f (x | θ) = 1 = 1 1 . X θ [0≤xi≤θ] θn [maxi xi≤θ] [mini xi≥0] i=1

If we let T = maxi xi, then we have 1 f (x | θ) = 1 1 . X θn [t≤θ] [mini xi≥0] | {z } | {z } g(t,θ) h(x)

So T = maxi xi is sufficient. Note that sufficient statistics are not unique. If T is sufficient for θ, then so is any 1-1 function of T . X is always sufficient for θ as well, but it is not of much use. How can we decide if a sufficient statistic is “good”? Given any statistic T , we can partition the sample space X n into sets {x ∈ X : T (x) = t}. Then after an experiment, instead of recording the actual value of x, we can simply record the partition x falls into. If there are less partitions than possible values of x, then effectively there is less information we have to store. If T is sufficient, then this data reduction does not lose any information about θ. The “best” sufficient statistic would be one in which we achieve the maximum possible reduction. This is known as the minimal sufficient statistic. The formal definition we take is the following: Definition (Minimal sufficiency). A sufficient statistic T (X) is minimal if it is a function of every other sufficient statistic, i.e. if T 0(X) is also sufficient, then T 0(X) = T 0(Y) ⇒ T (X) = T (Y). Again, we have a handy theorem to find minimal statistics: Theorem. Suppose T = T (X) is a statistic that satisfies f (x; θ) X does not depend on θ if and only if T (x) = T (y). fX(y; θ) Then T is minimal sufficient for θ. Proof. First we have to show sufficiency. We will use the factorization criterion to do so. Firstly, for each possible t, pick a favorite xt such that T (xt) = t. N Now let x ∈ X and let T (x) = t. So T (x) = T (xt). By the hypothesis, fX(x;θ) does not depend on θ. Let this be h(x). Let g(t, θ) = fX(xt, θ). Then fX(xt:θ)

fX(x; θ) fX(x; θ) = fX(xt; θ) = g(t, θ)h(x). fX(xt; θ)

8 1 Estimation IB Statistics

So T is sufficient for θ. To show that this is minimal, suppose that S(X) is also sufficient. By the factorization criterion, there exist functions gS and hS such that

fX(x; θ) = gS(S(x), θ)hS(x). Now suppose that S(x) = S(y). Then f (x; θ) g (S(x), θ)h (x) h (x) X = S S = S . fX(y; θ) gS(S(y), θ)hS(y) hS(y) This that the ratio fX(x;θ) does not depend on θ. By the hypothesis, this fX(y;θ) implies that T (x) = T (y). So we know that S(x) = S(y) implies T (x) = T (y). So T is a function of S. So T is minimal sufficient.

2 Example. Suppose X1, ··· ,Xn are iid N(µ, σ ). Then

2 2 −n/2  1 P 2 f (x | µ, σ ) (2πσ ) exp − 2 (x − µ) X = 2σ i i 2 2 −n/2  1 P 2 fX(y | µ, σ ) (2πσ ) exp − 2σ2 i(yi − µ) ( ! !) 1 X X µ X X = exp − x2 − y2 + x − y . 2σ2 i i σ2 i i i i i i 2 P 2 P 2 P P This is a constant function of (µ, σ ) iff i xi = i yi and i xi = i yi. So P 2 P 2 T (X) = ( i Xi , i Xi) is minimal sufficient for (µ, σ ). As mentioned, minimal sufficient statistics allow us to store the results of our experiments in the most efficient way. It turns out if we have a minimal sufficient statistic, then we can use it to improve any estimator. Theorem (Rao-Blackwell Theorem). Let T be a sufficient statistic for θ and let θ˜ be an estimator for θ with E(θ˜2) < ∞ for all θ. Let θˆ(x) = E[θ˜(X) | T (X) = T (x)]. Then for all θ, 2 2 E[(θˆ − θ) ] ≤ E[(θ˜ − θ) ]. The inequality is strict unless θ˜ is a function of T . Here we have to be careful with our definition of θˆ. It is defined as the expected value of θ˜(X), and this could potentially depend on the actual value of θ. Fortunately, since T is sufficient for θ, the conditional distribution of X given T = t does not depend on θ. Hence θˆ = E[θ˜(X) | T ] does not depend on θ, and so is a genuine estimator. In fact, the proof does not use that T is sufficient; we only require it to be sufficient so that we can compute θˆ. Using this theorem, given any estimator, we can find one that is a function of a sufficient statistic and is at least as good in terms of mean squared error of estimation. Moreover, if the original estimator θ˜ is unbiased, so is the new θˆ. Also, if θ˜ is already a function of T , then θˆ = θ˜.

Proof. By the conditional expectation formula, we have E(θˆ) = E[E(θ˜ | T )] = E(θ˜). So they have the same bias. By the conditional variance formula,

var(θ˜) = E[var(θ˜ | T )] + var[E(θ˜ | T )] = E[var(θ˜ | T )] + var(θˆ). Hence var(θ˜) ≥ var(θˆ). So mse(θ˜) ≥ mse(θˆ), with equality only if var(θ˜ | T ) = 0.

9 1 Estimation IB Statistics

−λ Example. Suppose X1, ··· ,Xn are iid Poisson(λ), and let θ = e , which is the probability that X1 = 0. Then P e−nλλ xi pX(x | λ) = Q . xi!

So P θn(− log θ) xi pX(x | θ) = Q . xi! P P We see that T = Xi is sufficient for θ, and Xi ∼ Poisson(nλ). ˜ We start with an easy estimator θ is θ = 1X1=0, which is unbiased (i.e. if we observe nothing in the first observation period, we assume the event is impossible). Then

n ! ˜ X E[θ | T = t] = P X1 = 0 | Xi = t 1 Pn P(X1 = 0)P( 2 Xi = t) = Pn P( 1 Xi = t) n − 1t = . n

P ¯ ¯ ˆ So θˆ = (1 − 1/n) xi . This approximately (1 − 1/n)nX ≈ e−X = e−λ, which makes sense.

Example. Let X1, ··· ,Xn be iid U[0, θ], and suppose that we want to estimate ˆ θ. We have shown above that T = max Xi is sufficient for θ. Let θ = 2X1, an unbiased estimator. Then ˜ E[θ | T = t] = 2E[X1 | max Xi = t] = 2E[X1 | max Xi = t, X1 = max Xi]P(X1 = max Xi) + 2E[X1 | max Xi = t, X1 6= max Xi]P(X1 6= max Xi)  1 t n − 1 = 2 t × + n 2 n n + 1 = t. n ˆ n+1 So θ = n max Xi is our new estimator.

1.4 Likelihood There are many different estimators we can pick, and we have just come up with some criteria to determine whether an estimator is “good”. However, these do not give us a systematic way of coming up with an estimator to actually use. In practice, we often use the maximum likelihood estimator. Let X1, ··· ,Xn be random variables with joint pdf/pmf fX(x | θ). We observe X = x.

Definition (Likelihood). For any given x, the likelihood of θ is like(θ) = fX(x | θ), regarded as a function of θ. The maximum likelihood estimator (mle) of θ is an estimator that picks the value of θ that maximizes like(θ).

10 1 Estimation IB Statistics

Often there is no closed form for the mle, and we have to find θˆ numerically. When we can find the mle explicitly, in practice, we often maximize the log-likelihood instead of the likelihood. In particular, if X1, ··· ,Xn are iid, each with pdf/pmf fX (x | θ), then

n Y like(θ) = fX (xi | θ), i=1 n X log like(θ) = log fX (xi | θ). i=1

Example. Let X1, ··· ,Xn be iid Bernoulli(p). Then X   X  l(p) = log like(p) = xi log p + n − xi log(1 − p).

Thus dl P x n − P x = i − i . dp p 1 − p P This is zero when p = xi/n. So this is the maximum likelihood estimator (and is unbiased).

2 2 Example. Let X1, ··· ,Xn be iid N(µ, σ ), and we want to estimate θ = (µ, σ ). Then n n 1 X l(µ, σ2) = log like(µ, σ2) = − log(2π) − log(σ2) − (x − µ)2. 2 2 2σ2 i This is maximized when ∂l ∂l = = 0. ∂µ ∂σ2 We have ∂l 1 X ∂l n 1 X = − (x − µ), = − + (x − µ)2. ∂µ σ2 i ∂σ2 2σ2 2σ4 i

2 So the solution, hence maximum likelihood estimator is (µ,ˆ σˆ ) = (x,¯ Sxx/n), 1 P P 2 wherex ¯ = n xi and Sxx = (xi − x¯) . 2 2 nσˆ2 2 2 (n−1)σ We shall see later that SXX /σ = σ2 ∼ χn−1, and so E(σˆ ) = n , i.e. σˆ2 is biased. Example (). Suppose the American army discovers some German tanks that are sequentially numbered, i.e. the first tank is numbered 1, the second is numbered 2, etc. Then if θ tanks are produced, then the of the tank number is U(0, θ). Suppose we have discovered n tanks whose numbers are x1, x2, ··· , xn, and we want to estimate θ, the total number of tanks produced. We want to find the maximum likelihood estimator. Then 1 like(θ) = 1 1 . θn [max xi≤θ] [min xi≥0] n So for θ ≥ maxi, like(θ) = 1/θ and is decreasing as θ increases, while for ˆ θ < max xi, like(θ) = 0. Hence the value θ = max xi maximizes the likelihood.

11 1 Estimation IB Statistics

Is θˆ unbiased? First we need to find the distribution of θˆ. For 0 ≤ t ≤ θ, the cumulative distribution function of θˆ is  t n F (t) = P (θˆ ≤ t) = (X ≤ t for all i) = ( (X ≤ t))n = , θˆ P i P i θ

ntn−1 Differentiating with respect to T , we find the pdf fθˆ = θn . Hence Z θ n−1 ˆ nt nθ E(θ) = t n dt = . 0 θ n + 1 So θˆ is biased, but asymptotically unbiased. Example. Smarties come in k equally frequent colours, but suppose we do not know k. Assume that we sample with replacement (although this is unhygienic). Our first Smarties are Red, Purple, Red, Yellow. Then

like(k) = Pk(1st is a new colour)Pk(2nd is a new colour) Pk(3rd matches 1st)Pk(4th is a new colour) k − 1 1 k − 2 = 1 × × × k k k (k − 1)(k − 2) = . k3 The maximum likelihood is 5 (by trial and error), even though it is not much likelier than the others. How does the mle relate to sufficient statistics? Suppose that T is sufficient for θ. Then the likelihood is g(T (x), θ)h(x), which depends on θ through T (x). To maximise this as a function of θ, we only need to maximize g. So the mle θˆ is a function of the sufficient statistic. Note that if φ = h(θ) with h injective, then the mle of φ is given by h(θˆ). For example, if the mle of the σ is σˆ, then the mle of the variance σ2 is σˆ2. This is rather useful in practice, since we can use this to simplify a lot of computations.

1.5 Confidence intervals Definition. A 100γ% (0 < γ < 1) confidence interval for θ is a random interval (A(X),B(X)) such that P(A(X) < θ < B(X)) = γ, no matter what the true value of θ may be. It is also possible to have confidence intervals for vector parameters. Notice that it is the endpoints of the interval that are random quantities, while θ is a fixed constant we want to find out. We can interpret this in terms of repeat sampling. If we calculate (A(x),B(x)) for a large number of samples x, then approximately 100γ% of them will cover the true value of θ. It is important to know that having observed some data x and calculated 95% confidence interval, we cannot say that θ has 95% chance of being within the interval. Apart from the standard objection that θ is a fixed value and either is or is not in the interval, and hence we cannot assign probabilities to this event, we will later construct an example where even though we have got a 50% confidence interval, we are 100% sure that θ lies in that interval.

12 1 Estimation IB Statistics

Example. Suppose X1, ··· ,Xn are iid N(θ, 1). Find a 95% confidence interval for θ. √ ¯ 1 ¯ We know X ∼ N(θ, n ), so that n(X − θ) ∼ N(0, 1). Let z1, z2 be such that Φ(z2) − Φ(z1) = 0.95, where Φ is the standard normal distribution function. √ ¯ We have P[z1 < n(X − θ) < z2] = 0.95, which can be rearranged to give  z z  X¯ − √2 < θ < X¯ − √1 = 0.95. P n n so we obtain the following 95% confidence interval:  z z  X¯ − √2 , X¯ − √1 . n n

There are many possible choices for z1 and z2. Since N(0, 1) density is symmetric, the shortest such interval is obtained by z2 = z0.025 = −z1. We can also choose other values such as z1 = −∞, z2 = 1.64, but we usually choose symmetric end points. The above example illustrates a common procedure for finding confidence intervals:

– Find a quantity R(X, θ) such that the Pθ-distribution of R(X, θ√) does not depend on θ. This is called a pivot. In our example, R(X, θ) = n(X¯ − θ).

– Write down a probability statement of the form Pθ(c1 < R(X, θ) < c2) = γ. – Rearrange the inequalities inside P(...) to find the interval.

Usually c1, c2 are percentage points from a known standardised distribution, often equitailed. For example, we pick 2.5% and 97.5% points for a 95% confidence interval. We could also use, say 0% and 95%, but this generally results in a wider interval. Note that if (A(X),B(X)) is a 100γ% confidence interval for θ, and T (θ) is a monotone increasing function of θ, then (T (A(X)),T (B(X))) is a 100γ% confidence interval for T (θ).

2 Example. Suppose X1, ··· ,X50 are iid N(0, σ ). Find a 99% confidence interval for σ2. 50 1 X We know that X /σ ∼ N(0, 1). So X2 ∼ χ2 . i σ2 i 50 i=1 2 P50 2 2 So R(X, σ ) = i=1 X2 /σ is a pivot. 2 2 Recall that χn(α) is the upper 100α% point of χn, i.e. 2 2 P(χn ≤ χn(α)) = 1 − α. 2 2 So we have c1 = χ50(0.995) = 27.99 and c2 = χ50(0.005) = 79.49. So  P X2  c < i < c = 0.99, P 1 σ2 2 and hence P 2 P 2  Xi 2 Xi P < σ < = 0.99. c2 c1

13 1 Estimation IB Statistics

Using the remark above, we know that a 99% confidence interval for σ is   q P X2 q P X2 i , i . c2 c1

Example. Suppose X1, ··· ,Xn are iid Bernoulli(p). Find an approximate confidence interval for p. P The mle of p isp ˆ = Xi/n. By the Central Limit theorem, pˆ is approximately N(p, p(1 − p)/n) for large n. √ n(ˆp − p) So is approximately N(0, 1) for large n. So letting z(1−γ)/2 be pp(1 − p) the solution to Φ(z(1−γ)/2) − Φ(−z(1−γ)/2) = 1 − γ, we have r r ! p(1 − p) p(1 − p) pˆ − z < p < pˆ + z ≈ γ. P (1−γ)/2 n (1−γ)/2 n

But p is unknown! So we approximate it by pˆ to get a confidence interval for p when n is large: r r ! pˆ(1 − pˆ) pˆ(1 − pˆ) pˆ − z < p < pˆ + z ≈ γ. P (1−γ)/2 n (1−γ)/2 n

Note that we have made a lot of approximations here, but it would be difficult to do better than this. Example. Suppose an says 20% of the people are going to vote UKIP, based on a random sample of 1, 000 people. What might the true proportion be? We assume we have an observation of x = 200 from a binomial(n, p) distri- bution with n = 1, 000. Then pˆ = x/n = 0.2 is an unbiased estimate, and also the mle. p(1−p) pˆ(1−pˆ) Now var(X/n) = n ≈ n = 0.00016. So a 95% confidence interval is r r ! pˆ(1 − pˆ) pˆ(1 − pˆ) pˆ − 1.96 , pˆ + 1.96 = 0.20±1.96×0.013 = (0.175, 0.225), n n

If we don’t want to make that many approximations, we can note that p(1 −p) ≤ p p 1/4 for all 0 ≤ p ≤ 1. So a conservative 95% interval is pˆ±1.96 1/√4n ≈ pˆ± 1/n. So whatever proportion is reported, it will be ’accurate’ to ±1/ n.

Example. Suppose X1,X2 are iid from U(θ − 1/2, θ + 1/2). What is a sensible 50% confidence interval for θ? We know that each Xi is equally likely to be less than θ or greater than θ. So there is 50% chance that we get one observation on each side, i.e. 1 (min(X ,X ) ≤ θ ≤ max(X ,X )) = . Pθ 1 2 1 2 2

So (min(X1,X2), max(X1,X2)) is a 50% confidence interval for θ. 1 But suppose after the experiment, we obtain |x1 − x2| ≥ 2 . For example, we might get x1 = 0.2, x2 = 0.9, then we know that, in this particular case, θ must lie in (min(X1,X2), max(X1,X2)), and we don’t have just 50% “confidence”!

14 1 Estimation IB Statistics

This is why after we calculate a confidence interval, we should not say “there is 100(1 − α)% chance that θ lies in here”. The confidence interval just says that “if we keep making these intervals, 100(1 − α)% of them will contain θ”. But if have calculated a particular confidence interval, the probability that that particular interval contains θ is not 100(1 − α)%.

1.6 Bayesian estimation So far we have seen the frequentist approach to a statistical inference, i.e. inferential statements about θ are interpreted in terms of repeat sampling. For example, the percentage confidence in a confidence interval is the probability that the interval will contain θ, not the probability that θ lies in the interval. In contrast, the Bayesian approach treats θ as a random variable taking values in Θ. The investigator’s information and beliefs about the possible values of θ before any observation of data are summarised by a prior distribution π(θ). When X = x are observed, the extra information about θ is combined with the prior to obtain the posterior distribution π(θ | x) for θ given X = x. There has been a long-running argument between the two approaches. Re- cently, things have settled down, and Bayesian methods are seen to be appropriate in huge numbers of application where one seeks to assess a probability about a “state of the world”. For example, spam filters will assess the probability that a specific email is a spam, even though from a frequentist’s point of view, this is nonsense, because the email either is or is not a spam, and it makes no sense to assign a probability to the email’s being a spam. In Bayesian inference, we usually have some prior knowledge about the distribution of θ (e.g. between 0 and 1). After collecting some data, we will find a posterior distribution of θ given X = x. Definition (Prior and posterior distribution). The prior distribution of θ is the probability distribution of the value of θ before conducting the experiment. We usually write as π(θ). The posterior distribution of θ is the probability distribution of the value of θ given an outcome of the experiment x. We write as π(θ | x). By Bayes’ theorem, the distributions are related by f (x | θ)π(θ) π(θ | x) = X . fX(x) Thus

π(θ | x) ∝ fX(x | θ)π(θ). posterior ∝ likelihood × prior. where the constant of proportionality is chosen to make the total mass of the posterior distribution equal to one. Usually, we use this form, instead of attempting to calculate fX(x). It should be clear that the data enters through the likelihood, so the inference is automatically based on any sufficient statistic. Example. Suppose I have 3 coins in my pocket. One is 3 : 1 in favour of tails, one is a fair coin, and one is 3 : 1 in favour of heads.

15 1 Estimation IB Statistics

I randomly select one coin and flip it once, observing a head. What is the probability that I have chosen coin 3? Let X = 1 denote the event that I observe a head, X = 0 if a tail. Let θ denote the probability of a head. So θ is either 0.25, 0.5 or 0.75. Our prior distribution is π(θ = 0.25) = π(θ = 0.5) = π(θ = 0.75) = 1/3. x 1−x The probability mass function fX (x | θ) = θ (1 − θ) . So we have the following results:

θ π(θ) fX (x = 1 | θ) fX (x = 1 | θ)π(θ) π(x) 0.25 0.33 0.25 0.0825 0.167 0.50 0.33 0.50 0.1650 0.333 0.75 0.33 0.75 0.2475 0.500 Sum 1.00 1.50 0.4950 1.000

So if we observe a head, then there is now a 50% chance that we have picked the third coin. Example. Suppose we are interested in the true mortality risk θ in a hospital H which is about to try a new operation. On average in the country, around 10% of the people die, but mortality rates in different hospitals vary from around 3% to around 20%. Hospital H has no deaths in their first 10 operations. What should we believe about θ? Let Xi = 1 if the ith patient in H dies. The P P xi n− xi fx(x | θ) = θ (1 − θ) . Suppose a priori that θ ∼ beta(a, b) for some unknown a > 0, b > 0 so that

π(θ) ∝ θa−1(1 − θ)b−1.

Then the posteriori is P P xi+a−1 n− xi+b−1 π(θ | x) ∝ fx(x | θ)π(θ) ∝ θ (1 − θ) . P P We recognize this as beta( xi + a, n − xi + b). So P P θ xi+a−1(1 − θ)n− xi+b−1 π(θ | x) = P P . B( xi + a, n − xi + b) In practice, we need to find a Beta prior distribution that matches our information from other hospitals. It turns out that beta(a = 3, b = 27) prior distribution has mean 0.1 and P(0.03 < θ < .20) = 0.9. P P Then we observe data xi = 0, n = 0. So the posterior is beta( xi + P a, n − xi + b) = beta(3, 37). This has a mean of 3/40 = 0.075. This leads to a different conclusion than a frequentist analysis. Since nobody has died so far, the mle is 0, which does not seem plausible. Using a Bayesian approach, we have a higher mean than 0 because we take into account the data from other hospitals. For this problem, a beta prior leads to a beta posterior. We say that the beta family is a conjugate family of prior distributions for Bernoulli samples. Suppose that a = b = 1 so that π(θ) = 1 for 0 < θ < 1 — the uniform P P distribution. Then the posterior is beta( xi + 1, n − xi + 1), with properties

16 1 Estimation IB Statistics

mean variance prior 1/2 non-unique 1/12 P x + 1 P x (P x + 1)(n − P x + 1) posterior i i i i n + 2 n (n + 2)2(n + 3) Note that the mode of the posterior is the mle. P xi+1 The posterior mean estimator, n+2 is discussed in Lecture 2, where we showed that this estimator had smaller mse than the mle for non-extreme value of θ. This is known as the Laplace’s estimator. The posterior variance is bounded above by 1/(4(n + 3)), and this is smaller than the prior variance, and is smaller for larger n. Again, note that the posterior automatically depends on the data through the sufficient statistic. After we come up with the posterior distribution, we have to decide what estimator to use. In the case above, we used the posterior mean, but this might not be the best estimator. To determine what is the “best” estimator, we first need a . Let L(θ, a) be the loss incurred in estimating the value of a parameter to be a when the true value is θ. Common loss functions are quadratic loss L(θ, a) = (θ − a)2, absolute error loss L(θ, a) = |θ − a|, but we can have others. When our estimate is a, the expected posterior loss is Z h(a) = L(θ, a)π(θ | x) dθ.

Definition (). The Bayes estimator θˆ is the estimator that minimises the expected posterior loss. For quadratic loss, Z h(a) = (a − θ)2π(θ | x) dθ.

h0(a) = 0 if Z (a − θ)π(θ | x) dθ = 0, or Z Z a π(θ | x) dθ = θπ(θ | x) dθ,

Since R π(θ | x) dθ = 1, the Bayes estimator is θˆ = R θπ(θ | x) dθ, the posterior mean. For absolute error loss, Z h(a) = |θ − a|π(θ | x) dθ Z a Z ∞ = (a − θ)π(θ | x) dθ + (θ − a)π(θ | x) dθ −∞ a Z a Z a = a π(θ | x) dθ − θπ(θ | x) dθ −∞ −∞ Z ∞ Z ∞ + θπ(θ | x) dθ − a π(θ | x) dθ. a a

17 1 Estimation IB Statistics

Now h0(a) = 0 if Z a Z ∞ π(θ | x) dθ = π(θ | x) dθ. −∞ a This occurs when each side is 1/2. So θˆ is the posterior .

Example. Suppose that X1, ··· ,Xn are iid N(µ, 1), and that a priori µ ∼ N(0, τ −2) for some τ −2. So τ is the certainty of our prior knowledge. The posterior is given by

π(µ | x) ∝ fx(x | µ)π(µ)  1 X   µ2τ 2  ∝ exp − (x − µ)2 exp − 2 i 2 " # 1  P x 2 ∝ exp − (n + τ 2) µ − i 2 n + τ 2

since we can regard n, τ and all the xi as constants in the normalisation term, and then complete the square with respect to µ. So the posterior distribution P 2 of µ given x is a normal distribution with mean xi/(n + τ ) and variance 1/(n + τ 2). The normal density is symmetric, and so the posterior mean and the posterior P 2 media have the same value xi/(n + τ ). This is the optimal estimator for both quadratic and absolute loss.

Example. Suppose that X1, ··· ,Xn are iid Poisson(λ) random variables, and λ has an with mean 1. So π(λ) = e−λ. The posterior distribution is given by

P P π(λ | x) ∝ enλλ xi e−λ = λ xi e−(n+1)λ, λ > 0, P which is gamma ( xi + 1, n + 1). Hence under quadratic loss, our estimator is

P x + 1 λˆ = i , n + 1 the posterior mean. Under absolute error loss, λˆ solves

ˆ P P Z λ (n + 1) xi+1λ xi e−(n+1)λ 1 P dλ = . 0 ( xi)! 2

18 2 Hypothesis testing IB Statistics

2 Hypothesis testing

Often in statistics, we have some hypothesis to test. For example, we want to test whether a drug can lower the chance of a heart attack. Often, we will have two hypotheses to compare: the null hypothesis states that the drug is useless, while the alternative hypothesis states that the drug is useful. Quantitatively, suppose that the chance of heart attack without the drug is θ0 and the chance with the drug is θ. Then the null hypothesis is H0 : θ = θ0, while the alternative hypothesis is H1 : θ 6= θ0. It is important to note that the null hypothesis and alternative hypothesis are not on equal footing. By default, we assume the null hypothesis is true. For us to reject the null hypothesis, we need a lot of evidence to prove that. This is since we consider incorrectly rejecting the null hypothesis to be a much more serious problem than accepting it when we should not. For example, it is relatively okay to reject a drug when it is actually useful, but it is terrible to distribute drugs to patients when the drugs are actually useless. Alternatively, it is more serious to deem an innocent person guilty than to say a guilty person is innocent. In general, let X1, ··· ,Xn be iid, each taking values in X , each with unknown pdf/pmf f. We have two hypotheses, H0 and H1, about f. On the basis of data X = x, we make a choice between the two hypotheses. Example.

– A coin has P(Heads) = θ, and is thrown independently n times. We could 1 3 have H0 : θ = 2 versus H1 : θ = 4 .

– Suppose X1, ··· ,Xn are iid discrete random variables. We could have H0 : the distribution is Poisson with unknown mean, and H1 : the distribution is not Poisson.

– General parametric cases: Let X1, ··· ,Xn be iid with density f(x | θ). f is known while θ is unknown. Then our hypotheses are H0 : θ ∈ Θ0 and H1 : θ ∈ Θ1, with Θ0 ∩ Θ1 = ∅.

– We could have H0 : f = f0 and H1 : f = f1, where f0 and f1 are densities that are completely specified but do not come form the same . Definition (Simple and composite hypotheses). A simple hypothesis H specifies 1 f completely (e.g. H0 : θ = 2 ). Otherwise, H is a composite hypothesis.

2.1 Simple hypotheses

Definition (Critical region). For testing H0 against an alternative hypothesis n H1, a test procedure has to partition X into two disjoint exhaustive regions C and C¯, such that if x ∈ C, then H0 is rejected, and if x ∈ C¯, then H0 is not rejected. C is the critical region. When performing a test, we may either arrive at a correct conclusion, or make one of the two types of error: Definition (Type I and II error).

19 2 Hypothesis testing IB Statistics

(i) Type I error: reject H0 when H0 is true.

(ii) Type II error: not rejecting H0 when H0 is false.

Definition (Size and power). When H0 and H1 are both simple, let

α = P(Type I error) = P(X ∈ C | H0 is true).

β = P(Type II error) = P(X 6∈ C | H1 is true).

The size of the test is α, and 1 − β is the power of the test to detect H1. If we have two simple hypotheses, a relatively straightforward test is the likelihood ratio test. Definition (Likelihood). The likelihood of a simple hypothesis H : θ = θ∗ given data x is ∗ Lx(H) = fX(x | θ = θ ).

The likelihood ratio of two simple hypotheses H0,H1 given data x is

Lx(H1) Λx(H0; H1) = . Lx(H0)

A likelihood ratio test (LR test) is one where the critical region C is of the form

C = {x :Λx(H0; H1) > k}

for some k. It turns out this rather simple test is “the best” in the following sense:

Lemma (Neyman-Pearson lemma). Suppose H0 : f = f0, H1 : f = f1, where f0 and f1 are continuous densities that are nonzero on the same regions. Then among all tests of size less than or equal to α, the test with the largest power is the likelihood ratio test of size α. Proof. Under the likelihood ratio test, our critical region is

 f (x)  C = x : 1 > k , f0(x) where k is chosen such that α = P(reject H0 | H0) = P(X ∈ C | H0) = R C f0(x) dx. The probability of Type II error is given by Z β = P(X 6∈ C | f1) = f1(x) dx. C¯ Let C∗ be the critical region of any other test with size less than or equal to α. ∗ ∗ ∗ ∗ ∗ Let α = P(X ∈ C | f0) and β = P(X 6∈ C | f1). We want to show β ≤ β . We know α∗ ≤ α, i.e Z Z f0(x) dx ≤ f0(x) dx. C∗ C

20 2 Hypothesis testing IB Statistics

Also, on C, we have f1(x) > kf0(x), while on C¯ we have f1(x) ≤ kf0(x). So Z Z f1(x) dx ≥ k f0(x) dx C¯∗∩C C¯∗∩C Z Z f1(x) dx ≤ k f0(x) dx. C¯∩C∗ C¯∩C∗ Hence Z Z ∗ β − β = f1(x) dx − f1(x) dx C¯ C¯∗ Z Z = f1(x) dx + f1(x) dx C¯∩C∗ C¯∩C¯∗ Z Z − f1(x) dx − f1(x) dx C¯∗∩C C¯∩C¯∗ Z Z = f1(x) dx − f1(x) dx C¯∩C∗ C¯∗∩C Z Z ≤ k f0(x) dx − k f0(x) dx C¯∩C∗ C¯∗∩C Z Z  = k f0(x) dx + f0(x) dx C¯∩C∗ C∩C∗ Z Z  − k f0(x) dx + f0(x) dx C¯∗∩C C∩C∗ = k(α∗ − α) ≤ 0.

C¯ C

¯∗ C ∩ C ∗ C¯∗ β /H1 (f1 ≥ kf0)

∗ ¯ ∗ C ∩ C ∗ ∗ C C ∩ C α /H0 (f1 ≤ kf0)

β/H1 α/H0

Here we assumed the f0 and f1 are continuous densities. However, this assumption is only needed to ensure that the likelihood ratio test of exactly size α exists. Even with non-continuous distributions, the likelihood ratio test is still a good idea. In fact, you will show in the example sheets that for a discrete distribution, as long as a likelihood ratio test of exactly size α exists, the same result holds.

2 2 Example. Suppose X1, ··· ,Xn are iid N(µ, σ0), where σ0 is known. We want to find the best size α test of H0 : µ = µ0 against H1 : µ = µ1, where µ0 and µ1

21 2 Hypothesis testing IB Statistics

are known fixed values with µ1 > µ0. Then

2 −n/2  1 P 2 (2πσ0) exp − 2 (xi − µ1) 2σ0 Λx(H0; H1) =   2 −n/2 1 P 2 (2πσ0) exp − 2 (xi − µ0) 2σ0  2 2  µ1 − µ0 n(µ0 − µ1) = exp 2 nx¯ + 2 . σ0 2σ0

This is an increasing function of x¯, so for any k,Λx > k ⇔ x¯ > c for some c. ¯ Hence we reject H0 ifx ¯ > c, where c is chosen√ such that P(X > c | H0) = α. ¯ 2 ¯ Under H0, X ∼ N(µ0, σ0/n), so Z = n(X − µ0)/σ0 ∼ N(0, 1). 0 0 Sincex ¯ > c ⇔ z > c for some c , the size α test rejects H0 if √ n(¯x − µ0) z = > zα. σ0

For example, suppose µ0 = 5, µ1 = 6, σ0 = 1, α = 0.05, n = 4 and x = (5.1, 5.5, 4.9, 5.3). Sox ¯ = 5.2. From tables, z0.05 = 1.645. We have z = 0.4 and this is less than 1.645. So x is not in the rejection region. We do not reject H0 at the 5% level and say that the data are consistent with H0. Note that this does not mean that we accept H0. While we don’t have sufficient reason to believe it is false, we also don’t have sufficient reason to believe it is true. This is called a z-test.

In this example, LR tests reject H0 if z > k for some constant k. The size of such a test is α = P(Z > k | H0) = 1 − Φ(k), and is decreasing as k increasing. Our observed value z will be in the rejected region iff z > k ⇔ α > p∗ = P(Z > z | H0). Definition (p-value). The quantity p∗ is called the p-value of our observed data x. For the example above, z = 0.4 and so p∗ = 1 − Φ(0.4) = 0.3446. In general, the p-value is sometimes called the “observed significance level” of x. This is the probability under H0 of seeing data that is “more extreme” than our observed data x. Extreme observations are viewed as providing evidence against H0.

2.2 Composite hypotheses For composite hypotheses like H : θ ≥ 0, the error probabilities do not have a single value. We define Definition (Power function). The power function is

W (θ) = P(X ∈ C | θ) = P(reject H0 | θ).

We want W (θ) to be small on H0 and large on H1. Definition (Size). The size of the test is

α = sup W (θ). θ∈Θ0

22 2 Hypothesis testing IB Statistics

This is the worst possible size we can get. For θ ∈ Θ1, 1 − W (θ) = P(Type II error | θ). Sometimes the Neyman-Pearson theory can be extended to one-sided alter- natives. For example, in the previous example, we have shown that the most powerful size α test of H0 : µ = µ0 versus H1 : µ = µ1 (where µ1 > µ0) is given by  √  n(¯x − µ0) C = x : > zα . σ0

The critical region depends on µ0, n, σ0, α, and the fact that µ1 > µ0. It does not depend on the particular value of µ1. This test is then uniformly the most powerful size α for testing H0 : µ = µ0 against H1 : µ > µ0. Definition (Uniformly most powerful test). A test specified by a critical region C is uniformly most powerful (UMP) size α test for test H0 : θ ∈ Θ0 against H1 : θ ∈ Θ1 if

(i) supθ∈Θ0 W (θ) = α. (ii) For any other test C∗ with size ≤ α and with power function W ∗, we have ∗ W (θ) ≥ W (θ) for all θ ∈ Θ1. Note that these may not exist. However, the likelihood ratio test often works. 2 Example. Suppose X1, ··· ,Xn are iid N(µ, σ0) where σ0 is known, and we wish to test H0 : µ ≤ µ0 against H1 : µ > µ0. 0 0 First consider testing H0 : µ = µ0 against H1 : µ = µ1, where µ1 > µ0. The 0 0 Neyman-Pearson test of size α of H0 against H1 has  √  n(¯x − µ0) C = x : > zα . σ0

We show that C is in fact UMP for the composite hypotheses H0 against H1. For µ ∈ R, the power function is

W (µ) = Pµ(reject H0) √  n(X¯ − µ0) = Pµ > zα σ0 √ √  n(X¯ − µ) n(µ0 − µ) = Pµ > zα + σ0 σ0  √  n(µ0 − µ) = 1 − Φ zα + σ0

To show this is UMP, we know that W (µ0) = α (by plugging in). W (µ) is an increasing function of µ. So sup W (µ) = α. µ≤µ0 So the first condition is satisfied. For the second condition, observe that for any µ > µ0, the Neyman-Pearson 0 0 ∗ ∗ size α test of H0 vs H1 has critical region C. Let C and W belong to any ∗ 0 other test of H0 vs H1 of size ≤ α. Then C can be regarded as a test of H0 0 ∗ vs H1 of size ≤ α, and the Neyman-Pearson lemma says that W (µ1) ≤ W (µ1). This holds for all µ1 > µ0. So the condition is satisfied and it is UMP.

23 2 Hypothesis testing IB Statistics

We now consider likelihood ratio tests for more general situations. Definition (Likelihood of a composite hypothesis). The likelihood of a composite hypothesis H : θ ∈ Θ given data x to be

Lx(H) = sup f(x | θ). θ∈Θ

So far we have considered disjoint hypotheses Θ0, Θ1, but we are not interested in any specific alternative. So it is easier to take Θ1 = Θ rather than Θ \ Θ0. Then

Lx(H1) supθ∈Θ1 f(x | θ) Λx(H0; H1) = = ≥ 1, Lx(H0) supθ∈Θ0 f(x | θ) with large values of Λ indicating departure from H0.

2 2 Example. Suppose that X1, ··· ,Xn are iid N(µ, σ0), with σ0 known, and we wish to test H0 : µ = µ0 against H1 : µ 6= µ0 (for given constant µ0). Here Θ0 = {µ0} and Θ = R. For the numerator, we have supΘ f(x | µ) = f(x | µˆ), where µˆ is the mle. We know thatµ ˆ =x ¯. Hence

2 −n/2  1 P 2 (2πσ0) exp − 2 (xi − x¯) 2σ0 Λx(H0; H1) =  . 2 −n/2 1 P 2 (2πσ0) exp − 2 (xi − µ0) 2σ0

Then H0 is rejected if Λx is large. To make our lives easier, we can use the logarithm instead:

1 hX 2 X 2i n 2 2 log Λ(H0; H1) = 2 (xi − µ0) − (xi − x¯) = 2 (¯x − µ0) . σ0 σ0

So we can reject H0 if we have √ n(¯x − µ0) > c σ0

for some c. √ n(X¯ − µ0) We know that under H0, Z = ∼ N(0, 1). So the size α σ0 generalised likelihood test rejects H0 if √ n(¯x − µ0) > zα/2. σ0

¯ 2 n(X − µ0) 2 Alternatively, since 2 ∼ χ1, we reject H0 if σ0

2 n(¯x − µ0) 2 2 > χ1(α), σ0

2 2 (check that zα/2 = χ1(α)). Note that this is a two-tailed test — i.e. we reject H0 both for high and low values ofx ¯.

24 2 Hypothesis testing IB Statistics

The next theorem allows us to use likelihood ratio tests even when we cannot find the exact relevant null distribution. First consider the “size” or “dimension” of our hypotheses: suppose that H0 imposes p independent restrictions on Θ. So for example, if Θ = {θ : θ = (θ1, ··· , θk)}, and we have

– H0 : θi1 = a1, θi2 = a2, ··· , θip = ap; or

– H0 : Aθ = b (with A p × k, b p × 1 given); or

– H0 : θi = fi(ϕ), i = 1, ··· , k for some ϕ = (ϕ1, ··· , ϕk−p).

We say Θ has k free parameters and Θ0 has k − p free parameters. We write |Θ0| = k − p and |Θ| = k.

Theorem (Generalized likelihood ratio theorem). Suppose Θ0 ⊆ Θ1 and |Θ1| − |Θ0| = p. Let X = (X1, ··· ,Xn) with all Xi iid. If H0 is true, then as n → ∞,

2 2 log ΛX(H0; H1) ∼ χp.

If H0 is not true, then 2 log Λ tends to be larger. We reject H0 if 2 log Λ > c, 2 where c = χp(α) for a test of approximately size α.

We will not prove this result here. In our example above, |Θ1| − |Θ0| = 1, 2 and in this case, we saw that under H0, 2 log Λ ∼ χ1 exactly for all n in that particular case, rather than just approximately.

2.3 Tests of goodness-of-fit and independence 2.3.1 Goodness-of-fit of a fully-specified null distribution So far, we have considered relatively simple cases where we are attempting to figure out, say, the mean. However, in reality, more complicated scenarios arise. For example, we might want to know if a dice is fair, i.e. if the probability of 1 getting each number is exactly 6 . Our null hypothesis would be that p1 = p2 = 1 ··· = p6 = 6 , while the alternative hypothesis allows any possible values of pi. In general, suppose the observation space X is partitioned into k sets, and let pi be the probability that an observation is in set i for i = 1, ··· , k. We want to test “H0 : the pi’s arise from a fully specified model” against “H1 : the pi’s P are unrestricted (apart from the obvious pi ≥ 0, pi = 1)”. Example. The following table lists the birth months of admissions to Oxford and Cambridge in 2012.

Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug 470 515 470 457 473 381 466 457 437 396 384 394

Is this compatible with a uniform distribution over the year? Out of n independent observations, let Ni be the number of observations in ith set. So (N1, ··· ,Nk) ∼ multinomial(k; p1, ··· , pk). For a generalized likelihood ratio test of H0, we need to find the maximised likelihood under H0 and H1.

25 2 Hypothesis testing IB Statistics

n1 nk Under H1, like(p1, ··· , pk) ∝ p ··· p . So the log likelihood is l = P 1 k P constant + ni log pi. We want to maximise this subject to pi = 1. Us- ing the Lagrange multiplier, we will find that the mle is pˆi = ni/n. Also |Θ1| = k − 1 (not k, since they must sum up to 1). Under H0, the values of pi are specified completely, say pi = p˜i. So |Θ0| = 0. Using our formula forp ˆi, we find that

 n1 nk    pˆ1 ··· pˆk X ni 2 log Λ = 2 log n1 nk = 2 ni log (1) p˜1 ··· p˜k np˜i

2 Here |Θ1|−|Θ0| = k−1. So we reject H0 if 2 log Λ > χk−1(α) for an approximate size α test. Under H0 (no effect of month of birth), p˜i is the proportion of births in month i in 1993/1994 in the whole population — this is not simply proportional 1 to the number of days in each month (or even worse, 12 ), as there is for example an excess of September births (the “Christmas effect”). So Then   X ni 2 log Λ = 2 ni log = 44.86. np˜i 2 −9 P(χ11 > 44.86) = 3 × 10 , which is our p-value. Since this is certainly less than 0.001, we can reject H0 at the 0.1% level, or can say the result is “significant at the 0.1% level”. The traditional levels for comparison are α = 0.05, 0.01, 0.001, roughly corresponding to “evidence”, “strong evidence” and “very strong evidence”.

A similar common situation has H0 : pi = pi(θ) for some parameter θ and H1 as before. Now |Θ0| is the number of independent parameters to be estimated under H0. ˆ Under H0, we find mle θ by maximizing ni log pi(θ), and then

n1 nk ! ! pˆ1 ··· pˆk X ni 2 log Λ = 2 log = 2 ni log . (2) ˆ n1 ˆ nk ˆ p1(θ) ··· pk(θ) npi(θ)

The degrees of freedom are k − 1 − |Θ0|.

2.3.2 Pearson’s chi-squared test

Notice that the two log likelihoods are of the same form. In general, let oi = ni ˆ (observed number) and let ei = np˜i or npi(θ) (expected number). Let δi = oi −ei. Then   X oi 2 log Λ = 2 oi log ei   X δi = 2 (ei + δi) log 1 + ei  2  X δi δi 3 = 2 (ei + δi) − 2 + O(δi ) ei 2ei  2 2  X δi δi 3 = 2 δi + − + O(δi ) ei 2ei

26 2 Hypothesis testing IB Statistics

P P P We know that δi = 0 since ei = oi. So

X δ2 ≈ i ei 2 X (oi − ei) = . ei This is known as the Pearson’s chi-squared test. Example. Mendel crossed 556 smooth yellow male peas with wrinkled green peas. From the progeny, let

(i) N1 be the number of smooth yellow peas,

(ii) N2 be the number of smooth green peas,

(iii) N3 be the number of wrinkled yellow peas,

(iv) N4 be the number of wrinkled green peas. We wish to test the goodness of fit of the model

9 3 3 1  H0 :(p1, p2, p3, p4) = 16 , 16 , 16 , 16 .

Suppose we observe (n1, n2, n3, n4) = (315, 108, 102, 31). We find (e1, e2, e3, e4) = (312.75, 104.25, 104.25, 34.75). The actual 2 log Λ = 2 0.618 and the approximation we had is P (oi−ei) = 0.604. ei 2 Here |Θ0| = 0 and |Θ1| = 4 − 1 = 3. So we refer to test statistics χ3(α). 2 Since χ3(0.05) = 7.815, we see that neither value is significant at 5%. So there is no evidence against Mendel’s theory. In fact, the p-value is approximately 2 P(χ3 > 0.6) ≈ 0.90. This is a really good fit, so good that people suspect the numbers were not genuine. Example. In a genetics problem, each individual has one of the three possible genotypes, with probabilities p1, p2, p3. Suppose we wish to test H0 : pi = pi(θ), where 2 2 p1(θ) = θ , p2 = 2θ(1 − θ), p3(θ) = (1 − θ) . for some θ ∈ (0, 1). ˆ We observe Ni = ni. Under H0, the mle θ is found by maximising X ni log pi(θ) = 2n1 log θ + n2 log(2θ(1 − θ)) + 2n3 log(1 − θ).

ˆ 2n1+n2 We find that θ = 2n . Also, |Θ0| = 1 and |Θ1| = 2. ˆ After conducting an experiment, we can substitute pi(θ) into (2), or find the 2 corresponding Pearson’s chi-squared statistic, and refer to χ1.

2.3.3 Testing independence in contingency tables Definition (). A contingency table is a table in which obser- vations or individuals are classified according to one or more criteria. Example. 500 people with recent car changes were asked about their previous and new cars. The results are as follows:

27 2 Hypothesis testing IB Statistics

New car Large Medium Small Large 56 52 42 Medium 50 83 67 car Small 18 51 81 Previous This is a two-way contingency table: Each person is classified according to the previous car size and new car size. Consider a two-way contingency table with r rows and c columns. For i = 1, ··· , r And j = 1, ··· , c, let pij be the probability that an individual selected from the population under consideration is classified in row i and column j. (i.e. in the (i, j) cell of the table). Let pi+ = P(in row i) and p+j = P(in column j). Then we must have P P p++ = i j pij = 1. Suppose a random sample of n individuals is taken, and let nij be the number of these classified in the (i, j) cell of the table. P P Let ni+ = j nij and n+j = i nij. So n++ = n. We have

(N11, ··· ,N1c,N21, ··· ,Nrc) ∼ multinomial(rc; p11, ··· , p1c, p21, ··· , prc). We may be interested in testing the null hypothesis that the two classifications are independent. So we test

– H0: pij = pi+p+j for all i, j, i.e. independence of columns and rows.

– H1: pij are unrestricted.

Of course we have the usual restrictions like p++ = 1, pij ≥ 0. nij Under H1, the mles arep ˆij = n . ni+ n+j Under H0, the mles arep ˆi+ = n andp ˆ+j = n . Write oij = nij and eij = npˆi+pˆ+j = ni+n+j/n. Then r c   r c 2 X X oij X X (oij − eij) 2 log Λ = 2 o log ≈ . ij e e i=1 j=1 ij i=1 j=1 ij using the same approximating steps for Pearson’s Chi-squared test. We have |Θ1| = rc − 1, because under H1 the pij’s sum to one. Also, P |Θ0| = (r − 1) + (c − 1) because p1+, ··· , pr+ must satisfy pi+ = 1 and P i p+1, ··· , p+c must satisfy j p+j = 1. So

|Θ1| − |Θ0| = rc − 1 − (r − 1) − (c − 1) = (r − 1)(c − 1).

Example. In our previous example, we wish to test H0: the new and previous car sizes are independent. The actual data is: New car Large Medium Small Total Large 56 52 42 150 Medium 50 83 67 120

car Small 18 51 81 150

Previous Total 124 186 190 500

28 2 Hypothesis testing IB Statistics

while the expected values given by H0 is

New car Large Medium Small Total Large 37.2 55.8 57.0 150 Medium 49.6 74.4 76.0 120

car Small 37.2 55.8 57.0 150

Previous Total 124 186 190 500

Note the margins are the same. It is quite clear that they do not match well, but we can find the p value to be sure. 2 XX (oij − eij) = 36.20, and the degrees of freedom is (3 − 1)(3 − 1) = 4. eij 2 2 From the tables, χ4(0.05) = 9.488 and χ4(0.01) = 13.28. So our observed value of 36.20 is significant at the 1% level, i.e. there is strong evidence against H0. So we conclude that the new and present car sizes are not independent.

2.4 Tests of homogeneity, and connections to confidence intervals 2.4.1 Tests of homogeneity Example. 150 patients were randomly allocated to three groups of 50 patients each. Two groups were given a new drug at different dosage levels, and the third group received a placebo. The responses were as shown in the table below.

Improved No difference Worse Total Placebo 18 17 15 50 Half dose 20 10 20 50 Full dose 25 13 12 50 Total 63 40 47 150

Here the row totals are fixed in advance, in contrast to our last section, where the row totals are random variables. For the above, we may be interested in testing H0 : the probability of “improved” is the same for each of the three treatment groups, and so are the probabilities of “no difference” and “worse”, i.e. H0 says that we have homogeneity down the rows. In general, we have independent observations from r multinomial distributions, each of which has c categories, i.e. we observe an r ×c table (nij), for i = 1, ··· , r and j = 1, ··· , c, where

(Ni1, ··· ,Nic) ∼ multinomial(ni+, pi1, ··· , pic)

independently for each i = 1, ··· , r. We want to test

H0 : p1j = p2j = ··· = prj = pj,

29 2 Hypothesis testing IB Statistics

for j = 1, ··· , c, and H1 : pij are unrestricted.

Using H1, for any matrix of probabilities (pij),

r Y ni+! like((p )) = pni1 ··· pnic , ij n ! ··· n ! i1 ic i=1 i1 ic and r c X X log like = constant + nij log pij. i=1 j=1

nij Using Lagrangian methods, we find thatp ˆij = . ni+ Under H0, c X log like = constant + n+j log pj. j=1

n+j By Lagrangian methods, we havep ˆj = . n++ Hence r c   r c   X X pˆij X X nij 2 log Λ = n log = 2 n log , ij pˆ ij n n /n i=1 j=1 j i=1 j=1 i+ +j ++ which is the same as what we had last time, when the row totals are unrestricted! We have |Θ1| = r(c − 1) and |Θ0| = c − 1. So the degrees of freedom is r(c − 1) − (c − 1) = (r − 1)(c − 1), and under H0, 2 log Λ is approximately 2 χ(r−1)(c−1). Again, it is exactly the same as what we had last time! 2 We reject H0 if 2 log Λ > χ(r−1)(c−1)(α) for an approximate size α test. ni+n+j If we let oij = nij, eij = , and δij = oij − eij, using the same approxi- n++ mating steps as for Pearson’s chi-squared, we obtain

2 X (oij − eij) 2 log Λ ≈ . eij Example. Continuing our previous example, our data is

Improved No difference Worse Total Placebo 18 17 15 50 Half dose 20 10 20 50 Full dose 25 13 12 50 Total 6 3 40 47 150

The expected under H0 is

Improved No difference Worse Total Placebo 21 13.3 15.7 50 Half dose 21 13.3 15.7 50 Full dose 21 13.3 15.7 50 Total 63 40 47 150

30 2 Hypothesis testing IB Statistics

2 We find 2 log Λ = 5.129, and we refer this to χ4. Clearly this is not significant, 2 as the mean of χ4 is 4, and is something we would expect to happen solely by chance. 2 We can calculate the p-value: from tables, χ4(0.05) = 9.488, so our observed value is not significant at 5%, and the data are consistent with H0. We conclude that there is no evidence for a difference between the drug at the given doses and the placebo. For interest, 2 X (oij − eij) = 5.173, eij giving the same conclusion.

2.4.2 Confidence intervals and hypothesis tests Confidence intervals or sets can be obtained by inverting hypothesis tests, and vice versa Definition (Acceptance region). The acceptance region A of a test is the complement of the critical region C. Note that when we say “acceptance”, we really mean “non-rejection”! The name is purely for historical reasons. Theorem (Duality of hypothesis tests and confidence intervals). Suppose X1, ··· ,Xn have joint pdf fX(x | θ) for θ ∈ Θ.

(i) Suppose that for every θ0 ∈ Θ there is a size α test of H0 : θ = θ0. Denote the acceptance region by A(θ0). Then the set I(X) = {θ : X ∈ A(θ)} is a 100(1 − α)% confidence set for θ.

(ii) Suppose I(X) is a 100(1 − α)% confidence set for θ. Then A(θ0) = {X : θ0 ∈ I(X)} is an acceptance region for a size α test of H0 : θ = θ0. Intuitively, this says that “confidence intervals” and “hypothesis accep- tance/rejection” are the same thing. After gathering some data X, we can produce a, say, 95% confidence interval (a, b). Then if we want to test the hypothesis H0 : θ = θ0, we simply have to check whether θ0 ∈ (a, b). On the other hand, if we have a test for H0 : θ = θ0, then the confidence interval is all θ0 in which we would accept H0 : θ = θ0.

Proof. First note that θ0 ∈ I(X) iff X ∈ A(θ0). For (i), since the test is size α, we have

P(accept H0 | H0 is true) = P(X ∈ A(θ0) | θ = θ0) = 1 − α. And so P(θ0 ∈ I(X) | θ = θ0) = P(X ∈ A(θ0) | θ = θ0) = 1 − α. For (ii), since I(X) is a 100(1 − α)% confidence set, we have

P (θ0 ∈ I(X) | θ = θ0) = 1 − α.

So P(X ∈ A(θ0) | θ = θ0) = P(θ ∈ I(X) | θ = θ0) = 1 − α.

31 2 Hypothesis testing IB Statistics

Example. Suppose X1, ··· ,Xn are iid N(µ, 1) random variables and we want a 95% confidence set for µ. One way is to use the theorem and find the confidence set that belongs to the hypothesis test that we found in the previous example. We find√ a test of size 0.05 of H0 : µ = µ0 against H1 : µ 6= µ0 that rejects H0 when | n(x¯ − µ0)| > 1.96 (where 1.96 is the upper 2.5% point of N(0, 1)). √ ¯ Then I(X) = {µ : X ∈ A(µ)√} = {µ : | n√(X − µ)| < 1.96}. So a 95% confidence set for µ is (X¯ − 1.96/ n, X¯ + 1.96/ n).

2.5 Multivariate normal theory 2.5.1 Multivariate normal distribution So far, we have only worked with scalar random variables or a vector of iid random T variables. In general, we can have a random (column) vector X = (X1, ··· ,Xn) , where the Xi are correlated. The mean of this vector is given by

T T µ = E[X] = (E(X1), ··· , E(Xn)) = (µ1, ··· , µn) . Instead of just the variance, we have the matrix

T cov(X) = E[(X − µ)(X − µ) ] = (cov(Xi,Xj))ij, provided they exist, of course. We can multiply the vector X by an m × n matrix A. Then we have

E[AX] = Aµ, and cov(AX) = A cov(X)AT . (∗) The last one comes from

T cov(AX) = E[(AX − E[AX])(AX − E[AX]) ] T T = E[A(X − EX)(X − EX) A ] T T = AE[(X − EX)(X − EX) ]A . If we have two random vectors V, W, we can define the covariance cov(V, W) to T be a matrix with (i, j)th element cov(Vi,Wj). Then cov(AX,BX) = A cov(X)B . An important distribution is a multivariate normal distribution. Definition (Multivariate normal distribution).X has a multivariate normal distribution if, for every t ∈ Rn, the random variable tT X (i.e. t · X) has a normal distribution. If E[X] = µ and cov(X) = Σ, we write X ∼ Nn(µ, Σ). Note that Σ is symmetric and is positive semi-definite because by (∗),

tT Σt = var(tT X) ≥ 0.

So what is the pdf of a multivariate normal? And what is the moment generating function? Recall that a (univariate) normal X ∼ N(µ, σ2) has density

 2  2 1 1 (x − µ) fX (x; µ, σ ) = √ exp − , 2πσ 2 σ2

32 2 Hypothesis testing IB Statistics with moment generating function  1  M (s) = [esX ] = exp µs + σ2s2 . X E 2 Hence for any t, the moment generating function of tT X is given by   stT X T 1 T 2 M T (s) = [e ] = exp t µs + t Σts . t X E 2 Hence X has mgf   tT X T 1 T M (t) = [e ] = M T (1) = exp t µ + t Σt . (†) X E t X 2 Proposition. T (i) If X ∼ Nn(µ, Σ), and A is an m × n matrix, then AX ∼ Nm(Aµ,AΣA ). 2 (ii) If X ∼ Nn(0, σ I), then |X|2 XT X X X2 = = i ∼ χ2 . σ2 σ2 σ2 n 2 2 2 2 2 2 Instead of writing |X| /σ ∼ χn, we often just say |X| ∼ σ χn. Proof. (i) See example sheet 3.

2 (ii) Immediate from definition of χn.   X1 Proposition. Let X ∼ Nn(µ, Σ). We split X up into two parts: X = , X2 where Xi is a ni × 1 column vector and n1 + n2 = n. Similarly write µ  Σ Σ  µ = 1 , Σ = 11 12 , µ2 Σ21 Σ22 where Σij is an ni × nj matrix. Then

(i) Xi ∼ Nni (µi, Σii)

(ii) X1 and X2 are independent iff Σ12 = 0. Proof. (i) See example sheet 3.

(ii) Note that by symmetry of Σ, Σ12 = 0 if and only if Σ21 = 0.   T 1 T n t1 From (†), MX(t) = exp(t µ+ 2 t Σt) for each t ∈ R . We write t = . t2 Then the mgf is equal to  1 1 1  M (t) = exp tT µ + tT Σ t + tT Σ t + tT Σ t + tT Σ t . X 1 1 2 11 1 2 2 22 2 2 1 12 2 2 2 21 1

T 1 T From (i), we know that MXi (ti) = exp(ti µi + 2 ti Σiiti). So MX(t) =

MX1 (t1)MX2 (t2) for all t if and only if Σ12 = 0.

33 2 Hypothesis testing IB Statistics

Proposition. When Σ is a positive definite, then X has pdf  n   1 1 1 T −1 fX(x; µ, Σ) = √ exp − (x − µ) Σ (x − µ) . |Σ|2 2π 2 Note that Σ is always positive semi-definite. The conditions just forbid the case |Σ| = 0, since this would lead to dividing by zero.

2.5.2 Normal random samples We wish to use our knowledge about multivariate normals to study univariate normal data. In particular, we want to prove the following:

Theorem (Joint distribution of X¯ and SXX ). Suppose X1, ··· ,Xn are iid 2 ¯ 1 P P ¯ 2 N(µ, σ ) and X = n Xi, and SXX = (Xi − X) . Then (i) X¯ ∼ N(µ, σ2/n)

2 2 (ii) SXX /σ ∼ χn−1.

(iii) X¯ and SXX are independent.

2 Proof. We can write the joint density as X ∼ Nn(µ, σ I), where µ =√ (µ, µ, ··· , µ). Let A be an n × n orthogonal matrix with the first row all 1/ n (the other rows are not important). One possible such matrix is

 √1 √1 √1 √1 √1  n n n n ··· n  √ 1 √−1 0 0 ··· 0   2×1 2×1   √ 1 √ 1 √−2 0 ··· 0   3×2 3×2 3×2  A =    ......   ......    √ 1 √ 1 √ 1 √ 1 ··· √−(n−1) n(n−1) n(n−1) n(n−1) n(n−1) n(n−1)

Now define Y = AX. Then

2 T 2 Y ∼ Nn(Aµ, Aσ IA ) = Nn(Aµ, σ I).

We have √ Aµ = ( nµ, 0, ··· , 0)T . √ 2 2 So Y1 ∼ N( nµ, σ ) and Yi ∼ N(0, σ ) for i = 2, ··· , n. Also, Y1, ··· ,Yn are independent, since the is every non-diagonal term 0. But from the definition of A, we have

n 1 X √ Y = √ X = nX.¯ 1 n i i=1

34 2 Hypothesis testing IB Statistics

√ √ So nX¯ ∼ N( nµ, σ2), or X¯ ∼ N(µ, σ2/n). Also

2 2 T 2 Y2 + ··· + Yn = Y Y − Y1 T T 2 = X A AX − Y1 = XT X − nX¯ 2 n X 2 ¯ 2 = Xi − nX i=1 n X 2 = (Xi − X¯) i=1

= SXX .

2 2 2 2 So SXX = Y2 + ··· + Yn ∼ σ χn−1. Finally, since Y1 and Y2, ··· ,Yn are independent, so are X¯ and SXX .

2.6 Student’s t-distribution Definition (t-distribution). Suppose that Z and Y are independent, Z ∼ N(0, 1) 2 and Y ∼ χk. Then Z T = pY/k

is said to have a t-distribution on k degrees of freedom, and we write T ∼ tk.

The density of tk turns out to be

Γ((k + 1)/2) 1  t2 −(k+1)/2 fT (t) = √ 1 + . Γ(k/2) πk k This density is symmetric, bell-shaped, and has a maximum at t = 0, which is rather like the standard normal density. However, it can be shown that P(T > t) > P(Z > t), i.e. the T distribution has a “fatter” tail. Also, as k → ∞, tk approaches a normal distribution.

Proposition. If k > 1, then Ek(T ) = 0. k If k > 2, then vark(T ) = k−2 . If k = 2, then vark(T ) = ∞. In all other cases, the values are undefined. In particular, the k = 1 case has undefined mean and variance. This is known as the .

Notation. We write tk(α) be the upper 100α% point of the tk distribution, so that P(T > tk(α)) = α. Why would we define such a weird distribution? The typical application is

to study random samples with unknown mean and unknown variance.√ 2 ¯ 2 n(X¯ −µ) Let X1, ··· ,Xn be iid N(µ, σ ). Then X ∼ N(µ, σ /n). So Z = σ ∼ N(0, 1). 2 2 ¯ Also, SXX /σ ∼ χn−1 and is independent of X, and hence Z. So √ n(X¯ − µ)/σ ∼ tn−1, p 2 SXX /((n − 1)σ )

35 2 Hypothesis testing IB Statistics

or √ n(X¯ − µ) p ∼ tn−1. SXX /(n − 1)

2 SXX We write σ˜ = n−1 (note that this is the unbiased estimator). Then a 100(1−α)% confidence interval for µ is found from √  α n(X¯ − µ) α 1 − α = −t ≤ ≤ t . P n−1 2 σ˜ n−1 2

This has endpoints σ˜ α X¯ ± √ t . n n−1 2

36 3 Linear models IB Statistics

3 Linear models 3.1 Linear models Linear models can be used to explain or model the relationship between a response (or dependent) variable, and one or more explanatory variables (or covariates or predictors). As the name suggests, we assume the relationship is linear. Example. How do motor insurance claim rates (response) depend on the age and sex of the driver, and where they live (explanatory variables)? It is important to note that (unless otherwise specified), we do not assume normality in our calculations here. Suppose we have p covariates xj, and we have n observations Yi. We assume n > p, or else we can pick the parameters to fix our data exactly. Then each observation can be written as

Yi = β1xi1 + ··· + βpxip + εi. ((*))

for i = 1, ··· , n. Here

– β1, ··· , βp are unknown, fixed parameters we wish to work out (with n > p)

– xi1, ··· , xip are the values of the p covariates for the ith response (which are all known).

– ε1, ··· , εn are independent (or possibly just uncorrelated) random variables with mean 0 and variance σ2.

We think of the βjxij terms to be the causal effects of xij and εi to be a random fluctuation (error term). Then we clearly have

– E(Yi) = β1xi1 + ··· βpxip. 2 – var(Yi) = var(εi) = σ .

– Y1, ··· ,Yn are independent.

Note that (∗) is linear in the parameters β1, ··· , βp. Obviously the real world can be much more complicated. But this is much easier to work with. Example. For each of 24 males, the maximum volume of oxygen uptake in the blood and the time taken to run 2 miles (in minutes) were measured. We want to know how the time taken depends on oxygen uptake. We might get the results

Oxygen 42.3 53.1 42.1 50.1 42.5 42.5 47.8 49.9 Time 918 805 892 962 968 907 770 743 Oxygen 36.2 49.7 41.5 46.2 48.2 43.2 51.8 53.3 Time 1045 810 927 813 858 860 760 747 Oxygen 53.3 47.2 56.9 47.8 48.7 53.7 60.6 56.7 Time 743 803 683 844 755 700 748 775

37 3 Linear models IB Statistics

For each individual i, we let Yi be the time to run 2 miles, and xi be the maximum volume of oxygen uptake, i = 1, ··· , 24. We might want to fit a straight line to it. So a possible model is Yi = a + bxi + εi, 2 where εi are independent random variables with variance σ , and a and b are constants. The subscripts in the equation makes it tempting to write them as matrices:         Y1 x11 ··· x1p β1 ε1  .   . .. .   .   .  Y =  .  ,X =  . . .  , β =  .  , ε =  .  Yn xn1 ··· xnp. βp εn Then the equation becomes Y = Xβ + ε. (2) We also have

– E(ε) = 0. – cov(Y) = σ2I. We assume throughout that X has full rank p, i.e. the columns are independent, and that the error variance is the same for each observation. We say this is the homoscedastic case, as opposed to heteroscedastic. Example. Continuing our example, we have, in matrix form       Y1 1 x1 ε1 a Y =  .  ,X = . .  , β = , ε =  .   .  . .  b  .  Y24 1 x24 ε24. Then Y = Xβ + ε. Definition (Least squares estimator). In a Y = Xβ + ε, the least squares estimator βˆ of β minimizes

S(β) = kY − Xβk2 = (Y − Xβ)T (Y − Xβ) n X 2 = (Yi − xijβj) i=1 with implicit summation over j. If we plot the points on a graph, then the least square estimators minimizes the (square of the) vertical distance between the points and the line. To minimize it, we want

∂S = 0 ∂βk β=βˆ for all k. So ˆ −2xik(Yi − xijβj) = 0

38 3 Linear models IB Statistics

for each k (with implicit summation over i and j), i.e. ˆ xikxijβj = xikYi for all k. Putting this back in matrix form, we have Proposition. The least squares estimator satisfies

XT Xβˆ = XT Y. (3)

We could also have derived this by completing the square of (Y − Xβ)T (Y − Xβ), but that would be more complicated. In order to find βˆ, our life would be much easier if XT X has an inverse. Fortunately, it always does. We assumed that X is of full rank p. Then

tXT Xt = (Xt)T (Xt) = kXtk2 > 0

for t =6 0 in Rp (the last inequality is since if there were a t such that kXtk = 0, then we would have produced a linear combination of the columns of X that gives 0). So XT X is positive definite, and hence has an inverse. So

βˆ = (XT X)−1XT Y, (4) which is linear in Y. We have

T −1 T T −1 T E(βˆ) = (X X )X E[Y] = (X X) X Xβ = β. So βˆ is an unbiased estimator for β. Also

cov(βˆ) = (XT X)−1XT cov(Y)X(XT X)−1 = σ2(XT X)−1, (5)

since cov Y = σ2I.

3.2 Simple linear regression What we did above was so complicated. If we have a simple linear regression model Yi = a + bxi + εi. then we can reparameterise it to

0 Yi = a + b(xi − x¯) + εi, (6)

P 0 P where x¯ = xi/n and a = a + bx¯. Since (xi − x¯) = 0, this leads to simplified calculations. In matrix form,   1 (x1 − x¯) . .  X = . .  . 1 (x24 − x¯) P T Since (xi − x¯) = 0, in X X, the off-diagonals are all 0, and we have n 0  XT X = , 0 Sxx

39 3 Linear models IB Statistics

P 2 where Sxx = (xi − x¯) . Hence  1 0  (XT X)−1 = n 0 1 Sxx So  ¯  ˆ T −1 T Y β = (X X )X Y = SxY , Sxx P where SxY = Yi(xi − x¯). Hence the estimated intercept isa ˆ0 =y ¯, and the estimated gradient is S ˆb = xy Sxx P i yi(xi − x¯) = P 2 i(xi − x¯) P (y − y¯)(x − x¯) r S = i i i × yy (∗) pP 2 P 2 S i(xi − x¯) i(yi − y¯) xx r S = r × yy . Sxx P We have (∗) since y¯(xi − x¯) = 0, so we can add it to the nominator. Then the other square root things are just multiplying and dividing by the same things. So the gradient is the Pearson product-moment correlation coefficient r times the ratio of the empirical standard deviations of the y’s and x’s (note that the gradient is the same whether the x’s are standardised to have mean 0 or not). Hence we get cov(βˆ) = (XT X)−1σ2, and so from our expression of (XT X)−1, σ2 σ2 var(ˆa0) = var(Y¯ ) = , var(ˆb) = . n Sxx Note that these estimators are uncorrelated. Note also that these are obtained without any explicit distributional assump- tions. Example. Continuing our previous oxygen/time example, we have y¯ = 826.5, 2 2 ˆ Sxx = 783.5 = 28.0 , Sxy = −10077, Syy = 444 , r = −0.81, b = −12.9. Theorem (Gauss Markov theorem). In a full rank linear model, let βˆ be the least squares estimator of β and let β∗ be any other unbiased estimator for β which is linear in the Yi’s. Then var(tT βˆ) ≤ var(tT β∗).

for all t ∈ Rp. We say that βˆ is the best linear unbiased estimator of β (BLUE). ∗ ∗ Proof. Since β is linear in the Yi’s, β = AY for some p × n matrix A. ∗ ∗ Since β is an unbiased estimator, we must have E[β ] = β. However, since ∗ ∗ β = AY, E[β ] = AE[Y] = AXβ. So we must have β = AXβ. Since this holds for any β, we must have AX = Ip. Now ∗ ∗ ∗ T cov(β ) = E[(β − β)(β − β) ] T = E[(AY − β)(AY − β) ] T = E[(AXβ + Aε − β)(AXβ + Aε − β) ]

40 3 Linear models IB Statistics

Since AXβ = β, this is equal to

T = E[Aε(Aε) ] = A(σ2I)AT = σ2AAT .

Now let β∗ − βˆ = (A − (XT X)−1XT )Y = BY, for some B. Then

T −1 T BX = AX − (X X )X X = Ip − Ip = 0.

By definition, we have AY = BY + (XT X)−1XT Y, and this is true for all Y. So A = B + (XT X)−1XT . Hence

cov(β∗) = σ2AAT = σ2(B + (XT X)−1XT )(B + (XT X)−1XT )T = σ2(BBT + (XT X)−1) = σ2BBT + cov(βˆ).

Note that in the second line, the cross-terms disappear since BX = 0. So for any t ∈ Rp, we have var(tT β∗) = tT cov(β∗)t = tT cov(βˆ)t + tT BBT tσ2 = var(tT βˆ) + σ2kBT tk2 ≥ var(tT βˆ).

Taking t = (0, ··· , 1, 0, ··· , 0)T with a 1 in the ith position, we have

ˆ ∗ var(βi) ≤ var(βi ).

Definition (Fitted values and residuals). Yˆ = Xβˆ is the vector of fitted values. These are what our model says Y should be. R = Y − Yˆ is the vector of residuals. These are the deviations of our model from reality. The residual sum of squares is

RSS = kRk2 = RT R = (Y − Xβˆ)T (Y − Xβˆ).

We can give these a geometric interpretation. Note that XT R = XT (Y − Yˆ ) = XT Y − XT Xβˆ = 0 by our formula for β. So R is orthogonal to the column space of X. Write Yˆ = Xβˆ = X(XT X)−1XT Y = P Y, where P = X(XT X)−1XT . Then P represents an orthogonal projection of Rn onto the space spanned by the columns of X, i.e. it projects the actual data Y to the fitted values Yˆ . Then R is the part of Y orthogonal to the column space of X. The projection matrix P is idempotent and symmetric, i.e. P 2 = P and P T = P .

41 3 Linear models IB Statistics

3.3 Linear models with normal assumptions So far, we have not assumed anything about our variables. In particular, we have not assumed that they are normal. So what further information can we obtain by assuming normality? Example. Suppose we want to measure the resistivity of silicon wafers. We have five instruments, and five wafers were measured by each instrument (so we have 25 wafers in total). We assume that the silicon wafers are all the same, and want to see whether the instruments consistent with each other, i.e. The results are as follows:

Wafer 1 2 3 4 5 1 130.5 112.4 118.9 125.7 134.0 2 130.4 138.2 116.7 132.6 104.2 3 113.0 120.5 128.9 103.4 118.1 4 128.0 117.5 114.9 114.9 98.8

Instrument 5 121.2 110.5 118.5 100.5 120.9

Let Yi,j be the resistivity of the jth wafer measured by instrument i, where i, j = 1, ··· , 5. A possible model is

Yi,j = µi + εi,j, where εij are independent random variables such that E[εij] = 0 and var(εij) = 2 σ , and the µi’s are unknown constants. This can be written in matrix form, with       Y1,1 1 0 ··· 0 ε1,1  .  . . .. .  .   .  . . . .  .        Y1,5 1 0 ··· 0 ε1,5         Y2,1 0 1 ··· 0 µ1 ε2,1        .  . . .. . µ2  .   .  . . . .    .  Y =   ,X =   , β = µ3 , ε =   Y2,5 0 1 ··· 0   ε2,5     µ4    .  . . .. .  .   .  . . . . µ5  .        Y5,1 0 0 ··· 1 ε5,1        .  . . . .  .   .  . . .. .  .  Y5,5 0 0 ··· 1 ε5,5

Then Y = Xβ + ε. We have 5 0 ··· 0 0 5 ··· 0 XT X =   . . .. . . . . . 0 0 ··· 5

42 3 Linear models IB Statistics

Hence  1  5 0 ··· 0 1 0 5 ··· 0 (XT X)−1 =    . . .. .   . . . .  1 0 0 ··· 5 So we have   Y¯1 ˆ T −1 T  .  µ = (X X) X Y =  .  Y¯5 The residual sum of squares is

5 5 5 5 X X 2 X X 2 RSS = (Yi,j − µˆi) = (Yi,j − Y¯i) = 2170. i=1 j=1 i=1 j=1

This has n − p = 25 − 5 = 20 degrees of freedom. We will later see that σ¯ = pRSS/(n − p) = 10.4. Note that we still haven’t used normality! We now make a normal assumption:

2 Y = Xβ + ε, ε ∼ Nn(0, σ I), rank(X) = p < n.

This is a special case of the linear model we just had, so all previous results hold. 2 Since Y = Nn(Xβ, σ I), the log-likelihood is n n 1 l(β, σ2) = − log 2π − log σ2 − S(β), 2 2 2σ2 where S(β) = (Y − Xβ)T (Y − Xβ). If we want to maximize l with respect to β, we have to maximize the only term containing β, i.e. S(β). So Proposition. Under normal assumptions the maximum likelihood estimator for a linear model is βˆ = (XT X)−1XT Y, which is the same as the least squares estimator. This isn’t coincidence! Historically, when Gauss devised the normal distri- bution, he designed it so that the least squares estimator is the same as the maximum likelihood estimator. To obtain the MLE for σ2, we require

∂l 2 = 0, ∂σ βˆ,σˆ2 i.e. n S(βˆ) − + = 0 2σ2 2ˆσ4

43 3 Linear models IB Statistics

So 1 1 1 σˆ2 = S(βˆ) = (Y − Xβˆ)T (Y − Xβˆ) = RSS. n n n Our ultimate goal now is to show that βˆ and σˆ2 are independent. Then we can apply our other standard results such as the t-distribution. First recall that the matrix P = X(XT X)−1XT that projects Y to Yˆ is idempotent and symmetric. We will prove the following properties of it:

Lemma.

2 (i) If Z ∼ Nn(0, σ I) and A is n × n, symmetric, idempotent with rank r, T 2 2 then Z AZ ∼ σ χr. (ii) For a symmetric idempotent matrix A, rank(A) = tr(A).

Proof. (i) Since A is idempotent, A2 = A by definition. So eigenvalues of A are either 0 or 1 (since λx = Ax = A2x = λ2x). Since A is also symmetric, it is diagonalizable. So there exists an orthogonal Q such that

T Λ = Q AQ = diag(λ1, ··· , λn) = diag(1, ··· , 1, 0, ··· , 0)

with r copies of 1 and n − r copies of 0. T 2 Let W = Q Z. So Z = QW. Then W ∼ Nn(0, σ I), since cov(W) = QT σ2IQ = σ2I. Then

r T T T T X 2 2 Z AZ = W Q AQW = W ΛW = wi ∼ χr. i=1

(ii)

rank(A) = rank(Λ) = tr(Λ) = tr(QT AQ) = tr(AQT Q) = tr A

2 Theorem. For the normal linear model Y ∼ Nn(Xβ, σ I),

ˆ 2 T −1 (i) β ∼ Np(β, σ (X X) )

2 2 2 σ2 2 (ii) RSS ∼ σ χn−p, and soσ ˆ ∼ n χn−p. (iii) βˆ andσ ˆ2 are independent. The proof is not particularly elegant — it is just a whole lot of linear algebra! Proof.

44 3 Linear models IB Statistics

– We have βˆ = (XT X)−1XT Y. Call this CY for later use. Then βˆ has a normal distribution with mean

(XT X)−1XT (Xβ) = β

and covariance

(XT X)−1XT (σ2I)[(XT X)−1XT ]T = σ2(XT X)−1.

So ˆ 2 T −1 β ∼ Np(β, σ (X X) ).

T 2 2 – Our previous lemma says that Z AZ ∼ σ χr. So we pick our Z and A so that ZT AZ = RSS, and r, the degrees of freedom of A, is n − p. T −1 T Let Z = Y − Xβ and A = (In − P ), where P = X(X X) X . We first check that the conditions of the lemma hold: 2 2 Since Y ∼ Nn(Xβ, σ I), Z = Y − Xβ ∼ Nn(0, σ I).

Since P is idempotent, In − P also is (check!). We also have

rank(In − P ) = tr(In − P ) = n − p. Therefore the conditions of the lemma hold. To get the final useful result, we want to show that the RSS is indeed ZT AZ. We simplify the expressions of RSS and ZT AZ and show that they are equal:

T T T Z AZ = (Y − Xβ) (In − P )(Y − Xβ) = Y (In − P )Y.

Noting the fact that (In − P )X = 0.

Writing R = Y − Yˆ = (In − P )Y, we have

T T RSS = R R = Y (In − P )Y,

using the symmetry and idempotence of In − P . T 2 2 Hence RSS = Z AZ ∼ σ χn−p. Then RSS σ2 σˆ2 = ∼ χ2 . n n n−p

βˆ   C  – Let V = = DY, where D = is a (p + n) × n matrix. R In − P Since Y is multivariate, V is multivariate with

cov(V ) = Dσ2IDT  T T  2 CC C(In − P ) = σ T T (In − P )C (In − P )(In − P )  T  2 CC C(In − P ) = σ T (In − P )C (In − P ) CCT 0  = σ2 0 In − P

45 3 Linear models IB Statistics

T −1 T Using C(In − P ) = 0 (since (X X) X (In − P ) = 0 since (In − P )X = 0 — check!). Hence βˆ and R are independent since the off-diagonal covariant terms are 0. So βˆ and RSS = RT R are independent. So βˆ and σˆ2 are independent.

2 2 RSS 2 From (ii), E(RSS) = σ (n − p). So σ˜ = n−p is an unbiased estimator of σ . σ˜ is often known as the residual on n − p degrees of freedom.

3.4 The F distribution 2 Definition (F distribution). Suppose U and V are independent with U ∼ χm 2 U/m and V ∼ χn. The X = V/n is said to have an F -distribution on m and n degrees of freedom. We write X ∼ Fm,n Since U and V have mean m and n respectively, U/m and V/n are approxi- mately 1. So F is often approximately 1. It should be very clear from definition that

Proposition. If X ∼ Fm,n, then 1/X ∼ Fn,m.

We write Fm,n(α) be the upper 100α% point for the Fm,n-distribution so that if X ∼ Fm,n, then P(X > Fm,n(α)) = α. Suppose that we have the upper 5% point for all Fn,m. Using these in- formation, it is easy to find the lower 5% point for Fm,n since we know that P(Fm,n < 1/x) = P(Fn,m > x), which is where the above proposition comes useful. Note that it is immediate from definitions of tn and F1,n that if Y ∼ tn, then 2 2 2 Y ∼ F1,n, i.e. it is a ratio of independent χ1 and χn variables.

3.5 Inference for β ˆ 2 T −1 We know that β ∼ Np(β, σ (X X) ). So

ˆ 2 T −1 βj ∼ N(βj, σ (X X)jj ). ˆ The standard error of βj is defined to be q ˆ 2 T −1 SE(βj) = σ˜ (X X)jj ,

2 2 T −1 where σ˜ = RSS/(n − p). Unlike the actual variance σ (X X)jj , the standard error is calculable from our data. Then q ˆ 2 T −1 βˆ − β βˆ − β (βj − βj)/ σ (X X)jj j j = j j = ˆ q p 2 SE(βj) 2 T −1 RSS/((n − p)σ ) σ˜ (X X)jj

By writing it in this somewhat weird form, we now recognize both the numer- ator and denominator. The numerator is a standard normal N(0, 1), and the

46 3 Linear models IB Statistics

q 2 denominator is an independent χn−p/(n − p), as we have previously shown. But a standard normal divided by χ2 is, by definition, the t distribution. So

βˆ − β j j ∼ t . ˆ n−p SE(βj)

ˆ ˆ α So a 100(1 − α)% confidence interval for βj has end points βj ± SE(βj)tn−p( 2 ). In particular, if we want to test H0 : βj = 0, we use the fact that under H0, βˆj ∼ tn−p. SE(βˆj )

3.6 Simple linear regression We can apply our results to the case of simple linear regression. We have

0 Yi = a + b(xi − x¯) + εi,

P 2 wherex ¯ = xi/n and εi are iid N(0, σ ) for i = 1, ··· , n. Then we have  σ2  aˆ0 = Y¯ ∼ N a0, n S  σ2  ˆb = xY ∼ N b, Sxx Sxx 0 ˆ Yˆi =a ˆ + b(xi − x¯) X ˆ 2 2 2 RSS = (Yi − Yi) ∼ σ χn−2, i

and (ˆa0, ˆb) andσ ˆ2 = RSS/n are independent, as we have previously shown. Note that σˆ2 is obtained by dividing RSS by n, and is the maximum likelihood estimator. On the other hand, σ˜ is obtained by dividing RSS by n − p, and is an unbiased estimator. Example. Using the oxygen/time example, we have seen that RSS 67968 σ˜2 = = = 3089 = 55.62. n − p 24 − 2

So the standard error of βˆ is

q r3089 55.6 ˆ 2 T −1 SE(b) = σ˜ (X X)22 = = = 1.99. Sxx 28.0 So a 95% interval for b has end points ˆ ˆ b ± SE(b) × tn−p(0.025) = 12.9 ± 1.99 ∗ t22(0.025) = (−17.0, −8.8),

using the fact that t22(0.025) = 2.07. Note that this interval does not contain 0. So if we want to carry out a size 0.05 test of H0 : b = 0 (they are uncorrelated) vs H1 : b 6= 0 (they are correlated), the would be ˆb = −12.9 = −6.48. Then we reject H because SE(ˆb) 1.99 0 this is less than −t22(0.025) = −2.07.

47 3 Linear models IB Statistics

3.7 Expected response at x∗ After performing the linear regression, we can now make predictions from it. Suppose that x∗ is a new vector of values for the explanatory variables. The expected response at x∗ is E[Y | x∗] = x∗T β. We estimate this by x∗T βˆ. Then we have x∗T (βˆ − β) ∼ N(0, x∗T cov(βˆ)x∗) = N(0, σ2x∗T (XT X)−1x∗). Let τ 2 = x∗T (XT X)−1x∗. Then x∗T (βˆ − β) ∼ t . στ˜ n−p Then a confidence interval for the expected response x∗T β has end points α x∗T βˆ ± στt˜ . n−p 2 Example. Previous example continued: Suppose we wish to estimate the time to run 2 miles for a man with an oxygen take-up measurement of 50. Here x∗T = (1, 50 − x¯), wherex ¯ = 48.6. The estimated expected response at x∗T is x∗T βˆ =a ˆ0 + (50 − 48.5) × ˆb = 826.5 − 1.4 × 12.9 = 808.5, which is obtained by plugging x∗T into our fitted line. We find 1 x∗2 1 1.42 τ 2 = x∗T (XT X)−1x∗ = + = + = 0.044 = 0.212. n Sxx 24 783.5 So a 95% confidence interval for E[Y | x∗ = 50 − x¯] is α x∗T βˆ ± στt˜ = 808.5 ± 55.6 × 0.21 × 2.07 = (783.6, 832.2). n−p 2 Note that this is the confidence interval for the predicted expected value, NOT the confidence interval for the actual obtained value. The predicted response at x∗ is Y ∗ = x∗β + ε∗, where ε∗ ∼ N(0, σ2), and Y ∗ is independent of Y1, ··· ,Yn. Here we have more uncertainties in our prediction: β and ε∗. A 100(1 − α)% for Y ∗ is an interval I(Y) such that P(Y ∗ ∈ I(Y)) = 1 − α, where the probability is over the joint distribution of ∗ Y ,Y1, ··· ,Yn. So I is a random function of the past data Y that outputs an interval. First of all, as above, the predicted expected response is Yˆ ∗ = x∗T β. This is an unbiased estimator since Yˆ ∗ − Y ∗ = x∗T (βˆ − β) − ε∗, and hence ∗ ∗ ∗T E[Yˆ − Y ] = x (β − β) = 0, To find the variance, we use that fact that x∗T (βˆ − β) and ε∗ are independent, and the variance of the sum of independent variables is the sum of the . So var(Yˆ ∗ − Y ∗) = var(x∗T (βˆ)) + var(ε∗) = σ2x∗T (XT X)−1x∗ + σ2. = σ2(τ 2 + 1).

48 3 Linear models IB Statistics

We can see this as the uncertainty in the regression line σ2τ 2, plus the wobble about the regression line σ2. So Yˆ ∗ − Y ∗ ∼ N(0, σ2(τ 2 + 1)). We therefore find that Yˆ ∗ − Y ∗ √ ∼ tn−p. σ˜ τ 2 + 1 So the interval with endpoints p α x∗T βˆ ± σ˜ τ 2 + 1t n−p 2 is a 95% prediction interval for Y ∗. We don’t call this a confidence interval — confidence intervals are about finding parameters of the distribution, while the prediction interval is about our predictions. Example. A 95% prediction interval for Y ∗ at x∗T = (1, (50 − x¯)) is p α x∗T ± σ˜ τ 2 + 1t = 808.5 ± 55.6 × 1.02 × 2.07 = (691.1, 925.8). n−p 2 Note that this is much wider than our our expected response! This is since there are three sources of uncertainty: we don’t know what σ is, what ˆb is, and the random ε fluctuation! Example. Wafer example continued: Suppose we wish to estimate the expected resistivity of a new wafer in the first instrument. Here x∗T = (1, 0, ··· , 0) (recall that x is an indicator vector to indicate which instrument is used). The estimated response at x∗T is

∗T x µˆ =µ ˆ1 =y ¯1 = 124.3 We find 1 τ 2 = x∗T (XT X)−1x∗ = . 5 ∗ So a 95% confidence interval for E[Y1 ] is

∗T α 10.4 x µˆ ± στt˜ n−p = 124.3 ± √ × 2.09 = (114.6, 134.0). 2 5 Note that we are using an estimate of σ obtained from all five instruments. If we had only used the data from the first instrument, σ would be estimated as s P5 y1,j − y¯1 σ˜ = j=1 = 8.74. 1 5 − 1

The observed 95% confidence interval for µ1 would have been

σ˜1 α y¯1 ± √ t4 = 124.3 ± 3.91 × 2.78 = (113.5, 135.1), 5 2 which is slightly wider. Usually it is much wider, but in this special case, we only get little difference since the data from the first instrument is relatively tighter than the others. ∗ ∗T A 95% prediction interval for Y1 at x = (1, 0, ··· , 0) is p α x∗T µˆ ± σ˜ τ 2 + 1t = 124.3 ± 10.42 × 1.1 × 2.07 = (100.5, 148.1). n−p 2

49 3 Linear models IB Statistics

3.8 Hypothesis testing 3.8.1 Hypothesis testing In hypothesis testing, we want to know whether certain variables influence the result. If, say, the variable x1 does not influence Y , then we must have β1 = 0. So the goal is to test the hypothesis H0 : β1 = 0 versus H1 : β1 6= 0. We will tackle a more general case, where β can be split into two vectors β0 and β1, and we test if β1 is zero. We start with an obscure lemma, which might seem pointless at first, but will prove itself useful very soon.

2 Lemma. Suppose Z ∼ Nn(0, σ In), and A1 and A2 are symmetric, idempotent T n × n matrices with A1A2 = 0 (i.e. they are orthogonal). Then Z A1Z and T Z A2Z are independent.

This is geometrically intuitive, because A1 and A2 being orthogonal means they are concerned about different parts of the vector Z.

Proof. Let Xi = AiZ, i = 1, 2 and

W  A  W = 1 = 1 Z. W2 A2

Then     0 2 A1 0 W ∼ N2n , σ 0 0 A2 2 T since the off diagonal matrices are σ A1 A2 = A1A2 = 0. So W1 and W2 are independent, which implies

T T T T T W1 W1 = Z A1 A1Z = Z A1A1Z = Z A1Z

and T T T T T W2 W2 = Z A2 A2Z = Z A2A2Z = Z A2Z are independent Now we go to hypothesis testing in general linear models: ! β  Suppose = X X and B = 0 , where rank(X) = p, rank(X ) = X 0 1 β 0 n×p n×p0 n×(p−p0) 1 p0. We want to test H0 : β1 = 0 against H1 : β1 6= 0. Under H0, X1β1 vanishes and Y = X0β + ε. 2 Under H0, the mle of β0 and σ are

ˆ T −1 T β0 = (X0 X0) X0 Y RSS0 1 ˆ ˆ σˆ2 = = (Y − X βˆ )T (Y − X βˆ ) n n 0 0 0 0 and we have previously shown these are independent.

50 3 Linear models IB Statistics

Note that our poor estimators wear two hats instead of one. We adopt the convention that the estimators of the null hypothesis have two hats, while those of the alternative hypothesis have one. So the fitted values under H0 are

ˆ T −1 T Y = X0(X0 X0) X0 Y = P0Y,

T −1 T where P0 = X0(X0 X0) X0 . The generalized likelihood ratio test of H0 against H1 is     √ 1 1 ˆ T ˆ 2 exp − 2ˆσ2 (Y − Xβ) (Y − Xβ) Λ (H ,H ) = 2πσˆ Y 0 1     √ 1 1 ˆ T ˆ exp − ˆ2 (Y − Xβ0) (Y − Xβ0) 2πσˆ2 2σˆ !n/2 σˆ2 = σˆ2

RSS n/2 = 0 RSS  RSS − RSSn/2 = 1 + 0 . RSS

RSS0−RSS We reject H0 when 2 log Λ is large, equivalently when RSS is large. Using the results in Lecture 8, under H0, we have  RSS − RSS 2 log Λ = n log 1 + 0 , RSS which is approximately a χ2 random variable. p1−p0 This is a good approximation. But we can get an exact null distribution, and get an . T We have previously shown that RSS = Y (In − P )Y, and so

T T T RSS0 − RSS = Y (In − P0)Y − Y (In − P )Y = Y (P − P0)Y.

Now both In − P and P − P0 are symmetric and idempotent, and therefore rank(In − P ) = n − p and

rank(P − P0) = tr(P − P0) = tr(P ) − tr(P0) = rank(P ) − rank(P0) = p − p0.

Also,

2 (In − P )(P − P0) = (In − P )P − (In − P )P0 = (P − P ) − (P0 − PP0) = 0.

2 (we have P = P by idempotence, and PP0 = P0 since after projecting with P0, we are already in the space of P , and applying P has no effect) Finally,

T T Y (In − P )Y = (Y − X0β0) (In − P )(Y − X0β0) T T Y (P − P0)Y = (Y − X0β0) (P − P0)(Y − X0β0)

51 3 Linear models IB Statistics

since (In − P )X0 = (P − P0)X0 = 0. If we let Z = Y − X0β0, A1 = In − P , A2 = P − P0, and apply our previous T 2 2 lemma, and the fact that Z AiZ ∼ σ χr, then T 2 RSS = Y (In − P )Y ∼ χn−p RSS − RSS = YT (P − P )Y ∼ χ2 0 0 p−p0 and these random variables are independent. So under H0,

T Y (P − P0)Y/(p − p0) (RSS0 − RSS)/(p − p0) F = T = ∼ Fp−p0,n−p, Y (In − P )Y/(n − p) RSS/(n − p)

Hence we reject H0 if F > Fp−p0,n−p(α). RSS0 − RSS is the reduction in the sum of squares due to fitting β1 in addition to β0.

Source of var. d.f. sum of squares mean squares F statistic

RSS0−RSS Fitted model p − p0 RSS0 − RSS (RSS −RSS)/(p−p ) p−p0 0 0 RSS RSS/(n−p) Residual n − p RSS n−p

Total n − p0 RSS0

The ratio RSS0−RSS is sometimes known as the proportion of variance explained RSS0 2 by β1, and denoted R .

3.8.2 Simple linear regression We assume that 0 Yi = a + b(xi − x¯) + εi, P 2 wherex ¯ = xi/n and εi are N(0, σ ). Suppose we want to test the hypothesis H0 : b = 0, i.e. no linear relationship. We have previously seen how to construct a confidence interval, and so we could simply see if it included 0. 0 2 0 Alternatively, under H0, the model is Yi ∼ N(a , σ ), and so aˆ = Y¯ , and the fitted values are Yˆi = Y¯ . The observed RSS0 is therefore

X 2 RSS0 = (yi − y¯) = Syy. i The fitted sum of squares is therefore

X 2 ˆ 2 ˆ2 X 2 ˆ2 RSS0 − RSS = (yi − y¯) − (yi − y¯ − b(xi − x¯)) = b (xi − x¯) = b Sxx. i

Source of var. d.f. sum of squares mean squares F statistic 2 2 Fitted model 1 RSS − RSS = ˆb S ˆb S 2 0 XX xx ˆb Sxx P 2 2 F = σ˜2 Residual n − 2 RSS = i(yi − yˆ) σ˜ . P 2 Total n − 1 RSS0 = i(yi − y¯)

52 3 Linear models IB Statistics

S2 ˆ2 xy 2 Note that the proposition of variance explained is b Sxx/Syy = = r , SxxSyy where r is the Pearson’s product-moment correlation coefficient

Sxy r = p . SxxSyy √ We have previously seen that under H , ˆb ∼ t , where SE(ˆb) = σ/˜ S . 0 SE(ˆb) n−2 xx So we let √ ˆb ˆb S t = = xx . SE(ˆb) σ˜ α  Checking whether |t| > tn−2 2 is precisely the same as checking whether 2 2 t = F > F1,n−2(α), since a F1,n−2 variable is tn−2. Hence the same conclusion is reached, regardless of whether we use the t-distribution or the F statistic derived form an analysis of variance table.

3.8.3 One way analysis of variance with equal numbers in each group Recall that in our wafer example, we made measurements in groups, and want to know if there is a difference between groups. In general, suppose J measurements are taken in each of I groups, and that

Yij = µi + εij,

2 where εij are independent N(0, σ ) random variables, and the µi are unknown constants. Fitting this model gives

I J I J X X 2 X X 2 RSS = (Yij − µˆi) = (Yij − Y¯i.) i=1 j=1 i=1 j=1 on n − I degrees of freedom. Suppose we want to test the hypothesis H0 : µi = µ, i.e. no difference between groups. 2 Under H0, the model is Yij ∼ N(µ, σ ), and so µˆ = Y¯ , and the fitted values are Yˆij = Y¯ . The observed RSS0 is therefore

X 2 RSS0 = (yij − y¯..) . i,j The fitted sum of squares is therefore

X X 2 2 X 2 RSS0 − RSS = (yij − y¯..) − (yij − y¯i.) = J (¯yi. − y¯..) . i j i

Source of var. d.f. sum of squares mean squares F statistic

2 (¯yi.−y¯..) P 2 P 2 Fitted model I − 1 J i(¯yi − y¯..) J i I−1 P (¯yi.−y¯..) P P 2 2 J i (I−1)˜σ2 Residual n − I i j(yij − y¯i.) σ˜ P P 2 Total n − 1 i j(yij − y¯..)

53