§1.1 Some Standard Distributions Revisited

CHAPTER 1 - SOME REVISION AND USEFUL DEFINITIONS

1.1 Some standard distributions revisited § BERNOULLI A basic building block of discrete random variables, the state space is 0, 1 and the param- { } eter space is (0, 1):

X Bern(p) ∼ x 1 x f (x) = p (1 p) − p − where the parameter p represents the probability of the event (X = 1).

BINOMIAL What happens if we add n iid Bern(p) random variables together (note the common parameter p)?

X1,...,Xn iid Bern(p) n Y = X Bin(n,p) i ∼ Xi=1 n y n y fp(y) = p (1 p) − y ! −

where the state space is 0, 1,...,n and the parameter space is p (0, 1). { } ∈ GEOMETRIC The Geometric is a discrete time waiting distribution. We run a sequence of iid Bern(p) random variables and wait for the ﬁrst occurrence of a 1:

X Geo(p) ∼ x 1 f (x) = p(1 p) − p − where the state space is 1, 2,... and the parameter space is p (0, 1). { } ∈ There is an alternative parameterisation whereby the variable is deﬁned as the number of 0’s before the ﬁrst 1. In that case:

Z Geo(p) ∼ f (z) = p(1 p)z p − where the state space is 0, 1,... and the parameter space is p (0, 1). { } ∈ NEGATIVE BINOMIAL A generalisation of the Geometric distribution, it is the waiting time for the rth 1 in a sequence of iid Bern(p) random variables:

X NegBin(r, p) ∼ x 1 r x r f (x) = − p (1 p) − p r 1 − − ! where the state space is r, r + 1,... and the parameter space is p (0, 1). { } ∈

1 Again, there is an alternative parameterisation whereby the variable is deﬁned as the number of 0’s before the rth 1. In that case:

Z NegBin(r, p) ∼ z + r 1 f (z) = − pr(1 p)z p r 1 − − ! where the state space is 0, 1,... and the parameter space is p (0, 1). { } ∈ POISSON The Poisson can be thought of as a limiting case of the Binomial as n and p 0 in such →∞ → a way that the expectation of the Binomial, np, tends to some ﬁnite, non-zero constant λ:

X P ois(λ) ∼ λxe λ f (x) = − λ x! where the state space is 0, 1,... and the parameter space is λ> 0. { } UNIFORM The Uniform is perhaps the simplest continuous distribution:

X U(θ ,θ ) ∼ 1 2 1 f (x) = θ1,θ2 θ θ 2 − 1 where the state space is (θ ,θ ) and the parameter space is <θ <θ < . 1 2 −∞ 1 2 ∞ EXPONENTIAL The Exponential is the continuous time equivalent of the Geometric distribution:

X exp(λ) ∼ λx fλ(x) = λe−

where the state space is X > 0 and the parameter space is λ > 0. In this parameterisation, the 1 parameter λ is referred to as the rate parameter, and the expectation of X is λ− . The alternative parameterisation in terms of the mean, µ, is also frequently used:

X exp(µ) ∼ 1 x/µ f (x) = e− µ µ

where the state space is still X > 0 and the parameter space is µ> 0. Clearly µ = 1/λ.

GAMMA A generalisation of the Exponential distribution:

X Gam(λ, t) ∼ t λ t 1 λx f (x) = x − e− λ,t Γ(t) ∞ t 1 y where Γ(t) = y − e− dy Z0 where the state space is X > 0 and the parameter space is λ> 0,t> 0.

2 One way in which Gamma distributions arise is as the sum of iid exp(λ) random variables:

X1,...,Xt iid exp(λ) t Y = X Gam(λ, t) i ∼ Xi=1 NORMAL The Normal arises in a number of ways, for example as the limiting case of a Binomial as n , or as a result of the Central Limit Theorem: →∞ X N(µ,σ2) ∼ 1 1 2 2 2 (x µ) fµ,σ2 (x) = e− σ − √2πσ2

where the state space is the real line and the parameter space is µ ( , ),σ2 > 0. ∈ −∞ ∞ BETA The Beta distribution will be a useful one for those going on to study MA40189:

X Beta(a, b) ∼ 1 a 1 b 1 f (x) = x − (1 x) − a,b B(a, b) − 1 a 1 b 1 where B(a, b) = y − (1 y) − dy − Z0 where the state space is (0, 1) and the parameter space is a> 0,b> 0.

1.2 Exponential families § It will be useful, and elegant, if we can work with a deﬁnition which includes many of the distributions in the previous section and which has good theoretical properties.

Deﬁnition 1.1 A k-parameter exponential family is one whose pmf/pdf can be written in the form

k c(θ)h(x) exp( j=1 aj(θ)bj(x)) x Ω fθ(x)= ∈ 0 otherwise ( P and where the state space Ω does not involve the parameter θ.

Examples of testing whether a distribution is a member of an exponential family 1. Exponential

λx fλ(x) = λe−

There is one parameter, so k = 1. We could parameterise fλ(x) using c(λ) = λ, h(x) = 1, a(λ) = λ and b(x)= x. The state space is X > 0, which is independent of the parameter λ. Therefore the − Exponential distribution is a member of the one parameter exponential family.

3 2. Normal 1 1 2 2 2 (x µ) fµ,σ2 (x) = e− σ − √2πσ2 2 µ 2 1 2 There are two parameters, so k = 2. We could parameterise f 2 (x) using c(µ,σ )= e− 2σ , µ,σ √2πσ2 h(x) = 1, a (µ,σ2) = 1/2σ2 with b (x) = x2 and a (µ,σ2) = µ/σ2 with b (x) = x. The 1 − 1 2 2 state space is the real line, which is independent of the parameters µ and σ2. Therefore the Normal distribution is a member of the two parameter exponential family.

3. Uniform The Uniform distribution with parameters θ1 and θ2 has state space (θ1,θ2). Therefore the Uniform distribution is not a member of an exponential family as its state space depends on the parameters.

1.3 Sufﬁciency § Another useful concept will be that of sufﬁciency.

Definition 1.2 A sufficient statistic is a statistic which exhausts the information in a data set regarding the parameter θ in the sense that the conditional distribution of X1,...,Xn given the value of the sufficient statistic T (x)= t does not involve the parameter θ.

An example of sufﬁciency

Suppose X1,...,Xn are iid Bern(p) random variables and that we want to see whether T (X1,...,Xn)= n X is a sufﬁcient statistic. We will need to ﬁnd the conditional distribution of X ,...,X T = t and i=1 i 1 n| note whether or not this distribution involves the parameter p. P What is the distribution of T ? As T is the sum of iid Bern(p) random variables, T Bin(n,p) • ∼ n t n t P(T = t)= p (1 p) − , t = 0, 1,...,n t ! −

What is the joint distribution of X ,...,X ? As they are independent, the joint distribution is the • 1 n product of the marginals n xi 1 xi P(X = x)= p (1 p) − , x 0, 1 − i ∈ { } iY=1 What is the conditional distribution of X conditional on T taking the value t? • P(X = x T = t) P(X = x T = t) = ∩ | P(T = t) P(X = x) n = I[ x =t] P(T = t) i=1 i 1 nP = I[ x =t] n i=1 i t ! P

4 As this conditional distribution does not involve the parameter p, we can say that T (X1,...,Xn) = n i=1 Xi is a sufficient statistic for the Bern(p) distribution. RatherP than guessing at a possible sufficient statistic for any particular distribution, and then finding the conditional distribution to check whether the guess is correct, the following theorem enormously simplifies the process in the one parameter case:

Theorem 1.1 The Factorisation theorem states that T (X) is sufﬁcient for θ if and only if there exist functions g(T (x),θ) and h(x) such that the joint distribution can be factorised

fθ(x)= g(T (x),θ)h(x) Proof of the Factorisation Theorem in the discrete case:

Suppose fθ(x)= g(T (x),θ)h(x)

Pθ(T (X)= t) = Pθ(X = x) x x :TX( )=t = fθ(x) x x :TX( )=t = g(T (x),θ)h(x) x x :TX( )=t = g(t,θ) h(x) x x :TX( )=t Now consider a particular y such that T (y)= t:

Pθ(X = y T (X)= t) Pθ(X = y T (X)= t) = ∩ | Pθ(T (X)= t) P (X = y) = θ since T (y)= t Pθ(T (X)= t) g(T (y),θ)h(y) = g(t,θ) x:T (x)=t h(x) h(y) = P x:T (x)=t h(x) Since this conditional probability does not involvePθ, T (X) is sufﬁcient for θ. Suppose T (X) is sufﬁcient

fθ(x) = Pθ(X = x)

= Pθ(X = x T (X)= t∗) where t∗ forms a partition ∗ ∩ Xt = P (X = x T (X)= t(x)) since all other joint probabilities are zero θ ∩ = P (X = x T (X)= t(x))P (T (X)= t(x)) θ | θ but this ﬁrst term is independent of θ by assumption, so denote it by h(x), while the second is a function of T (x) and θ, so denote it by g(T (x),θ), i.e.

fθ(x)= g(T (x),θ)h(x)

5 Examples of ﬁnding sufﬁcient statistics using the Factorisation theorem

1. Bernoulli X1,...,Xniid Bern(p) random variables:

n xi 1 xi fp(x) = p (1 p) − i=1 − Y n n xi n xi = p i=1 (1 p) − i=1 − P P Take h(x) = 1, T (x) = n x and g(T (x),p) = pT (x)(1 p)n T (x), then by the Factorisation i=1 i − − theorem, n X is a sufﬁcient statistic for p. i=1 i P P 2. The one parameter exponential family X1,...,Xniid from the one parameter exponential family

fθ(x) = c(θ)h(x) exp(a(θ)b(x)), x Ω n n ∈ n fθ(x) = c(θ) h(xi) exp(a(θ) b(xi)) iY=1 Xi=1 n n n Take h(x) = i=1 h(xi), T (x) = i=1 b(xi) and g(T (x),θ) = c(θ) exp(a(θ)T (x)), then by the Factorisation theorem, n b(X ) is a sufficient statistic for θ. Q i=1 i P The concept of sufficiency extendsP to joint sufficiency with a corresponding generalised factorisation theorem (which we will not prove).

Definition 1.3 Suppose X ,...,X have a joint distribution which depends on parameters θ ,...,θ . The statistics 1 n { 1 k} T (X),...,T (X) are jointly sufficient for θ ,...,θ if the conditional distribution of X ,...,X { 1 r } { 1 k} 1 n given the values of T (X),...,T (X) does not depend on any of the θ ,...,θ . { 1 r } { 1 k} Theorem 1.2 The Generalised Factorisation theorem states that T (X),...,T (X) are jointly sufficient for θ ,...,θ { 1 r } { 1 k} if and only if there exist functions g(T1(X),...,Tr(X),θ1,...,θk) and h(x) such that

fθ1,...,θk (x)= g(T1(X),...,Tr(X),θ1,...,θk)h(x)

Examples of ﬁnding joint sufﬁcient statistics using the Generalised Factorisation theorem

2 1. Normal X1,...,Xniid N(µ,σ ) random variables:

n n 1 1 2 fµ,σ2 (x) = exp( (xi µ) ) √2πσ2 −2σ2 − Xi=1 n 2 n 2 2 n/2 1 i=1 xi 2µ i=1 xi nµ = (2πσ )− exp( 2 2 + 2 ) −2 " P σ − Pσ σ #

n 2 n Take h(x) = 1, T1(x)= i=1 x , T2(x)= i=1 xi and i 2 2 2 n/2 1 T1(x) 2µT2(x) nµ g(T1(x), T2(x),µ,σ ) =P (2πσ )− exp(P2 σ2 σ2 + σ2 ), then by the Generalised n n −2 − 2 Factorisation theorem, i=1 Xi and i=1 Xi areh jointly sufﬁcient statisticsi for µ,σ . P P 6 2. Uniform X1,...,Xniid U(θ, 2θ) random variables: 1 n fθ(x) = I[θ

Take h(x) = 1, T1(x) = min xi, T2(x) = max xi and n 1 g(T1(x), T2(x),θ)= θ I[θ

1.4 Revision of properties of maximum likelihood estimation § Recall that the likelihood function is the joint mass/density function evaluated at the observed x1,...,xn and regarded as a function of the unknown parameters:

L(θ)= fθ(x1,...,xn)

The maximum likelihood estimate (MLE) is the value of θ (not necessarily unique) in the parameter space Θ which makes L(θ) as large as possible.

1. Working with the log likelihood In practice, we often work with the loglikelihood, ℓ(θ)=ln L(θ).

Example

X1,...,Xniid Bern(p) random variables, Θ = (0, 1):

n xi 1 xi fp(x) = p (1 p) − i=1 − Y n n xi n xi L(p) = p i=1 (1 p) − i=1 − Pn P n ℓ(p) = ( x ) ln p + (n x ) ln(1 p) i − i − Xi=1 Xi=1 dℓ(p) n x n n x = i=1 i − i=1 i dp p − 1 p P P− d2ℓ(p) n x n n x = i=1 i − i=1 i dp2 − p2 − (1 p)2 P P−

Solving the first derivative equal to zero over the parameter space (0, 1), gives as a turning point n pˆ = i=1 xi/n, at which point the second derivative is negative, confirming that this pˆ is a maximum. Thus the Maximum Likelihood Estimator is pˆ = n X /n. P i=1 i 2. Times when calculus does not help If the MLE occursP on the boundary of the parameter space, then calculus may not help to find the maximiser of the likelihood.

7 Example

X1,...,Xniid U(0,θ) random variables n 1 f (x) = I p θ [0x1,...,xn

dL(θ) (n+1) = nθ− ,θ>x ,...,x dsθ − 1 n

The first derivative is negative, i.e. the likelihood is decreasing and so to maximise L we should pick the smallest possible value of θ. In this case, the smallest possible value that θ can take is max xi, that is the Maximum Likelihood Estimator is max Xi. 3. MLEs and sufficient statistics If there is a unique MLE θˆ, then it is a function of the sufficient statistics. To see why this is the case, consider writing the distribution fθ(x) in the form of the Generalised Factorisation theorem:

L(θ)= fθ(x)= g(T1(X),...,Tr(X),θ)h(x)

Maximising L(θ) over θ for ﬁxed x is equivalent to maximising g(T1(X),...,Tr(X),θ) over θ, and hence the maximiser, i.e. θˆ, will be a function of the sufﬁcient statistics T1(X),...,Tr(X).

Examples

Consider the previous two examples, the Bernoulli and the Uniform. 4. Multiparameter problems In multiparameter problems, we have an optimisation problem in more than one dimension. In some cases, this will be analytically tractable but in other cases numerical methods might be required. Either way, conditions for any turning point to be a maximum should be checked.

Example

X1,...,Xniid Gam(λ, t) random variables n t λ t 1 f (x) = x − exp( λx ) λ,t Γ(t) i − i iY=1 t 1 λnt n − n L(λ, t) = n xi exp( λ xi) Γ(t) i=1 ! − i=1 Y X n n ℓ(λ, t) = nt ln λ n ln Γ(t) + (t 1) ln x λ x − − i − i iY=1 Xi=1 ∂ℓ(λ, t) nt n = x ∂λ λ − i Xi=1 ∂ℓ(λ, t) n∂lnΓ(t) n = n ln λ + ln x ∂t − ∂t i iY=1

8 ∂ℓ(λ,t) ∂ℓ(λ,t) We need to solve simultaneously ∂λ = 0 and ∂t = 0. We can reduce this to a one-dimensional nt problem by noting from the ﬁrst equation that we need λ = n and substituting this into the i=1 xi second equation, but we are still left with a problem which must be solved numerically. P Once we have a turning point of ℓ(λ, t), the conditions on the second derivatives to be checked to ensure we have a maximum are: ∂2ℓ(λ, t) < 0 ∂λ2 ∂2ℓ(λ, t) < 0 ∂t2 2 ∂2ℓ(λ, t) ∂2ℓ(λ, t) ∂2ℓ(λ, t) > ∂t2 ∂λ2 ∂t∂λ !

5. Maximum likelihood estimates are functionally invariant If θˆ is the MLE of θ, then when g is any function of θ, the MLE of g(θ) is g(θˆ), i.e. we can simply plug the MLE into the function. To see why this works, let φ = g(θ). Denote the likelihood function for θ by L(θ) and the likelihood function for φ by L˜(φ). Consider the two cases separately, g invertible and not invertible.

g is invertible In this case the likelihood L˜(φ) is easy to define since we know exactly which value of θ corresponds to any φ: 1 L˜(φ)= L(θ) where θ = g− (φ) As a result, since L(θˆ) L(θ) by the definition of maximum likelihood ≥ L˜(g(θˆ)) = L(θˆ) L(θ)= L˜(g(θ)) ≥ that is L˜(g(θˆ)) L˜(φ) φ ≥ ∀ so g(θˆ) maximises the likelihood L˜ and so is the maximum likelihood estimate of φ. g is not invertible Note: this part of the proof is not examinable, it is purely for those who are inter- ested. In this case there is not a unique θ corresponding to each φ and to define L˜ we need to make a choice as to which θ to use for each φ. Define

L˜(φ)= max L(θ) θ:g(θ)=φ

then again we can see that the largest value of L˜ occurs at g(θˆ) since this θˆ maximises L.

Example of function invariance

Suppose X1,...,Xn are iid exp random variables, parameterised either by the rate parameter λ or by the mean parameter µ. What are the MLEs of the two parameters?

The rate parameter λ

n f (x) = λ exp( λx ) λ − i iY=1

9 n n L(λ) = λ exp( λ xi) − i=1 Xn ℓ(λ) = n ln λ λ x − i Xi=1 dℓ(λ) n n = x dλ λ − i Xi=1 d2ℓ(λ) n = dλ2 −λ2

2 dℓ(λ) d ℓ(λ) ˆ n Solving dλ = 0 and checking that at this point dλ2 < 0, tells us that λ = n . i=1 xi The mean parameter µ P n 1 f (x) = exp( x /µ) µ µ i i=1 − Y n n L(µ) = µ− exp( x /µ) − i Xi=1 1 n ℓ(µ) = n ln µ xi − − µ i=1 n X dℓ(µ) n i=1 xi = + 2 dµ −µ P µ 2 n d ℓ(µ) n i=1 xi 2 = 2 2 3 dµ µ − P µ

2 n dℓ(µ) d ℓ(µ) i=1 xi Solving dµ = 0 and checking that at this point dµ2 < 0, tells us that µˆ = n . P 1 1 The point of this example is to notice that µ = λ− and µˆ = λˆ− (so, once we had found λˆ, we could write down µˆ without resorting to calculus).