<<

Notes on and

Antonis Demos (Athens University of Economics and Business)

October 2002 2 Part I

Probability Theory

3

Chapter 1

INTRODUCTION

1.1 Digression

A set is defined as any collection of objects, which are called points or elements. The biggest possible collection of points under consideration is called the , universe,oruniversal set. For Probability Theory the space is called the . AsetA is called a of B (we write A B or B A) if every element ⊆ ⊇ of A is also an element of B. A is called a proper subset of B (we write A B or ⊂ B A) if every element of A is also an element of B and there is at least one element ⊃ of B which does not belong to A. Two sets A and B are called equivalent sets or equal sets (we write A = B) if A B and B A. ⊆ ⊆ If a set has no points, it will be called the empty or null set and denoted by φ. The complement of a set A with respect to the space Ω, denoted by A¯, Ac, or Ω A, is the set of all points that are in but not in A. − The intersection of two sets A and B is a set that consists of the common elements of the two sets and it is denoted by A B or AB. ∩ The union of two sets A and B is a set that consists of all points that are in A or B or both (but only once) and it is denoted by A B. ∪ The set difference of two sets A and B is a set that consists of all points in 6 Introduction

A that are not in B and it is denoted by A B. − Properties of Set Operations Commutative: A B = B A and A B = B A. ∪ ∪ ∩ ∩ Associative: A (B C)=(A B) C and A (B C)=(A B) C. ∪ ∪ ∪ ∪ ∩ ∩ ∩ ∩ Distributive: A (B C)=(A B) (A C) and A (B C)= ∩ ∪ ∩ ∪ ∩ ∪ ∩ (A B) (A C). ∪ ∩ ∪ (Ac)c = A¯ = A i.e. the complement of the A-complement is A. If A subset¡ ¢ of Ω (the space) then: A Ω = A, A Ω = Ω, A φ = φ, ∩ ∪ ∩ A φ = A, A A = φ, A A = Ω, A A = A,andA A = A. ∪ ∩ ∪ ∩ ∪ De Morgan Law: (A B)=A B,and(A B)=A B. ∪ ∩ ∩ ∪ Disjoint or mutually exclusive sets are the sets that their intersection is the empty set, i.e. A and B are mutually exclusive if A B = φ. A1,A2, .... ∩ are mutually exclusive if Ai Aj = φ for any i = j. ∩ 6

Uncertainty or variability are prevalent in many situations and it is the purpose of the probability theory to understand and quantify this notion. The basic situation is an experiment whose is unknown before it takes place e.g., a) coin tossing, b)throwingadie,c)choosingatrandomanumberfromN,d)choosingatrandoma number from (0, 1). The sample space is the collection or totality of all possible outcomes of a conceptual experiment. An event is a subset of the sample space. The class of all events associated with a given experiment is defined to be the event space. Let us describe the sample space S, i.e. the set of all possible relevant outcomes of the above experiments, e.g., S = H, T ,S = 1, 2, 3, 4, 5, 6 . In both of these { } { } examples we have a finite sample space. In example c) the sample space is a countable infinity whereas in d) it is an uncountable infinity. Classical or a priori Probability: If a random experiment can result in N mutually exclusive and equally likely outcomes and if N(A) of these outcomes have an attribute A,thentheprobability of A is the fraction N(A)/N i.e. P (A)=N(A)/N , Set Theory Digression 7 where N = N(A)+N(A). Example: Consider the drawing an ace (event A) from a deck of 52 cards. What is P (A)? We have that N(A)=4and N(A)=48.ThenN = N(A)+N(A)=4+48=

N(A) 4 52 and P (A)= N = 52 Frequency or a posteriori Probability: Is the ratio of the number α that an event A has occurred out of n trials, i.e. P (A)=α/n. Example: Assume that we flip a coin 1000 times and we observe 450 heads. Then the a posteriori probability is P (A)=α/n =450/1000 = 0.45 (this is also the relative frequency). Notice that the a priori probability is in this case 0.5. Subjective Probability: This is based on intuition or judgment. We shall be concerned with a priori . These probabilities involve, many times, the counting of possible outcomes.

1.1.1 Some Counting Problems

Some more sophisticated discrete problems require counting techniques. For example:

a) What is the probability of getting four of a kind in a five card poker? b) What is the probability that two people in a classroom have the same birthday?

The sample space in both cases, although discrete, can be quite large and it not feasible to write out all possible outcomes. 1. Duplication is permissible and Order is important (Multiple Choice Arrangement), i.e. the element AA is permitted and AB is a different element from BA. In this case where we want to arrange n objects in x places the

n x possibleoutcomesisgivenfrom:Mx = n . Example: Find all possible combinations of the letters A, B, C, and D when duplication is allowed and order is important. Theresultaccordingtotheformulais: n =4,andx =2, consequently the 8 Introduction

4 2 possible number of combinations is M2 =4 =16.Tofind the result we can also use atreediagram. 2. Duplication is not permissible and Order is important (Per- mutation Arrangement), i.e. the element AA is not permitted and AB is a different element from BA. In this case where we want to permute n objects in x places the possible outcomes is given from: n! P n or P (n, x)=n (n 1) ..(n x +1)= . x × − × − (n x)! − Example: Find all possible permutations of the letters A, B, C, and D when duplication is not allowed and order is important. Theresultaccordingtotheformulais: n =4,andx =2, consequently the

4 4! 2 3 4 possible number of combinations is P2 = (4 2)! = ∗2∗ =12. − 3. Duplication is not permissible and Order is not important (Combination Arrangement), i.e. the element AA is not permitted and AB is not adifferent element from BA. In this case where we want the combinations of n objects in x places the possible outcomes is given from: P (n, x) n! n Cn or C(n, x)= = = x x! (n x)!x! x − µ ¶ Example: Find all possible combinations of the letters A, B, C, and D when duplication is not allowed and order is not important. Theresultaccordingtotheformulais: n =4,andx =2, consequently the

4 4! 2 3 4 possible number of combinations is C2 = 2! (4 2)! = ∗2 ∗2 =6. ∗ − ∗ Let us now define probability rigorously.

1.1.2 Definition of Probability

Consider a collection of sets Aα with index α Γ, which is denoted by Aα : α Γ . ∈ { ∈ } We can define for an index Γ of arbitrary cardinality (the cardinal number of a set is the number of elements of this set):

Aα = x S : x Aα for some α Γ α Γ ∪∈ { ∈ ∈ ∈ } Set Theory Digression 9

Aα = x S : x Aα for all α Γ α Γ ∩∈ { ∈ ∈ ∈ }

A collection is exhaustive if α Γ Aα = S (partition), and is pairwise exclusive ∪ ∈ or disjoint if Aα Aβ = φ, α = β. ∩ 6 To define probabilities we need some further structure. This is because in uncountable cases we can not just define probability for all subsets of S,asthere are some sets on the real line whose probability can not be determined, i.e., they are unmeasurable. We shall define probability on a family of subsets of S,ofwhichwe require the following structure.

Definition 1 Let be a non-empty class of subsets of S. is an if A A 1. Ac , whenever A ∈ A ∈ A 2. A1 A2 ,wheneverA1,A2 . ∪ ∈ A ∈ A is a σ-algebra if also A / 2 . ∞ An , whenever An , n=1,2,3,... ∪n=1 ∈ A ∈ A

Note that since is non-empty, (1) and (2) φ and S .Notealso A ⇒ ∈ A ∈ A that ∞ An A.Thelargestσ-algebra is the set of all subsets of S, denoted by ∩n=1 ∈ (S), and the smallest is φ, S .Wecangenerateaσ-algebra from any collection P { } of subsets by adding to the set the complements and the unions of its elements. For example let S = R,and

= [a, b] , (a, b], [a, b), (a, b),a,b R , B { ∈ } and let = σ ( ) consists of all intervals and countable unions of intervals and A B complements thereof. This is called the Borel σ-algebra and is the usual σ-algebra we work when S = R.Theσ-algebra (R), i.e., there are sets in (R) not A ⊂ P P in . These are some pretty nasty ones like the Cantor set. We can alternatively A construct the Borel σ-algebra by considering the set of all intervals of the form J ( ,x], x R. We can prove that σ ( )=σ ( ). Wecannowgivethedefinition −∞ ∈ J B of probability which is due to Kolmogorov. 10 Introduction

Definition 2 Given a sample space S and a σ-algebra (S, ), a A is a mapping from R such that A −→ 1. P (A) 0 for all A ≥ ∈ A 2. P (S)=1

3. if A1,A2, ... are pairwise disjoint, i.e., Ai Aj = φ for all i = j,then ∩ 6 ∞ ∞ P Ai = P (Ai) Ãi=1 ! i=1 [ X Insuchawaywehaveaprobabilityspace(S, ,P).WhenS is discrete we A usually take = (S).WhenS = R or some subinterval thereof, we take = σ ( ). A P A B P is a matter of choice and will depend on the problem. In many discrete cases, the problem can usually be written such that outcomes are equally likely.

P ( x )=1/n, n = (S) . { } In continuous cases, P is usually like , i.e.,

P ((a, b)) b a. ∝ − Properties of P 1. P (φ)=0 2. P (A) 1 ≤ 3. P (Ac)=1 P (A) − 4. P (B Ac)=P (B) P (B A) ∩ − ∩ 5. If A B P (A) P (B) ⊂ ⇒ ≤ 6. P(B A)=+P (A)+P (B) P (A B) More generally, for events ∪ − ∩ A1,A2, ...An we have: ∈ A n n n+1 P Ai = P[Ai] P [AiAj]+ P [AiAjAk] ..+( 1) P [A1..An]. − − − "i=1 # i=1 i

P A1 A2 A3 = P [A1]+P [A2]+P [A3] P[A1A2] P [A1A3] P[A2A3]+P [A1A2A2]. − − − h [ [ i and Independence 11

7. P ( ∞ Ai) ∞ P (Ai) ∪i=1 ≤ i=1 Proofs involve manipulatingP sets to obtain disjoint sets and then apply the .

1.2 Conditional Probability and Independence

In many statistical applications we have variables X and Y (or events A and B) and want to explain or predict Y or A from X or B, we are interested not only in marginal probabilities but in conditional ones as well, i.e., we want to incorporate some in our predictions. Let A and B be two events in and a probability A P (.).Theconditional probability of A given event B,isdenotedby P [A B] and is defined as follows: |

Definition 3 The probability of an event A givenaneventB, denoted by P (A B), | is given by P (A B) P ([A B)= ∩ if P (B) > 0 | P (B) and is left undefined if P (B)=0.

From the above formula is evident P [AB]=P [A B]P [B]=P [B A]P [A] if | | both P [A] and P [B] are nonzero. Notice that when speaking of conditional proba- bilities we are conditioning on some given event B; that is, we are assuming that the experiment has resulted in some outcome in B. B,ineffect then becomes our ”new” sample space. All probability properties of the previous section apply to conditional probabilities as well, i.e. P ( B) is a probability measure. In particular: ·| 1. P(A B) 0 | ≥ 2. P(S B)=1 | 3. P( ∞ Ai B)= ∞ P(Ai B) for any pairwise disjoint events Ai ∞ . ∪i=1 | i=1 | { }i=1 Note that if A andPB are mutually exclusive events, P (A B)=0.When | A B, P(A B)= P (A) P (A) with strict inequality unless P (B)=1.When ⊆ | P (B) ≥ B A, P (A B)=1. ⊆ | 12 Introduction

However, there is an additional property (Law) called the Law of Total Probabilities which states that: :

P (A)=P(A B)+P (A Bc) ∩ ∩

For a given (Ω, ,P[.]),ifB1,B2, ..., Bn is a collection of mutually nA exclusive events in satisfying Bi = Ω and P [Bi] > 0 for i =1, 2, ..., n then for A i=1 every A , S ∈ A

n

P [A]= P [A Bi]P[Bi] i=1 | X Another important in probability is the so called Bayes’ Theorem which states:

BAYES RULE: Given a probability space (Ω, ,P[.]),ifB1,B2, ..., Bn is a An collection of mutually exclusive events in satisfying Bi = Ω and P[Bi] > 0 for A i=1 i =1, 2,...,n then for every A for which P [A] > 0 weS have: ∈ A

P [A Bj]P [Bj] P[Bj A]= | | n P [A Bi]P [Bi] i=1 | P Notice that for events A and B which satisfy P [A] > 0 and P [B] > 0 we have: ∈ A P (A B)P (B) P (B A)= | . | P (A B)P (B)+P (A Bc)P (Bc) | | This follows from the definition of and the law of total probability. The probability P (B) is a prior probability and P (A B) frequently is a | likelihood, while P (B A) is the posterior. | Finally the Multiplication Rule states:

Given a probability space (Ω, ,P[.]),ifA1,A2,...,An are events in for A A which P [A1A2...... An 1] > 0 then: − Conditional Probability and Independence 13

P [A1A2...... An]=P[A1]P [A2 A1]P [A3 A1A2].....P [An A1A2....An 1] | | | − Example: A plant has two machines. Machine A produces 60% of the total output with a fraction defective of 0.02. Machine B the rest output with a fraction defective of 0.04. If a single unit of output is observed to be defective, what is the probability that this unit was produced by machine A? If A is the event that the unit was produced by machine A, B the event that the unit was produced by machine B and D the event that the unit is defective. Then we ask what is P[A D].ButP [A D]= P [AD] .NowP [AD]=P [D A]P [A]=0.02 | | P [D] | ∗ 0.6=0.012.AlsoP [D]=P [D A]P [A]+P [D B]P [B]=0.012 + 0.04 0.4=0.028. | | ∗ Consequently, P [A D]=0.571.NoticethatP [B D]=1 P [A D]=0.429.Wecan | | − | also use a tree diagram to evaluate P [AD] and P [BD]. Example: A marketing manager believes the market demand potential of a new product to be high with a probability of 0.30, or with probability of 0.50, or to be low with a probability of 0.20. From a sample of 20 employees, 14 indicated a very favorable reception to the new product. In the past such an employee response (14 out of 20 favorable) has occurred with the following probabilities: if the actual demand is high, the probability of favorable reception is 0.80; if the actual demand is average, the probability of favorable reception is 0.55; and if the actual demand is low, the probability of the favorable reception is 0.30. Thus given a favorable reception, what is the probability of actual high demand? AgainwhatweaskisP [H F ]= P [HF] .NowP [F ]=P [H]P [F H]+P [A]P [F A]+ | P [F ] | | P [L]P [F L]=0.24+0.275+0.06 = 0.575.AlsoP [HF]=P [F H]P [H]=0.24. Hence | | P [H F]= 0.24 =0.4174 | 0.575 Example: There are five boxes and they are numbered 1 to 5. Each box contains 10 balls. Box i has i defective balls and 10 i non-defective balls, i =1, 2, .., 5. − Consider the following random experiment: First a box is selected at random, and then a ball is selected at random from the selected box. 1) What is the probability 14 Introduction that a defective ball will be selected? 2) If we have already selected the ball and notedthatisdefective,whatistheprobabilitythatitcamefromthebox5?

Let A denote the event that a defective ball is selected and Bi the event that box i is selected, i =1, 2, .., 5.NotethatP [Bi]=1/5,fori =1, 2,..,5,and

P [A Bi]=i/10.Question1)iswhatisP [A]? Using the theorem of total probabilities | we have: 5 5 i 1 P [A]= P [A Bi]P [Bi]= 5 5 =3/10. Noticethatthetotalnumberof i=1 | i=1 15 defectiveballsis15outof50.HenceinthiscasewecansaythatP P P[A]= 50 =3/10. This is true as the probabilities of choosing each of the 5 boxes is the same. Question

2) asks what is P [B5 A]. Since box 5 contains more defective balls than box 4, which | contains more defective balls than box 3 and so on, we expect to find that P [B5 A] > | P [B4 A] >P[B3 A] >P[B2 A] >P[B1 A]. We apply Bayes’ theorem: | | | |

P [A B ]P [B ] 1 1 1 P [B A]= 5 5 = 2 5 = 5 5 | 3 | 10 3 P [A Bi]P[Bi] i=1 |

P j 1 P [A Bj ]P [Bj ] j | 10 5 Similarly P [Bj A]= 5 = 3 = 15 for j =1, 2, ..., 5. Noticethatuncon- 10 | P [A Bi]P [Bi] i=1 | P ditionally all Bi0s were equally likely.

Let A and B be two events in and a probability function P (.).Events A A and B are defined independent if and only if one of the following conditions is satisfied: (i) P [AB]=P [A]P [B]. (ii) P [A B]=P [A] if P [B] > 0. | (iii) P [B A]=P [B] if P [A] > 0. | These are equivalent definitions except that (i) does not really require P (A), P (B) > 0. Notice that the property of two events A and B and the property that A and B are mutually exclusive are distinct, though related properties. We know that if A and B are mutually exclusive then P [AB]=0. Now if these events are Conditional Probability and Independence 15 also independent then P [AB]=P [A]P [B], and consequently P [A]P[B]=0,which that either P[A]=0or P [B]=0. Hence two mutually exclusive events are independent if P [A]=0or P [B]=0. On the other hand if P [A] =0and P [B] =0, 6 6 then if A and B are independent can not be mutually exclusive and oppositely if they are mutually exclusive can not be independent. Also notice that independence is not transitive, i.e., A independent of B and B independent of C does not imply that A is independent of C. Example: Consider tossing two . Let A denote the event of an odd total, B the event of an ace on the first die, and C the event of a total of seven. We ask the following: (i) Are A and B independent? (ii) Are A and C independent? (iii) Are B and C independent? (i) P [A B]=1/2, P [A]=1/2 hence P [A B]=P [A] and consequently A and | | B are independent. (ii) P [A C]=1= P [A]=1/2 hence A and C are not independent. | 6 (iii) P [C B]=1/6=P [C]=1/6 hence B and C are independent. | Notice that although A and B are independent and C and B are independent A and C are not independent. Let us extend the independence of two events to several ones:

For a given probability space (Ω, ,P[.]),letA1,A2, ..., An be n events in . A A Events A1,A2,...,An are defined to be independent if and only if:

P [AiAj]=P [Ai]P [Aj] for i = j 6 P [AiAjAk]=P [Ai]P [Aj]P [Ak] for i = j, i = k, k = j 6 6 6 and so on n n P [ Ai]= P [Ai] i=1 i=1 Notice thatT pairwiseQ independence does not imply independence, as the fol- lowing example shows. 16 Introduction

Example: Consider tossing two dice. Let A1 denote the event of an odd face in the first die, A2 the event of an odd face in the second die, and A3 the event of 1 1 1 1 an odd total. Then we have: P [A1]P [A2]= 2 2 = P [A1A2],P[A1]P[A3]= 2 2 = 1 P [A3 A1]P [A1]=P [A1A3], and P [A2A3]= = P [A2]P [A3] hence A1,A2,A3 are | 4 1 pairwise independent. However notice that P [A1A2A3]=0= = P [A1]P [A2]P[A3]. 6 8 Hence A1,A2,A3 are not independent. Chapter 2

RANDOM VARIABLES, DISTRIBUTION FUNCTIONS, AND DENSITIES

The probability space (S, ,P) isnotparticularlyeasytoworkwith.Inpractice,we A often need to work with spaces with some structure (metric spaces). It is convenient therefore to work with a cardinalization of S by using the notion of random . Formally, a X is just a mapping from the sample space to the real line, i.e.,

X : S R, −→ with a certain property, it is a measurable mapping, i.e.

1 X = A S : X(A) = X− (B):B , A { ⊂ ∈ B} ∈ B ⊆ A © ª where is a -algebra on R, for any B in the inverse image belongs to .The B B A probability measure PX canthenbedefined by

1 PX (X B)=P X− (B) . ∈ ¡ ¢ It is straightforward to show that X is a σ-algebra whenever is. Therefore, PX A B is a probability measure obeying Kolmogorov’saxioms.Hencewehavetransferred

(S, ,P) (R, ,PX ),where is the Borel σ-algebra when X(S)=R or any A −→ B B uncountable set, and is (X (S)) when X (S) is finite. The function X(.) must be B P such that the set Ar,defined by Ar = ω : X(ω) r , belongs to for every real { ≤ } A number r,aselementsof are left-closed intervals of R. B 18 Random Variables, Distribution Functions, and Densities

The important part of the definition is that in terms of a random experiment, S is the totality of outcomes of that random experiment, and the function, or random variable, X(.) with domain S makes some correspond to each outcome of the experiment. The fact that we also require the collection of ω s for which 0 X(ω) r to be an event (i.e. an element of ) for each real number r is not much of ≤ A a restriction since the use of random variables is, in our case, to describe only events. Example: Consider the experiment of tossing a single coin. Let the random variable X denote the number of heads. In this case S = head, tail ,andX(ω)=1 { } if ω = head,andX(ω)=0if ω = tail. So the random variable X associates a real number with each outcome of the experiment. To show that X satisfies the definition we should show that ω : X(ω) r , belongs to for every real number { ≤ } A r. = φ, head , tail ,S .Nowifr<0, ω : X(ω) r = φ,if0 r<1 then A { { } { } } { ≤ } ≤ ω : X(ω) r = tail ,andifr 1 then ω : X(ω) r = head, tail = S. { ≤ } { } ≥ { ≤ } { } Hence, for each r the set ω : X(ω) r belongs to and consequently X(.) is a { ≤ } A random variable. In the above example the random variable is described in terms of the random experiment as opposed to its functional form, which is the usual case.

We can now work with (R, ,PX ), which has metric structure and algebra. B Forexample,wetosstwodieinwhichcasethesamplespaceis

S = (1, 1) , (1, 2) ,...,(6, 6) . { } We can define two random variables: the Sum and Product:

X (S)= 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 { } X (S)= 1, 2, 3, 4, 5, 6, 8, 9, 10, ..., 36 { }

The simplest form of random variables are the indicators IA

1 if s A IA (s)= ∈ ⎧ 0 if s / A ⎨ ∈ ⎩ Random Variables, Distribution Functions, and Densities 19

This has associated sigma algebra in S

φ, S, A, Ac { } Finally,wegiveformaldefinition of a continuous real-valued random variable.

Definition 4 A random variable is continuous if its probability measure PX is ab- solutely continuous with respect to Lebesgue measure, i.e., PX (A)=0whenever λ(A)=0.

2.0.1 Distribution Functions

Associated with each random variable there is the distribution function

FX (x)=PX (X x) ≤ defined for all x R. This function effectively replaces PX . Notethatwecan ∈ reconstruct PX from FX . EXAMPLE. S = H, T ,X(H)=1,X(T)=0, (p =1/2). { } If x<0, FX (x)=0

If 0 x<1, FX(x)=1/2 ≤ If x 1, FX(x)=1. ≥ EXAMPLE. The logit c.d.f. is

1 FX(x)= x 1+e− It is continuous everywhere and asymptotes to 0 and 1 at respectively. Strictly ±∞ increasing.

Note that the distribution function FX (x) of a continuous random variable is a . The distribution function of a discrete random variable is a step function.

Theorem 5 AfunctionF ( ) is a c.d.f. of a random variable X if and only if the · following three conditions hold 20 Random Variables, Distribution Functions, and Densities

1. limx F (x)=0and limx F(x)=1 →−∞ →∞ 2. F is a nondecreasing function in x

3. F is right-continuous, i.e., for all x0, limx x+ F (x)=F(x0) → 0 4. F is continuous except at a set of points of Lebesgue measure zero.

2.0.2 Discrete Random Variables.

As we have already said, a random variable X will be defined to be discrete if the range of X is countable. If a random variable X is discrete, then its corresponding cumulative distribution function FX(.) will be defined to be discrete, i.e. a step function. By the range of X being countable we that there exists a finite or denumerable set of real , say x1,x2, ...xn, ...,suchthatX takes on values only in that set. If X is discrete with distinct values x1,x2, ...xn, ...,thenS = ω : n { X(ω)=xn , and X = xi X = xj = φ for i = j. Hence 1=P[S]= PS[X = } { } ∩ { } 6 n xn] by the third of probability. P

If X is a discrete random variable with distinct values x1,x2,...xn, ..., then the function, denoted by fX (.) and defined by

P [X = x] if x = xj,j=1, 2, ..., n, ... f (x)= X ⎧ 0 if x = xj ⎨ 6 is defined to be the discrete density function of X. ⎩ Notice that the discrete density function tell us how likely or probable each of the values of a discrete random variable is. It also enables one to calculate the probability of events described in terms of the discrete random variable. Also notice that for any discrete random variable X, FX (.) can be obtained from fX (.),andvice versa Example: Consider the experiment of tossing a single die. Let X denote the number of spots on the upper face. Then for this case we have: X takes any value from the set 1, 2, 3, 4, 5, 6 .SoX is a discrete ran- { } dom variable. The density function of X is: fX (x)=P[X = x]=1/6 for any Random Variables, Distribution Functions, and Densities 21 x 1, 2, 3, 4, 5, 6 and 0 otherwise. The cumulative distribution function of X is: ∈ { } [x] FX(x)=P [X x]= P [X = n] where [x] denotes the integer part of x.. Notice ≤ n=1 that x can be any realP number. However, the points of interest are the elements of 1, 2, 3, 4, 5, 6 . NoticealsothatinthiscaseΩ = 1, 2, 3, 4, 5, 6 as well, and we do { } { } not need any reference to . A Example: Consider the experiment of tossing two dice. Let X denote the total of the upturned faces. Then for this case we have: Ω = (1, 1), (1, 2),...(1, 6), (2, 1), (2, 2), ....(2, 6), (3, 1), ....., (6, 6) a total of (us- { } ing the Multiplication rule) 36 = 62 elements. X takesvaluesfromtheset 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 { The density function is: 1/36 for x =2 or x =12 ⎧ 2/36 for x =3 or x =11 ⎪ ⎪ ⎪ 3/36 for x =4 or x =10 ⎪ ⎪ fX (x)=P [X = x]=⎪ 4/36 for x =5 or x =9 ⎪ ⎪ ⎨ 5/36 for x =6 or x =8 ⎪ 1/36 for x =7 ⎪ ⎪ ⎪ 0 foranyotherx ⎪ ⎪ The cumulative distribution⎪ function is: ⎩⎪ 0 for x < 2 ⎧ 1 for 2 x<3 ⎪ 36 ⎪ ≤ ⎪ 3 for 3 x<4 ⎪ 36 ⎪ ≤ [x] ⎪ 6 for 4 x<5 ⎪ 36 FX (x)=P [X x]= P [X = n]=⎪ ≤ ≤ n=1 ⎪ 10 ⎪ 36 for 5 x<6 P ⎨ ≤ ...... ⎪ ⎪ ⎪ 35 for 11 x<12 ⎪ 36 ⎪ ≤ ⎪ 1 for 12 x ⎪ ⎪ ≤ Notice that, again, we do not need any⎪ reference to . ⎩⎪ A In fact we can speak of discrete density functions without reference to some random variable at all. 22 Random Variables, Distribution Functions, and Densities

Any function f(.) with domain the real line and counterdomain [0, 1] is defined to be a discrete density function if for some x1,x2,...xn, .... has the following properties:

i) f(xj) > 0 for j =1, 2, ...

ii) f(x)=0for x = xj; j =1, 2,... 6 iii) f(xj)=1, where the summation is over the points x1,x2, ...xn, .... 2.0.3 ContinuousP Random Variables

A random variable X is called continuous if there exist a function fX (.) such that x FX(x)= fX (u)du for every real number x. InsuchacaseFX (x) is the cumulative −∞ distributionR and the function fX (.) is the density function. Notice that according to the above definition the density function is not uniquely determined. The idea is that if the a function change value if a few points its is unchanged. Furthermore, notice that fX (x)=dFX (x)/dx. The notations for discrete and continuous density functions are the same, yettheyhavedifferent interpretations. We know that for discrete random variables fX(x)=P [X = x], which is not true for continuous random variables. Further- more, for discrete random variables fX (.) is a function with domain the real line and counterdomain the interval [0, 1], whereas, for continuous random variables fX(.) is a function with domain the real line and counterdomain the interval [0, ).Notethat ∞ for a continuous r.v.

P (X = x) P (x ε X x)=FX (x) FX (x ε) 0 ≤ − ≤ ≤ − − → as ε 0, by the continuity of FX (x). The set X = x is an example of a set of → { } measure (in this case the measure is P or PX ) zero. In fact, any countable set is of measure zero under a distribution which is absolutely continuous with respect to Lebesgue measure. Because the probability of a is zero

P (a X b)=P (a X

Example: Let X be the random variable representing the length of a tele- phone conversation. One could model this experiment by assuming that the distrib-

λx ution of X is given by FX (x)=(1 e− ) where λ is some positive number and the − random variable can take values only from the interval [0, ). The density function ∞ λx is dFX (x)/dx = fX(x)=λe− . If we assume that telephone conversations are mea- 10 10 λx 5λ 10λ sured in minutes, P [5

EXPECTATIONS AND MOMENTS OF RANDOM VARIABLES

An extremely useful concept in problems involving random variables or distributions is that of expectation.

3.0.4 Mean or Expectation

Let X be a random variable. The mean or the of X,denotedby

E[X] or μX ,isdefined by:

(i) E[X]= xjP [X = xj]= xjfX (xj) if X is a discreteP random variableP with counterdomain the countable set

x1,...,xj,.. { } (ii) E[X]= ∞ xfX (x)dx −∞ if X is a continuousR random variable with density function fX (x) and if either 0 ∞ 0 xfX (x)dx < or xfX (x)dx < or both. ∞ −∞ ∞ 0 ¯ ∞ ¯ ¯R (iii)¯ E[X]=¯R0 [1 FX (x)]¯dx FX (x)dx ¯ ¯ ¯ − ¯ − −∞ for an arbitrary randomR variable X. R (i) and (ii) are used in practice to find the mean for discrete and continuous random variables, respectively. (iii) is used for the mean of a random variable that is neither discrete nor continuous. Notice that in the above definition we assume that the sum and the exist. Also that the summation in (i) runs over the possible values of j and the jth term is the value of the random variable multiplied by the probability that the random variable takes this value. Hence E[X] is an average of the values that the 26 Expectations and Moments of Random Variables random variable takes on, where each value is weighted by the probability that the random variable takes this value. Values that are more probable receive more weight. Thesameistrueintheintegralformin(ii).Therethevaluex is multiplied by the approximate probability that X equals the value x, i.e. fX(x)dx, and then integrated over all values. Notice that in the definition of a mean of a random variable, only density functions or cumulative distributions were used. Hence we have really defined the mean for these functions without reference to random variables. We then call the defined mean the mean of the cumulative distribution or the appropriate density function. Hence, we can speak of the mean of a distribution or density function as well as the mean of a random variable. Notice that E[X] is the center of gravity (or centroid) of the unit mass that is determined by the density function of X.SothemeanofX is a measure of where the values of the random variable are centered or located i.e. is a measure of central location. Example: Consider the experiment of tossing two dice. Let X denote the total of the upturned faces. Then for this case we have: 12 E[X]= ifX (i)=7 i=2 Example:P Consider a X that can take only to possible values, 1 and -1, each with probability 0.5. Then the mean of X is: E[X]=1 0.5+( 1) 0.5=0 ∗ − ∗ NoticethatthemeaninthiscaseisnotoneofthepossiblevaluesofX. Example: Consider a continuous random variable X with density function λx fX(x)=λe− for x [0, ).Then ∈ ∞ ∞ ∞ λx E[X]= xfX (x)dx = xλe− dx =1/λ 0 Example:−∞R Consider a continuousR random variable X with density function 2 fX(x)=x− for x [1, ).Then ∈ ∞ ∞ ∞ 2 E[X]= xfX (x)dx = xx− dx =limlog b = b 1 →∞ ∞ −∞R R Expectations and Moments of Random Variables 27

so we say that the mean does not exist, or that it is infinite.

Median of X:WhenFX is continuous and strictly increasing, we can define the of X,denotedm(X),asbeingtheuniquesolutionto

1 F (m)= . X 2

1 1 1 Since in this case, F − ( ) exists, we can alternatively write m = F − ( ). For discrete X · X 2 r.v., there may be many m that satisfy this or may none. Suppose

01/3 X = ⎧ 11/3 , ⎪ ⎪ ⎨ 21/3 ⎪ ⎪ 1 then there does not exist an m with FX⎩(m)= 2 . Also, if

01/4 ⎧ 11/4 ⎪ X = ⎪ , ⎪ 21/4 ⎨⎪ 31/4 ⎪ ⎪ ⎪ then any 1 m 2 is an adequate median.⎩⎪ ≤ ≤ n n 1 Note that if E (X ) exists, then so does E (X − ) but not vice versa (n>0). Also when the is infinite, the expectation does not necessarily exist. 0 ∞ If 0 xfX(x)dx = but xfX (x)dx > ,thenE (X)= ∞ −∞ −∞ ∞ 0 ∞ If R0 xfX(x)dx = and R xfX (x)dx = ,thenE (X)is not defined. ∞ −∞ −∞ 1 1 Example:R [Cauchy] fX(xR)= π 1+x2 . This density function is symmetric about zero, and one is temted to say that E (X)=0.But ∞ xfX (x)dx = and 0 ∞ 0 xfX (x)dx = ,soE(X) does not exist according to theR above definition. −∞ −∞ R Now consider Y = g(X), where g is a (piecewise) monotonic continuous func- tion. Then

∞ ∞ E (Y )= yfY (y)dy = g(x)fX (x)dx = E (g(x)) Z−∞ Z−∞ 28 Expectations and Moments of Random Variables

Theorem 6 Expectation has the following properties:

1. [Linearity] E (a1g1(X)+a2g2(X)+a3)=a1E (g1(X)) + a2E (g2(X)) + a3

2. [Monotonicity] If g1(x) g2(x) E (g1(X)) E (g2(X)) ≥ ⇒ ≥ 3. Jensen’s inequality. If g(x) is a weakly , i.e., g (λx +(1 λ) y) − ≤ λg (x)+(1 λ) g (y) for all x, y, and all with 0 λ 1, then − ≤ ≤

E (g(X)) g (E (X)) . ≥

An Interpretation of Expectation

We claim that E (X) is the unique minimizer of E (X θ)2 with respect to θ, assum- − ing that the second moment of X is finite.

Theorem 7 Suppose that E (X2) exists and is finite, then E (X) is the unique min- imizer of E (X θ)2 with respect to θ. −

This Theorem says that the Expectation is the closest to θ,inmean error.

3.0.5

2 Let X be a random variable and μX be E[X].Thevariance of X, denoted by σX or var[X],isdefined by:

2 2 (i) var[X]= (xj μ ) P [X = xj]= (xj μ ) fX(xj) − X − X if X is a discreteP random variable with counterdomainP the countable set

x1,...,xj,.. { } 2 (ii) var[X]= ∞ (xj μX ) fX(x)dx −∞ − if X is a continuousR random variable with density function fX(x). 2 (iii) var[X]= ∞ 2x[1 FX(x)+FX( x)]dx μ 0 − − − X for an arbitrary randomR variable X. Thevariancesaredefined only if the series in (i) is convergent or if the integrals in (ii) or (iii) exist. Again, the variance of a random variable is definedintermsof Expectations and Moments of Random Variables 29 the density function or cumulative distribution function of the random variable and consequently, variance can be defined in terms of these functions without reference to a random variable. Notice that variance is a measure of spread since if the values of the random variable X tend to be far from their mean, the variance of X will be larger than the variance of a comparable random variable whose values tend to be near their mean. It is clear from (i), (ii) and (iii) that the variance is a nonnegative number.

2 If X is a random variable with variance σX , then the standard deviation of

X,denotedbyσX ,isdefined as var(X) The standard deviation ofp a random variable, like the variance, is a measure of spread or dispersion of the values of a random variable. In many applications it is preferable to the variance since it will have the same measurement units as the random variable itself. Example: Consider the experiment of tossing two dice. Let X denote the total of the upturned faces. Then for this case we have (μX =7): 12 2 var[X]= (i μX) fX (i)=210/36 i=2 − Example:PConsider a X that can take only to possible values, 1 and -1, each with probability 0.5. Then the variance of X is (μX =0): var[X]=0.5 12 +0.5 ( 1)2 =1 ∗ ∗ − Example: Consider a X that can take only to possible values, 10 and -10, each with probability 0.5. Then we have: μ = E[X]=10 0.5+( 10) 0.5=0 X ∗ − ∗ var[X]=0.5 102 +0.5 ( 10)2 =100 ∗ ∗ − Notice that in examples 2 and 3 the two random variables have the same mean but different variance, larger being the variance of the random variable with values further away from the mean. Example: Consider a continuous random variable X with density function λx fX(x)=λe− for x [0, ).Then(μ =1/λ): ∈ ∞ X 30 Expectations and Moments of Random Variables

∞ 2 ∞ 2 λx 1 var[X]= (x μX ) fX (x)dx = (x 1/λ) λe− dx = λ2 − 0 − Example:−∞R Consider a continuousR random variable X with density function 2 fX(x)=x− for x [1, ). Then we know that the mean of X does not exist. ∈ ∞ Consequently, we can not define the variance. Notice that

Var(X)=E (X E(X))2 = E X2 E2 (X) − − £ ¤ ¡ ¢ and that

Var(aX + b)=a2Var(X) ,SD= √Var, SD(aX + b)= a SD(X), | | i.e., SD(X) changes proportionally. Variance/standard deviation measures disper- 1 1 sion, higher variance more spread out. Interquartile range: F − (3/4) F − (1/4),the X − X range of middle half always exists and is an alternative measure of dispersion.

3.0.6 Higher Moments of a Random Variable

th / If X is a random variable, the r raw moment of X, denoted by μr,isdefined as:

/ r μr = E[X ]

/ / if this expectation exists. Notice that μr = E[X]=μ1 = μX , the mean of X. If X is a random variable, the rth of X about α is defined as E[(X α)r]. If α = μ ,wehavetherth central moment of X about μ ,denoted − X X by μr,whichis: μ = E[(X μ )r] r − X We have measures defined in terms of quantiles to describe some of the char- acteristics of random variables or density functions. The qth quantile of a random variable X or of its corresponding distribution is denoted by ξq and is defined as the smallest number ξ satisfying FX (ξ) q.IfX is a continuous random variable, then ≥ th the q quantile of X is given as the smallest number ξ satisfying FX(ξ) q. ≥ Expectations and Moments of Random Variables 31

The median of a random variable X, denoted by medX or med(X), or ξq,is the 0.5th quantile. Notice that if X a continuous random variable the median of X satisfies:

med(X) 1 ∞ fX (x)dx = = fX (x)dx 2 med(X) Z−∞ Z so the median of X is any number that has half the mass of X to its right and the other half to its left. The median and the mean are measures of central location.

The third moment about the mean μ = E (X E (X))3 is called a measure 3 − of asymmetry, or . Symmetrical distributions can be shown to have μ3 =0. Distributions can be skewed to the left or to the right. However, knowledge of the third moment gives no clue as to the shape of the distribution, i.e. it could be the

μ3 case that μ3 =0but the distribution to be far from symmetrical. The ratio σ3 is unitless and is call the coefficient of skewness. An alternative measure of skewness is provided by the ratio: (mean-median)/(standard deviation)

The fourth moment about the mean μ = E (X E (X))4 is used as a measure 4 − of ,whichisadegreeofflatness of a density near the center. The coefficient of kurtosis is defined as μ4 3 and positive values are sometimes used to indicate σ4 − that a density function is more peaked around its center than the normal (leptokur- tic distributions). A positive value of the coefficient of kurtosis is indicative for a distribution which is flatter around its center than the standard normal (platykurtic distributions). This measure suffers from the same failing as the measure of skewness i.e. it does not always measure what it supposed to.

While a particular moment or a few of the moments may give little information about a distribution the entire set of moments will determine the distribution exactly. In applied statistics the firsttwomomentsareofgreatimportance,butthethirdand forth are also useful. 32 Expectations and Moments of Random Variables

3.0.7 Moment Generating Functions

Finally we turn to the moment generating function (mgf) and (cf). The mgf is defined as

tX ∞ tx MX (t)=E e = e fX(x)dx Z−∞ ¡ ¢ for any real t, provided this integral exists in some neighborhood of 0. It is the

Laplace transform of the function fX ( ) with argument t. We have the useful · − inversion formula ∞ tx fX (x)= MX (t) e− dt Z−∞ The mgf is of limited use, since it does not exist for many r.v. the cf is applicable more generally, since it always exists:

itX ∞ itx ∞ ∞ ϕX (t)=E e = e fX (x)dx = cos (tx) fX (x)dx+i sin (tx) fX (x)dx Z−∞ Z−∞ Z−∞ ¡ ¢ This essentially is the of the function fX ( ) and there is a well · defined inversion formula

1 ∞ itx fX(x)= e− ϕ (t) dt √2π X Z−∞ If X is symmetric about zero, the complex part of cf is zero. Also,

r d r r itX r r ϕ (0) = E i X e t=0= i E (X ) ,r=1, 2, 3,.. dtr X ↓ ¡ ¢ Thus the moments of X are related to the derivative of the cf at the origin. If c (t)= ∞ exp (itx) dF (x) Z−∞ notice that drc (t) = ∞ (ix)r exp (itx) dF (x) dtr Z−∞ and r r d c (t) ∞ r r / / r d c (t) r = (ix) dF (x)=(i) μr μr =( i) r dt t=0 ⇒ − dt t=0 ¯ Z−∞ ¯ ¯ ¯ ¯ ¯ ¯ ¯ Expectations and Moments of Random Variables 33 the rth uncenterd moment. Now expanding c (t) in powers of t we get

drc (t) drc (t) (t)r (it)r c (t)=c (0) + t + ... + + ... =1+μ/ (it)+... + μ/ + ... dtr dtr r! 1 r r! ¯t=0 ¯t=0 ¯ ¯ ¯ ¯ The cummulants are¯ defined as the coeffi¯ cients κ1, κ2, ..., κr of the identity in it 2 r r (it) (it) / / (it) exp κ1 (it)+κ2 + ... + κr + ... =1+μ1 (it)+... + μr + ... Ã 2! r! ! r! = c (t)= ∞ exp (itx) dF (x) Z−∞ The -moment connection:

Suppose X is a random variable with n moments a1, ...an. Then X has n k1,...kn and

r r ar+1 = ajkr+1 j for r =0, ..., n 1. − j=0 ⎛ j ⎞ − X ⎝ ⎠ Writing out for r =0, ...3 produces:

a1 = k1

a2 = k2 + a1k1

a3 = k3 +2a1k2 + a2k1

a4 = k4 +3a1k3 +3a2k2 + a3k1.

These recursive formulas can be used to calculate the a s efficiently from the 0 k s, and vice versa. When X has mean 0, that is, when a1 =0=k1,aj becomes 0 μ = E((X E(X))j), j − so the above formulas simplify to:

μ2 = k2

μ3 = k3

2 μ4 = k4 +3k2. 34 Expectations and Moments of Random Variables

3.0.8 Expectations of Functions of Random Variablers

Product and Quotient

X X Let f (X, Y )= Y , E (X)=μX and E (Y )=μY . Then, expanding f (X, Y )= Y around (μX,μY ),wehave

μX 1 μX μX 2 1 f (X, Y )= + (X μX ) 2 (Y μY )+ 3 (Y μY ) 2 (X μX )(Y μY ) μY μY − −(μY ) − (μY ) − −(μY ) − −

2 2 2 2 as ∂f = 1 , ∂f = X , ∂ f =0, ∂ f = ∂ f = 1 ,and ∂ f =2X .Taking ∂X Y ∂Y − Y 2 ∂X2 ∂X∂Y ∂Y ∂X − Y 2 ∂Y 2 Y 3 expectations we have

X μ μ 1 E = X + X Var(Y ) Cov (X, Y ) . Y μ (μ )3 − (μ )2 µ ¶ Y Y Y For the variance, take again the variance of the Taylor expansion and keeping only terms up to order 2 we have:

X (μ )2 Var(X) Var(Y ) Cov(X, Y ) Var = X + 2 . Y (μ )2 (μ )2 (μ )2 − μ μ µ ¶ Y ∙ X Y X Y ¸ Chapter 4

EXAMPLES OF PARAMETRIC UNIVARIATE DISTRIBUTIONS

A parametric family of density functions is a collection of density functions that are

λx indexed by a quantity called , e.g. let f(x; λ)=λe− for x>0 and some λ>0. λ is the parameter, and as λ ranges over the positive numbers, the collection f(.; λ):λ>0 is a parametric family of density functions. { } 4.0.9 Discrete Distributions

UNIFORM:

Suppose that for j =1, 2, 3, ...., n

1 P (X = xj )= |X n where x1,x2, ...xn = is the support. Then { } X 2 1 n 1 n 1 n E (X)= x ,Var(X)= x2 x . n j n j n j j=1 j=1 − Ã j=1 ! X X X The c.d.f. here is 1 n P (X x)= 1(x x) n j ≤ j=1 ≤ X Bernoulli

A random variable whose outcome have been classified into two categories, called “success” and “failure”, represented by the letters s and f, respectively, is called a . If a random variable X is defined as 1 if a Bernoulli trial results in 36 Examples of Parametric Univariate Distributions success and 0 if the same Bernoulli trial results in failure, then X has a with parameter p = P [success].Thedefinition of this distribution is: A random variable X has a Bernoulli distribution if the discrete density of X is given by:

x 1 x p (1 p) − for x =0, 1 fX (x)=fX (x; p)= − ⎧ 0 otherwise ⎨ where p = P [X =1].Fortheabovede⎩ fined random variable X we have that:

E[X]=pandvar[X]=p(1 p) −

BINOMIAL:

Consider a random experiment consisting of n repeated independent Bernoulli trials with p the probability of success at each individual trial. Let the random variable X represent the number of successes in the n repeated trials. Then X follows a . The definition of this distribution is:

A random variable X has a binomial distribution, X ∼ Binomial(n, p),if the discrete density of X is given by:

n x n x p (1 p) − for x =0, 1,...,n fX (x)=fX (x; n, p)=⎧ ⎛ x ⎞ − ⎪ ⎪ ⎨ ⎝ ⎠ 0 otherwise ⎪ where p = P [X =1]i.e. the⎩⎪ probability of success in each independent Bernoulli trial and n is the total number of trials. For the above defined random variable X we have that: E[X]=np and var[X]=np(1 p) − Mgf t n MX (t)= pe +(1 p) . − Example: Consider a stock with£ value S =50¤ . Each period the stock moves up or down, independently, in discrete steps of 5. The probability of going up is Examples of Parametric Univariate Distributions 37 p =0.7 and down 1 p =0.3. Whatistheexpectedvalueandthevarianceofthe − value of the stock after 3 period?

If we call X the random variable which is a success if the stock moves up and failure if the stock moves down. Then P [X = success]=P[X =1]=0.7,and X˜Binomial(3,p).NowX can take the values 0, 1, 2, 3 i.e. no success, 1 success and 2 failures, etc.. The value of the stock in each case and the probabilities are:

3 0 3 0 3 S =35,andfX (0) = p (1 p) − =1 0.3 =0.027, ⎛ 0 ⎞ − ∗ ⎝ ⎠ 3 1 3 1 2 S =45,andfX (1) = p (1 p) − =3 0.7 0.3 =0.189, ⎛ 1 ⎞ − ∗ ∗ ⎝ ⎠ 3 2 3 2 2 S =55,andfX (2) = p (1 p) − =3 0.7 0.3=0.441, ⎛ 2 ⎞ − ∗ ∗ ⎝ ⎠ 3 3 3 3 3 S =65and fX(3) = p (1 p) − =1 0.7 =0.343. ⎛ 3 ⎞ − ∗ Hence the expected stock⎝ ⎠ value is:

E[S]=35 0.027 + 45 0.189 + 55 0.441 + 65 0.343 = 56,andvar[S]= ∗ ∗ ∗ ∗ (35 56)2 0.027 + ( 11)2 0.189 + ( 1)2 0.441 + (9)2 0.343. − ∗ − ∗ − ∗ ∗

Hypergeometric

Let X denote the number of defective balls in a sample of size n when is done without replacement from a box containing M balls out of which K are defective. The X has a hypergeometric distribution. The definition of this distribution is:

A random variable X has a hypergeometric distribution if the discrete den- 38 Examples of Parametric Univariate Distributions sity of X is given by:

K M K ⎛ ⎞⎛ − ⎞ ⎧ ⎜ ⎟⎜ ⎟ ⎜ x ⎟⎜ n x ⎟ ⎪ ⎜ ⎟⎜ − ⎟ for x =0, 1, ..., n ⎪ ⎝ ⎠⎝ ⎠ fX(x)=fX (x; M,K,n)=⎪ M ⎪ ⎪ ⎛ ⎞ ⎨ ⎜ ⎟ ⎜ n ⎟ ⎜ ⎟ ⎪ ⎝ 0 ⎠ otherwise ⎪ ⎪ ⎪ where M is a positive integer, K⎩⎪ is a nonnegative that is at most M,andn is a positive integer that is at most M. For this distribution we have that:

K K M K M n E[X]=n and var[X]=n − − M M M M 1 − Notice the difference of the binomial and the hypergeometric i.e. for the binomial distribution we have Bernoulli trials i.e. independent trials with fixed probability of success or failure, whereas in the hypergeometric in each trial the probability of success or failure changes depending on the result.

Geometric

Consider a sequence of independent Bernoulli trials with p equal the probability of success on an individual trial. Let the random variable X represent the number of trials required before the first success. Then X has a . The definition of this distribution is: A random variable X has a geometric distribution,

X ∼ geometric(p) , if the discrete density of X is given by:

p(1 p)x for x =0, 1, ..., n fX (x)=fX (x; p)= − ⎧ 0 otherwise ⎨ where p is the probability of success⎩ in each Bernoulli trial. For this distribution we have that: 1 p 1 p E[X]= − and var[X]= − p p2 Examples of Parametric Univariate Distributions 39

It is worth noticing that the Binomial distribution Binomial(n, p) can be approximated by a P oisson(np) (see below). The approximation is more valid as n ,p 0, in such a way so that np = constant. →∞ →

POISSON:

ArandomvariableX has a , X ∼ P oisson(λ), if the discrete density of X is given by:

e λλx P (X = x λ)= − x =0, 1, 2, 3,... | x!

In calculations with the Poisson distribution we may use the fact that

∞ tj et = for any t. j! j=0 X Employing the above we can prove that

E (X)=λ, E (X (X 1)) = λ2,Var(X)=λ. −

The Poisson distribution provides a realistic model for many random phenomena. Since the values of a Poisson random variable are nonnegative integers, any random phenomenon for which a count of some sort is of interest is a candidate for modeling in assuming a Poisson distribution. Such a count might be the number of fatal traffic accidents per week in a given place, the number of telephone calls per hour, arriving in a switchboard of a company, the number of pieces of information arriving per hour, etc. Example: It is known that the average number of daily changes in excess of 1%, for a specific stock Index, occurring in each six-month period is 5. What is the probability of having one such a change within the next 6 months? What is the probability of at least 3 changes within the same period? We model the number of in excess of 1% changes, X, within the next 6 months

λ x 5 x e− λ e− 5 as a Poisson process. We know that E[X]=λ =5. Hence fX(x)= x! = x! , 40 Examples of Parametric Univariate Distributions

5 1 e− 5 for x =0, 1, 2,,... Then P[X =1]=fX (1) = =0.0337.AlsoP [X 3] = 1! ≥ 1 P [X<3] = − =1 P [X =0] P [X =1] P [X =2]= − − − 5 0 5 1 5 2 =1 e− 5 e− 5 e− 5 =0.875. − 0! − 1! − 2! We can approximate the Binomial with Poisson. The approximation is better the smaller the p and the larger the n.

4.0.10 Continuous Distributions

UNIFORM ON [a, b].

A very simple distribution for a continuous random variable is the uniform distribu- tion. Its density function is:

1 b a if x [a, b] f (x a, b)= − ∈ , | ⎧ 0 otherwise ⎨ and ⎩ x x a F (x a, b)= f (z a, b) dz = − , | | b a Za − where

a + b a + b (b a)2 E (X)= , median = ,Var(X)= − . 2 2 12

X a If X v U [a, b]= X a v U [0,b a]= b −a v U [0,b a].Noticethatif ⇒ − − ⇒ − − a random variable is uniformly distributed over one of the following intervals [a, b), (a, b], (a, b) the density function, expected value and variance does not change.

Exponential Distribution

If a random variable X has a density function given by:

λx fX (x)=fX (x; λ)=λe− for 0 x< ≤ ∞ Examples of Parametric Univariate Distributions 41 where λ>0 then X is defined to have an (negative) . Now this random variable X we have 1 1 E[X]= and var[X]= λ λ2

Pareto-Levy or Stable Distributions

The stable distributions are a natural generalization of the normal in that, as their name suggests, they are stable under addition, i.e. a sum of stable random variables is also a random variable of the same type. However, nonnormal stable distributions have more probability mass in the tail areas than the normal. In fact, the nonnormal stable distributions are so fat-tailed that their variance and all higher moments are infinite. Closed form expressions for the density functions of stable random variables are available for only the cases of normal and Cauchy. If a random variable X has a density function given by:

1 γ fX(x)=fX (x; γ,δ)= for

Normal or Gaussian:

2 We say that X v N [μ, σ ] then

(x μ)2 2 1 − f x μ, σ = e− 2σ2 ,

The distribution is symmetric about μ, it is also unimodal and positive everywhere. Notice X μ − = Z N [0, 1] σ v is the standard . 42 Examples of Parametric Univariate Distributions

Lognormal Distribution

Let X be a positive random variable, and let a new random variable Y be defined as Y =logX.IfY has a normal distribution, then X is said to have a lognormal distribution. The density function of a lognormal distribution is given by

1 (log x μ)2 2 −2 fX(x; μ, σ )= e− 2σ for 0 0.Wehaven −∞ ∞ μ+ 1 σ2 2μ+2σ2 2μ+σ2 E[X]=e 2 and var[X]=e e − Notice that if X is lognormally distributed then

E[log X]=μandvar[log X]=σ2

Gamma-χ2

1 α 1 x f (x α, β)= x − e− β , 0 0 | Γ (α) βα ∞ ∞ α 1 t α is , β is a . Here Γ (α)= 0 t − e− dt is the 2 , Γ (n)=n!.Theχk is when α = k,andβ =1. R

Notice that we can approximate the Poisson and Binomial functions by the normal, in the sense that if a random variable X is distributed as Poisson with parameter λ,thenX λ is distributed approximately as standard normal. On the √−λ Y np other hand if Y ∼ Binomial(n, p) then − ∼ N(0, 1). √np(1 p) − The standard normal is an important distribution for another reason, as well.

Assume that we have a sample of n independent random variables, x1,x2,...,xn,which are coming from the same distribution with mean m and variance s2,thenwehave the following:

n 1 xi m − ∼ N(0, 1) √n s i=1 X This is the well known for independent observa- tions. Multivariate Random Variables 43

4.1 Multivariate Random Variables

We now consider the extension to multiple r.v., i.e.,

k X =(X1,X2, .., Xk) R ∈

The joint pmf, fX (x), is a function with

P (X A)= fX (x) ∈ x A X∈ The joint pdf, fX(x), is a function with

P (X A)= fX(x)dx ∈ xZA ∈ This is a multivariate integral, and in general difficult to compute. If A is a rectangle

A =[a1,b1] ... [ak,bk],then × × bk b1

fX (x)dx = ... fX (x)dx1..dxk

xZA aZk aZ1 ∈ The joint c.d.f. is defined similarly

FX (x)= fX (z1,z2,...,zk)

z1 x1,...,zk xk ≤ X ≤ x1 xn

FX (x)=P (X1 x1, ..., Xk xk)= ... fX (z1,z2, ..., zk)dz1..dzk ≤ ≤ Z Z −∞ −∞ The multivariate c.d.f. has similar coordinate-wise properties to a univariate c.d.f. For continuously differentiable c.d.f.’s

k ∂ FX (x) fX(x)= ∂x1∂x2..∂xk 4.1.1 Conditional Distributions and Independence

We defined conditional probability P (A B)=P (A B)/P(B) for events with P (B) = | ∩ 6 0.Wenowwanttodefine conditional distributions of Y X. In the discrete case there | is no problem f(y, x) fY X(y x)=P (Y = y X = x)= | | | fX (x) 44 Examples of Parametric Univariate Distributions when the event X = x has nonzero probability. Likewise we can define { }

Y y f(y,x) FY X(y x)=P (Y y X = x)= ≤ | f (x) | ≤ | P X

Note that fY X (y x) is a density function and FY X (y x) is a c.d.f. | | | | 1) fY X (y x) 0 for all y | | ≥ y f(y,x) fX (x) 2) y fY X (y x)= f (x) = f (x) =1 | | P X X InP the continuous case, it appears a bit anomalous to talk about the P (y ∈ A X = x),since X = x itself has zero probability of occurring. Still, we define the | { } conditional density function

f(y, x) fY X (y x)= | | fX(x) in terms of the joint and marginal densities. It turns out that fY X(y x) has the | | properties of p.d.f.

1) fY X (y x) 0 | | ≥ ∞ f(y,x)dy fX (x) 2) ∞ fY X(y x)dy = −∞ = =1. fX (x) fX (x) −∞ | | R WeR can define Expectations within the conditional distribution

∞ ∞ yf(y,x)dy E(Y X = x)= yfY X (y x)dy = −∞ | | | ∞ f(y, x)dy Z−∞ R −∞ and higher moments of the conditional distribution R

4.1.2 Independence

We say that Y and X are independent (denoted by )if ⊥⊥

P (Y A, X B)=P (Y A)P (X B) ∈ ∈ ∈ ∈ for all events A, B, in the relevant sigma-. This is equivalent to the cdf’s version which is simpler to state and apply.

FYX(y, x)=F(y)F (x) Multivariate Random Variables 45

In fact, we also work with the equivalent density version

f(y, x)=f(y)f(x) for all y,x

fY X(y x)=f(y) for all y | |

fX Y (x y)=f(x) for all x | |

If Y X,theng(X) h(Y ) for any measurable functions g,andh. ⊥⊥ ⊥⊥ We can generalise the notion of independence to multiple random variables. Thus Y , X,andZ are mutually independent if:

f(y, x, z)=f(y)f(x)f(z)

f(y, x)=f(y)f(x) for all y,x

f(x, z)=f(x)f(z) for all x,z

f(y, z)=f(y)f(z) for all y,z for all y,x,z.

4.1.3 Examples of Multivariate Distributions

Multivariate Normal

We say that X (X1,X2, ..., Xk) v MVNk (μ, Σ) , when

1 1 / 1 fX (x μ, Σ)= exp (x μ) Σ− (x μ) | k/2 1/2 −2 − − (2π) [det (Σ)] µ ¶ where Σ is a k k ×

σ11 σ12 ... σ1k . . ⎛ .. . ⎞ Σ = ⎜ .. . ⎟ ⎜ . . ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ σkk ⎟ ⎜ ⎟ ⎝ ⎠ and det (Σ) is the determinant of Σ. 46 Examples of Parametric Univariate Distributions

Theorem 8 (a) If X v MVNk (μ, Σ) then Xi v N (μi,σii) (this is shown by inte- gration of the joint density with respect to the other variables).

(b) The conditional distributions X =(X1,X2) are Normal too

fX1 X2 (x1 x2) v N μX1 X2 , ΣX1 X2 | | | | ¡ ¢ where 1 1 μX1 X2 = μ1 + Σ12Σ22− (x2 μ2) , ΣX1 X2 = Σ11 Σ12Σ22− Σ21. | − | −

(c) Iff Σ diagonal then X1,X2, .., Xk are mutually independent. In this case

det (Σ)=σ11σ22..σkk k 2 1 / 1 1 xj μj (x μ) Σ− (x μ)= − −2 − − −2 σjj j=1 ¡ ¢ X so that k 2 1 1 xj μj fX (x μ, Σ)= exp − | 2πσ −2 σ j=1 jj à ¡ jj ¢ ! Y 4.1.4 More on Conditional Distributionsp

We now consider the relationship between two, or more, r.v. when they are not inde- pendent. In this case, conditional density fY X and c.d.f. FY X is in general varying | | with the conditioning point x. Likewise for conditional mean E (Y X), conditional | median M (Y X), conditional variance V (Y X), conditional cf E eitY X ,andother | | | functionals, all of which characterize the relationship between Y ¡and X¢.Notethat this is a directional concept, unlike covariance, and so for example E (Y X) can be | very different from E (X Y ). |

Regression Models:

We start with random variable (Y,X). We can write for any such random variable

m(X) ε Y = E (Y X) + Y E (Y X) | − | systematicz }| { part zrand om}| part{ | {z } | {z } Multivariate Random Variables 47

By construction ε satisfies E (ε X)=0,butε is not necessarily independent of X. | For example, Var(ε X)=Var(Y E (Y X) X)=Var(Y X)=σ2 (X) can be | − | | | expected to vary with X as much as m (X)=E (Y X) . A convenient and popular | simplification is to assume that

E (Y X)=α + βX | Var(Y X)=σ2 | For example, in the bivariate normal distribution Y X has | σY E (Y X)=μY + ρYX (X μX ) | σX − Var(Y X)=σ2 1 ρ2 | Y − YX and in fact ε X. ¡ ¢ ⊥⊥ We have the following result about conditional expectations

Theorem 9 (1) E (Y )=E [E (Y X)] | (2) E (Y X) minimizes E (Y g (X))2 over all measurable functions g ( ) | − · (3) Var(Y )=E [Var(Y X£ )] + Var[E (¤Y X)] | |

Proof. (1) Write fYX (y, x)=fY X (y x) fX (x) then we have E (Y )= yfY (y)dy = | | y fYX(y,x)dx dy = y fY X (y x) fX (x) dx dy = R | | R ¡R = yfY ¢X (y x)Rdy ¡fRX (x) dx = [E(Y X¢= x] fX (x) dx = E (E (Y X)) | | | | 2 2 (2)RE¡R (Y g (X)) ¢= E [Y E (RY X)+E (Y X) g (X)] − − | | − 2 2 = E [Y£ E (Y X)] ¤+2E [[£Y E (Y X)] [E (Y X) g (X)]]+E¤ [E (Y X) g (X)] − | − | | − | − as now E (YE(Y X)) = E (E (Y X))2 ,andE (Yg(X)) = E (E (Y X) g (X)) we | | | 2 2 2 2 get that E (Y g (X)) = £E [Y E (Y¤X)] +E [E (Y X) g (X)] E [Y E (Y X)] . − − | | − ≥ − | 2 2 2 (3) £Var(Y )=E [¤Y E (Y )] = E [Y E (Y X)] + E [E (Y X) E (Y )] − − | | − +2E [[Y E (Y X)] [E (Y X) E (Y )]] − | | − The first term is E [Y E (Y X)]2 = E E [Y E (Y X)]2 X = E [Var(Y X)] − | { − | | } | 2 The second term is E [E (Y X) E (Y )] £= Var[E (Y X)] ¤ | − | The third term is zero as ε = Y E (Y X) is such that E (ε X)=0,and − | | E (Y X) E (Y ) is measurable with respect to X. | − 48 Examples of Parametric Univariate Distributions

Covariance

Cov (X, Y )=E [X E (X)] E [Y E (Y )] = E (XY ) E (X) E (Y ) − − − Note that if X or Y is a constant then Cov (X, Y )=0.Also

Cov(aX + b, cY + d)=acCov (X, Y )

An alternative measure of association is given by the correlation coefficient

Cov (X, Y ) ρXY = σXσY

Note that ρ = sign (a) sign (c) ρ aX+b,cY +d × × XY If E (Y X)=a = E (Y ) , then Cov(X, Y )=0. Also if X and Y are | independent r.v. then Cov(X, Y )=0. Both the covariance and the correlation of random variables X and Y are measures of a linear relationship of X and Y in the following sense. cov[X, Y ] will be positive when (X μ ) and (Y μ ) tend to have the same sign with high − X − Y probability, and cov[X, Y ] will be negative when (X μ ) and (Y μ ) tend to have − X − Y opposite signs with high probability. The actual magnitude of the cov[X, Y ] does not much meaning of how strong the linear relationship between X and Y is. This is because the variability of X and Y is also important. The correlation coefficient does nothavethisproblem,aswedividethecovariancebytheproductofthestandard deviations. Furthermore, the correlation is unitless and 1 ρ 1. − ≤ ≤

The properties are very useful for evaluating the expected return and stan- dard deviation of a portfolio. Assume ra and rb are the returns on assets A and B, 2 2 and their are σa and σb, respectively. Assume that we form a portfolio of the two assets with weights wa and wb, respectively. If the correlation of the returns of these assets is ρ, find the expected return and standard deviation of the portfolio. Inequalities 49

If Rp is the return of the portfolio then Rp = wara + wbrb. The expected portfolio return is E[Rp]=waE[ra]+wbE[rb]. The variance of the portfolio is 2 2 var[Rp]=var[wara + wbrb]=E[(wara + wbrb) ] (E[wara + wbrb]) = − 2 2 2 2 = waE[ra]+wb E[rb ]+2wawbE[rarb] 2 2 2 2 w (E[ra]) w (E[rb]) 2wawbE[ra]E[rb]= − a − b − 2 2 2 2 2 2 = w E[r ] (E[ra]) +w E[r ] (E[rb]) +2wawb E[rarb] E[ra]E[rb] a { a − } b { b − } { − } 2 2 2 2 2 2 = wavar[ra]+wb var[rb]+2wawbcov[ra,rb] or = waσa + wb σb +2wawbρσaσb In a vector format we have:

E[ra] E[Rp]= wa wb and ⎛ E[r ] ⎞ ³ ´ b ⎝ 2 ⎠ σa ρσaσb wa var[Rp]= wa wb ⎛ ρσ σ σ2 ⎞ ⎛ w ⎞ ³ ´ a b b b ⎝ ⎠ ⎝ ⎠ From the above example we can see that var[aX+bY ]=a2var[X]+b2var[Y ]+ 2abcov[X, Y ] for random variables X and Y and constants a and b. Infactwecan generalize the formula above for several random variables X1,X2, ..., Xn and constants n n 2 a1,a2,a3, ..., an i.e. var[a1X1 + a2X2 + ...anXn]= ai var[Xi]+2 aiajcov[Xi,Xj] i=1 i

This section gives some inequalities that are useful in establishing a variety of prob- abilistic results.

4.2.1 Markov

Let Y be a random variable and consider a function g (.) such that g (y) 0 for all ≥ y R. Assume that E [g (Y )] exists. Then ∈

1 P [g (Y ) c] c− E [g (Y )] ,forallc>0. ≥ ≤

Proof: Assume that Y is continuous random variable (the discrete case follows anal- 50 Examples of Parametric Univariate Distributions ogously) with p.d.f. f (.).Define A1 = y g (y) c and A2 = y g (y)

E [g (Y )] = g (y) f (y) dy + g (y) f (y) dy ZA1 ZA2 g (y) f (y) dy cf (y) dy = cP [g (Y ) c] . ≥ ≥ ≥ ZA1 ZA1 ¥ 4.2.2 Chebychev’s Inequality Var(X) P [ X E (X) η] | − | ≥ ≤ η2 or alternatively 1 P X E (X) r Var(X) | − | ≥ ≤ r2 h p i Proof: To prove the above, assume that E (X)=0and compare 1(X η) with X2 . | | ≥ η2 2 E(X2) Clearly 1(X η) X and it follows that E [1 ( X η)] P [ X η] | | ≥ ≤ η2 | | ≥ ≤ η2 ⇒ | | ≥ ≤ Var(X) . Alternatively, apply Markov’s inequality by setting g (y)=[x E (X)]2 and η2 − 2 c = r Var(X).¥ 4.2.3 Minkowski

Let Y and Z be random variables such that [E ( Y α)] < and [E ( Z α)] < for | | ∞ | | ∞ some 1 α< .Then ≤ ∞

[E ( Y + Z α)]1/α [E ( Y α)]1/α +[E ( Z α)]1/α | | ≤ | | | |

For α =1we have the triangular inequality

4.2.4 Triangle

E X + Y E X + E Y . | | ≤ | | | | 4.2.5 Cauchy-Schwarz

E2 (XY ) E (X)2 E (Y )2 ≤ 2 2 2 ( ajbj) a b ≤ j j P ¡P ¢¡P ¢ Inequalities 51

Proof: Let 0 h (t)=E (tX Y )2 = t2E (X2)+E (Y 2) 2tE (XY ). Then the ≤ − − function h (t) is a quadratic£ function¤ in t which is increasing as t .Ithasa → ±∞ unique minimum at h/ (t)=0 2tE (X2) 2E (XY )=0 t = E(XY ) . Hence ⇒ − ⇒ E(X2) 0 h E(XY ) E2 (XY ) E (X)2 E (Y )2. ≤ E(X2) ⇒ ≤ ¥ 4.2.6³ Hölder’s´ Inequality

1 1 For any p, q satisfying p + q =1we have

p 1 q 1 E XY (E X ) p (E Y ) q | | ≤ | | | |

In fact the Cauchy-Schwarz inequality corresponds for p = q =2.

4.2.7 Jensen Inequality

Let X be a random variable with mean E[X],andletg(.) be a convex function. Then

E[g(X)] g(E[X]). ≥

Now a continuous function g(.) with domain and counterdomain the real line is called convex if for any x0 on the real line, there exist a line which goes through the point // (x0,g(x0)) and lies on or under the graph of the function g(.).Alsoifg (x0) 0 ≥ then g(.) is convex. 52 Examples of Parametric Univariate Distributions Part II

Statistical Inference

53

Chapter 5

SAMPLING THEORY

To proceed we shall recall the following definitions.

Let X1,X2,...,Xk be k random variables all definedonthesameprobability space (Ω, ,P[.]).Thejoint cumulative distribution function of X1,X2, ..., Xk, A denoted by FX ,X ,...X ( , , ..., ),isdefined as 1 2 n • • •

FX ,X ,...X (x1,x2, ..., xk)=P[X1 x1; X2 x2; ...; Xk xk] 1 2 k ≤ ≤ ≤ for all (x1,x2, ..., xk).

Let X1,X2, ..., Xk be k discrete random variables, then the joint discrete density function of these, denoted by fX ,X ,...X ( , ,..., ),isdefined to be 1 2 k • • •

fX1,X2,...Xk (x1,x2, ..., xk)=P [X1 = x1; X2 = x2; ...; Xk = xk]

for (x1,x2, ..., xk),avalueof(X1,X2, ..., Xk) and is 0 otherwise.

Let X1,X2, ..., Xk be k continuous random variables, then the joint contin- uous density function of these, denoted by fX ,X ,...X ( , ,..., ),isdefined to be 1 2 k • • • a function such that

xk x1

FX1,X2,...Xk (x1,x2, ..., xk)= ... fX1,X2,...Xk (u1,u2, ..., uk)du1..duk Z Z −∞ −∞ for all (x1,x2, ..., xk). The totality of elements which are under discussion and about which infor- mation is desired will be called the target population. The statistical problem is 56 Sampling Theory to find out something about a certain target population. It is generally impossible or impractical to examine the entire population, but one may examine a part of it (a sample from it) and, on the basis of this limited investigation, make inferences regarding the entire target population. The problem immediately arises as to how the sample of the population should be selected. Of practical importance is the case of a simple random sample, usually called a random sample, which can be defined as follows:

Let the random variables X1,X2, ..., Xn have a joint density fX1,X2,...Xn (x1,x2, ..., xn) that factors as follows:

fX1,X2,...Xn (x1,x2, ..., xn)=f(x1)f(x2)...f(xn)

where f(.) is the common density of each Xi.ThenX1,X2, ..., Xn is defined to be a random sample of size n from a population with density f(.). Notethatidentical distribution can be weakened - could have different population for each j -reflecting heterogeneous individuals. Also, in time series we might want to allow dependence, i.e., Xj and Xk are dependent. When we are dealing with finite population, sampling without replacement causes some heterogeneity since if X1 = x1, then the distribution of X2 must be affected.

5.1 Sample Statistics

A sample statistic is a function of observable random variables, which is itself an observable random variable, which does not contain any unknown , i.e. a sample statistic is any quantity we can write as a , T(X1,...,Xn).

For example, let X1,X2, ..., Xk be a random sample from the density f(.). Then the th / r sample moment, denoted by Mr ,isdefined as:

1 n M / = Xr. r n i i=1 X Means and Variances 57

In particular, if r =1, we get the sample mean, which is usually denoted by X orXn; that is: 1 n X = X n n i i=1 X th Also the r sample central moment (about Xn), denoted by Mr,isdefined as: n 1 r M = X X . r n i n i=1 − X ¡ ¢ In particular, if r =2, we get the sample variance, and the sample standard deviation

n 1 2 s2 = X X ,s= √s2 n i i=1 − X ¡ ¢ or maybe another sample statistic for the variance,

n 2 1 2 s = Xi X . ∗ n 1 − − i=1 X ¡ ¢ We can also get the sample Median,

X(r) if n =2r 1 M = median X1, ..., Xn = − { } ⎧ 1 X + X if n =2r ⎨ 2 (r) (r+1) the empirical cumulative distribution function⎩ £ ¤

1 n F (x)= 1(X x) n n i i=1 ≤ X 1 n 1 n 1 n ϕ (t)= eitXi = sin (tX )+i cos (tX ) n n n i n i i=1 i=1 i=1 X X X These are analogous of corresponding population characteristics and will be shown to be similar to them when n is large. We calculate the properties of these variables: (1) Exact properties; (2) Asymptotic.

5.2 Means and Variances

We can prove the following : 58 Sampling Theory

Theorem 10 Let X1,X2, ..., Xk be a random sample from the density f(.).The expected value of the rth sample moment is equal to the rth population moment, i.e. the rth sample moment is an unbiased estimator of the rth population moment (Proof omitted).

Theorem 11 Let X1,X2, ..., Xn be a random sample from a density f(.),andlet n 1 Xn = n Xi be the sample mean. Then i=1 P 1 E[X ]=μandvar[X ]= σ2 n n n where μ and σ2 are the mean and variance of f(.), respectively. Notice that this is true for any distribution f(.), provided that is not infinite.

Proof n n n 1 1 1 1 E[Xn]=E[ n Xi]= n E[Xi]= n μ = n nμ = μ.Also i=1 i=1 i=1 n n n P1 P1 P 1 2 1 2 1 2 var[Xn]=var[ n Xi]= n2 var[Xi]= n2 σ = n2 nσ = n σ i=1 i=1 i=1 P P P 2 Theorem 12 Let X1,X2, ..., Xn be a random sample from a density f(.),andlets ∗ defined as above. Then

2 2 2 1 n 3 4 E[s ]=σ and var[s ]= μ4 − σ ∗ ∗ n − n 1 µ − ¶ 2 th where σ and μ4 are the variance and the 4 central moment of f(.), respectively.

Notice that this is true for any distribution f(.), provided that μ4 is not infinite.

Proof We shall prove first the following identity, which will be used latter: n n 2 2 2 (Xi μ) = Xi Xn + n Xn μ i=1 − i=1 − − 2 2 2 P(Xi μ) = P ¡Xi Xn ¢+ Xn ¡ μ = ¢ Xi Xn + Xn μ = − − − − − 2 2 P= Xi XnP ¡+2 Xi Xn Xn ¢ μ P+ £¡Xn μ ¢= ¡ ¢¤ − − − − 2 2 = P hX¡ i Xn ¢ +2 X¡ n μ ¢¡Xi X¢n +¡ n Xn ¢ iμ = − − − − P ¡ ¢ ¡ ¢ P ¡ ¢ ¡ ¢ Sampling from the Normal Distribution 59

2 2 = Xi Xn + n Xn μ − − UsingP ¡ the above¢ identity¡ we obtain:¢ n n 2 1 2 1 2 2 E[Sn]=E n 1 (Xi Xn) = n 1 E (Xi μ) n Xn μ = − i=1 − − i=1 − − − n ∙ ¸ ∙ n ¸ 1 P 2 2 P 1 2 ¡ ¢ = n 1 E (Xi μ) nE Xn μ = n 1 σ nvar Xn = − i=1 − − − − i=1 − 1 ∙ 2 1 2 2 ¸ ∙ ¸ = n 1 nσP n n σ = σ ¡ ¢ P ¡ ¢ − − 2 The derivation£ of the¤ variance of Sn is omitted.

Theorem 13 Let X1, ..., Xn be a random sample from a population with mean μ, 2 variance σ , skewness κ3,andkurtosisκ4.Then,

(3) E (Fn(x)) = F (x), and Var(Fn(x)) = F(x)(1 F(x)) /n − n (4) Characteristic Function of X, ϕX (t)=[ϕX (t/n)] .

Proof n 1 E (Fn(x)) = E n 1(Xi x) = E (1 (Xi x)) = F (x) . Also Var(Fn(x)) = i=1 ≤ ≤ µ n ¶ 2 2 P 1 E [Fn(x) F (x)] = E 1[Xi x] F (x) = − n ≤ − ½ i=1 ¾ n P n 1 2 1 = E n2 1[Xi x] F (x) + n2 1[Xi x] F (x) 1[Xj x] F (x) i=1 { ≤ − } i=j { ≤ − }{ ≤ − } ( 6 ) 1 P 2 1 P 2 1 2 = E 1[Xi x] F (x) = E 1[Xi x] F (x) = F (x)[1 F (x)] n { ≤ − } n { { ≤ − }} n − © ª

5.3 Sampling from the Normal Distribution

Theorem 14 Let denote Xn the sample mean of a random sample of size n from a 2 2 normal distribution with mean μ and variance σ . Then (1) X v N(μ, σ /n). (2) X and s2 are independent.

(n 1)s2 2 (3) −σ2 ∗ v χn 1 − X μ (4) s /−√n v tn 1. ∗ −

Proof 60 Sampling Theory

(1) From a Theorem above we have that

n ϕX (t)=[ϕX (t/n)] .

n Now ϕ (t/n)=expiμt 1 σ2t2 . Hence ϕ (t)= exp iμ t 1 σ2 t 2 = X − 2 X n − 2 n 2 exp iμt 1 σ t2 ,whichisthe¡ ¢ cf of a normal distributionh ³ with mean¡ ¢ ´iμ and − 2 n ³ σ2 ³ ´ ´ variance n .

X1+X2 (2) For n =2we have that if X1 v N(0, 1) and X2 v N(0, 1) then X = 2 2 2 (X1 X2) X1+X2 X1 X2 and s = −4 .Define Z1 = 2 and Z2 = −2 .ThenZ1 and Z2 are uncorrelated and by normality independent.

5.3.1 The Gamma Function

The gamma function is defined as:

∞ t 1 x Γ(t)= x − e− dx for t > 0

Z0 Notice that Γ(t +1)=tΓ(t),as

∞ ∞ ∞ t x t x t x t 1 x Γ(t +1)= x e− dx = x de− = x e− ∞ + t x − de− = tΓ(t) − − 0 Z0 Z0 Z0 ¯ ¯ 1 and if t is an integer then Γ(t +1)=t!.Alsoift is again an integer then Γ(t + 2 )= 1 3 5 ...(2t 1) 1 ∗ ∗ ∗2t − √π. Finally Γ( 2 )=√π. Recall that if X is a random variable with density

k/2 1 1 k 1 1 x fX (x)= x 2 − e− 2 for 0

E[X]=kandvar[X]=2k

We can prove the following theorem Sampling from the Normal Distribution 61

Theorem 15 If the random variables Xi,i=1, 2, .., k are normally and indepen- 2 dently distributed with means μi and variances σi then

k 2 Xi μ U = − i σ i=1 i X µ ¶ has a chi-square distribution with k degrees of freedom. Proof omitted.

Furthermore,

Theorem 16 If the random variables Xi,i=1, 2, .., k are normally and indepen- n 2 2 1 2 dently distributed with mean μ and variance σ ,andletS = n 1 (Xi Xn) − i=1 − then P 2 (n 1)S 2 U = − 2 v χn 1 σ − 2 where χn 1 is the chi-square distribution with n 1 degrees of freedom. Proof omitted. − −

5.3.2 The F Distribution

If X is a random variable with density

m m/2 1 Γ[(m + n)/2] m x 2 − fX (x)= for 0

n 2n2(m + n 2) E[X]= and var[X]= − n 2 m(n 2)2(n 4) − − − Theorem 17 If the random variables U and V are independently distributed as chi-

2 2 square with m and n degrees of freedom, respectively i.e. U v χm and V v χn independently, then U/m = X F V/n v m,n where Fm,n is the F distribution with m, n degrees of freedom. Proof omitted. 62 Sampling Theory

5.3.3 The Student-t Distribution

If X is a random variable with density

Γ[(k +1)/2] 1 1 fX (x)= for

k E[X]=0 and var[X]= k 2 − Theorem 18 If the random variables Z and V are independently distributed as stan-

2 dard normal and chi-square with k, respectively i.e. Z v (N(0, 1) and V v χk inde- pendently, then Z = X v tk V/k where tk is the t distribution with pk degrees of freedom. Proof omitted.

The above Theorems are very useful especially to get the distribution of various tests and construct confidence intervals. Chapter 6

POINT AND INTERVAL ESTIMATION

The problem of estimation is defined as follows. Assume that some characteristic of the elements in a population can be represented by a random variable X whose den- sity is fX(.; θ)=f(.; θ), where the form of the density is assumed known except that it contains an unknown parameter θ (if θ were known, the density function would be completely specified, and there would be no need to make inferences about it.

Further assume that the values x1,x2,...,xn of a random sample X1,X2, ...., Xn from f(.; θ) can be observed. On the basis of the observed sample values x1,x2, ..., xn it is desired to estimate the value of the unknown parameter θ or the value of some function, say τ(θ), of the unknown parameter. The estimation can be made in two ways. The first, called point estimation, is to let the value of some statistic, say t(X1,X2, ...., Xn), represent or estimate, the unknown τ(θ). Such a statistic is called the point estimator. The second, called interval estimation,istodefine two statistics, say t1(X1,X2, ...., Xn) and t2(X1,X2, ...., Xn),wheret1(X1,X2, ...., Xn) < t2(X1,X2, ...., Xn),sothat(t1(X1,X2, ...., Xn),t2(X1,X2, ...., Xn)) constitutes an in- terval for which the probability can be determined that it contains the unknown τ(θ).

6.1 Parametric Point Estimation

The point estimation admits two problems. The firstistodevisesomemeansofob- taining a statistic to use as an estimator. The second, to select criteria and techniques 64 Point and Interval Estimation to define and find a “best” estimator among many possible estimators.

6.1.1 Methods of Finding Estimators

Any statistic (known function of observable random variables that is itself a random variable) whose values are used to estimate τ(θ),whereτ(.) is some function of the parameter θ,isdefined to be an estimator of τ(θ). Notice that for specific values of the realized random sample the estimator takes a specific value called estimate.

6.1.2 Method of Moments

Let f(.; θ1,θ2, ..., θk) beadensityofarandomvariableX which has k parameters / th r / θ1,θ2, ..., θk. As before let μr denote the r moment i.e. = E[X ]. In general μr will be a known function of the k parameters θ1,θ2, ..., θk.Denotethisbywrit- / / ing μr = μr(θ1,θ2, ..., θk).LetX1,X2, ..., Xn be a random sample from the density n / th / 1 j f(.; θ1,θ2, ..., θk), and, as before, let Mj be the j sample moment, i.e. Mj = n Xi . i=1 Then equating sample moments to population ones we get k equations withPk un- knowns, i.e.

/ / Mj = μj (θ1,θ2, ..., θk) for j =1, 2, ..., k

Let the solution to these equations be θ1, θ2, ..., θk. We say that these k estimators are the estimators of θ ,θ , ..., θ obtained by the method of moments. 1 2 k b b b Example: Let X1,X2, ..., Xn be a random sample from a normal distribution 2 2 with mean μ and variance σ .Let(θ1,θ2)=(μ, σ ). Estimate the parameters μ and σ by the method of moments.. Recall that σ2 = μ/ (μ/)2 and μ = μ/.Themethod 2 − 1 1 of moment equations become: n 1 / / / 2 n Xi = X = M1 = μ1 = μ1(μ, σ )=μ i=1 n 1 P 2 / / / 2 2 2 n Xi = M2 = μ2 = μ2(μ, σ )=σ + μ i=1 SolvingP the two equations for μ and σ we get: n 1 μ = X, and σ = (Xi X) which are the M-M estimators of μ and σ. n − r i=1 P b b Parametric Point Estimation 65

Example: Let X1,X2,...,Xn be a random sample from a Poisson distribution with parameter λ. There is only one parameter, hence only one equation, which is: n 1 / / / n Xi = X = M1 = μ1 = μ1(λ)=λ i=1 HenceP the M-M estimator of λ is λ = X.

b 6.1.3 Maximum Likelihood

Consider the following estimation problem. Suppose that a box contains a number of black and a number of white balls, and suppose that it is known that the ratio of the number is 3/1 but it is not known whether the black or the white are more numerous, i.e. the number of drawing a black ball is either 1/4 or 3/4. If n balls are drawn with replacement from the box, the distribution of X, the number of black balls, is given by the binomial distribution

n x n x f(x; p)= p (1 p) − for x =0, 1, 2, ..., n ⎛ x ⎞ − ⎝ ⎠ where p is the probability of drawing a black ball. Here p =1/4 or p =3/4.Weshall draw a sample of three balls, i.e. n =3, with replacement and attempt to estimate the unknown parameter p of the distribution. the estimation is simple in this case as we have to choose only between the two numbers 1/4=0.25 and 3/4=0.75.The possible outcomes and their probabilities are given below:

outcome : x 0 1 2 3 f(x;0.75) 1/64 9/64 27/64 27/64 f(x;0.25) 27/64 27/64 9/64 1/64

In the present example, if we found x =0in a sample of 3, the estimate 0.25 for p wouldbepreferredover0.75 because the probability 27/64 is greater than 1/64, i.e. 66 Point and Interval Estimation

And in general we should estimate p by 0.25 when x =0or 1 and by 0.75 when x =2or 3. The estimator may be defined as

0.25 for x =0, 1 p = p(x)= ⎧ 0.75 for x =2, 3 ⎨ b b The estimator thus selects fro every⎩ possible x the value of p,sayp, such that

/ f(x; p) >f(x; p ) b where p/ is the other value of p. b More generally, if several values of p were possible, we might reasonably pro- ceed in the same manner. Thus if we found x =2in a sample of 3 from a binomial population, we should substitute all possible values of p in the expression

3 f(2; p)= p2(1 p) for 0 p 1 ⎛ 2 ⎞ − ≤ ≤ ⎝ ⎠ and choose as our estimate that value of p which maximizes f(2; p).Theposition of the maximum of the function above is found by setting equal to zero the first derivative with respect to p,i.e. d f(2; p)=6p 9p2 =3p(2 3p)=0 p =0or dp − − ⇒ p =2/3. The second derivative is: d2 f(2; p)=6 18p. Hence, d2 f(2; 0) = 6 and dp2 − dp2 the value of p =0represents a minimum, whereas d2 f(2; 2 )= 6 and consequently dp2 3 − 2 2 p = 3 represents the maximum. Hence p = 3 is our estimate which has the property

/ f(x; p)b>f(x; p ) where p/ is any other value in the intervalb 0 p 1. ≤ ≤ The of n random variables X1,X2,...,Xn is defined to be the joint density of the n random variables, say fX1,X2,...,Xn (x1,x2,...,xn; θ),which is considered to be a function of θ.Inparticular,ifX1,X2, ..., Xn is a random sample from the density f(x; θ), then the likelihood function is f(x1; θ)f(x2; θ).....f(xn; θ). To think of the likelihood function as a function of θ, we shall use the notation

L(θ; x1,x2, ..., xn) or L( ; x1,x2, ..., xn) for the likelihood function in general. • Parametric Point Estimation 67

The likelihood is a value of a density function. Consequently, for discrete random variables it is a probability. Suppose for the moment that θ is known, denoted by θ0. The particular value of the random variables which is “most likely to occur” / / / is that value x1,x2, ..., xn such that fX1,X2,...,Xn (x1,x2, ..., xn; θ0) is a maximum. for example, for simplicity let us assume that n =1and X1 has the normal density with mean 0 and variance 1. Then the value of the random variable which is most / likely to occur is X1 =0. By “most likely to occur” we mean the value x1 of X1 / such that φ0,1(x1) >φ0,1(x1). Now let us suppose that the joint density of n random variables is fX1,X2,...,Xn (x1,x2, ..., xn; θ),whereθ is known. Let the particular values / / / which are observed be represented by x1,x2, ..., xn. Wewanttoknowfromwhich density is this particular set of values most likely to have come. We want to know / / / from which density (what value of θ)isthelikelihoodlargestthatthesetx1,x2, ..., xn was obtained. in other words, we want to find the value of θ in the admissible set, / / / denoted by θ, which maximizes the likelihood function L(θ; x1,x2, ..., xn).Thevalue θ which maximizes the likelihood function is, in general, a function of x ,x ,...,x , b 1 2 n say θ = θ(x ,x , ..., x ). Hence we have the following definition: b 1 2 n Let L(θ)=L(θ; x ,x , ..., x ) be the likelihood function for the random vari- b b 1 2 n ables X1,X2, ..., Xn.Ifθ [where θ = θ(x1,x2, ..., xn) is a function of the observations x ,x , ..., x ]isthevalueofθ in the admissible range which maximizes L(θ),thenΘ = 1 2 n b b b θ(X ,X , ..., X ) is the maximum likelihood estimator of θ. θ = θ(x ,x , ..., x ) 1 2 n 1 2 b n is the maximum likelihood estimate of θ for the sample x ,x , ..., x . b 1 2b b n The most important cases which we shall consider are those in which X1,X2, ..., Xn is a random sample from some density function f(x; θ), so that the likelihood function is

L(θ)=f(x1; θ)f(x2; θ).....f(xn; θ)

Many likelihood functions satisfy regularity conditions so the maximum likelihood estimator is the solution of the equation dL(θ) =0 dθ 68 Point and Interval Estimation

Also L(θ) and logL(θ) have their maxima at the same value of θ, and it is sometimes easier to find the maximum of the of the likelihood. Notice also that if the likelihood function contains k parameters then we find the estimator from the solution of the k first order conditions. Example: Let a random sample of size n is drawn from the Bernoulli distri- bution

x 1 x f(x; p)=p (1 p) − − where 0 p 1. The sample values x1,x2, ..., xn will be a sequence of 0s and 1s, and ≤ ≤ the likelihood function is

n xi 1 xi xi n xi L(p)= p (1 p) − = p (1 p) − i=1 − P − P Y

Let y = xi we obtain that P logL(p)=y log p +(n y)log(1 p) − − and d log L(p) y n y = − dp p − 1 p − Setting this expression equal to zero we get

y 1 p = = x = x n n i X which is intuitively what the estimateb for this parameter should be.

Example: Let a random sample of size n is drawn from the normal distrib- ution with density

(x μ)2 2 1 − f(x; μ, σ )= e− 2σ2 √2πσ2 The likelihood function is

n 2 n/2 n 1 (xi μ) 1 1 2 −2 2 L(μ, σ )= e− 2σ = exp (xi μ) √ 2 2πσ2 2σ2 i=1 2πσ "− i=1 − # Y µ ¶ X Interval Estimation 69 the logarithm of the likelihood function is

n n 1 n log L(μ, σ2)= log 2π log σ2 (x μ)2 2 2 2σ2 i − − − i=1 − X To find the maximum with respect to μ and σ2 we compute

∂ log L 1 n = (x μ) ∂μ σ2 i i=1 − X and ∂ log L n 1 1 n = + (x μ)2 ∂σ2 2 σ2 2σ4 i − i=1 − X and putting these derivatives equal to 0 and solving the resulting equations we find the estimates 1 n μ = x = x n i i=1 X and b 1 n σ2 = (x x)2 n i i=1 − X which turn out to be the sampleb moments corresponding to μ and σ2.

6.1.4 Properties of Point Estimators

One needs to define criteria so that various estimators can be compared. One of these is the unbiasedness. An estimator T = t(X1,X2, ..., Xn) is defined to be an unbiased estimator of τ(θ) if and only if

Eθ[T ]=Eθ[t(X1,X2, ..., Xn)] = τ(θ) for all θ in the admissible space. Other criteria are consistency, mean square error etc.

6.2 Interval Estimation

In practice estimates are often given in the form of the estimate plus or minus a certain amount e.g. the cost per volume of a book could be 83 4.5 per cent which ± 70 Point and Interval Estimation means that the actual cost will lie somewhere between 78.5% and 87.5% with high probability. Let us consider a particular example. Suppose that a random sample (1.2, 3.4,.6, 5.6) of four observations is drawn from a normal population with unknown mean μ andaknownvariance9. The maximum likelihood estimate of μ is the sample mean of the observations: x =2.7 We wish to determine upper and lower limits which are rather certain to contain the true unknown parameter value between them. We know that the sample

2 mean, x, is distributed as normal with mean μ and variance 9/n i.e. x v N(μ, σ /n). Hence we have X μ Z = −3 v N(0, 1) 2 Hence Z is standard normal. Consequently we can find the probability that Z will be between two arbitrary values. For example we have that

1.96 P [ 1.96 μ>X 1.96 2 − 2 and for the specificvalueofthesamplemeanwehavethat5.64 >μ> .24 i.e. − P [5.64 >μ> .24] = .95. This leads us to the following definition for the confidence − interval.

Let X1,X2, ...., Xn be a random sample from the density f( ; θ).LetT1 = • t1(X1,X2, ...., Xn) and T2 = t2(X1,X2, ...., Xn) be two statistics satisfying T1 T2 ≤ for which Pθ[T1 <τ(θ)

Let X1,X2, ...., Xn be a random sample from the density f( ; θ).LetT1 = • t1(X1,X2, ...., Xn) be a statistic for which Pθ[T1 <τ(θ)] = γ.ThenT1 is called a one- sided lower confidence interval for τ(θ). Similarly, let T2 = t2(X1,X2, ...., Xn) be a statistic for which Pθ[τ(θ)

Example: Let X1,X2, ...., Xn be a random sample from the density f(x; θ)=

φ (x).SetT1 = t1(X1,X2, ...., Xn)=X 6/√n and T2 = t2(X1,X2, ...., Xn)= θ,9 − X +6/√n.Then(T1,T2) constitutes a random interval and is a confidence interval for τ(θ)=θ,withconfidence coefficient γ = P [X 6/√n<θ

Let X1,X2,...,Xn be a random sample from the normal distribution with mean μ and variance σ2.Ifσ2 is unknown then θ =(μ, σ2), the unknown parameters and τ(θ)=μ, the parameter we want to estimate by interval estimation. We know that X μ −σ v N(0, 1) √n However, the problem with this statistic is that we have two parameters. Conse- quentlywecannotaninterval.hencewelookforastatisticthatinvolvesonlythe parameter we want to estimate, i.e. μ.Noticethat σ (X μ)/ √n X μ − = − v tn 1 2 2 S/√n − (Xi X) /(n 1)σ − − This statistic involvesq onlyP the parameter we want to estimate. Hence we have X μ q1 < −

Hence the interval X q2(S/√n), X q1(S/√n) is the 100γ percent confidence − − interval for μ. It can¡ be proved that if q1,q2 are¢ symmetrical around 0, then the length is the interval is minimized. Alternatively, if we want to find a confidence interval for σ2,whenμ is un- known, then we use the statistic

2 2 (Xi X) (n 1)S 2 −2 = − 2 v χn 1 σ σ − P Hence we have

2 2 2 (n 1)S (n 1)S 2 (n 1)S q1 < −

(n 1)S2 P q1 < −

(n 1)S2 (n 1)S2 2 So the interval − , − is a 100γ percent confidence interval for σ .Theq1, q2 q1 (n 1)S2 (n 1)S2 q2 are often selected³ so that P ´q2 < − 2 = P − 2

HYPOTHESIS TESTING

A statistical hypothesis is an assertion or conjecture, denoted by ,abouta H distribution of one or more random variables. If the statistical hypothesis completely specifies the distribution is simple, otherwise is composite.

Example Let X1,X2, ..., Xn be a random sample from f(x; θ)=φμ,25(x). The statistical hypothesis that the mean of the normal population is less or equal to 17 is denoted by: : θ 17. Such a hypothesis is composite, as it does not H ≤ completely specify the distribution. On the other hand, the hypothesis : θ 17 is H ≤ simple since it completely specifies the distribution.

A test of statistical hypothesis is a rule or procedure for deciding whether H to reject . H

Example Let X1,X2, ..., Xn be a random sample from f(x; θ)=φμ,25(x). Consider : θ 17. One possible test is as follows: Reject if and only if H ≤ Y H X>17 + 5/√n. In many hypotheses-testing problems two hypotheses are discussed. The first, the hypothesis being testing, is called the null hypothesis, denoted by 0,andthe H second is called the alternative hypothesis denoted by 1.Wesaythat 0 is H H tested against or versus 1. The thinking is that if the null hypothesis is wrong the H alternativehypothesisistrue,andviceversa.Wecanmaketwotypesoferrors:

Rejection of 0 when 0 is true is called a Type I error, and acceptance of H H 0 when 0 is false is called a Type II error.Thesize of Type I error is defined H H 74 Hypothesis testing to be the probability that a Type I error is made, and similarly the size of a Type II error is defined to be the probability that a Type II error is made. Significance level or size of a test,denotedbyα, is the supremum of the probability of rejecting 0 when 0 is correct, i.e. it is the supremum of the Type I H H error. In general to perform a test we fix the size to a prespecified value in general 10%, 5% or 1%.

Example Let X1,X2, ..., Xn be a random sample from f(x; θ)=φμ,25(x).

Consider 0 : θ 17 and the test :Reject 0 if and only if X>17 + 5/√n.Then H ≤ Y H of the test is

X θ 17+5/√n θ − sup P [X>17 + 5/√n]=supP 5/√−n > 5/√n = θ 17 θ 17 ≤ ≤ X θ 17+5/√n θh i 17+5/√n θ − − =sup 1 P 5/√−n 5/√n =sup 1 P Z 5/√n = θ 17 − ≤ θ 17 − ≤ ≤ ≤ n h17+5/√n θ io n h io − =sup 1 Φ 5/√n =1 Φ(1) = 0.159 θ 17 − − ≤ n h io 7.1 Testing Procedure

Let us establish a test procedure via an example. Assume that n =64, X =9.8 and σ2 =0.04. We would like to test the hypothesis that μ =10. 1. Formulate the null hypothesis:

0 : μ =10 H 2. Formulate the alternative:

1 : μ =10 H 6 3. select the level of significance: α =0.01

From tables find the critical values for Z,denotedbycZ =2.58. 4. Establish the rejection limits:

Reject 0 if Z< 2.58 or Z>2.58. H − 5. Calculate Z:

X μ0 9.8 10 Z = −σ = = 8 0.2/−√64 √n − 6. Make the decision: Testing Proportions 75

Since Z is less than 2.58,reject 0. − H To findtheappropriatetestforthemeanwehavetoconsiderthefollowing cases: 1. Normal population and known population variance (or standard deviation). In this case the statistic we use is:

X μ0 Z = −σ v N(0, 1) √n 2. Large samples in order to use the central limit theorem. In this case the statistic we use is:

X μ0 Z = −S v N(0, 1) √n

3. Small samples from a normal population where the population variance (or standard deviation) is unknown. In this case the statistic we use is:

X μ0 t = − tn 1 S v − √n

7.2 Testing Proportions

Thenullhypothesiswillbeoftheform:

0 : π = π0 H an the three possible alternatives are:

(1) 1 : π = π0 two sided test, (2) 1 : π<π0 one sided, (3) 1 : π>π0 one H 6 H H sided. The appropriate statistic is based on the central limit theorem and is:

p π0 2 Z = −S v N(0, 1) where S = π0(1 π0) √n −

Example: Mr. X believes that he will get more 60% of the votes. However, in a sample of 400 voters 252 indicate that they will vote for X. At a significance level of 5% test Mr. X belief. 76 Hypothesis testing

252 2 p = =0.63, S =0.6(1 0.6) = 0.24.The 0 : π = π0 and the alternative 400 − H p π0 0.63 0.6 is : π>π. The critical value is 1.64.NowZ = − = =1.22. 1 0 S 0.489/−√400 H √n Consequently, the null is not rejected as Z<1.64.ThusMr.Xbeliefiswrong.

If fact we have the following possible outcomes when testing hypotheses:

0 is accepted 1 is accepted H H 0 is correct Correct decision (1 α) Type I error (α) H − 1 is correct Type II error (β) Correct decision (1 β) H − An operating characteristic curve presents the probability of accepting a null hypothesis for various values of the population parameter at a given significance level α using a particular sample size. The power of the test is the inverse function of the operating characteristic curve, i.e. it is the probability of rejecting the null hypothesis for various possible values of the population parameter. Part III

Asymptotic Theory

77

Chapter 8

MODES OF CONVERGENCE

We have a statistic T which is a measurable function of the data

Tn = T (X1, ..., Xn) ,

and we would like to know what happens to Tn as n . It turns out that the limit →∞ is easier to work with than Tn itself. The plan is to use the limit as approximation device. We think of a sequence T1,T2, ... which have distribution functions F1,F2,....

Definition A sequence of random variables T1,T2, ... converges in probability p to a random variable T (denoted by Tn T )if,foreveryε>0, −→

lim P [ Tn T >ε]=0. n →∞ | − |

Definition A sequence of random variables T1,T2,...converges in mean square ms to a random variable T (denoted by Tn T )if, −→

2 lim E (Tn T ) =0 n →∞ − which is equivalent to (a) Var(Tn) 0 and (b) E (Tn) E (T) 0 because of the → − → triangle inequality.

Theorem 19 Convergence in mean square implies convergence in probability.

Proof. By the Markov/Chebychev inequality,

2 E Tn T P [ Tn T ε] | − | 0. | − | ≥ ≤ ε2 → 80 Modes of Convergence

Note that if Tn > 0,

2 E (Tn) E (Tn) P [Tn ε] ≥ ≤ ε2 ≤ ε

E(Tn) p so that if 0,thisissufficient for Tn 0. ε → −→ But the converse of the theorem is not necessarily true. To see this consider the following random variable

1 n with probability n Tn = ,T=0. ⎧ 0 with probability 1 1 ⎨ − n

1 1 2 2 1 Then P [Tn ε]= for⎩ any ε>0, and P [Tn ε]= 0. But E (Tn) = n = ≥ n ≥ n → n n . →∞ A famous consequence of the theorem is the (Weak)

Theorem 20 WEAK LAW of LARGE NUMBERS Let X1, ..., Xn be i.i.d. with 2 E (Xi)=μ and Var(Xi)=σ < ,andTn = X. Then for all ε>0, ∞

p lim P [ Tn μ >ε]=0,i.e.,Tn μ. n →∞ | − | −→ The proof is easy because

2 2 σ E[(Tn μ) ]= 0. − n → as we have shown. In fact, the result can be proved with only the hypothesis that E X < by using a truncation argument. | | ∞ Another application of the previous theorem is to the empirical distribution function, i.e., n 1 p F (x)= 1(X x) F (x) . n n i i=1 ≤ −→ X The next result is very important for applying the Law of Large Numbers beyond the simple sum of iid’s. Modes of Convergence 81

p Theorem 21 CONTINUOUS MAPPING THEOREM If Tn μ aconstantand −→ g(.) is a continuous function at μ,then

p g (Tn) g (μ) . −→

Proof. Let ε>0. By the continuity of g at μ, η>0 such that if ∃

x μ <η g (x) g (μ) <ε | − | ⇒ | − |

Let An = Tn μ <η and Bn = g (Tn) g (μ) <ε .ButwhenAn is true so is {| − | } {| − | } Bn, i.e., An Bn. Since P (An) 1, we must have that P (Bn) 1. ⊂ → →

Now we look at the sample variance

2 1 n 1 n s2 = X2 X n i n i i=1 − Ã i=1 ! X X We know that: n 1 p X2 E X2 = σ2 + μ2 n i i i=1 −→ X ¡ ¢ and n n 2 1 p 1 p X μ X μ2 n i n i i=1 −→ ⇒ Ã i=1 ! −→ X X by the continuous mapping theorem. Combining these two results we get

p s2 σ2. −→

/ Finally, notice that when dealing with a vector Tn =(Tn1, ..., Tnk) ,wehave that p Tn T 0, k − k −→ where x = x/x 1/2 is the Euclidean , if and only if k k

¡ ¢ p Tnj Tj 0 | − | −→ 82 Modes of Convergence for all j =1, ..., k. Theifpartisnosurpriseandfollowsfromthecontinuousmapping theorem.Theonlyifpartfollowsasif Tn T <εthen Tnj Tj <ηfor each j k − k | − | and some η>0.

Definition A sequence of random variables T1,T2, ... converges almost surely as to a random variable T (denoted by Tn T )if,foreveryε>0, −→

P [ lim Tn T <ε]=1. n →∞ | − | This result is generally harder to establish than convergence in probability, i.e., therearenotsimplesufficient conditions based on mean and variance. Almost sure convergence implies convergence in probability but not vice versa. Note again that vector convergence is equivalent to componentwise convergence. Continuous mapping theorem is obvious. Let

A = ω : Tn (ω) T (ω) ,P(A)=1 { → }

On this set A,wehave

g [Tn (ω)] g [T (ω)] , → by ordinary continuity.

Theorem 22 STRONG LAW of LARGE NUMBERS. If E X < ,then | | ∞

as Tn (ω) E (X) . −→

We can have Strong Law of Large Numbers applied to empirical distribution functions and to sample variances (from the continuous mapping theorem) etc. Chapter 9

ASYMPTOTIC THEORY 2

We can now establish the convergence in distribution and the central limit theorem, which is of great importance.

Definition A sequence of random variables T1,T2, ... converges in distribution D to a random variable T (denoted by Tn T )if −→

lim P [Tn x]=P [T x] n →∞ ≤ ≤ at all points of continuity of FT (x)=P [T x]. ≤ Convergence in distribution is weaker than in probability, i.e.,

P D Tn T Tn T −→ ⇒ −→ but not vice versa, except when the limit is nonrandom, i.e.,

D P Tn α Tn α. −→ ⇒ −→

D Theorem 23 A sequence of random vectors (Tn1,Tn2, ..., Tnk) (T1,T2, ..., Tk) iff −→

/ D / c Tn c T −→

/ for any c =(c1,c2, ..., ck) =0. 6

This is known as the Cram´er Wold device. The main result is the following: − 84 Asymptotic Theory 2

Theorem 24 Central Limit Theorem of Lindemberg-Lévy. Let X1,X2,...,Xn be i.i.d. 2 with E (Xi)=μ, Var(Xi)=σ < . Then ∞

√n X μ D N 0,σ2 − −→ √n X μ T = ¡ − ¢ D N¡(0, 1)¢ n σ ¡ ¢ −→ / The vector version: Let X1,X2,...,Xn be i.i.d. with E (Xi)=μ, E (Xi μ)(Xi μ) = − − Σ where 0 < Σ < . Then h i ∞

D Tn = √n X μ N (0, Σ) . − −→ ¡ ¢ A modern proof of the result is based on characteristic functions.

2 Example. If Xi v N (μ, σ ) , then

√n X μ T = − D N (0, 1) n σ ¡ ¢ −→ for all n a result which is trivial. Suppose instead that

10.5 Xi = ⎧ 10.5 ⎨ −

E (Xi)=0,Var(Xi)=1. ⎩ D We know that Tn N (0, 1) . −→ 2/√21/4 X1 + X2 n =2: = ⎧ 01/2 √2 ⎪ ⎪ ⎨ 2/√21/4 − ⎪ ⎩⎪ 3/√31/8 X + X + X ⎧ 1/√33/8 1 2 3 ⎪ n =3: = ⎪ √3 ⎪ 1/√33/8 ⎨⎪ − 3/√31/8 ⎪ ⎪ − ⎪ etc. The Binomial distribution gets closer and closer⎩⎪ to normal. Asymptotic Theory 2 85

We can now approximately calculate for example

√n X μ √n (10 μ) P X>10 = P − > − σ σ Ã ¡ ¢ ! ¡ ¢ √n (10 μ) = P Z> − ∼ σ µ ¶ √n (10 μ) √n (10 μ) =1 P − Z =1 Φ − − σ ≤ − σ µ ¶ µ ¶ CLT for non-identically distributed random variables.

Theorem 25 (Lyapunov) Suppose that X1,X2, ..., Xn are independent random vari- ables with 2 3 E (Xi)=μ, V ar (Xi)=σ ,EXi μ = m3i i | − i| and additionally 1/2 1/3 n − n 2 σi m3i 0 Ã i=1 ! Ã i=1 ! → X X e.g. if 1 n 1 n σ2 σ2 m m n i n 3i 3 i=1 → i=1 → X X then

√n X μ D N 0,σ2 − −→ X ¡ E X ¢ D ¡ ¢ Tn = − N (0, 1) Var¡X ¢ −→ q The Lindeberg-Feller CLT is even¡ weaker.¢

Theorem 26 (Lindeberg-Feller) Let xi be independent with mean μi and variance 2 2 n 2 σi , and distribution functions Fi.SupposethatBn = i=1 σi satisfies 2 σn 2 P 2 0,Bn ,asn . Bn → →∞ →∞ Then 1 n n ( xi μ ) n i=1 − i=1 i D N (0, 1) 1 2 1/2 −→ P n2 Bn P £ ¤ 86 Asymptotic Theory 2 if and only if the Lindeberg condition

n (t μ )2 dF (t) i=1 t μ >εBn i i | − i| − 2 0,n ,eachε>0 P R Bn → →∞ is satisfied.

The key condition for the above CLT is the Lindeberg condition. Which basically ensures that no one term is so relatively large as to dominate the entire sample, in the limit. The following CLT gives some more transparent conditions that are sufficient for the Lindeberg condition to hold.

2 2 − Theorem 27 Let xi be independent with mean μi and variance σi ,andletσn = 1 2 n σi .If

X 1 2+δ 2+δ max1 i n E xi ≤ ≤ | | B< δ>0, n 1 ³ ´ σ−n ≤ ∞ ∀ ≥ then 1 n 1 n √n xi μ n i=1 − n i=1 i d N (0, 1) → ¡ P σ−n P ¢ The above condition although sufficient is not necessary. To see this assume

1 1 that yt =0with probability 1 2 , 0 with the same probability and t with 2 − t 1 probability t2 .Inthiscaseyt tends¡ to¢ a Bernoulli random variable, and the CLT certainly applies in this case. Yet the condition of the above theorem is not satisfied.

1 Furthermore, let assume that yt =0with probability 1 t2 and t with prob- − 2 1 1 − ability 2 .ThenE (yt)=1/t 0,andVar(yt)=1 2 1. Hence σ 1. t → − t → n → Despite this, it is clear that yt is converging to a degenerate random variable which takes the value of 0 with probability 1 (in fact is xt = yt E(yt) that is degenerate). 1 − 2+δ 2+δ δ However, it is verified that E xi = O t δ+2 and consequently for any δ>0 | | the condition of the above³ theorem must´ fail for³ n large´ enough. Dependent random variables CLT’s are available too. Combination Properties 87

9.1 Combination Properties

D P Theorem 28 (Slutsky’s) Suppose that andXn X and Yn c.Then −→ −→

D YnXn cX −→ D Yn + Xn c + X −→ D XnYn/ X/c if c nonzero −→

Application: Suppose that we look at

√n X μ − , s ¡ X ¢ when sX isthesamplevariance.TheCLTtellusthat

√n X μ D N 0,σ2 − −→ ¡ ¢ ¡ ¢ The LLN and CMT say that p sX σ. −→ Therefore, √n X μ N (0,σ2) − D = N (0, 1) s σ ¡ X ¢ −→ Theorem 29 Continuous Mapping Theorem II. Suppose that

D Tn =(Tn1,Tn2,..,Tnk) T =(T1,T2, .., Tk) −→ and g : Rk Rq. Then → D g (Tn) g (T ) −→

EXAMPLE.

D Yn + Xn X + Y −→ D YnXn YX −→ but notice that the assumption requires the joint convergence of (Yn,Xn) . 88 Asymptotic Theory 2

d Theorem 30 (Cramer) Assume that Xn N (μ, Σ),andAn is a conformable ma- → d / trix with plimAn = A.ThenAnXn N μ, AΣA → ¡ ¢ Notice that if √n X μ − D N (0, 1) s ¡ X ¢ −→ then 2 n X μ − D χ2 s2 1 ¡ X ¢ −→

9.2 Delta Method.

Suppose that

∧ D √n θ θ0 X − −→ µ ¶ where X has a cdf F and θ is a p 1 vector. Suppose that g : Rp Rq. Then × →

∧ D ∂g √n g θ g (θ0) (θ0) X − −→ ∂θ/ · p 1 µ µ ¶ ¶ × q p × The proof is by the mean value theorem, i.e.,| {z }

∧ ∂g − ∧ g θ = g (θ0)+ θ θ θ0 ∂θ/ − µ ¶ µ ¶µ ¶ where −θ lies between θ and θ∧.Nowfor ∂g continuous at θ ,wehave 0 ∂θ/ 0

− P ∂g − P ∂g θ θ0 θ (θ0) . −→ ⇒ ∂θ/ −→ ∂θ/ µ ¶ Therefore,

∧ D ∂g √n g θ g (θ0) (θ0) X − −→ ∂θ/ · µ µ ¶ ¶ as required. D For example, sin X when μ =0.Now(sin x)/ =cosx. Hence √n sin X −→ N (0, 1) . In fact we can state the following theorem. Delta Method. 89

2 Theorem 31 Suppose that Xn is asymptotically distributed as N (μ, σ ), with σn n → 0.Letg be a real valued function differentiable m (m 1) times at x = μ,with ≥ gm (μ) =0but gj (μ)=0for j

2 For example, let Xn be asymptotically N (0,σ ),withσn 0.Then n → 2 log (1 + Xn) d 2 2 χ1. σn → Toseethisapplytheabovetheoremwithg (x)=log2(1 + x), m =2and μ =0. 90 Asymptotic Theory 2 Chapter 10

ASYMPTOTIC ESTIMATION THEORY

∧ Let θn (p 1) be an estimator, applied to a sample of size n, of a vector parameter × θ0. Both and must be elements of the set Θ of all admissible values of the para- meters, called the parameter space, which can in principle be defined to be Rp,or p-dimensional Euclidian space. For technical reasons, Θ must be a compact subset

p of R i.e. bounded and closed, i.e. it contains its boundary points. Furthermore, θ0 must be an interior point of Θ. Thisistosaythatθ0 int (Θ) if there exist a real ∈ number δ>0 such that θ Θ whenever θ θ0 <δ.Thisexcludesθ0 being at the ∈ k − k boundary of the set.

∧ ∧ Definition 32 θn is a consistent estimator of θ0 if plimθn = θ0.

Consistency might be a minimum requirement for a useful estimator. Proofs of consistency play an important role in econometric theory. The forms that these

∧ proofs is that if limn E θn = θ0, i.e. the estimator is asymptotically unbiased, →∞ µ ¶ ∧ and limn Var θn =0suffices for the consistency of the estimator. →∞ µ ¶ ∧ k ∧ Now suppose θn is consistent, and n θn θ0 = Op (1) for some k>0,and − has a non-degenerate limit distribution as nµ .¶ This distribution is called the →∞ ∧ asymptotic distribution of θn.

∧ ∧ Definition 33 θn is said to be consistent and asymptotically normal (CAN) for θn ∈ k ∧ d int (Θ) if there exist k>0 such that n θn θ0 N (0,V),whereV is a finite − → variance-. µ ¶ 92 Asymptotic Estimation Theory

In most applications k =1/2, although it can be larger than this for models containing determinist trend terms. There are also case where k>1/2 but the limiting distribution is not normal, e.g. when there are trends. Asymptotic normality is an important property to establish for an estimator, as it is often the only basis for constructing interval estimates and tests of hypotheses.

∧ Let denote by the class of CAN estimators of θ0,andwriteθn to denote C ∈ C that the estimator belongs to this class.

∧ Definition 34 θn is said to best asymptotically normal for θ0 (BAN) in the class ∈ C v ∧ v if AV ar θn AV ar θn is positive semi-definite for every θn . − ∈ C ³ ´ µ ¶ This property is also called asymptotic efficiency. BAN can be seen as an asymptotic counterpart of the BLUE property.

10.1 Asymptotics of the Stochastic Regressor Model

Assume the regression model:

/ yt = xt β + ut t =1, 2,...,n

Let the following assumptions hold for all n>k

E (u)=0 a.s.

/ 2 E uu = σ In a.s. rank¡ (X¢)=ka.s. where u is the (n 1) vector of the errors, X is the (n k) matrix of the explana- × × tory variables and In is the identity matrix of n. These are the usual assumptions. Furthermore, assume that n n 1 / 1 / p lim xtxt = lim E xtxt = Mxx < (p.d.) n n n n →∞ t=1 →∞ t=1 ∞ X X ³ ´ and 2+δ / E λ xtut B< δ>0, fixed λ ≤ ∞ ∀ ¯ ¯ ¯ ¯ ¯ ¯ Asymptotics of the Stochastic Regressor Model 93

1 / The first additional condition can be written as plimn− X X = Mxx has two com- ponents. The weak law of large numbers must apply to the squares and the cross- products of the elements of xt,andMxx must have full rank. The latter can fail even if the matrix X has rank k for every finite n. To see this take the fairly trivial

n 1 π2 1 n 1 example xt =1/t.Thenlimn t=1 2 = . Hence limn n− t=1 2 =0. →∞ t 6 →∞ t The least squared estimatorP can be written as P

1 1 n − n n − n ∧ / / β = xtxt xtyt = β + xtxt xtut à t=1 ! t=1 à t=1 ! t=1 X X X X 2 2 Consider now the k 1 vector xtut.SinceE(ut xt)=0and E(u xt)=σ the Law × | t | of Iterated Expectations gives

2 / 2 / Var(xtut)=E E u xtx xt = σ E xtx < . t t | t ∞ h ³ ´i ³ ´ Furthermore, the ut s are independent hence, the Weak Law of Large Numbers can 0 be applied on xtut, i.e. 1 n p lim x u =0 n t t t=1 X 1 / which is written as p lim n X u =0. Then by the Continuous Mapping Theorem we have that 1 ∧ 1 / − 1 / 1 p lim β β = p lim X X p lim X u =M − 0=0 − n n xx µ ¶ which is the consistency result. / Let us now consider the sequence λ xtut.Wehavethat

n 1 / 2 / lim Var λ xtut = σ λ Mxxλ. n n →∞ t=1 X ³ ´ 2 / Since now Mxx is positive definite 0 <σ λ Mxxλ< . Hence the denominator of the ∞ 2+δ / condition of Theorem 20 is bounded and bigger than 0.Furthermore,E λ xtut ≤ ¯ ¯ B ensures the condition of the same Theorem and consequently we have¯ that ¯ ¯ ¯ 1 n λ/x u d N 0,σ2λ/M λ √n t t xx t=1 → X ³ ´ 94 Asymptotic Estimation Theory for each specific λ. But this is equivalent to

1 / d 2 X u N 0,σ Mxx . √n → ¡ ¢ Finally 1 ∧ 1 / − 1 / d 2 1 √n β β = X X X u N 0,σ M − . − n √n → xx µ ¶ µ ¶ ¡ ¢ Part IV

Likelihood Function

95

Chapter 11

MAXIMUM LIKELIHOOD ESTIMATION

Let the observations be x =(x1,x2, ..., xn), and the Likelihood Function be denoted by L (θ)=d (x; θ), θ Θ k. Then the Maximum Likelihood Estimator (MLE) is: ∈ ⊂ <

θ∧ =argmaxp (x; θ) d x; θ∧ d (x; θ) θ Θ. θ Θ ⇔ ≥ ∀ ∈ ∈ µ ¶

If θ∧ is unique, then the model is identified. Let (θ)=lnd (x; θ),thenforlocal identification we have the following Lemma.

Lemma 35 ThemodelisLocallyIdentified iff the Hessian is negative definite with probability 1, i.e. ∂2 θ∧ Pr ⎡H θ∧ = µ ¶ < 0⎤ =1. ∂θ∂θ/ µ ¶ ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ Assume that the log-Likelihood Function can be written in the following form:

n

(θ)= ln d (xi; θ) . i=1 X Then we usually make the following assumptions:

Assumption A1. The range of the random variable x,sayC, i.e.

C = x n : d (x; θ) > 0 , { ∈ < }

be independent of the parameter θ. 98 Maximum Likelihood Estimation

Assumption A2. The Log-Likelihood Function ln d (x; θ) has partial derivatives with respect of θ up to third order, which are bounded and integrable with respect to x.

The vector ∂(x; θ) s (x; θ)= ∂θ which is k 1, is called the score vector. We can state the following Lemma. ×

Lemma 36 Under the assumptions A1. and A2. we have that

∂2 E (s)=0,Ess/ = E . − ∂θ∂θ/ ∙ ¸ ¡ ¢

Proof: As L (x, θ) is a density function it follows that

L (x, θ) dx =1

ZC where C = x n : L (x, θ) > 0 n.UnderA1.C is independent of θ. Conse- { ∈ < } ⊂ < quently, the derivative, with respect to θ, of the integral is equal to the integral of the derivative.∗ Consequently taking derivatives of the above integral we have:

∂ ∂L(x, θ) ∂L 1 ∂ ln L L (x, θ) dx =0 dx =0 Ldx =0 Ldx =0 ∂θ ⇒ ∂θ ⇒ ∂θ L ⇒ ∂θ ZC ZC ZC ZC ∂ Ldx =0 sLdx =0 E (s)=0. ⇒ ∂θ ⇒ ⇒ ZC ZC

∗In case that C was dependent on θ, we would have to apply the Second Fundamental Theorm of Analysis to find the derivative of the integral. Maximum Likelihood Estimation 99

Hence, we also have that E s/ =0and taking derivatives with respect to θ we have ¡ ¢

∂ ∂ ∂ ∂ ∂ s/Ldx =0 Ldx =0 L dx =0 ∂θ ⇒ ∂θ ∂θ/ ⇒ ∂θ ∂θ/ ZC ZC ZC µ ¶ ∂ ∂ ∂L ∂ L + dx =0 ⇒ ∂θ ∂θ/ ∂θ ∂θ/ ZC ∙µ ¶ µ ¶¸ ∂2 ∂L 1 ∂ L + L dx =0 ⇒ ∂θ∂θ/ ∂θ L ∂θ/ ZC ∙ µ ¶¸ ∂2 ∂ ∂ L + L dx =0 ⇒ ∂θ∂θ/ ∂θ ∂θ/ ZC ∙ ¸ ∂ ∂ ∂2 Ldx = Ldx ⇒ ∂θ ∂θ/ − ∂θ∂θ/ ZC ZC ∂2 ∂2 ss/Ldx = Ldx E ss/ = E ⇒ − ∂θ∂θ/ ⇒ − ∂θ∂θ/ ZC ZC µ ¶ ¡ ¢ which is the second result. ¥ The matrix

∂ ∂ ∂2 J (θ)=E ss/ = E = E ∂θ ∂θ/ − ∂θ∂θ/ µ ¶ µ ¶ ¡ ¢ is called (Fisher) Information Matrix and is a measure of the information that the sample x contains about the parameters in θ.Incasethat (x, θ) canbewrittenas n (x, θ)= i=1 (xi,θ) we have that

P 2 n n 2 2 ∂ ∂ (xi,θ) ∂ (xi,θ) J (θ)= E / (xi,θ) = E / = nE / . − Ã∂θ∂θ i=1 ! − i=1 ∂θ∂θ − ∂θ∂θ X X µ ¶ µ ¶ Consequently the Information matrix is proportional to the sample size. Furthermore, 2 by Assumption A2. E ∂ (xi,θ) is bounded. Hence ∂θ∂θ/ ³ ´ J (θ)=O (n) .

Now we can state the following Lemma that will be needed in the sequel. 100 Maximum Likelihood Estimation

Lemma 37 Under assumptions A1. and A2. we have that for any unbiased estima- tor of θ,sayθ, E θs/ = I , e k ³ ´ where Ik is the identity matrix of orderek.

Proof: As θ is an unbiased estimator of θ,wehavethat

e E θ = θ θL (x; θ) dx = θ. ⇒ C ³ ´ Z Taking derivatives, with respecte to θ/,wehavethate

∂ ∂L(x; θ) 1 θ (x) L (x; θ) dx = Ik θ (x) L (x; θ) dx = Ik ∂θ/ ⇒ ∂θ/ L (x; θ) ZC ZC ∂(x, θ) / e θ (x) e L (x; θ) dx = Ik θ (x) s L (x; θ) dx = Ik ⇒ ∂θ/ ⇒ ZC ZC / E eθs = Ik e ⇒ ³ ´ e which is the result. ¥ We can now prove the Cramer-Rao Theorem.

Theorem 38 Under the regularity assumptions A1. and A2. we have that for any unbiased estimator θ we have that

e 1 J − (θ) V θ . ≤ ³ ´ e / Proof: Let us define the following 2k 1 vector ξ/ = δ/,s/ = θ θ/,s/ . × − Now we have that ³ ´ ³ ´ e

/ / δδ δs V θ Ik V (ξ)=E ξξ/ = = . ⎡ sδ/ ss/ ⎤ ⎡ I³ ´ J (θ) ⎤ ³ ´ ke ⎣ ⎦ ⎣ ⎦ For the above result we took into consideration that θ is unbiased, i.e. E θ = θ, the above Lemma, i.e. E θs/ = I , E (s)=0,andE ss/ = J (θ).Itisknown³ ´ k e e ³ ´ ¡ ¢ e Maximum Likelihood Estimation 101 that all variance-covariance matrices are positive semi-definite. Hence V (ξ) 0.Let ≥ us define the following matrix

1 B = Ik J− (θ) . − h i The matrix B is k 2k and of rank k. Consequently, as V (ξ) 0,wehavethat × ≥ / 1 BV (ξ) B 0. Hence, as Ik and J− (θ) are symmetric we have ≥

V θ Ik Ik / 1 BV (ξ) B = Ik J− (θ) 1 − ⎡ I³ ´ J (θ) ⎤ ⎡ J− (θ) ⎤ h i ke − ⎣ ⎦ ⎣ ⎦ Ik 1 1 = V θ J− (θ)0 = V θ J− (θ) 0. 1 − ⎡ J− (θ) ⎤ − ≥ h ³ ´ i − ³ ´ e ⎣ ⎦ e 1 1 Consequently, V θ J− (θ), in the sense that V θ J− (θ) is a positive semi- ≥ − definite matrix. ³ ´ ³ ´ ¥e e 1 The matrix J− (θ) is called the Cramer-Rao Lower Bound as it is the lower bound for any unbiased estimator (either linear or non-linear).

Let, now, θ0 denote the true parameter values of θ and d (x; θ0) be the likeli- hood function evaluated at the true parameter values. Then for any function f (x) we define

E0 [f (x)] = f (x) d (x; θ0) dx. Z Let (θ) /n be the average log-likelihood and define a function z : Θ as → < 1 z (θ)=E ( (θ) /n)= (θ) d (x; θ ) dx. 0 n 0 Z Then we can state the following Lemma:

Lemma 39 θ Θ we have that ∀ ∈

z (θ) z (θ0) ≤ with strict inequality if

Pr [x S : d (x; θ) = d (x; θ0)] > 0 ∈ 6 102 Maximum Likelihood Estimation

Proof: From the definition of z (θ) we have that

d (x; θ) d (x; θ) n [z (θ) z (θ0)] = E0 [ (θ) (θ0)] = E0 ln ln E0 − − d (x; θ ) ≤ d (x; θ ) ∙ 0 ¸ ∙ 0 ¸ d (x; θ) =ln d (x; θ ) dx =ln d (x; θ) dx =ln1=0 d (x; θ ) 0 Z 0 Z where the inequality is due to Jensen. The inequality is strict when the ratio d(x;θ) d(x;θ0) is non-constant with probability greater than 0. ¥

When the observations x =(x1,x2, ..., xn) arerandomlysampled,wehavethat

n

(θ)= zi where zi =lnd (xi; θ) i=1 X and the zi random variables are independent and have the same distribution with mean E (zi)=z (θ). Then from the Weak Law of Large Numbers we have that

1 n 1 p lim z = p lim (θ)=z (θ) . n i n i=1 X However, the above is true under weaker assumptions, e.g. for dependent observations or for non-identically distributed random variables etc. To avoid a lengthy exhibition of various cases we make the following assumption:

Assumption A3. Θ is a compact subset of k,and < 1 p lim (θ)=z (θ) θ Θ. n ∀ ∈

Recall that a closed and bounded subset of k is compact. <

Theorem 40 Under the above assumption and if the statistical model is identified we have that

∧ p θ θ0 →

∧ where θ is the MLE and θ0 the true parameter values. Maximum Likelihood Estimation 103

Proof: Let N is an open sphere with centre θ0 and radius ε,i.e. N =

θ Θ : θ θ0 <ε .ThenN is closed and consequently A = N Θ is closed { ∈ k − k } ∩ and bounded, i.e. compact. Hence

max z (θ) θ A ∈ exist and we can define

δ = z (θ0) max z (θ) (11.1) θ A − ∈ Let Tδ S the event (a subset of the sample space) which defined by ⊂ 1 δ θ Θ (θ) z (θ) < . (11.2) ∀ ∈ n − 2 ¯ ¯ ¯ ¯ ¯ ¯ Hence (11.2) applies for θ = θ∧ as well.¯ Hence ¯

∧ 1 ∧ δ for Tδ z θ > θ . ⇒ n − 2 µ ¶ µ ¶

∧ Given now that θ (θ0) we have that ≥ µ ¶

∧ 1 δ for Tδ z θ > (θ0) . ⇒ n − 2 µ ¶ Furthermore, as θ0 Θ,wehavethat ∈ 1 δ for Tδ (θ0) >z(θ0) . ⇒ n − 2 from the relationship that if x d

∧ for Tδ z θ >z(θ0) δ. ⇒ − µ ¶ Substituting out δ, employing (11.1) we get

∧ for Tδ z θ > max z (θ) . ⇒ θ A µ ¶ ∈ Hence θ/∧ A θ/∧ N Θ θ∧ N Θ θ∧ N. ∈ ⇒ ∈ ∩ ⇒ ∈ ∩ ⇒ ∈ 104 Maximum Likelihood Estimation

∧ Consequently we have shown that when Tδ is true then θ N.Thisimpliesthat ∈

∧ Pr θ N Pr (Tδ) ∈ ≥ µ ¶ and taking limits, as n we have →∞

lim Pr ( θ θ0 <ε) lim Pr (Tδ)=1 n n →∞ k − k ≥ →∞ by the definition of N and by assumption A3. Hence, as ε is any small positive number, we have

ε>0limPr ( θ θ0 <ε)=1 n ∀ →∞ k − k which is the definition of probability limit. ¥

When the observations x =(x1,x2,...,xn) are randomly sampled, we have :

n

(θ)= ln d (xi; θ) i=1 X ∂(θ) n ∂ ln d (x ; θ) s (θ)= = i ∂θ ∂θ i=1 2 Xn 2 ∂ (θ) ∂ ln d (xi; θ) H (θ)= / = / . ∂θ∂θ i=1 ∂θ∂θ X Given now that the observations xi are independent and have the same distribution,

∂ ln d(xi;θ) with density d (xi; θ), the same is true for the vectors ∂θ and the matrices 2 ∂ ln d(xi;θ) . Consequently we can apply a Central Limit Theorem to get: ∂θ∂θ/

n ∂ ln d (x ; θ) _ n 1/2s (θ)=n 1/2 i d N 0, J (θ) − − ∂θ i=1 → X ³ ´ and from the Law of Large Numbers

n 2 _ 1 1 ∂ ln d (xi; θ) p n− H (θ)=n− / J (θ) i=1 ∂θ∂θ →− X where _ 2 1 1 1 ∂ (θ) J (θ)=n− J (θ)=n− J (θ)= n− E , − ∂θ∂θ/ µ ¶ Maximum Likelihood Estimation 105 i.e., the average Information matrix. However, the two asymptotic results apply even if the observations are dependent or identically distributed. To avoid a lengthy exhibition of various cases we make the following assumptions. As n we have: →∞

Assumption A4. _ 1/2 d n− s (θ) N 0, J (θ) → ³ ´ and

Assumption A5. _ 1 p n− H (θ) J (θ) →− where _ J (θ)=J (θ) /n = E ss/ /n = E (H) /n. − ¡ ¢ We can now state the following Theorem

Theorem 41 Under assumptions A2. and A3.the above two assumptions and iden- tification we have: _ 1 ∧ d − √n θ θ0 N 0, J (θ0) − → µ ¶ ∙ ³ ´ ¸ ∧ where θ is the MLE and θ0 the true parameter values.

Proof: As θ∧ maximises the Likelihood Function we have from the first order conditions that ∂ θ∧ s θ∧ = µ ¶ =0. ∂θ µ ¶ From the Mean Value Theorem, around θ0,wehavethat

∧ s (θ0)+H (θ ) θ θ0 =0 (11.3) ∗ − µ ¶

∧ ∧ where θ θ, θ0 , i.e. θ θ0 θ θ0 . ∗ ∈ k ∗ − k ≤ − ∙ ¸ ° ° ° ° ° ° ° ° 106 Maximum Likelihood Estimation

Now from the consistency of the MLE we have that

∧ θ = θ0 + op (1)

whereop (1) is a random variable that goes to 0 in probability as n .Asnow →∞ ∧ θ θ, θ0 ,wehavethat ∗ ∈ ∙ ¸ θ = θ0 + op (1) ∗ as well. Hence from 11.3 we have that

∧ 1 √n θ θ0 = [H (θ ) /n]− s (θ0) /√n. − − ∗ µ ¶

As now θ = θ0 + op (1) and under the second assumption we have that ∗ _ 1 ∧ − √n θ θ0 = J (θ0) s (θ0) /√n + op (1) . − µ ¶ h i

∧ d Under now the first assumption the above equation implies that √n θ θ0 − → _ 1 µ ¶ − N 0, J (θ0) . ¥ ∙ ³ ´ ¸ 2 Example: Let yt v N (μ, σ ) i.i.d for t =1,...,T.Then

T T 1 T (θ)= ln (2π) ln σ2 (y μ)2 2 2 2σ2 t − − − t=1 − ¡ ¢ X μ where θ = .Now ⎛ σ2 ⎞ ⎝ ⎠ ∂ 1 T = (y μ) ∂μ σ2 t t=1 − X and T ∂ T 1 2 = − + (yt μ) . ∂σ2 2σ2 2σ4 t=1 − X Now ∂ θ∧ ∂ θ∧ µ ¶ = µ ¶ =0. ∂μ ∂σ2 Maximum Likelihood Estimation 107

Hence T T 2 1 ∧ 1 μ∧ = y and σ2 = y μ∧ . T t T t t=1 t=1 − X X ³ ´ Now

2 T 1 T ∂ (θ) 2 4 (yt μ) H (θ)= = − σ − σ t=1 − , ∂θ∂θ/ ⎡ 1 T T 1 T 2 ⎤ 4 t=1 (yt μ) 4 P6 t=1 (yt μ) − σ − 2σ − σ − ⎣ P P ⎦ and consequently evaluating H (θ) at θ∧ we have

T 0 ∧ − σ∧2 H θ = ⎡ ⎤ 0 T µ ¶ ∧ ⎢ − 2σ4 ⎥ ⎣ ⎦ which is clearly negative definite. Now the Information matrix is:

T 1 T 2 4 (yt μ) J (θ)= E (H (θ)) = E σ σ t=1 − − ⎛⎡ 1 T T 1 T 2 ⎤⎞ 4 (yt μ) 4 +P6 (yt μ) σ t=1 − − 2σ σ t=1 − ⎝⎣ ⎦⎠ T 0 P P = σ2 ⎡ T ⎤ 0 2σ4 ⎣ ⎦ 108 Maximum Likelihood Estimation Chapter 12

RESTRICTED MAXIMUM LIKELIHOOD ESTIMATION

Let us assume that the k 1 vector of parametersθ0 satisfy r constrains, i.e. ×

ϕ (θ)=0 where ϕ (θ) and 0 are r 1 vectors. Let us also assume that × ∂ϕ(θ) F (θ)= ∂θ/ the r k matrix of derivatives has rank r, i.e. there no redundant constrain. × Under these conditions, MLE is still consistent but not is not asymptotically efficient, as it ignores the information in the constrains. To get an asymptotically efficient estimator we have to take into consideration the information of the constrains. Hence we form the Lagrangian:

L (θ, λ)= (θ)+λ/ϕ (θ)= (θ)+ϕ/ (θ) λ where λ is the r 1 vector of Lagrange Multipliers. The first order conditions are: × ∂L ∂ = + F / (θ) λ = s (θ)+F / (θ) λ =0 ∂θ ∂θ ∂L ∂λ/ = ϕ (θ)=I ϕ (θ)=ϕ (θ)=0. ∂λ ∂λ k

Let θ and λ the constrained ML estimators, i.e. the solution of the above first order conditions, i.e. e e s θ + F / θ λ =0 (12.1) ³ ´ ³ ´ e e e 110 Restricted Maximum Likelihood Estimation

ϕ θ =0. (12.2) ³ ´ Applying the Mean Value Theorem arounde θ0 we have to s θ and ϕ θ we get ³ ´ ³ ´ ∂s(θ ) e e s θ = s (θ0)+ ∗ θ θ0 = s (θ0)+H (θ ) θ θ0 (12.3) ∂θ/ − ∗ − ³ ´ ∙ ¸ ³ ´ ³ ´ e ∂ϕ(θ ) e e ϕ θ = ϕ (θ0)+ ∗∗ θ θ0 = ϕ (θ0)+F (θ ) θ θ0 (12.4) ∂θ/ − ∗∗ − ³ ´ ∙ ¸ ³ ´ ³ ´ where θ ande θ are defined as e e ∗ ∗∗

θ θ0 θ θ0 k ∗ − k ≤ − ° ° θ θ0 °θ θ0° . k ∗∗ − k ≤ °e − ° ° ° ° ° Notice that θ and θ are not necessarily the° same.e ° ∗ ∗∗ Substituting (12.3) into (12.1), (12.4) into (12.2) and taking into account that under the null ϕ (θ0)=0we get

/ s (θ0)+H (θ ) θ θ0 + F θ λ =0 ∗ − ³ ´ ³ ´ and e e e

F (θ ) θ θ0 =0. ∗∗ − ³ ´ Hence we get e

1 1 1 / s (θ0)+ H (θ ) √n θ θ0 + F θ λ =0 (12.5) √n n ∗ − √n ³ ´ ³ ´ and e e e

√nF (θ ) θ θ0 =0. (12.6) ∗∗ − ³ ´ Now θ is consistent and so are θ ande θ ,i.e. ∗ ∗∗

θe= θ0 + op (1) ,θ= θ0 + op (1) ,andθ= θ0 + op (1) . ∗ ∗∗

Furthermore,e according to our assumptions we have that

1 _ H (θ )= J (θ0)+op (1) n ∗ − Restricted Maximum Likelihood Estimation 111

_ where J (θ0) the average information matrix. Hence, equations (12.5) and (12.6) become:

_ s (θ0) / λ J (θ0) √n θ θ0 + F (θ0) = op (1) (12.7) √n − − √n ³ ´ e and e

F (θ0) √n θ θ0 = op (1) . (12.8) − ³ ´ Let us now define the matrix e _ 1 − / P (θ0)=F (θ0) J (θ0) F (θ0) > 0

_ h i _ 1 − as J (θ0) > 0 and F (θ0) has rank r. Now multiplying (12.7) by F (θ0) J (θ0) and add (12.8) we get: h i

_ 1 s (θ ) λ F (θ ) J (θ ) − 0 + P (θ ) = o (1) . 0 0 √n 0 √n p h i e Hence, as P (θ0) > 0 we have that

_ 1 λ 1 − s (θ0) = [P (θ0)]− F (θ0) J (θ0) + op (1) . (12.9) √n − √n e h i Furthermore, substituting into (12.7) we get

_ 1 _ 1 − / 1 − s (θ0) √n θ θ0 = Ik J (θ0) F (θ0)[P (θ0)]− F (θ0) J (θ0) + op (1) . − − √n ³ ´ ½ h i ¾ h i (12.10) e NowwecanstatethefollowingTheorem:

Theorem 42 Under the assumptions A4., A5., model identification and that the true parameter values satisfy the constrains, we have

λ d 1 N 0, [P (θ0)]− √n → e and ¡ ¢ _ 1 d − √n θ θ0 N 0, J (θ0) A − → − ³ ´ µ h i ¶ where e _ 1 _ 1 − / 1 − A = J (θ0) F (θ0)[P (θ0)]− F (θ0) J (θ0) > 0. h i h i 112 Restricted Maximum Likelihood Estimation

Proof: From assumption A4. we have that

_ _ 1/2 1/2 d − s (θ0) d n− s (θ) N 0, J (θ) J (θ0) N (0,Ik) . → ⇒ √n → ³ ´ h i Hence from (12.9) we have

_ 1/2 _ 1/2 λ 1 − − s (θ0) = [P (θ0)]− F (θ0) J (θ0) J (θ0) + op (1) . √n − √n e h i h i Hence λ d N (0, Ω1) √n → where e

_ 1/2 _ 1/2 / 1 − 1 − 1 Ω1 = [P (θ0)]− F (θ0) J (θ0) [P (θ0)]− F (θ0) J (θ0) =[P (θ0)]− . − − ½ h i ¾½ h i ¾ Furthermore from (12.10) we have

_ 1 _ 1/2 − / 1 − √n θ θ0 = Ik J (θ0) F (θ0)[P (θ0)]− F (θ0) J (θ0) − − ½ ¾ ³ ´ _ h 1/2i h i − s (θ0) e J (θ0) + op (1) . × √n h i Hence d √n θ θ0 N (0, Ω2) − → ³ ´ where e _ 1 _ 1/2 − / 1 − Ω2 = Ik J (θ0) F (θ0)[P (θ0)]− F (θ0) J (θ0) − ½½ h i ¾ h i ¾ _ 1 _ 1/2 / − / 1 − Ik J (θ0) F (θ0)[P (θ0)]− F (θ0) J (θ0) × − _½½ 1 h _ i 1 ¾ h _ i 1 ¾ − − / 1 − = J (θ0) J (θ0) F (θ0)[P (θ0)]− F (θ0) J (θ0) − _ 1 _ 1 h i − h / i 1 − h i J (θ0) F (θ0)[P (θ0)]− F (θ0) J (θ0) − _ 1 _ 1 _ 1 h −i / 1 h −i / 1 − J (θ0) F (θ0)[P (θ0)]− F (θ0) J (θ0) F (θ0)[P (θ0)]− F (θ0) J (θ0) _ 1 _ 1 _ 1 h i− − / h 1 i − h i = J (θ0) J (θ0) F (θ0)[P (θ0)]− F (θ0) J (θ0) − h i h i h i which is the result. ¥ Hence we can reach the following Conclusion: Restricted Maximum Likelihood Estimation 113

Corollary 43 The Restricted MLE is at least as efficient as the MLE, i.e.

AsyV ar θ AsyV ar θ∧ . ≤ ³ ´ µ ¶ e 114 Restricted Maximum Likelihood Estimation Part V

Neyman or Ratio of the Likelihoods Tests

115

117

Let x =(x1,x2, ..., xn) bearandomsamplehavingadensityfunctiond (x, θ), where θ Θ Rk.Wewouldliketotestasimplehypothesis ∈ ⊂

H0 : θ = θ0 against a simple alternative

H1 : θ = θ1 Θ θ0 . ∈ − { } To simplify the notation we write

d0 (x)=d (x, θ0) ,d1 (x)=d (x, θ1) .

The Neyman Ratio is the statistic d (x) d (x, θ ) λ (x)= 1 = 1 . d0 (x) d (x, θ0) If S is the sample space of x,thentheNeymanRatioTestisdefined by the Rejection Region

R = x S : λ (x) cα { ∈ ≥ } where cα > 0 is a constant such that the Type I Error (Size) of the test is equal to α (0, 1/2), i.e. ∈ P (R θ = θ0)= d0 (x) dx = α, | ZR i.e. the probability of rejecting the null when it is correct. Notice that the Power of the above test, πR (α) is

πR (α)=P (R θ = θ1)= d1 (x) dx, | ZR i.e. the probability of rejecting the null when the alternative is correct.

Lemma 44 (Neyman-Pearson) Let A be the Rejection Region of the test of H0 with size less or equal to α,i.e.

P (A θ = θ0)= d0 (x) dx α. | ≤ ZA Then the Power of this test is less or equal to the Power of the Neyman Test, i.e.

P (A θ = θ1) P (R θ = θ1) . | ≤ | 118

Proof: The difference in the Powers of the 2 tests is

πR (α) πA (α)=P (R θ = θ1) P (A θ = θ1)= − | − |

d1 (x) dx d1 (x) dx = − ZR ZA

d1 (x) dx + d1 (x) dx d1 (x) dx d1 (x) dx = A R A R − R A − R A Z ∩ Z ∩ Z ∩ Z ∩

d1 (x) dx d1 (x) dx. A R − R A Z ∩ Z ∩ Form the definition of the Rejection Region R we have that

x Rd1 (x) cαd0 (x) ,and x Rd1 (x)

πR (α) πA (α) cαd0 (x) dx cαd0 (x) dx = − ≥ A R − R A Z ∩ Z ∩

cα d0 (x) dx + d0 (x) dx d0 (x) dx d0 (x) dx = A R A R − R A − A R µZ ∩ Z ∩ Z ∩ Z ∩ ¶

cα d0 (x) dx d0 (x) dx cα (α α)=0. − ≥ − µZR ZA ¶ Hence πR (α) πA (α). ≥ The above Lemma says that when we test a simple null versous a simple alternative the Neyman Test is the Most Powerful Test of all tests that have the same or smaller size.

However, the usual tests are of simple H0 versous composite alternative. In such cases the power of the test is, in general, a function of the parameters in the alternative, i.e. πA (α, θ) the power of the test is a function of the parameters as well.

Definition 45 Let A be the rejection region of a test. Then if πA (α, θ) α for all ≥ θ Θ the test is unbiased. ∈ The comparison of unbiased tests is very difficult as the one is more powerful in one region of the parametric space and the other in another. 119

Definition 46 If for any test, say B,wehavethatπA (α, θ) πB (α, θ) for all θ Θ ≥ ∈ the test A is called uniformly most powerful test.

Thereisonewaytofind uniformly most powerful tests or alternatively to compare the power of any given test. Assume that we have to test the null H0 : θ =0 versous the alternative H1 : θ =0.WecancalculatethepoweroftheNeymanRatio 6 TestofthesimplenullH0 : θ =0versous the simple alternative H1 : θ =1.and graphthepointinaπR (α, θ) θ diagram. We repeat for the simple alternatives − H1 : θ = 1, H1 : θ =2, H1 : θ = 2, etc. The line we get by joining all these points − − together is called the Envelope Power Function. As an immediate consequence of the Neyman-Pearson Lemma we have:

Theorem 47 Let πA (α, θ) be the power function of a test of size α of the null hy- pothesis H0 : θ = θ0 versous the alternative H1 : θ Θ θ0 .Ifeα (θ) is the ∈ − { } Envelope Power Function we have that

θ Θ πA (α, θ) eα (θ) . ∀ ∈ ≤ For hypothesis testing the Envelope Power Function is the analogue of the Cramer-Rao bound in estimation. It is obvious that if the power function of a test is identical to the Envelope Power Function then this test is uniformly most powerful in the class of unbiased tests. Hence we have the following Corollary

Corollary 48 If the Rejection Region of a Neyman Ratio test of a simple null H0 :

θ = θ0 versous the alternative H1 : θ Θ1 does not depend on θ1,arandompointis ∈ Θ1, then the Neyman test is uniformly most powerful test.

The proof is based on the fact that if the rejection region R does not depend on θ1, the power of the test is identical to the Envelope Power Function.

2 Example 49 Let x =(x1,x2, ..., xn) be a random sample from a N (μ, σ ) with un- 2 known μ and known σ . WewanttotestthehypothesisH0 : μ = μ0 versous the 120

alternative H1 : μ>μ0. For any μ1 >μthe Likelihood Functions for μj (j =0, 1) is given by n n/2 1 2 d (θ)= 2πσ2 − exp x μ j 2σ2 i j "− i=1 − # X and the ratio of the likelihoods¡ is ¢ ¡ ¢

1 n n λ (x)=d (x) /d (x)=exp (μ μ ) x μ2 μ2 . 1 0 σ2 1 0 i 2σ2 1 0 " − i=1 − − # X ¡ ¢ Hence λ (x) cα is equivalent to ≥ 1 n n (μ μ ) x μ2 μ2 ln c σ2 1 0 i 2σ2 1 0 α − i=1 − − ≥ ⇔ X ¡ ¢ n σ2 n (μ + μ ) σ2 ln c μ + μ x ln c + 1 0 x c/ = α + 1 0 i μ μ α 2 α n (μ μ ) 2 i=1 ≥ 1 0 ⇔ ≥ 1 0 X − − / where the constants cα and cα are determined by the size of the test, which should be α, i.e. P x c/ μ = μ = α. ≥ α| 0 2 ¡ σ ¢x μ0 Under H0, we have that x v N μ0, n and √n −σ v N (0, 1). The above probabil- ity is equivalent to ³ ´

/ x μ0 cα μ0 P √n − √n − μ = μ0 = α. Ã σ ≥ σ | !

Let zα be the critical value that leaves α% in the tail of a N (0, 1) distribution. Hence

x μ0 P √n − zα μ = μ = α. σ ≥ | 0 µ ¶ Hencewehaveshownthat

x μ0 λ (x) cα √n − zα. ≥ ⇔ σ ≥ The Rejection Region of the Ratio of the Neyman Test is

x μ0 R = x S : λ (x) cα = x S : √n − zα { ∈ ≥ } ∈ σ ≥ ½ ¾ which is independent of μ1. Consequently, the test is Uniformly Most Powerful. 121

The Neyman Ratio Test is generalised and for the testing of a composite null versous composite alternative. Let x =(x1,x2, ..., xn) be a sample with Likelihood Function

k d (x; θ) ,θΘ R , ∈ ⊂ and Ω Rν (ν

H0 : θ Ω ∈ versous the alternative

H1 : θ Θ Ω. ∈ − The Neyman Ratio is the function

supθ Θ d (x; θ) λ (x)= ∈ supθ Ω d (x; θ) ∈ and the Neyman Ratio Test is definedbytheRejectionRegionofH0

R = x S : λ (x) cα , { ∈ ≥ } where cα > 0 is a constant such that the probability of Type I Error is α, i.e.

sup P (R θ Ω) =sup d (x; θ) dx = α. θ Ω { | ∈ } θ Ω R ∈ ∈ Z 2 Example 50 Let x =(x1,x2, ..., xn) be a random sample from a N (μ, σ ) with un- 2 known μ and σ . We want to test the hypothesis H0 : μ = μ0 versous the alternative

H1 : μ = μ . The vector of parameters is 6 0

/ 2 θ = μ, σ Θ = R R+. ∈ × ¡ ¢ Under the null the parameter sample is Ω = μ0 R+. Consequently we have that × H0 : θ Ω versous H1 : θ Θ Ω.Foranyθ the likelihood function is ∈ ∈ − n n/2 1 2 d (x; θ)= 2πσ2 − exp (x μ) 2σ2 i "− i=1 − # ¡ ¢ X 122

2 However, the maximum of supθ Ω [d (x; θ)] = supσ2 [d (x; μ0,σ )].Themaximumof ∈ 2 d (x; μ0,σ ) is at 1 n σ2 = s2 = (x μ )2 0 n i 0 i=1 − X Hence 2 n/2 n sup [d (x; θ)] = 2πs0 − exp . θ Ω − 2 ∈ h i Withthesamereasoningwefind that ¡ ¢ n 2 n/2 n 2 1 2 sup [d (x; θ)] = 2πs − exp ,s= (xi x) . θ Θ − 2 n − ∈ i=1 ¡ ¢ h i X Hence the Ratio of the Likelihoods is

supθ Θ Ω [d (x; θ)] 2 2 n/2 λ (x)= ∈ − = s /s0 − , supθ Ω [d (x; θ)] ∈ ¡ ¢ and the Rejection Region of the test is

2 2 2 2 n/2 s0 s 2/n R = x S : s /s cα = x S : − c 1 . ∈ 0 ≥ ∈ s2 ≥ α − ½ ¾ n ¡ ¢ o Notice that

2 2 n 2 n 2 2 s0 s i=1 (xi μ0) i=1 (xi x) n (x μ0) −2 = − n − 2 − = n − 2 => s (xi x) (xi x) P i=1 −P i=1 − 2 2 2 n s0 s n (x μ0) ∧2 1 2 − = − P,whereσ = P(xi x) s2 n 1 − (n 1) σ∧2 − i=1 − X Hence n (x μ )2 R = x S : − 0 (n 1) c2/n 1 . ∈ ≥ − α − ( σ∧2 ) ¡ ¢ It is easy to show that under H0 the ratio 2 n (x μ0) F = − F1,n 1. v − σ∧2 Hence the null is rejected when 2 n (x μ0) F = − fα ≥ σ∧2 where fα the α% critical value of an F1,n 1. − Chapter 13

χ-SQUARE TESTS

Let θ Θ Rk be a vector of parameters and Ω Rν (ν

√n θ∧ θ d N (0,V) , − −→ µ ¶ where V is a known or consistently estimated variance matrix.

We want to test H0 : ϕ (θ)=0versous H1 : ϕ (θ) =0. Assume that the r k 6 × matrix ∂ϕ ∂ϕi F (θ)= / = ,i=1, ..., r, j =1, 2, ..k ∂θ ∂θj ½ ¾ has full row rank, i.e. rank [F (θ)] = r.

This is fulfilled if there are no redundant restrictions.

Theorem 51 For a Consistent Asymptotically Normal (CAN) estimator θ∧,assuming that rank [F (θ)] = r and under H0 : ϕ (θ)=0,wehavethat

/ 1 d nϕ θ∧ FVF/ − ϕ θ∧ χ2 −→ r µ ¶ µ ¶ ¡ ¢ 124 χ-Square Tests

d Proof. From the Delta Method we have that if √n θ∧ θ X then − −→ µ ¶ ∧ d ∧ d √n ϕ θ ϕ (θ) F (θ) X. But under H0 ϕ (θ)=0. Hence √nϕ θ − −→ · −→ µ µ ¶ ¶ / µ ¶ d 1 d F (θ) X √nϕ θ∧ N 0,FVF/ . It follows that nϕ θ∧ FVF/ − ϕ θ∧ · ⇒ −→ −→ 2 µ ¶ / µ ¶ µ ¶ χr where r = rank FVF ¡= r. In case¢ that the matrices F and¡ V are¢ functions of the unknown parameter¡ vector¢ θ∧, can be substituted by consistent estimators, e.g. F θ∧ and V θ∧ . µ ¶The χ2µtest¶ does not necessarily have the optimal properties of the test of the likelihood ratio, however, it has the great advantage that we do not have to know the exact distribution of the sample. What is necessary is a CAN estimator. Furthermore, it is fairly easy to construct and evaluate such a test. Consequently, it is not a surprise that χ2 tests are the most popular tests in statistics. Chapter 14

THE CLASSICAL TESTS

Let the null hypothesis be represented by

Ω = θ Θ : ϕ (θ)=0 { ∈ } where θ is the vector of parameters and ϕ (θ)=0are the restrictions. Consequently the Neyman ratio test is given by:

L θ∧ supθ Θ L (θ) λ (x)= ∈ = µ ¶ supθ Ω L (θ) ∈ L θ ³ ´ where L (θ) is the Likelihood function. As now the ln ( ) is a monotonic, strictly e · increasing, an equivalent test can be based on

LR =2ln(λ (x)) = 2 θ∧ θ − ∙ µ ¶ ³ ´¸ where where LR is the well known Likelihood Ratio test ande (θ) is the log-likelihood function. Using a Taylor expansion of θ around θ∧ andemployingtheMeanValue Theorem we get: ³ ´ e ∧ ∂ θ / 2 ∧ ∧ 1 ∧ ∂ (θ ) ∧ θ = θ + µ ¶ θ θ + θ θ ∗ θ θ ∂θ/ − 2 − ∂θ∂θ/ − ³ ´ µ ¶ µ ¶ µ ¶ µ ¶ e ∂e ∧θ e e ∧ ∧ ∧ where θ θ θ θ .Now, µ /¶ = s θ =0due to the fact the the first ∗ − ≤ − ∂θ ° ° ° ° µ ¶ ° ° ° ° order conditions° ° are°e satis°fied by the ML estimator θ∧. Consequently, the LR test is ° ° ° ° 126 The Classical Tests given by:

/ 2 ∧ ∧ ∂ (θ ) ∧ LR =2 θ θ = θ θ ∗ θ θ . − − − ∂θ∂θ/ − ∙ µ ¶ ³ ´¸ µ ¶ µ ¶ Nowweknowthat e e e

_ 1 _ 1 − / 1 − s (θ0) √n θ θ0 = Ik J (θ0) F (θ0)[P (θ0)]− F (θ0) J (θ0) + op (1) − − √n ³ ´ ½ h i ¾ h i and e _ 1 ∧ − s (θ0) √n θ θ0 = J (θ0) + op (1) . − √n µ ¶ h i Hence

_ 1 _ 1 ∧ − / 1 − s (θ0) √n θ θ = J (θ0) F (θ0)[P (θ0)]− F (θ0) J (θ0) + op (1) − − √n µ ¶ h i h i and consequentlye

_ 1 _ 1 / 2 − / 1 − s (θ0) 1 ∂ (θ ) LR = J (θ0) F (θ0)[P (θ0)]− F (θ0) J (θ0) ∗ √n −n ∂θ∂θ/ µ ¶ µ ¶ _h i1 _h i1 − / 1 − s (θ0) J (θ ) F (θ )[P (θ )]− F (θ ) J (θ ) + o (1) . 0 0 0 0 0 √n p h i h i Now from assumption A5. we have

_ 1 n− H (θ)= J (θ)+op (1) , − _ 1 − / θ = θ0 + op (1) and P (θ0)=F (θ0) J (θ0) F (θ0) . ∗ h i Hence

/ _ 1 _ 1 s (θ0) − / 1 − s (θ0) LR = J (θ ) F (θ )[P (θ )]− F (θ ) J (θ ) + o (1) . √n 0 0 0 0 0 √n p µ ¶ h i h i We can now state the following Theorem:

Theorem 52 Under the usual assumptions and under the null Hypothesis we have that LR =2 θ∧ θ d χ2. − → r ∙ µ ¶ ³ ´¸ e The Classical Tests 127

Proof: The Likelihood Ratio is written as

1 / / − / LR =(ξ0) Z0 Z0 Z0 Z0 ξ0 + op (1) h i where _ 1/2 s (θ ) _ 1/2 J (θ ) − 0 = ξ ,andZ= J (θ ) − F / (θ ) . 0 √n 0 0 0 0 h i h i Now _ 1/2 − s (θ0) d J (θ0) N (0,Ik) √n → 1 h i / − / and Z0 Z0 Z0 Z0 is symmetric idempotent. Hence h i 1 1 1 / − / / − / / − / r Z0 Z0 Z0 Z0 = tr Z0 Z0 Z0 Z0 = tr Z0 Z0 Z0 Z0 = tr (Ir)=r. µ h i ¶ µ h i ¶ µh i ¶ Consequently, we get the result. ¥ The Wald test is based on the idea that if the restrictions are correct the vector ϕ θ∧ should be close to zero. µ ¶ ∧ Expanding ϕ θ around ϕ (θ0) we get: µ ¶

∧ ∂ϕ(θ ) ∧ ∧ ϕ θ = ϕ (θ0)+ ∗ θ θ0 = F (θ ) θ θ0 ∂θ/ − ∗ − µ ¶ µ ¶ µ ¶ as under the null ϕ (θ0)=0. Hence

∧ ∧ √nϕ θ = √nF (θ ) θ θ0 ∗ − µ ¶ µ ¶ and consequently

∧ ∧ √nϕ θ = √nF (θ0) θ θ0 + op (1) . − µ ¶ µ ¶ Furthermore recall that

_ 1 ∧ d − √n θ θ0 N 0, J (θ0) . − → µ ¶ ∙ ³ ´ ¸ Hence, _ 1 ∧ d − / √nϕ θ N 0,F (θ0) J (θ0) F (θ0) . → µ ¶ ∙ ³ ´ ¸ 128 The Classical Tests

Let us now consider the following quadratic:

/ _ 1 1 ∧ − / − ∧ n ϕ θ F (θ0) J (θ0) F (θ0) ϕ θ , ∙ µ ¶¸ ∙ ³ ´ ¸ µ ¶ which is the square of the Mahalanobis distance of the √nϕ θ∧ vector. However the abovequantitycannotbeconsideredasastatisticasitisafunctionoftheunknownµ ¶ parameter θ0. The Wald test is given by the above quantity if the unknown vector of ∧ parameters θ0 is substituted by the ML estimator θ,i.e.

/ 1 1 _ − − W = ϕ θ∧ F θ∧ nJ θ∧ F / θ∧ ϕ θ∧ ∙ µ ¶¸ " µ ¶µ µ ¶¶ µ ¶# µ ¶ / 1 1 − − = ϕ θ∧ F θ∧ J θ∧ F / θ∧ ϕ θ∧ , ∙ µ ¶¸ " µ ¶µ µ ¶¶ µ ¶# µ ¶ where J θ∧ is the estimated information matrix. In case that J θ∧ does not have an explicitµ formula¶ it can be substituted by a consistent estimator,µ e.g.¶ by

2 ∧ n ∂ θ ∧ J = µ /¶ − i=1 ∂θ∂θ X or by the asymptotically equivalent

∧ ∧ n ∂ θ ∂ θ J∧ = µ ¶ µ ¶. ∂θ / i=1 ∂θ X HencetheWaldstatisticisgivenby

/ 1 1 − − W = ϕ θ∧ F θ∧ J∧ F / θ∧ ϕ θ∧ . ∙ µ ¶¸ " µ ¶µ ¶ µ ¶# µ ¶ NowwecanprovethefollowingTheorem:

Theorem 53 Under the usual regularity assumptions and the null hypothesis we that

/ 1 1 − − W = ϕ θ∧ F θ∧ J∧ F / θ∧ ϕ θ∧ d χ2. → r ∙ µ ¶¸ " µ ¶µ ¶ µ ¶# µ ¶ The Classical Tests 129

Proof: For any consistent estimator of θ0 we have that

1 _ 1 ∧ ∧ − / ∧ − / F θ J F θ = F (θ0) nJ (θ0) F (θ0)+op (1) . µ ¶µ ¶ µ ¶ ³ ´ Hence / _ 1 1 ∧ − / − ∧ W = n ϕ θ F (θ0) J (θ0) F (θ0) ϕ θ + op (1) . ∙ µ ¶¸ ∙ ³ ´ ¸ µ ¶ Furthermore, _ 1 ∧ d − / √nϕ θ N 0,F (θ0) J (θ0) F (θ0) , → µ ¶ ∙ ³ ´ ¸ and the result follows. ¥ TheLagrangeMultiplier(LM) test considers the distance from zero of the estimated Lagrange Multipliers. Recall that

λ d 1 N 0, [P (θ0)]− . √n → e ¡ ¢ Consequently, the square Mahalanobis distance is

/ / _ 1 λ λ − / P (θ0) = λ F (θ0) nJ (θ0) F (θ0) λ . Ã√n! Ã√n! e e ³ ´ h i ³ ´ e e Again, the above quantity is not a statistic as it is a function of the unknown parameters θ0. However, we can employ the restricted ML estimates of θ0 to find the the unknown , i.e. F = F θ and J = J θ . Hence we can prove the following: ³ ´ ³ ´ e e e e

Theorem 54 Under the usual regularity assumptions and the null hypothesis we have

/ 1 LM = λ F J − F / λ d χ2. → r ³ ´ h i ³ ´ e e e e e Proof: Again we have that for any consistent estimator of θ0, as is the restricted MLE θ,wehavethat

/ / 1 e − / λ λ LM = λ F J F λ = P (θ0) + op (1) Ã√n! Ã√n! ³ ´ h i ³ ´ e e e e e e e 130 The Classical Tests and by the asymptotic distribution of the Lagrange Multipliers we get the result. ¥

NowwehavethattheRestrictedMLEsatisfythefirst order conditions of the Lagrangian, i.e. s θ + F / θ λ =0. ³ ´ ³ ´ Consequently the LM test can bee expressede as:e

/ 1 LM = s θ J − s θ . ³ ³ ´´ h i ³ ´ Now Rao has suggested to find the scoree vectore ande the information matrix of the unrestricted model and evaluate them at the restricted MLE. Under this form the LM statistic is called efficient score statistic as it measures the distance of the score vector, evaluated at the restricted MLE, from zero.

14.1 The Linear Regression

Let us consider the classical linear regression model:

2 y = Xβ + u, u X N 0,σ In | v ¡ ¢ where y is the n 1 vector of endogenous variables, X is the n k matrix of weakly × × exogenous explanatory variables, β is the k 1 vector of mean parameters and u is × the n 1 vector of errors. Let us call the vector of parameters θ, i.e. θ/ = β/,σ2 × a (k +1) 1 vector. The log-likelihood function is: ³ ´ × n n 1 (y Xβ)/ (y Xβ) (θ)= ln (2π) ln σ2 − − . − 2 − 2 − 2 σ2 ¡ ¢ The first order conditions are:

∂(θ) X/ (y Xβ) = − =0 ∂β σ2 and ∂(θ) n 1 (y Xβ)/ (y Xβ) = + − − =0. ∂σ2 −2σ2 2 σ4 The Linear Regression 131

Solving the equations we get:

1 β∧ = X/X − X/y / ∧ ¡u∧ u∧ ¢ σ2 = , u∧ = y Xβ.∧ n − Notice that the MLE of β is the same as OLS estimator. Something which is not true for the MLE of σ2. The Hessian is

2 2 2 ∂ (θ) ∂ (θ) 1 / 1 / ∂ (θ) / 2 2 X X 4 X u H (θ)= = ∂β∂β ∂β∂σ = − σ − 2σ . / 2 2 / ∂θ∂θ ⎛ ∂ (θ) ∂ (θ) ⎞ ⎛ 1 u/X n u u ⎞ ∂σ2∂β/ ∂(σ2)2 − 2σ4 2σ4 − σ6 ⎝ ⎠ ⎝ ⎠ Hence the Information matrix is

1 X/X 0 J (θ)=E [ H (θ)] = σ2 , − ⎛ n ⎞ 0 2σ4 ⎝ ⎠ and the Cramer-Rao limit

1 2 / − 1 σ X X 0 J− (θ)= . ⎛ 2σ4 ⎞ ¡ 0 ¢ n ⎝ ⎠ Notice that under normality, of the errors, the OLS estimator is asymptotically effi- cient. Let us now consider r linear constrains on the parameter vector β, i.e.

ϕ (β)=Qβ q =0 (14.1) − where Q is the r k matrix of the restrictions (with r

L = (θ)+λ/ϕ (β)= (θ)+ϕ/ (β) λ = (θ)+(Qβ q)/ λ, − where λ is the vector of the r Lagrange Multipliers. The first order conditions are:

∂L ∂(θ) X/ (y Xβ) = + Q/λ = − + Q/λ =0 (14.2) ∂β ∂β σ2 132 The Classical Tests

∂L ∂(θ) n 1 (y Xβ)/ (y Xβ) = = + − − =0 (14.3) ∂σ2 ∂σ2 −2σ2 2 σ4 and ∂L = Qβ q =0. (14.4) ∂λ − Nowfrom(14.2)wehavethat

X/y = X/Xβ σ2Q/λ (14.5) − and it follows that

1 1 1 Q X/X − X/y = Q X/X − X/Xβ σ2Q X/X − Q/λ. − ¡ ¢ ¡ ¢ ¡ ¢ Hence 1 1 Q X/X − X/y = Qβ σ2Q X/X − Q/λ. − It follows that ¡ ¢ ¡ ¢ Qβ QV Q/λ = Qβ,∧ − where 1 1 β∧ = X/X − X/yandV= σ2 X/X − .

Nowfrom(14.4)wehavethat¡ ¢Qβ = q. Hence we get ¡ ¢

1 λ = QV Q/ − Qβ∧ q . (14.6) − − µ ¶ £ ¤ Substituting out λ from (14.5) employing the above and solving for β we get:

1 1 1 β = β∧ X/X − Q/ Q X/X − Q/ − Qβ∧ q . − − µ ¶ ¡ ¢ h ¡ ¢ i Solving (14.3) wee get that

(u)/ u σ2 = , u = y Xβ, n − e e and from (14.6) we get: e e e

1 1 λ = QVQ/ − Qβ∧ q , V = σ2 X/X − . − − µ ¶ h i ¡ ¢ e e e e The Linear Regression 133

Theabove3formulaegivetherestrictedMLEs. Now the Wald test for the linear restrictions in (14.1) is given by

/ 1 − W = Qβ∧ q QVQ∧ / Qβ∧ q . − − µ ¶ ∙ ¸ µ ¶ The restricted and unrestricted residuals are given by

u = y Xβ, and u∧ = y Xβ.∧ − −

Hence e e

u = u∧ + X β∧ β − µ ¶ and consequently, if X/u∧ =0, i.e.e the regression hase a constant we have that

/ / u/u = u∧ u∧ + β∧ β X/X β∧ β . − − µ ¶ µ ¶ It follows that e e e e

/ / 1 1 u/u u∧ u∧ = Qβ∧ q Q X/X − Q/ − Qβ∧ q . − − − µ ¶ µ ¶ h ¡ ¢ i Hence the Walde e test is given by

/ u/u u∧ u∧ W = n −/ . u∧ u∧ e e The LR test is given by

/ ∧ u u LR =2 θ θ = n ln / − Ã ∧ ∧ ! ∙ µ ¶ ³ ´¸ u u e e e and the LM test is / u/u u∧ u∧ LM = n − u/u as e e / 1 e e / 1 LM = λ F J − F / λ = Qβ∧ q QVQ/ − Qβ∧ q . − − ³ ´ h i ³ ´ µ ¶ h i µ ¶ We can nowe statee e a welle knowne result. e 134 The Classical Tests

Theorem 55 Under the classical assumptions of the Linear Regression Model we have that W LR LM. ≥ ≥

Proof: The three test can be written as

1 W = n (r 1) ,LR= n ln (r) ,LM= n 1 , − − r µ ¶

u/u x 1 where r = / 1. Now we know that ln (x) −x and the result follows by ∧u ∧u ≥ ≥ considering xe =e r and x =1/r.

14.2 Autocorrelation

Apply the LM test to test the hypothesis that ρ =0in the following model

/ i.i.d. 2 yt = xt β + ut,ut = ρut 1 + εt,εt N 0,σ . − v ¡ ¢ Discuss the advantages of this LM test over the Wald and LR tests of this hypothesis.

Firstnoticethatfromut = ρut 1 + εt we get that −

E (ut)=ρE (ut 1)+E (εt)=ρE (ut 1) − − as E (εt)=0and for ρ < 1 we get that | |

E (ut) ρE (ut 1)=0 E (ut)=0 − − ⇒ as E (ut)=E (ut 1) independent of t.Furthermore −

2 2 2 2 2 2 2 Var(ut)=E ut = ρ E ut 1 + E εt +2ρE (ut 1εt)=ρ E ut 1 + σ − − − ¡ ¢ ¡ ¢ ¡ ¢ ¡ ¢ as the first equality follows from the fact that E (ut)=0, and the last from the fact that

E (ut 1εt)=E [ut 1E (εt It 1)] = E [ut 10] = 0 − − | − − Autocorrelation 135 where It 1 the information set at time t 1, i.e. the sigma-field generated by − − εt 1,εt 2, ... . Hence { − − } 2 2 2 2 2 2 σ E ut ρ E ut 1 = σ E ut = 2 − − ⇒ 1 ρ − 2 2 ¡ ¢ ¡ ¢ ¡ ¢ as E (ut )=E ut 1 independent of t. − Substituting¡ ¢ out ut we get

/ yt = xt β + ρut 1 + εt, −

/ and observing that ut 1 = yt 1 xt 1β we get − − − −

/ / / / yt = xt β + ρ yt 1 xt 1β + εt εt = yt xt β ρyt 1 + xt 1βρ − − − ⇒ − − − − ³ ´ where by assumption the εt0 s are i.i.d. Hence the log-likelihood function is

2 T / / T T yt xt β ρyt 1 + xt 1βρ l (θ)= ln (2π) ln σ2 − − − − , 2 2 ³ 2σ2 ´ − − − t=1 ¡ ¢ X whereweassumethaty 1 =0,andx 1 =0. as we do not have any observations − − for t = 1. Inanycase,giventhat ρ < 1,thefirstobservationwillnotaffect the − | | distribution LM test, as it is based in asymptotic theory, i.e. T .Thefirst order →∞ conditions are:

T / / ∂l yt xt β ρyt 1 + xt 1βρ (xt xt 1ρ) = − − − − − − ∂β ³ σ2 ´ t=1 X T / / / T yt xt β ρyt 1 + xt 1βρ yt 1 xt 1β ∂l − − εtut 1 = − − − − − = − , ∂ρ ³ σ2 ´³ ´ σ2 t=1 t=1 X X 2 T / / T ∂l T yt xt β ρyt 1 + xt 1βρ T ε2 = + − − − − = + t . ∂σ2 2σ2 ³ 2σ4 ´ 2σ2 2σ4 − t=1 − t=1 X X The second derivatives are:

T / / ∂2l (xt xt 1ρ) xt xt 1ρ = − − − − / σ³2 ´ ∂β∂β − t=1 X 136 The Classical Tests

2 T / T 2 yt 1 xt 1β u2 ∂ l − − t 1 = − = − , ∂ρ2 ³ σ2 ´ σ2 − t=1 − t=1 X X 2 T / / T ∂2l T yt xt β ρyt 1 + xt 1βρ T ε2 = − − − − = t . 2 2 2σ4 ³ σ6 ´ 2σ4 σ6 ∂ (σ ) − t=1 − t=1 X X

T / / / / / / ∂2l yt 1 xt 1β xt xt 1ρ + yt xt β ρyt 1 + xt 1βρ xt 1 = − − − − − − − − − − = ∂β∂ρ ³ ´³ ´ ³σ2 ´³ ´ − t=1 X T / / / ut 1 xt xt 1ρ + εt xt 1 = − − − − ³ σ2 ´ ³ ´ − t=1 X

T / / / T 2 yt xt β ρyt 1 + xt 1βρ yt 1 xt 1β ∂ l − − εtut 1 = − − − − − = − , ∂ρ∂σ2 ³ σ4 ´³ ´ σ4 − t=1 − t=1 X X T / / / / T / / ∂2l yt xt β ρyt 1 + xt 1βρ xt xt 1ρ εt xt xt 1ρ = − − − − − − = − − ∂β∂σ2 ³ σ4 ´³ ´ ³ σ4 ´ − t=1 − t=1 X X Notice now that the Information Matrix J is

/ / T (xt xt 1ρ) xt xt 1ρ − − − − t=1 σ³2 ´ 00 J (θ)= E [H (θ)] = ⎡ T ⎤ P 0 1 ρ2 0 − − ⎢ T ⎥ ⎢ 004 ⎥ ⎢ 2σ ⎥ ⎣ ⎦ / / / / / ut 1 xt xt 1ρ +εt xt 1 εt xt xt 1ρ T ε u ε2 − − − − − − t t 1 t as E ³ σ2 ´ ³ ´ =0, E ³ σ4 ´ =0, E t=1 σ4− =0, E σ6 =

∙ 2 2 ¸ ∙ ¸ 1 ut 1 E(ut 1) 1 h i h i2 − P σ4 , E σ−2 = σ2 = 1 ρ2 , i.e. the matrix is block diagonal between β, ρ,andσ . − Consequentlyh i the LM test has the form

2 / 1 (sρ) LM = sρJρρ− sρ = Jρρ

T εtut 1 T as sρ = t=1 σ2− , Jρρ = 1 ρ2 . All these quantities evaluated under the null. − HenceP under H0 : ρ =0we have that

Jρρ = T, and ut = εt Autocorrelation 137 i.e. there is no autocorrelation. Consequently, we can estimate β by simple OLS, as OLS and ML result in the same estimators and σ2 by the ML estimator, i.e.

1 T − T u/u T u2 β = x x/ x y ,andσ2 = = t=1 t , t t t t T T Ã t=1 ! t=1 X X P e e e e e

/ where ut = yt x β = εt the OLS residuals. Hence − t

e e e

2 T utut 1 T 2 T 2 t=1 − − σ2 2 LM = = T utut 1 ut . ³ Te e ´ − P f à t=1 ! à t=1 ! X X e e e 138 The Classical Tests Book References 1. T. Amemiya: Advanced Econometrics. 2. E. Berndt: The Practice of Econometrics: Classic and Cotemporary 3. G. Box and G. Jenkins (1976) TimeSeries Analysis forecasting and Control. K. Cuthbertson, S.G. Hall and M. P. Taylor: Applied Econometric Techniques. 4. R. Davidson and J. MacKinnon: Econometric Theory and Methods. 5. C. Gourieroux and A. Monfort: Statistics and Econometric Models, Vol I and II. 6. W.H. Greene: Econometric Analysis. 7. J. Hamilton Time Series Analysis 8. A. Harvey: The Econometric Analysis of Time Series. 9. A. Harvey: Time Series Models. 10. J. Johnston: Econometric Methods. 11. G. Judge, R. Hill, H. Lutkepohl and T. Lee: Introduction to the Theory and Practice of Econometrics. 12. R. Pindyck and D. Rubinfeld: Econometric Models and Economic Fore- casts. 13. P. Ruud: An Introduction to Classical Econometric Theory. 14. R. Serfling: Approximation Theorems of . 15. H White: Asymptotic Theory for Econometricians.