1 an Overview of Almost Sure Convergence 2 Strong Consistency

1 An overview of almost sure convergence

An attractive features of almost sure convergence (which we deﬁne below) is that it often reduces the problem to the convergence of deterministic sequences.

First we go through some deﬁnitions (these are not very formal). Let Ωt be the set of

all possible outcomes (or realisations) at the point t, and deﬁne the random variable Yt as the function Y : Ω R. Deﬁne the set of possible outcomes over all time as Ω = ∞ Ω , t t → ⊗t=1 t and the random variables X : Ω R, where for every ω Ω, with ω = (ω ,ω ,...), we t → ∈ 0 2 have X (ω) = Y (ω ). Hence we have a sequence of random variables X (which we call t t t { t}t a random process). When we observe x , this means there exists an ω Ω, such that { t}t ∈ X (ω) = x . To complete things we have a sigma-algebra whose elements are subsets of t t F Ω and a probability measure P : [0, 1]. But we do not have to worry too much about F → this.

Deﬁnition 1 We say that the sequence X converges almost sure to µ, if there exists a { t} set M Ω, such that P(M) = 1 and for every ω N we have ⊂ ∈ X (ω) µ. t → In other words for every ε> 0, there exists an N(ω) such that

X (ω) µ < ε, (1) | t − | a.s. for all t>N(ω). We denote X µ almost surely, as X µ. t → t → We see in (1) how the definition is reduced to a nonrandom definition. There is an equivalent definition, which is explicitly stated in terms of probabilities, which we do not give here.

The object of this handout is to give conditions under which an estimatora ˆn converges almost surely to the parameter we are interested in estimating and to derive its limiting distribution.

2 Strong consistency of an estimator

2.1 The autoregressive process

A simple example of a sequence of random variables X which are dependent is the auto- { t} gressive process. The AR(1) process satisﬁes the representation

Xt = aXt−1 + ǫt, (2) where ǫ are iid random variables with E(ǫ )=0 and E(ǫ2) = 1. It has the unique causal { t}t t t solution ∞ j Xt = a ǫt−j. j=0 X 1 From the above we see that E(X2)=(1 a2)−1. Let σ2 := E(X2)=(1 a2)−1. t − 0 − Question What can we say about the limit of

1 n σˆ2 = X2. n n t j=1 X If E(ε4) < , then we show that E(ˆσ σ2)2 0 (Exercise). • t ∞ n − → What can we say about almost sure convergence? •

2.2 The strong law of large numbers

Recall the strong law of large numbers; Suppose X is an iid sequence, and E( X ) < { t}t | 0| ∞ then by the SLLN we have that

n 1 a.s. X E(X ), n t → 0 j=1 X for the proof see Grimmet and Stirzaker (1994).

What happens when the random variables X are dependent? • { t} In this case we use the notion of ergodicity to obtain a similar result. •

2.3 Some ergodic tools

Suppose Y is an ergodic sequence of random variables (which implies that Y are iden- { t} { t}t tically distributed random variables). For the deﬁnition of ergodicity and a full treatment see, for example, Billingsley (1965).

A simple example of an ergodic sequence is Z , where Z are iid random variables. • { t} { t}t Theorem 1 One variation of the ergodic theorem: If Y is an ergodic sequence, where { t} E( g(Y ) ) < . Then we have | t | ∞ n 1 a.s. g(Y ) E(g(Y )). n t → 0 i=1 X We give some suﬃcient conditions for a process to be ergodic. The theorem below is a simpliﬁed version of Stout (1974), Theorem 3.5.8.

Theorem 2 Suppose Z is an ergodic sequence (for example iid random variables) and { t} g : R∞ R is a continuous function. Then the sequence Y , where → { t}t

Yt = g(Zt, Zt−1,..., ), is an ergodic process.

2 (i) Example I Let us consider the sequence X , which satisﬁes the AR(1) representa- { t} tion. We will show by using Theorem 2, that X is an ergodic process. We know { t} that Xt has the unique (casual) solution

∞ j Xt = a ǫt−j. j=0 X The solution motivates us to deﬁne the function ∞ j g(x0, x1,...)= a xj. j=0 X We have that ∞ ∞ g(x , x ,...) g(y , y ,...) = ajx ajy | 0 1 − 0 1 | | j − j| j=0 j=0 ∞X X aj x y . ≤ | j − j| j=0 X Therefore if max x y 1 a ε, then g(x , x ,...) g(y , y ,...) ε. Hence j | j − j| ≤ | − | | 1 2 − 1 2 | ≤ the function g is continuous (under the sup-norm), which implies, by using Theorem 2, that X is an ergodic process. { t} Application Using the ergodic theorem we have that

n 1 a.s. X2 σ2. n t → t=1 X (ii) Example II A stochastic process often used in ﬁnance is the ARCH process. X is { t} said to be an ARCH(1) process if it satisﬁes the representation

2 2 Xt = σtZt σt = a0 + a1Xt−1.

It has almost surely the solution

∞ j 2 j 2 Xt = a0 a1 Zt−i. j=0 i=0 X Y Using the arguments above we can show that X2 is an ergodic process. { t }t

3 Likelihood estimation

Our object here is to evaluate the maximum likelihood estimator of the AR(1) parameter and to study its asymptotic properties. We recall that the maximum likelihood estimator is the parameter which maximises the joint density of the observations. Since the log-likelihood

3 often has a simpler form, we often maximise the log density rather than the density (since both the maximum likelihood estimator and maximum log likelihood estimator yield the same estimator). Suppose we observe X ; t = 1,...,n where X are observations from an AR(1) process. { t } t Let Fǫ fǫ be the distribution function and the density function of ǫ respectively. We ﬁrst note that the AR(1) process is Markovian, that is

P(X x X , X ,...) = P(X x X ) (3) t ≤ | t−1 t−2 t ≤ | t−1 f (X X ,...) = f (X X ). ⇒ a t| t−1 a t−1| t−1 The Markov property is where the probability of X given the past is the same as the • t probability of Xt given Xt−1.

To prove (3) we see that • P(X x X = x , X = x ) = P(aX + ǫ x X = x , X = x ) t ≤ t| t−1 t−1 t−2 t−2 t−1 t ≤ t| t−1 t−1 t−2 t−2 = P(ǫ x ax X = x ) t ≤ t − t−1| t−2 t−2 = P (ǫ x ax )= P(X x X = x ), ǫ t ≤ t − t−1 t ≤ t| t−1 t−1 hence the process satisﬁes the Markov property.

By using the above we have

P(X x X ) = P (ǫ x aX ) t ≤ | t−1 ǫ ≤ − t−1 P(X x X ) = F (x aX ) ⇒ t ≤ | t−1 ǫ − t−1 f (X X ) = f (X aX ). (4) a t| t−1 ǫ t − t−1 Evaluating the joint density and using (4) we see that it satisﬁes

n f (X , X ,...,X ) = f (X ) f (X X , X ,...) (by Bayes theorem) a 1 2 n a 1 a t| t−1 t−1 t=2 Yn = f (X ) f(X X ) (by the Markov property) a 1 t| t−1 t=2 Yn = f (X ) f (X aX ) (by (4)). a 1 ǫ t − t−1 t=2 Y Therefore the log likelihood is

n log f (X , X ,...,X ) = log f(X ) + log f (X aX ) . a 1 2 n 1 ǫ t − t−1 often ignored Xk=2 conditional likelihood | {z } Usually we ignore the initial distribution f(X1) and| maximise{z the conditional} likelihood to obtain the estimator.

4 Hence we usea ˆn as an estimator of a, where

aˆn = arg max log fǫ(Xt aXt−1), a∈Θ − Xk=2 and Θ is the parameter space we do the maximisation in. We say that aˆn is the conditional likelihood estimator.

3.1 Likelihood function when the innovations are Gaussian

We now consider the special case that the innovations ǫ are Gaussian. In this case we { t}t have that 1 log fǫ(X aX )= log 2π (X aX )2. t − t−1 −2 − t − t−1 Therefore if we let

1 n (a)= (X aX )2, Ln −n 1 t − t−1 t=2 − X 1 since 2 log 2π is a constant we have

n (a) log fǫ(X aX ). Ln ∝ t − t−1 Xk=2 Therefore the maximum likelihood estimator is

aˆn = arg max n(a). ⇒ a∈Θ L In the derivations here we shall assume Θ = [ 1, 1]. −

Deﬁnition 2 An estimator αˆn is said to be a consistent estimator of α, if there exists a set M Ω, where P(M) = 1 and for all ω M we have ⊂ ∈ αˆ (ω) α. n →

Question: Is the likelihood estimatora ˆn strongly consistent?

In the case of least squares for AR processes,a ˆ has the explicit form • n 1 n XtXt−1 aˆ = n−1 t=2 . n 1 n−1 2 n−P1 t=1 Xt Now by just applying the ergodic theoremP to the numerator and denominator we get

almost sure converge ofa ˆn (exercise).

5 However we will tackle the problem in a rather artiﬁcal way and assume that it does • not have an explicit form and instead assume thata ˆ is obtained by minimising (a) n Ln using a numerical routine. In general this is the most common way of minimising a likelihood function (usually explicit solutions do not exist).

In order to derive the sampling properties ofa ˆ we need to directly study the likelihood • n function (a). We will do this now in the least squares case. Ln Most of the analysis of (a) involves repeated application of the ergodic theorem. Ln The ﬁrst clue to solving the problem is to let • ℓ (a)= (X aX )2. t − t − t−1 By using Theorem 2 we have that ℓ (a) is an ergodic sequence. Therefore by using { t }t the ergodic theorem we have

n 1 a.s. (a)= ℓ (a) E(ℓ (a)). Ln n 1 t → 0 t=2 − X a.s. In other words for every a [ 1, 1] we have that (a) E(ℓ (a)). This is what we • ∈ − Ln → 0 call almost sure pointwise convergence.

Remark 1 It is interesting to note that the least squares/likelihood estimator aˆn can also be used even if the innovations do not come from a Gaussian process.

3.2 Strong consistency of a general estimator

We now consider the general case where (a) is a ‘criterion’ which we maximise (or min- Bn imse). We note that (a) includes the likelihood function, least squares criterion etc. Bn We suppose it has the form

1 n (a)= g (a), (5) Bn n 1 t j=2 − X where for each a [c,d], g (a) is a ergodic sequence. Let ∈ { t }t ˜(a)= E(g (a)), (6) B t we assume that ˜(a) is continuous and has a unique maximum in [c,d]. B We deﬁne the estimatorα ˆn where

αˆn = arg max n(a). a∈[c,d] B

6 a.s. Object To show under what conditionsα ˆ α, where α = arg max (a). n → B a.s. Now suppose that for each a [c,d] (a) ˜(a), then this it called almost sure pointwise ∈ Bn → B convergence. That is for each a [c,d] we can show that there exists a set M such that ∈ a M Ω where P (M ) = 1, and for each ω M and every ε> 0 a ⊂ a ∈ a (ω,a) ˜(a) ε, |Bn − B | ≤

for all n>Na(ω). But the Na(ω) depends on the a, hence the rate of convergence is not uniform (the rate of depends on a).

Remark 2 We will assume that the set Ma is common for all a, that is Ma = Ma′ for all a,a′ [c,d]. We will prove this result in Lemma 1 under the assumption of equicontinuity ∈ (defined later), however I think the same result can be shown under the weaker assumption that ˜(a) is continuous. B Returning to the estimator, by using the pointwise convergence we observe that a.s. (a) (â ) ˜(â ) ˜(a), (7) Bn ≤ Bn n → B n ≤ B

wherea ˆn is kept fixed in the limit. We now consider the difference n(ân) ˜(a) , if we can a.s. P |B − B | show that (â ) ˜(a) 0, thena ˆ a almost surely. |Bn n − B | → → Studying ( (â ) ˜(a)) and using (7) we have Bn n − B (a) ˜(a) (â ) ˜(a) (â ) ˜(â ). Bn − B ≤ Bn n − B ≤ Bn n − B n Therefore we have (â ) ˜(a) max (a) ˜(a) , (â ) ˜(â ) . |Bn n − B | ≤ |Bn − B | |Bn n − B n | To show that the RHS of the above convergesn to zero we require not onlyo pointwise convergence but uniform convergence of (a), which we define below. Bt Definition 3 (a) is said to almost surely converge uniformly to ˜(a), if Bn B a.s. sup n(a) B˜(a) 0. a∈[c,d] |B − | → In other words there exists a set M Ω where P (M) = 1 and for every ω M, ⊂ ∈

sup n(ω,a) B˜(a) 0. a∈[c,d] |B − | → a.s. Therefore returning to our problem, if (a) ˜(a) uniformly, then we have the bound Bn → B a.s. n(ˆαn) ˜(α) sup n(a) ˜(a) 0. |B − B | ≤ a∈[c,d] |B − B | → a.s. This implies that n(ˆαn) ˜(α) and since ˜(α) has a unique minimum we have that a.s. B → B B αˆ α. Therefore if we can show almost sure uniform convergence of (a), then we have n → Bn strong consistency of the estimatorα ˆn. Comment: Pointwise convergence is relatively easy to show, but how to show uniform convergence?

7 Figure 1: The middle curve is (a,ω˜ ). If the sequence (a,ω) converges uniformly to B {Bn } (a,ω˜ ), then (a,ω) will lie inside these boundaries, for all n>N(ω). B Bn

3.3 Uniform convergence and stochastic equicontinuity

We now deﬁne the concept of stochastic equicontinuity. We will prove that stochastic equicontinuity together with almost sure pointwise convergence (and a compact parameter space) imply uniform convergence.

Deﬁnition 4 The sequence of stochastic functions (a) is said to be stochastically {Bn }n equicontinuous if there exists a set M Ω where P (M) = 1 and for every and ε > 0, ∈ there exists a δ and such that for every ω M ∈

sup n(ω,a1) n(ω,a2) ε, |a1−a2|≤δ |B − B | ≤ for all n>N(ω).

Remark 3 A suﬃcient condition for stochastic equicontinuity is that there exists an N(ω), such that for all n>N(ω) (ω,a) belongs to the Lipschitz class C(L) (where L is the same Bn for all ω). In general we verify this condition to show stochastic equicontinuity.

In the following lemma we show that if (a) is stochastically equicontinuous and {Bn }n also pointwise convergent, then there exists a set M Ω, where P(M) = 1 and for every ⊂ ω M, there is pointwise convergence of (ω,a) ˜(a). This lemma can be omitted on ∈ Bn → B ﬁrst reading (it is mainly technical).

Lemma 1 Suppose the sequence (a) is stochastically equicontinuous and also pointwise {Bn }n convergent (that is (a) converges almost surely to ˜(a)), then there exists a set M Ω Bn B ∈ where P (M) = 1 and for every ω M and a [c,d] we have ∈ ∈ (ω,a) ˜(a) 0. |Bn − B | →

8 PROOF. Enumerate all the rationals in the interval [c,d] and call this sequence a . Then { i}i for every a there exists a set M where P (M ) = 1, such that for every ω M we have i ai ai ∈ ai (ω,a ) ˜(a ) 0. Deﬁne M = M , since the number of sets is countable P (M) = 1 |Bn i − B i | → ∩ ai and for every ω M and a we have (ω,a ) ˜(a ) 0. Suppose we have equicontinity ∈ i Bn i − B i → for every realisation in M˜ , and M˜ is such that P (M˜ ) = 1. Let M¯ = M˜ M . Let ∩ {∩ ai } ω M¯ , then for every ε/3 > 0, there exists a δ > 0 such that ∈

sup n(ω,a1) n(ω,a2) ε/3, |a1−a2|≤δ |B − B | ≤ for all n>N(ω). Now for any given a, choose a rational a such that a a δ. By j | − j| ≤ pointwise continuity we have

(ω,a ) ˜(a ) ε/3, |Bn i − B i | ≤ where n>N ′(ω). Then we have

Theorem 3 Suppose the set [a, b] is a compact interval and for every a [c,d] (a) con- ∈ Bn verges almost surely to ˜(a). Furthermore assume that (a) is almost surely equicontin- B {Bn } uous. Then we have a.s. sup n(a) ˜(a) 0. a∈[c,d] |B − B | →

PROOF. Let M¯ = M M˜ where M is the set where we have uniform convergence and M˜ ∩ the set where all ω M˜ , (ω,a) converge pointwise, it is clear that P (M¯ ) = 1. Choose ∈ {Bn }n ε/3 > 0 and let δ be such that for every ω M˜ we have ∈

sup n(ω,a1) n(ω,a2) ε/3, |a1−a2|≤δ |B − B | ≤ for all n>N(ω). Since [c,d] is compact it can be divided into a ﬁnite number of intervals. Therefore let c = ρ ρ ρ = d and sup ρ ρ δ. Since p is ﬁnite, there 0 ≤ 1 ≤···≤ p i | i+1 − i| ≤ exists a N˜(ω) such that

max n(ω,ai) ˜(ai) ε/3, 1≤i≤p |B − B | ≤ for all n> N˜(ω) (where N (ω) is such that (ω,a ) ˜(a ) ε/3, for all n N (ω) and i |Bn i − B i | ≤ ≥ i N˜(ω) = max (N (ω))). For any a [c,d], choose the i, such that a [a ,a ], then by i i ∈ ∈ i i+1 stochastic equicontinuity we have

(ω,a) (ω,a ) ε/3, |Bn − Bn i | ≤ 9 for all n>N(ω). Therefore we have

The following theorem summarises all the suﬃcient conditions for almost sure consistency.

Theorem 4 Suppose (n) and ˜(a) are deﬁned as in (5) and (6) respectively. Let Bn B

αˆn = arg max n(a) (8) a∈[c,d] B and α = arg max ˜(a). (9) a∈[c,d] B

Assume that [c,d] is a compact subset and that ˜(a) has a unique maximum in [c,d]. Fur- B a.s. thermore assume that for every a [c,d] (a) ˜(a) and the sequence (a) is ∈ Bn → B {Bn }n stochastically equicontinuous. Then we have a.s. αˆ α. n →

3.4 Strong consistency of the least squares estimator

We now verify the conditions in Theorem 4 to show that the least squares estimator is strongly consistent. We recall that

aˆn = arg max n(a), (10) a∈Θ L where 1 n (a)= (X aX )2. Ln −n 1 t − t−1 t=2 − X a.s. We have already shown that for every a [ 1, 1] we have (a) ˜(a), where ˜(a) := ∈ − Ln → L L E(g (a)). Recalling Theorem 3, since the parameter space [ 1, 1] is compact, to show strong 0 − consistency we need only to show that (a) is stochastically equicontinuous. Ln By expanding (a) and using the mean value theorem we have Ln (a ) (a )= (¯a)(a a ), (11) Ln 1 − Ln 2 ∇Ln 1 − 2 wherea ¯ [min[a ,a ], max[a ,a ]] and ∈ 1 2 1 2 2 n (α)= − X (X αX ). ∇Ln n 1 t−1 t − t−1 t=2 − X 10 Because α [ 1, 1] we have ∈ − (α) , |∇Ln |≤Dn where 2 n = ( X X + X2 ). Dn n 1 | t−1 t| t−1 t=2 − X Since X is an ergodic process, then X X + X2 is an ergodic process. Therefore, { t}t {| t−1 t| t−1} if var(ǫ ) < , by using the ergodic theorem we have 0 ∞ a.s. 2E( X X + X2 ). Dn → | t−1 t| t−1 Let := 2E( X X + X2 ). Therefore there exists a set M Ω, where P(M) = 1 and D | t−1 t| t−1 ∈ for every ω M and ε> 0 we have ∈ (ω) ε, |Dn −D|≤ for all n>N(ω). Substituting the above into (11) we have

(ω,a ) (ω,a ) (ω) a a |Ln 1 − Ln 2 | ≤ Dn | 1 − 2| ( + ε) a a , ≤ D | 1 − 2| for all n N(ω). Therefore for every ε> 0, there exists a δ := ε/( + ε) such that ≥ D (ω,a ) (ω,a ) ( + ε) a a , |Ln 1 − Ln 2 | ≤ D | 1 − 2| for all n N(ω). Since this is true for all ω M we see that (a) is stochastically ≥ ∈ {Ln } equicontinuous.

Theorem 5 Let aˆn be deﬁned as in (10). Then we have a.s. aˆ a. n →

PROOF. Since n(a) is almost sure equicontinuous, the parameter space [ 1, 1] is compact {L } a.s. − and we have pointwise convergence of n(α) ˜(α), by using Theorem 4 we have that a.s. L → L aˆ α, where α = max ˜(α). Finally we need to show that α a. Since n → L ≡ ˜(α)= E(ℓ (a)) = E(X αX )2, L 0 − 1 − 0 we see by diﬀerentiating ˜(α) that it is maximised when α = E(X X )/E(X2). Inspecting L 0 1 0 the AR(1) process we see that

Xt = aXt−1 + ǫt X X = aX2 + ǫ X ⇒ t t−1 t−1 t t−1 E(X X ) = aE(X2 )+ E(ǫ X ) . ⇒ t t−1 t−1 t t−1 =0 Therefore a = E(X X )/E(X2), hence α a, thus we have shown strong consistency of the 0 1 0 ≡ | {z } least squares estimator.

11 (i) This is the general method for showing almost sure convergence of a whole class of estimators. Using a similar techique one can show strong consistency of maximum likelihood estimator of the ARCH(1) process, see Berkes, Horv´ath, and Kokoskza (2003) for details. a.s. (ii) Described here was strong consistency where we showedα ˆ α. Using similar tools n → one can show weak consistency whereα ˆ P α (convergence in probability), which n → requires a much weaker set of conditions. For example, rather than show almost sure pointwise convergence we would show pointwise convergence in probability. Rather than stochastic equicontinuity we would show equicontinuity in probability, that is for every ǫ> 0 there exists a δ such that

lim P sup n(a1) n(a2) > ǫ 0. n→∞ |a1−a2|≤δ |B − B | ! →

Some of these conditions are easier to verify than almost sure conditions.

4 Central limit theorems

Recall once again the AR(1) process X , where X satisﬁes { t} t

Xt = aXt−1 + ǫt.

It is of interest to check if there actually is dependency in X , if there is no dependency { t} then a = 0 (in which case X would be white noise). Of course in most situations we only { t} have an estimator of a (for example,α ˆn deﬁned in (10)).

(i) Given the estimatora ˆn, how to see if a = 0?

(ii) Usually we would do a hypothesis test, with H0: a = 0 against the alternative of say a = 0. However in order to do this we require the distribution ofa ˆ . 6 n (iii) Usually evaluating the exact sample distribution is extemely hard or close to impossible. Instead we would evaluating the limiting distribution which in general is a lot easier.

(iv) In this section we shall show asymptotic normality of √n(ân a). The reason for a.s. − normalising by √n, is that (â a) 0 as n , hence in terms of distributions it n − → → ∞ converges towards the point mass at zero. Therefore we need to increase the magnitude of the differencea ˆ a. We can show that (â a)= O(n−1/2), therefore √n(â a)= n − n − n − O(1). Multiplying (â a) by anything larger would mean that its limit goes to infinity, n − multipying (â a) by something smaller in magnitude would mean its limit goes to n − zero, so n1/2, is the happy median.

12 We often use (a) to denote the derivative of (a) with respect to a ( ∂Ln(a) ). Since ∇Ln Ln ∂a aˆ = arg max (a), we observe that (ˆa ) = 0. Now expanding (ˆa ) about a (the n Ln ∇Ln n ∇Ln n true parameter) we have

(â ) (a) = 2 (â a), ∇Ln n − ∇Ln ∇ Ln n − (a) = 2 (â a), (12) ⇒ −∇Ln ∇ Ln n − 2 ∂ Ln where (α)= 2 , ∇Ln ∂α 2 n 2 n (a) = − X (X aX )= − X ǫ ∇Ln n 1 t−1 t − t−1 n 1 t−1 t t=2 t=2 − X − X 2 n and 2 = X2 . ∇ Ln n 1 t−1 t=2 − X Therefore by using (12) we have

−1 (ˆa a)= 2 (a). (13) n − − ∇ Ln ∇Ln a.s. Since X2 are ergodic random variables, by using the ergodic theorem we have 2 { t } ∇ Ln → E 2 2 (X0 ). This with (13) implies

−1 √n(â a)= 2 √n (a). (14) n − − ∇ Ln ∇Ln a.s. 2 −1 → (2E(X0 )) | {z } To show asymptotic normality of √n(â a), will show asymptotic normality of √n (a). n − ∇Ln We observe that 2 n (a)= − X ǫ , ∇Ln n 1 t−1 t t=2 − X is the sum of martingale differences, since E(X ǫ X ) = X E(ǫ X ) = X E(ǫ ) = t−1 t| t−1 t−1 t| t−1 t−1 t 0. In order to show asymptotic of (a) we will use the martingale central limit theorem. ∇Ln

4.0.1 Martingales and the conditional likelihood

First a quick overview. Z ; t = 1,..., are called martingale differences if { t ∞} E(Z Z , Z ,...) = 0. t| t−1 t−2 An example is the sequence X ǫ considered above. Because ǫ and X , X ,... are { t−1 t}t t t−1 t−2 independent then E(X ǫ X , X ,...) = 0. t−1 t| t−1 t−2 The stochastic sum , where {Sn}n n = Z Sn t Xk=1 13 is called a martingale if Z are martingale differences. { t} First we show that the gradient of the conditional log likelihood, defined as

n ∂ log f (X X ,...,X ) (θ)= θ t| t−1 1 , Cn ∂θ t=2 X at the true parameter θ is the sum of martingale differences. By definition if (θ ) is the 0 Cn 0 sum of martingale differences then

E ∂ log fθ(Xt Xt−1,...,X1) | 0 X , X ,...,X = 0. ∂θ ⌋θ=θ t−1 t−2 1

Rewriting the above in terms of integrals and exchanging derivative with integral we have

E ∂ log fθ(Xt Xt−1,...,X1) | 0 X , X ,...,X ∂θ ⌋θ=θ t−1 t−2 1 ∂ log fθ(xt Xt−1,...,X1) = | 0 f 0 (x X ,...,X )dx ∂θ ⌋θ=θ θ t| t−1 1 t Z 1 ∂fθ(xt Xt−1,...,X1) = | θ=θ0 fθ0 (xt Xt−1,...,X1)dxt f 0 (x X ,...,X ) ∂θ ⌋ | Z θ t| t−1 1 ∂ = f (x X ,...,X )dx = 0. ∂θ θ t| t−1 1 t ⌋θ=θ0 Z Therefore ∂ log fθ(Xt|Xt−1,...,X1) are a sequence of martingale diﬀerences and (θ ) is { ∂θ ⌋θ=θ0 }t Ct 0 the sum of martingale diﬀerences (hence it is a martingale).

4.1 The martingale central limit theorem

Let us deﬁne as Sn 1 n = Y , (15) Sn √n t t=1 X where = σ(Y ,Y ,...), E(Y ) = 0 and E(Y 2) < . In the following theorem Ft t t−1 t|Ft−1 t ∞ adapted from Hall and Heyde (1980), Theorem 3.2 and Corollary 3.1, we show that is Sn asymptotically normal.

Theorem 6 Let be deﬁned as in (15). Further suppose {Sn}n 1 n Y 2 P σ2, (16) n t → t=1 X where σ2 is a ﬁnite constant, for all ε> 0,

1 n E(Y 2I( Y >ε√n) ) P 0, (17) n t | t| |Ft−1 → t=1 X 14 (this is known as the conditional Lindeberg condition) and

1 n E(Y 2 ) P σ2. (18) n t |Ft−1 → t=1 X Then we have

D (0, σ2). (19) Sn →N

4.2 Asymptotic normality of the least squares estimator

We now use Theorem 6 to show that √n (a) is asymptotically normal, which means ∇Ln we have to verify conditions (16)-(18). We note in our example that Yt := Xt−1ǫt, and that the series X ǫ is an ergodic process. Furthermore, since for any function g, { t−1 t}t E(g(X ǫ ) ) = E(g(X ǫ ) X ), where = σ(X , X ,...) we need only to condi- t−1 t |Ft−1 t−1 t | t−1 Ft t t−1 tion on X rather than the entire sigma-algebra . t−1 Ft−1 C1 : By using the ergodicity of X ǫ we have { t−1 t}t 1 n 1 n Y 2 = X2 ǫ2 P E(X2 ) E(ǫ2) = σ2. n t n t−1 t → t−1 t t=1 t=1 X X =1

C2 : We now verify the conditional Lindeberg condition. | {z }

1 n 1 n E(Y 2I( Y >ε√n) )= E(X2 ǫ2I( X ǫ >ε√n) X ) n t | t| |Ft−1 n t−1 t | t−1 t| | t−1 t=1 t=1 X X 2 2 We now use the Cauchy-Schwartz inequality for conditional expectations to split Xt−1ǫt and I( X ǫ > ε). We recall that the Cauchy-Schwartz inequality for conditional | t−1 t| expectations is E(X Y ) [E(X2 )E(Y 2 )]1/2 almost surely. Therefore t t|G ≤ t |G t |G 1 n E(Y 2I( Y >ε√n) ) n t | t| |Ft−1 t=1 Xn 1 1/2 E(X4 ǫ4 X )E(I( X ǫ >ε√n)2 X ) ≤ n t−1 t | t−1 | t−1 t| | t−1 t=1 Xn 1 1/2 X2 E(ǫ4)1/2 E(I( X ǫ >ε√n)2 X ) . (20) ≤ n t−1 t | t−1 t| | t−1 t=1 X We note that rather than use the Cauchy-Schwartz inequality we can use a generalisa- tion of it called the H¨older inequality. The H¨older inequality states that if p−1+q−1 = 1, then E(XY ) E(Xp) 1/p E(Y q) 1/q (the conditional version also exists). The ad- ≤ { } { } vantage of using this inequality is that one can reduce the moment assumptions on

Xt.

15 Returning to (20), and studying E(I( X ǫ >ε)2 X ) we use that E(I(A)) = P(A) | t−1 t| | t−1 and the Chebyshev inequality to show

E(I( X ǫ >ε√n)2 X )= E(I( X ǫ >ε√n) X ) | t−1 t| | t−1 | t−1 t| | t−1 = E(I( ǫt >ε√n/Xt−1) Xt−1) | | 2| P ε√n Xt−1var(ǫt) = ε( ǫt > )) 2 . (21) | | Xt−1 ≤ ε n Substituting (21) into (20) we have 1 n E(Y 2I( Y >ε√n) ) n t | t| |Ft−1 t=1 X 1 n X2 var(ǫ ) 1/2 X2 E(ǫ4)1/2 t−1 t ≤ n t−1 t ε2n t=1 X E(ǫ4)1/2 n t X 3E(ǫ4)1/2 ≤ εn3/2 | t−1| t t=1 X E(ǫ4)1/2 1 n t X 3. ≤ εn1/2 n | t−1| t=1 X E 4 E 4 If (ǫt ) < , then (Xt ) < , therefore by using the ergodic theorem we have n ∞ a.s. ∞ 1 X 3 E( X 3). Since almost sure convergence implies convergence in n t=1 | t−1| → | 0| probability we have P 1 n E(ǫ4)1/2 1 n E(Y 2I( Y >ε√n) ) t X 3 n t | t| |Ft−1 ≤ εn1/2 n | t−1| t=1 t=1 X →0 X P 3 →E(|X0| ) | {z } P 0. | {z } → Hence condition (17) is satisﬁed.

C3 : We need to verify that 1 n E(Y 2 ) P σ2. n t |Ft−1 → t=1 X Since X is an ergodic sequence we have { t}t 1 n 1 n E(Y 2 )= E(X2 ε2 X ) n t |Ft−1 n t−1 | t−1 t=1 t=1 X X 1 n 1 n = X2 E(ε2 X )= E(ε2) X2 n t−1 | t−1 n t−1 t=1 t=1 X X a.s. 2 → E(X0 ) P E(ε2)E(X2)= σ2, | {z } → 0 hence we have veriﬁed condition (18).

16 Altogether conditions C1-C3 imply that

1 n √n (a)= X ǫ D (0, σ2). (22) ∇Ln √n t−1 t →N t=1 X Recalling (14) and that √n (a) D (0, σ2) we have ∇Ln →N −1 √n(ˆa a)= 2 √n (a) . (23) n − − ∇ Ln ∇Ln D a.s. 2 −1 2 → (2E(X0 )) →N (0,σ ) | {z } | {z } E 2 2 Using that (X0 )= σ , this implies that 1 √n(ˆa a) D (0, (σ2)−1). (24) n − →N 4

Thus we have derived the limiting distribution ofa ˆn.

References

Berkes, I., Horv´ath, L., & Kokoskza, P. (2003). GARCH processes: Structure and estimation. Bernoulli, 9, 201-2017.

Billingsley, P. (1965). Ergodic theory and information.

Grimmet, G., & Stirzaker. (1994). Probablity and random processes.

Hall, P., & Heyde, C. (1980). Martingale Limit Theory and its Application. New York: Academic Press.

Stout, W. (1974). Almost Sure Convergence. New York: Academic Press.