EFFICIENT LIKELIHOOD ESTIMATION IN STATE SPACE MODELS

BY CHENG-DER FUH† ACADEMIA SINICA

ABSTRACT

Motivated by studying asymptotic properties of the maximum likelihood estimator (MLE) in stochastic volatility (SV) models, in this paper, we investigate likelihood estimation in state space models. Based on a device to represent the likelihood function as the L1-norm of Markov- ian iterated random functions system, we first prove, under some regularity conditions, there is a consistent sequence of roots of the likelihood equation that is asymptotically normal with the inverse of the Fisher information as its variance. With an extra assumption that the likelihood equation has a unique root for each n, then there is a consistent sequence of estimators of the unknown parameters. If, in addition, the supremum of the log likelihood function is integrable, the maximum likelihood estimator exists and is strong consistent. Edgeworth expansion of the approximate solution of likelihood equation is also established.

AMS 2000 subject classifications. Primary 62M09; secondary 62F12, 62E25. Key words and phrases. consistency, efficiency, ARMA models, (G)ARCH models, stochastic volatility models, asymptotic normality, asymptotic expansion, maximum likelihood, incom- plete data, iterated random functions. † Research partially supported by NSC 92-2118-M-001-032.

1 Introduction

Motivated by studying asymptotic properties of the maximum likelihood estimator (MLE) in stochastic volatility (SV) models, in this paper, we investigate likelihood estimation in state ∞ space models. A state space model is, loosely speaking, a sequence {ξn}n=1 of random variables obtained in the following way. First, a realization of a X = {Xn,n ≥ 0} is created. This chain is sometimes called the regime and is not observed. Then, conditioned on

X, the ξ-variables are generated. Usually, the dependency of ξn on X is more or less local, as when ξn = g(Xn,ξn−1,ηn) for some function g and random sequence {ηn}, independent of X.

ξn itself is generally not Markov and may, in fact, have a complicated dependency structure.

When the state space of {Xn,n ≥ 0} is finite, it is the so-called or Markov switching model. The statistical modeling and computation for state space models have attracted a great deal of attention recently because of its importance in applications to speech recognition (Rabiner and Juang, 1993), (Elliott et al., 1995), ion channels (Ball and Rice, 1992), molecular biology (Krogh et al., 1994) and economics (Engle, 1982, Bollerslev, 1986 and Taylor, 1986). The reader is referred to Hamilton (1994), K¨unsch (2001) and Fan and Yao (2003) for a comprehensive summary. The main focus of these efforts has been state space modeling and estimation, algorithms for fitting these models, and the implementation of likelihood based methods. The state space model here is defined in a general sense, in which the observations are conditionally Markovian dependent, and the state space of the driving Markov chain needs not be finite nor compact. When the state space is finite and the observations are conditionally independent, Baum and Petrie (1966) establishes the consistency and asymptotic normality of the maximum likelihood estimator (MLE), in the case that the observation is a deterministic function of the state space. Leroux (1992) proves the consistency of the MLE for general hidden Markov models under mild conditions. The asymptotic normality of the MLE is given by Bickel, Ritov and Ryd´en (1998). Jensen and Petersen (1999), and Douc and Matias (2001) study asymptotic properties of the MLE for general ‘pseudo-compact’ state space models. By extending the inference problem to analysis where the state space is finite and the observations are conditionally Markovian dependent, Goldfeld and Quandt (1973) and Hamilton (1989) consider the implementation of a maximum likelihood estimator in switch autoregression with Markov regimes. Francq and Roussignol (1998) studies the consistency of the MLE, while Fuh (2004a) establishes the Bahadur efficiency of the MLE in Markov switching models. A major difficulty for analyzing the likelihood function in state space models is that it can only be expressed in integral form (cf. equation (2.3)). In this paper, we provide a device which represents the integral likelihood function as the L1-norm of Markovian iterated random functions system. This new representation enables us to apply results of the strong , and Edgeworth expansion for the distributions of Markov random walks, to verify the strong consistency of the MLE, and first-order efficiency and Edgeworth expansion on the solution of the likelihood equation. Note that the third-order efficiency follows from Edgeworth expansion by a standard argument (cf. Ghosh, 1994). The remainder of this paper is organized as follows. In Section 2, we define the state space model as a general state Markov chain in a Markovian random environment, and represent the likelihood function as the L1-norm of Markovian iterated random functions system. In Section 3, we give a brief summary of a Markovian iterated random functions system, and state ergodic theorem and the strong law of large numbers. The multivariate central limit theorem and Edgeworth expansion for a Markovian iterated random functions system are given in Section 4. In Section 5, we consider efficient likelihood estimation in state space models. First, we compute the Fisher information and prove the existence of an efficient estimator in a “Cram´er fashion”. Second, we characterize the Kullback-Leiber information, and prove the strong consistency of the maximum likelihood estimator. Third, we establish Edgeworth expansion of the approximate solution of likelihood equation. In Section 6, we consider few examples, including ARMA models, (G)ARCH models and SV models, which are commonly

1 used in financial economics. Necessary proofs are given in Sections 7 and 8, respectively.

2 State space models

A state space model is defined as a parameterized Markov chain in a Markovian random environment with the underlying environmental Markov chain viewed as missing data. This setting generalizes the hidden Markov models and state space models considered by Leroux (1992), Bickel and Ritov (1996), Bickel, Ritov and Ryd´en (1998), Francq and Roussignol (1998), Jensen and Petersen (1999), Douc and Matias (2001) and Fuh (2003, 2004a, 2004c), in order to cover several interesting examples of Markov-switching Gaussian autoregression of Hamilton (1989), (G)ARCH models of Engle (1982) and Bollerslev (1986), and SV models of Clark

(1973) and Taylor (1986). Specifically, let X = {Xn,n ≥ 0} be a Markov chain on a general θ θ state space X , with transition probability kernel P (x, ·)=P {X1 ∈·|X0 = x} and stationary q probability πθ(·), where θ ∈ Θ ⊆ R denotes the unknown parameter. Suppose that a random ∞ d sequence {ξn}n=0, taking values in R , is adjoined to the chain such that {(Xn,ξn),n≥ 0} is d θ θ a Markov chain on X×R satisfying P {X1 ∈ A|X0 = x, ξ0 = s} = P {X1 ∈ A|X0 = x} for

A ∈B(X ), the σ-algebra of X . And conditioning on the full X sequence, ξn is a Markov chain with probability

θ (2.1) P {ξn+1 ∈ B|X0,X1, ···; ξ0,ξ1, ···,ξn} θ θ = P {ξn+1 ∈ B|Xn+1; ξn} = P (Xn+1 : ξn,B) a.s. for each n and B ∈B(Rd), the Borel σ-algebra of Rd. Note that in (2.1) the conditional probability of ξn+1 depends on Xn+1 and ξn only. Furthermore, we assume the existence of θ a transition probability density p (x, y) for the Markov chain {Xn,n ≥ 0} with respect to a σ-finite measure m on X such that

θ (2.2) P {X1 ∈ A, ξ1 ∈ B|X0 = x, ξ0 = s0} Z Z = pθ(x, y)f(s; ϕy(θ)|s0)Q(ds)m(dy), y∈A s∈B where f(ξk; ϕXk (θ)|ξk−1) is the conditional probability density of ξk given ξk−1, and Xk, with d respect to a σ-finite measure Q on R . Here ϕy(θ),y∈X, are functions defined on the parameter space Θ. We also assume that the Markov chain {(Xn,ξn),n≥ 0} has a stationary probability with probability density function π(x)f(·; ϕx(θ)) with respect to m × Q. In this q paper, we consider θ =(θ1, ···,θq) ∈ Θ ⊆ R as the unknown parameter, and the true parameter value is denoted by θ0. For convenience of notation, we will use π(x) for πθ(x) and p(x, y) for pθ(x, y), respectively, in the sequel. We give a formal definition as follows.

Definition 1 {ξn,n≥ 0} is called a state space model if there is a Markov chain {Xn,n≥ 0} such that the process {(Xn,ξn),n≥ 0} satisfies (2.1).

When the state space X is finite, this reduces to the hidden Markov model considered by Fuh n (2003, 2004a). Denote Sn = Pt=1 ξt. When ξn are conditionally independent given X, the Markov chain {(Xn,Sn),n≥ 0} is called a and Sn is called a Markov

2 . Furthermore, if the state space X is finite, {ξn,n ≥ 0} is the hidden Markov model studied by Leroux (1993), Bickel and Ritov (1996), and Bickel, Ritov and Ryd´en (1998).

When the state space X is ‘pseudo-compact’ and ξn are conditionally independent given X,

{ξn, n ≥ 0} is the state space model considered in Jensen and Petersen (1999), and Douc and Matias (2001).

For given observations ξ0,ξ1, ···,ξn from a state space model {ξn,n ≥ 0}, the likelihood function is

(2.3) pn(ξ0,ξ1, ···,ξn; θ) n Z Z = ··· πθ(x0)f(ξ0; ϕx0 (θ)) Y pθ(xj−1,xj)f(ξj; ϕxj (θ)|ξj−1)m(dxn) ···m(dx0). x0∈X xn∈X j=1

Recall that πθ(x0)f(ξ0; ϕx0 (θ)) is the stationary probability density with respect to m × Q of the Markov chain {(Xn,Sn),n≥ 0}.

To represent the likelihood function (2.3) as the L1-norm of a Markovian iterated random functions system, let

(2.4) M = {h|h : X→R+ is m−measurable and Z h(x)m(dx) < ∞}. x∈X

d For each j =1, ···,n, define the random functions Pθ(ξ0) and Pθ(ξj)on(X×R ) × M as

Z (2.5) Pθ(ξ0)h(x)= f(ξ0; ϕx(θ))h(x)m(dx), x∈X Z (2.6) Pθ(ξj)h(x)= pθ(x, y)f(ξj; ϕy(θ)|ξj−1)h(y)m(dy). y∈X Define the composition of two random functions as

(2.7) Pθ(ξj+1) ◦ Pθ(ξj)h(x) Z  Z  = pθ(x, z)f(ξj; ϕz(θ)|ξj−1) pθ(z,y)f(ξj+1; ϕy(θ)|ξj)h(y)m(dy) m(dz). z∈X y∈X

1 For h ∈ M, denote khk := Rx∈X h(x)m(dx) as the L -norm on M with respect to m. Then, the likelihood function (2.3) can be represented as

(2.8) pn(ξ0,ξ1, ···,ξn; θ) n Z Z = ··· πθ(x0)f(ξ0; ϕx0 (θ)) Y pθ(xj−1,xj)f(ξj; ϕxj (θ)|ξj−1)m(dxn) ···m(dx0) x0∈X xn∈X j=1

= ||Pθ(ξn) ◦···◦Pθ(ξ1) ◦ Pθ(ξ0)νθ(x)||.

Note that, for j =1, ···,n, the integrate pθ(x, y)f(ξj; ϕy(θ)|ξj−1)ofPθ(ξj) in (2.6) and (2.8) represents Xj−1 = x and Xj ∈ dy, and ξj is a Markov chain with transition probability density f(ξj; ϕy(θ)|ξj−1) for given X. By definition (2.1), {(Xn,ξn),n≥ 0} is a Markov chain, and this implies that Pθ(ξj) is a sequence of Markovian iterated random functions system (see Section 3 for a formal definition). Therefore, by representation (2.8), pn(ξ0,ξ1, ···,ξn; θ) is the L1-norm of a Markovian iterated random functions system.

3 3 Ergodic theorems for a Markovian iterated random functions system

To analyze the asymptotic properties of efficient likelihood estimators in state space models, in this section, we study ergodic theorem and the strong law of large numbers for a Markovian iterated random functions system. The Markovian iterated random functions system is a generalization of an iterated random functions system, in which the random functions are driven by a Markov chain. For general account of an iterated random functions system, the reader is refereed to Diaconis and Freedman (1999) for a recent survey including an extensive list of relevant literature.

For simplicity in our notations, let {Yn,n ≥ 0} (instead of {(Xn,ξn),n ≥ 0} in Section 2) be a Markov chain on a general state space Y with σ-algebra A, which is irreducible with respect to a maximal irreducibility measure on (Y, A) and is aperiodic. The transition kernel is denoted by P (y, A). Let (M,d) be a complete separable metric space with Borel σ-algebra B(M). A sequence of the form

(3.1) Mn = F (Yn,Mn−1),n≥ 0, taking values in (M,d) is called a Markovian iterated random functions system (MIRFS) of Lipschitz functions providing

(1) {Yn,n ≥ 0} is a Markov chain taking values in a second countable measurable space

(Y, A), with transition probability kernel P (·, ·) and stationary probability π, and M0 is a random element on a probability space (Ω, F,P), which is independent of {Yn,n≥ 0}; (2) F :(Y×M, A⊗B(M)) → (M, B(M)) is jointly measurable and Lipschitz continuous in the second argument.

Clearly, {(Yn,Mn),n≥ 0} constitutes a Markov chain with state space Y×M and transition probability kernel P, given by Z (3.2) P((y, u),A× B):= IB(F (z,u))P (y, dz) z∈A for all y ∈Y,u∈ M,A∈A, and B ∈B(M), where I denotes the indicator function. n The n-step transition kernel is denoted P .For(y, u) ∈Y×M, let Pyu be the probability measure on the underlying measurable space under which Y0 = y, M0 = u a.s. The associated expectation is denoted Eyu, as usual. For an arbitrary distribution ν on Y×M, we put Pν (·):=R Pyu(·) ν(dy × du) with associated expectation Eν. We use P and E for probabilities and expectations, respectively, that do not depend on the initial distribution.

Let M0 be a dense subset of M and M(M0, M) the space of all mappings h : M0 →

M endowed with product topology and product σ-algebra. Then the space LLip(M, M)of all Lipschitz continuous mappings h : M → M properly embedded forms a Borel subset of

M(M0, M) and the mappings

LLip(M, M) × M 3 (h, u) 7→ h(u) ∈ M, d(h(u),h(v)) LLip(M, M) 3 h 7→ l(h) := sup u=6 v d(u, v)

4 are Borel, see Lemma 5.1 in Diaconis and Freedman (1999) for details. Hence

(3.3) Ln := l(F (Yn, ·)),n≥ 0, are also measurable and form a sequence of Markovian dependent random variables. An important point in the proof of ergodic property will be the right use of the idea of duality. For this purpose, we introduce a time-reversed (or dual) Markov chain {Y˜n,n≥ 0} of

{Yn,n≥ 0} as follows. Assume that there exists a σ-finite measure m on (Y, A) such that the probability measure P on (Y, A) defined by P (A)=P (Y1 ∈ A|Y0 = y) is absolutely continuous with respect to m, so that P (A)=RA p(y, z)m(dz) for all A ∈A, where p(y, ·)=dP/dm. The Markov chain {Yn,n≥ 0} is assumed to have an invariant probability measure π which has a positive probability density function π (without any confusion, we still use the same notation) with respect to m. We shall use ∼ to refer the time-reversed (or dual) process {Y˜n,n ≥ 0} with transition probability density

(3.4) p˜(y, z)=p(y, z)π(y)/π(z).

Denote P˜ as the corresponding probability. It is easy to see that both Yn and Y˜n have the same stationary distribution π. In this section, we will assume that the initial distribution of

Y0 is the stationary distribution π.

In the following, we write Fn(u) for F (Yn,u). For all 1 ≤ k ≤ n, let Fk:n := Fk ◦···◦Fn,

Fn:k := Fn ◦···◦Fk, where ◦ denotes the composition of functions. Denote Fn:n−1 as the identity on M. Hence

(3.5) Mn = Fn(Mn−1)=Fn:1(M0) for all n ≥ 0. Closely related to these forward iterations, and in fact a key tool to the analysis of ergodic property, is the following sequence of backward iterations

(3.6) M˜ n := F1:n(M0),n≥ 0.

The connection is established by the identity

˜ (3.7) π(y)P(Mn ∈·|Y0 = y)=π(z)P(M˜ n ∈·|Y˜0 = z)

u ˜ u for all n ≥ 0. Put also Mn := Fn:1(u) and Mn := F1:n(u) for u ∈ M and note that

Z Z u ˜ u ˜ (3.8) P((Mn , Mn )n≥0 ∈·|Y0 = y, Y0 = z)π(dy)π(dz) y∈Y z∈Y Z Z = P((Mn, M˜n)n≥0 ∈·|Y0 = y, Y˜0 = z)π(dy)π(dz). y∈Y z∈Y Note that in (3.8), the probability P denotes a joint probability.

We call that {Yn,n ≥ 0} is Harris recurrent if there exists a set A ∈A, a probability measure Γ concentrate on A, and an ε with 0 <ε<1 such that Py(Yn ∈ A i.o.) = 1 for all y ∈Y and furthermore there exists n, such that

(3.9) P n(y, A0) ≥ εΓ(A0),

5 for all y ∈ A and all A0 ∈A.

A central question for a MIRFS (Mn)n≥0 is under which conditions it stabilizes, that is, converges to a stationary distribution Π. The next theorem summarizes the results regarding this question.

Theorem 1 Let {Yn,n ≥ 0} be an aperiodic, irreducible and Harris recurrent Markov chain, and let (Mn)n≥0 be a MIRFS of Lipschitz functions. Suppose the initial distribution of Y0 is π, and

+ + (3.10) E log l(F1) < 0 and E log d(F1(u0),u0) < ∞ for some u0 ∈ M. Then the following assertions hold.

(i) M˜ n converges a.s. to a random element M˜ ∞ which does not depend on the initial distribution.

(ii) Mn converges in distribution to M˜ ∞ under P. (iii) Define Π as the stationary distribution of (Y˜∞, M˜ ∞), then Π is the unique stationary probability of the Markov chain {(Yn,Mn),n≥ 0}.

(iv) (Mn)n≥0 is ergodic under PΠ, i.e. for any u ∈ M, 1 n (3.11) X g(M ) −→ E (g(M˜∞)) P a.s. n k Π Π k=1 for all bounded continuous real-valued functions g on M.

A proof of Theorem 1 is given in Section 7.

4 Central limit theorem and Edgeworth expansion for distrib- utions of a Markovian iterated random functions system

Consider the Markovian iterated random functions system {(Yn,Mn),n≥ 0} defined in (3.1). Abuse the notation a little bit, let g be a Rp-valued function on M. In this section, we study n −1 central limit theorem and Edgeworth expansion of the sum Sn = Pk=1 g(Mk) and g(n Sn) for a smooth function g : Rp → Rq. Let w : Y→[1, ∞) be a measurable function, and let B be the Banach space of measurable functions h : Y→C (:= the set of complex numbers) with khkw := supy |h(y)|/w(y) < ∞. Assume further that {Yn,n≥ 0} has a stationary distribution π with R w(y)π(dy) < ∞, and

Z (4.1) lim sup{|E[h(Yn)|Y0 = y] − h(z)π(dz)|/w(y):y ∈Y, |h|≤w} =0, n→∞ y

(4.2) sup{E[w(Y1)|Y0 = y]/w(y)} < ∞. y Condition (4.1) says that the chain is w-uniformly ergodic, which implies that there exist γ>0 and 0 <ρ<1 such that for all h ∈ B and n ≥ 1,

Z n (4.3) sup |E[h(Yn)|Y0 = y] − h(z)π(dz)|/w(y) ≤ γρ khkw, y

6 (pages 382-383 and Theorem 16.0.1 of Meyn and Tweedie 1993). We remark that for w =1, condition (4.1) is the classical uniform condition for {Yn,n≥ 0}. The following assumption will be assumed throughout this section. Assumption A.

A1. Let {Yn,n≥ 0} be an aperiodic, irreducible Markov chain satisfying conditions

(4.1)-(4.2). Furthermore, we assume the initial distribution of Y0 is π.

A2. The MIRFS (Mn)n≥0 has the mean contraction property, i.e.

+ E log L1 < 0.

A3. Assume that there exists u0 ∈ M for which

2 Ed (F1(u0),u0) < ∞ and sup E(L1|Y0 = y) < 1. y

REMARK 1: Assumption A1 is a condition for the underlying Markov chain, which is general enough to include several practical used models studied in Section 6. Assumption A2 is the standard mean contraction condition. Accompany these with condition A3, we will prove central limit theorem and Edgeworth expansion for the distributions of a Markovian iterated random functions system. Recall that Π defined in Theorem 1(iii) and denote Q(B):=Π(Y×B) for all B ∈B(M). Let g ∈L2(Q) be a square integrable function taking value in Rp with mean 0, i.e. g =(g , ···,g ) 0 ¯ 1 p with each gk is a real-valued function on M, and

Z 2 Z 2 (4.4) gk(u)Q(du)=0, kgkk2 = gk(u)Q(du) < ∞, M M for k =1, ···,p. Consider the sequence

(4.5) Sn = Sn(g)=g(M1)+···+ g(Mn),n≥ 1, which may be viewed as a Markov random walk on the Markov chain {(Yn,Mn),n≥ 0}.

Since the induced Markov chain {(Yn,Mn),n ≥ 0} is not irreducible in general, we can not apply standard limit theorems from an ergodic Markov chains. Instead, we will first get a Martingale representation of Sn, and then apply the Martingale central limit theorem to have the desired result. This approach will be summarized in Proposition 1 below, which is of independent interest. Another approach is based on the assumption of irreducibility. Note that there are two special properties of the Markov chain induced by the Markovian iterated random functions system (2.4)-(2.7). A positivity hypothesis on the matrices in the support of the Markov chain leads to contraction properties, on which basis we will develop the spectral theory in this section. Another natural hypothesis is that the transition probability possesses a density. This leads to a classical situation in the context of the so-called “Doeblin condition” for Markov chains. It also leads to precise results of the limiting theory. To prove the central limit theorem of (4.5). If there is no irreducibility assumption on the induced Markov chain {(Yn,Mn),n≥ 0}, we need to put a continuity assumption on the given function g. In this case, the particular given complete separable metric d on the space M will

7 not be essential but may rather be altered to our convenience (cf. Wu and Woodroofe, 2000.) We consider flattened variations of d obtained by composing d with an arbitrary nondecreasing, concave function ψ :[0, ∞) → [0, ∞) with ψ(0) = 0 and ψ(t) > 0 for all t>0. Let Ψ be the collection of all such functions. It is easy to see that dψ := ψ ◦ d is again a complete metric. a ∗ t Possible choices from Ψ include ψa(t):=t for any 0

(4.6) dψ∗ (u, v) ≤ d(u, v) ≤ 2dψ∗ (u, v) for all u, v ∈{(u, v) ∈ M × M : d(u, v) ≤ 1}. This shows that the behavior of d and dψ∗ is essentially the same for small values. Notice further that ψ∗ ◦ ψ ∈ Ψ with ψ∗ ◦ ψ(t) (4.7) lim =1 t↓0 ψ(t) for all ψ ∈ Ψ. One can further relax the global Lipschitz continuity of g, by requiring g to be a Q-almost sure local Lipschitz continuous function (with respect to a flattened metric dψ) in combination with an integrability condition on the local Lipschitz constant. To make this precise, let ψ ∈ Ψ, and for a measurable g : M → Rp, define its local Lipschitz constant at u ∈ M with respect to dψ as |g(u) − g(v)| (4.8) lψ(g, u) := sup , v:0

(4.9) kgkr,ψ := klψ(g, ·)kr ,

r where k·kr denotes the usual norm on L (Q). It is easily seen that k·kr,ψ defines a (pseudo-) norm on the space

r  r Z  (4.10) Lψ,0(Q):= g ∈L (Q): g(u) Q(du)=0and kgkr,ψ < ∞ M ¯ 1 r r ∗ ∗ and that Lψ,0(Q)=Lψ∗◦ψ,0(Q) with 2 k·kr,ψ ◦ψ ≤ k·kr,ψ ≤ k·kr,ψ ◦ψ on this space (use (4.6) and (4.7)). Possibly after replacing ψ with ψ∗ ◦ ψ, we may therefore always assume ψ be r r r bounded when dealing with elements of Lψ,0(Q). Denote Lψ(Q):={g ∈L (Q):kgkr,ψ < ∞}. p r Plainly, all global Lipschitz functions, i.e. all g ∈LLip(M, R ), are elements of Lψ,0(Q) for r any ψ ∈ Ψ. However, g need not be continuous in order for being an element of some Lψ,0(Q). As pointed out in Wu and Woodroofe (2000), if p = 1 and g = IB is the indicator function of some B ∈B(M), then 1 lψ(IB,u)= , dψ(∂B,u) where ∂B denotes the topological boundary of B and dψ(∂B,u) := infv∈∂B d(u, v). They further show that, if B(u, c)={v : |u − v|≤c} is the closed c-ball with center u ∈ M, ψ(t)= 1/4 2 t and λ denotes Lebesgue measure, then, for each u ∈ M, IB(u,c) − Q(B(u, c)) ∈Lψ,0(Q) for λ-almost all c>0. By using a similar argument as Theorem 2.4 of Benda (1998) and Theorem 2 of Wu and Woodroofe (2000), we have

8 Proposition 1 Let {(Yn,Mn),n≥ 0} be the MIRFS defined in (2.1) satisfying assumption A. Assume there exists ψ ∈ Ψ satisfying 1 ψ(t) Z dt < ∞, 0 t 2 r and if g ∈Lψ,0(Q) ∩Lψ(Q) for some r>2. Then 1 √ (S − nE (g(M˜∞))) −→ N(0, Σ) in distribution, n n Π where Σ is the variance-covariance matrix.

Note that in Proposition 1, we only have the central limit theorem for normalized Sn with- out the characterization of the variance-covariance matrix Σ. Therefore, this result is difficult to be applied for efficient likelihood estimation. In order to have more applicable results, we will explore the special structure of the Markovian iterated random functions system, and ap- ply perturbation theory for the induced ergodic Markov chains to have central limit theorem and Edgeworth expansion for the distribution of Sn.

Definition 2 Let w : Y→[1, ∞) be the weight function. For any measurable function ϕ :

Y×M → [1, ∞), given u0 ∈ M, define |ϕ(y, u)| ||ϕ||w := sup , y∈Y,u∈M w(y)(1 + dψ(u0,u)) and |ϕ(y, u) − ϕ(y, v)| kϕkl := sup . y∈Y,u,v:0

We define H as the set of ϕ on Y×M for which kϕkwl := kϕkw + kϕkl is finite, where wl represents a combination of the weighted variation norm and the bounded Lipschitz norm.

Let ν be an initial distribution of (Y0,M0) and let Eν denote expectation under the initial 2 distribution ν on (Y0,M0). For ϕ ∈H, g ∈L(Q), y ∈Y, u ∈ M, and p × 1 vectors 0 p 0 α =(α1, ···,αp) ∈ R , where denotes the transpose, define linear operators Tα, T, να and Q on the space H as

0 iα g(M1) (4.11) (Tαϕ)(y, u)=E{e ϕ(Y1,M1)|Y0 = y, M0 = u},

(4.12) (Tϕ)(y, u)=E{ϕ(Y1,M1)|Y0 = y, M0 = u} iα0ϕ(u) (4.13) ναϕ = Eν{e ϕ(Y0,u)}, Qϕ = EΠ{ϕ(Y0,u)}.

In the case of a w-uniformly ergodic Markov chain, Fuh and Lai (2001) has shown that there exists sufficiently small δ>0 such that for |α|≤δ, H = H1(α) ⊕H2(α) and

(4.14) TαQαϕ = λ(α)Qαϕ for all ϕ ∈H, where H1(α) is a one-dimensional subspace of H, λ(α) is the eigenvalue of Tα with corre- sponding eigenspace H1(α), Qα is the parallel projection of H onto the subspace H1(α)in the direction of H2(α). Extension of their argument to the weight function w and l defined in Definition 2 are given in Section 7, which also proves the following lemmas.

9 Lemma 1 Let {(Yn,Mn),n ≥ 0} be the MIRFS of Lipschitz functions defined in (2.1) satis- fying assumption A. Assume g ∈Lr(Q) for some r>2. Then T and Q are bounded linear operators on the Banach space H with norm k·kwl, and satisfy

n n n (4.15) kT − Qkwl = sup kT ϕ − Qϕkwl <γ∗ρ∗ , ϕ∈H, kϕkwl≤1 for some γ∗ > 0 and 0 <ρ∗ < 1.

Lemma 2 Let {(Yn,Mn),n≥ 0} be the MIRFS defined in (2.1) satisfying assumption A, such that the induced Markov chain {(Yn,Mn),n≥ 0} with transition probability kernel (3.2) is irreducible, aperiodic and Harris recurrent. Assume g ∈Lr(Q) for some r>2. Then, there exists δ>0 such that for α ∈ Rp with |α| <δ, and for ϕ ∈H,

0 iα g(Mn) n n (4.16) Eν{e ϕ(Yn,Mn)} = ναTαϕ = ναTα{Qα +(I − Qα)}ϕ n n = λ (α)ναQαϕ + ναQα(I − Qα)ϕ. and

(i) λ(α) is the unique eigenvalue of the maximal modulus of Tα; (ii) Qα is a rank-one projection;

(iii) the mappings λ(α), Qα and I − Qα are analytic; 2+ρ∗ (iv) |λ(α)| > 3 and for each k ∈ N, the set of positive integers, there exists c>0 such that for each n ∈ N, and j1, ···,jp with j1 + ···+ jp = k, k ∂ n 1+2ρ∗ n k (I − Qα) kwl ≤ c( ) ; j1 jp 3 ∂α1 ···αp

(v) denote g =(g1, ···,gp), and let γj := limn→∞(1/n)Exu log ||gj(Mn)||, the upper Lyapunov exponent, it follows that

∂λ(α) Z (4.17) γj = |α=0 = Eyugj(M1)Π(dy × du). ∂αj

Note that in Lemma 2, we need the extra assumption of the induced Markov chain {(Yn,Mn), n ≥ 0}, with transition probability kernel (3.2), is irreducible, aperiodic and Harris recurrent. In Section 5, we will show that this condition is satisfied for the Markovian iterated random functions system (2.4)-(2.7). Under this assumption, we can remove the continuity condition, and only require g to satisfy some moment conditions. n For sums Sn = Pk=1 g(Mk) of the MIRFS {(Yn,Mn),n ≥ 0}. By making use of the 0 iα g(Mn) representation (4.16) for the characteristic function E(e |Y0 = y, M0 = 0), we will obtain Edgeworth expansions for the distribution of Sn. These Edgeworth expansions involve the derivatives of λ(α) and Qαh1(y)(h1 := 1) at α = 0, as analogous of the cumulants of S1 in the classical Edgeworth expansions for sums of i.i.d. random variables.

Lemma 1 implies that {(Yn,Mn),n ≥ 0} is geometrically mixing in the sense that there exist r1 > 0and0<γ1 < 1 such that for all y ∈Y,u ∈ M,k ≥ 0 and n ≥ 1 and for all 2 2 real-valued measurable functions ϕ1,ϕ2 with ||ϕ1||wl < ∞ and ||ϕ2||wl < ∞,

(4.18) ||E{ϕ1(Yk,Mk)ϕ2(Yk+n,Mk+n)|Y0 = y, M0 = u} n −{Eϕ1(Yk,Mk)|Y0 = y, M0 = u}{Eϕ2(Yk+n,Mk+n|Y0 = y, M0 = u)}||wl ≤ r1γ1 .

10 Letϕ ˜1, ϕ˜2 be real-valued measurable functions on (Y×M) × (Y×M). Denote ϕ1(z,v)= E{ϕ˜1((z,v), (Y1,M1))|Y0 = z,M0 = v)}, and note that

E{ϕ˜1((Yk,Mk), (Yk+1,Mk+1))|Y0 = y, M0 = u)} = E{ϕ1(Yk,Mk)|Y0 = y, M0 = u}.

The same proof as that of Theorem 16.1.5 of Meyn and Tweedie (1993) can be used to show that there exist r1 > 0 and 0 <γ1 < 1 such that for all y ∈Y,u ∈ M,k ≥ 0 and n ≥ 1 and for all 2 2 measurableϕ ˜1, ϕ˜2 with || supz,v ϕ˜1((y, u), (z,v))||wl < ∞ and || supz,v ϕ˜2((y, u), (z,v))||wl < ∞,

||E{ϕ˜1((Yk,Mk), (Yk+1,Mk+1))ϕ ˜2((Yk+n,Mk+n), (Yk+n+1,Mk+n+1))|Y0 = y, M0 = u}

−E{(ϕ(Yk,Mk)|Y0 = y, M0 = u}E{ϕ2(Yk+n,Mk+n)|Y0 = y, M0 = u}||wl n−1 (4.19) ≤ r1γ1 .

To establish Edgeworth expansion for a Markovian iterated random functions system, we shall make use of (4.19) in conjunction with the following extension of Cramer’s (strongly nonlattice) condition:

0 (4.20) inf |1 − Eπ{exp(iv S1(g)}| > 0 for all α>0. |v|>α

In addition, we also assume that the conditional Cram´er’s (strong nonlattice) condition [G¨otze and Hipp (1983), (2.5) on page 216]: There exists δ>0 such that for all m, n =1, 2, ···, δ−1

0 Eπ|E{exp(iα (g(Mn−m)+···+ g(Mn+m)))|(Yn−m,Mn−m), ···, (Yn−1,Mn−1), −δ (4.21) (Yn+1,Mn+1), ···, (Yn+m,Mn+m), (Yn+m+1,Mn+m+1)}| ≤ e .

Let Z 0 (4.22) γ = Eyug(M1)Π(dy × du)(=λ (0)).

2 Let V =(∂ λ(α)/∂αi∂αj|α=0)1≤i,j≤p be the Hessian matrix of λ at 0. By Lemma 2,

−1 0 (4.23) lim n Eν{(g(Mn) − nγ)(g(Mn) − nγ) } = V. n→∞

0 iα g(Mn) Let ψn(α)=Eν(e ). Then by Lemma 2 and the fact that ναQαh1 has continuous partial derivatives of order r − 2 in some neighborhood of α = 0, we have the Taylor series √ √ expansion of ψn(α/ n) for |α/ n|≤ε (some sufficiently small positive number):

√ r−2 −j/2 −α0 Vα/2 −(r−2)/2 (4.24) ψn(α/ n){1+X n π˜j(iα)}e + o(n ), j=1 whereπ ˜j(iα) is a polynomial in iα of degree 3j, whose coefficients are smooth function of the partial derivatives of λ(α)atα = 0 up to the order j+2 and those of ναQαh1 at α = 0 up to the order j. Letting D denote the p × 1 vector whose jth component is the partial differentiation operator Dj with respect to the jth coordinate, define the differential operatorπ ˜j(−D). As in the case of sums of i.i.d. zero-mean random vectors (cf. Bhattacharya and Rao (1976)), we obtain an Edgeworth expansion for the “formal density” of the distribution of g(Mn)by

11 −α0Vα/2 replacing theπ ˜j(iα) and e in (4.24) byπ ˜j(−D) and φV (y), respectively, where φV is the density function of the q-variate normal distribution with mean 0 and covariance matrix

V . Throughout the sequel we let Pν denote the probability measure under which (Y0,M0) has initial distribution ν.

Theorem 2 Let {(Yn,Mn),n≥ 0} be the MIRFS defined in (2.1) satisfying assumption A, such that the induced Markov chain {(Yn,Mn),n≥ 0}, with transition probability kernel (3.2), is irreducible, aperiodic and Harris recurrent. Assume g ∈Lr(Q) for some r>2.Let

φj,V =˜πj(−D)φV for j =1, ..., r − 2.For0 0, let Ba,c be the class of all p a Borel subsets B of R such that R(∂B)ε φV (y)dy ≤ cε for every ε>0, where ∂B denotes the boundary of B and (∂B)ε denotes its ε-neighborhood. Then √ r−2 Z −j/2 −(r−2)/2 (4.25) sup |Pν {(Sn − nγ)/ n ∈ B}− {φV (y)+X n φj,V (y)}dy| = o(n ). B∈Ba,c B j=1 A proof of Theorem 2 is given in Section 7. Let r = 2 in Theorem 2, we have Corollary 1 With the same notation and assumptions as in Theorem 2, then 1 √ (S − nγ) −→ N(0, Σ) in distribution, n n where 2  ∂ λ(α)  (4.26) Σ= |α=0 ∂αi∂αj i,j=1,···,p is the variance-covariance matrix.

−1 n In statistical applications, one often works with g(n Sn) instead of Sn = Pk=1 g(Mk), p q where g : R → R is sufficiently smooth in some neighborhood of the mean γ := (γ1, ···,γp). p Denote g =(g1, ···, gq) with each gi,1≤ i ≤ q, is a real-valued function on R . For the case of sum of i.i.d. random variables, Bhattacharya and Ghosh (1978) makes use of the √ Edgeworth expansion of the distribution of (S − nγ)/ n to derive an Edgeworth expansion √ n −1 of the distribution of n{g(n Sn) − g(γ)}. Making use of Theorem 2 and a straightforward extension of their argument, we can generalize their result to the case where Sn is a Markovian iterated random functions system.

Theorem 3 With the same notation and assumptions as in Theorem 2, suppose that g : p q R → R has continuous partial derivatives of order r in some neighborhood of γ.LetJg = 0 (Djgi(γ))1≤i≤q,1≤j≤p be the q × p Jacobian matrix and let V (g)=JgVJg. Then √ r−2 −1 Z −j/2 sup |Pν { n(g(n Sn) − g(γ)) ∈ B}− {φV (g)(y)+X n φj,V,g(y)}dy| B∈Ba,c B j=1 (4.27) = o(n−(r−2)/2),

p where φj,V,g =˜πj,g(−D)φV and π˜j,g(y) is a polynomial in y (∈ R ) whose coefficients are smooth functions of the partial derivatives of λ(α) at α =0up to order j +2 and those of

ναQαh1 at α =0up to order j together with those of g at µ up to order j +1.

12 In the next theorem, we consider p =1.

Theorem 4 Let {(Yn,Mn),n≥ 0} be the MIRFS defined in (2.1), satisfying assumption A, such that the induced Markov chain {(Yn, Mn),n≥ 0}, with transition probability kernel (3.2), is irreducible, aperiodic and Harris recurrent. Assume g ∈Lr(Q) for some r>2. Then √ 1 − P {(S − nγ)/ n ≤ t} √ √ t (4.28) ν n = exp(t3/ n)ϕ(t/ n)(1 + O(√ )), 1 − Φ(t) n and √ P {(S − nγ)/ n ≤−t} √ √ t (4.29) ν n = exp(−t3/ n)ϕ(−t/ n)(1 + O(√ )), Φ(−t) n where Φ(t) is the standard normal distribution, and ϕ(t) is a power series which converges for t sufficiently small absolute value.

Theorem 4 states the moderate deviations results for distribution of MIRFS, which will be used to prove Edgeworth expansion for the MLE in Section 5. Since the proof is a straightfor- ward generalization of Theorem 6 in Nagaev (1961), it will not be repeated here.

5 Efficient likelihood estimation

For a given state space model defined in (2.1) which involves several parameters θ =(θ1, ···,θq), the estimation problem we consider in this section is the case of estimating one of the para- meters at a time, the other parameters play the role of nuisance parameters. Recall pn = pn(ξ0,ξ1, ···,ξn; θ) defined as (2.3). When ∂ log pn/∂θ exists, one can seek solutions of the following likelihood equations ∂ log p (5.1) n =0. ∂θ The following conditions will be used throughout the rest of this paper.

C1. For all θ ∈ Θ, the Markov chain {(Xn,ξn),n ≥ 0} defined in (2.1) and (2.2) is aperiodic, irreducible and satisfies (4.1) and (4.2). Furthermore, we assume 0

∂f(ξ ; ϕ (θ)) ∂2f(ξ ; ϕ (θ)) ∂3f(ξ ; ϕ (θ)) 0 x , 0 x , 0 x exist, ∂θi ∂θi∂θj ∂θi∂θj∂θk as well as the partial derivatives

∂f(ξ ; ϕ (θ)|ξ ) ∂2f(ξ ; ϕ (θ)|ξ ) ∂3f(ξ ; ϕ (θ)|ξ ) 1 x 0 , 1 x 0 , 1 x 0 exist, ∂θi ∂θi∂θj ∂θi∂θj∂θk and for all x, y ∈X, θ → pθ(x, y) and θ → πθ(x) have twice continuous derivatives in some neighborhood Nδ(θ0):={θ : |θ − θ0| <δ} of θ0.

13 C3. ∂π (x) ∂2π (x) Z sup | θ |m(dx) < ∞, Z sup | θ |m(dx) < ∞, X θ∈Nδ(θ0) ∂θi X θ∈Nδ(θ0) ∂θi∂θj and for all x ∈X,

∂p (x, y) ∂2p (x, y) Z sup | θ |m(dy) < ∞, Z sup | θ |m(dy) < ∞, X θ∈Nδ(θ0) ∂θi X θ∈Nδ(θ0) ∂θi∂θj for all i, j =1, ···,q. d C4. For all x ∈X, ξ0,ξ1 ∈ R and θ ∈ Θ,

2 θ ∂f(ξ0; ϕx(θ)) θ ∂ f(ξ0; ϕx(θ)) Ex| | < ∞,Ex| | < ∞, ∂θi ∂θi∂θj

2 θ ∂f(ξ1; ϕx(θ)|ξ0) θ ∂ f(ξ1; ϕx(θ)|ξ0) Ex| | < ∞,Ex| | < ∞. ∂θi ∂θi∂θj d Furthermore, we assume that for all x ∈X, ξ0,ξ1 ∈ R and uniformly for θ ∈ Nδ(θ0),

3 3 ∂ log f(ξ0; ϕx(θ)) ∂ log f(ξ1; ϕx(θ)|ξ0) | |

θ0 θ0 where Hijk and Gijk are such that Ex Hijk(x, ξ1) < ∞ and Ex Gijk(x, ξ1) < ∞, for all i, j, k = θ0 θ0 1, ···,q and x ∈X, where Ex is the expectation defined under P (·, ·) in (2.1) with initial state X0 = x. C5. f(ξ ; ϕ (θ))f(ξ ; ϕ (θ)|ξ ) 2 sup Eθ0  sup sup 0 x 1 x 0  < ∞. x | x∈X |θ−θ0|<δ x,y∈X f(ξ0; ϕy(θ))f(ξ1; ϕy(θ) ξ0)

C6. The equality 0 pn(ξ0,ξ1, ···,ξn; θ)=pn(ξ0,ξ1, ···,ξn; θ ) P -almost surely, for all non-negative n, if and only if θ = θ0.

C7. For all x, y ∈X, θ → pθ(x, y), θ → πθ(x) and θ → ϕx(θ) are continuous, and d θ → f(ξ0,ϕx(θ)) as well as θ → f(ξ1,ϕx(θ)|ξ0) are continuous for all x ∈X and ξ0,ξ1 ∈ R . d Furthermore, for all x ∈X and ξ0,ξ1 ∈ R , f(ξ0,ϕx(θ)) → 0 and f(ξ1,ϕx(θ)|ξ0) → 0, as |θ|→∞.

θ0 C8. Ex | log f(ξ0; ϕx(θ0))f(ξ1; ϕx(θ0)|ξ0)| < ∞ for all x ∈X. θ0 0 + C9. For each θ ∈ Θ, there is δ>0 such that Ex [sup|θ0−θ|<δ(log f(ξ0; ϕx(θ0))f(ξ1; ϕx(θ ))|ξ0)) ] θ0 0 + < ∞, and there is a b>0 such that Ex [sup|θ0|>b(log f(ξ0; ϕx(θ0))f(ξ1; ϕx(θ ))|ξ0)) ] < ∞ for all x ∈X, where a+ = max{a, 0}.

REMARK 2: Condition C1 is the w-uniform ergodicity condition for the underlying Markov chain, which is considerable weak than the uniformly recurrent condition A1 of Jensen and Petersen (1999), and that of Douc and Matias (2001). Note that the Hamilton’s Markov- switching model satisfies this condition cf. Fuh (2003). Other practical used models of ARMA models, (G)ARCH models and SV models will be given in Section 6. C2-C4 are standard

14 smoothness conditions. C5 is the technical condition for the existence of the Fisher information to be defined in (5.6) below. C8-C9 are integrability conditions that will be used to prove strong consistency of the MLE. Condition C6 is the identifiability condition for state space models.

That is, the family of mixtures of {f(ξ1; ϕx(θ)|ξ0) : θ ∈ Θ} is identifiable. This condition will be used to prove strong consistency of the MLE. Although it is difficult to check this condition in a general state space model, in many models of interest, the parameter itself is identifiable only up to a permutation of states such as finite state hidden Markov models with normal distributions. A sufficient condition for the identifiable issue can be found in Theorem 1 of Douc and Matias (2001). See also the paper by Ito et al. (1992), for necessary and sufficient conditions in the case that ξi is a deterministic function of Xi, and the state space is finite.

Let {(Xn,ξn),n≥ 0} be the Markov chain defined in (2.1) and (2.2). For given observations

ξ0,ξ1, ···,ξn from the state space model {ξn,n ≥ 0}, recall from (2.8) that the log likelihood can be written as

(5.2) l(θ)

= log pn(ξ1, ···,ξn; θ)=log||Pθ(ξn) ◦···◦Pθ(ξ1) ◦ Pθ(ξ0)π(x)|| ||P (ξ ) ◦···◦P (ξ ) ◦ P (ξ )π(x)|| ||P (ξ ) ◦ P (ξ )π(x)|| = log θ n θ 1 θ 0 + ···+ log θ 1 θ 0 . ||Pθ(ξn−1) ◦···◦Pθ(ξ1) ◦ Pθ(ξ0)π(x)|| ||Pθ(ξ0)π(x)|| For each n, denote

(5.3) Mn := Pθ(ξn) ◦···◦Pθ(ξ1) ◦ Pθ(ξ0) as the Markovian iterated random functions system on M induced from (2.4)-(2.7). Then d {((Xn,ξn),Mn),n ≥ 0} is a Markov chain on the state space (X×R ) × M, with tran- sition probability kernel Pθ defined as in (3.2). Let Πθ be the stationary distribution of {((Xn,ξn),Mn),n ≥ 0} defined in Theorem 1(iii). Then, the log-likelihood function l(θ) can n be written as Sn := Pk=1 g(Mk−1,Mk) with

||Pθ(ξk) ◦···◦Pθ(ξ1) ◦ Pθ(ξ0)π(x)|| (5.4) g(Mk−1,Mk) := log . ||Pθ(ξk−1) ◦···◦Pθ(ξ1) ◦ Pθ(ξ0)π(x)|| In order to apply Theorems 1-4 in Sections 3 and 4, we need to check that the Markovian iterated random functions system satisfies assumption A, and the induced Markov chain is aperiodic, irreducible and Harris recurrent. Recall from (2.4) that

M = {h|h : X→R+ is m−measurable and Z h(x)m(dx) < ∞}.

For convenience of notation, we will use h to represent elements in M, which is different from the notation u used in Sections 3 and 4. We define the variation distance between any two elements h1,h2 in M by

(5.5) d(h1,h2) = sup |h1(x) − h2(x)|. x∈X First we note that (M,d) defined in (2.4) and (5.5) is a complete metric space with Borel σ-algebra B(M), but it is not separable. Thus Theorems 1-4 do not apply. However, rather

15 than deal with the measure-theoretic technicalities created by an inseparable space, we can apply the results developed in Section 7 of Diaconis and Freedman (1999) for a direct argument of convergence. Therefore, Theorems 1-4 still hold under the regularity conditions.

Lemma 3 Assume C1-C5 hold or C1, C6-C9 hold. Then for each θ ∈ Θ and j =1, ···,n, d the random functions Pθ(ξ0) and Pθ(ξj), defined in (2.5) and (2.6), from (X×R ) × M to M are Lipschitz continuous in the second argument, and the Markovian iterated random functions system (2.4)-(2.7) satisfies assumption A. Furthermore, the function g defined in (5.4) belongs to Lr(Q) for any r>0.

A proof of Lemma 3 is given in Section 8.

For each θ ∈ Θ, recall that {((Xn,ξn),Mn),n ≥ 0} is a Markov chain, induced by the Markovian iterated random functions system (2.4)-(2.7), on the state space (X×Rd) × M.

Lemma 4 Assume C1-C5 hold or C1, C6-C9 hold. Then for each θ ∈ Θ, {((Xn,ξn),Mn),n≥ 0} is an aperiodic, m × Q-irreducible and Harris recurrent Markov chain.

A proof of Lemma 4 is given in Section 8.

Lemma 5 Assume C1-C5 hold. Then the Fisher information matrix

(5.6) I(θ)=(Iij(θ))

 θ  ∂ log kPθ(ξ1) ◦ Pθ(ξ0)π(x)k ∂ log kPθ(ξ1) ◦ Pθ(ξ0)π(x)k  = EΠ , ∂θi ∂θj

θ is positive definite for θ in a neighborhood Nδ(θ0) of θ0. Recall that EΠ := EΠ is defined as the expectation under PΠ in (3.2), with the initial distribution as the stationary distribution Π.

A proof of Lemma 5 is given in Section 8. REMARK 3: Note that Fisher information (5.6) is defined as the expected value under θ the stationary distribution Π 0 of the Markov chain {((Xn,ξn),Mn),n ≥ 0}. It is worth to mention that only ξn appears in Mn, in which it reflects the nature of state space models. The key idea of representing the Fisher information (5.6) is that the log likelihood function can be rewritten as an additive functional of the Markov chain {((Xn,ξn),Mn),n ≥ 0} in (5.4), then using the strong law of large numbers for Markovian iterated random functions given in Theorem 1(iv) to have, with probability 1,

1 ∂2 lim log ||Pθ(ξn) ◦···◦Pθ(ξ1) ◦ Pθ(ξ0)π(x)||θ=θ0 = −Iij(θ0). n→∞ n ∂θi∂θj

0 Lemma 6 Assume C1-C5 hold. Let lj(θ0)=∂l(θ)/∂θj|θ=θ0 . Then, as n →∞,

1  0  (5.7) √ lj(θ0) −→ N(0, I(θ0)) in distribution, n j=1,···,q where I(θ0) is the Fisher information matrix (5.6).

A proof of Lemma 6 is given in Section 8.

16 Theorem 5 Assume C1-C5 hold, then there exists a sequence of solutions θˆ of (5.1) such √ n that θˆn → θ0 in probability. Furthermore, n(θˆn − θ0) is asymptotically normally distributed −1 with mean zero and variance-covariance matrix I (θ0).

Since the proof of Theorem 5 follows a standard argument, we will not give it here.

Corollary 2 Under the assumptions of Theorem 5, if the likelihood equation has a unique ˆ root for each n and all ξ1, ···,ξn, then there is a consistent sequence of estimators θn of the unknown parameters θ0.

Next, we prove strong consistency of the MLE when the log likelihood function is integrable. A crucial step is to give an appropriate definition of the Kullback-Leibler information for state space models, so that we can apply Theorem 1 to have a standard argument of strong consistency for the MLE. Here, we define the Kullback-Leibler information as k ◦ k θ0  Pθ0 (ξ1) Pθ0 (ξ0)πθ0 (x)  (5.8) K(θ0,θ)=EΠ log kPθ(ξ1) ◦ Pθ(ξ0)πθ(x)k k ◦ k Z Pθ0 (ξ1) Pθ0 (ξ0)πθ0 (x) := log Π(d(x, ξ) × dπθ0 ). kPθ(ξ1) ◦ Pθ(ξ0)πθ(x)k

REMARK 4: Denote p(ξ1|ξ0,ξ−1, ···; θ) as the conditional probability of ξ1 given ξ0,ξ−1, ···, θ and P (X0 = ·|ξ0,ξ−1, ···; θ) as the filtered probabilities given ξ0,ξ−1, ···. When the state space X of the Markov chain {Xn,n≥ 0} is finite. The Kullback-Leibler distance is K(θ0,θ)= H(θ0,θ0) − H(θ0,θ) with H(θ0,θ) is the relative entropy, intuitively defined as

θ0 H(θ0,θ)=E (log p(ξ1|ξ0,ξ−1, ···; θ)) d d θ0  θ  = E log X X f(ξ1|ξ0; ϕy(θ))pxyP (X0 = x|ξ0,ξ−1, ···; θ) x=1 y=1

θ0 θ = E (log kM1(θ)P (X0 = x|ξ0,ξ−1, ···; θ)k).

This shows that the definition of H(θ0,θ) is the expectation under the stationary distribution θ of the P (X0 = ·|ξ0,ξ−1, ···; θ), when run under a process having parameter θ0. A simi- lar interpretation can be made for (5.8), where the Kullback-Leibler information number is θ defined under Π 0 . Note that a Markov chain {((Xn,ξn),Mn),n≥ 0} with stationary distrib- ution Πθ0 as its initial distribution is equivalent to that it involves the whole past information

{ξ0,ξ−1, ···}.

ˆ Theorem 6 Assume that C1, C6-C9 hold and let θn be the MLE based on n observations ˆ θ0 ξ0,ξ1, ···,ξn. Then θn −→ θ0 P -a.s. as n →∞.

Since the proof of Theorem 6 follows a standard argument, we will not give it here. To derive Edgeworth expansion for the MLE, we need to define the following notations and assumptions first. For nonnegative integral vectors ν =(ν(1), ···,ν(q)) write |ν| = ν(1) + (q) (1) (q) ν ν(1) ν(q) ···+ ν , ν!=ν ! ···ν !, and let D =(D1) ···(Dq) denote the νth derivative with respect to θ. Suppose assumption C2, C3, C4 and C5 are strengthened so that

17 d C2’. Let r ≥ 5. The true parameter θ0 is an interior point of Θ. For all x ∈X, ξ0,ξ1 ∈ R , θ ∈ Θ ⊂ Rq, and for i, j, k =1, ···,q, the partial derivatives

1 2 r D f(ξ0; ϕx(θ)),Df(ξ0; ϕx(θ)), ···,D f(ξ0; ϕx(θ)) exist, as well as the partial derivatives

1 2 r D f(ξ1; ϕx(θ)|ξ0),Df(ξ1; ϕx(θ)|ξ0), ···,D f(ξ1; ϕx(θ)|ξ0) exist, and for all x, y ∈X, θ → pθ(x, y) and θ → πθ(x) have r − 1 continuous derivatives in some neighborhood Nδ(θ0):={θ : |θ − θ0| <δ} of θ0. C3’.

Z 1 Z r−1 sup |D πθ(x)|m(dx) < ∞, ···, sup |D πθ(x)|m(dx) < ∞, X θ∈Nδ(θ0) X θ∈Nδ(θ0) and for all x ∈X,

Z 1 Z r−1 sup |D pθ(x, y)|m(dy) < ∞, ···, sup |D pθ(x, y)|m(dy) < ∞. X θ∈Nδ(θ0) X θ∈Nδ(θ0)

d C4’. For all x ∈X, ξ0,ξ1 ∈ R and θ ∈ Θ,

θ ν r θ ν r Ex|D f(ξ0; ϕx(θ))| < ∞,Ex|D f(ξ1; ϕx(θ)|ξ0)| < ∞, for 1 ≤|ν|≤r, and

θ ν r θ ν r Ex( sup |D f(ξ0; ϕx(θ))| ) < ∞,Ex( sup |D f(ξ1; ϕx(θ)|ξ0)| ) < ∞, θ∈Nδ(θ0) θ∈Nδ(θ0) for |ν| = r +1. C5’. f(ξ ; ϕ (θ))f(ξ ; ϕ (θ)|ξ ) r sup Eθ0  sup sup 0 x 1 x 0  < ∞. x | x∈X |θ−θ0|<δ x,y∈X f(ξ0; ϕy(θ))f(ξ1; ϕy(θ) ξ0)

We will assume the Cram´er condition (4.20) and conditional Cram´er condition (4.21) hold (ν) ν for Zj := D log p1(ξ0,ξ1; θ0), (1 ≤|ν|≤r), where ξ0,ξ1 are treated as random variables here. (ν) Let Zj := {Zj :1≤|ν|≤r} be a p-dimensional random vectors for j ≥ 1, where p is the ¯ n number of all distinct multi-indeces ν,1≤|ν|≤r. In the following, denote Z =(1/n) Pj=1 Zj. Use a standard argument involving the sign change of a continuous function, or a fixed point theorem in the multiparameter case (cf. Bhattacharya and Ghosh, 1978) to prove that the likelihood equation has a solution which converges in probability to θ0. Note that the following notations are interpreted in the multi-dimensional sense. Applying the moderate deviation result on Z¯ in Theorem 4 of Section 4, it is possible to ensure that, with P θ0 - √ −1 ˆ probability 1 − o(n ), θ satisfies the likelihood equation and lies on (θ0 ± log n/ n). It is this solution we take as our θˆ . If the likelihood equation has multiple roots, assume we have a n √ θ −1 consistent estimator Tn such that Tn lies in (θ0 ± log n/ n) with P 0 -probability 1 − o(n ).

18 In this case, we may take the solution nearest to T . By the preceding reasoning, this solution, n √ θ −1 which is identifiable from the sample, will lie in (θ0 ±log n/ n) with P 0 -probability 1−o(n ). Clearly, with θˆ as above, with probability 1 − o(n−1),

r−1 1 (5.9) 0=Z¯(es) + X Z¯(es+ν)(θˆ − θ )ν + R (θˆ ), 1 ≤ s ≤ q, ν! n 0 n,s n |ν|=1 where es has 1 as the s-th coordinate and zeros otherwise. We rewrite equation (5.9) as

ˆ (5.10) 0=A(Z,¯ θn)+Rn.

Note 0 = A(γ(θ0),θ0) and

∂A = −(Fisher information) =06 . ∂θ γ(θ0),θ0 Hence, by the implicit function theorem, there is a neighborhood N of γ and q uniquely defined real-valued infinitely differentiable function gi (1 ≤ i ≤ q)onN such that θ = g(z)=

(g1(z), ···, gq(z)) satisfies (5.10). This implies, with probability 1 − o(n−1), √ ˆ 4 |θ − θ0|≤K(log n/ n) . √ θ0 ˆ ˆ −1 To derive the asymptotic expansion of P { n(θn − θ0) ∈ B}. Note that θ = g(n Zn), p q where g : R → R is sufficiently smooth in some neighborhood of γ. For the case of i.i.d. ξn, Bhattacharya and Ghosh (1978) makes use of the Edgeworth expansion of the distribution of √ √ −1 (Sn − nγ)/ n to derive an Edgeworth expansion of the distribution of n{g(n Sn) − g(γ)}. Making use of Theorem 4 and a straightforward extension of their argument, we can generalize their result to have the following theorem.

Theorem 7 Assume C1, C2’-C5’ hold for some r ≥ 3. Assume the (conditional) Cram´er condition hold. Let Jg =(Dj gi(γ))1≤i≤q,1≤j≤p be the q × p Jacobian matrix and let V (g)= 0 ˆ JgVJg. Then there exists a sequence of solutions θn of (5.1), and there exist polynomials pj in q variables (1 ≤ j ≤ r − 2) such that

√ r−2 θ0 ˆ Z −j/2 sup |Pν { n(θn − θ0) ∈ B}− {φV (g)(y)+X n φj,V,g(y)}dy| B∈Ba,c B j=1 (5.11) = o(n−(r−2)/2),

p where φj,V,g =˜πj,g(−D)φV and π˜j,g(y) is a polynomial in y (∈ R ) whose coefficients are smooth functions of the partial derivatives of λ(α) at α =0up to order j +2 and those of

ναQαh1 at α =0up to order j together with those of g at µ up to order j +1.

The application of Theorem 7 to third-order efficient for the MLE and third-order efficient approximate solution of the likelihood equation follows directly from Ghosh (1994).

19 6 Examples

From theoretical point of view, Theorems 5-7 are adequate for state space models estimation problems in providing assurance of the existence of efficient estimators, characterizing them as solutions of likelihood equations and prescribing their asymptotic behavior. In practice, however, one must still contend with certain statistical and numerical difficulties, such as implementation of the maximum likelihood estimator. In this section, we apply our results to study some examples which include ARMA models, (G)ARCH models, and SV models. For simplicity, in these examples, we only consider a specific structure of normal error assumption in the given model. Although strong consistency and asymptotic normality of the MLE in ARMA and GARCH(p, q) have been known in the literature, we provide alternative proofs in the framework of state space models. Furthermore, we can apply Theorem 7 to have Edgeworth expansion for the MLE. To the best of our knowledge, the asymptotic normality of the MLE in the AR(1)/ARC H (1) model, considered in subsection 6.2, seems to be new. The results of asymptotic properties for the MLE in stochastic volatility model not only provide theoretic justification, but also give some insight of the structure for the likelihood function, to which it can be used for further study.

6.1 ARMA models

We start with a univariate Gaussian casual ARMA(p, q) model which can be written in state space model by defining r = max{p, q +1}.

ξn − µ = α1(ξn−1 − µ)+α2(ξn−2 − µ)+···+ αr(ξn−r − µ)

(6.1) + εn + β1εn−1 + β2εn−2 + ···+ βr−1εn−r+1, where αj = 0 for j>pand βj = 0 for j>q. Furthermore, we assume εn are independent and identically distributed random variables with distribution N(0,σ2). Asymptotic properties of the MLE in ARMA model can be found in Hannan (1973), and Yao and Brockwell (2001). A general treatment of the MLE in Gaussian ARMAX model can be found in Chapter 7 of Caines (1988). By using the same idea as that in Hamilton (1994), we consider the following state space representation of (6.1).

 α1 α2 ··· αr−1 αr   εn+1  10··· 00 0         (6.2) X =  01··· 00 X +  0  , n+1   n    ......   .   . . . . .   .   00··· 10  0  and

(6.3) ξn = µ + h 1 β1 β2 ···βr−1 i Xn. For implementation of the MLE via Kalman filter, the reader is referred to Hamilton (1994) for details.

20 2 p Assume that the roots of 1 − α1z − α2z − ··· − αpz = 0 lies outside the unit circle. It is easy to see that {Xn,n ≥ 0} forms a finite state ergodic (aperiodic, irreducible and positive recurrent) Markov chain, and ξn are conditional independent given {Xn,n ≥ 0}. 2 This implies that condition C1 holds. The assumption of εn ∼ N(0,σ ) also implies that conditions C2-C5, C2’-C5’ and C7-C9 are satisfied in model (6.1). Since the verification is straightforward, we do not report it here. Suppose the conditional distribution of ξn given

X0, ···,Xn is of the form FXn−1,Xn from (6.3). The Cram´er conditions (4.20) and (4.21) hold (ν) ν 2 for Zj := D log p1(ξ0,ξ1; θ0), since the conditional density of ξn given {xn,n≥ 0} is N(0,σ ), and ∞ ∞ ∞ Z Z Z iθξ (6.4) lim sup | { e dFx,αx+z(ξ)}ϕ(z)dzπ(dx)| < 1, |θ|→0 −∞ −∞ −∞ where ϕ(·) is the normal density function of ε1, and π is the stationary distribution of {Xn}. The identification issue in C6 can be found in Chapter 9 of Caines (1988) or Chapter 13 of Hamilton (1994).

6.2 (G)ARCH models

In this subsection, we consider two specific (G)ARCH models. First, we study the AR(1)/ARC H (1) model

q 2 (6.5) Xn = β0 + β1Xn−1 + α0 + α1Xn−1εn,

2 where αi,βi are unknown parameters for i =0, 1 with α0 > 0, 0 <α1 < 1, 3α1 < 1, and 0 <β<1. Here, εn are assumed to be independent random variables with standard normal distribution. Note that in (6.5) X =(Xn) is defined as the autoregressive scheme AR(1) with q 2 ARCH(1) noise ( α0 + α1Xn−1εn)n≥1. When β0 = β1 = 0, this is the classical ARCH(1) model first considered by Engle (1982). Model (6.5) is conditional Gaussian, therefore the likelihood function of the parameter

θ =(α0,α1,β0,β1) for given observations x =(x0 =0,x1, ···,xn) from (6.5) is

n n 2 1 (x − β − β x − ) (6.6) l(x; θ)=(2π)−n/2 Y (α + α x2 )−1/2 exp  − X k 0 1 k 1 . 0 1 k−1 2 α + α x2 k=1 k=1 0 1 k−1 ˆ Assume β0 = 0 and α0,α1 are given. The maximum likelihood estimator β1 of β1 is the root of the equation ∂l(x; θ)/∂β1)=0. In view of (6.5) and (6.6), we obtain

n n 2 − 2 P xk−1εk/qα0 + α1x ˆ Pk=1(xk β0)xk−1/(α0 + α1xk−1) k=1 k−1 (6.7) β1 = n 2 2 = β1 + n 2 2 . Pk=1 xk−1/(α0 + α1xk−1) Pk=1 xk−1/(α0 + α1xk−1) Meyn and Tweedie (1993), pages 380 and 383, establishes w-uniform ergodicity (with w(x)=|x| + 1) of the AR(1) model Xn = β0 + β1Xn−1 + εn by proving that a drift con- dition is satisfied, where |β1| < 1 and the εn are i.i.d. random variables, with E|εn| < ∞, whose common density function q with respect to Lebesgue measure is positive everywhere.

By letting ξn = Xn, condition C1 holds. The strongly nonlattice condition hold as that in

21 model (6.1). By using an argument similar to Theorem 1 of Lumsdaine (1996), we have the asymptotic identifiability of the likelihood function (6.6). The verification of conditions C1-C9 and C2’-C5’ is straightforward and tedious, and is thus omitted. By Theorems 5-7, we have ˆ the strong consistency, asymptotic normality and Edgeworth expansion of the MLE β1. The asymptotic properties of the MLE of β0, α0 and α1 can be verified in a similar way. Next, we consider the GARCH(p, q) model of Bollerslev (1986) with order p(≥ 1) and q(≥ 0), let

(6.8) Yn = σnεn, p q 2 2 2 σn = δ + X αiσn−i + X βjYn−j, i=1 j=1 where δ>0,αi ≥ 0, and βj ≥ 0 are constants, εn is a sequence of independent and identically distributed random variables, and εn is independent of {Yn−k,k ≥ 1} for all n. It is known that the necessary and sufficient condition of (6.8) defining a unique strictly 2 {Yn,n≥ 0} with EYn < ∞ is p q (6.9) X αi + X βj < 1. i=1 j=1 We also assume (6.9) hold. Similar to the estimation for ARMA models, the most frequently used estimators for GARCH models are those derived from a (conditional) Gaussian likelihood function (cf. Fan and Yao (2003)). Without the normal assumption of εn in (6.8), and impose the moment con- 4 dition E(ε1) < ∞, Hall and Yao (2003) establishes the asymptotic normality of the conditional maximum likelihood estimator in GARCH(p, q). They also establishes asymptotic results for the case of the error distribution is heavy-tail. Earlier in the literature, when p = q = 1, Lee and Hansen (1994), and Lumsdaine (1996) prove that, under some regularity conditions, the consis- tency and asymptotic normality for the quasi-maximum likelihood estimator in GARCH(1, 1) model.

For convenience of notation, we assume that p, q ≥ 2, and by adding some αi, or βj equals −1 to zero if necessary. Denote ηn = σn Yn and

2 p−1 τn =(α1 + β1ηn,α2, ···,αp−1) ∈ R , 2 p−1 ζn =(ηn, 0, ···, 0) ∈ R , q−2 β =(β2, ···,βq−1) ∈ R .

Let An be a (p + q − 1) × (p + q − 1) matrix, written in block form as

 τn αp ββq   Ip−1 000 (6.10) An =   ,  ζn 000   00Iq−2 0 where Ip−1 and Iq−2 are the identity matrices. Note that {An,n ≥ 0} are i.i.d. random matrices.

22 0 p+q−1 2 2 2 2 0 0 Let Z =(δ, 0, ···, 0) ∈ R and Xn =(σn+1, ···,σn−p+2,Yn , ···,Yn−q+2) , where denotes transpose. Following the idea of Bougerol and Picard (1992), we have the following state space representation of GARCH(p, q) model: Xn is a Markov chain govern by

(6.11) Xn+1 = An+1Xn + Z,

2 2 0 and ξn := g(Xn)=(Yn , ···,Yn−q+2) is a non-invertible function of Xn. Assume δ>0. It is known (cf. Theorem 3.2 of Bougerol and Picard, 1992) that the Markov chain {Xn,n ≥ 0} defined in (6.11) is stationary if and only if the top Lyapunov exponent γ of An is strictly negative. It is easy to see that {Xn,n≥ 0} is an aperiodic, irreducible and w- uniformly (with w(x)=x2) ergodic Markov chain, and hence condition C1 holds. Furthermore, we assume εn are independent and identically distributed random variables with distribution N(0,σ2). This implies that conditions C2-C5, C2’-C5’ and C7-C9 are satisfied in model (6.11). When p = q = 1, Theorem 1 of Lumsdaine (1996) proves the asymptotic identifiability of the likelihood function. Note that Theorem 7 can also be used to prove second-order efficiency of the bootstrap method proposed by Hall and Yao (2003) in the case of nonheavy-tailed errors.

6.3 Stochastic volatility models

In financial economic literature, the discrete time stochastic volatility model, from Taylor (1986), may be written as

(6.12) yn = σnεn,

2 where yn denotes the demeaned return process yn = log(Sn/Sn−1) − µ and log σn follows an AR(1) process. It will be assumed that εn is a sequence of independent and identically distributed random variables with standard normal probability density function. In this sub- section, we only study the asymptotic properties of the MLE as a benchmark. For general account of the MLE in stochastic volatility models, the reader is referred to Taylor (1994), Shephard (1996), and Chysels, Harvey and Renault (1996) for a comprehensive survey. Note that Genon-Catalot, Jeantheau and Lar´edo (2000) studies the ergodicity and mixing properties of stochastic volatility models from the hidden Markov models’ point of view. 2 Following a convention often adopted in the literature we write Xn := log σn, and

(6.13) yn = σεn exp(Xn/2), where σ is a scale parameter. Squaring the observations in (6.12) and taking logarithms gives

2 2 2 (6.14) log yn = log σ + Xn + log εn. Alternatively,

2 (6.15) log yn = ω + Xn + ζn,

2 2 where ω = log σ + E log εn, so that the disturbance ζn has zero mean by construction. The scale parameter σ also removes the need for a constant term in the stationary first-order autoregressive process

(6.16) Xn = αXn−1 + ηn, |α| < 1,

23 where ηn is a sequence of independent and identically distributed random variables with 2 N(0,ση ). Moreover, we assume that ζn and ηn are independent. Note that in (6.14) and (6.15), {Xn,n≥ 0} forms a Markov chain with transition probability

1 (x2 − αx )2 2 −1/2  − k k−1  (6.17) p(xk−1,xk)=(2πση ) exp 2 , 2 ση

2 and stationary distribution π ∼ N(0,ση/(1 − α)). 2 2 For given observations y = (log y1, ···, log yn) from (6.14) and (6.15), the likelihood func- 2 tion of the parameter θ =(α, ση)is

(6.18) l(y; θ) n Z Z 2 = ··· π(x0) Y p(xk−1,xk)fζ (log yk − ω − xk) dx0dx1 ···dxn, ∈X ∈X x0 xn k=1 where fζ(·) is the probability density function of ζ1. To check condition C1 hold in the setting of (6.14) and (6.15), we note that Meyn and Tweedie (1993), pages 380 and 383, establishes w-uniform ergodicity (with w(x)=|x| + 1) of the AR(1) model Xn = αXn−1 + ηn by proving that a drift condition is satisfied, where |α| < 1 and the ηn are i.i.d. random variables, with E|ηn| < ∞, whose common density function q 2 with respect to Lebesgue measure is positive everywhere. Since εn ∼ N(0, 1), ζn = log ε , 2 ηn ∼ N(0,ση ), and ζn and ηn are mutual independent, this implies that conditions C2-C5, C2’-C5’ and C7-C9 are satisfied in model (6.14) and (6.15) (cf. pages 22-23 of Shephard, 2 1996). Denote ξn := log yn. Note that the conditional density of Xn exists, and this implies that the conditional distribution of ξn given X0, ···,Xn is of the form FXn−1,Xn such that

∞ ∞ Z Z Z iθξ (6.19) lim sup | { e dFx,αx+z(ξ)}ϕ(z)dzπ(dx)| < 1, |θ|→0 X −∞ −∞ where ϕ(·) is the normal density function of ζ1, π is the stationary distribution of {Xn}. n Let Sn = Pi=1 ξi, S0 = 0. Then {(Xn,Sn),n ≥ 0} is strongly nonlattice. To check the identification condition C6, the reader is refereed Chapter 13 of Hamilton (1994) and section 2.4.3 of Chysels, Harvey and Renault (1996). Without the normal assumption, quasi-maximum likelihood (QML) estimators of the pa- rameters are obtained by treating ζn and ηn as though they were normal and maximizing the prediction error decomposition form of the likelihood obtained via the Kalman filter or . That is, we assume that ζn is a sequence of independent and identically distributed 2 2 2 random variables with N(0,σζ ). For given observations y = (log y1, ···, log yn) from (6.14) 2 2 and (6.15), the likelihood function of the parameter θ =(α, ση ,σζ )is

n Z Z 2 −n/2 (6.20) l(y; θ)= ··· π(x0)(2πσζ ) Y p(xk−1,xk) ∈X ∈X x0 xn k=1 1 n (log y2 − ω − x )2 × exp  − X k k dx dx ···dx . 2 σ2 0 1 n k=1 ζ

24 By using the results of Dunsmuir (1979), Harvey, Ruiz and Shephard (1994) shows that the quasi-maximum likelihood estimators are asymptotically normal under some regularity condi- tions. Further study of the MLE in stochastic volatility models will be published in a separate paper.

7 Proofs of Lemmas and Theorems in Sections 3 and 4

Proof of Theorem 1. Elton (1990) shows in the situation of a stationary sequence (Fn)n≥1 + + that this holds true whenever E log l(F1) and E log d(F1(u0),u0) are both finite for some

(and then all) u0 ∈ M and the Lyapunov exponent

−1 (7.1) γ := lim n log l(Fn:1) n→∞ which exists by Kingman’s subadditive ergodic theorem, is a.s. negative. Since the initial dis- tribution of Y0 is the stationary distribution π, the Markov chain Yn is a stationary sequence, and hence, Mn is a sequence of iterated random functions generated by stationary sequences. Theorem 1 states the ergodic properties for stationary Markovian iterated random functions system. The proof is similar to Barnsley, Elton and Hardin (1989), and Elton (1990), and is thus omitted.

Proof of Lemma 1. Let Fn(u), Fn:k(u) and l(Fn) be defined as those in Section 3. The obvious inequality

n ! ˜ u ˜ u (7.2) d(Mn+m, Mn ) ≤ Y l(Fk) d(Fn+1:n+m(u),u) a.s., k=1 valid for all n, m ≥ 0 and u ∈ M. For ϕ ∈H, we note

(7.3) |Tnϕ(y, u) − Qϕ(y, u)| u Z = E (ϕ(Y ,M )) − ϕ(y, u)Π(dy × du) yu n n

˜ ˜ ˜ u ˜  ˜ ˜ u  = EY (ϕ(Yn, M )) − EΠ ϕ( lim (Yk, M )) by (3.8) and Theorem 1(iii) n n →∞ k k ˜  ˜ ˜ u ˜ ˜ u  ≤ sup Ey lim |ϕ(Yn, Mn ) − ϕ(Yk, Mk )| . y k→∞ ˜ By assumption A3, there exists 0 <ρ<1 such that supy Eyl(F1) <ρ. For given u0 ∈ M,we have

n (7.4) ||T ϕ − Qϕ||w |Tnϕ(y, u) − Qϕ(y, u)| = sup y∈Y,u∈M w(y)(1 + dψ(u0,u)) ˜  ˜ ˜ u ˜ ˜ u  supy Ey limk→∞ |ϕ(Yn, Mn ) − ϕ(Yk, Mk )| ≤ sup by (7.3) y∈Y,u∈M w(y)(1 + dψ(u0,u))

25 ˜  n  supy Ey Qk=1 l(Fk) limk→∞ d(u, Fn+1 ◦···◦Fk(u)) ≤||ϕ||l sup by (7.2) u∈M (1 + dψ(u0,u)) ˜ n R dψ(u, v)Q(dv) ≤||ϕ||l(sup Eyl(F1)) y (1 + dψ(u0,u)) n ≤ C1||ϕ||lρ , by Definition 2 and (7.3) where C1 := R dψ(u, v)Q(dv)/(1 + dψ(u0,u)). On the other hand, |(Tn − Q)ϕ(x, u) − (Tn − Q)ϕ(x, v)| (7.5) dψ(u, v) |Tnϕ(y, u) − Tnϕ(y, v)| = w(y)dψ(u, v) ˜ ˜ ˜ u ˜ ˜ u sup Ey |ϕ(Yn, M ) − ϕ(Yk, M )| n y n k ≤||ϕ||lρ sup by (7.3) u,v:0

Combining (7.4) and (7.5) to get

n n 0 n 0 n (7.6) kT ϕ − Qϕkwl = sup kT ϕ − Qϕkwl ≤ C ρ ||ϕ||L ≤ C ρ kϕkwl. ϕ∈H, kϕkwl≤1 Since (4.15) follows from (7.6), this completes the proof. 2

A similar result of Lemma 2 has been proved in Proposition 1 of Fuh (2004b), we include here for completeness. Proof of Lemma 2. By using the same notation and assumptions as those in Section 3, we give a brief summary of the perturbation theory for Markovian operators Tα, T, να and Q defined as (4.11)-(4.13). For a bounded linear operator T : H→H, the resolvent set is defined as {z ∈ C : (T − zI)−1 exists} and (T − zI)−1 is called the resolvent (when the inverse exists). From

(4.15) it follows that for z =6 1 and |z| >ρ∗, ∞ (7.7) R(z):=Q/(z − 1) + X(Tn − Q)/zn+1 n=0 is well defined. Since R(z)(T − zI)=−I =(T − zI)R(z), the resolvent of T is −R(z). Moreover, by A3 and an argument similar to the proof of Nagaev (1957), there exist K>0 and η>0 such that for |α|≤η, |z − 1| > (1 − ρ∗)/6 and |z| >ρ∗ +(1− ρ∗)/6, kTα − Tkwl ≤ ∞ n K|α|, and Rα(z):=Pn=0 R(z){(Tα − T)R(z)} is well defined. Since Rα(z)(Tα − zI)= Rα(z){(Tα − T)+(T − zI)} = −I =(Tα − zI)Rα(z), the resolvent of Tα is −Rα(z).

For |α|≤η, the spectrum (which is the complement of the resolvent set) of Tα, therefore, lies inside the two circles C1 = {z : |z − 1| =(1− ρ∗)/3} and C2 = {z : |z| = ρ∗ +(1− ρ∗)/3}.

Hence, by Riesz’s spectral decomposition theorem, H = H1 ⊕H2, the direct sum of H1 and

H2, and 1 Z 1 Z (7.8) Qα := Rα(z)dz, I − Qα := Rα(z)dz 2πi C1 2πi C2

26 are parallel projections of H onto the subspaces H1, H2, respectively. Moreover, by an argu- ment similar to the proof of Lemma 1.1 of Nagaev (1957), there exists 0 <δ≤ η such that

H1 is one-dimensional for |α|≤δ and sup|α|≤δ kQα − Qkwl < 1. For |α|≤δ, let λ(α) be the eigenvalue of Tα with corresponding eigenspace H1. Since Qα is the parallel projection onto the subspace H1 in the direction of H2, (4.14) holds. Therefore, for ϕ ∈H,

0 iα g(Mn) n n (7.9) Eν{e ϕ(Xn,Mn)} = ναTαϕ = ναTα{Qα +(I − Qα)}ϕ n n = λ (α)ναQαϕ + ναQα(I − Qα)ϕ.

An argument similar to (1.22) of Nagaev (1957) shows that there exist K∗ > 0 and 0 < δ∗ <δsuch that for |α|≤δ∗,

n ∗ n (7.10) kναTα(I − Qα)ϕkwl ≤ K kϕkwl|α|{(1 + 2ρ∗)/3} .

n We next consider the summand λ (α)ναQαϕ in (7.9). Since A3 holds, then for any integer r ≥ 3, analogous to Lemma 1.2 of Nagaev (1957), λ(α) has the Taylor expansion

j1+···+jq j1 jq (7.11) λ(α)=1+ X i λj1,···,jq α1 ···αq /(j1! ···jq!) + ∆(α) (j1,···,jq):1≤j1+···+jq≤r

r in some neighborhood of the origin, where ∆(α)=O(|α| )as|α|→0. Let ϕ1 := 1, then

ναQαϕ1 has continuous partial derivatives of order r − 2 in some neighborhood of the origin. 2

√ Proof of Theorem 2. Since (7.9) holds uniformly in |α| <δ n, standard arguments involving smoothing inequalities and Fourier inversion (cf. Chapter 4 of Bhattacharya and Rao (1976)) reduce the proof to that of showing for every δ>0,a>0 and b>1,

0 iα Sn −b (7.12) sup |Eπ(e )| = o(n ). δ≤|α|≤na

0 To prove (7.12), let ζt = St − St−1 (t =1, 2, ···),ζ0 = S0 andϕ ˜((y, u), (y ,v)) = iα0ζ 0 E(e 1 |(Y0 = y, M0 = u), (Y1 = y ,M1 = v)).

Let J = {j ∈{1, ···,n}, and divide J into blocks A1,B1, ···,Al,Bl as follows: Define j1, ···,jl by

j1 =1, and jk+1 = inf{j ≥ jk +7m : j ∈ J} and let l be the smallest integer for which the inf is undefined. Write

−1/2 0 n iα ζj Ak = Y{e : |j − jk|≤m},k=1, ..., l, −1/2 0 n iα ζj Bk = Y{e : jk + m +1≤ j ≤ jk+1 − m − 1},k=1, ···,l− 1, −1/2 0 n iα ζj Bl = Y{e : j>jl + m +1}.

Then l iαSn e = Y AkBk. 1

27 We have

l l

Ex Y AkBk − Ex Y BkE(Ak|ζj : j =6 jk)

1 1 l q−1 l

≤ X E Y A B (A − E(A |ζ : j =6 j )) Y B E(A |ζ : j =6 j ) x k k q k j q k k j k q=1 1 q+1

l q−1 l

(7.13) ≤ X E Y A B (A − E(A |ζ : j =6 j )) Y B E(A |ζ :0< |j − j |≤3m) . x k k q k j q k k j k q=1 1 q+1

The first summation term in (7.13) vanishes since

q−1 l Y AkBk and Y BkE(Ak|ζj :0< |j − jk|≤3m) 1 q+1 are both measurable with respect to the σ-field generated by ζj : j =6 jq. Recall that the functions

E(Ak|ζj :0< |j − jk|≤3m), for k =1, ···,l are weakly dependent since jk+1 − jk ≥ 7m, k =1, ···,l− 1. Using condition C1, we obtain

l

Ex Y BkE(Ak|ζj :0< |j − jk|≤3m)

1 l β r ≤ (2n ) Ex Y E(Ak|ζj :0< |j − jk|≤3m)

1 l β r β r −1 −dm ≤ (2n ) Y Ex|E(Ak|ζj :0< |j − jk|≤3m)| +(2n ) l · 4d e . 1 With the strong nonlattice condition (4.20), the conditional strong nonlattice condition (4.21), we find an upper bound for

Ex|E(Ak|ζj :0< |j − jk|≤3m)|.

−δ We have for |α|≥δ, the relation Ex|E(Ak|ζj : j =6 jq)|≤e , and hence by (4.21) for all α ∈ Rp, |α|≤δ, 2 Ex|E(Ak|ζj : j =6 jq)|≤exp(−δ|α| /n). Therefore, for all α ∈ Rp

2 −δ Ex|E(Ak|ζj :0< |j − jk|≤3m)|≤Ex|E(Ak|ζj : j =6 jq)|≤max(exp(−δ|α| /n),e ).

If we choose K appropriately and let m be the integral part of K log n, then the assertion of the lemma follows from

2 n 2 0 ε/2 exp(−δ|α| /n) m ≤ exp(−δ|α| /K log n)) ≤ exp(−δ n )

0 for |α|≥cnε and some δ > 0. 2

28 8 Proofs of Lemmas and Theorems in Section 5

For convenience of notation, denote {Zn,n ≥ 0} := {((Xn,ξn),Mn),n ≥ 0} as the Markov chain, induced by the Markovian iterated random functions system (2.4)-(2.7), on the state space (X×Rd) × M. In the proofs of Lemmas 3 and 4, we omit θ for simplicity.

Proof of Lemma 3. We consider only the case of P(ξj ) for j =1, ···,n, since the case of P(ξ0) is a straightforward consequence. For any two elements h1,h2 ∈ M, and two fixed d elements ξj−1,ξj ∈ R , we have

d(P(ξj )h1, P(ξj )h2) Z = sup | pθ(xj−1,xj)f(ξj; ϕxj (θ)|ξj−1)h1(xj)m(dxj) xj−1∈X Z − pθ(xj−1,xj)f(ξj; ϕxj (θ)|ξj−1)h2(xj)m(dxj )| Z ≤ d(h1,h2) sup pθ(xj−1,xj)f(ξj; ϕxj (θ)|ξj−1)m(dxj ) xj−1∈X

= Cd(h1,h2),

| ∞ where 0

+ + d(P(ξj )h1, P(ξj )h2) (8.1) E log L1 = E log sup h1=6 h2 d(h1,h2) Z < E log sup pθ(xj−1,xj)f(ξj; ϕxj (θ)|ξj−1)m(dxj ) xj−1∈X Z ≤ log sup E pθ(xj−1,xj)f(ξj; ϕxj (θ)|ξj−1)m(dxj )=0. xj−1∈X

The last inequality follows from (2.2) and Jensen’s inequality. ∞ To verify assumption A3 hold. As m is σ-finite, we have X = ∪n=1Xn, where the Xn are pairwise disjoint and 0

∞ IX (x) (8.2) h(x)=X n . 2nm(X ) n=1 n

It is ease to see that Rx∈X h(x)m(dx) = 1 and hence belongs to M. And

2 Ed (P(ξj)h, h) Z (8.3) = E sup | pθ(xj−1,xj)f(ξj; ϕxj (θ)|ξj−1)h(xj )m(dxj) − h(xj−1)|. xj−1∈X

By definition of h(x) in (8.2), it is piecewise constant, and pθ(xj−1,xj)f(ξj; ϕxj (θ)|ξj−1)isa probability density function, to be integrable over the subset Xn. These imply (8.3) is finite.

29 In the last, we observe

sup E(L1|X0 = x, ξ0 = ξ) x,ξ

d(P(ξj )h1, P(ξj )h2) = sup E( sup |X0 = x, ξ0 = ξ) x,ξ h1=6 h2 d(h1,h2) Z < sup E( sup pθ(xj−1,xj)f(ξj; ϕxj (θ)|ξj−1)m(dxj)|X0 = x, ξ0 = ξ) x,ξ xj−1∈X ZZ ≤ sup sup pθ(xj−1,xj)f(ξj; ϕxj (θ)|ξj−1)m(dxj)dQ(ξj)=1. x,ξ xj−1∈X

Note that C5 implies the exponential moment condition of g. Hence, the proof is com- pleted. 2

Proof of Lemma 4. We first to prove that {Zn,n ≥ 0} is Harris recurrent. Note that the transition probability kernel of the Markov chain {(Xn,ξn),n≥ 0}, defined in (2.1) and (2.2), has probability density with respect to m × Q. And the iterated random functions system, defined in (2.5)-(2.8), also has probability density with respect to m × Q. Therefore, there exists a measurable function g :(X×Rd × M) × (X×Rd × M) → [0, ∞) such that

(8.4) P(z,dz0)=g(z,z0)(m × Q)(dz0)

0 0 d where R(X×Rd)×M g(z,z )(m × Q)(dz ) > 0 for all z ∈ (X×R ) × M. For an arbitrary τ τ = h(Zn) for the σ-algebra generated by Zn, let P (z,·):=Pz(Zτ ∈·) for z ∈ (X×Rd) × M.ForA ∈B(X×Rd) and B ∈B(M), define

τ Z 0 0 (m × Q) (A × B):= P{Zτ (z ) ∈ A × B}(m × Q)(dz ). (X×Rd)×M

Then

Pτ+1(z,A × B)=Z Pτ (z0,A× B)g(z,z0)(m × Q)(dz0) (X×Rd)×M Z 0 0 0 = P{Zτ (z ) ∈ A × B}g(z,z )(m × Q)(dz ) (X×Rd)×M

d for all A ∈B(X×R ) and B ∈B(M). Therefore, given any P(m×Q)-a.s. finite stopping time τ+1 τ τ for {Zn,n≥ 0}, the family (P (z,·))z∈(X×Rd)×M is nonsingular with respect to (m × Q) . We have thus particularly shown that, if P has a probability density with respect to m×Q, then already for all n ≥ 1 (with in general different m × Q). Let gτ be such that

τ+1 0 0 τ 0 d (8.5) P (z,dz )=gτ (z,z )(m × Q) (dz ),z∈ (X×R ) × M,

0 τ 0 d where R(X×Rd)×M gτ (z,z )(m × Q) (dz ) > 0 for all z ∈ (X×R ) × M. Next, under condition C1, that for each Π-positive A × B, let

d Γ0(A × B):={z ∈ (X×R ) × M : Pz{Zn ∈ A × B i.o.} =1}

30 d satisfies Π(Γ0(A×B)) = 1 and thus also P(z,Γ0(A×B)) = 1 for Π-almost all z ∈ (X×R )×M. Recursively, define

Γn+1(A × B):={z ∈ Γn(A × B):P(z,Γn(A × B)) = 1} for n ≥ 0. Then Π(Γn(A×B)) = 1 for all n ≥ 0 and Γn(A×B) ↓ Γ∞(A×B):=∩k≥0Γk(A×B), as n →∞, giving Π(Γ∞(A × B)) = 1. Furthermore, Γ∞(A × B) is absorbing because, by construction, P(z,Γn(A×B)) = 1 for all z ∈ Γ∞(A×B) and n ≥ 0, and thus P(z,Γ∞(A×B)) = limn→∞ P(z,Γn(A × B)) = 1 for all z ∈ Γ∞(A × B). c d c In particular, put τ = 1. Denote B as the compliment of B. Since Π(Γ∞((X×R )×M) )= d c 0, also (m × Q)(Γ∞((X×R ) × M) ) = 0. It is now obvious from the previous considerations that we can choose δ>0 sufficiently small such that

Z Z Z 1{g≥δ}(z1,z2)1{g≥δ}(z2,z3) Γ∞((X×Rd)×M) (X×Rd)×M Γ∞((X×Rd)×M) · (m × Q)(dz3)(m × Q)(dz2)Π(dz1) > 0.

Hence by Lemma 4.3 of Niemi and Nummelin (1986), there exists a Π-positive set Γ1 ⊂ d d Γ∞((X×R ) × M) and a m × Q-positive set Γ2 ⊂ Γ∞((X×R ) × M) such that

2 d α := inf (m × Q) {z2 ∈ (X×R ) × M : g(z1,z2) ≥ δ, g(z2,z3) ≥ δ} > 0. z1∈Γ1,z3∈Γ2

A combination of the above result with (8.4) and (8.5) implies

3 P (z1,A× B) Z 2 = P(z2,A× B) P (z1,dz2) (X×Rd)×M Z Z (8.6) ≥ g2(z1,z2) g(z2,z3)(m × Q)(dz3)(m × Q)(dz2) d (X×R )×M A×B∩Γ2 2 ≥ αδ (m × Q)(A × B ∩ Γ2)

d d for all z1 ∈ Γ1 and A × B ∈B((X×R ) × M). By defining H := Γ∞((X×R ) × M), we obtain an absorbing set such that Γ1 is a regeneration set for {Zn,n ≥ 0} restricted to

H, i.e., Γ1 is recurrent and satisfies a minorization condition, namely (8.6). This proves the

Harris recurrence of {Zn,n ≥ 0} on H. By the previous construction, it is easy to see that d H =(X×R ) × M. Since {Zn,n≥ 0} possesses a stationary distribution, it is clearly positive Harris recurrent.

Next, we give the proof of aperiodicity. However, if {Zn,n≥ 0} were q-periodic with cyclic Π(·∩Γk) classes Γ1, ···, Γq, say, then the q-skeleton (Znq)n≥0 would have stationary distributions Π(Γk) for k =1, ···,q. On the other hand, Zqn is aperiodic by definition and Mnq is also an iterated random matrices satisfying condition C1, and thus possesses only one stationary distribution.

Consequently, q = 1 and {Zn,n≥ 0} is aperiodic. d Note that we have Pz{Zn ∈ A × B i.o.} = 1 for all z ∈ (X×R ) × M and all Π-positive d open A × B ∈B((X×R ) × M). Denote ρ(B) as the first return time to B for Zn. Hence d d d Π(int((X×R ) × M)) > 0 ensures Pz(ρ((X×R ) × M) < ∞) = 1 for all z ∈ (X×R ) × M,

31 which easily yields the m × Q-irreducibility of {Zn,n≥ 0}. 2

Proof of Lemma 5. In order to define the Fisher information (5.6), we need to verify that θ there exists a δ>0, such that ∂ log kPθ(ξ1) ◦ Pθ(ξ0)π(x)k/∂θ ∈ L2(PΠ) for θ ∈ Nδ(θ0), a δ-neighborhood of θ0. That is, we need to show

∂ log kP (ξ ) ◦ P (ξ )π(x)k 2 (8.7) Eθ  θ 1 θ 0  < ∞, Π ∂θ for θ ∈ Nδ(θ0). It is easy to see that C5 implies that

| 2 θ∂ log Ry∈X π(x)p(x, y)f(ξ0; ϕy(θ))f(ξ1; ϕy(θ) ξ0)m(dy)  sup Ex < ∞, x∈X ∂θ for θ ∈ Nδ(θ0). And this leads that

| 2 θ  ∂ log Ry∈X π(x)p(x, y)f(ξ0; ϕy(θ))f(ξ1; ϕy(θ) ξ0)m(dy)  (8.8) sup Ex < ∞, x∈X ∂θ

θ θ for θ ∈ Nδ(θ0), where Ex is the expectation under P (·, ·). Finally, (8.8) implies (8.7) and we have the proof. 2

Proof of Lemma 6. For each j =1, ···,q, 1 1 ∂ √ 0 √ lj(θ0)= log kPθ(ξn) ◦···◦Pθ(ξ1) ◦ Pθ(ξ0)π(x)k|θ=θ0 n n ∂θj 1 n ∂ kP (ξ ) ◦···◦P (ξ ) ◦ P (ξ )π(x)k = √ X  log θ k θ 1 θ 0 |  n ∂θ kP (ξ ) ◦···◦P (ξ ) ◦ P (ξ )π(x)k θ=θ0 k=1 j θ k−1 θ 1 θ 0 1 n ∂ = √ X g(M ,M )| . n ∂θ k−1 k θ=θ0 k=1 j

q d Now, for each h ∈ M, α =(α1, ···,αq) ∈ C , and a (X×R ) × M measurable function ϕ with kϕkwl < ∞, define

(8.9) (T1(α)ϕ)((x, ξ),h) ∂ θ0   ··· 0 k ◦ || ··· = E(x,ξ) exp (α1, ,αq) ( log Pθ(ξ1) Pθ(ξ0)h(x) θ=θ0 , , ∂θ1 ∂   log kPθ(ξ1) ◦ Pθ(ξ0)h(x)k|θ=θ0 ) ϕ((X1,ξ1), Pθ(ξ1) ◦ Pθ(ξ0)h(x)) . ∂θq

By using an argument similar to that of Lemma 2, we have for sufficiently small |α|, T1(α) is a bounded and analytic operator. Let λθ0 (α) be the eigenvalue of T (α) corresponding to T1 1 a one-dimensional eigenspace. Define γj as that in Lemma 2(v). By assumptions C1-C5 and Lemma 4, it is easy to see that ∂ ∂ (8.10) γ = λθ0 (α)| = Eθ0  log kP (ξ ) ◦ P (ξ )π(x)k|  =0. j T1 α=0 Π θ 1 θ 0 θ=θ0 ∂αj ∂θj

32 By Corollary 1, we have

1  0  (8.11) √ lj(θ0) −→ N(0, Σ(θ0)) in distribution, n j=1,···,q where

∂2λθ0 (α)  T1  (8.12) Σ(θ0)=(Σij(θ0)) = |α=0 ∂αi∂αj i,j=1,···,q is the variance-covariance matrix.

In the following, we will verify that the variance-covariance matrix Σ(θ0) defined as (8.12) is the Fisher information matrix I(θ0). By Lemma 2 and Corollary 1, we have

∂ ∂ ∂2 Eθ0 ( log kM π(x)k| )( log kM π(x)k| ) − n λθ0 (α)| −→ 0, Π n θ=θ0 n θ=θ0 T1 α=0 ∂θj ∂θk ∂αj∂αk as n →∞. Therefore,

∂2 Σ (θ )= λθ0 (α)| jk 0 T1 α=0 ∂αj∂αk 1 ∂ ∂ θ0  k k|  k k|  = lim EΠ log Mnπ(x) θ=θ0 log Mnπ(x) θ=θ0 n→∞ n ∂θj ∂θk 1 ∂2 − θ0  k k|  = lim EΠ log Mnπ(x) θ=θ0 n→∞ n ∂θj∂θk ∂2 − θ0  k ◦ k|  = EΠ log Pθ(ξ1) Pθ(ξ0)π(x) θ=θ0 ∂θj∂θk ∂ ∂ θ0  k ◦ k|  k ◦ k|  = EΠ log Pθ(ξ1) Pθ(ξ0)π(x) θ=θ0 log Pθ(ξ1) Pθ(ξ0)π(x) θ=θ0 ∂θj ∂θk = Ijk(θ0). 2

References

[1] Barnsley, M. F., Elton, J. H. and Hardin, D. P. (1989). Recurrent iterated functions systems. Constr. Approx., 5, 3-31.

[2] Baum, L. E. and Petrie, T. (1966). Statistical inference for probabilistic functions of finite state Markov chains. Ann. Math. Statist., 37, 1554-1563.

[3] Ball, F. and Rice, J. A. (1992). Stochastic models for ion channels: Introduction and Bibliography. Math. Biosciences, 112, 189-206.

[4] Bhattacharya, R. N. and Ghosh, J. K. (1978). On the validity of the formal Edgeworth expansion. Ann. Statist., 6, 434-451.

[5] Bickel, P. and Ritov, Y. (1996). Inference in hidden Markov models I: Local asymptotic normality in the stationary case. Bernoulli, 2, 199-228.

33 [6] Bickel, P., Ritov, Y. and Ryd´en, T. (1998). Asymptotic normality of the maximum likeli- hood estimator for general hidden Markov models. Ann. Statist., 26, 1614-1635.

[7] Bollerslev, T. (1986). Generalized autoregressive conditional heteroscedasticity. Journal of Econometrics, 31, 307-327.

[8] Caines, P. E. (1988). Linear Stochastic Systems. John Wiley & Sons, New York.

[9] Chysels, E., Harvey, A. C. and Renault, E. (1996). Stochastic volatility. Gaddala and Rao, eds., Handbook of , 14, 119-191.

[10] Clark, P. K. (1973). A subordinated model with finite variance for speculative prices. Econometrica, 41, 135-156.

[11] Diaconis, P. and Freedman, D. (1999). Iterated random functions. SIAM Review, 41, 45-76.

[12] Douc, R. and Matias, C. (2001). Asymptotics of the maximum likelihood estimator for general hidden Markov models. Bernoulli, 7, 381-420.

[13] Dunsmuir, W. (1979). A central limit theorem for parameter estimation in stationary vector time series and its applications to models for a signal observed with noise. Ann. Statist., 7, 490-506.

[14] Elliott, R., Aggoun, L. and Moore, J. (1995). Hidden Markov Models: Estimation and Control. Springer-Verlag. New York.

[15] Elton, J. H. (1990). A multiplicative ergodic theorem for Lipschitz maps. Stoch. Proc. Appl., 34, 39-47.

[16] Engle, R. (1982). Autoregressive conditional heteroscedasticity with estimates of the vari- ance of U.K. inflation. Econometrica, 50, 987-1008.

[17] Fan, J. and Yao, Q. (2003). Nonlinear Time Series. Springer, New York.

[18] Francq, C. and Roussignol, M. (1998). Ergodicity of autoregressive processes with Markov- switching and consistency of the maximum likelihood estimator. Statistics, 32, 151-173.

[19] Fuh, C. D. (2003). SPRT and CUSUM in hidden Markov models. Ann. Statist., 31 942- 977.

[20] Fuh, C. D. (2004a). On Bahadur efficiency of the maximum likelihood estimator in hidden Markov models. Statistica Sinica, 14, 127-144.

[21] Fuh, C. D. (2004b). Uniform Markov and ruin probabilities in Markov random walks. Ann. Appl. Probab., to appear.

[22] Fuh, C. D. (2004c). Asymptotic operating characteristics of an optimal change point detection in hidden Markov models. Ann. Statist., to appear.

34 [23] Fuh, C. D. and Lai, T. L. (2001). Asymptotic expansions in multidimensional Markov renewal theory and first passage times for Markov random walks. Adv. Appl. Probab., 33 652-673.

[24] Genon-Catalot, V., Jeantheau, T. and Lar´edo, C. (2000). Stochastic volatility models as hidden Markov models and statistical applications. Bernoulli, 6, 1051-1079.

[25] Goldfeld, S. M. and Quandt, R. E. (1973). A Markov model switching regressions. Journal of Economics, 1, 3-16.

[26] G¨otze,F. and Hipp, C. (1983). Asymptotic expansions for sums of weakly dependent random vectors. Z. Wahrsch Verw. Gebiete. 64 211-239.

[27] Hall, P. and Yao, Q. (2003). Inference in ARCH and GARCH models with heavy-tailed errors. Econometrica, 71, 285-318.

[28] Hamilton, J. D. (1989). A new approach to the economic analysis of nonstationary time series and the business cycle. Econometrica, 57, 357-384.

[29] Hamilton, J. D. (1994). Time Series Analysis. Princeton University Press. Princeton, New Jersey.

[30] Hannan, E. J. (1973). The asymptotic theory of linear time-series models. J. Appl. Probab., 10, 130-145.

[31] Harvey, A. C., Ruiz, E. and Shephard, N. (1994). Multivariate stochastic variance models. Rev. Econom. Stud. 61, 247-264.

[32] Ito, H., Amari, S. I. and Kobayashi, K. (1992). Identifiability of hidden Markov informa- tion sources and their minimum degrees of freedom. IEEE Trans. Inform. Theory, 38, 324-333.

[33] Jensen, J. L. and Petersen, N. V. (1999). Asymptotic normality of the maximum likelihood estimator in state space models. Ann. Statist., 27, 514-535.

[34] Krogh, A., Brown, M., Mian, I. S., Sjolander, K. and Haussler, D. (1994). Hidden Markov models in computational biology: application to protein modeling. Journal of Molecular Biology, 235, 1501-1531.

[35] K¨unsch, H. R. (2001). State space and hidden Markov models. In Chapter 3, Complex Sto- chastic Systems, Ed. by Barndorff-Nielsen, Cox and Kl¨uppelberg. Chapman & Hall/CRC.

[36] Lee, S. W. and Hansen, B. E. (1994). Asymptotic theory for the GARCH(1,1) quasi- maximum likelihood estimator. Econometric Theory, 6, 29-52.

[37] Leroux, B. G. (1992). maximum likelihood estimation for hidden Markov models. Stoch. Proc. Appl., 40, 127-143.

35 [38] Lumsdaine, R. L. (1996). Consistency and asymptotic normality of the quasi-maximum likelihood estimator in IGARCH(1,1) and covariance stationary GARCH(1,1) models. Econometrica, 64, 575-596.

[39] Meyn, S. P. and Tweedie, R. L. (1993). Markov Chains and Stochastic Stability. Springer- Verlag, New York.

[40] Nagaev, S. V. (1957). Some limit theorems for stationary Markov chains. Theory Probab. Appl., 2, 378-406.

[41] Nagaev, S. V. (1961). More exact statements of limit theorems for homogeneous Markov chains. Theory Probab. Appl., 6, 62-81.

[42] Niemi, S. and Nummelin, E. (1986). On non-singular renewal kernels with an application to a semigroup of transition kernels. Stoch. Proc. Appl., 22, 177-202.

[43] Rabiner, L. R. and Juang, B. H. (1993). Fundamentals of Speech Recognition. Prentice Hall, New Jersey.

[44] Shephard (1996). Statistical aspects of ARCH and stochastic volatility. In Time Series Models in Econometrics, Finance and Other Fields (D.R. Cox, D.V. Hinkely, and O.E. Barndorff-Nielsen, eds.) Chapman & Hall, London, 1-67.

[45] Taylor, S. J. (1986). Modeling Financial Time Series. John Wiley: Chichester.

[46] Taylor, S. J. (1994). Modelling stochastic volatility. Math. Finance, 4, 183-204.

[47] Wu, W. B. and Woodroofe, M. (2000). A central limit theorem for iterated random func- tions. J. Appl. Probab., 37, 748-755.

[48] Yao, Q. and Brockwell, P. J. (2001). Gaussian maximum likelihood estimation for ARMA models I: time series. manuscript.

C. D. FUH INSTITUTE OF STATISTICAL SCIENCE ACADEMIA SINICA TAIPEI, TAIWAN, R.O.C. E-MAIL: [email protected]

36