CREATES Research Paper 2009-12 Poisson Autoregression

CREATES Research Paper 2009-12

Poisson Autoregression

Konstantinos Fokianos, Anders Rahbek and Dag Tjøstheim

School of Economics and Management Aarhus University Bartholins Allé 10, Building 1322, DK-8000 Aarhus C Denmark

Poisson Autoregression (Complete Version)

Konstantinos Fokianos1 Anders Rahbek2 Dag Tjøstheim3

1Department of Mathematics & Statistics, University of Cyprus e-mail: [email protected]

2Department of Economics, University of Copenhagen and CREATES e-mail:[email protected]

3Department of Mathematics, University of Bergen e-mail:[email protected]

Final Version: March 2009

Abstract This paper considers geometric ergodicity and likelihood based inference for linear and nonlinear Poisson au- toregressions. In the linear case the conditional mean is linked linearly to its past values as well as the observed values of the Poisson process. This also applies to the conditional variance, making an interpretation as an integer valued GARCH process possible. In a nonlinear conditional Poisson model, the conditional mean is a nonlinear function of its past values and a nonlinear function of past observations. As a particular example an exponential autoregressive Poisson model for time series is considered. Under geometric ergodicity the maximum likelihood estimators of the parameters are shown to be asymptotically Gaussian in the linear model. In addition we provide a consistent estimator of their asymptotic covariance matrix. Our approach to verifying geometric ergodicity proceeds via Markov theory and irreducibility. Finding transparent conditions for proving ergodicity turns out to be a delicate problem in the original model formulation. This problem is circumvented by allowing a perturbation of the model. We show that as the perturbations can be chosen to be arbitrarily small, the differences between the perturbed and non-perturbed versions vanish as far as the asymptotic distribution of the parameter estimates is concerned.

Keywords: asymptotic theory, count data, generalized linear models, geometric ergodicity, integer GARCH, likelihood, non- canonical link function, observation driven models, Poisson regression, φ-irreducibility. JEL classiﬁcation: C51, C22

1 1 Introduction

In this paper we study ergodicity and likelihood inference for a specific class of GARCH type Poisson time series models. The necessity for such an investigation arises from the fact that count dependent sequences appear in several diverse scientific fields, including medical, environmental or financial applications. As an illustrative example, consider the upper plot of Figure 4 which depicts the number of transactions per minute for some stock during a specific day. These data consist of a count time series and therefore statistical methodology should be developed for time series modeling, estimation, inference and prediction. Models for time series of counts have been considered by many authors–for a comprehensive account see Ke- dem and Fokianos (2002, Ch. 4), for instance. The most popular choice among several authors is the log–linear model. In other words, if it is assumed that Yt is conditionally Poisson distributed with mean λt, then most existing models are based upon regressing log λt–the canonical link parameter–on past values of the response and/or covariates. As it has been developed by Fokianos and Kedem (2004), these models fall within the broad class of generalized linear time series models, and their analysis is based on partial likelihood inference. Estimation, diagnostics, model assessment, and forecasting are implemented in a straightforward manner where the computation is carried out by a number of the existing statistical computing environments. In addition, empirical evidence shows that both positive and negative association can be taken into account by a suitable parametrization of the model, see e.g. Zeger and Qaqish (1988) and Davis et al. (1999). A largely missing element in these developments has been the possibility of an autoregressive feedback mech- anism in {λt}. Such a feedback is a key feature in state space models. One example is the GARCH model for volatility. These models are generally expected to be more parsimonious. The purpose of this paper is to study autoregressive models of λt both linear and nonlinear. More specifically, λt is regressed on past values of the observed process and past values of λt itself–this model has been considered by Rydberg and Shephard (2000), Streett (2000) and more recently by Ferland et al. (2006). Two classes of models are actually proposed in the present paper. The first class is a simple linear model and it is given by expression (1) in Section 2.1. This model states that the conditional mean of the Poisson observed time series is a linear function of its past values and lagged values of the observations. The model can be motivated by the arguments of Rydberg and Shephard (2000) who show that it is a reasonable approximation for inference about the number of trades within a small time interval. Furthermore, model (1) is similar to the GARCH models (Bollerslev (1986)) in the sense that for the Poisson distribution the conditional mean equals the conditional variance. In fact, this is a generalized linear model for time series of counts but with an identity link–that is a non canonical link function.

The second class of models generalizes (1) by imposing a non linear structure on both past values of λt and lagged values of Yt. Accordingly this is a non-linear model for Poisson time series and can be cast as a generalized non-linear model for time series of counts, see equation (4) in Section 2.3. A speciﬁc example which falls within this framework is the Poisson analogue of the so called exponential autoregressive model, see Haggan and Ozaki (1981). In the Poisson time series framework, this model is formulated in (5). The contribution of this article is to study ergodicity and likelihood inference for both types of models. A fundamental complication with the analysis of these models is the proof of the geometric ergodicity of both the

2 observed {Yt} and latent process {λt}. This problem is bypassed by showing that the slightly perturbed models (3) and (6) are geometrically ergodic under simple restrictions on the parameter space. The perturbation idea is a form of regularization–an idea similar in spirit to the analysis of ill–posed problems. Likelihood inference is developed in detail for the linear model by showing that the difference between the perturbed and unperturbed model can be made arbitrarily small provided that the perturbation decreases. It is shown that the corresponding maximum likelihood estimator is asymptotically normal. The results are applied to real and simulated data and several guidelines for the development of algorithms for numerical maximization of the log likelihood for both type of models are given. Non linear models for count time series have not been discussed in the literature and their development adds an additional tool to the analysis of count time series. The organization of the presentation is as follows. Section 2 introduces the reader to the relevant models and discusses the link between their perturbed and unperturbed versions. In addition, Propositions 2.1–2.4 show the joint ergodicity of both the observed and unobserved process for linear and non linear perturbed models. Section 3 develops likelihood inference for the linear model (3) and discusses some aspects of inference for the non linear model (6) with a special focus on the exponential autoregressive model (7). Several simulations and some real data analysis are presented in Section 3 while the presentation concludes with a discussion and an appendix which contains the proofs of the theoretical results.

2 Model Speciﬁcation & Ergodicity Results

Y,λ Suppose that {Yt} is a time series of counts and assume that Ft stands for the σ–ﬁeld generated by {Y0,...,Yt, λ0}, Y,λ that is Ft = σ(Ys, s 6 t, λ0), where {λt} is a Poisson intensity process to be introduced below. Initially, a linear Poisson GARCH type model is considered and then a non linear model is proposed.

2.1 Linear Model

Consider the following model given by,

Y,λ Yt | Ft−1 ∼ Poisson(λt), λt = d + aλt−1 + bYt−1, (1) for t > 1 and where the parameters d, a, b are assumed to be positive. In addition assume that λ0 and Y0 are fixed. Recall that for the Poisson distribution, the conditional mean is equal to the conditional variance, that Y,λ Y,λ is E[Yt | Ft−1] = Var[Yt | Ft−1] = λt. Therefore it is tempting to call (1) an INGARCH(1,1)–that is integer GARCH–model since its structure parallels that of the customary GARCH model, see Bollerslev (1986). However, although such a definition is quite plausible for this particular case, the proposed modeling is based on the evolution of the mean of the Poisson instead of its variance. In other words, model (1) specifies a conditional mean relation to the past values of both λt and Yt. For technical reasons having to do with the proofs of asymptotic normality of parameter estimates, it turns out that it is advantageous to rephrase the model formulation (1). More specifically, it is desirable to express the sequence of independent Poisson drawings–that is the first equation of (1)–more explicitly in terms of random

3 variables, like the observational equation in a state space model, or like the deﬁning equation in a GARCH model giving the relationship between the observations and the conditional variance. To achieve this, for each time point t, introduce a Poisson process Nt(·) of unit intensity. Then, the ﬁrst equation of (1) can be restated in terms of these Poisson processes by assuming that Yt is equal to the number of events Nt(λt) of Nt(·) in the time interval

[0, λt]. Let therefore {Nt(·), t = 1, 2,...} be a sequence of independent Poisson processes of unit intensity and rephrase (1) as

Yt = Nt(λt), λt = d + aλt−1 + bYt−1, (2) for t > 1 and with Y0, λ0 fixed. This notation will be used throughout the paper, and it is emphasized that we can always recover (1) by (2). Model (1) (or its rephrasing (2)) is related to the theory of generalized linear models for time series, see Kedem and Fokianos (2002, Ch. 1 & 4). In particular, the driving random component of the model corresponds to the Poisson distribution which belongs to the exponential family of distributions. The link function is taken to be 0 the identity while the systematic component is the time dependent random vector (1, λt−1,Yt−1) . Hence, (2) is a non canonical link model for time series of counts. Furthermore, notice that even though the vector of time dependent covariates that influences the evolution of (1) is composed by the unobserved process λt, the linear model still belongs to the class of observation driven models as defined by Cox (1981). This is so because the unobserved process λt can be expressed as a function of past values of the observed process Yt, after repeated substitution. Observation driven models for time series of counts have been studied by several authors including Zeger and Qaqish (1988), Li (1994) and more recently by Brumback et al. (2000), Fahrmeir and Tutz (2001), Davis et al. (2003) and Jung et al. (2006). However, a log–linear model for the mean of the observed process is usually assumed and its structure is composed by past values of the response, moving average terms and other explanatory variables. With the exception of Davis et al. (2003), who considers a simple but important case of a log-linear model, all the other references do not discuss the problem of ergodicity of the joint process (Yt, λt) which is instrumental in developing asymptotic estimation theory. Second order properties of model (1) have been studied by Rydberg and Shephard (2000) while Streett (2000) shows existence and uniqueness of a stationary distribution. In a recent contribution, Ferland et al. (2006) considered model (1) in a more general form and it was shown that the process Yt is stationary provided that 0 6 a+b < 1, by employing a different technique from that of Streett (2000). In particular, E [Yt] = E[λt] ≡ µ = d/(1 − a − b) and its autocovariance function is given by

 2 2  (1 − (a + b) + b )µ  , h = 0,  1 − (a + b)2 Cov [Y ,Y ] = t t+h   b(1 − a(a + b))(a + b)h−1µ  , h > 1. 1 − (a + b)2 In addition, it was shown that all moments of model (1) are ﬁnite if and only if 0 6 a + b < 1. Upon noticing that µ ¶ b2 Var[Y ] = µ 1 + , t 1 − (a + b)2

4 we conclude that Var[Yt] > E[Yt] with equality when b = 0. Thus, the inclusion of the past values of Yt in the evolution of λt leads to overdispersion–a phenomenon that occurs frequently in count time series data. In what follows, we discuss some properties of the linear model and show ergodicity for a modiﬁcation of (2).

2.2 Ergodicity of a perturbed model

We first note that {λt} defined by (2) is a Markov chain. Consider the skeleton λt = d + aλt−1 of (2). Then λ∗ = d/(1 − a) is the solution of λ = d + aλ, i.e. a fix point of the mapping f(λ) = d + aλ. In Streett (2000) it was shown that if 0 6 a + b < 1, there exists a stationary initial distribution for {λt} by showing that the point λ∗ is reachable, where (cf. Meyn and Tweedie (1993, p.131)) a point λ1 in the state space Λ is reachable if for 1 P n n every neighborhood O of λ , n P (λ, O) > 0, λ ∈ Λ. Here P (λ, A) = P(λn ∈ A | λ0 = λ), is the n’th step transition probability of {λt, t > 0}. In the course of the proof of Lemma A-1 in the appendix, it will be shown that every point in [λ∗, ∞) is ∗ reachable, if 0 < a < 1, i.e. {λt} is open set irreducible on [λ , ∞). However, in our approach to show geometric ergodicity we need φ–irreducibility, where φ is the Lebesgue measure with support [k, ∞), for some k > λ∗. P n Establishing φ–irreducibility is equivalent to showing that n P (λ, A) > 0 if φ(A) > 0. The problem is that there are sets A of positive Lebesgue measure that are not open sets, for example the set of all irrationals in [λ∗, ∞), and Lemma A-1 cannot be used directly to handle such sets. One could add an assumption stating that the chain is not allowed to stay on sets of Lebesgue measure zero, but since it is not clear how restrictive such an assumption m m is, we have decided to avoid that issue by resorting to the perturbed chain (Yt , λt ) defined by

m m m m m Yt = Nt (λt ) , λt = d + aλt−1 + bYt−1 + εt,m, (3)

m m with λ0 , Y0 ﬁxed, and

¡ m ¢ εt,m = cm1 Yt−1 = 1 Ut, cm > 0 , cm → 0, as m → ∞, where 1(·) is the indicator function, and where {Ut} is a sequence of iid uniform random variables on (0, 1) and such that the {Ut} is independent of {Nt(·)}. Note that the introduction of {Ut} makes it possible to establish irreducibility, see the proofs of Lemma A-1 and Proposition 2.1 Another possibility would be to try an approach to prove ergodicity that does not depend on φ-irreducibility, see, e.g Aue et al. (2006), Mikosch and Straumann (2006). But the structure of our model is different, and we have not succeeded in using such an approach. Note m m m that {λt } is still a Markov chain. Our strategy is to prove geometric ergodicity of {Yt , λt } deﬁned by (3). Then to use this to obtain asymptotic normality for the likelihood estimators of (3), and ﬁnally by letting cm → 0 obtain asymptotic normality of the likelihood estimates of (2).

The perturbation in (3) is a purely auxiliary device to obtain φ–irreducibility. The Ut’s could be thought of as pseudo observations generated by the uniform law. As will become clear in the subsequent proofs, the perturbation can be introduced in many other ways. For, instance, it is enough to set {Ut} to be an i.i.d sequence of positive random variables with bounded support possessing density on the positive real axis with respect to the Lebesgue m measure and ﬁnite fourth moment. In addition, the form of the likelihood functions for {Yt} and {Yt } as far as

5 m dependence on {λt} is concerned will be the same for both models (2) and (3). Note that both {Yt} and {Yt } can be identified with the observations in the likelihood, but they cannot be identified as stochastic variables since m they are generated by different models. Technically speaking, with the introduction of {εt,m}, the process {λt } is made into a T -chain with a continuous component, where open set irreducibility implies measure theoretic irreducibility. We have the following results concerning ergodicity of model (3). Their proof is postponed to the Appendix. m There it is first proved that the unobserved process {λt } is geometrically ergodic.

m Proposition 2.1 Consider model (3) and suppose that 0 < a + b < 1. Then the process {λt , t > 0} is a geometrically ergodic Markov chain with ﬁnite moments of order k, for an arbitrary k.

m m The above proposition is useful in obtaining geometric ergodicity of the joint process (Yt ,Ut, λt ). In fact, it k k k can been shown that the joint process is V(Y,U,λ)–geometrically ergodic with VY,U,λ(Y, U, λ) = 1+Y +λ +U , for a deﬁnition see Meyn and Tweedie (1993, p.355). In particular, Proposition 2.2 shows the existence of moments m m of (Yt ,Ut, λt ), for any k. In addition, the ergodicity result is employed in the next section where the large m sample estimation theory is studied. Note that inferential results for the {Yt } process depend upon proving that m m m geometric ergodicity of the {λt } series implies geometric ergodicity of the chain {(Yt ,Ut, λt )}. The proof of the following proposition follows closely the arguments of Meitz and Saikonnen (2008) (see also Carrasco and Chen (2002)) and its proof is indicated in the appendix.

Proposition 2.2 Consider model (3) and suppose that the conditions of Proposition 2.1 hold. Then the process m m k k k {(Yt , λt ,Ut), t > 0} is a V(Y,U,λ)–geometrically ergodic Markov chain with VY,U,λ(Y, U, λ) = 1+Y +λ +U .

The following lemma quantiﬁes the difference between (2) and (3), as m → ∞ such that cm → 0, and shows that essentially the perturbed model can be made arbitrarily close to the unperturbed model. The crucial condition is that the sum of a and b must be less than one –this as seen from Proposition 2.1 was also the natural condition for proving ergodicity of (3). It is in the proof of this lemma that the rephrasing of model (1) as model (2) is very useful. The proof is in the appendix.

m m Lemma 2.1 Suppose that (Yt, λt) and (Yt , λt ) are deﬁned by (2) and (3) respectively. If 0 6 a + b < 1, then the following statements hold:

m m 1. |E (λt − λt)| = |E (Yt − Yt)| 6 δ1,m,

m 2 2. E (λt − λt) 6 δ2,m,

m 2 3. E (Yt − Yt) 6 δ3,m, where δi,m → 0 as m → ∞ for i = 1, 2, 3. Furthermore, almost surely, with m large enough

m m |λt − λt| 6 δ and |Yt − Yt| 6 δ, for any δ > 0.

6 2.3 Non Linear Models

A simple generalization of the linear model (2) is given by

Yt = Nt(λt), λt = f(λt−1) + b(Yt−1), (4) for t > 1, and where f(·) and b(·) are known functions up to an unknown finite dimensional parameter vector. In addition both functions are defined and take values on the positive real line, that is f, b : R+ → R+. The initial values Y0 and λ0 are fixed. It is seen that (2) is a special case of (4) upon defining f(x) = d + ax and b(x) = bx, with d, a, b > 0, and x > 0. There are many examples of non-linear time series models, see Tong (1990) and Fan and Yao (2003) for comprehensive reviews. Such models have not been considered in the literature earlier in the context of generalized linear models for count time series, and they provide a flexible framework for studying dependent count data. For example, consider the so called exponential autoregressive model which is described below.

Example 2.1 The exponential autoregressive model is deﬁned by

¡ 2 ¢ Yt = Nt(λt), λt = a + c exp(−γλt−1) λt−1 + bYt−1. (5)

The model parallels the structure of the traditional exponential autoregressive model, see Haggan and Ozaki (1981). ¡ ¢ Comparing recursions (4) and (5) it is clear that f(x) = a + c exp(−γx2) x and b(x) = bx, a, c, b, γ > 0 and x > 0.

Reiterating previous arguments, notice that model (4) is related to a time series following generalized linear models as described in Kedem and Fokianos (2002). Speciﬁcally, for model (5), the Poisson assumption guarantees that the conditional random component of the model belongs to the exponential family of distributions, while the link function is equal to the identity, as before. If γ is known, then the systematic component of the model consists 2 0 of the vector (λt−1, λt−1 exp(−γλt−1),Yt−1) ; otherwise, (5) does not belong to the class of generalized linear models. Assume the following conditions hold for f(·) and b(·): Assumption NL

1. There exists a unique solution of the equation λ = f(λ) and denote it by λ?.

2. With λ positive real and κ a positive integer, f(λ) is increasing in λ for λ > λ? and b(κ) is increasing in κ such that b(κ) > β?κ, β? > 0.

? 3. For some α2 > 0, f(λ2) − f(λ1) 6 α2(λ2 − λ1), for all λ1, λ2 > λ with λ2 > λ1.

4. For some β2 > 0 such that α2 + β2 < 1, b(κ2) − b(κ1) 6 β2(κ2 − κ1), κ2 > κ1.

To prove φ–irreducibility we again introduce an ²–perturbed model

m m m m m Yt = Nt (λt ) , λt = f(λt−1) + b(Yt−1) + εt,m t > 1, (6)

7 m where εt,m has been deﬁned as in (3). The following proposition shows that the process {λt , t > 0}, as given by (6) is geometrically ergodic. Its proof parallels the proof of Proposition 2.1 and is given in the appendix.

m Proposition 2.3 Consider model (6) and suppose that Assumption NL holds true. Then the process {λt , t > 0} is a geometrically ergodic Markov chain with ﬁnite moments of order k, for an arbitrary k.

m m In addition, we obtain the joint geometric ergodicity of the process {(Yt ,Ut, λt ), t > 0}. The proof is omitted.

Proposition 2.4 Assume model (6) and suppose that the conditions of Proposition 2.3 hold true. Then the process m m k k k {(Yt ,Ut, λt ), t > 0} is a V(Y,U,λ)–geometrically ergodic Markov chain with VY,U,λ(Y, U, λ) = 1+Y +U +λ .

We conclude this section with the following corollary for the perturbed exponential AR model

m m m ¡ m 2 ¢ m m Yt = Nt (λt ) , λt = a + c exp(−γ(λt−1) ) λt−1 + bYt−1 + εt,m t > 1. (7)

Corollary 2.1 Assume the exponential autoregressive model (7). Suppose that 0 < a + b < 1. Then the process m m k k k {(Yt ,Ut, λt ), t > 0} is a V(Y,U,λ)–geometrically ergodic Markov chain with VY,U,λ(Y, U, λ) = 1+Y +U +λ .

3 Likelihood Inference

0 Denote by θ the three dimensional vector of unknown parameters, that is θ = (d, a, b) and the true value of the 0 parameter as θ0 = (d0, a0, b0) . Then the conditional likelihood function for θ based on (2) and given the starting value λ0 in terms of the observations Y1,...,Yn is given by

n Y exp(−λ (θ))λYt (θ) L(θ) = t t . Y ! t=1 t

Here we have used the Poisson assumption, λt(θ) = d + aλt−1(θ) + bYt−1 by (2) and λt = λt(θ0). Hence, the log–likelihood function is given up to a constant, by Xn Xn l(θ) = lt(θ) = (Yt log λt(θ) − λt(θ)) , (8) t=1 t=1 and the score function is deﬁned by µ ¶ ∂l(θ) Xn ∂l (θ) Xn Y ∂λ (θ) S (θ) = = t = t − 1 t , (9) n ∂θ ∂θ λ (θ) ∂θ t=1 t=1 t where ∂λt(θ)/∂θ is a three-dimensional vector with components given by ∂λ ∂λ ∂λ ∂λ ∂λ ∂λ t = 1 + a t−1 , t = λ + a t−1 , t = Y + a t−1 . (10) ∂d ∂d ∂a t−1 ∂a ∂b t−1 ∂b

8 The solution of the equation Sn(θ) = 0, if it exists, yields the conditional maximum likelihood estimator of θ which is denoted by θˆ. Furthermore, the Hessian matrix for model (2) is obtained by further differentiation of the score equations (9), Xn 2 ∂ lt(θ) Hn(θ) = − 0 t=1 ∂θ∂θ µ ¶ µ ¶0 µ ¶ Xn Y ∂λ (θ) ∂λ (θ) Xn Y ∂2λ (θ) = t t t − t − 1 t . (11) λ2(θ) ∂θ ∂θ λ (θ) 0 t=1 t t=1 t ∂θ∂θ

If the process {(Yt, λt)} is a geometrically ergodic Markov chain, then an asymptotic theory for the maximum likelihood estimator of θ can be developed directly. A problem is that at present we do not know precisely what conditions guarantee ergodicity of (2). However, the assumptions of Proposition 2.2 guarantee geometric ergodicity m m m of the perturbed model (Yt , λt ). In addition, Lemma 2.1 shows that λt approaches λt, for large m. Hence, it is m m rather natural to use the ergodic properties of the perturbed process (Yt , λt ) to study the asymptotic properties of the maximum likelihood estimators analogous to (9) and then use Lemma 2.1. Towards this goal, we deﬁne the counterparts of expressions (8)-(11) for model (3). m The likelihood function, say L , including the pseudo observations U1,U2,...,Un, is given by n m m Y m n Y exp(−λ (θ))(λ (θ)) t Y Lm(θ) = t t f (U ), Y m! u t t=1 t t=1 m m by the Poisson assumption and the asserted independence of Ut from (Yt−1, λt−1). Here, fu(·) denotes the uniform m m m density and λt (θ) = d + aλt−1(θ) + bYt−1 + εt,m as given by (3). The log likelihood function is given up to a constant by Xn Xn Xn m m m m m l (θ) = lt (θ) = (Yt log λt (θ) − λt (θ)) + log fu(Ut), (12) t=1 t=1 t=1 whereas the score function is equal to µ ¶ ∂lm(θ) Xn ∂lm(θ) Xn Y m ∂λm(θ) Sm(θ) = = t = t − 1 t , (13) n ∂θ ∂θ λm(θ) ∂θ t=1 t=1 t m and is seen to have exactly the same form as (9), but with λt(θ) replaced by λt (θ). Note that the vector m ∂λt (θ)/∂θ consists now of ∂λm ∂λm ∂λm ∂λm ∂λm ∂λm t = 1 + a t−1 , t = λm + a t−1 , t = Y m + a t−1 . (14) ∂d ∂d ∂a t−1 ∂a ∂b t−1 ∂b m ˆm The solution of the equation Sn (θ) = 0 is denoted by θ . Similarly, the Hessian matrix for model (3) is obtained as Xn 2 m m ∂ lt (θ) Hn (θ) = − 0 t=1 ∂θ∂θ µ ¶ µ ¶0 µ ¶ Xn Y m ∂λm(θ) ∂λm(θ) Xn Y m ∂2λm(θ) = t t t − t − 1 t . (15) (λm(θ))2 ∂θ ∂θ λm(θ) 0 t=1 t t=1 t ∂θ∂θ

9 3.1 Asymptotic Theory

To study the asymptotic properties of the maximum likelihood estimator θˆ, for the linear model (2) we derive and m use the asymptotic properties of the maximum likelihood estimator θˆ for the perturbed linear model (3). The m main tool in linking θˆ to θˆ is the following:

Proposition 3.1 (Brockwell and Davis (1991, Prop. 6.3.9)) Let Xn, n = 1, 2,... and Ynm, m = 1, 2,..., n = 1, 2,... be random k-vectors such that

D 1. Ynm → Ym, as n → ∞, for each m = 1, 2,...,

D 2. Ym → Y, as m → ∞, and

3. limm→∞ lim supn→∞ P[|Xn − Ynm| > ²] = 0, for every ² > 0.

Then D Xn → Y as n → ∞.

m Accordingly, we ﬁrst show that θˆ is asymptotically normal where for the proof of consistency and asymptotic normality we take advantage of the fact that the log–likelihood function is three times differentiable applying Jensen and Rahbek (2004, Lemma 1). Then we show that the score function, the information matrix and the third derivatives of the perturbed likelihood function (12) tend to the corresponding quantities of the unperturbed likelihood function (8) which enable us to employ Proposition 3.1. To formulate the end result, introduce lower and upper values of each component of θ, δL < d0 < δU , αL < a0 < αU < 1 and βL < b0 < βU , and in terms of these deﬁne,

O (θ0) = {θ|0 < δL 6 d 6 δU , 0 < αL 6 a 6 αU < 1 and 0 < βL 6 b 6 βU } . (16)

Then the following theorem regarding the properties of the maximum likelihood estimator θˆ holds true.

Theorem 3.1 Consider model (2) and suppose that at the true value θ0, 0 < a0 + b0 < 1. Then, there exists a

ﬁxed open neighborhood O = O(θ0) of θ0–see (16)–such that with probability tending to one, as n → ∞, the log likelihood function (8) has a unique maximum point θˆ. Furthermore, θˆ is consistent and asymptotically normal, √ ³ ´ ˆ D −1 n θ − θ0 → N (0, G ),

ˆ where the matrix G is deﬁned in Lemma 3.1. A consistent estimator of G is given by Gn(θ), where

· ¸ µ ¶ µ ¶0 Xn ∂l (θ) Xn 1 ∂λ (θ) ∂λ (θ) G (θ) = Var t | F = t t n ∂θ t−1 λ (θ) ∂θ ∂θ t=1 t=1 t

10 To prove the above theorem, we need a series of results. Lemma 3.1 shows that the limiting conditional information matrix of model (3) tends to another matrix that plays the role of the conditional information matrix for model (2). In Lemmas 3.2–3.4 the conditions (A.1), (A.2) and (A.3) of Lemma 1 of Jensen and Rahbek (2004) are veriﬁed for the perturbed model (3). In addition, these lemmas show that the score function, the Hessian matrix and the third derivative of the log likelihood function of the perturbed model tend to their counterparts of the unperturbed model. These results are proved in the appendix.

Lemma 3.1 Deﬁne the matrices Ã ! Ã ! µ m ¶ µ m ¶0 µ ¶ µ ¶0 m 1 ∂λt ∂λt 1 ∂λt ∂λt G (θ) = E m and G(θ) = E . λt ∂θ ∂θ λt ∂θ ∂θ

m Under the assumptions of Theorem 3.1, the above matrices evaluated at the true value θ = θ0 satisfy G → G, as m → ∞. In addition, Gm and G are positive deﬁnite.

Lemma 3.2 Under the assumptions of Theorem 3.1, the score functions deﬁned by (9) and (13) and evaluated at the true value θ = θ0 satisfy the following: 1 1. √ Sm →D Sm := N (0, Gm), as n → ∞ for each m = 1, 2 ..., n n

2. Sm →D N (0, G) as m → ∞,

m √ 3. limm→∞ lim supn→∞ P (||Sn − Sn| | > ε n) = 0, for every ε > 0.

Lemma 3.3 Under the assumptions of Theorem 3.1, the Hessian matrices deﬁned by (11) and (15) and evaluated at the the true value θ = θ0 satisfy the following: 1 1. Hm →P Gm as n → ∞ for each m = 1, 2 ..., n n 2. Gm → G, as m → ∞,

m 3. limm→∞ lim supn→∞ P (kHn − Hnk > εn) = 0, for every ε > 0.

Lemma 3.4 With the neighborhood O (θ0) deﬁned in (16), it holds under the assumptions of Theorem 3.1, that, ¯ ¯ ¯ Xn 3 ¯ Xn ¯ 1 ∂ lt (θ) ¯ 1 max sup ¯ ¯ 6 Mn := mt i,j,k=1,2,3 ¯n ∂θ ∂θ ∂θ ¯ n θ∈O(θ0) t=1 i j k t=1 where θi for i = 1, 2, 3 refers to θ = d, θ = a and θ = b, respectively. In addition,

¡ 3 ¢ mt = C Ytµ3t + µ3t + Ytµ2tµ1t + Ytµ1t Xt−i j−1 µit = βU kj,iαU Yt−i−j, kj,1 = j, kj,2 = j (j + 1) and kj,3 = j (j + 1) (j + 2) . j=1

m m m m Deﬁne correspondingly Mn , mt and µit in terms of Yt . Then

11 m P m 1. Mn → M , as n → ∞ for each m = 1, 2 ..., 2. M m→M, as m → ∞, where M is a ﬁnite constant,

m 3. limm→∞ lim supn→∞ P (|Mn − Mn| > εn) = 0, for every ε > 0.

3.2 Some Remarks about the Exponential Autoregressive Model

Study of asymptotic properties of the maximum likelihood estimator for the non linear model (6) proceeds along the previous lines but the corresponding analysis is more cumbersome. For instance, consider the exponential autoregressive model (5) and put θ = (a, c, b)0 and assume that the parameter γ is known. Then the recursions for calculation of the score are given by µ ¶ ∂λ ∂λ ¡ ¢ ∂λ t = 1 − 2γcλ exp(−γλ2 ) t−1 λ + a + c exp(−γλ2 ) t−1 , ∂a t−1 t−1 ∂a t−1 t−1 ∂a µ ¶ ∂λ ∂λ ¡ ¢ ∂λ t = 1 − 2γcλ t−1 exp(−γλ2 )λ + a + c exp(−γλ2 ) t−1 , ∂c t−1 ∂c t−1 t−1 t−1 ∂c ∂λ ∂λ ¡ ¢ ∂λ t = a t−1 + 1 − 2γλ2 c exp(−γλ2 ) t−1 + Y . ∂b ∂b t−1 t−1 ∂b t−1 If the parameter γ is assumed to be unknown, then the following additional recursion is needed to obtain the score equations for the enlarged vector of parameters which includes γ: µ ¶ ∂λ ∂λ ¡ ¢ ∂λ t = −c exp(−γλ2 )λ2 λ + 2γ t−1 + a + c exp(−γλ2 ) t−1 . ∂γ t−1 t−1 t−1 ∂γ t−1 ∂γ We remark that when c = 0 the parameter γ is not identiﬁable. Therefore testing the hypothesis that c = 0 by means of the likelihood ratio test, for instance, is more involved but it can be accomplished by means of the methodology outlined by Davies (1987). Although we do not study in detail the asymptotic properties of the maximum likelihood estimator for the exponential autoregressive model, we believe that analogous results can be proved for models (4) and (6) under assumption NL. In particular, when α2 + β2 < 1, see NL4, then Lemma 2.1 can be restated in terms of the non-linear processes. We report some simulation ﬁndings in the next section which indicate that the asymptotic normality should hold also for this case.

4 Simulation and Data Analysis

A few simulation results are reported for illustration of the asymptotic normality of the maximum likelihood estimator for both the linear and non linear model. Calculation of the maximum likelihood estimators is carried out by optimizing the log–likelihood function (8) employing a quasi–Newton method–details are available by the authors. The simple linear model (2) is generated with (d0, a0, b0) = (0.3, 0.4, 0.5) for different sample sizes. For ˆ this choice of parameter values, a0 + b0 = 0.9 which is close to the sum aˆ + b of the estimates obtained from the real data example reported in Section 4.3. Results for different set of parameter values were similar, provided that the condition a0 + b0 < 1 is satisﬁed, and therefore we omit them.

12 4.1 Simulations for the Linear Model

In Ferland et al. (2006) it was shown that (2) possesses moments up to second order identical to those of an ARMA(1,1) process which satisﬁes the difference equation

(Yt − µ) − (a + b)(Yt−1 − µ) = et − aet−1, where et is a white noise process with with E[et] = Var[et] = d/(1−a−b). This observation points to the potential use of the least squares theory for estimation of the parameter vector (d, a, b). We compare the least squares estimators with the maximum likelihood estimators in what follows. To initiate the algorithm for optimization of the log–likelihood function (8) using the recursions (10), starting values for (d, a, b) are obtained by the ARIMA(1,1) fit to the data. Table 4.1 shows the results of the comparison between maximum likelihood and least squares estimators. The table reports the estimates of the parameters obtained by averaging out the results from all runs together. In addition, it reports the ratio of the mean square error (MSE) of the conditional least squares estimator (CLSE) to the mean square error of the MLE–see fifth column. The MSE has been calculated by using the simulation output. In all cases the MSE of MLE is lower than that of the MSE of CLSE. In addition, we use a likelihood ratio test to examine whether the ratio is equal to 1. In other words, we consider the simulated estimators from both methods as a sample from the bivariate normal with known mean vector and unknown covariance matrix. Testing whether the diagonal elements are equal–that is the ratio of the mean square error of the CLSE to the mean square error of MLE is equal to 1– the likelihood ratio test yielded very small p-values in all cases and therefore the CLSE has greater MSE than that of the MLE. Furthermore, the last three columns of Table 4.1 show some summary statistics of the sampling distribution of the standardized MLE. When n is large, the asserted asymptotic normality is supported. Notice that for n = 500, dîs not approaching normality satisfactorily but the approximation is improving for larger sample sizes. Figure 1 shows histograms and qq-plots for the sampling distribution of the standardized maximum likelihood estimators indicating the adequacy of the normal approximation although there are some deviations in the upper plot for dˆ. This is consistent with the findings from Table 4.1.

4.2 Simulations for the Exponential Autoregressive Model

Table 2 shows some empirical results from estimation of the exponential autoregressive model (5) with (a0, c0, b0) = (0.25, 1, 0.65) and for different values of γ. Note that the standard errors have been obtained by the standard de- viation of the simulated estimators, as in the case of the linear model. Initially it is assumed that the parameter γ is known. Maximization of the log likelihood function (8) under (5) is carried out as it was discussed in the case of the linear model. Initial values for the parameters a and b are obtained by the ﬁt of a linear model (2), while the initial value of c is ﬁxed at 0.50. The results of Table 2 show that for large sample sizes, the maximum likelihood estimators are approaching the true parameter values when the parameter γ is known. Figure 2 shows qq–plots and histograms of the simulated estimators, when γ = 1 and n = 500. The asserted normality appears as an adequate approximation.

13 Parameters Sample Size MLE CLSE Efﬁciency Skewness Kurtosis p-value

d0 200 0.3713 0.3909 1.3139 1.5910 9.8186 0.0000

a0 0.3756 0.3790 1.2209 -0.2856 3.9471 0.6426

b0 0.4967 0.4863 1.2915 -0.0696 3.2534 0.5233

d0 500 0.3271 0.3318 1.3957 0.8644 4.5069 0.0012

a0 0.3923 0.3922 1.4299 -0.0202 3.1440 0.8296

b0 0.4971 0.4932 1.4610 -0.0268 2.8811 0.9711

d0 1000 0.3148 0.3180 1.5651 0.4410 3.3538 0.2820

a0 0.3954 0.3951 1.4111 0.0726 3.1183 0.8832

b0 0.4985 0.4965 1.4204 -0.0714 2.9575 0.8375

Table 1: Results of simulation for model (2) when (d0, a0, b0) = (0.3, 0.4, 0.5). The third and fourth column report the mean of estimators obtained by maximum likelihood and conditional least square methods, respectively. The ﬁfth column reports the ratio of the MSE of CLSE to the MSE of MLE. The other three columns report sample skewness, sample kurtosis and the p-value of a Kolmogorov-Smirnov test statistic (for testing normality) for the standardized MLE obtained by the simulation. Results are based on 1000 simulations.

Application of the exponential autoregressive model (5) becomes more involved when the parameter γ is unknown, which is the most interesting case in applications. To estimate jointly the parameter vector (a, c, b, γ) of (5) the following procedure is proposed.

• Fit the linear model (2) to the data to obtain starting values for both a and b.

• Set the initial value of c equal to some constant.

• Generate a grid of values for γ and for each of these values ﬁt model (5) with known γ.

• To maximize the log likelihood function over all (a, c, b, γ), get as a starting value the γ–value that yields the maximum log–likelihood from the previous step together with the corresponding coefﬁcients.

Maximization is carried out numerically as described before. Table 3 reports the results of the above method and it is observed that the sample size should be reasonably large for an adequate approximation of both c and γ. Some further evidence regarding asymptotic normality is given in Figure 3.

14 Sample Quantiles −2 0 2 4 0.0 0.2 0.4

−2 0 2 4 −3 −2 −1 0 1 2 3

Theoretical Quantiles Sample Quantiles −3 −1 1 2 3 0.0 0.1 0.2 0.3 0.4

−4 −2 0 2 −3 −2 −1 0 1 2 3

Theoretical Quantiles Sample Quantiles −3 −1 1 2 3 0.0 0.1 0.2 0.3 0.4

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

Theoretical Quantiles

Figure 1: From top to bottom: Histograms and qq-plots of the sampling distribution of the standardized estimators ˆ ˆ ˆ of θ = (d, a,ˆ b) for the linear model (2) when the true values are (d0, a0, b0) = (0.3, 0.4, 0.5). Superimposed is the standard normal density function. The results are based on 500 data points and 1000 simulations.

Maximum Likelihood Estimators Sample Size aˆ cˆ ˆb n γ 0.2470 1.0241 0.6468 500 2 (0.0465) (0.1282) (0.0472) 0.2486 1.0128 0.6475 1000 (0.0339) (0.0898) (0.0344) 0.2504 1.0300 0.6440 500 1 (0.0472) (0.1416) (0.0473) 0.2488 1.0140 0.6484 1000 (0.0336) (0.0943) (0.0324) 0.2510 1.0262 0.6457 500 0.50 (0.0496) (0.1429) (0.0469) 0.2517 1.0106 0.6465 1000 (0.0327) (0.0987) (0.0313)

Table 2: Estimates and their standard error (in parentheses) for the exponential autoregressive model (5) when

(a0, c0, b0) = (0.25, 1, 0.65) and for different sample sizes by maximum likelihood. The parameter γ is assumed to be known. Results are based on 1000 simulations.

15 Sample Quantiles −3 −1 1 2 3 0.0 0.1 0.2 0.3 0.4

−2 0 2 4 −3 −2 −1 0 1 2 3

Theoretical Quantiles Sample Quantiles −2 0 2 4 0.0 0.1 0.2 0.3 0.4

−2 0 2 4 −3 −2 −1 0 1 2 3

Theoretical Quantiles Sample Quantiles −3 −1 1 2 3 0.0 0.1 0.2 0.3 0.4

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

Theoretical Quantiles

Figure 2: From top to bottom: Histograms and qq-plots of the sampling distribution of the standardized estimators ˆ ˆ of θ = (ˆa, c,ˆ b) for the exponential autoregressive model (5) when the true values are (d0, a0, b0) = (0.25, 1, 0.65) and the parameter γ is assumed to be known and equal to 1. Superimposed is the standard normal density function. The results are based on 500 data points and 1000 simulations.

Maximum Likelihood Estimators Sample Size True γ aˆ cˆ ˆb γˆ n γ 0.2504 1.1645 0.6430 1.8655 500 1.50 (0.0537) (0.5348) (0.0487) (1.2573) 0.2491 1.0593 0.6456 1.6357 1000 (0.0364) (0.2498) (0.0343) (0.7684) 0.2500 1.1190 0.6426 1.1542 500 1 (0.0521) (0.4431) (0.0485) (0.6919) 0.2490 1.0278 0.6475 1.0325 1000 (0.0355) (0.1690) (0.0334) (0.2886) 0.2467 1.1025 0.6472 0.5419 500 0.50 (0.0514) (0.3813) (0.0463) (0.2341) 0.2481 1.0322 0.6493 0.5064 1000 (0.0337) (0.1590) (0.0305) (0.1087)

Table 3: Estimates and their standard errors (in parentheses) for the exponential autoregressive model (5) when

(a0, c0, b0, γ0) = (0.25, 1, 0.65, γ0) where γ0 ∈ {0.5, 1, 1.5} and for different sample sizes by maximum likelihood. Results are based on 500 simulations.

16 −3 0 2 0.0 0.2 0.4 −3 −2 −1 0 1 2 3 Sample Quantiles −3 −2 −1 0 1 2 3

Theoretical Quantiles −2 0 2 0.0 0.2 0.4 −2 −1 0 1 2 3 4 Sample Quantiles −3 −2 −1 0 1 2 3

Theoretical Quantiles −3 0 2 0.0 0.3 −3 −2 −1 0 1 2 3 Sample Quantiles −3 −2 −1 0 1 2 3

Theoretical Quantiles −2 1 3 0.0 0.3 −2 −1 0 1 2 3 4 Sample Quantiles −3 −2 −1 0 1 2 3

Theoretical Quantiles

Figure 3: From top to bottom: Histograms and qq-plots of the sampling distribution of the standardized estima- ˆ ˆ tors of θ = (ˆa, c,ˆ b, γˆ) for the exponential autoregressive model (5) when the true values are (d0, a0, b0, γ0) = (0.25, 1, 0.65, 1). Superimposed is the standard normal density function. The results are based on 1500 data points and 500 simulations.

17 4.3 Data Example

For an illustration of the methodology, models (2) and (5) are applied to the number of transactions per minute for the stock Ericsson B during July 2nd 2002. This is a part of a larger data set which includes all the transactions of this speciﬁc stock for the time period between July 2nd and July 22nd, 2002. There are 460 available observations conveying eight hours of transactions, approximately. Notice that the ﬁrst and last minutes transactions are not taken into account. Figure 4 shows both the data and the respective autocorrelation function. Even though the data are counts, the plot of the usual autocorrelation functions reveals the high dependence between transactions. Note that the mean number of transactions for these particular data is equal to 9.909 while their sample variance is given by 32.836. This is a case of overdispersion, as it was discussed in Section 2.1. To model these data, set 0 10 20 30 Transactions per minute

0 100 200 300 400

Time

Empirical Autocorrelation Function ACF 0.0 0.4 0.8

0 5 10 15 20 25

Lag

Figure 4: Number of transactions per minute for the stock Ericsson B during July 2nd, 2002. The bottom plot shows their autocorrelation function

λ0 = 0 and ∂λ0/∂θ = 0 for initialization of the recursions and consider the linear model (2). Maximization of the log–likelihood function (8) yields the following results:

ˆ ˆ λt = 0.5808 +0.7445 λt−1+ 0.1986 Yt−1 (0.1628) (0.0264) (0.0167) where the standard errors underneath the estimated parameter are computed by using the robust sandwich matrix ˆ −1 ˆ ˆ Hn(θ)Gn (θ)Hn(θ), where Gn(θ) has been defined in Theorem 3.1 and Hn(θ) is given by (11). It is rather interesting to observe that aˆ + ˆb is close to unity. In fact, an approximate 95% confidence interval for the parameter a + b is given by (0.9097, 0.9765). This is a very similar phenomenon to the unit-root case in autoregressive time series in econometric analysis of most financial data, and moreover, akin to the IGARCH

18 literature where high-persistence is often discovered in the conditional variance. Analogous findings can be derived from Rydberg and Shephard (2000) where transaction data were also examined. √ To examine the adequacy of the fit, consider the so called Pearson residuals defined by et = (Yt − λt)/ λt.

Under the correct model, the sequence et is a white noise sequence with constant variance, see Kedem and Fokianos ˆ (2002, Sec. 1.6.3). To estimate the Pearson residuals, substitute λt by λt(θ). Figure 5 demonstrates that the ˆ predicted values deﬁned by Yˆt = λt(θ) approximate the observed process reasonably well. The bottom plot of the same ﬁgure illustrates the whiteness of the Pearson residuals by depicting the cumulative periodogram plot, see Brockwell and Davis (1991, Sec. 10.2). 0 10 20 30 Transactions per minute

0 100 200 300 400

Time

Pearson Residuals 0.0 0.4 0.8 0.0 0.1 0.2 0.3 0.4 0.5

frequency

Figure 5: Top: Observed and predicted (grey) number of transactions per minute using (2). Bottom: Cumulative periodogram plot of the Pearson residuals.

Consider now the application of the Poisson exponential autoregressive model (5) to the same data. The following model is ﬁtted

ˆ ˆ2 ˆ λt = (0.8303 +7.030 exp(−0.1675λt−1)) λt−1 + 0.1551 Yt−1 (0.0232) (3.0732) (0.0592) (0.0218) where the stated standard errors refer to those of aˆ, cˆ dˆ and γˆ, respectively. The starting value for carrying out the estimation of γ is implemented by the suggested profiling procedure. To be more specific, a grid of values is generated and the corresponding log–likelihood value is evaluated at these points. For these data, the values of the parameter γ were generated from 0.001 to 10 by increments of 0.20, that is 50 values were obtained. The upper plot of Figure 6 shows a graph of γ versus the corresponding log–likelihood function for a part of the generated grid, and it shows that at the value 0.205, approximately, the corresponding maximum value is obtained. Using this value as a starting point for γ, model (5) is fitted to the data. The other graphs of Figure 6 show the predicted

19 ˆ response–notice again that Yˆt = λt–and the cumulative periodogram plot of the Pearson residuals. To compare PN 2 the models, we calculate the mean square error of the Pearson residuals defined by t=1 et /(N − p), where p is the number of estimated parameters, see Kedem and Fokianos (2002, Sec. 1.8). It turns out that for the linear model the mean square error of the Pearson residuals is approximately equal to 2.367 while for the non linear model it is equal to 2.392–that means both of the models yield similar conclusions. This fact is supported further by the sample correlation coefficient for the Pearson residuals from both models which is equal to 0.991. As a final remark, we note that the numerical results, in particular for dˆ, are sensitive to the choice of the initial value

λ0 similar to classic GARCH model estimation. Loglikelihood 6040 6046 6052

0.0 0.5 1.0 1.5 2.0 2.5

γ 0 10 20 30 Transactions per minute

0 100 200 300 400

Time

Pearson Residuals 0.0 0.4 0.8 0.0 0.1 0.2 0.3 0.4 0.5

frequency

Figure 6: Top: Obtaining an estimator for γ. Center: Observed and predicted (grey) number of transactions per minute using (5). Bottom: Cumulative periodogram plot of the raw residuals.

Acknowledgements

The authors acknowledge constructive criticism and helpful remarks from the Associate Editor and two reviewers. Many thanks also to the participants at the CFE08 conference and seminar participants at LSE and Bologna Uni- versity. Comments from Neil Shephard, Andrew Patton and Theis Lange are greatly acknowledged. K. Brann¨ as¨ provided us with the transactions data. A. Rahbek would like to thank for ﬁnancial support from Danish Social Sciences Research Council, Project Number 2114-04-001.

20 Appendix

∗ Recall that λ = d/(1 − a) is a ﬁx point of the skeleton λt = d + aλt−1 of (2) and (3). We start by proving that ∗ {λt}, deﬁned by (2) or (3), is open set irreducible on [λ , ∞). The proof of the following lemma does not require the ²-perturbation of (3), but it is valid in the presence of such a perturbation.

∗ Lemma A-1 Let {λt} be a Markov chain deﬁned by (2) or (3). If 0 < a < 1, then every point in [λ , ∞) is reachable.

Proof: To simplify notation, we state the proof for model (2). Consider a point c ∈ [λ∗, ∞). We may assume that λ1 > c, since otherwise we may start from λj = d + aλj−1 + bYj−1 > c. Consider a path such that ∗ Y1 = ... = Yi = 0. Then λi approaches λ as i increases and a < 1 and this proves that c is reachable if c = λ∗. (This was already proved by Streett (2000)). The proof is more difﬁcult if c > λ∗. The intuitive idea is to consider realizations where Y1 = N, and where subsequent {Yi, i > 1} are zero. Then, if (c1, c2) is an open interval containing c, λi will approach this interval from above in successively smaller steps. By choosing N appropriately, one of these steps will be contained in (c1, c2). Equivalently, we can prove that one of these steps will be arbitrarily close to c. Therefore, let Y1 = N and Y2 = 0, Y3 = 0, .... Then, for j > 1

1 − aj λ = λ (N) = aj−1(aλ + bN) + d. 1+j 1+j 1 1 − a

Let N := Nj be the least integer such that

λ1+j(N) > c, λ1+j(N − 1) 6 c.

(Note that we can always choose λ1 such that N − 1 > 1 so that the reasoning is exactly the same for the ²– j−1 perturbed chain given in (3)). We have λ1+j(N) − λ1+j(N − 1) = a b. For any δ > 0, since a < 1, we can j−1 choose a j such that a b < δ, that is c ∈ [λ1+j(N − 1), λ1+j(N)] where the width of the interval is less than δ. Since δ is arbitrary, c can be approximated arbitrarily closely, and c is reachable if c ∈ [λ∗, ∞). ¤

m To prove Proposition 2.1 we need to show that {λt , t > 0} is aperiodic and φ–irreducible. In addition we show the existence of a small set C and a test function V (·) which satisﬁes

m m E[V (λt+1)|λt = λ] 6 (1 − k1)V (λ) + k21(λ ∈ C) (A-1)

m for some constants k1, k2 such that 0 < k1 < 1, 0 < k2 < ∞. This implies that the chain {λt , t > 0} is m geometrically ergodic and with a proper choice of V , the k’th moment of λt exists for an arbitrary k, see Meyn and Tweedie (1993). The inequality (A-1) and uniform open set reachability from a compact set [λ∗,K], K > λ∗, can be established for both (2) and (3), but the ²–perturbation is used to establish φ–irreducibility and uniform φ–reachability required to establish the existence of a small set.

21 Proof of Proposition 2.1

m ∗ From Lemma A-1 we have that {λt , t > 0} as deﬁned by (3) is open set irreducible on [λ , ∞). Let A be a set in [k, ∞) in the support of φ, for some k > λ∗, where φ is the Lebesgue measure, and such that φ(A) > 0. Let 0 m c be a point in A. Then using the technique of proof of Lemma A-1, for some j, λj+1 will be arbitrarily close to 0 0 (c − d − b)/a, where (c − d − b)/a > λ∗ by choosing k large enough. In particular, j can be chosen so that m 0 0 0 | d + aλj+1 + b − c |< ²/2. Therefore if D = A ∩ B with B = (c − δ/2, c + δ/2) for some small δ, and fu(·) the density of Ut, then the probability of being in A in the next step is Z

P (A) > P (D) = fu(u)du > inf fu(u)φ(D) > 0, D D which implies φ-irreducibility. (Note that it follows from this proof that the ²–perturbation need not be introduced for Y = 1 but may in fact be inserted for any Y .) It remains to prove the existence of a small set, aperiodicity and the inequality (A-1). Existence of a small set can be proved by extending and modifying the technique of Lemma A-1. Let C be a compact set, C = [λ∗,K] for a ﬁnite K > λ∗. Since a < 1 and K is ﬁnite, there exists an integer n = n(η) such m m m m ∗ n ∗ that for given η > 0, and with a path where Y1 = ··· = Yn−1 = 0, λ1 = λ, |λn − λ | = a |λ − λ | < η for all m m m λ ∈ C. Then with Yn = N, Yn+1 = 0,Yn+2 = 0,...,

1 − aj λm = ajλm + aj−1bN + d = aj(λm − λ∗) + aj−1bN + λ∗. n+j n 1 − a n

∗ Similarly to the proof of Lemma A-1, consider an open interval (c1, c2) with c1 > λ , and let N be the least integer such that j−1 ∗ µn+j(N) ≡ a bN + λ > c2, where with no loss of generality we may assume that N > 2. By choosing j large enough and by the previous arguments, for any δ > 0, there exists an j such that aj−1b < δ. Therefore

c2 − δ < µn+j(N − 1) < c2.

m j m ∗ But λn+j(N − 1) = µn+j(N − 1) + a (λn − λ ) where by choosing η small enough

m c2 − δ < µn+j(N − 1) < λn+j(N − 1) < c2

m so that for these choices of n, j, N, λn+j(N − 1) ∈ (c1, c2) for all λ ∈ C and

n+m m m m m m inf P (λ, (c1, c2)) > P(Y1 = 0,...Yn−1 = 0,Yn = N,Yn+1 = 0,...,Yn+j−1 = 0) > 0. λ∈C

∗ This means that the interval (c1, c2) is uniformly reachable from all λ ∈ [λ ,K], and arguing as in the above proof of φ–irreducibility it follows that an n can be found such that

inf Pn(λ, A) > 0, λ∈C

22 for a set A of positive Lebesgue measure. This implies that the set C is a small set. m ∗ We now show that {λt , t > 0} is aperiodic. Consider the small set C = [λ ,K]. Note that φ(C) > 0 and m m m m m ∗ let λt−1 = λ ∈ C. Then λt = d + aλ + bYt−1 + εt,m. If Yt−1 = 0, then λt = d + aλ = λ (1 − a) + aλ = ∗ ∗ ∗ m ∗ ∗ λ + a(λ − λ ) > λ since a > 0. On the other hand λt − λ = λ (1 − a) − λ(1 − a) = −(1 − a)(λ − λ ) 6 0 m m for a < 1. It is concluded that if 0 < a < 1, then λ ∈ C and Yt−1 = 0 imply that λt ∈ C, which in turn implies m m 2 m m m P(λ, C) > P(Yt−1 = 0 | λt−1 = λ) = P(Nt(λ) = 0) > 0. Similarly, P (λ, C) > P(Yt = Yt−1 = 0 | λt−1 = m λ) > 0. It follows that {λt , t > 0} is aperiodic by Chan (1990, Prop. A1.1). Finally, we prove the existence of a test function V (·) such that (A-1) holds true. Consider V (x) = 1 + xk. Then

m m m k m E[V (λt )|λt−1 = λ] = E[(1 + (λt ) )|λt−1 = λ] m k m = 1 + E[(d + aλ + bYt−1 + εt,m) |λt−1 = λ] µ ¶ Xk k kX−1 = 1 + (aλ)i(bλ)k−i + c λj i j i=0 j=0 kX−1 k k j = 1 + (a + b) λ + cjλ , j=0

∗ for some constants cj depending on a, b, d and ². Consider the small set C = [λ ,K] and write · ¸ λk[1 − (a + b)k] 1 + (a + b)kλk = 1 − (1 + λk) [1(λ ∈ C) + 1(λ ∈ Cc)] . 1 + λk For λ ∈ Cc, we obtain · k k ¸ λ [1 − (a + b) ] k k sup 1 − k → 1 − [1 − (a + b) ] = (a + b) λ∈Cc 1 + λ as K increases. Similarly, by making K large enough

Pk−1 j j=1 cjλ sup k < δ, λ∈Cc 1 + λ

k k k Pk−1 j where 0 < δ < (a + b) . But on C, 1 + (a + b) λ + j=1 cjλ is bounded. Therefore, it follows that there exists constants k1 and k2 such that (0 < k1 < 1, 0 < k2 < ∞)

m m E(V (λt )|λt−1 = λ) 6 (1 − k1)V (λ) + k21(λ ∈ C)

m m and this implies that the chain {λt , t > 0} is geometrically ergodic, and that the kth moment of λt exists for an arbitrary k.

Proof of Proposition 2.2

m We will use the method of Meitz and Saikonnen (2008) to show that geometric ergodicity of the {λt } process m m implies geometric ergodicity of the chain {(Yt ,Ut, λt )}, deﬁned by (3). Denote the σ-algebra generated by the

23 past of Ut+1 and Nt(·) process by Ft−1, that is Ft = σ(Uk+1,Nk, k 6 t). We will use the method of Meitz and m Saikonnen (2008) to show that geometric ergodicity of the {λt } process implies geometric ergodicity of the chain m m {(Yt ,Ut, λt )}, deﬁned by (3). Denote the σ-algebra generated by the past of Ut+1 and Nt(·) process by Ft−1, that is Ft = σ(Uk+1,Nk, k 6 t). m m Note that conditional distribution of Yt given Ft−1 depends only on λt . In addition, the conditional dis- m m m m tribution function of Yt given λt = λ does not depend on t. Further, given the initial state (Y0 , λ0 ,U1), we m m m m m m m have λ1 = d + aλ0 + bY0 + ε1,m and since the conditional distribution of Y1 given {Y0 ,U1, λ0 , λ1 } is m m m m m m m m Poisson(λ1 ), conditionally on {Y0 ,U1, λ0 , λ1 } we have λ2 = d + aλ1 + bN1(λ1 ) + ε2,m where N1(λ1 ) m m m m is Poisson(λ1 ). Hence, {λt } considered as component of the trivariate chain {(Yt ,Ut, λt } from t > 2 has the m same structure as the one–dimensional chain {λt }, where conditionally on Ft−1

m m m λt = d + aλt−1 + bNt−1(λt−1) + εt,m

m m with Nt−1(λt−1) being Poisson(λt−1). It follows that Assumption 1 in the paper by Meitz and Saikonnen (2008) m m is fulﬁlled. Therefore, Prop. 1 of Meitz and Saikonnen (2008) shows that {Yt ,Ut, λt )} inherits the geometric m m m k ergodicity of {λt }. Moreover, since we can take Eµ[1+(d+λ0 +Y0 +ε1,m) ] < ∞, where µ is the distribution m m of the initial value (Y0 ,U1, λ0 ), and since for some constant C(k)

m k m k k m k k E[1 + (λt−1) + (Yt−1) + εt,m|λt−1 = λ] = C(k) + λ + E[Yt−1|λt−1 = λ] kX−1 k k i = C(k) + λ + λ + ciλ i=1 0 6 C (k)(1 + λk),

m m it follows from Thm. 2 of Meitz and Saikonnen (2008)–see also Prop.2– that {(Yt ,Ut, λt )} is V(Y,U,λ)- geomet- k k k rically ergodic with V(Y,U,λ)(Y, U, λ) = 1 + Y + U + λ .

Proof of Lemma 2.1

It follows from the deﬁning equations (1) and (2)

m ¡ m ¢ ¡ m ¢ λt − λt = a λt−1 − λt−1 + b Yt−1 − Yt−1 + εt,m, (A-2) and by taking conditional expectation and using the properties of the Poisson process we obtain that

Xt−1 m ¡ m ¢ i E (λt − λt) = (a + b) E λt−1 − λt−1 + E (εt,m) = (a + b) E (εt−i,m) . i=0

Since (a + b) < 1 and |E (εt,m)| 6 cm, with cm → 0 as m → ∞, c |E (λm − λ )| 6 m := δ , t t 1 − (a + b) 1,m which proves the ﬁrst assertion.

24 Next consider the second statement. By using (A-2) again,

m 2 2 ¡ m ¢2 2 ¡ m ¢2 E (λt − λt) = a E λt−1 − λt−1 + b E Yt−1 − Yt−1 ¡ m ¢ ¡ m ¢ ¡ 2 ¢ + 2abE Yt−1 − Yt−1 λt−1 − λt−1 + E εt,m £¡ m ¢ ¤ £¡ m ¢ ¤ + 2aE λt−1 − λt−1 εt,m + 2bE Yt−1 − Yt−1 εt,m .

m However, for λt > λt

m m m m E ((Yt − Yt)(λt − λt)) = E [E ((Yt − Yt)(λt − λt)) |Ft−1] m m m 2 = E [(λt − λt)(E (Nt[λt, λt ]))] = E (λt − λt) ,

m where Ft−1 is the σ-algebra generated by {Uk+1,Nk, k 6 t − 1} and Nt[λt, λt ] is the number of events between m m λt and λt for the unit intensity Poisson process Nt. (If λt < λt, we work along the same lines). By using again the properties of the Poisson process, we ﬁnd that

m 2 m 2 m m 2 E (Yt − Yt) 6 E (λt − λt) + 2 |E (λt − λt)| 6 E (λt − λt) + 2δ1,m. (A-3)

Finally, with κ a positive constant,

¡ 2 ¢ ¡¡ m ¢ ¢ ¡¡ m ¢ ¢ 2 E εt,m + 2aE λt−1 − λt−1 εt,m + 2bE Yt−1 − Yt−1 εt,m 6 κcm.

Therefore, by simple recursion

m 2 2 ¡ m ¢2 2 2 E (λt − λt) 6 (a + b) E λt−1 − λt−1 + κcm + 2b δ1,m 6 δ2,m, where δ2,m → 0 as m → ∞. This establishes the second assertion and hence the last statement of the Lemma. As to third statement this follows by (A-3) and the second assertion of the Lemma. ¤

Proof of Proposition 2.3

Using the techniques of Lemma A-1 we ﬁrst show that every point of [λ∗, ∞) is reachable. This part of the proof is equally valid for the unperturbed model (4), and for simplicity of notation we phrase it in terms of this model.

We note that since α2 < 1, the function f is a contraction and therefore the skeleton sequence deﬁned by

λt = f(λt−1)

∗ ∗ ∗ converges for λt > λ towards λ , the ﬁx point of f, and it follows that λ is reachable since b(Yt) = 0 for Yt = 0. ∗ Consider a point c ∈ [λ , ∞). Then with Y1 = N, Y0 = Y2 = ... = Ym = 0,

(j−1) λ1+j(N) = f (f(λ1) + b(N)) with f (i)(x) = f(f (i−1)(x)), f (1)(x) = f(x). Since f(λ) is monotone upwards, f (i)(λ) is monotone upwards as well, and because of (NL2) there exists an integer N such that

λ1+j(N) > c, λ1+j(N − 1) 6 c.

25 Using (NL3) repeatedly and then (NL4)

(j−2) (j−2) λ1+j(N) − λ1+j(N − 1) = f(f( (f(λ1) + b(N)))) − f(f (f(λ1) + b(N − 1)) (j−2) (j−2) 6 α2(f (f(λ1) + b(N)) − f (f(λ1 + b(N − 1)] j−1 j−1 6 α2 (b(N) − b(N − 1)) 6 α2 β2, and the proof of reachability of c is concluded as in the proof of Lemma A-1. Subsequently, φ–irreducibility is proved by a modiﬁcation of the arguments in the ﬁrst part of the proof of Proposition 2.1. Switch notation to follow (6) and consider now the existence of a small set. The candidate set is given by ∗ C = [λ ,K], as before. Using α2 < 1, there exists an n such that

m ∗ n ∗ n ∗ |λn − λ | 6 α2 |λ − λ | < α2 |K − λ | < η. m ∗ which implies that |f(λn ) − λ | < η for all λ0 = λ ∈ C. As before, m (j−1) m λn+j(N) = f (f(λn ) + b(N)). Consider (j−1) ∗ µn+j(N) = f (b(N)) + λ and an open interval (c1, c2), and let N be the least integer with µn+j(N) > c2. Then j−1 µn+j(N) − µn+j(N − 1) 6 α2 β2 which can be made smaller than any δ > 0 by choosing j large enough, such that c2 − δ < µn+j(N − 1) < c2. Moreover,

(j−1) m (j−1) ∗ |λn+j(N − 1) − µn+j(N − 1)| = |f (f(λn ) + b(N − 1)) − f (b(N − 1) − λ )| j−1 m ∗ j−1 6 α2 [f(λn ) − λ ] 6 α2 η, and the derivation is concluded as in the proof of Proposition 2.1 to show that (c1, c2) is uniformly reachable from C. As in the proof of Proposition 2.1, it is concluded that C is a small set. Aperiodicity is also proved analogously to the proof of Proposition 2.1. We now turn to the existence of a test function. Let V (x) = 1 + xk. Then

m m m k m E[(V (λt )|λt−1 = λ] = E[(1 + (λt ) )|λt−1 = λ] m k m = 1 + E[(f(λ) + b(Yt−1) + εt,m) |λt−1 = λ] µ ¶ Xk k = 1 + (f(λ))iE[b(N (λ)) + ε ]k−i. i t−1 t,m i=0 It follows that µ ¶ Xk k U E[V (λm)|λm = λ] 6 1 + (f(λ) − f(λ∗) + f(λ∗))iβk−iE(N (λ) + t )k−i t t−1 i 2 t−1 β i=0 2 µ ¶ Xk k kX−1 6 1 + (α λ)i(β λ)k−i + c λj. i 2 2 j i=0 j=0

26 Therefore, using the arguments of Proposition 2.1, the conclusion holds.

Proof of Lemma 3.1

2 We consider here the squared terms (∂λt/∂θ) individually for θ0 = d0, a0 and b0 respectively. The evaluation of these imply that identical arguments apply to the cross-product terms. m m 2 With θ0 = a0, which is the most complicated part, we need to evaluate the difference (1/λt )(∂λt /∂a) − 2 (1/λt)(∂λt/∂a) . From (10) and (14) we obtain µ ¶ µ ¶ ∂λm ∂λ ¡ ¢ ∂λm ∂λ t − t = λm − λ + a t−1 − t−1 . ∂a ∂a t−1 t−1 0 ∂a ∂a Using this recursion it follows that, ¯ ¯ ¯ ¯ ¯∂λm ∂λ ¯ ¯ ¯ ¯∂λm ∂λ ¯ E ¯ t − t ¯ 6 E ¯λm − λ ¯ + a E ¯ t−1 − t−1 ¯ ¯ ∂a ∂a ¯ t−1 t−1 0 ¯ ∂a ∂a ¯ ¯ ¯ p ¯∂λm ∂λ ¯ 6 δ + a E ¯ t−1 − t−1 ¯ 6 γ , 2,m 0 ¯ ∂a ∂a ¯ 1,m where γ1,m → 0 as m → ∞ from the second assertion of Lemma 2.1 and a < 1. The last display implies that ¯ ¯ ¯∂λm ∂λ ¯ ¯ t − t ¯ 6 γ a.s. (A-4) ¯ ∂a ∂a ¯ with γ > 0 arbitrarily small. Next, using recursion and Cauchy–Schwartz inequality,

µ m ¶2 µ m ¶2 ∂λ ∂λ ¡ ¢2 ∂λ ∂λ E t − t = E λm − λ + a2E t−1 − t−1 ∂a ∂a t−1 t−1 0 ∂a ∂a µ ¶ ¡ ¢ ∂λm ∂λ +2a E λm − λ t − t 0 t−1 t−1 ∂a ∂a µ ¶ ∂λm ∂λ 2 p 6 δ + a2E t−1 − t−1 + 2a γ δ 6 γ, 2,m 0 ∂a ∂a 0 2,m

m m 2 2 where γ > 0 can be chosen arbitrarily small. Now turn to the evaluation of (1/λt )(∂λt /∂a) −(1/λt)(∂λt/∂a) : ¯ ¯ ¯Ã ! ¯ ¯ µ m ¶2 µ ¶2¯ ¯ µ m ¶2 µ ¶2 ¯ ¯ 1 ∂λt 1 ∂λt ¯ ¯ ∂λt m ∂λt m¯ E ¯ m − ¯ = E ¯ λt − λt /λtλt ¯ ¯λt ∂a λt ∂a ¯ ¯ ∂a ∂a ¯ ¯Ã Ã ! ! ¯ ¯ µ m ¶2 µ ¶2 µ ¶2 ¯ ¯ ∂λ ∂λt ∂λt ¯ = E ¯ λ t − + (λ − λm) /λ λm¯ ¯ t ∂a ∂a ∂a t t t t ¯ ¯ ¯ ¯ ¯ ¯µ m ¶2 µ ¶2¯ ¯µ ¶2 ¯ 1 ¯ ∂λt ∂λt ¯ 1 ¯ ∂λt m ¯ 6 E ¯ − ¯ + 2 E ¯ (λt − λt )¯ d0 ¯ ∂a ∂a ¯ d ¯ ∂a ¯ ¯µµ ¶ µ ¶¶ µµ ¶ µ ¶¶¯ s µ ¶ ¯ m m ¯ 4 1 ¯ ∂λt ∂λt ∂λt ∂λt ¯ 1 m 2 ∂λt 6 E ¯ − + ¯ + 2 E (λt − λt ) E d0 ∂a ∂a ∂a ∂a d0 ∂a γ p 6 κ1 + κ2 δ2,m 6 γ,˜ d0

27 where γ˜ can be chosen arbitrarily small and κi some constants. Here we have used (2) and (3) (which guarantee m m that λt > d0 and λt > d0, (A-4), the second conclusion of Lemma 2.1, and ﬁnite moments of (∂λt /∂a), 4 m (∂λt/∂a) and (∂λt/∂a) . The latter holds by ﬁnite moments of λt and λt using the expressions in (10) and (14) and a0 < 1.

With θ0 = d0, it follows that, µ ¶ µ ¶ ∂λm ∂λ ∂λm ∂λ t − t = a t−1 − t−1 6 γ, ∂d ∂d 0 ∂d ∂d with γ arbitrarily small. Next, as for the case of θ0 = a0, ¯ ¯ ¯Ã ! ¯ ¯ µ m ¶2 µ ¶2¯ ¯ µ m ¶2 µ ¶2 ¯ ¯ 1 ∂λt 1 ∂λt ¯ ¯ ∂λt m ∂λt m¯ E ¯ m − ¯ = E ¯ λt − λt /λtλt ¯ ¯λt ∂d λt ∂d ¯ ¯ ∂d ∂d ¯ ¯µµ ¶ µ ¶¶ µµ ¶ µ ¶¶¯ s µ ¶ ¯ m m ¯ 2 1 ¯ ∂λt ∂λt ∂λt ∂λt ¯ 2 m 2 ∂λt 6 E ¯ − + ¯ + E (λt − λt ) E d0 ∂d ∂d ∂d ∂d d0 ∂d γ p 6 κ1 + κ2 δ2,m 6 γ˜ d0 using the second conclusion of Lemma 2.1. With θ0 = b0, µ ¶ µ ¶ ∂λm ∂λ ¡ ¢ ∂λm ∂λ t − t = Y m − Y + a t−1 − t−1 . ∂b ∂b t−1 t−1 0 ∂a ∂a Using this recursion it follows that, ¯ ¯ ¯ ¯ ¯∂λm ∂λ ¯ ¯ ¯ ¯∂λm ∂λ ¯ E ¯ t − t ¯ 6 E ¯Y m − Y ¯ + a E ¯ t−1 − t−1 ¯ (A-5) ¯ ∂b ∂b ¯ t−1 t−1 0 ¯ ∂b ∂b ¯ ¯ ¯ p ¯∂λm ∂λ ¯ 6 δ + a E ¯ t−1 − t−1 ¯ 6 γ , 3,m 0 ¯ ∂b ∂b ¯ 1,m and the result follows as before. 3 0 To see that G is positive deﬁnite, then with ν 6= 0 ∈ R , is sufﬁcient to show that ν ∂λt/∂θ > 0 with positive 0 probability. By (10) ν ∂λt/∂θ 6= 0 as ∂λt/∂d > 1 while Yt and λt are non-zero, for some t. The same reasoning holds for Gm.

Proof of Lemma 3.2

m Pn m Equation (13) shows that the score is given by Sn (θ) = t=1 ∂lt (θ) /∂θ, with martingale difference terms deﬁned by µ ¶ ∂lm Y m ∂λm ∂λm t = t − 1 t := Zm t ∂θ λm ∂θ t ∂θ t ¡ ¢ m m m 2 m where ∂λt /∂θ is deﬁned by (14). It follows that at θ = θ0,E (∂lt /∂θ|Ft−1) = 0 and E (Zt ) |Ft−1 = 1/λt where Ft−1 is the σ-algebra generated by {Uk+1,Nk, k 6 t − 1}. Furthermore, from (3) and (14), ∂λm Xt−1 ∂λm 1 − at ∂λm Xt−1 t = ai λm , t = 0 and t = ai Y m . ∂a 0 t−1−i ∂d 1 − a ∂b 0 t−1−i i=0 0 i=0

28 m 2 m 2 m 2 m 2 m 2 Observe that, as a0, b0 < 1,E (Yt ) < ∞, and E (λt ) < ∞, then (∂λt /∂d) ,E (∂λt /∂a) and E (∂λt /∂b) m are all ﬁnite. Also 1/λt 6 1/ (d0 − η1) for any small η1 > 0, where for m large enough,

¡ m ¢ εt,m = cm1 Yt−1 = 1 Ut ∈ [−η1, η1].

m m From Holders¨ inequality we conclude that E k∂lt /∂θk < ∞. Thus ∂lt /∂θ is a martingale difference sequence −1/2 m with respect to Ft and an application of the CLT (Hall and Heyde, 1980, Cor. 3.1), gives n Sn is asymptotically Gaussian with covariance given by the limit, Ã µ ¶ µ ¶ ! 1 Xn ∂λm ∂λm 0 E Z2 t t |F →P Gm, n t ∂θ ∂θ t−1 t=1 by the LLN for geometrically ergodic process in Jensen and Rahbek (2007). That the conditional Lindeberg’s condition holds follows by noting

1 Xn ³ ¡ √ ¢ ´ 1 Xn ³ ´ E k∂lm/∂θk2 I k∂lm/∂θk > nδ |F 6 E k∂lm/∂θk4 |F → 0, n t t t−1 n2δ2 t t−1 t=1 t=1

m 4 since E k∂lt /∂θk < ∞. This proves the first assertion of the Lemma. The second assertion follows by Lemma 3.1. m We consider the last conclusion of the Lemma. Define Zt in an analogous way as Zt . Then µ ¶ µ ¶ 1 1 Xn ∂lm ∂l 1 Xn ∂λm ∂λ √ (Sm − S ) = √ t − t = √ Zm t − Z t n n n n ∂θ ∂θ n t ∂θ t ∂θ t=1 t=1 · µ ¶ ¸ 1 Xn ∂λm ∂λ ∂λ = √ Zm t − t + (Zm − Z ) t . n t ∂θ ∂θ t t ∂θ t=1 But for the first summand Ã° µ ¶° ! Ã ° ° ! °Xn m ° °Xn ° ° ∂λ ∂λt ° √ ° ° √ P ° Zm t − ° > δ n 6 P γ ° Zm° > δ n ° t ∂θ ∂θ ° m ° t ° t=1 t=1 γ2 Xn 6 a,m E kZmk2 6 Cγ2 → 0, δ2n t a,m t=1

2 as m → ∞. For the second summand, the same arguments apply because E k∂λt/∂θk < ∞. In addition, for

m m m λt (Yt − Yt) − Yt (λt − λt) Zt − Zt = m , λtλt m we have kZt − Ztk < γb,m → 0 as m → ∞. To see this, observe ¯ ¯ ¯ ¯ ¯ m ¯ ¯ m ¯ ¯(Yt − Yt)¯ m ¯Yt (λt − λt)¯ δmE |Yt| E ¯ m ¯ 6 E |Yt − Yt| / (d0 − η1) 6 Cδ1,m and E ¯ m ¯ 6 6 Cδm λt λtλt d0 (d0 − η1) using Lemma (2.1) and the fact that E|Yt| < ∞.

29 Proof of Lemma 3.3

The Hessian matrix Hn is given by (11) and can be decomposed as 1 H = V + J , n n n n where µ ¶ µ ¶ 1 Xn Y ∂λ ∂λ 0 V = t t t n n λ2 ∂θ ∂θ t=1 t and µ ¶ 1 Xn ∂2λ J = − Z t , n n t ∂θ∂θ0 t=1 m m m with Zt deﬁned in proof of Lemma 3.2. Similarly we can deﬁne V n and J n for Hn given by (15). Notice that 0 the (i, j) th element of J n consists of function of

∂2λ ∂2λ ∂2λ ∂λ ∂2λ t = a t−1 = 0, t = t−1 + a t−1 , ∂d2 ∂d2 ∂a∂d ∂d ∂a∂d ∂2λ ∂2λ ∂2λ ∂λ ∂2λ t = a t−1 = 0, t = 2 t−1 + a t−1 , (A-6) ∂b∂d ∂b∂d ∂a2 ∂a ∂a2 ∂2λ ∂λ ∂2λ ∂2λ ∂2λ t = t−1 + a t−1 , t = a t−1 = 0. ∂b∂a ∂b ∂b∂a ∂b2 ∂b2 m upon using (10). Similar results but with λt are obtained by (14). m m The LLN of Jensen and Rahbek (2007) shows that V n converges in probability to G deﬁned by Lemma 3.1 since Ã ! Ã ! m µ m ¶ µ m ¶0 µ m ¶ µ m ¶0 Yt ∂λt ∂λt 1 ∂λt ∂λt E m 2 = E m , (λt ) ∂θ ∂θ λt ∂θ ∂θ m by simple application of the law of iterated expectations. The result then follows by establishing that J n tends to zero in probability. Using again the LLN of Jensen and Rahbek (2007) , µ µ ¶¶ ∂2λm J m → E Zm t = 0, n t ∂θ∂θ0

¡ ° ¡ ¢°¢2 ¡ ¢ in probability. That the LLN applies follows by establishing that E °Zm ∂2λm/∂θ∂θ0 ° < E(Zm)2 · °¡ ¢° t t t ° 2 m 0 °2 E ∂ λt /∂θ∂θ < ∞ using similar arguments as in the proof of Lemma 3.2. For instance

∂2λm Xt−1 t−Xi−2 t = 2 ai aj λm , ∂a2 0 0 t−2−j i=0 j=0 ¡ ¢ 2 m 2 2 by (A-6) and working as in the proof of Lemma 3.2 we can show that E ∂ λt /∂a < ∞ since a0 < 1 and all m moments of λt are ﬁnite. Likewise for the remaining terms.

30 Hence the ﬁrst result has been established. The second result repeats the conclusion of Lemma 3.1. To prove the third statement, note that Ã° ° ! ° n ° ° 1 X ° P (kHm − H k > εn) = P ° (V m + J m) − (V + J )° > ε . n n °n n n n n ° t=1

Rewrite · µ ¶ ¸ Xn ∂2λm ∂2λ ∂λ2 J m − J = − = Zm t − t + (Zm − Z ) t . n n t ∂θ∂θ0 ∂θ∂θ0 t t ∂θ∂θ0 t=1 The conclusion holds by showing ﬁrst that ° ° ° ∂2λm ∂2λ ° ° t − t ° < γ . °∂θ∂θ0 ∂θ∂θ0 ° 2,m

m Equations (10) for the perturbed model show that this is implied by k∂λt /∂θ − ∂λt/∂θk < γ2,m analogous to m the proof of Lemma 3.1. In addition, we have that kZt − Ztk < γb,m from the proof of Lemma 3.2. Likewise rewriting, µ ¶ µ ¶ µ ¶ µ ¶ Xn Y m ∂λm ∂λm 0 Y ∂λ ∂λ 0 V m − V = t t t − t t t n n m 2 ∂θ ∂θ λ2 ∂θ ∂θ t=1 (λt ) " Ãµ ¶ µ ¶ µ ¶ µ ¶ ! Ã ! µ ¶ µ ¶ # Xn Y m ∂λm ∂λm 0 ∂λ ∂λ 0 Y m Y ∂λ ∂λ 0 = t t t − t t − t − t t t m 2 ∂θ ∂θ ∂θ ∂θ m 2 λ2 ∂θ ∂θ t=1 (λt ) (λt ) t the result follows.

Proof of Lemma 3.4

3 By (11), the third order derivative of the terms lt (θ) in the log-likelihood function, ∂ lt (θ)/∂θi∂θj∂θk with i, j, k = 1, 2, 3 referring to d, a and b respectively, is given by

3 µ ¶ µ 2 2 2 ¶ ∂ lt (θ) Yt ∂ λt (θ) ∂λt (θ) ∂ λt (θ) ∂λt (θ) ∂ λt (θ) ∂λt (θ) = − 2 + + ∂θi∂θj∂θk λt (θ) ∂θi∂θj ∂θk ∂θi∂θk ∂θj ∂θj∂θk ∂θi µ ¶ µ ¶ µ ¶ 3 Yt ∂λt (θ) ∂λt (θ) ∂λt (θ) Yt ∂ λt (θ) + 2 3 + − 1 . λt (θ) ∂θk ∂θj ∂θi λt (θ) ∂θi∂θj∂θk

2 2 3 3 It is simple to see that Yt/λt (θ) 6 Yt/δL,Yt/λt (θ) 6 Yt/δL and |Yt/λt (θ) − 1| 6 Yt/δL + 1. Next, turn to the derivatives where the ﬁrst and second order derivatives are given by (10) and (A-6). The most involved ones are the derivatives with respect to the parameter a and we focus on these here without loss of generality as identical arguments for the remaining derivatives gives the desired uniform upper bounds. By

(2) and (10) and with λ0 ﬁxed,

Xt−1 1 − at Xt−1 λ (θ) = b aiY + d + atλ 6 µ = β ai Y + c t t−1−i 1 − a 0 0t U U t−1−i 0 i=0 i=0

31 where c0 = δU /(1 − αU ) + λ0 and t = 1, 2, .., T. For t = 2, ...T and c1 = c0/(1 − αU )

∂λ (θ) Xt−2 Xt−1 t = aiλ (θ) 6 µ = β iαi−1Y + c . ∂a t−1−i 1t U U t−1−i 1 i=0 i=1 Likewise,

∂2λ (θ) Xt−3 ∂λ (θ) Xt−2 t = 2 ai t−1−i 6 µ = β ((i + 1) i) αi−1Y + c , ∂a2 ∂a 2t U U t−2−i 2 i=0 i=1 ∂3λ (θ) Xt−4 ∂2λ (θ) Xt−3 t = 3 ai t−1−i 6 µ = β ((i + 2) (i + 1) i) αi−1Y + c , ∂a3 ∂2a 3t U U t−3−i 3 i=0 i=1

2 3 ? where c2 = 2c0/(1 − aU ) and c3 = 6c0/(1 − aU ) . Hence with θi = θj = θk = a, and M a constant, ¯ ¯ ¯ 3 ¯ ¡ ¢ ¯ ∂ lt (θ) ¯ ? 3 ¯ ¯ 6 mt := M Ytµ3t + µ3t + Ytµ2tµ1t + Ytµ1t . ∂θi∂θj∂θk

It follows that Emt < ∞ by consider each term in mt separately. For example, for some constant C, using that k EYt is finite for all k, such that in particular max E |Yt−jYt−k| is bounded by a finite constant, one finds

Xt−3 2 2 2(i−1) Eµ3t 6 C ((i + 2) (i + 1) i) αU < ∞, i=1 as αU < 1. Thus Eµ3t and EYtµ3t are ﬁnite. Likewise for the other terms. m P m m m That Mn → M , with M = Emt , holds by the LLN for geometrically ergodic processes in Jensen and m Rahbek (2007). That M →M = Emt as m → ∞, holds by evaluating ¯ ¯ m m m m m m ¯ m 3 m3¯ E |Yt µ3t − Ytµ3t| , E |µ3t − µ3t| , E |Yt µ2tµ1t − Ytµ2tµ1t| , and E Yt µ1t − Ytµ1t .

Consider the ﬁrst term,

m m m m m E |Yt µ3t − Ytµ3t| 6 E |(Yt − Yt) µ3t| + E |Yt (µ3t − µ3t)| .

Similarly for the remaining terms. Finally it can be shown that Ã¯ ¯ ! ¯ n ¯ ¯X ¯ P (|M m − M | > ²n) = P ¯ (mm − m )¯ > ²n → 0 n n ¯ t t ¯ t=1 as before.

32 References

Aue, A., I. Berkes, and L. Horvath´ (2006). Strong approximation for the sums of squares of augmented GARCH sequences. Bernoulli 12, 583–608.

Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics 31, 307–327.

Brockwell, P. J. and R. A. Davis (1991). Time Series: Data Analysis and Theory (2nd ed.). New York: Springer.

Brumback, B. A.and Ryan, L. M., J. D. Schwartz, L. M. Neas, P. C. Stark, and H. A. Burge (2000). Transitional regression models with application to environmental time series. Journal of the American Statistical Associa- tion 85, 16–27.

Carrasco, M. and X. Chen (2002). Mixing and moment properties of various GARCH and stochastic volatility models. Econometric Theory 18, 1–39.

Chan, K. S. (1990). Deterministic stability, stochastic stability and ergodicity, Appendix 1 of H. Tong (1990), Non–Linear Time Series: A Dynamical System Approach. Clarendon Press, Oxford.

Cox, D. R. (1981). Statistical analysis of time series: Some recent developments. Scandinavian Journal of Statistics 8, 93–115.

Davies, R. B. (1987). Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika 74, 33–43.

Davis, R. A., W. T. M. Dunsmuir, and S. B. Streett (2003). Observation-driven models for Poisson counts. Biometrika 90, 777–790.

Davis, R. A., Y. Wang, and W. T. M. Dunsmuir (1999). Modelling time series of count data. In S. Ghosh (Ed.), Asymptotics, Nonparametric & Time Series, pp. 63–114. New York: Marcel Dekker.

Fahrmeir, L. and G. Tutz (2001). Multivariate Statistical Modelling Based on Generalized Linear Models (2nd ed.). New York: Springer.

Fan, J. and Q. Yao (2003). Nonlinear time series. New York: Springer-Verlag.

Ferland, R., A. Latour, and D. Oraichi (2006). Integer–valued GARCH processes. Journal of Time Series Analy- sis 27, 923–942.

Fokianos, K. and B. Kedem (2004). Partial likelihood inference for time series following generalized linear models. Journal of Time Series Analysis 25, 173–197.

Haggan, V. and T. Ozaki (1981). Modelling nonlinear random vibrations using an amplitude-dependent autoregressive time series model. Biometrika 68, 189–196.

33 Hall, P. and C. C. Heyde (1980). Martingale Limit Theory and its Applications. New York: Academic Press.

Jensen, S. T. and A. Rahbek (2004). Asymptotic inference for nonstationary GARCH. Econometric Theory 20, 1203–1226.

Jensen, S. T. and A. Rahbek (2007). On the law of large numbers for (geometrically) ergodic Markov chains. Econometric Theory 23, 761–766.

Jung, R. C., M. Kukuk, and R. Liesenfeld (2006). Time series of count data: modeling, estimation and diagnostics. Computational Statistics & Data Analysis 51, 2350–2364.

Kedem, B. and K. Fokianos (2002). Regression Models for Time Series Analysis. Hoboken, NJ: Wiley.

Li, W. K. (1994). Time series models based on generalized linear models: some further results. Biometrics 50, 506–511.

Meitz, M. and P. Saikonnen (2008). Ergodicity, mixing and existence of moments of a class of Markov models with applications to GARCH and ACD models. Econometric Theory 24, 1291–1320.

Meyn, S. P. and R. L. Tweedie (1993). Markov Chains and Stochastic Stability. London: Springer.

Mikosch, T. and D. Straumann (2006). Stable limits of martingale transforms with application to the estimation of GARCH parameters. The Annals of Statistics 34, 493–522.

Rydberg, T. H. and N. Shephard (2000). A modeling framework for the prices and times of trades on the New York stock exchange. In W. J. Fitzgerlad, R. L. Smith, A. T. Walden, and P. C. Young (Eds.), Nonlinear and Nonstationary Signal Processing, pp. 217–246. Cambridge: Isaac Newton Institute and Cambridge University Press.

Streett, S. (2000). Some observation driven models for time series of counts. Ph. D. thesis, Colorado State University, Department of Statistics.

Tong, H. (1990). Nonlinear Time Series:A Dynamical System Approach. New York: Oxford University Press.

Zeger, S. L. and B. Qaqish (1988). Markov regression models for time series: a quasi-likelihood approach. Bio- metrics 44, 1019–1031.

34 Research Papers 2009

2008-62: Giuseppe Cavaliere, David I. Harvey, Stephen J. Leybourne and A.M. Robert Taylor: Testing for Unit Roots in the Presence of a Possible Break in Trend and Non-Stationary Volatility

2008-63: Ole E. Barndorff-Nielsen, Peter Reinhard Hansen, Asger Lunde and Neil Shephard: Multivariate realised kernels: consistent positive semi-definite estimators of the covariation of equity prices with noise and non-synchronous trading

2009-01: Roman Frydman, Michael D. Goldberg, Søren JohanseN and Katarina Juselius: A Resolution of the Purchasing Power Parity Puzzle: Imperfect Knowledge and Long Swings

2009-02: Morten Ørregaard Nielsen: Nonparametric Cointegration Analysis of Fractional Systems With Unknown Integration Orders

2009-03: Andrés González, Kirstin Hubrich and Timo Teräsvirta: Forecasting inflation with gradual regime shifts and exogenous information

2009-4: Theis Lange: First and second order non-linear cointegration models

2009-5: Tim Bollerslev, Natalia Sizova and George Tauchen: Volatility in Equilibrium: Asymmetries and Dynamic Dependencies

2009-6: Anders Tolver Jensen and Theis Lange: On IGARCH and convergence of the QMLE for misspecified GARCH models

2009-7: Jeroen V.K. Rombouts and Lars Stentoft: Bayesian Option Pricing Using Mixed Normal Heteroskedasticity Models

2009-8: Torben B. Rasmussen: Jump Testing and the Speed of Market Adjustment

2009-9: Dennis Kristensen and Andrew Ang: Testing Conditional Factor Models

2009-10: José Fajardo and Ernesto Mordecki: Skewness Premium with Lévy Processes

2009-11: Lasse Bork: Estimating US Monetary Policy Shocks Using a Factor- Augmented Vector Autoregression: An EM Algorithm Approach

2009-12: Konstantinos Fokianos, Anders Rahbek and Dag Tjøstheim: Poisson Autoregression