<<

Gaussian Mixture Vector Autoregression∗

Leena Kalliovirta Mika Meitz Pentti Saikkonen University of Helsinki University of Helsinki University of Helsinki Natural Resources Institute Finland

January 5, 2016

Abstract This paper proposes a new nonlinear vector autoregressive (VAR) model referred to as the Gaussian mixture vector autoregressive (GMVAR) model. The GMVAR model belongs to the family of mixture vector autoregressive models and is de- signed for analyzing that exhibit regime-switching behavior. The main difference between the GMVAR model and previous mixture VAR models lies in the definition of the mixing weights that govern the regime probabilities. In the GMVAR model the mixing weights depend on past values of the series in a spe- cific way that has very advantageous properties from both theoretical and practical point of view. A practical advantage is that there is a wide diversity of ways in which a researcher can associate different regimes with specific economically - ingful characteristics of the phenomenon modeled. A theoretical advantage is that stationarity and ergodicity of the underlying process are straightforward to establish and, contrary to most other nonlinear autoregressive models, explicit expressions of low order stationary marginal distributions are known. These theo- retical properties are used to develop an asymptotic theory of maximum likelihood estimation for the GMVAR model whose practical usefulness is illustrated in a bi- variate setting by examining the relationship between the EUR—USDexchange rate and a related rate . Keywords: mixture models, nonlinear vector autoregressive models, regime switch- ing

JEL Classification: C32

∗The authors thank the Academy of Finland (LK, MM, and PS), the OP-Pohjola Group Research Foundation (LK, MM, and PS), and Finnish Cultural Foundation (PS) for financial support. The pa- per has benefited from useful comments and suggestions made by the co-editor and three anonymous referees. Contact addresses: Leena Kalliovirta, Department of Political and Economic Studies, Univer- sity of Helsinki, P. O. Box 17, FI—00014 University of Helsinki, Finland, or Natural Resources Institute Finland (Luke), Viikinkaari 4, FI—00790 Helsinki, Finland; e-mail: leena.kalliovirta@luke.fi. Mika Meitz, Department of Political and Economic Studies, University of Helsinki, P. O. Box 17, FI—00014 Univer- sity of Helsinki, Finland; e-mail: mika.meitz@helsinki.fi. Pentti Saikkonen, Department of Mathematics and , University of Helsinki, P. O. Box 68, FI—00014 University of Helsinki, Finland; e-mail: pentti.saikkonen@helsinki.fi.

1 1 Introduction

The vector autoregressive (VAR) model is one of the main tools used to analyze economic time series. Quite often, the VAR model is assumed linear, although both economic theory and previous empirical evidence may suggest that a nonlinear VAR model could be more appropriate. One popular nonlinear VAR model is the Markov switching VAR (MS—VAR) model that is designed to describe time series that switch between two or more regimes with each regime having the dynamics of a linear VAR model. In most applications, the regime switches are determined by a latent indicator variable that follows a time- homogeneous Markov chain with the transition probabilities depending on the most recent regime but not on past observations (see, e.g., Krolzig (1997) and Sims, Waggoner, and Zha (2008)). More general time-inhomogeneous MS—VAR models, where the transition probabilities depend both on the most recent regime and on past observations, have also been considered (see, e.g., Ang and Bekaert (2002)). In this paper, we are interested in mixture VAR (MVAR) models. These models can be viewed as special cases of general time-inhomogeneous MS—VAR models from which they are obtained with suitable parameter restrictions. They differ from the commonly used time-homogeneous MS—VAR models in that the transition probabilities do not depend on the most recent regime, but instead on past observations. An equivalent formulation of MVAR models (explaining the nomenclature ‘mixture’) is to specify the conditional of the process as a mixture of (typically) Gaussian conditional distributions of linear VAR models. Different models are obtained by different specifications of the mixing weights. Univariate mixture autoregressive models were introduced by Le, Martin, and Raftery (1996) and further developed by Wong and Li (2000, 2001a,b) (for further references, see Kalliovirta, Meitz, and Saikkonen (2015); for mixture autoregressions in Bayesian framework, see, e.g., Villani, Kohn, and Giordani (2009)). Extensions to the vector case with economic applications involving inflation, interest rates, stock , and exchange rates have been presented by Lanne (2006), Fong, Li, Yau, and Wong (2007), Bec, Rahbek, and Shephard (2008), and Dueker, Psaradakis, Sola, and Spagnolo (2011). In this paper, we propose a new mixture VAR model referred to as the Gaussian mix- ture vector autoregressive (GMVAR) model. This model is a multivariate generalization of a similar univariate model introduced in Kalliovirta et al. (2015). The specific formu- lation of the GMVAR model turns out to have very convenient theoretical implications. To highlight this point, first recall a property that makes the stationary linear Gaussian VAR model different from most, if not nearly all, of its nonlinear alternatives, namely that

2 the probability structure of the underlying is fully known and can be described by Gaussian densities. In nonlinear VAR (and also nonlinear AR) models the situation is typically very different: the conditional distribution is known by construction but what is usually known beyond that is only the existence of a stationary distribution and finiteness of some of its moments. In the GMVAR model, stationarity of the under- lying stochastic process is a simple consequence of the definition of the model. Moreover, letting p denote the autoregressive order of the model (see Section 2), the stationary dis- tribution of p + 1 consecutive (vector valued) observations is known to be a mixture of multivariate Gaussian distributions with constant mixing weights and known structure for the mean and of the component distributions, whereas the condi- tional distribution is a multivariate Gaussian mixture with time varying mixing weights. Thus, similarly to the linear Gaussian VAR model, and contrary to (at least most) other nonlinear VAR models, the structure of stationary marginal distributions consisting of p + 1 observations or less is fully known in the GMVAR model. In order to interpret a multivariate regime-switching model one typically aims at asso- ciating different economically meaningful regimes with different states of economic vari- ables, such as high or low level of inflation, , or asset return. An appealing feature of the GMVAR model is that, due to the specific structure of the mixing weights, the researcher can associate different regimes with different characteristics of the phenom- enon modeled. Moreover, in the GMVAR model switches between regimes are allowed to depend not only on, say, the level of past observations, but on their entire distribution. Thus, in addition to regime switches taking place in periods of high/low levels of the considered series, the GMVAR model can also allow for regime switches taking place in periods of high/low variability, or high/low temporal dependence, and combinations of all these. These convenient features are illustrated in our empirical example, which also demonstrates promising power of the GMVAR model. We believe that introducing the GMVAR model makes a useful addition to the lit- erature on multivariate regime-switching models. This is mainly due to the formulation of the model which, in addition to the attractive properties already discussed, has the following implications. First, the regime-switching mechanism is parsimonious and its form becomes automatically specified once the number of regimes and the order of the model are chosen; there is no need to find out which lagged values of the considered se- ries are used to model the regime-switching mechanism and in what form they should be included in the model. Second, conditions that guarantee stationarity (and ergodicity) of the model are entirely similar to those in linear VAR models and they are also necessary

3 in the sense of not overly restricting the parameter space of the model. These conditions are therefore both sharp and easy to check, and there is no need to use simulation to find out whether an estimated model fulfills the stationarity condition. The plan of the paper is as follows. Section 2 discusses general mixture VAR mod- els. Section 3 introduces the new GMVAR model, discusses its theoretical properties, and establishes the consistency and asymptotic normality of the maximum likelihood es- timator. Section 4 presents an empirical example with exchange rate and interest rate data, discusses issues of model building, and compares the forecasting performance of the GMVAR model to other linear and nonlinear VAR models. Section 5 concludes, and an Appendix contains some technical derivations. A ‘Supplementary Appendix’(available from the authors) contains additional material omitted from the paper. Finally, a word on notation. We use vec (A) to denote a column vector obtained by stacking the columns of the matrix A one below another. If A is a symmetric matrix then vech (A) is a column vector obtained by stacking the columns of A from the principal diagonal downwards (including elements on the diagonal). The usual notation A B is ⊗ used for the of the matrices A and B. To simplify notation, we shall write z = (z1, . . . , zm) for the (column) vector z where the components zi may be either scalars or vectors (or both). For any scalar, vector, or matrix x, the Euclidean norm is denoted by x . | |

2 Multivariate mixture autoregressive models

Let yt (t = 1, 2,...) be the d—dimensional time series of interest, and let t 1 = σ (ys, s < t) F − denote the σ—algebra generated by past yt’s. We use Pt 1 ( ) to signify the conditional − · probability of the indicated event given t 1. In a general multivariate mixture au- F − toregressive model with M mixture components (or regimes) the yt’sare assumed to be generated by M 1/2 yt = st,m(µm,t + Ωm εt), (1) m=1 X where the following conditions hold.

Condition 1. p (a) For each m = 1,...,M, µm,t = φm,0 + i=1 Am,iyt i and Ωm is positive definite. −

(b) εt (d 1) is a sequence of independentP standard multivariate normal random vectors × (εt NID (0,Id)) such that εt is independent of ys, s < t . ∼ { }

4 (c) st = (st,1, . . . , st,M ) is a sequence of (unobserved) random vectors such that for each t, exactly one of its components takes the one and others equal zero, with

( t 1—measurable) conditional probabilities Pt 1 (st,m = 1) = αm,t, m = 1,...,M, − − F M that satisfy m=1 αm,t = 1.

(d) Conditional onP t 1, st and εt are independent. F − For later developments, we collect the unknown parameters introduced above in the

vectors ϑm = (φm,0, φm, σm) with φm = (vec(Am,1),..., vec(Am,p)) and σm = vech (Ωm) (m = 1,...,M).

The conditional probabilities αm,t in Condition 1(c) are referred to as mixing weights (this nomenclature will be made clear shortly). They can be thought of as probabilities that determine which one of the M VAR components in (1) generates the next observation. To complete the definition of an MVAR model, the mixing weights need to be specified. Our specification as well as some alternatives are discussed in Section 3.1.

Using equation (1) and Condition 1 the conditional density function of yt given its

past, f( t 1), is obtained as · | F − M d/2 1/2 1 0 1 f(yt t 1) = αm,t (2π)− det (Ωm)− exp yt µm,t Ωm− yt µm,t . (2) − 2 | F m=1 − − − X n  o Thus, the distribution of yt given its past is specified as a mixture of multivariate normal

densities with time varying mixing weights αm,t. The conditional mean and covariance

matrix of yt given t 1 can be expressed as F − M M p

E[yt t 1] = αm,tµm,t = αm,t φm,0 + Am,iyt i (3) − − | F m=1 m=1 i=1 X X  X  and M M M M 0 Cov[yt t 1] = αm,tΩm + αm,t µm,t αn,tµn,t µm,t αn,tµn,t . (4) − | F m=1 m=1 − n=1 − n=1 X X  X  X  These expressions are valid for any specification of the mixing weights αm,t.

3 Gaussian Mixture Vector Autoregressive (GMVAR) model

3.1 Definition

The GMVAR model is based on a particular choice of the mixing weights αm,t in Condition 1(c). This choice is similar to that used by Glasbey (2001) and Kalliovirta et al. (2015) in

5 a univariate setting. In order to define these mixing weights we first use the parameters

φm,0, Am,i, and Ωm (see (1) and Condition 1(a)) to define the M auxiliary linear Gaussian VAR processes

p 1/2 νm,t = φm,0 + Am,iνm,t i + Ωm εt, m = 1,...,M, − i=1 X where the coeffi cient matrices Am,i are assumed to satisfy

p i det Am (z) = det Id Am,iz = 0 for z 1, m = 1,...,M. (5) − i=1 6 | | ≤  X  This condition implies that the processes νm,t (d 1) are stationary and that each com- × ponent model in (1) satisfies the usual stationarity condition of a linear VAR model. We

also need the density function of the (Gaussian) random vector νm,t = (νm,t, . . . , νm,t p+1) − (dp 1)(m = 1,...,M) given by ×

dp/2 1/2 1 1 ndp (νm,t; ϑm) = (2π)− det(Σm,p)− exp (νm,t 1p µ )0 Σ− (νm,t 1p µ ) , −2 − ⊗ m m,p − ⊗ m  (6)  1 where 1p = (1,..., 1) (p 1), µ = A− (1) φ , and the Σm,p is a × m m m,0 function of Am,i, i = 1, . . . , p, and Ωm, and hence of the parameters φm and σm (for details, see Lütkepohl (2005, eqn. (2.1.39))).

Now we can specify our choice for the mixing weights αm,t. Using the vector yt 1 = − (yt 1, . . . , yt p) (dp 1) and the multivariate Gaussian densities in (6), we set − − ×

αmndp(yt 1; ϑm) αm,t = M − , (7) n=1 αnndp(yt 1; ϑn) − M where the αm (0, 1), m = 1,...,MP, are unknown parameters satisfying αm = 1. ∈ m=1 (Clearly, the coeffi cients αm,t are measurable functions of yt 1 = (yt 1, . . . , yt p) and − − P − M satisfy m=1 αm,t = 1 for all t.) Equation (1), Condition 1, and (7) define the Gaussian MixtureP Vector or the GMVAR model. We use the abbreviation GMVAR(p, M) when the autoregressive order and number of component models need to be emphasized. The unknown parameters to be estimated are collected in the vector 2 θ = (ϑ1,..., ϑM , α1, . . . , αM 1) ((M(d p + d + d (d + 1) /2 + 1) 1) 1); the coeffi cient − M − × αM is not included due to the restriction m=1 αm = 1. As already mentioned, the specificationP of the mixing weights in (7) is analogous to that used by Glasbey (2001) and Kalliovirta et al. (2015) in a univariate setting; indeed, when d = 1 the GMVAR model reduces to the GMAR model of Kalliovirta et al. (2015). Of

6 previous multivariate mixture autoregressive models, Fong et al. (2007) consider constant

mixing weights, that is, αm,t = αm, whereas Bec et al. (2008) study a two component model (M = 2) with time-varying mixing weights specified via a logistic function (cf. the univariate model of Wong and Li (2001b)). The time-varying mixing weights used by Dueker et al. (2011) in their C—MSTAR model are based on a formula that is formally similar to (7) except that instead of density functions these authors employ cumulative distribution functions of the multinormal distribution.1 A particular feature of their model is that the number of regimes, M, is always determined by the dimension of the model, d, so that M = d2. Even when the dimension d is fairly small, say three or four, this makes the number of regimes, and hence the number of parameters to be estimated quite large, and special measures may be called for to facilitate estimation in practice (see Dueker et al. (2011, Footnote 5)). This is in contrast to our GMVAR model where the number of regimes is not related to the dimension of the model and the mixing mechanism is modeled in a parsimonious fashion in that only the parameters α1, . . . , αM 1 are involved. − Even though parsimony is an important feature of our mixing weights, its major advantage is theoretical attractiveness, as discussed in the next subsection.

3.2 Theoretical properties

Given the specification of the mixing weights αm,t in (7), the conditional distribution of yt given t 1 only depends on yt 1, implying that yt is Markovian. This fact is formally F − − stated in the following theorem which shows that there exists a choice of initial values

y0 such that yt is a stationary and ergodic Markov chain. An explicit expression for the stationary distribution is also provided. The following theorem is proved in the Appendix.

Theorem 1. Consider the GMVAR process yt generated by (1) and (7), and assume that

Condition 1 and condition (5) are satisfied. Then yt = (yt, . . . , yt p+1) is a Markov chain − on Rdp with a stationary distribution characterized by the density

M

f(y; θ) = αmndp (y; ϑm) . (8) m=1 X Moreover, yt is ergodic. 1 According to the authors their model belongs to the family of multivariate STAR models and this interpretation is indeed consistent with the initial definition of the model (see equations (3) and (4) in Dueker et al. (2011)). However, we treat the model as a mixture model because the used to fit the model to data is determined by (not necessarily Gaussian) conditional density functions that are of the mixture form (2) (see Section 4.1 in Dueker et al. (2011)).

7 Theorem 1 is an analog of the corresponding result obtained by Kalliovirta et al.

(2015) in the univariate case d = 1. It shows that the stationary distribution of yt is a

mixture of M multinormal distributions with constant mixing weights αm that appear

in the time varying mixing weights αm,t defined in (7). Consequently, all moments of the stationary distribution exist and are finite. In fact, as can be seen from the proof of Theorem 1, the stationary distribution of the d (p + 1)—dimensional random vector

(yt, yt 1) is also a Gaussian mixture with density of the same form as in (8) or, specif- − M ically, m=1 αmnd(p+1) ((y, y); ϑm). This implies that the marginal distributions of this 2 GaussianP mixture belong to the same family and also that explicit expressions for the mean, , and first p autocovariances of the stationary version of the process yt can easily be derived (cf. Kalliovirta et al. (2015, p. 251)). As discussed by Kalliovirta et al. (2015), it is quite exceptional that complete knowl- edge of the stationary distribution of a nonlinear autoregressive model is available and that readily verifiable conditions that define the parameter space can be used to obtain a relatively simple proof for stationarity and ergodicity. Similarly to their univariate coun- terparts, Theorem 1 and the definition of the model suggest that the GMVAR model can be quite flexible in describing various nonlinear and non-Gaussian features encountered in time series data (for details, see Section 2 of Kalliovirta et al. (2015)).

3.3 Interpretation of the mixing weights αm and αm,t

Unless otherwise stated, we assume the stationary version of the GMVAR process in this and the next subsection. According to Theorem 1, the parameter αm (m = 1,...,M) has an immediate interpretation as the unconditional probability of the random vector yt = (yt, . . . , yt p+1) being generated from a distribution with density ndp (y; ϑm), that − is, from the mth component of the Gaussian mixture characterized in (8). Consequently,

αm (m = 1,...,M) also represents the unconditional probability of the component yt being generated from a distribution with density nd (y; ϑm), the mth component of the (d— M dimensional) Gaussian mixture density m=1 αmnd (y; ϑm) where nd (y; ϑm) is the density function of a Gaussian random vector with mean 1p µ and covariance matrix Γm,0. P ⊗ m It can also be shown that αm represents the unconditional probability of (the d— dimensional) yt being generated from the mth VAR component in (1) whereas αm,t rep- resents the corresponding conditional probability Pt 1 (st,m = 1) = αm,t. This conditional − 2 Note, however, that this does not hold in higher dimensions so that the stationary distribution of

(yt+1, yt, yt 1), for example, is not a Gaussian mixture. −

8 probability depends on the (relative) size of the product αmndp(yt 1; ϑm), the numerator − of the expression defining αm,t (see (7)). The latter factor of this product, ndp(yt 1; ϑm), − can be interpreted as the likelihood of the mth autoregressive component in (1) based on the observation yt 1. Thus, the larger this likelihood is the more likely it is to ob- − serve yt from the mth autoregressive component. However, the product αmndp(yt 1; ϑm) − is also affected by the former factor αm or the weight of ndp(yt 1; ϑm) in the stationary − mixture distribution of yt 1 (evaluated at yt 1; see (8)). Specifically, even though the − − likelihood of the mth autoregressive component in (1) is, for example, large, a small value of αm attenuates its effect so that the likelihood of observing yt from the mth autore- gressive component can be small. This seems intuitively natural because, for example, a small weight of ndp(yt 1; ϑm) in the stationary mixture distribution of yt 1 that − − observations cannot be generated by the mth autoregressive component too frequently. The preceding discussion highlights the fact that the GMVAR model associates the mixing weights or the regime probabilities Pt 1 (st,m = 1) = αm,t with observable economic − characteristics through the density functions ndp(yt 1; ϑm). In particular, and in contrast − to existing mixture VAR models, the specific form of the mixing weights in the GMVAR model therefore allows the regime probabilities αm,t to depend on the entire distribution of p past observations and not only on some of their specific features such as levels.

3.4 Parameter estimation

The parameters of the GMVAR model can be estimated by the method of maximum likelihood (ML). As the stationary distribution of the GMVAR process is known it is even possible to make use of initial values and construct the exact likelihood function and obtain exact ML estimates. Assuming the observed data y p+1, . . . , y0, y1, . . . , yT and − stationary initial values the log-likelihood function takes the form

M T

LT (θ) = log αmndp (y0; ϑm) + lt (θ) , (9) m=1 t=1 X  X where M d/2 1/2 lt (θ) = log αm,t(θ) (2π)− det (Ωm)− m=1 X 1 1 exp yt µ (ϑm) 0 Ω− yt µ (ϑm) . (10) × − 2 − m,t m − m,t  n  o Here dependence of the mixing weights αm,t and the conditional expectations µm,t of the component models on the parameters is made explicit (see (7) and Condition 1(a)). In the

9 expression (9) it has been assumed that the initial values in the vector y0 are generated by the stationary distribution. If this assumption seems inappropriate one can condition on initial values and drop the first term on the right hand side of (9). In what follows we assume estimation is performed based on this conditional likelihood, namely

T 1 LT (θ) = T − lt (θ) , t=1 X which we, for convenience, have also scaled with the sample size. Maximizing the con-

ditional log-likelihood function LT (θ) with respect to the parameter vector θ yields the ˆ ˆ ML estimate denoted by θT (a similar notation is used for components of θT ). ˆ Investigation of the asymptotic properties of the ML θT requires further assumptions. The unknown model parameters are collected in the vector θ = (ϑ, α) =

(ϑ1,..., ϑM , α1, . . . , αM 1) taking values in the parameter space Θ. The parameter space − Θ needs to be constrained in various ways. Earlier we already mentioned the stationarity

conditions (5) and the positive definiteness of the covariance matrices Ωm (m = 1,...,M) that are assumed to hold. Throughout we assume that the number of mixture components

is known, and this also entails the requirement that the coeffi cients αm, m = 1,...,M, used to define the mixing weights are all strictly positive (and strictly less than unity). Further restrictions are required to ensure identification of the model parameters. In the proof of Theorem 2 below, the identification of GMVAR models is established by using the results of Yakowitz and Spragins (1968) on the identification of finite mixtures of (in our case) multivariate Gaussian distributions. Essentially, what is required is that the M component models cannot be ‘relabeled’and the same GMVAR model obtained. A suffi cient condition ensuring this is

α1 > > αM > 0 and ϑi = ϑj only if 1 i = j M. (11) ··· ≤ ≤ We summarize the restrictions imposed on the parameter space as follows.

Assumption 1. The true parameter value θ0 is an interior point of Θ, where Θ is M(d+pd2+d(d+1)/2) M 1 a compact subset of θ (0, 1) − : (5) and (11) hold, and Ωm { ∈ × (m = 1,...,M) are positive definite . } The following theorem establishes the strong consistency of the ML estimator.

Theorem 2. Suppose yt are generated by the stationary and ergodic GMVAR process of ˆ Theorem 1 and that Assumption 1 holds. Then the maximum likelihood estimator θT is ˆ strongly consistent, that is, θT θ0 a.s. → 10 To simplify the proof Theorem 2 assumes stationary initial values (relaxing this as- sumption is possible at the cost of a longer and more complicated proof). A similar ˆ remark applies to our next theorem below where we show that the ML estimator θT is asymptotically normally distributed. To this end, Lemmas 1—3in the Appendix establish some of the major ingredients required for proving this result, namely (i) that the score vector (evaluated at the true parameter value) is a (square integrable) martingale differ- ence sequence and thus obeys a , (ii) that the Hessian matrix of the log-likelihood function converges uniformly in some neighborhood of the true parameter value, and (iii) that, when evaluated at the true parameter value, the limiting covariance matrix of the score vector equals the negative of the expected Hessian. The last main ingredient required, positive definiteness of the information matrix, is contained in the following assumption.

∂lt(θ0) ∂lt(θ0) Assumption 2. The matrix (θ0) = E is positive definite. I ∂θ ∂θ0 h i Verification of Assumption 2 in the GMVAR model considered is challenging, one reason for this being the rather complex expressions the partial derivatives of the mixing

weights αm,t have. A similar assumption is made by Dueker et al. (2011, condition C.9 on p. 324), who show that, under appropriate ‘high level’regularity conditions, the usual results of consistency and asymptotic normality of the ML estimator hold in their mixture model. The following theorem establishes asymptotic normality of the ML estimator in our GMVAR model.

Theorem 3. Suppose yt are generated by the stationary and ergodic GMVAR process of Theorem 1 and that Assumptions 1 and 2 hold. Then

1/2 ˆ d 1 T (θT θ0) N 0, (θ0)− , − → −J

2  where (θ0) = E[∂ lt(θ0)/∂θ∂θ0] = (θ0) is finite. J −I Theorem 3 shows that the conventional limiting distribution applies to the ML estima- ˆ tor θT which (in conjunction with the derivations used in the proof of Theorem 3) implies the applicability of standard likelihood-based tests. It is worth noting, however, that here a correct specification of the number of autoregressive components M is required. In particular, if the number of component models is chosen too large then some parameters of the model are not identified and, consequently, the result of Theorem 3 and the validity of the related tests break down. This particularly happens when one tests for the number

11 of component models, an issue discussed in more detail below (see also Dueker, Sola, and Spagnolo (2007), Dueker et al. (2011), and the references therein). Suppose one is interested in testing the null hypothesis of a linear VAR model (a model where the number of component models M equals 1) against a nonlinear GMVAR model (with M 2) or, more generally, to select the number of component models M. In such ≥ testing problems, diffi culties arise because in a GMVAR(p,M) model some parameters are not identified if the model reduces to a linear VAR(p) model. A further complication is that the GMVAR(p, M) model can be reduced to a VAR(p) model in several ways. For instance, when M = 2 a VAR(p) model is obtained by specifying the null hypothesis either as ϑ1 = ϑ2 or as α1 = 1. In the former case the parameter α1 is not identified

and in the latter case the parameter ϑ2 is not identified. The latter case also involves the

nonstandard feature that under null hypothesis the parameter α1 lies on the boundary of the parameter space. These facts indicate that the testing problem is highly nonstandard. Likelihood-based tests for testing problems similar to those discussed above have re- cently been studied for mixture or Markov switching type regime switching models, among others, by Cho and White (2007) and Carrasco, Hu, and Ploberger (2014). The former authors develop tests in mixture models where the mixing weights are constant over time, whereas the latter authors discuss tests in Markov switching models in which the transi- tion probabilities depend on past regimes but not on past observations. To our knowledge, tests for the case in which the mixing weights depend on past data have not yet been devel- oped. The complexity of the abovementioned papers suggests, however, that developing such tests may be a major task and is, therefore, left for future research. Instead of formal tests, in our empirical application we use residual-based diagnostics and information criteria (AIC and BIC) to infer which model fits the data best. Similar approaches have also been used by Fong et al. (2007) and Dueker et al. (2011) in their mixture VAR models. Note, however, that once the number of regimes M in the GMVAR model is (correctly) chosen, standard likelihood-based inference can be used to choose regime-wise autoregressive orders and to test other hypotheses of interest.

4 Empirical example

4.1 Data and preliminary analysis

We use interest rate and exchange rate data to illustrate how the GMVAR model can describe typical features of financial data and how it improves the in-sample and out-of-

12 Figure 1: Left: The Euro and U.S. dollar exchange rate series scaled by 100 (upper solid line), the interest rate differential between the Euro area and U.S. scaled by 10 (lower solid line), and scaled mixing weights based on the estimates of the GMVAR model in (12) (dashed line; the scaling is such that αˆ1,t equals the maximum (minimum) of the observations when αˆ1,t = 1 (= 0)). Middle: A kernel density estimate of the observations (solid line) and mixture density implied by the same GMVAR model as in the left panel for the interest rate differential series (dashed line). Right: A kernel density estimate of the observations (solid line) and mixture density implied by the same GMVAR model as in the left panel for the exchange rate series (dashed line). sample fit compared to a linear VAR model and competing regime switching nonlinear VAR models. Our data, retrieved from OECD Statistics, consists of the difference between the monthly Euro area and U.S. long-term government bond yields (referred to as the interest rate differential), zt,1, and the monthly average Euro-U.S. dollar exchange rate, 3 zt,2, from January 1989 to December 2013. We split the data into an estimation period of 21 years and a forecasting period of 4 years, and analyze slightly transformed data, yt = (10zt,1, 100zt,2), to ease the numerical maximization of the likelihood function. Time series plots of the transformed data are shown in Figure 1 for the estimation period that ends on December 2009 (left panel, the solid lines). Visual inspection of the time series plots of the series (Figure 1, solid lines in the left panel) suggests that the two series exhibit changes at least in levels and potentially also in variability. Kernel estimates of the density functions of the component series (Figure 1, solid lines in the middle and right panels) also suggest the potential presence of multiple

3 The difference between yields of government bonds with 10 years maturity, zt,1 = it,EU it,US, is − calculated by the ECB and the Federal Reserve Board; prior to 2001, the Euro area data refer to the “EU11”countries, and afterwards with changing composition eventually to the “EU17”by the end of the data period. The latter series, zt,2, is based on the ECU—USD exchange rate prior to 1999.

13 regimes and multimodality. These observations are in line with the univariate analyses of the component series in Kalliovirta et al. (2014, 2015). Although the time periods used in these papers differ somewhat from that used here, the obtained results nevertheless lend support to the fact that the component series exhibit regime switching dynamics that could adequately be described by univariate versions of the GMVAR model. For a multivariate analysis, a natural first step is to check how well conventional linear Gaussian VAR models fit the data. For brevity, we do not present the results of the linear VAR analysis performed (the results are available in the Supplementary Appendix). In summary, BIC suggested a VAR(2) model which, however, was clearly rejected in that the residuals were found conditionally heteroskedastic, autocorrelated, and non-Gaussian (a similar result was obtained when a VAR(4), suggested by AIC, was tried).

4.2 The estimated GMVAR model

When specifying a GMVAR(p, M) model, it is advisable to begin with low-order models. One reason for this is that if the number of component models M is chosen too large, then some parameters of the model are not identified (see the discussion in Section 3.4 following Theorem 3). For the autoregressive order p, the order chosen for the linear VAR model appears a natural initial choice. As Kalliovirta et al. (2014, 2015) found that the component series could adequately be described by univariate versions of the GMVAR model with the autoregressive coeffi cients restricted the same in each regime, we started our multivariate analysis with a similarly restricted GMVAR(2,2) model. Thus, we first tried a simple 2-regime GMVAR model with autoregressive order 2 and regime- wise intercept terms and error covariance matrices. Estimation based on the conditional likelihood gave the following results:

1.25 0.04 0.29 0.05 yt,1 (0.07) (0.04) yt 1,1 −(0.06) −(0.04) yt 2,1 =   − +   − "yt,2# 0.06 1.34 "yt 1,2# 0.08 0.36 "yt 2,2# (0.10) (0.06) − −(0.10) −(0.06) −        1/2  1/2 1.03 0.93 0.15 1.79 5.88 3.56 (0.51) (0.15) −(0.27) (0.69) (0.73) (0.72) +st,1  + ˆεt + st,2  + ˆεt (12) 2.36  0.15 5.20  3.00 3.56 9.80  (1.34) −(0.27) (0.76)   (1.48) (0.72) (1.18)                      with the estimate of the mixing weight α1 = P (st,1 = 1) being αˆ1 = 0.37 (0.18) and the estimated correlation of the error terms being 0.07 (0.12) in regime 1 and 0.47 (0.07) − in regime 2 (standard errors computed using the Hessian of the log-likelihood function

14 are given in parentheses). Based on the maximized values of the likelihood functions and information criteria, this model is clearly preferable to the linear VAR(2) model.4

The time series plot of the estimated time-varying mixing weights αˆ1,t is depicted in Figure 1 (dashed line in the left panel). From 1989 until the beginning of 1996, the second regime is clearly dominating. Between 1996 and 2008 the series mostly evolve in the first regime, although there are a few occasions where the probability of the second regime is quite high. After 2008, the series switch back to the second regime. Based on the time series plots of the estimated conditional means and conditional of the component series (see Figure A.8 in the Supplementary Appendix; computed according to (3) and (4)), the first regime corresponds to a low mean, low variance regime, whereas the second regime corresponds to a high mean, high variance regime. This is consistent with the time series plots of the series (Figure 1, solid lines in the left panel) as well as with the estimates of the means and variances of the stationary distribution implied by the estimated model (12) (not shown) in that in the first regime these estimates are small compared to their counterparts in the second regime. The time series plot of the estimated conditional correlation between the two series (see Figure A.8 in the Supplementary Appendix; computed according to (4)) has a shape similar to that of an overturned version of the estimated mixing weights αˆ1,t (depicted in Figure 1). In the first regime the series evolve rather uncorrelated of each other (the estimated conditional correlation during the years 1996—2008 is most of the time between 0 and 0.25), whereas in the second regime they are correlated (the estimated conditional correlation before 1996 and after 2008 is near-constant at 0.47). The estimates of the off-diagonal elements of the autoregressive matrices in (12) are rather small, and for completeness, we also estimated a model in which these elements were restricted to zero. The likelihood ratio test for this restriction had a p-value of 0.12, but as the (quantile) residuals of the restricted model were autocorrelated and the restricted model produced forecasts inferior to those of the unrestricted model, we prefer the unrestricted model. It may be worth noting that even if the off-diagonal elements of the autoregressive matrices were zero the conditional distribution of yt,1, for example, given the past of the two series is not independent of past values of yt,2 (a similar conclusion holds for yt,2). The reason for this is that the mixing weight α1,t depends on (yt 1,2, yt 2,2). − − In particular, as the intercept terms and variances in the two regimes differ the conditional expectation E[yt,1 t 1] and the conditional variance V ar[yt,1 t 1] are functions of | F − | F − 4 The maximized value of the log-likelihood function is 1083, and the AIC and BIC values are 2205 − and 2272. The corresponding figures for the linear VAR(2) model are 1116, 2258, and 2304. − 15 Figure 2: Contour plots of the mixing weights αˆ1,t as a function of yt 1 = (yt 1,1, yt 1,2) with − − − yt 2 = (yt 2,1, yt 2,2) fixed and with yt 1 and yt 2 chosen to match with selected values of the − − − − − interest rate differential (yt,1) and the exchange rate (yt,2) series. The arrows point from yt 2 − to yt 1 with these two points chosen as follows. Upper panel: yt 2 and yt 1 correspond to − − − April 2000 and May 2000 (left), May 2000 and June 2000 (middle), and June 2000 and July 2000

(right). Lower panel: yt 2 and yt 1 correspond to October 2007 and November 2007 (left), − − November 2007 and December 2007 (middle), and December 2007 and January 2008 (right).

α1,t, and hence functions of (yt 1,2, yt 2,2) (see equations (3) and (4)). − − To better understand how the regime switches can occur in the GMVAR model (12), we examine the estimated mixing weights αˆ1,t = Pt 1 (st,m = 1) graphically. As αˆ1,t is a − function of the four-dimensional vector yt 1 = (yt 1, yt 2), we consider two-dimensional − − − projections as a function of yt 1 with the arguments chosen to correspond to observed − values of the two series during five successive months. Thus, the three figures in the upper panel and lower panel of Figure 2 are related to observed values of the interest rate dif- ferential and the exchange rate series during the periods from April 2000 to August 2000, and from October 2007 to February 2008, respectively. The former of these periods con- tains the big drop in the estimated mixing weights that lasts only a couple of months (see Figure 1, left panel), whereas the latter illustrates a more gradual shift in the estimated mixing weights. The figure on the left in the upper panel of Figure 2 depicts contour plots of the mixing weight αˆ1,t as a function of yt 1 = (yt 1,1, yt 1,2) with the dot in the center of the − − −

16 contours being yt 2 = (yt 2,1, yt 2,2) chosen to match with the observed values of the two − − − series on April 2000. The arrow from this dot towards left points to yt 1, the values of the − two series on the next month, May 2000. Thus, the two dots give the four components of yt 1 that determine the value of the estimated mixing weight αˆ1,t for June 2000, and, − as can be seen from the figure, this value is below 0.1, with the precise value being only

0.02. This is a large drop compared to the values of αˆ1,t on April 2000 and May 2000 which are 0.91 and 0.82, respectively. The contour plot in the middle of the upper panel shows the situation one month later. Thus, the dot in the center of this contour plot shows the components of yt 2 on May 2000, with the arrow pointing to the dot showing − the components of yt 1 on June 2000. Together these two dots give the four components − of yt 1 that determine the value of αˆ1,t for July 2000, which is 0.49. Similarly, the two − dots on the right of the upper panel show the components of yt 1 that determine the − value of αˆ1,t for August 2000, which is 0.86. Thus, the series has most likely visited the second regime (high mean, high variance regime) for only a month or two after May 2000 and then returned back to the first regime (low mean, low variance regime). What is interesting here is that the levels of both series have all the time been low, even below the estimated means of their stationary distributions which are —4.41 and 116.88 for the interest rate differential (yt,1) and the exchange rate (yt,2), respectively. This illustrates the fact mentioned in the introduction that the GMVAR model can allow for regime switches taking place in various combinations of high/low levels and high/low variability of the considered series. Now consider the period from October 2007 to February 2008 depicted in the lower panel of Figure 2. Prior to this period, the estimated mixing weight αˆ1,t takes values 0.73 and 0.80 in August and September of 2007. The two dots on the left of the lower panel of Figure 2 show the observations for October 2007 and November 2007 (the tip of the arrow being outside the largest contour) when the estimated mixing weights αˆ1,t are 0.44 and 0.38, respectively, indicating a gradual decrease from the values in August and September. These two dots determine the value of αˆ1,t on December 2007 which is 0.01. During the following two months the estimated mixing weights remain below 0.01. These are depicted for December 2007 (the lower panel, left), January 2008 (the lower panel, middle), and February 2008 (the lower panel, right), respectively. As can be seen from Figure 1 (left panel) the estimated mixing weights remain very low till the end of the estimation period December 2009.

17 4.3 Model evaluation

We next check the adequacy of the estimated GMVAR model. In mixture models, care is needed when residual-based diagnostics are used to evaluate fitted models, because

empirical counterparts of the error terms εt cannot be straightforwardly computed and, therefore, conventional residuals are not readily available. The reason for this is that

the presence of the unobserved variables st,m cannot be separated from the effect of εt (see (1)). As in the univariate case in Kalliovirta et al. (2015), we use (multivariate) quantile residuals instead of conventional residuals in computing residual-based diagnostic tests. Tests for serial correlation, conditional heteroskedasticity, and non-normality in quantile residuals in multivariate models are developed by Kalliovirta and Saikkonen (2010) (see also Kalliovirta (2012) for similar tests in a univariate setting).5 Given the asymptotic results presented in Section 3.4, one can show that, under correct specification, the obtained p—values of these tests are asymptotically valid. The results of the diagnostic tests based on multivariate quantile residuals are as fol- lows. The test for normality has a p—value of 0.98 (this test is based on moments of 2 multivariate quantile residuals, and under correct specification it is approximately χ4— distributed). The test statistics for and conditional heteroskedasticity have p—values 0.12 and 0.46, respectively (these tests are based on the first six serial of the quantile residuals and on the first six serial covariances of squared quantile residuals, respectively, and under correct specification both of them are approxi- 2 mately χ24—distributed). These diagnostic tests show no indication of misspecification in the estimated GMVAR(2,2) model (12). Graphical analyses related to these tests (includ- ing time series plots, auto and cross correlation functions, Q-Q plots, and kernel densities of the residuals; not reported here but available in the Supplementary Appendix) further support the adequacy the estimated GMVAR model.

4.4 Forecasting comparisons

We next compare the performance of the linear VAR model and the nonlinear GMVAR model in a forecasting exercise in which we also include several other models, although these models do not pass the residual-based diagnostics and some of them are found inferior to the GMVAR model by information criteria. As competing nonlinear multi- variate models, we include the MVAR model of Fong et al. (2007), the C—MSTAR model

5 When the null hypothesis is a linear VAR(p) model our test for conditional heteroskedasticity appears very similar to the general portmanteau-type test Dueker et al. (2011) use to test for nonlinearity.

18 of Dueker et al. (2011), and the (time-homogeneous) MS—VAR model (see, e.g., Krolzig (1997)). Furthermore, we also consider a restricted version of the GMVAR(2,2) model, GMVARr, in which the off-diagonal elements of the autoregressive parameters are re- stricted to zero, and all other parameters are estimated freely (see (12) and the discussion in Section 4.2). A number of other models were also tried; for brevity all results and discussion regarding these is relegated to the Supplementary Appendix. The parameters of all the models considered are estimated using maximum likelihood and observations from January 1989 to December 2009 (the parameter estimates, values of information criteria, and results of appropriate diagnostic tests are available in the Supplementary Appendix). We use a fixed forecasting scheme so that all the forecasts are based on these parameter estimates. The date of forecasting ranges from December 2009 till November 2013, and for each date of forecasting, forecasts are computed for all the subsequent periods up until December 2013. Assuming correct specification, opti- mal one-step-ahead forecasts (in mean squared sense and ignoring estimation errors) are straightforward to compute with each model because explicit formulas are available for the conditional expectation (see (3)). However, computing multi-step forecasts is more complicated for mixture models because explicit formulas are very diffi cult to obtain. Therefore, we use the common practice and resort to simulation-based methods (see, e.g., Dueker et al. (2007, Sec. 4.2), Teräsvirta, Tjøstheim, and Granger (2010, Ch. 14), and the references therein). For mixture models multi-step forecasts are computed as follows. Using each of the estimated mixture models and initial values known at the date of fore- casting, we simulate 500,000 realizations and treat the mean of these realizations as a point forecast. We repeat this for all forecast horizons up until December 2013. This results in a total of 48 one-step forecasts, 47 two-step forecasts, . . . , 39 ten-step forecasts (as well as forecasts for longer horizons which we discard). We examine forecast accuracy of point forecasts for the two (univariate) component series separately as well as for the bivariate system as a whole. Forecast accuracy is measured using mean squared error (MSPE) for the forecasts of the component series, and using the determinant of the covariance matrix of the forecast error vectors in the multivariate case. The results are presented in Figure 3: The left panel presents MSPEs when forecasting interest rates, middle panel the MSPEs for the exchange rate series, and the right panel the determinant of the covariance matrix of the forecast error vectors for the bivariate system. Forecast accuracy of the models is presented relative to the GMVAR model: The straight line (at 100) represents the GMVAR model, whereas the other lines represent the size of the forecast error made relative to the GMVAR model (for

19 Figure 3: Relative forecast accuracies in the interest rate series (left) and the exchange rate series (middle) using mean squared prediction error (MSPE), and the determinant of the covariance matrix of forecast error vectors (right). instance, a value of 109 in the figure is to be interpreted as an MSPE (or a determinant of the covariance matrix of the forecast error vector) 9% larger than for the GMVAR model). As far as forecasting the interest rate series is concerned, the relative merits of the competing models are clear: The most accurate forecasts are produced by the GMVAR model, followed by the restricted GMVAR model, the MVAR model, the VAR model, and finally the C—MSTAR and MS—VAR models. Except for the last two models, this remains the same across all the considered forecast horizons (1-step, . . . , 10- step). The results are less clear when forecasts of the exchange rate series are compared. All six models are more or less equally good in 1-step and 2-step ahead forecasting (the best 1-step ahead forecasts are produced by the C—MSTAR model and the best 2-step ahead forecasts by the GMVARr model, but overall the differences are small). For longer forecast horizons the GMVAR model again outperforms its competitors. The rightmost panel of Figure 3 summarizes the forecast accuracy of the models using the determinant of the covariance matrix of the forecast error vectors. The C—MSTAR model narrowly outperforms the other models in 1-step ahead forecasting, but performs less well at longer forecast horizons. The GMVAR model produces the most accurate forecasts for all forecast horizons larger than 1. The other models perform more or less equally well, with their ranking depending on the forecast horizon. Overall, we can conclude that the GMVAR model performs quite well in comparison with its competitors. Details of the forecasting accuracy of the GMVAR(2,2) model can be found in Section A1.4 of the Supplementary Appendix. Here we only note that the forecast accuracy was best in one-step-ahead

20 Table 1: The percentage shares of observations that belong to the 80% and 90% prediction intervals based on the distribution of 500,000 one-step ahead simulated forecasts.

Interest rate Exchange rate 80% 90% 80% 90% GMVAR 79 92 81 90 GMVARr 79 90 81 90 VAR 75 79 81 90 MVAR 71 79 77 90 C—MSTAR 71 79 79 92 MS—VAR 71 79 79 85 prediction and, as expected, steadily deteriorated with the forecast horizon. The root MSPEs for 1-step, 2-step, and 4-step forecasts are 2.30, 3.62, and 5.27 for the interest rate series and 3.09, 4.82, and 6.57 for the exchange rate series (the standard deviations of the interest rate series and exchange rate series are 12.53 and 15.56, respectively). We also examined one-step ahead prediction intervals produced by the six competing models. Table 1 presents the percentage shares of observations that belong to the 80% and 90% prediction intervals based on the distribution of 500,000 simulated one-step ahead forecasts. It is seen that for the interest rate series the empirical coverage rates of the GMVAR and GMVARr based prediction intervals are closer to the nominal 80% and 90% levels than the ones obtained with the other models, whereas for the exchange rate series the differences are small. Christoffersen’s(1998) tests indicate that for the interest rate series the 90% prediction intervals produced by the VAR, MVAR, C—MSTAR, and MS—VAR models have incorrect coverage probabilities, whereas all the other prediction intervals in Table 1 have correct coverage probabilities (for details, see Section A1.6 of the Supplementary Appendix). This better predictive accuracy may be partly explained by the particular mixing weights used in the GMVAR models.

5 Conclusion

This paper introduces a new mixture VAR model referred to as the Gaussian mixture vector autoregressive (GMVAR) model. Due to the particular specification of the mix- ing weights the GMVAR model has a clear probability structure with simple conditions ensuring stationarity and ergodicity. Building on these properties of the GMVAR model, the paper develops an asymptotic theory of maximum likelihood estimation establishing

21 the consistency and asymptotic normality of the ML estimator. In addition to theoretical attractiveness, an appealing feature of the mixing weights employed in the GMVAR model is that they depend on past values of the series in a way which enables the researcher to associate different regimes of the model to different states of economy. In this respect the GMVAR model is very flexible being capable of allowing for regime switches that take place, for instance, in periods of high/low levels of the considered series, or in periods of high/low variability, or high/low temporal dependence, or combinations of all of these. The practical use of the new model is illustrated by a bivariate example on exchange rate and interest rate data. A GMVAR model with two economically meaningful regimes is found to provide a good in-sample fit and good forecasting power in comparison to considered alternatives.

22 Technical Appendix

Proof of Theorem 1. We first note some properties of the stationary auxiliary vector + autoregressions νm,t. Denoting νm,t = (νm,t, νm,t 1) and 1p+1 = (1,..., 1) ((p + 1) 1), it − + × is seen that νm,t follows the d(p + 1)—dimensional multivariate normal distribution with density

+ d(p+1)/2 1/2 nd(p+1) νm,t; ϑm = (2π)− det(Σm,p+1)−

1 + 1 +  exp ν 1p+1 µ 0 Σ− ν 1p+1 µ , × −2 m,t − ⊗ m m,p+1 m,t − ⊗ m     where the matrices Σm,p+1, m = 1,...,M, have the usual symmetric block Toeplitz form. This joint density can be decomposed as

+ nd(p+1) νm,t; ϑm = nd (νm,t νm,t 1; ϑm) ndp (νm,t 1; ϑm) , (13) | − −  where the normality of the two densities on the right-hand side follows from properties of the multivariate normal distribution (see, e.g., Anderson (2003, Thms 2.4.3 and 2.5.1)).

Moreover, ndp ( ; ϑm) clearly has the form given in (6), and making use of the Yule-Walker · equations for VAR processes (see, e.g., Reinsel (1997, Sec. 2.2.2)) together with the 1 identity µm = Am− (1) φm,0, it can be seen that

d/2 1/2 1 1 nd (νm,t νm,t 1; ϑm) = (2π)− det (Ωm)− exp νm,t µm,t 0 Ωm− νm,t µm,t , (14) | − − 2 − − n  o where µm,t is defined in Condition 1(a). The rest of the proof makes use of the theory of Markov chains (for the employed concepts, see Meyn and Tweedie (2009)). As was noted in the discussion preceding the dp theorem, yt is a Markov chain on R . Now let y0 = (y0, . . . , y p+1) be a random vector − M whose distribution has the density f(y0; θ) = m=1 αmndp (y0; ϑm). According to (2), (7), (13), and (14), the conditional density of y1Pgiven y0 is M αm f(y1 y0; θ) = M ndp (y0; ϑm) nd (y1 y0; ϑm) | m=1 n=1 αnndp (y0; ϑn) | XM P αm = M nd(p+1) ((y1, y0); ϑm) . m=1 n=1 αnndp (y0; ϑn) X

It thus follows that the densityP of (y1, y0) = (y1, y0, . . . , y p+1) is − M

f((y1, y0); θ) = αmnd(p+1) ((y1, y0); ϑm) . m=1 X 23 M Integrating y p+1 away it follows that the density of y1 is f(y1; θ) = m=1 αmndp (y1; ϑm). − Therefore, y and y are identically distributed. As already noted, y ∞ is a (time ho- 0 1 P{ t}t=1 mogeneous) Markov chain, and hence we can conclude that yt t∞=1 has a stationary dis- {M } tribution πy ( ), say, characterized by the density f( ; θ) = αmndp ( ; ϑm) (cf. Meyn · · m=1 · and Tweedie (2009, pp. 230—231)). As a mixture of multivariateP normal distributions, all moments of yt are finite. It remains to establish ergodicity. To this end, let P p(y, ) = Pr(y y = y) signify y · p | 0 the p-step transition probability measure of y . It is straightforward to check that P p(y, ) t y · has a density given by

p p M

f(yp y0; θ) = f(yt yt 1; θ) = αm,tnd yt yt 1; ϑm . − − | t=1 | t=1 m=1 | Y Y X  dp dp The last expression makes clear that f(yp y0; θ) > 0 for all yp R and all y0 R | ∈ ∈ dp so that, from every initial state y0 = y ( R ), the chain yt can in p steps reach ∈ any set of the state space Rdp with positive Lebesgue measure. Using the definitions of

irreducibility and aperiodicity we can therefore conclude that the chain yt is irreducible and aperiodic (see Meyn and Tweedie (2009, Chapters 4.3 and 5.4)). Moreover, also the p p-step transition probability measure P (y, ) is irreducible, aperiodic, and has πy as its y · stationary distribution (see Meyn and Tweedie, 2009, Theorem 10.4.5). A further consequence of the preceding discussion is that the p-step transition prob- p dp dp ability measure Py (y, ) is equivalent to the Lebesgue measure on R for all y R . · ∈ As the stationary probability measure πy( ) also has a (Lebesgue) density positive every- · where in Rdp it is likewise equivalent to the Lebesgue measure on Rdp. Consequently, the p-step transition probability measure P p(y, ) is absolutely continuous with respect to the y · dp stationary probability measure πy( ) for all y R . · ∈ To complete the proof, we now use the preceding facts and conclude from Theorem 1 pn dp and Corollary 1 of Tierney (1994) that Py (y, ) πy( ) 0 as n for all y R , · − · → → ∞ ∈ where signifies the total variation norm of probability measures. Now, by Proposition k·k n dp 13.3.2 of Meyn and Tweedie (2009), also Py (y, ) πy( ) 0 as n for all y R · − · → → ∞ ∈ (as the total variation norm is non-increasing in n). Hence, yt is ergodic in the sense of

Meyn and Tweedie (2009, Ch. 13).

Proof of Theorem 2. We consider the conditional ML estimator obtained by maxi- 1 T mizing LT (θ) = T − t=1 lt (θ) (it is easy to verify that the conditional and exact ML are asymptoticallyP equivalent), and assume stationary initial values (the same

24 result can be obtained without assuming this). Assumption 1 together with the continu-

ity of the log-likelihood function LT (θ) implies the existence of a measurable maximizer ˆ θT of LT (θ) (see, e.g., Pötscher and Prucha, 1991, Lemma 3.4). For the strong con- sistency it suffi ces to show that the (conditional) log-likelihood obeys a uniform strong law of large numbers, that is, supθ Θ LT (θ) E[LT (θ)] 0 a.s. as T , and that ∈ − → → ∞ an identification condition (to be specified below) holds. For the former, as the initial values are drawn from the stationary distribution, the process is stationary and ergodic and E[LT (θ)] = E [lt(θ)], and it thus suffi ces to show that E [supθ Θ lt(θ) ] < (see, ∈ | | ∞ e.g., Ranga Rao (1962)). To this end, Assumption 1 implies that δ det (Ωm) ∆ ≤ ≤ and δ αm 1 δ for some 0 < δ < 1 and ∆ < , m = 1,...,M. Now, because ≤ ≤ − ∞ 0 αm,t 1 and det (Ωm) δ for all m = 1,...,M, and because the exponential func- ≤ ≤ ≥ tion is bounded from above by unity on the non-positive real axis (and Ωm is positive definite), we can find a C < such that ∞ M d/2 1/2 1 0 1 lt (θ) = log αm,t (2π)− det (Ωm)− exp 2 yt µm,t Ωm− yt µm,t C m=1 − − − ! ≤ X n  o for all θ Θ. On the other hand, making use of properties of the trace operator and the ∈ fact that Θ is compact it can be seen that

0 1 yt µm,t Ωm− yt µm,t c1 1 + yt0yt + yt0 1yt 1 , m = 1,...,M, − − ≤ − −    for all θ Θ with some finite c1. As det (Ωm) ∆ for all m = 1,...,M, we thus get ∈ ≤

d/2 1/2 1 nd yt yt 1; ϑm (2π)− ∆− exp 2 c1 1 + yt0yt + yt0 1yt 1 | − ≥ − − −   M  for all θ Θ and all m = 1,...,M. Therefore, as αm,t = 1, ∈ m=1 M P

lt (θ) = log αm,tnd yt yt 1; ϑm − m=1 | ! d X 1 1  log (2π) log (∆) c1 1 + yt0yt + yt0 1yt 1 ≥ −2 − 2 − 2 − −  for all θ Θ. Defining C < suitably and combining with the result obtained above, ∈ ∞ C 1 + yt0yt + yt0 1yt 1 lt (θ) C, from which E [supθ Θ lt(θ) ] < follows because − − − ≤ ≤ ∈ | | ∞ E yt0yt + yt0 1yt 1 < . − − ∞ Concerning the identification condition, we next establish that E [lt (θ)] E [lt (θ0)],   ≤ and E [lt (θ)] = E [lt (θ0)] implies ϑm = ϑτ(m),0 and αm = ατ(m),0 (m = 1,...,M) for some permutation τ(1), . . . , τ(M) of 1,...,M . In light of (11), this implies that θ = θ0. { } { } 25 Choose an arbitrary θ and consider the difference E [lt (θ)] E [lt (θ0)]. First note − that the density of (yt, yt 1) can be written as − M M M

αm,0nd(p+1) (yt, yt 1); ϑm,0 = αn,0ndp yt 1; ϑn,0 αm,0,tnd yt yt 1; ϑm,0 , − − − m=1 n=1 m=1 | X  X  X  so that E [lt (θ)] E [lt (θ0)] can be written as −

E [lt (θ)] E [lt (θ0)] − M M m=1 αm,tnd (y y; ϑm) = αm,0nd(p+1) ((y, y); ϑm,0) log M | dydy m=1 m=1 αm,0,tnd (y y; ϑm,0)! ZZ X P | M M M P m=1 αm,tnd (y y; ϑm) = αn,0ndp (y; ϑn,0) αm,0,tnd (y y; ϑm,0) log M | dy dy. n=1 " m=1 | m=1 αm,0,tnd (y y; ϑm,0)! # Z X Z X P | (Here and in what follows, for conciseness we continue usingP the notation αm,t although

now yt 1 in (7) is replaced by y.) The inner integral is, for every fixed y, the (negative) of − M the Kullback-Leibler between the two mixture densities m=1 αm,tnd (y y; ϑm) M | and αm,0,tnd (y y; ϑm,0). Therefore, E [lt (θ)] E [lt (θ0)] 0, with equality if and m=1 | − ≤P onlyP if for almost all (y, y), M M

αm,tnd (y y; ϑm) = αm,0,tnd (y y; ϑm,0) . m=1 | m=1 | X X For each fixed y at a time, the mixing weights are constants, and we may apply the results on identification of finite mixtures of Gaussian distributions in Yakowitz and Spragins (1968, Proposition 2) (see also (11)). Consequently, for each fixed y at a time, there exists a permutation τ(1), . . . , τ(M) of 1,...,M (where this permutation may depend on { } { } y) such that

αm,t = ατ(m),0,t and nd(y y; ϑm) = nd(y y; ϑτ(m),0) for almost all y (m = 1,...,M). (15) | | The number of possible permutations being finite (M!), this induces a finite partition of Rdp where the elements y of each partition correspond to the same permutation. At least one of these partitions, say A Rdp, must have positive Lebesque measure. Thus, ⊂ we may conclude that (15) holds for all fixed y A with some specific permutation ∈ τ(1), . . . , τ(M) of 1,...,M . { } { } The latter condition in (15) means that (denoting Am = [Am,1 : : Am,p] (d dp) ··· × and Am,0 similarly using the true parameter values)

1/2 1 1 det (Ωm)− exp y φ Amy 0 Ω− y φ Amy − 2 − m,0 − m − m,0 − 1/2 n 1 1 o = det Ωτ(m),0 − exp y φ  Aτ(m),0y 0 Ω− y  φ Aτ(m),0y − 2 − τ(m),0,0 − τ(m),0 − τ(m),0,0 −  n  o 26 d for m = 1,...,M, almost all y R , and all y A. This implies that Ωm = Ωτ(m),0 ∈ ∈ and φ φ + (Am Aτ(m),0)y = 0, m = 1,...,M, for all y A. The latter m,0 − τ(m),0,0 − ∈ equalities imply either that φm,0, φm = (φτ(m),0,0, φτ(m),0), or that y takes values only on a d(p 1)—dimensional hyperplane. As A has positive Lebesgue measure, the latter −  is not possible. Therefore ϑm = (φm,0, φm, σm) = (φτ(m),0,0, φτ(m),0, στ(m),0) = ϑτ(m),0 (m = 1,...,M). Now, the former condition in (15) means that

αmndp (y; ϑm) ατ(m),0ndp y; ϑτ(m),0 M = M , m = 1,...,M, n=1 αnndp (y; ϑn) n=1 ατ(n),0ndp y; ϑτ(n),0  for all y AP. Cancelling ndp (y; ϑPm) = ndp y; ϑτ(m),0 and rearranging, ∈ M  αm n=1 αnndp (y; ϑn) = M , m = 1,...,M, ατ(m),0 α n y; ϑ n=1P τ(n),0 dp τ(n),0

for all y A. As the right handP side does not depend on m, we obtain α1/ατ(1),0 = = ∈ ··· αM /ατ(M),0, which implies αm = ατ(m),0 (m = 1,...,M). This completes the proof.

Proof of Theorem 3. Let Θ0 be a compact convex set contained in the interior of Θ that has θ0 as an interior point, and introduce the notation lθ,t(θ) = ∂lt(θ)/∂θ, lθθ,t(θ) = 2 2 ∂ lt(θ)/∂θ∂θ0, Lθ,T (θ) = ∂LT (θ)/∂θ, and Lθθ,T (θ) = ∂ LT (θ)/∂θ∂θ0. Expressions of

lθ,t(θ) and lθθ,t(θ) in Lemmas 1 and 2 below make clear that lt(θ) is twice continuously

differentiable on Θ0. A standard mean value expansion of the score vector Lθ,T (θ) yields

1/2 ˆ 1/2 1/2 ˆ T Lθ,T (θT ) = T Lθ,T (θ0) + L˙ θθ,T T (θT θ0) a.s., (16) −

where L˙ θθ,T signifies the matrix Lθθ,T (θ) with each row evaluated at an intermediate point ˙ ˆ ˆ θi,T (i = 1,..., dim θ) lying between θT and θ0. By Theorem 2, θT θ0 a.s., so that → ˙ θi,T θ0 a.s. as T (i = 1,..., dim θ) which, together with the uniform convergence → → ∞ result for Lθθ,T (θ) in Lemma 3 below, yields L˙ θθ,T (θ0) a.s. as T . This and → J → ∞ the invertibility of (θ0) obtained from Assumption 2 and the result (θ0) = (θ0) J J −I established below in Lemma 2 implies that, for all T suffi ciently large, L˙ θθ,T is also 1 1 invertible (a.s.) and L˙ − (θ0)− a.s. as T . Multiplying the mean value θθ,T → J → ∞ ˙ + ˙ expansion (16) with the Moore-Penrose inverse Lθθ,T of Lθθ,T (this inverse exists for all T ) and rearranging we obtain

1/2 ˆ + 1/2 ˆ + 1/2 ˆ T (θT θ0) = (Idim θ L˙ L˙ θθ,T )T (θT θ0) + L˙ T Lθ,T (θT ) − − θθ,T − θθ,T + 1/2 L˙ T Lθ,T (θ0). (17) − θθ,T 27 The first two terms on the right hand side of (17) converge to zero a.s. (for the first

term, this follows from the fact that for all T suffi ciently large L˙ θθ,T is invertible; for the ˆ second one, this holds because θT being a maximizer of LT (θ) and θ0 being an interior ˆ point of Θ0 yield Lθ,T (θT ) = 0 for all T suffi ciently large). Furthermore, the eventual + 1 a.s. invertibility of L˙ θθ,T also means that L˙ (θ0)− 0 a.s. Hence, (17) becomes θθ,T − J → 1/2 ˆ 1 1/2 T (θT θ0) = o1(1) ( (θ0)− + o2(1))T Lθ,T (θ0), − − J

where o1(1) and o2(1) (a vector- and a matrix-valued process, respectively) converge to

zero a.s. Combining this with the result of Lemma 1 below and the property (θ0) = J (θ0) (see Lemma 2 below) completes the proof. −I 1/2 ∂ d Lemma 1. Under the assumptions of Theorem 3, T LT (θ0) N (0, (θ0)), where ∂θ → I ∂lt(θ0) ∂lt(θ0) (θ0) = E is finite. I ∂θ ∂θ0 Proof. We begin by deriving the score vectors (of a single observation) with respect to

parameters ϑ and α. To this end, first note that lt (θ) can be expressed as

M M

lt (θ) = log αmnd(p+1) (yt, yt 1); ϑm log αmndp yt 1; ϑm . (18) − − m=1 ! − m=1 ! X  X  Next, introduce the notation

(1) nd(p+1) (yt, yt 1); ϑm nd(p+1) (yt, yt 1); ϑM − − lαm,t(yt, yt 1) = M − − n=1 αnnd(p+1) (yt, yt 1); ϑn  −  (2) ndp yt 1; ϑm ndp yt 1; ϑM − P −  lαm,t(yt 1) = M − − n=1 αnndp yt 1; ϑn  −  α ∂ l(1) (y , y ) = P m  n (y , y ); ϑ ϑm,t t t 1 M d(p+1) t t 1 m − ∂ϑ − n=1 αnnd(p+1) (yt, yt 1); ϑn m −  αmnd(p+1) (yt, yt 1); ϑm ∂ P −  = M log nd(p+1) (yt, yt 1); ϑm ∂ϑ − n=1 αnnd(p+1) (yt, yt 1); ϑn m −  α ∂  l(2) (y ) = P m n  y ; ϑ ϑm,t t 1 M dp t 1 m − ∂ϑ − n=1 αnndp yt 1; ϑn m −  αmndp yt 1; ϑm ∂ P −  = M log ndp yt 1; ϑm ∂ϑ − n=1 αnndp yt 1; ϑ n m −  where m = 1,...,M P1 in the first two quantities defined, and m = 1,...,M in the − last two. These quantities also depend on θ, but for brevity we have suppressed this dependence. For the corresponding quantities evaluated at θ = θ0, we use the notation

28 (1) (2) (1) (2) lαm,t,0(yt, yt 1), lαm,t,0(yt 1), lϑ ,t,0(yt, yt 1), and lϑ ,t,0(yt 1). Now, with straightforward − − m − m − differentiation of (18) we obtain the partial derivative with respect to αm as

∂ (1) (2) lt (θ) = lαm,t(yt, yt 1) lαm,t(yt 1), m = 1,...,M 1, (19) ∂αm − − − − and with respect to ϑm as ∂ l (θ) = l(1) (y , y ) l(2) (y ), m = 1,...,M. (20) t ϑm,t t t 1 ϑm,t t 1 ∂ϑm − − − Making use of the identities

M M M

αmnd(p+1) (yt, yt 1); ϑm = αm,tnd yt yt 1; ϑm αmndp yt 1; ϑm (21) − − − m=1 m=1 | m=1 X  X  X  and

nd(p+1) (yt, yt 1); ϑm = nd yt yt 1; ϑm ndp yt 1; ϑm (22) − | − − together with the definition of αm,t these can alternatively be written as

∂ 1 αm,t αM,t lt (θ) = M nd yt yt 1; ϑm nd yt yt 1; ϑM ∂α α − α − m n=1 αn,tnd yt yt 1; ϑn m | − M | | −   αm,t αM,t   P + , m = 1,...,M 1, (23) − αm αM − and

∂ αm,tnd yt yt 1; ϑm ∂ − lt (θ) = M | log nd(p+1) (yt, yt 1); ϑm ∂ϑ ∂ϑ − m n=1 αn,tnd yt yt 1; ϑ n m | − ∂  P αm,t log ndp yt 1; ϑm , m = 1, ..., M. (24) − ∂ϑm −  As the process yt is assumed to be stationary and ergodic, so is also the score vector.

We next establish that ∂lt (θ0) /∂θ is square integrable. To this end, conclude from (19) that ∂lt (θ0) /∂αm c < , so that it suffi ces to consider ∂lt (θ0) /∂ϑ. Thus, the desired | | ≤ ∞ (1) (2) result is obtained by showing that lϑ ,t,0(yt, yt 1) and lϑ ,t,0(yt 1) are square integrable. m − m − To establish this, note that

2 (1) 2 αm,0nd(p+1) (yt, yt 1); ϑm,0 ∂ l (y , y ) − log n (y , y ); ϑ , ϑm,t,0 t t 1 M d(p+1) t t 1 m,0 − ∂ϑ − | | ≤ n=1 αn,0nd(p+1) (yt, yt 1); ϑn,0 m −  2  (2) 2 αm,0ndp yt 1; ϑm,0 ∂ l (y ) P − log n y ; ϑ . ϑm,t,0 t 1 M dp t 1 m,0 | − | ≤ α n y ; ϑ ∂ϑm − n=1 n,0 dp t 1 n,0 − 

Hence, P  2 (1) 2 ∂ E lϑ ,t,0(yt, yt 1) αm,0 log nd(p+1) ((y, y); ϑm,0) nd(p+1) ((y, y); ϑm,0) dydy, | m − | ≤ ∂ϑ Z m h i

29 which is finite because the integral is the expectation of the squared norm of the score of

ϑm corresponding to the density nd(p+1) ((y, y); ϑm,0). In a similar manner it is seen that (2) 2 E[ lϑ ,t,0(yt 1) ] < . Thus, we have shown that that ∂lt (θ0) /∂θ is square integrable. | m − | ∞ For the martingale difference property, let αn,t,0 signify αn,t evaluated at θ = θ0, M and notice that E[ t 1] = n=1 αn,t,0nd (y y; ϑn,0) dy. Concerning the score · | F − · | with respect to α, taking conditionalR P expectations it is immediately seen from (19) that E[∂lt (θ0) /∂αm t 1] = 0 holds. As for the score with respect to ϑm, use (24) and the | F − fact log nd(p+1) (yt, yt 1); ϑm = log nd yt yt 1; ϑm + log ndp yt 1; ϑm to obtain − | − − ∂ M    lt (θ) αn,t,0nd yt yt 1; ϑn,0 ∂ϑ − m n=1 | X ∂  = αm,t,0nd yt yt 1; ϑm,0 log nd yt yt 1; ϑm,0 | − ∂ϑm | − ∂   M +αm,t,0 log ndp yt 1); ϑm,0 nd yt yt 1; ϑm,0 αn,t,0nd yt yt 1; ϑn,0 . ∂ϑ − − − m " | − n=1 | # X  ∂   Integrating over yt results in a zero vector because ∂ϑ log nd yt yt 1; ϑm,0 is the score m | − vector corresponding to the density nd yt yt 1; ϑm,0 , so that also E[∂lt(θ0) /∂ϑm | − | t 1] = 0 holds. F −  The stated asymptotic normality now follows from the central limit theorem for sta- tionary and ergodic martingale differences (see Billingsley (1961)).

Lemma 2. Under the assumptions of Theorem 3, (θ0) = (θ0). J −I Proof. With straightforward differentiation, the required second partial derivatives can be expressed as 2 ∂ (1) (1) (2) (2) lt (θ) = lαm,t(yt, yt 1)lαn,t(yt, yt 1) + lαm,t(yt 1)lαn,t(yt 1), ∂αm∂αn − − − − − ∂2 l (θ) = l(1) (y , y )l(1) (y , y ) + l(2) (y )l(2) (y ) , t ϑm,t t t 1 ϑn,t t t 1 0 ϑm,t t 1 ϑn,t t 1 0 ∂ϑm∂ϑn0 − − − − − 2 αm ∂ + M nd(p+1) (yt, yt 1); ϑm ∂ϑ ∂ϑ0 − k=1 αknd(p+1) (yt, yt 1); ϑk m n − α ∂2  P m  M ndp yt 1; ϑm , ∂ϑ ∂ϑ0 − − k=1 αkndp yt 1; ϑk m n − ∂2  l (θ) = lP(1) (y , y )l(1) (y , y ) + l(2) (y )l(2) (y ), t αn,t t t 1 ϑm,t t t 1 αn,t t 1 ϑm,t t 1 ∂ϑm∂αn − − − − − ∂2 l (θ) = l(1) (y , y )l(1) (y , y ) + l(2) (y )l(2) (y ) t αm,t t t 1 ϑm,t t t 1 αm,t t 1 ϑm,t t 1 ∂ϑm∂αm − − − − − 1 (1) 1 (2) +αm− lϑ ,t(yt, yt 1) αm− lϑ ,t(yt 1), m − − m − 30 where in the first expression m, n = 1,...,M 1; in the second m, n = 1,...,M; in the − third m = 1,...,M, n = 1,...,M 1, and m = n; and in the fourth m = 1,...,M 1. − 6 − Using these expressions together with those for the first partial derivatives of lt (θ) given in 2 (19) and (20), the result (θ0) = E[∂ lt(θ0)/∂θ∂θ0] = E [(∂lt(θ0)/∂θ)(∂lt(θ0)/∂θ0)] = J − (θ0) can be established using elementary but tedious calculations. For brevity, we −I omit the details, which are available in the Supplementary Appendix.

Lemma 3. Under the assumptions of Theorem 3, supθ Θ0 Lθθ,T (θ) (θ) 0 a.s., ∈ | − J | → where (θ) = E [lθθ,t (θ)] is continuous at θ0. J Proof. As the process yt is assumed to be stationary and ergodic, from the expressions

of the components of lθθ,t(θ) given at the beginning of the proof of Lemma 2 (see also the

equations after (18)) it follows that lθθ,t(θ) forms a stationary ergodic sequence of random

variables that are continuous in θ over Θ0. The desired result thus follows from Ranga

Rao (1962) if we establish that E supθ Θ0 lθθ,t(θ) is finite. To this end, first note that ∈ | | αmndp yt 1; ϑm  αmnd(p+1) (yt, yt 1); ϑm − − αm,t = M < 1 and M < 1 n=1 αnndp yt 1; ϑn n=1 αnnd(p+1) (yt, yt 1); ϑn −  −  for m = 1,...,MP , and observe that the set Θ0Pcan be chosen small enough to ensure that αm, m = 1,...,M, are bounded away from zero on Θ0. Using these facts and the (1) (2) (1) (2) definitions of lαm,t(yt, yt 1), lαm,t(yt 1), lϑ ,t(yt, yt 1), and lϑ ,t(yt 1) (see the equations − − m − m − after (18)) it can then be seen that

(1) (2) lαm,t(yt, yt 1) C and lαm,t(yt 1) C (m = 1,...,M 1), | − | ≤ | − | ≤ − (1) ∂ lϑ ,t(yt, yt 1) log nd(p+1) (yt, yt 1); ϑm (m = 1,...,M), | m − | ≤ ∂ϑ − m ∂  l(2) (y ) log n y ; ϑ (m = 1,...,M), ϑm,t t 1 dp t 1 m | − | ≤ ∂ϑ − m  for some C < and all θ Θ0. These upper bounds, the expressions of the second ∞ ∈ partial derivatives of lt(θ) at the beginning of the proof of Lemma 2, the relation ∂2 ∂ ∂ ndp yt 1; ϑm = ndp yt 1; ϑm log ndp yt 1; ϑm log ndp yt 1; ϑm ∂ϑm∂ϑm0 − − ∂ϑm − ∂ϑm0 −   ∂2   +ndp yt 1; ϑm log ndp yt 1; ϑm , − ∂ϑm∂ϑm0 −   and an analogous relation for the density nd(p+1) (yt, yt 1); ϑm , can now be used to show − that E sup l (θ) < holds as long as θ Θ0 θθ,t  ∈ | | ∞  ∂  2 ∂2 E sup log ndp yt 1; ϑm < and E sup log ndp yt 1; ϑm < , θ Θ0 ∂ϑm − ∞ θ Θ0 ∂ϑm∂ϑm0 − ∞ " ∈ #  ∈   

31 together with analogous results for the density nd(p+1) (yt, yt 1); ϑm , hold for m = − 1,...,M.  To establish the finiteness of these moments, first consider the partial derivatives of

log ndp yt 1; ϑm , and for clarity let µ(ϑm) (= 1p µm) and Σ(ϑm) (= Σm,p) denote − ⊗ the mean vector and covariance matrix of the ndp yt 1; ϑm distribution as functions of  − 1 the parameter vector ϑm. From the equality µm= Am− (1) φm,0 and the expression of Σ(ϑm) in Lütkepohl (2005, eqn. (2.1.39)) as a function of Am,i, i = 1, . . . , p, and Ωm,

and hence of ϑm, it follows that µ(ϑm) and Σ(ϑm) are twice continuously differentiable

functions of ϑm. Now, for each component ϑm,i of ϑm (i = 1,..., dim ϑm), straightforward differentiation gives

∂ 1 1 ∂Σ(ϑm) 1 ∂µ(ϑm) log ndp yt 1; ϑm = tr Σ(ϑm)− + yt 1 µ(ϑm) 0 Σ(ϑm)− ∂ϑ − −2 ∂ϑ − − ∂ϑ m,i  m,i  m,i   1 1 ∂Σ(ϑm) 1 + [vec((yt 1 µ(ϑm))(yt 1 µ(ϑm))0)]0 vec Σ(ϑm)− Σ(ϑm)− . 2 − − − − ∂ϑ  m,i  As the set Θ0 can be assumed small enough to ensure that the functions of µ(ϑm) and

Σ(ϑm) as well as their partial derivatives appearing in the last expression above are ∂ 2 bounded on Θ0, E supθ Θ0 ∂ϑ log ndp yt 1; ϑm < follows because yt has finite ∈ | m − | ∞ fourth moments dueh to Theorem 1. The finiteness i of the three other condi- tions follows in a similar fashion (details omitted) because µ(ϑm) and Σ(ϑm) are twice continuously differentiable and yt has finite moments of all orders due to Theorem 1.

32 References

Anderson, T. W. (2003): An Introduction to Multivariate Statistical Analysis, 3rd edn. Wiley, Hoboken NJ.

Ang, A. and G. Bekaert (2002): “Regime switches in interest rates,” Journal of Business and , 20, 163—182.

Bec, F., A. Rahbek, and N. Shephard (2008): “The ACR model: a multivariate dynamic mixture autoregression,” Oxford Bulletin of and Statistics, 70, 583—618.

Billingsley, P. (1961): “The Lindeberg-Levy theorem for martingales,”Proceedings of the American Mathematical Society, 12, 788—792.

Carrasco, M., L. Hu, and W. Ploberger (2014): “Optimal tests for Markov switch- ing parameters,”, 82, 765—784.

Cho, J. S. and H. White (2007): “Testing for regime switching,” Econometrica, 75, 1671—1720.

Christoffersen, P. F. (1998): “Evaluating interval forecasts,”International Economic Review, 39, 841—862.

Dueker, M. J., Z. Psaradakis, M. Sola and F. Spagnolo (2011): “Multivariate contemporaneous-threshold autoregressive models,” Journal of , 160, 311—325.

Dueker, M. J., M. Sola and F. Spagnolo (2007): “Contemporaneous threshold autoregressive models: estimation, testing and forecasting,”Journal of Econometrics, 141, 517—547.

Fong, P. W., W. K. Li, C. W. Yau, and C. S. Wong (2007): “On a mixture vector autoregressive model,”Canadian Journal of Statistics, 35, 135—150.

Glasbey, C. A. (2001): “Non-linear autoregressive time series with multivariate Gaussian mixtures as marginal distributions,” Journal of the Royal Statistical So- ciety: Series C, 50, 143—154.

Kalliovirta, L. (2012): “Misspecification tests based on quantile residuals,”Economet- rics Journal, 15, 358—393.

33 Kalliovirta, L., M. Meitz, and P. Saikkonen (2014): “Modeling the Euro—USD exchange rate with the Gaussian mixture autoregressive model,” in J. Knif and B. Pape (eds.), Contributions to Mathematics, Statistics, Econometrics, and Finance: A Festschrift in Honour of Professor Seppo Pynnönen, Acta Wasaensia 296, Univer- sity of Vaasa.

Kalliovirta, L., M. Meitz, and P. Saikkonen (2015): “A Gaussian mixture au- toregressive model for univariate time series,”Journal of Time Series Analysis, 36, 247—266.

Kalliovirta, L., and P. Saikkonen (2010): “Reliable residuals for multivariate non- linear time series models,” unpublished revision of HECER Discussion Paper No. 247.

Lanne, M. (2006): “Nonlinear dynamics of interest rate and inflation,”Journal of Ap- plied Econometrics, 21, 1154—1168.

Krolzig, H.-M. (1997): Markov-Switching Vector Autoregressions Modelling, and Applications to Analysis. Springer, Berlin.

Le, N. D., R. D. Martin, and A. E. Raftery (1996): “Modeling flat stretches, bursts, and outliers in time series using mixture transition distribution models,”Journal of the American Statistical Association, 91, 1504—1515.

Lütkepohl, H. (2005): New Introduction to Multiple Time Series Analysis. Springer, Berlin.

Meyn, S., and R. L. Tweedie (2009): Markov Chains and Stochastic Stability, 2nd edn. Cambridge University Press, Cambridge.

Pötscher, B. M., and I. R. Prucha (1991): “Basic structure of the asymptotic theory in dynamic nonlinear econometric models, Part I: Consistency and approximation concepts,”Econometric Reviews, 10, 125—216.

Ranga Rao, R. (1962): “Relations between weak and uniform convergence of measures with applications,”Annals of , 33, 659—680.

Reinsel, G. C. (1997): Elements of Multivariate Time Series Analysis, 2nd edn. Springer, New York.

34 Sims, C. A., D. F. Waggoner, and T. Zha (2008): “Methods for inference in large multiple equation Markov-switching models,”Journal of Econometrics, 146, 255—274.

Teräsvirta, T., D. Tjøstheim and C. W. J. Granger (2010): Modelling Nonlinear Economic Time Series. Oxford University Press, Oxford.

Tierney, L. (1994): “Markov chains for exploring posterior distributions,” Annals of Statistics, 22, 1701—1762.

Tong, H. (2011): “Threshold models in time series analysis — 30 years on,” Statistics and Its Interface, 4, 107—118.

Villani, M., R. Kohn, and P. Giordani (2009): “Regression using smooth adaptive Gaussian mixtures,”Journal of Econometrics, 153, 155—173.

Wong, C. S., and W. K. Li (2000): “On a mixture autoregressive model,”Journal of the Royal Statistical Society: Series B, 62, 95—115.

Wong, C. S., and W. K. Li (2001a): “On a mixture autoregressive conditional het- eroscedastic model,”Journal of the American Statistical Association, 96, 982—995.

Wong, C. S., and W. K. Li (2001b): “On a logistic mixture autoregressive model,” Biometrika, 88, 833—846.

Yakowitz, S. J., and J. D. Spragins (1968): “On the identifiability of finite mixtures,” Annals of Mathematical Statistics, 39, 209—214.

35