888 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 5, MAY 2014 -Mixture Conditional Heteroscedasticity

Emmanouil A. Platanios and Sotirios P. Chatzis

Abstract—Generalized autoregressive conditional heteroscedasticity (GARCH) models have long been considered as one of the most successful families of approaches for volatility modeling in financial return series. In this paper, we propose an alternative approach based on methodologies widely used in the field of statistical . Specifically, we propose a novel nonparametric Bayesian mixture of Gaussian process regression models, each component of which models the noise process that contaminates the observed data as a separate latent Gaussian process driven by the observed data. This way, we essentially obtain a Gaussian process-mixture conditional heteroscedasticity (GPMCH) model for volatility modeling in financial return series. We impose a nonparametric prior with power-law nature over the distribution of the model mixture components, namely the Pitman-Yor process prior, to allow for better capturing modeled data distributions with heavy tails and skewness. Finally, we provide a copula-based approach for obtaining a predictive posterior for the covariances over the asset returns modeled by means of a postulated GPMCH model. We evaluate the efficacy of our approach in a number of benchmark scenarios, and compare its performance to state-of-the-art methodologies.

Index Terms—Gaussian process, Pitman-Yor process, , conditional heteroscedasticity, copula, volatility modeling

1INTRODUCTION TATISTICAL modeling of asset values in financial mar- and the past , which facilitates model estimation S kets requires taking into account the tendency of assets and computation of the prediction errors. They have been towards asymmetric temporal dependence [1]. Besides, the extremely successful in both volatility prediction based on data generation processes of the returns of financial market daily returns, as well as on predictions using intraday infor- indexes may be non-linear, non-stationary and/or heavy- mation (realized volatility), where they offer state-of-the-art tailed, while the marginal distributions may be asymmet- performance. ric, leptokurtic and/or show conditional heteroscedasticity. Gaussian process (GP) models comprise one of the most Hence, there is a need to construct flexible models capable popular Bayesian methods in the field of machine learning of incorporating these features. The generalized autore- for regression, function approximation, and predictive den- gressive conditional heteroscedasticity (GARCH) family of sity estimation [4]. Despite their significant flexibility and models has been used to address conditional heteroscedas- success in many application domains, GPs do also suffer ticity and excess kurtosis (see [2], [3]). from several limitations. In particular, GP models are faced The time-dependent variance in series of returns on with difficulties when dealing with tasks entailing non- prices, also known as volatility, is of particular interest stationary covariance functions, multi-modal output, or in finance, as it impacts the pricing of financial instru- discontinuities. Several approaches that entail using ensem- ments, and it is a key concept in market regulation. GARCH bles of fractional GP models defined on subsets of the input approaches are commonly employed in modeling financial space have been proposed as a means of resolving these return series that exhibit time-varying volatility clustering, issues (see [5]–[7]). i.e. periods of swings followed by periods of relative calm, In this work, we propose a novel GP-based approach for and have been shown to yield excellent performance in volatility modeling in financial (return) data. these applications, consistently defining the state-of-the-art Our proposed approach provides a viable alternative to in the field in the last decade. GARCH models represent GARCH models, that allows for effectively capturing the the variance by a function of the past squared returns clustering effects in the variability or volatility. It is based on the introduction of a novel nonparametric Bayesian • Emmanouil A. Platanios is with the Department of Machine Learning, mixture model, the component distributions of which con- Carnegie Mellon University, Pittsburgh, PA 15213 USA. stitute GP regression models; the noise variance processes E-mail: [email protected]. • Sotirios P. Chatzis is with the Department of Electrical Engineering, of the model component GPs are considered as input- Computer Engineering, and Informatics, Cyprus University of dependent latent variable processes which are also modeled Technology, Limassol 3036, Cyprus. E-mail: [email protected]. by imposition of appropriate GP priors. This way, our Manuscript received 23 Jan. 2013; revised 23 Aug. 2013; accepted 10 Sep. novel approach allows for learning both the observation- 2013. Date of publication 26 Sep. 2013. Date of current version 29 Apr. 2014. dependent nature of asset volatility, as well as the under- Recommended for acceptance by A. J. Storkey. lying volatility clustering mechanism, modeled as a latent For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. model component switching procedure. In out work, we Digital Object Identifier 10.1109/TPAMI.2013.183 focus on volatility prediction based on daily returns. Even

0162-8828 c 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. PLATANIOS AND CHATZIS: GAUSSIAN PROCESS-MIXTURE CONDITIONAL HETEROSCEDASTICITY 889 though realized volatility measures have been proven to of applications dealing with volatility modeling in financial be more accurate, we opt here for working with daily return series. In the final section, we summarize and discuss return series due to the easier access to training data for our results. our algorithms. However, our method is directly applica- ble to realized volatility as well, without modifications. We 2PRELIMINARIES dub our approach the Gaussian process-mixure conditional heteroscedasticity (GPMCH) model. 2.1 The Pitman-Yor Process Nonparametric Bayesian modeling techniques, espe- models were first introduced by cially Dirichlet process mixture (DPM) models, have Ferguson [17]. A DP is characterized by a base distri- become very popular in over the last few years, bution G0 and a positive scalar α, usually referred to as for performing nonparametric density estimation [8]–[10]. the innovation parameter, and is denoted as DP(α, G0). Briefly, a realization of a DPM can be seen as an infi- Essentially, a DP is a distribution placed over a distribution. nite mixture of distributions with given parametric shape Let us suppose we randomly draw a sample distribution (e.g., Gaussian). An interesting alternative to the Dirichlet G from a DP, and, subsequently, we independently draw process prior for nonparametric Bayesian modeling is the {∗ }M M random variables m = from G: Pitman-Yor process prior [11]. Pitman-Yor processes pro- m 1 duce a large number of sparsely populated clusters and G|α, G0 ∼ DP(α, G0) (1) a small number of highly populated clusters [12]. Indeed, ∗ | ∼ , = ,... . m G G m 1 M (2) the Pitman-Yor process prior can be viewed as a general- ization of the Dirichlet process prior, and reduces to it for Integrating out G, the joint distribution of the variables {∗ }M a specific selection of its parameter values. Consequently, m m=1 can be shown to exhibit a clustering effect. ∗ M−1 the Pitman-Yor process turns out to be more promising as Specifically, given the first M − 1 samples of G, { } = , ∗ m m 1 a means of modeling complex real-life datasets that usually it can be shown that a new sample  is either (a) drawn M α comprise a high number of clusters which comprise only from the base distribution G0 with probability α+M−1 ,or(b) few data points, and a low number of clusters which are is selected from the existing draws, according to a multi- highly frequent, thus dominating the entire population. nomial allocation, with probabilities proportional to the Inspired by these advances, the component switching number of the previous draws with the same allocation [18]. { }C mechanism of our model is obtained by means of a Pitman- Let c c=1 be the set of distinct values taken by the vari- Yor process prior imposed over the component GP latent {∗ }M−1 νM−1 ables m m=1 . Denoting as c the number of values in allocation variables of our model. We derive a computation- {∗ }M−1  ∗ m m=1 that equal to c, the distribution of M given ally efficient inference algorithm for our model based on the ∗ − { }M 1 can be shown to be of the form [18] variational Bayesian framework, and obtain the predictive m m=1 α density of our model using an approximation technique. We ∗ ∗ M−1 p  |{ } = ,α,G = G examine the efficacy of our approach considering volatility M m m 1 0 α + M − 1 0 prediction in a number of financial return series. C νM−1 Currently, there is an extensive corpus of existing work + c δ , (3) α + M − 1 c on conditionally heteroscedastic (Gaussian) mixture mod- c=1 els put forward in the financial econometrics literature, where δ denotes the distribution concentrated at a sin- which bear some relationship with our proposed approach. c gle point c. These results illustrate two key properties of In particular, several authors have proposed mixture pro- the DP scheme. First, the innovation parameter α plays a cesses where in each component the conditional variance key-role in determining the number of distinct parameter is driven by GARCH-type dynamics (e.g., [13]–[16]). Such values. A larger α induces a higher tendency of drawing approaches have been shown to yield excellent out-of- new parameters from the base distribution G0; indeed, as sample volatility and density forecasts in a multitude of α →∞we get G → G0. On the contrary, as α → 0 scenarios. We shall provide comparisons of our approach {∗ }M all m m=1 tend to cluster to a single . against a popular such method in the experimental section Second, the more often a parameter is shared, the more of our paper. likely it will be shared in the future. The remainder of this paper is organized as follows: The Pitman-Yor process (PYP) [11] functions similar to In Section 2, we provide a brief presentation of the the- the Dirichlet process. Let us suppose we randomly draw a oretical background of the proposed method. Initially, we sample distribution G from a PYP, and, subsequently, we present the Pitman-Yor process and its function as a prior ∗ M independently draw M random variables { } = from G: in nonparametric Bayesian models; further, we provide a m m 1 brief summary of Gaussian process regression. In Section 3, G|δ,α,G0 ∼ PY(δ, α, G0) (4) we introduce the proposed Gaussian process-mixture con- ditional heteroscedasticity (GPMCH) model, and derive with ∗ efficient model inference algorithms based on the varia-  |G ∼ G, m = 1,...M , (5) tional Bayesian framework. We also propose a copula-based m method for learning the interdependencies between the where δ ∈ [0, 1) is the discount parameter of the Pitman- returns of multiple assets jointly modeled by means of an Yor process, α>−δ is its innovation parameter, and G0 the GPMCH model. In Section 4, we conduct the experimental base distribution. Integrating out G, similar to Eq. (3), we evaluation of our proposed model, considering a number now yield 890 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 5, MAY 2014 α + δ ∗ ∗ M−1 C p  |{ } = ,δ,α,G = G and we will write the Gaussian process as M m m 1 0 α + M − 1 0 ( ) ∼ N ( ( ), ( , )). C νM−1 − δ f x m x k x x (12) + c δ . (6) α + M − 1 c Usually, for notational simplicity, and without any loss of c=1 generality, the mean of the process is taken to be zero, As we observe, the PYP yields an expression for m(x) = 0, although this is not necessary. Concerning (∗ |{∗ }M−1, ) p M m m=1 G0 quite similar to that of the DP, also selection of the covariance function, a large variety of ker-  possessing the rich-gets-richer clustering property, i.e., the nel functions k(x, x ) might be employed, depending on more samples have been assigned to a draw from G0, the application considered [20]. This way, a postulated the more likely subsequent samples will be assigned to Gaussian process eventually takes the form thesamedraw.Further,themorewedrawfromG0,the more likely a new sample will again be assigned to a new f (x) ∼ N (0, k(x, x)). (13) draw from G0. These two effects together produce a power- ∗ law distribution where many unique  values are observed, Let us suppose a set of independent and identically dis- m D ={(x , )| = ,..., } most of them rarely [11], thus allowing for better modeling tributed (i.i.d.) samples i yi i 1 N , with the x observations with heavy-tailed distributions. In particular, d-dimensional variables i being the observations related δ for δ>0, the number of unique values scales as O(αM ), to a modeled phenomenon, and the scalars yi being the where M is the total number of draws. Note also that, for associated target values. The goal of a regression model is, x∗ δ = 0, the Pitman-Yor process reduces to the Dirichlet pro- given a new observation , to predict the corresponding target value y∗, based on the information contained in the cess, in which case the number of unique values grows D more slowly at O(αlogM) [12]. training set . The basic notion behind Gaussian process A characterization of the (unconditional) distribu- regression consists in the assumption that the observable tion of the random variable G drawn from a PYP, (training) target values y in a considered regression prob- PY(δ, α, G ), is provided by the stick-breaking construction lem can be expressed as the superposition of a Gaussian 0 X (x) of Sethuraman [19]. Consider two infinite collections of process over the input space , f , and an independent = ( )∞ { }∞ white Gaussian noise independent random variables v vc c=1, c c=1, where  the vc are drawn from a , and the c are y = f (x) + , (14) independently drawn from the base distribution G0.The stick-breaking representation of G is then given by [12] where f (x) is given by (12), and ∞  ∼ N ,σ2 . =  ( )δ , 0 (15) G c v c (7) c=1 Under this regard, the joint normality of the training tar- = N ∗ where get values y [yi]i=1 and some unknown target value y , approximated by the value f∗ of the postulated Gaussian ( ) = ( − δ,α + δ ) p vc Beta 1 c (8) process evaluated at the observation point x∗, yields [20] c−1  ( ) = ( − ) ∈ , ( , ) + σ 2 ( ∗) c v vc 1 vj [0 1] (9) y ∼ N , K X X IN k x , 0 ( )T ( , ) (16) j=1 f∗ k x∗ k x∗ x∗ and where ∞ T k(x∗)  [k(x1, x∗),...,k(xN, x∗)] (17) c(v) = 1. (10) c=1 ={ }N × X xi i=1, IN is the N N identity matrix, and K is Under the stick-breaking representation of the Pitman-Yor the matrix of the covariances between the N training data process, the atoms  , drawn independently from the base points (design matrix), i.e. c ⎡ ⎤ distribution G0, can be seen as the parameters of the com- k(x1, x1) k(x1, x2)... k(x1, xN) ponent distributions of a mixture model comprising an ⎢ ( , ) ( , )... ( , )⎥ ⎢k x2 x1 k x2 x2 k x2 xN ⎥ unbounded number of component densities, with K(X, X)  ⎢ . . . ⎥ . (18) ⎣ . . . ⎦ proportions c(v). k(x , x ) k(x , x )... k(x , x ) 2.2 Gaussian Process Models N 1 N 2 N N Let us consider an observation space X . A Gaussian pro- Then, from (16), and conditioning on the available train- cess f (x), x ∈ X , is defined as a collection of random variables, ing samples, we can derive the expression of the model any finite number of which have a joint Gaussian distribu- predictive distribution, yielding tion [20]. A Gaussian process is completely specified by its ( | , D) = N |μ ,σ2 mean function and covariance function. We define the mean p f∗ x∗ f∗ ∗ ∗ (19)  ( ) ( , ) −1 function m x and the covariance function k x x of a real μ = ( )T ( , ) + σ 2 ( ) ∗ k x∗ K X X IN y (20) process f x as −1 σ 2 = σ 2 − ( )T ( , ) + σ 2 m(x) = E[f (x)] ∗ k x∗ K X X IN    k(x, x ) = E[(f (x) − m(x))(f (x ) − m(x ))] (11) k(x∗) + k(x∗, x∗). (21) PLATANIOS AND CHATZIS: GAUSSIAN PROCESS-MIXTURE CONDITIONAL HETEROSCEDASTICITY 891

Regarding optimization of the hyperparameters of the this point, we make a further key-assumption: We assume employed covariance function (kernel), say θ,andthenoise that the noise variance σ 2 of the postulated GPs is not a variance σ 2 of a GP model, this is usually conducted by constant, but rather that it varies with the input variables type-II maximum likelihood, that is by maximization of the x ∈ Rp. In other words, we consider the noise variance as a model marginal likelihood (evidence). Using (16), it is easy latent process, different for each model output variable, and to show that the evidence of the GP regression model yields exhibiting a clustering effect, as described by the dynamics   of the postulated PYP mixing prior. 2 N 1  2  logp y|X; θ,σ =− log2π − log K(X, X) + σ IN Let us consider a set of input/output observation pairs 2 2 { , }N −1 xn yn n=1, comprising N samples. Let us also introduce −1 T ( , ) + σ 2 . N,∞ y K X X IN y (22) the set of variables {z } , with z = 1 if the func- 2 nc n,c=1 nc tion relating xn to yn is considered to be expressed by the It is interesting to note that the GP regression model { c( )}D = set fd x d=1 of postulated Gaussian processes, znc 0 considers that the noise that contaminates the modeled otherwise. Then, based on the previous description, we output variables does not depend on the observations essentially postulate the following model: themselves, but rather that it constitutes an additive term with constant variance, which bears no corre-   D | , = = N | c( ), σ c( )2 lation between observations, and no dependence on the p yn xn znc 1 ynd fd xn d xn (23) values of the observations. Nevertheless, in many real- d=1 ( = | ) =  ( ) world applications, with financial return series modeling p znc 1 v c v (24) being a characteristic example, this assumption of constant c−1  ( ) = ( − ) ∈ , noise variance is clearly implausible. c v vc 1 vj [0 1] (25) To ameliorate this issue, an heteroscedastic GP regression j=1 approach was proposed in [21], where the noise variance with is considered to be a function of the observed data, similar ∞ to previously proposed heteroscedastic regression approaches  ( ) = applied to econometrics and statistical finance, e.g., [22], c v 1 (26) = [23]. A key drawback of the approach of [21] is that c 1 ( ) = ( − δ,α + δ ) their heteroscedastic regression approach does not allow p vc Beta 1 c (27) for capturing the clustering effects in the variability or and volatility, which is apparent in the vast majority of financial     c | = N c | , c( , ) , return series data, and is effectively captured by GARCH- p f d X f d 0 K X X (28) type models with integrated mixture-model-type clustering  { }N mechanisms (e.g., [13]–[16]). Our approach addresses these where ynd is the dth element of yn,wedefineX xn n=1,  { }N  { }N,∞ c issues under a nonparametric Bayesian mixture modeling Y yn n=1,andZ znc n,c=1, f d is the vector of the c( ) ∀ c  c( ) N c( , ) scheme, as discussed next. fd xn n, i.e., f d [fd xn ]n=1,andK X X is the following design matrix ⎡ ⎤ 3PROPOSED APPROACH c c c k (x1, x1) k (x1, x2)... k (x1, xN) ⎢ c( , ) c( , )... c( , )⎥ In this section, we first define the proposed GPMCH model, ⎢k x2 x1 k x2 x2 k x2 xN ⎥ Kc(X, X)  ⎢ . . . ⎥ . (29) considering a generic modeling problem that comprises the ⎣ . . . ⎦ input variables x ∈ Rp, and the output variables y ∈ RD. kc(x , x ) kc(x , x )... kc(x , x ) Further, we derive an efficient inference algorithm for our N 1 N 2 N N model under the variational paradigm, σ c( )2 Regarding the latent processes d xn , we choose to also and we obtain the expression of its predictive density. impose a GP prior over them. Specifically, to accommodate Finally, we show how we can obtain a predictive distri- σ c( )2 ≥ the fact that d xn 0 (by definition), we postulate bution for the covariances between the modeled output   { }D σ c( )2 = c ( ) variables yi i=1, by utilization of the statistical tool of d xn exp gd xn (30) copulas. and     c | = N c | ˜ c , c( , ) . 3.1 Model Definition p gd X gd md1 X X (31) ( ) Let fd x be a latent function modeling the dth output vari- c c c c N where g is the vector of the g (xn) ∀n, i.e., g  [g (xn)] = , able yd as a function of the model input x.Weconsider d d d d n 1 and c(X, X) is a design matrix, similar to Kc(X, X), but with that the expression of yd asafunctionofx is not uniquely ( ) ( ) (possibly) different kernel functions λ(·, ·). described by the latent function fd x ,butfd x is only an α instance of the (possibly infinite) set of possible latent func- Finally, due to the effect of the innovation parameter c( ), = ,...,∞ on the number of effective mixture components, we also tions fd x c 1 . To determine the association between input samples and latent functions, we impose a impose a Gamma prior over it: Pitman-Yor process prior over this set of functions. In addi- p(α) = G (α|η1,η2) . (32) tion, we consider that each one of these latent functions c( ) fd x has a prior distribution of the form of a Gaussian pro- This completes the definition of our proposed GPMCH cess over the whole space of input variables x ∈ Rp.At model. 892 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 5, MAY 2014

3.2 Inference Algorithm with Inference for nonparametric models can be conducted N C under a Bayesian setting, typically by means of variational q(Z) = q(znc = 1). (36) Bayes (e.g., [24]), or Monte Carlo techniques (e.g., [25]). n=1 c=1 Here, we prefer a variational Bayesian approach, due to Then, the variational free energy of the model reads (ignor- its considerably better scalability in terms of computational ing constant terms) costs, which becomes of major importance when having to    deal with large data corpora [26], [27]. C D   p f c |0, Kc(X, X) L(q) = df c q f c log d   Our variational Bayesian inference algorithm for the d d q f c c=1 d=1 d GPMCH model comprises derivation of a family of vari-    (.) C D   c | ˜ c , c( , ) ational posterior distributions q which approximate the p gd md1 X X = + dgc q gc log   true posterior distribution over the infinite sets Z, v d d q gc ∞ c ∞ c ∞ c=1 d=1 d (vc) = , {f } = ,and{g } = , and the innovation parameter α.   c 1 c 1 c 1 p (α|η ,η ) Apparently, Bayesian inference is not tractable under this + dαq(α) log 1 2 setting, since we are dealing with an infinite number of q(α)   parameters. C−1 ( |α) + ( ) p vc For this reason, we employ a common strategy in the lit- dvcq vc log q(vc) erature of Bayesian nonparametrics, formulated on the c=1   basis of a truncated stick-breaking representation of the C N ( = | ) + ( = ) ( ) p znc 1 v PYP [24]. That is, we fix a value C and we let the varia- q znc 1 dvq v log ( = ) ( = ) = = = q znc 1 tional posterior over the vi have the property q vC 1 1. c 1 n 1 D  In other words, we set c(v) equal to zero for c > C.Note + c c ( c ) ( c ) that, under this setting, the treated GPMCH model involves df ddgdq f d q gd = a full PYP prior; truncation is not imposed on the model d 1  itself, but only on the variational distribution to allow for c c 2 logp y |f (xn), σ (xn) . (37) tractable inference. Hence, the truncation level C is a vari- nd d d ational parameter which can be freely set, and not part of Derivation of the variational posterior distribution q(W) the prior model specification. involves maximization of the variational free energy L(q)  { ,α, , { c}C , { c}C } Let W v Z f c=1 g c=1 be the set of all the over each one of the factors of q(W) in turn, holding the oth- parameters of the GPMCH model over which a prior ers fixed, in an iterative manner [30]. By construction, this  distribution has been imposed, and be the set of the iterative, consecutive updating of the variational posterior hyperparameters of the model priors and kernel func- distribution is guaranteed to monotonically and maximally tions. Variational Bayesian inference introduces an arbi- increase the free energy L(q) [29]. ( ) trary distribution q W to approximate the actual posterior Let us denote as . the posterior expectation of a ( |, , ) p W X Y which is computationally intractable, yield- quantity. From (37), we obtain the following variational ing [28] (approximate) posteriors over the parameters of our model: ( , ) = L( ) + ( || ), logp X Y q KL q p (33) 1) Regarding the PYP stick-breaking variables vc,we have where  q(vc) = Beta(vc|βc,1,βc,2), (38) p(X, Y, W|) L(q) = dWq(W)log (34) where q(W) N β = − δ + ( = ) and KL(q||p) stands for the Kullback-Leibler (KL) diver- c,1 1 q znc 1 (39) gence between the (approximate) variational posterior, n=1 q(W), and the actual posterior, p(W|, X, Y). Since KL C N β = α + δ + (  = ). divergence is nonnegative, L(q) forms a strict lower bound c,2 c q znc 1 (40) of the log evidence, and would become exact if q(W) = c=c+1 n=1 ( |, , ) L( ) p W X Y . Hence, by maximizing this lower bound q 2) The innovation parameter α approximately yields (variational free energy) so that it becomes as tight as pos- (α) = G(α|ˆη , ηˆ ), sible, not only do we minimize the KL-divergence between q 1 2 (41) the true and the variational posterior, but we also implicitly where integrate out the unknowns W. ηˆ = η + − Due to the considered conjugate exponential prior con- 1 1 C 1 (42) figuration of the GPMCH model, the variational posterior C−1   ηˆ = η − ψ(β ) − ψ(β + β ) . q(W) is expected to take the same functional form as the 2 2 c,2 c,1 c,2 (43) prior, p(W) [29]: c=1   ψ(.) denotes the Digamma function, and C−1 C D     ηˆ ( ) = ( ) (α) ( ) f c gc 1 q W q Z q q vc q d q d (35) α = . (44) c=1 c=1 d=1 ηˆ2 PLATANIOS AND CHATZIS: GAUSSIAN PROCESS-MIXTURE CONDITIONAL HETEROSCEDASTICITY 893

3) Regarding the posteriors over the latent functions 3.3 Predictive Density c f d,wehave Let us consider the predictive distribution of the dth model output variable corresponding to x∗. To obtain it, we begin q(f c ) = N (f c |μc ,c ), (45) d d d d by deriving the predictive posterior distribution over the where latent variables f. Following the relevant derivations of   − Section 2.2, we have −1 1 c = Kc(X, X) + Bc (46) d d   C D   c c c c c c 2 μ =  B y (47) q f∗ = c (v) N f∗ |a∗ , σ∗ , (58) d d d⎛d ⎞ d d d N c=1 d=1 1 Bc  diag ⎝  q(z = 1) ⎠ (48) d c 2 nc where σ (xn) d n=1   −1 c = c( )T c( , ) + c −1 a∗ k x∗ K X X B yd (59)  N d d and yd [ynd] = .     − n 1 2 −1 1 σ c =− c( ∗)T c( , ) + c c( ∗) ∗d k x K X X Bd k x 4) Similar, regarding the posteriors over the latent + kc(x∗, x∗) (60) noise variance processes gc ,wehave d c−1    ( c ) = N ( c | c , c ),  (v) = v 1 − v (61) q gd gd md Sd (49) c c j j=1 where βc,1   − v = (62) −1 1 c β + β Sc = c(X, X) + Qc (50) c,1 c,2 d d   and 1 N mc = c(X, X) Qc − diag q (z = 1) 1 + m˜ c 1 d d nc n=1 d T 2 k(x∗)  [k(x1, x∗),...,k(xN, x∗)] . (63) (51) Further, we proceed to the predictive posterior distribu- c and Qd is a positive semi-definite diagonal matrix, tion over the latent variables g; we yield whose components comprise variational parameters     c = N c |τ c ,ϕc , that can be freely set. Note that, from this result, it q g∗d g∗d ∗d ∗d (64) follows where 1 σ c( )2 = c − c . c c T c 1 c d xn exp [md]n [Sd]nn (52) τ∗ = λ (x∗) Q − 1 + m˜ (65) 2 d d 2 d −1 5) Finally, the posteriors over the latent variables Z c c T c T c c −1 c ϕ∗ = λ (x∗, x∗) − λ (x∗) + (Q ) λ (x∗) (66) yield d d d   and q(znc = 1) ∝ exp logc(v) exp (rnc) , (53) λ(x∗)  [λ(x , x∗),...,λ(x , x∗)]T . (67) where 1 N   c−1     Based on these results, the predictive posterior of our  ( ) = ( −  ) + log c v log 1 vc logvc (54) model output variables yields    c =1  C ( ) = N   ( ) c , and q y∗d y∗d c v a∗d (68) c=1 D   1 1 2 C r  −   y − [μc ]     nc c 2 nd d n π ( ) 2 σ c 2 + c c . 2 σ (xn) c v ∗ exp g∗ dg∗ d=1 d d d d c=1 + c + c , [ d]nn [md]n (55) We note that this expression does not yield a Gaussian pre- dictive posterior. However, it is rather straightforward to ξ ξ c compute the predictive means and variances of y∗d.Itholds where [ ]n is the nth element of vector ,[ d]nn is ( , ) c the n n th element of d, and it holds   C   ˆ = E | ∗; D =  ( ) c y∗d y∗d x c v a∗d (69) logvc = ψ(βc,1) − ψ(βc,1 + βc,2) (56)   c=1 log(1 − v ) = ψ(β , ) − ψ(β , + β , ). (57) c c 2 c 1 c 2 and As a final note, estimates of the values of the model   C    V | ; D = π ( ) 2 σ c 2 + ψc , hyperparameters set , which comprises the hyperparam- y∗d x∗ c v ∗ ∗ (70) (·, ·) d d eters of the model priors and the kernel functions k c=1 and λ(·, ·), are obtained by maximization of the model vari- where ational free energy L(q) over each one of them. For this 1 purpose, in this paper we resort to utilization of the limited ψc  E ( c )| ∗; D = τ c + ϕc . ∗d [exp g∗d x ] exp ∗d ∗d (71) memory variant of the BFGS algorithm (L-BFGS) [31]. 2 894 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 5, MAY 2014

(·), = ,..., 3.4 Learning the Covariances between the Modeled is a D-dimensional copula and Fi i 1 D,are Output Variables univariate cdf’s, it holds As one can observe from (23), a characteristic of our − − C (u ,...,u ) = F F 1(u ),...,F 1(u ) , (73) proposed GPMCH model is its assumption that the dis- 1 D 1 1 D D ∈ RD − tribution of the modeled output vectors y factorizes where F 1(·) denotes the inverse of the cdf of the dth over their component variables {y }D . Indeed, this type d d d=1 marginal distribution F (·), i.e. the quantile function of the of modeling is largely the norm in Gaussian process-based d dth modeled variable yd. modeling approaches [20]. This construction in essence It is easy to show that the corresponding probability den- implies that, under our approach, the modeled output vari- sity function of the copula model, widely known as the ables are considered independent, i.e. their covariance is copula density function,isgivenby always assumed to be zero. However, when jointly model- ing the return series of various assets, the modeled output ∂D c (u1,...,uD) = C (u1,...,uD) variables (asset returns) are rather strongly correlated, and ∂u1 ...∂uD it is desired to be capable of predicting the values of their ∂D = −1( ),..., −1( ) covariances for any given input value. ∂ ...∂ F F1 u1 FD uD u1 uD Existing approaches for resolving these issues of GP- −1( ),..., −1( ) based models are based on the introduction of an additional p F1 u1 FD uD = , (74) kernel-based modeling mechanism that allows for captur- D −1 = p F (u ) ing this latent covariance structure [32]–[36]. For example, i 1 i i i (·) in [32] the authors propose utilization of a convolution where pi is the probability density function of the ith process to induce correlations between two output com- component variable yi. ponents. In [35], a generalization of the previous method is Let us now assume a parametric class for the copula (·,...,·) (·), = ,..., proposed for the case of more than two modeled outputs C and the marginal cdf’s Fi i 1 D,respec- combined under a convolved kernel. Along the same lines, tively. In particular, let ζ denote the (trainable) parameter multitask learning approaches for resolving these issues are (or set of parameters) of the postulated copula. Then, the presented in [33] and [34], where separate GPs are postu- joint probability density of the modeled variables y = lated for each output, and are considered to share the same D [yi]i=1 yields prior in the context of a multitask learning framework.  D   A drawback of the aforementioned existing approaches ( ,..., |ζ) = ( ) ( ),..., ( )|ζ . is that, in all cases, learning entails employing a tedious p y1 yD pi yi c F1 y1 FD yD (75) = optimization procedure to estimate a large number of i 1 hyperparameters of the used kernel functions. As expected, Since the emergence of the concept of copula, sev- such a procedure is, indeed, highly prone to getting trapped eral copula families have been constructed, e.g., Gaussian, to bad local optima, a fact that might severely undermine Clayton, Frank, Gumbel, Joe, etc, that enable capturing model performance. of any form of dependence structure. By coupling differ- In this work, to avoid being confronted with such opti- ent marginal distributions with different copula functions, mization issues, and inspired by the financial research copula-based models are able to model a wide variety literature, we devise a novel way of capturing the interde- of marginal behaviors (such as skewness and fat tails), { }D and dependence properties (such as clusters, positive or pendencies between the modeled output variables yd d=1, expressed in the form of their covariances: specifically, we negative tail dependence) [38]. Selection of the best-fit cop- use the statistical tool of copulas [37]. ula has been a topic of rigorous research efforts during The copula, introduced in the seminal work of Sklar [37], the last years, and motivating results have already been is a model of statistical dependence between random vari- achieved [39] (for excellent and detailed discussions on ables. A copula is defined as a multivariate distribution copulas, c.f. [38], [40]). with standard uniform marginal distributions, or, alterna- tively, as a function (with some restrictions mentioned for 3.4.2 Proposed Approach example in [38]) that maps values from the unit hypercube In this work, to capture the interdependencies (covari- to values in the unit interval. ances) between the GPMCH-modeled output variables, we propose a conditional copula-based dependence modeling 3.4.1 Copulas: An Introduction framework. Specifically, for the considered D-dimensional = D Let y = [y ]D be a D-dimensional random variable with output vectors y [yd]d=1, we postulate pairwise parametric d d=1   ( , )D D conditional models for each output pair yi yj i,j=1,i=j, with joint cumulative distribution function (cdf) F [yd]d=1 ,and ( ), = ,..., cdf’s defined as follows: marginal cdf’s Fd yd d 1 D, respectively. Then, according to Sklar’s theorem, there exists a D-variate copula ( , | ) = ( ( | ), ( | )| ), F yi yj x C Fi yi x Fj yj x x (76) cdf C(·,...,·) on [0, 1]D such that ( | )     where the marginals Fd yd x are the cdf’s that corre- ,..., = ( ),..., ( ) ( ) F y1 yD C F1 y1 FD yD (72) spond to the predictive posteriors q y∗d given by (68), and the used input-conditional copulas are defined under for any y ∈ RD. Additionally, if the marginals F (·), d = d a parametric construction as 1,...,D, are continuous, then the D-variate copula (·,...,·) (·,...,·) ( , | )  ( , |ζ ( )) C satisfying (72) is unique. Conversely, if C C ui uj x C ui uj ij x (77) PLATANIOS AND CHATZIS: GAUSSIAN PROCESS-MIXTURE CONDITIONAL HETEROSCEDASTICITY 895

Fig. 1. Squared return series considered in our experiments. (a) First scenario. (b) Second scenario. (c) Third scenario.

ζ ( ) and we consider that the ij x are given by The procedure we use to perform model training is widely known as “inference function for margins” ζ (x) = ξ(γ (x)), (78) ij ij (IFM) [41]; it generally comprises two steps: on the first γ ( ) ξ(·) where the ij x are trainable real-valued models, and step, the marginal model is maximized with respect to ζ ( ) is a link function ensuring that the values of ij x will its entailed (marginal) parameters, while, in the second always be within the range allowed by the copula model step, the copula model is maximized with respect to the employed each time. For instance, if a Clayton copula C(·) entailed (copula) parameters, using the marginal estimates is employed, it is required that its parameter be positive, obtained from the first step. This way, model estimation ζ ( )> ξ(·) i.e. ij x 0[38]; in such a case, maybedefinedasthe becomes computationally efficient, while comparison of dif- exponential function, i.e. ξ(α) = exp(α). ferent copulas can also be conducted in a convenient way, ( ) Note that the predictive posteriors q y∗d are difficult to by means of standard methodologies for assumption test- compute analytically, since (68) does not yield a Gaussian ing. Of course, such an approximate training procedure distribution. For this reason, and in order to facilitate effi- naturally results in some information loss. However, IFM cient training of the postulated pairwise conditional copula has been shown to yield good quality results for very attrac- models, in the following we approximate (68) as a Gaussian tive computational costs, especially when working with with mean and variance given by (69) and (70), respectively. large training datasets [41]. γ ( )   Further, we consider the functions ij x to be linear basis V , | ; D y∗i y∗j x∗ functions models. Specifically, we postulate    = (κ| ∗), (κ | ∗) ξ T ( ∗) γ ( ) = T ( ), C Fi x Fj x wij h x ij x wij h x (79)  − (κ| ) (κ| ) κ κ. Fi x∗ Fj x∗ d d (82) where the wij are trainable model parameters, and the basis functions h(x) are defined using a small set of basis input After training the postulated pairwise models { }I ˜ ( , |ζ ( )) ∀ = observations xi i=1, and an appropriate kernel function k: C ui uj ij x  i j, computation of the predictive covariance V y∗ , y∗ |x∗; D between the ith and the jth ( )  ˜( , ) I . i j h x [k x xi ]i=1 (80) model output given the input observation x∗ can be Training for the postulated pairwise conditional copula conducted using the corresponding conditional copula models can be performed by optimizing the logarithm of model and marginal predictive densities. Specifically, from the copula density function that corresponds to the para- Hoeffding’s lemma [42]–[44], we directly obtain [Eq. (82)]; metric conditional model (77), given a set of training data this latter integral can be approximated by means of D = ( , )N numerical analysis methods. xn yn n=1, which yields Note that, in deriving the proposed approach, we were N  interested in modeling bivariate marginals separately. As P = ( | ), ( | )ξ T ( ) ij log c Fi yni xn Fj ynj xn wij h xn (81) such, there is no enforcement that the result is a coher- n=1 ent probabilistic model. This lack of our approach could with respect to the parameter vectors wij. To effect this pro- result in a loss of statistical efficiency by not conditioning cedure, in this paper we resort to the L-BFGS algorithm [31]. on the constraints enforcing coherence. On the other hand, there are significant computational gains; in addition, the resulting estimator might still be better than a fully coherent TABLE 1 First Scenario: Obtained RMSEs Based on the model if there is severe mispecification. Percentage Returns 4EXPERIMENTAL EVALUATION In this section, we elaborate on the application of our GPMCH approach to volatility modeling for financial return series data. We perform an experimental evaluation of its performance in volatility modeling, and examine how 896 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 5, MAY 2014

TABLE 2 TABLE 3 Second Scenario: Obtained RMSEs Based on the Third Scenario: Obtained RMSEs Based on the Percentage Returns Percentage Returns

it compares to state-of-the-art competitors. We also assess 4) (NIK) Japanese Nikkei 225 the efficacy of the proposed copula-based approach for 5) (FTSE) U.K. FTSE 100 learning the predictive covariances between the modeled 6) (SP) U.S. S&P 500 output variables of the GPMCH model. 7) (EB3M) Three-month Euribor rate. For this purpose, we consider modeling the daily return These series have become standard benchmarks for assess- series of various financial indices, including currency ing the performance of volatility prediction algorithms [23], exchange rates, global large-cap equity indices, and Euribor [45], [46]. We provide an illustration of the squares of the ( ) rates. We note that, in this work, asset return r t is defined considered return series in Fig. 1. ( ) as the difference between the logarithm of the prices p t in In all the considered scenarios, the proposed GPMCH ( )  ( ) − ( − ) two subsequent time points, i.e., r t logp t logp t 1 . model is trained using as input data, x(t),vectorscon- All our source codes were developed in MATLAB R2012a. taining the daily returns of all the assets considered in each scenario. The corresponding training output data y(t) 4.1 Volatility Prediction Using the GPMCH Model: essentially comprise the same series of input vectors shifted Real Assets one-step ahead. In other words, the output series are ( )  ( + ), > In this set of experiments, we consider three application defined as y t r t 1 t 0, and the input series as ( )  ( ), < scenarios: x t r t t T, where T is the total duration of the mod- eled return series, and the vectors r(t) contain the return • In the first scenario, we model the return series per- values of all the considered indices at time t. taining to the following currency exchange rates, over In our experiments, we evaluate the GPMCH model  the period December 31, 1979 to December 31, 1998 using zero kernels for the mean process, i.e. kc(x, x ) = 0 ∀c; (daily closing prices): this construction allows for our model to remain con- sistent with the existing literature, where it is typically 1) (AUD) Australian Dollar / U.S. $ considered that the modeled return  series constitute a zero- 2) (GBP) U.K. Pound / U.S. $ c( ) = ∀ , mean noise-only process, i.e. fd x 0 d c. Note though 3) (CAD) Canadian Dollar / U.S. $ that our approach can seamlessly deal with learning the 4) (DKK) Danish Krone / U.S. $ c( ) mean process fd x , if a model for its covariance is avail- 5) (FRF) French Franc / U.S. $ able. Further, we consider autoregressive kernels of order 6) (DEM) German Mark / U.S. $ one for the noise variance process of the model, of the 7) (JPY) Japanese Yen / U.S. $ form 8) (CHF) Swiss Franc / U.S. $. 2  σ || − || • In the second scenario, we model the return series λc(x, x ) = 0 φ x x , (83) pertaining to the following global large-cap equity (1 − φ2) indices, for the business days over the period April φ σ 2 where the and 0 are model hyperparameters, estimated 27, 1993 to July 14, 2003 (daily closing prices): by means of free energy optimization (using the L-BFGS 1) (TSX) Canadian TSX Composite algorithm). 2) (CAC) French CAC 40 To obtain some comparative results, we also evaluate: 3) (DAX) German DAX (i) a common baseline approach from the field of financial 4) (NIK) Japanese Nikkei 225 engineering and econometrics, namely the GARCH(1,1) 5) (FTSE) U.K. FTSE 100 model [3], that is a GARCH model with volatility terms of 6) (SP) U.S. S&P 500. order one and residual terms of order one; (ii) the Mixed • Finally, in the third scenario, we model the return Normal Conditional Heteroscedasticity (mix-GARCH(1,1)) series pertaining to the following seven global large- approach of [14]; and (iii) the recently proposed VHGP capequityindicesand Euribor rates, for the business approach of [21]. All these approaches have been shown days over the period February 7, 2001 to April 24, to be very competitive in the task of volatility predic- 2006 (daily closing prices for the first 6 indices, and tion in financial return series [21], [47]. Note that the annual percentage rate converted to daily effective GARCH(1,1) and mix-GARCH(1,1) models use as input the yield for the last index): time variable, while the VHGP model is trained similar to GPMCH. 1) (TSX) Canadian TSX Composite In our experiments, we follow an evaluation protocol 2) (CAC) French CAC 40 similar to [46]: all the evaluated methods are trained using 3) (DAX) German DAX a rolling window of the previous 120 days of returns to PLATANIOS AND CHATZIS: GAUSSIAN PROCESS-MIXTURE CONDITIONAL HETEROSCEDASTICITY 897

Fig. 2. Second scenario: Obtained RMSEs based on the percentage returns. (a) GARCH. (b) mixGARCH. (c) VHGP. (d) GPMCH. make volatility forecasts up to 30 days ahead; we retrain In Tables 1–3, we provide the obtained results for the the models every 7 days. As our performance metric used three considered scenarios. These results are computed to evaluate the considered algorithms, we consider the root over all the assets modeled in each scenario. In addi- mean squared error (RMSE) between the model-estimated tion, to show how performance fluctuates as a function volatilities and the squared returns of the modeled return of the prediction horizon, we further elaborate on the sec- series. As discussed in [48], this groundtruth measurement ond scenario: in Fig. 2, we depict the obtained RMSEs as constitutes one of the few consistent ways of volatility a function of the prediction horizon, separately for each measuring; the same performance was employed asset. As we observe, our method works clearly better than in [14]. the competition for all the modeled assets throughout the

TABLE 4 TABLE 5 First Scenario: Obtained RMSEs Considering Comparison Second Scenario: Obtained RMSEs Considering Comparison Against the Asset Pair Return Products Against the Asset Pair Return Products 898 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 5, MAY 2014

Fig. 3. Volatility prediction: Synthetic experiments: (a) GPMCH: posteriors over g(x). (b) GPMCH: posteriors over y. (c) VHGP: posteriors over g(x). (d) VHGP: posteriors over y. considered 30-step prediction horizon. We also observe a accuracy, especially in terms of the obtained posteriors over clear trend of prediction error increase for longer predic- g(x). Finally, in Fig. 4 we provide the posterior over model tion horizons in all assets. Similar results are obtained for components obtained by our method; specifically, we show the rest of the considered experimental cases (omitted for how many data points are effectively assigned to each com- brevity). ponent based on the MAP criterion. As we observe, we eventually obtained 4 (effective) model components; all the 4.2 Volatility Prediction Using the GPMCH model: rest were empty. Synthetic Experiments In this experiment, we evaluate our method using syn- 4.3 Copula-Based Modeling of the Covariances thetic data with availability of groundtruth predictions. between Asset Returns This way, we allow for gaining better insights into the Here, we evaluate the performance of the proposed copula- modeling capacity of our approach, and how it compares based approach for learning a predictive model of the to existing alternatives. Specifically, for this purpose, we covariances between the GPMCH-modeled asset returns. use a synthetic dataset of 200 data points generated by Girolami and Calderhead in [49]. This dataset was obtained by sampling from a VHGP model (i.e., an GPMCH with one mixture component), where the latent function is set  to zero, i.e., k(x, x ) = 0, the mean of g(x) is of the form  m˜ = 2logβ, and the kernel of g(x) is of the form λ(x, x ) = σ 2 | − | 0 φ x x . Specifically, in the considered datasets, the 1−φ2 chosen values of the model hyperparameters were σ0 = 0.15,φ= 0.98,β= 0.65. To make the task harder and more realistic, we fur- ther contaminated this dataset with white noise. We used the so-obtained distorted data to perform training of our GPMCH model as well as the related VHGP model. In Figs. 3(a)–(d), we provide the obtained posteriors over g(x) and the outputs y of the model; in these figures, the shaded area illustrates the variance of the depicted posterior dis- Fig. 4. Volatility prediction: Synthetic experiments. Posterior over model tributions. As we observe, our model obtains much better components. PLATANIOS AND CHATZIS: GAUSSIAN PROCESS-MIXTURE CONDITIONAL HETEROSCEDASTICITY 899

Fig. 5. Pearson correlation coefficient between the modeled assets in the considered experimental scenarios. (a) First Scenario. (b) Second Scenario. (c) Third Scenario.

For this purpose, we repeat the previous experimental that contaminates the observed data as a separate latent scenarios, with the goal now being to obtain predictions Gaussian process driven by the observed data. We imposed regarding the covariances between the assets modeled each a nonparametric prior with power-law nature over the dis- time. To obtain an impression of how strongly correlated tribution of the model mixture components, namely the the modeled assets are in each scenario, we provide the Pitman-Yor process prior, to allow for better capturing mod- Pearson correlation coefficient over the modeled assets in eled data distributions with heavy tails and skewness. In Fig. 5. As we observe, most of the modeled assets in our addition, in order to provide a predictive posterior for the scenarios are found to be moderately to strongly correlated. covariances over the modeled asset returns, we devised In our experiments, we consider application of three a copula-based covariance modeling procedure built on popular Archimedean copula types, namely Clayton, Frank, top of our model. To assess the efficacy of our approach, and Gumbel copulas [38]. The employed GPMCH models we applied it to several asset return series, and compared are trained similar to the previous experiments. The postu- its performance to several state-of-the-art methods in the lated conditional-copula pairwise models use a basis set of field, on the grounds of standard evaluation metrics. As input observations (to compute the h(x) in (80)) that com- we observed, our approach yields a clear performance prises the 10% of the available training data points, i.e. 12 improvement over its competitors in all the considered data points sampled at regular time intervals (one sample scenarios. every 10 days). To obtain some comparative results, we also evalu- ate the performance of two state-of-the-art methods used REFERENCES for modeling dynamic covariance matrices (multivari- [1] L. Chollete, A. Heinen, and A. Valdesogo, “Modeling interna- ate volatility) for high-dimensional vector-valued observa- tional financial returns with a multivariate regime switching tions; specifically, we consider the CCC-MVGARCH(1,1) copula,” J. Financ. Econ., vol. 7, no. 4, pp. 437–480, 2009. approach of [50], and the GARCH-BEKK(1,1) method [2] R. Engle, “Autoregressive conditional heteroskedasticity mod- els with estimation of variance of United Kingdom inflation,” of [51]. As our evaluation metric, we use the products of Econometrica, vol. 50, no. 4, pp. 987–1007, Jul. 1982. the returns of the corresponding asset pairs at each time [3] T. Bollerslev, “Generalized autoregressive conditional het- point. Our obtained results are depicted in Tables 4–6. We eroskedasticity,” J. Econ., vol. 94, pp. 238–276, Apr. 1986. [4] D. Gu and H. Hu, “Spatial Gaussian process regression with observe that our approach yields a very competitive result: mobile sensor networks,” IEEE Trans. Neural Netw. Learn. Syst., specifically, in two out of the three considered scenarios, the vol. 23, no. 8, pp. 1279–1290, Aug. 2012. yielded improvement was equal to or exceeded one order of [5] C. Rasmussen and Z. Ghahramani, “Infinite mixtures of Gaussian magnitude, while, in one case, all methods yielded compa- process experts,” in Proc. NIPS 14, 2002, pp. 881–888. [6] E. Meeds and S. Osindero, “An alternative infinite mixture of rable results. We also observe that switching the employed Gaussian process experts,” in Proc. NIPS 18, 2006, pp. 883–890. Archimedean copula type had only marginal effects on [7] L. Xu, M. I. Jordan, and G. E. Hinton, “Analternative model for model performance, in all our experiments. mixtures of experts,” in Proc. NIPS 7, 1995, pp. 633–640. [8] S. Walker, P. Damien, P. Laud, and A. Smith, “Bayesian nonpara- metric inference for random distributions and related functions,” 5CONCLUSIONS J. Roy. Statist. Soc., ser. B, vol. 61, no. 3, pp. 485–527, 1999. [9] R. Neal, “ sampling methods for Dirichlet pro- In this paper, we proposed a novel nonparametric Bayesian cess mixture models,” J. Comput. Graph. Statist.,vol.9,no.2, approach for modeling conditional heteroscedasticity in pp. 249–265, 2000. financial return series. Our approach consists in the postu- [10] P. Muller and F. Quintana, “Nonparametric Bayesian data analy- lation of a mixture of Gaussian process regression models, sis,” Statist. Sci., vol. 19, no. 1, pp. 95–110, 2004. [11] J. Pitman and M. Yor, “The two-parameter Poisson-Dirichlet each component of which models the noise variance process distribution derived from a stable subordinator,” Ann. Probab., vol. 25, no. 2, pp. 855–900, Apr. 1997. [12] Y. W. Teh, “A hierarchical Bayesian language model based on TABLE 6 Pitman-Yor processes,” in Proc. 21st Int. Conf. ACL, Stroudsburg, Third Scenario: Obtained RMSEs Considering Comparison PA, USA, 2006, pp. 985–992. Against the Asset Pair Return Products [13] C. Alexander and E. Lazar, “Normal mixture GARCH(1,1): Applications to exchange rate modelling,” J. Appl. Econ., vol. 21, no. 3, pp. 307–336, 2006. [14] M. Haas, S. Mittnik, and M. Paolella, “Mixed normal conditional heteroskedasticity,” J. Financ. Econ., vol. 2, no. 2, pp. 211–250, 2004. [15] L. Bauwens, C. M. Hafner, and J. V. K. Rombouts, “Multivariate mixed normal conditional heteroskedasticity,” Comput. Stat. Data Anal., vol. 51, no. 7, pp. 3551–3566, Apr. 2007. 900 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 5, MAY 2014

[16] M. Haas, S. Mittnik, and M. Paolella, “Asymmetric multivariate [43] H. Block and Z. Fang, “A multivariate extension of Hoeffding’s normal mixture garch,” Comput. Stat. Data Anal., vol. 53, no. 6, lemma,” Ann. Probab., vol. 16, pp. 1803–1820, 1988. pp. 2129–2154, Apr. 2009. [44] C. M. Cuadras, “Correspondence analysis and diagonal expan- [17] T. Ferguson, “A Bayesian analysis of some nonparametric prob- sions in terms of distribution functions,” J. Stat. Plan. Inference, lems,” Ann. Stat., vol. 1, no. 2, pp. 209–230, Mar. 1973. vol. 103, pp. 137–150, Apr. 2002. [18] D. Blackwell and J. MacQueen, “Ferguson distributions via [45] B. McCullough and C. Renfro, “Benchmarks and software stan- Pólya urn schemes,” Ann. Stat., vol. 1, no. 2, pp. 353–355, dards: A case study of GARCH procedures,” J. Econ. Social Meas., Mar. 1973. vol. 25, no. 2, pp. 59–71, 1998. [19] J. Sethuraman, “A constructive definition of the Dirichlet prior,” [46] A. G. Wilson and Z. Ghahramani, “Copula processes,” in Proc. Statistica Sinica, vol. 2, no. 4, pp. 639–650, 1994. NIPS, Vancouver, BC, USA, 2010. [20] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for [47] P. R. Hansen and A. Lunde, “A forecast comparison of volatility Machine Learning. Cambridge, MA, USA: MIT Press, 2006. models: Does anything beat a GARCH(1,1),” J. Appl. Econ., vol. 20, [21] M. Lázaro-Gredilla and M. Tsitsias, “Variational heteroscedas- no. 7, pp. 873–889, 2005. tic Gaussian process regression,” in Proc. 28th Int. Conf. Machine [48] C. T. Brownlees, R. F. Engle, and B. T. Kelly. (2009). A Practical Learning, Bellevue, WA, USA, 2011. Guide to Volatility Forecasting Through Calm and Storm [Online]. [22] K. Kersting, C. Plagemann, P. Pfaff, and W. Burgard, “Most likely Available http://ssrn.com/abstract=1502915 heteroscedastic Gaussian processes regression,” in Proc. ICML, [49] M. Girolami and B. Calderhead, “Riemann manifold Langevin New York, NY, USA, 2007, pp. 393–400. and Hamiltonian Monte Carlo methods,” J. Roy. Stat. Soc.,ser.B, [23] C. Brooks, S. Burke, and G. Persand, “Benchmarks and the accu- vol. 73, no. 2, pp. 1–37, 2011. racy of GARCH model estimation,” Int. J. Forecasting, vol. 17, [50] R. F. Engle, “Dynamic conditional correlation: A simple class of pp. 45–56, Jan.–Mar. 2001. multivariate generalized autoregressive conditional heteroskedas- [24] D. M. Blei and M. I. Jordan, “Variational inference for ticity models,” J. Bus. Econ. Stat., vol. 20, no. 3, pp. 339–350, Dirichlet process mixtures,” Bayesian Anal., vol. 1, no. 1, Jul. 2002. pp. 121–144, 2006. [51] R. F. Engle and K. F. Kroner, “Multivariate simultaneous general- [25] Y. Qi, J. W. Paisley, and L. Carin, “Music analysis using hid- ized ARCH,” Econ. Theory, vol. 11, no. 1, pp. 122–150, Mar. 1995. den Markov mixture models,” IEEE Trans. Signal Process., vol. 55, no. 11, pp. 5209–5224, Nov. 2007. [26] W. Fan, N. Bouguila, and D. Ziou, “Variational learning for finite Dirichlet mixture models and applications,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 5, pp. 762–774, May 2012. Emmanouil A. Platanios conducted this work [27] A. Penalver and F. Escolano, “Entropy-based incremental varia- during the Internship with the Department of tional Bayes learning of Gaussian mixtures,” IEEE Trans. Neural Electrical Engineering, Computer Engineering, Netw. Learn. Syst., vol. 23, no. 3, pp. 534–540, May 2012. and Informatics, Cyprus University of [28] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul, “An introduc- Technology, Limassol, Cyprus. He is cur- tion to variational methods for graphical models,” Mach. Learn., rently pursuing the Ph.D. degree with the vol. 37, no. 2, pp. 183–233, Nov. 1999. Machine Learning Department, Carnegie Mellon [29] S. Chatzis, D. Kosmopoulos, and T. Varvarigou, “Signal modeling University, Pittsburgh, PA, USA. He received and classification using a robust latent space model based on t dis- the M.E. degree in electrical and electronic tributions,” IEEE Trans. Signal Process., vol. 56, no. 3, pp. 949–963, engineering in 2013, with first-class honours, Mar. 2008. from Imperial College London, U.K. He is [30] D. Chandler, Introduction to Modern Statistical Mechanics.New the co-founder of Holic Inc., a start-up company in Greece, the York, NY, USA: Oxford University Press, 1987. main product of which is an intelligent news reader application. He [31] D. Liu and J. Nocedal, “On the limited memory method for large has served as a research director of Holic Inc., directing machine scale optimization,” Math. Prog., ser. B, vol. 45, no. 3, pp. 503–528, learning-related research. He has been with novel algorithms for news 1989. articles classification and clustering, and on specialized ranking of [32] P. Boyle and M. Frean, “Dependent Gaussian processes,” in Proc. news providers. NIPS, 2005, pp. 217–224. [33] K. Yu, V. Tresp, and A. Schwaighofer, “Learning Gaussian pro- cesses from multiple tasks,” in Proc. 22nd ICML, Bonn, Germany, 2005, pp. 1012–1019. Sotirios P. Chatzis is a lecturer (U.S. [34] E. V. Bonilla, K. M. A. Chai, and C. K. I. Williams, “Multi-task equivalent: assistant professor) with the Gaussian process prediction,” in Proc. NIPS, 2008. Department of Electrical Engineering, Computer [35] M. Alvarez and N. Lawrence, “Sparse convolved multiple output Engineering and Informatics, Cyprus University Gaussian processes,” in Proc. NIPS, 2008, pp. 57–64. of Technology, Limassol, Cyprus. His current [36] L. Bo and C. Sminchisescu, “Twin Gaussian processes for struc- research interests include machine learning tured prediction,” Int. J. Comput. Vision, vol. 87, no. 1–2, pp. 28–52, theory and methodologies with a special focus Mar. 2010. on hierarchical Bayesian models, Bayesian [37] A. Sklar, “Functions de repartition à n dimensions et luers nonparametrics, quantum statistics, and neu- marges,” Publications de l’Institut de Statistique de l’Universitè de roscience. He received the M.E. degree in Paris, vol. 8, pp. 229–231, 1959. electrical and computer engineering with dis- [38] R. Nelsen, An Introduction to Copulas. New York, NY, USA: tinction from the National Technical University of Athens, Greece, Springer, 2006. in 2005, and the Ph.D. degree in machine learning, in 2008, from [39] F. Abegaz and U. Naik-Nimbalkar, “Modeling statistical depen- the same institution. From January 2009 until June 2010, he was a dence of Markov chains via copula models,” J. Stat. Plan. Inference, post-doctoral fellow with the University of Miami, Miami, FL, USA. vol. 138, pp. 1131–1146, Apr. 2008. From June 2010 until August 2012, he was a post-doctoral researcher [40] H. Joe, Multivariate Models and Dependence Concepts. London, U.K.: with the Department of Electrical and Electronic Engineering, Imperial Chapman and Hall, 1997. College London, London, U.K. His Ph.D. research was supported [41] H. Joe, “Asymptotic efficiency of the two-stage estimation method by the Bodossaki Foundation and the Greek Ministry for Economic for copula-based models,” J. Multivar. Anal., vol. 94, no. 2, pp. 401– Development. He was also awarded the Dean’s Scholarship for Ph.D. 419, Jun. 2005. studies, being the best performing Ph.D. student in his class. In his first [42] W. Hoeffding, “Masstabinvariante korrelations theorie,” Schr. seven years as a researcher, he has authored more than 40 papers in Math. Inst. Univ. Berlin, vol. 5, no. 3, pp. 179–233, 1940. the most prestigious journals and conferences of his research field.

 For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.