<<

ARMA -Series Modeling with Graphical Models

Bo Thiesson David Maxwell Chickering David Heckerman Christopher Meek Microsoft Research Microsoft Research Microsoft Research Microsoft Research [email protected] [email protected] [email protected] [email protected]

Abstract moving (ARMA) time-series model (e.g., Box, Jenkins, and Reinsel, 1994) as a to We express the classic ARMA time-series achieve similar benefits. As we shall see, the ARMA model as a directed graphical model. In model includes deterministic relationships, making it doing so, we find that the deterministic re- effectively impossible to estimate model parameters lationships in the model make it effectively via the Expectation–Maximization (EM) algorithm. impossible to use the EM algorithm for Consequently, we introduce a variation of the ARMA learning model parameters. To remedy this model for which the EM algorithm can be applied. problem, we replace the deterministic rela- Our modification, called ARMA (σARMA), tionships with Gaussian distributions hav- replaces the deterministic component of the ARMA ing a small , yielding the stochastic model with a Gaussian distribution having a small ARMA (σARMA) model. This modification variance. To our surprise, this addition not only has allows us to use the EM algorithm to learn the desired effect of making the EM algorithm effective parameters and to forecast, even in situa- for parameter estimation, but our also tions where some is missing. This mod- suggest that it provides a technique that ification, in conjunction with the graphical- produces more accurate estimates. model approach, also allows us to include We have chosen to focus on ARMA time-series mod- cross predictors in situations where there are els in this paper, but our paper applies immediately multiple time series and/or additional non- to autoregressive integrated (ARIMA) temporal covariates. More surprising, ex- time-series models as well. In particular, an ARIMA periments suggest that the move to stochas- model for a time series is simply an ARMA model tic ARMA yields improved accuracy through for a preprocessed version of that same time series. better smoothing. We demonstrate improve- This preprocessing consists of d consecutive differenc- ments afforded by cross and better ing transformations, where each transformation re- smoothing on real data. places the observations with the differences between successive observations. For example, when d = 0 an ARIMA model is a regular ARMA model, when d = 1 1 Introduction an ARIMA model is an ARMA model of the differ- ences, and when d = 2 an ARIMA model is an ARMA Graphical models have been used to represent time- model of the differences of the differences. In practice, series models for almost two decades (e.g., Dean and d ≤ 2 is almost always sufficient for good results (Box, Kanazawa, 1988; Cooper, Horvitz, and Heckerman, Jenkins, and Reinsel, 1994). 1988). The benefits of such representation include the ability to apply standard graphical-model-inference al- In Section 2, we review the ARMA model and intro- gorithms for parameter estimation (including estima- duce our stochastic variant. We also extend the model tion in the presence of ), for prediction, to allow for cross prediction, yielding σARMAxp, a and for cross prediction—the use of one time series model more general than previous extensions to cross- to help predict another time series (e.g., Ghahramani, prediction ARMA (e.g., the vector ARMA model; 1998; Meek, Chickering, and Heckerman, 2002; Bach Reinsel, 2003). In Section 3, we describe how the EM and Jordan, 2004). algorithm can be applied to σARMAxp for parameter estimation. Our approach is, to our knowledge, the In this paper, we express the classic autoregressive first EM-based alternative to parameter estimation in Our stochastic time-series models handle incomplete the ARMA model class. In Section 4, we describe time-series in the sense that some values in how to predict observations—that is, forecast— the are missing. Hence, in real-world sit- using σARMAxp. In Sections 5 and 6, we demonstrate uations where the length of multiple cross-predicting the utility of our extensions in an evaluation using two time series do not match, we have two possibilities. real data collections. We show that (1) the stochastic We can introduce observation variables with missing variant of ARMA produces more accurate values, or we can shorten a sequence of observation than those of standard ARMA, (2) estimation and pre- variables, as necessary. In the following we will as- diction in the face of missing data using our approach sume that Y , E, and C are all of the same length. yields better forecasts than by a heuristic approach in For any sequence, say Y , we denote the sub-sequence which missing data are filled in by , and consisting of the i’th through the j’th element by Y j = (3) the inclusion of cross predictions can lead to more i (Y ,Y ,...,Y ), i < j. accurate forecasting. i i+1 j 2.1 ARMA Models 2 Time-Series Models In slightly different notation than usual (see, e.g., In this section, we review the well-known class of Box, Jenkins, and Reinsel, 1994 or Ansley, 1979) the autoregressive-moving average (ARMA) time series ARMA(p, q) time series model is defined as the deter- models and define two closely related stochastic varia- ministic relation ∗ tions, the σARMA and the σARMA classes. We also ∑q ∑p define a generalization of these two stochastic varia- Y = ζ + β E − + α Y − (1) xp ∗xp t j t j i t i tions, called σARMA and σARMA , which ad- j=0 i=1 ditionally allow for selective cross-predictors from re- ∑ p lated time series. where ζ is the intercept,∑ i=1 αiYt−i is the autore- q gressive (AR) part, β E − is the moving aver- We begin by introducing notation and nomenclature. j=0 j t j age (MA) part with β fixed as 1, and E ∼ N (0, γ) We denote a temporal sequence of observation vari- 0 t is “white ” with E mutually independent for all ables by Y = (Y ,Y ,...,Y ). Time-series data is t 1 2 T t. The construction of this model therefore involves a sequence of values for these variables denoted by estimation of the free parameters ζ,(α , . . . , α ), y = (y , y , . . . , y ). We suppose that these observa- 1 p 1 2 T (β , . . . , β ), and γ. tions are obtained at discrete, equispaced intervals of 1 q time. In this paper we will consider incomplete obser- For a constructed model, the one step-ahead forecast vation sequences in the sense that some of the obser- Yˆt given the past can be computed as vation variables may have missing observations. For ∑q ∑p notational convenience, we will represent such a se- Yˆ = ζ + β E − + α Y − (2) quence of observations as a complete sequence, and it t j t j i t i j=1 i=1 will be clear from context that this sequence has miss- ing observations. where we exploit that at any time t, the error in the ARMA model can be determined as the difference be- In the time-series models that we consider in this pa- tween the actual observed value and the one step- per, we associate a latent “” variable with ahead forecast each observable variable. These latent variables are Et = Yt − Yˆt (3) denoted E = (E1,E2,...,ET ). The variance for this forecast is γ. Some of the models will contain cross-predictor se- quences. A cross-predictor sequence is a sequence An ARMA model can be represented by a directed of observation variables from a related time series, graphical model (or Bayes net) with both stochastic which is used in the predictive model for the time and deterministic nodes. A node is deterministic if series under consideration. For each cross-predictor the value of the variable represented by that node is a sequence in a model, a cross-predictor variable is as- deterministic of the values for variables rep- sociated with each observable variable. For instance, resented by nodes pointing to that node in the graph- ′ ′ ′ ′ ′′ ′′ ′′ ′′ Y = (Y1 ,Y2 ,...,YT ′ ) and Y = (Y1 ,Y2 ,...,YT ′′ ) ical representation. From the definition of the ARMA ′ ′ may be related time series sequences where Yt−1 Yt−12 models, we see that the observable variables (the Y ’s) ′′ and Yt−1 are cross-predictor variables for Yt. Let Ct are represented by deterministic nodes and the error denote a vector of cross-predictor variables for Yt. The variables (the E’s) are represented by stochastic nodes. set of cross-predictor vectors for all variables Y is de- The relations between variables are defined by (1) and noted C = (C1,C2,...,CT ). accordingly, Yt−p,...,Yt−1 and Et−q,...,Et all point to Yt. In this paper we are interested in the condi- ting β0 vary freely. We will denote this class of models tional likelihood models, where we condition on the first as σARMA∗. = max(p, q) variables. Relations between variables A σARMA (or σARMA∗) model has the same graphi- for t ≤ R can therefore be ignored. The graphical cal representation as the similar ARMA model, except representation for an ARMA(2,2) model is shown in that deterministic nodes are now stochastic. Figure 1. It should be noted that if we artificially ex- tend the time series back in time for R (unobserved) time steps, this model represents what is known in the 2.3 σARMAxp Models literature as the exact likelihood model. There are al- ternative methods for dealing with the beginning of a The σARMAxp class of models includes the follow- time series (see, e.g., Box, Jenkins, and Reinsel, 1994). ing generalizations of the σARMA class: (1) a model may define multiple time series - with different p’s and E E E E E 1 2 3 4 5 q’s in different time series and (2) a time series is al- lowed additional dependencies on observations from related time series, called cross-predictors (the ’xp’ in Y Y Y Y Y 1 2 3 4 5 the name). Consider the part of a σARMAxp model which de- Figure 1: ARMA(2,2) model for time series with five scribes a particular time series in the set of time se- observations. Stochastic nodes are shown as single- ries defined by the model. The representation of this circles and deterministic nodes as double-circles. time series is similar to an σARMA model, except that Yt additionally depends on the vector of cross predictors C . Let η be a vector of regression coeffi- 2.2 σARMA Models t cients associated with the cross predictors. In this case | t t−1 ∼ N The EM algorithm is a standard method used to learn Yt Et−q,Yt−p ,Ct (µ, σ) with the parameters in a graphical model. As argued in the In- troduction and as we will see in Section 3.3, the EM ∑q ∑p algorithm cannot be applied to estimate the parame- µ = ζ + βjEt−j + αiYt−i + ηCt (5) ters in an ARMA model. In this section we define the j=0 i=1 σARMA class of models for which the EM algorithm can be applied. A σARMA model is identical to an ARMA model except that the deterministic relation Similar to the σARMA models, we define two varia- xp in Equation 1 is replaced by a conditional Normal dis- tions. In the σARMA class, β0 is fixed as 1, and in ∗xp tribution (with variance σ, thereby the name). More the σARMA , β0 vary freely. specifically, for the observable sequence of variables As mentioned above, a full σARMAxp model con- we have Y |Et ,Y t−1 ∼ N (µ, σ), where the func- t t−q t−p tains multiple time series. The time series are related tional expression for the mean µ and the variance σ through the cross predictors only, and hence the un- are shared across the observation variables. The vari- observed error variables are independent across time ance is fixed at a given (small) value to be specified series in the model and may have different shared vari- by the user. In Section 4 we will see that σ takes the ances for different time series. role as a minimum allowed variance for the one step- ahead forecast. The mean is related to the conditional Besides the stochastic nature of the σARMAxp models, variables as follows these models are different from vector ARMA models (see, e.g., Reinsel, 2003) in that (1) different time series ∑q ∑p in a model may have different of AR and MA µ = ζ + βjEt−j + αiYt−i (4) j=0 i=1 regressors, and (2) cross predictors between time series are not defined by the ARMA structure, but are chosen We see from this representation that ARMA is the in a selective way for each time series in the model. limit of σARMA as σ → 0. The graphical representation of an σARMAxp (or ∗xp By fixing β0 = 1 in the ARMA model class, the vari- σARMA ) model is shown in Figure 2. The model ance γ of Et has the semantic of being the variance represents two time series. The first time series (for for the one-step forecast. As we will see in Section 4, observations Yt) correspond to an σARMA(2,2) model γ does not carry the same semantic for a σARMA with no cross-predictors and the second time series (for ′ model. Without this semantic for γ, it seems natural observations Yt ) correspond to an σARMA(1,1) model to extend the σARMA model class by additionally let- with a single cross-predictor, Yt−1. E E E E E ML estimation problem into an iterative sequence of 1 2 3 4 5 “pseudo-estimations” with respect to the for complete data. Let Θ denote the param- eterizations θ for all the individual time series in the Y1 Y2 Y3 Y4 Y5 model and let Θn denote the current value of Θ af- ter n iterations. Each iteration of the EM algorithm

Y’1 Y’2 Y’3 Y’4 Y’5 involves two steps.

E-step: For each time series, construct the con- E’ E’ E’ E’ E' 1 2 3 4 5 ditional expectation for the complete data log- likelihood given the current parameterization and Figure 2: σARMAxp for two time series each with five observed data for all time series in the model observations. n Q(θ|Θ ) = EEΘn [l(θ)]. xp 3 σARMA Estimation M-step: For each time series, choose θn+1 as the value that maximizes Q(θ|Θn). We seek the maximum likelihood estimate for the conditional log-likelihood of all time series in the In this case where the statistical model is a subfam- σARMAxp model, where we condition on the first ily of an , the EM algorithm be- R = max(p, q) time steps for each time series in the comes an alternation between an E-step that computes model. Notice that R may be different for different expected sufficient for the statistical model time series, as is the case for the example in Figure 2. and an M-step that re-estimates the parameters of the This log-likelihood is, however, difficult to maximize model by treating the expected sufficient statistics as if directly due to incomplete data in the statistical they were actual sufficient statistics (Dempster, Laird, model. First of all, the latent error variables ren- and Rubin 1977). ders all parameters within a given time series depen- xp dent. Secondly, some of the observable variables may 3.1 σARMA (fixed β0) have missing observations in which case cross predic- tors may cause parameters in different time series to E-step become dependent. By the factorization (6), the expected complete data In contrast, given (imaginary) complete data for all log-likelihood becomes variables in the model, the parameters associated with each time series are independent and can therefore be ∑T [ | n | t−1 t−1 Q(θ Θ ) = EEΘn log p(Yt Et,Yt−p ,Et−q ,Ct, θY ) estimated independently. Let β = (β1, . . . , βq) and t=R+1 α = (α1, . . . , αp). We denote the parameters in the + log p(E |θ )] model by θ = (θE, θY ), where θE = (γ) and θY = t E xp (ζ, β, α, η, σ) for σARMA and θY = (ζ, β0, β, α, η, σ) ∗xp t−1 t−1 for σARMA . Conditioning on the first R time Defining Xt = (Et−q ,Yt−p ,Ct) and ϕ = (β, α, η) we steps, the complete data (conditional) log-likelihood use the σARMAxp model definition (5) to obtain for a particular time series factorizes as follows ∑ [( ) ] n 1 ⊤ 2 Q(θ|Θ ) = EE n Y − (ζ + ϕX + β E ) /σ l(θ) = log p(Y T ,ET |Y R,ER, C, θ) 2 Θ t t 0 t R+1 R+1 1 1 t ∑T ∑ [ ] − − 1 2 | t 1 t 1 − EEΘn E /γ + c (7) = log p(Yt Et,Yt−p ,Et−q ,Ct, θY ) 2 t t=R+1 t ∑T where c is a constant, θ = (ζ, ϕ, γ) is the free parame- + log p(E |θ ) (6) t E ters in the model, β0 = 1, and ⊤ denotes transpose. t=R+1 M-step Due to the above factorization, the complete data log- likelihood is easier to maximize. Hence, assuming that The ML estimation for γ is easily obtained as the ex- data is missing at random, the EM algorithm can be pected sample variance. That is, used to solve the maximization problem. ∑ [ ] 2 − γˆ = EE Et /(T R) (8) Roughly speaking, the EM algorithm converts the t where we have eased the notation by letting EE denote 3.3 EM Only Works for σ > 0 expectation with respect to Θn. It turns out that the convergence rate for EM decreases Differentiating (7) we obtain the following partial with smaller (fixed) values of σ. In particular, the derivatives EM algorithm will not be able to improve the initial ∑ [ ] ∑ [ ] parameter setting in any significant way when σ = 0. dQ ∝ ⊤ − ⊤ ⊤ EE YtXt EE Xt Xt ϕ → → dϕ To realize this fact, let σ 0. In this case, Yt ζ + t t ′ xp ∑ [ ] ∑ [ ] Xtϕ + Et for the σARMA representation and Yt → − ⊤ − ⊤ ′ ∗xp EE Xt Et EE Xt ζ ζ+Xtϕ for the alternative σARMA representation. t t (Recall that Xt and ϕ are defined differently for the dQ ∑ ∑ ∝ EE [Y ] − EE [X ] ϕ⊤ two representations.) By inserting into (9) and (10), dζ t t respectively, we see that in both cases t∑ t ∑ − − dQ EE [Et] ζ (9) → 0 t t dϕ dQ → 0 By setting derivatives to zero and solving this set of dζ equations we obtain the ML estimate for the param- Hence, the EM algorithm will not be able to improve eters (ζ, ϕ). The set of equations can be singular (or the (ζ, ϕ) parameters in any of the model representa- close to singular) so one should be careful and use a tions. In both representations the γ variance param- method robust to this situation when solving the equa- eter will improve at the first EM step, but because tions. We use pseudo-inversion to extend the notion the expected statistics used to compute (8) are not of inverse matrices to singular matrices when solving a function of this variance, and because the remain- the equations (see, e.g., Golub and Van Loan, 1996). ing parameters in both model representations do not By realizing that the σARMAxp model defines a Bayes change, the models will not improve further. net, the expected sufficient statistics involved in the above ML estimation can be efficiently computed 4 σARMAxp by the procedure described in Lauritzen and Jensen (2001), or any other stable method for inference in We now consider the problem of using a σARMAxp or Gaussian Bayes nets. In the use of this procedure, we σARMA∗xp model to forecast. For a given time se- define an inference structure with cliques (Yt,Et,Xt), ries sequence in the model, the task of forecasting is t = R + 1,...,T for each time series in the model. to calculate the distributions for future observations Except for the first clique in a time series, all cliques given a previously observed sequence of observations | are initially assigned the densities p(Yt,Et Xt) given for that time series and sequences of observations for by the current parameterization. The first clique is related cross-predictors. We distinguish between two \ | assigned the densities p(Yt,Et,Xt Ct Ct). important types of forecasting: (1) one-step forecast- ing and (2) multi-step forecasting. In our evaluation of ∗xp 3.2 σARMA (free β0) predictive accuracy, we use one-step forecasting. The set of free parameters in an σARMA∗xp model In one-step forecasting, we are interested in pre- additionally include the parameter β0. Therefore this dicting YT +1 given any known observations on the t t−1 T T +1 time, let Xt = (Et−q,Yt−p ,Ct) and ϕ = (β0, β, α, η). observable variables Y1 and C1 . For this sit- Differentiating (7) with respect to this alternative set uation, the predictive posterior distribution can be of free parameters, we obtain the following partial computed as follows. Using Bayes-net inference derivatives we (pre-)compute the marginal posterior distribution T T T T ∑ [ ] ∑ [ ] p(E − ,Y − |y , c ) for the last clique associ- dQ ⊤ ⊤ ⊤ T q+1 T p+1 1 1 ∝ EE YtX − EE X Xt ϕ ated with the time series in the inference structure, dϕ t t t∑ [ ] t as defined in Section 3.1. Recall that some of the − ⊤ observable variables may have missing observations. EE Xt ζ T t We first consider the simpler situation where yT −p+1 dQ ∑ ∑ ∑ and c are completely observed. Let EE[ET ] ∝ EE [Y ] − EE [X ] ϕ⊤ − ζ (10) T +1 T −q+1 dζ t t and Σ denote the mean vector and ma- t t t trix for the marginal multivariate Normal distribution T | T T T Hence, the M-step is performed by solving (8) and p(ET −q+1 y1 , c1 ). With yT −p+1 and cT +1 completely solving the slightly simpler set of equations (10) with observed we can derive the predictive posterior dis- | T T +1 one more unknown/equation than (9). tribution p(YT +1 y1 , c1 ) as the Normal distribution N(µ∗, σ∗) with mean and variance by Economagic and can be obtained for a fee from

∗ T ⊤ T ⊤ ⊤ http://www.economagic.com. The US-Econo collec- µ = ζ + EE[ET −q+1]β + yT −p−1α + cT +1η (11) tion contains monthly data on 50 economic indicators σ∗ = σ + βΣβ⊤ + γ (12) from February 1992 to December 2002, as reported by the U.S. Bureau. All the US-Econo data This derivation can be verified by following the same sets are of length 131. The JPN-Econo collection con- inference step as for the situation with missing obser- tains monthly data on 53 economic indicators, as re- vations, described below. ported by the Bank of Japan and the Economic Plan- Notice from (12) that if we compare to an ARMA ning Agency of Japan. The data sets in this collection model, where the variance for the one-step predictive vary in length from 52 to 259 periods with both the distribution is γ, a σARMAxp model “smoothes” the and mean lengths larger than 192 periods (16 distribution by adding σ + βΣβ⊤ to this variance. As years). The longest time series begin in January 1981 Σ is positive definite and hence βΣβ⊤ is positive, the and all end in July 2002. fixed observation variance σ—specified by the user— Each is standardized before modeling—that can therefore be conceived as the lowest allowed level is, for each variable we subtract the mean value and for variance in the predictive distribution. divide by the . We divide each T data set into a training set, used as input to the learn- If any of the variables in YT −p+1 and CT +1 are not observed, it becomes more complicated ing method, and a holdout set, used to evaluate the to compute the posterior predictive distribu- models. We use the last twelve observations as the tion for YT +1. In this case, we extend the holdout set, knowing that the data are monthly. inference structure with an additional clique T +1 T +1 In order to evaluate our models on incomplete data, (YT −p+1,ET −p+1,CT +1) initially assigned the condi- we have created—for each of the two data collections— | T T tional densities p(YT +1,ET +1 YT −p+1,ET −p+1,CT +1) five corresponding incomplete data collections with re- defined by the current parameterization. We spectively 10, 20, 30, 40, and 50 percent of randomly then insert any observations we may have on missing observations in each of the training sets in a CT +1 into this clique and perform a simple collection. Corresponding to the incomplete data col- Bayes net inference from the (pre-)computed lections, we have also created corresponding collections T T | T T p(ET −q+1,YT −p+1 y1 , c1 ) to the new clique by of filled-in data sets, where a missing observation is T +1 T +1 | T T +1 which we obtain p(ET −q+1,YT −p+1 y1 , c1 ). The filled in by linear interpolation or of the posterior predictive distribution is then created two closest (observed) data points in the training set. by simple marginalization to Y . It should be T +1 In our experiments, we perform the following compar- noted that this predictive distribution is only an isons. One, for the complete data collections, predic- approximation if some of the variables in C are t+1 tions for ARMA models learned by the Levenberg- not observed. We choose this approximation in order Marquardt method for ML estimation are compared to avoid complete inference over all time series in the to predictions for the σARMA and σARMA∗ models. model and in this way allow for fast predictions. This comparison will illustrate the smoothing effect of In multi-step forecasting, we are interested in predict- stochastic ARMA models. Two, to illustrate the im- ing values for variables at multiple future time steps. portance of handling missing data in a theoretically When forecasting more than one step into the future sound way, we compare predictions for σARMA mod- we introduce intermediate unobserved prediction vari- els trained on the incomplete data sets with those for ables. For example, say that we are interested in a σARMA models trained on the filled-in data. Three, three-step forecast. In this case we introduce the vari- the effect of cross predictors is illustrated by compar- xp ∗xp ables YT +1, YT +2, and YT +3, where only YT +3 is ob- isons of predictions for σARMA and σARMA served. This situation is similar to the incomplete data models with predictions for σARMA and σARMA∗, situation, described above, and the posterior predic- respectively, all trained on the complete data sets. tive distribution can be obtained in the same way by We evaluate the quality of a learned model by comput- introducing a clique in the inference structure for each ing the sequential predictive score for the holdout data additional time step. set corresponding to the training data from which the model was learned. The sequential predictive score for 5 Evaluation a model is simply the average log-likelihood obtained by a one-step forecast for each of the observations in In this section, we provide an empirical evaluation of the holdout set. To evaluate the quality of a learn- our methods. We use two collections of data sets: US- ing method, we compute the average of the sequential Econo and JPN-Econo. The data sets are compiled predictive scores obtained for each of the time series Table 1: ARMA, smoothed ARMA, and σARMA av- in a collection. Note that the use of the log-likelihood erage sequential predictive scores for complete data to measure performance simultaneously evaluates both collections. the accuracy of the estimate and the accuracy of the of the estimate. Finally, we use a (one- sided) at significance level 5% to evaluate if Collection Method Ave. score one method is significantly better than another. To US-Econo ARMA -4.928 form the sign test, we count the number of one US-Econo Smoothed ARMA -4.599 method improves the predictive score over the other US-Econo σARMA (fixed β0) -4.520 ∗ for each individual time series in a collection. Exclud- US-Econo σARMA (free β0) -4.533 ing ties, we seek to reject the hypothesis of equality, where the test for the sign test follows a bi- JPN-Econo ARMA -2.935 nomial distribution with probability parameter 0.5. JPN-Econo Smoothed ARMA -2.095 JPN-Econo σARMA (fixed β0) -2.023 ∗ In all experiments we search for the best structure for JPN-Econo σARMA (free β0) -1.994 models—that is, the best p and q for ARMA, σARMA, and σARMA∗ models, and additionally the best set of cross-predictors for each time-series sequence in rank order until the score does not increase. If the first ∗ σARMAxp and σARMA xp models. For the structural added cross predictor does not increase the score, we search, we split the training data set into a structural delete cross predictors in reverse rank order until the training set and a structural validation set. We use score does not increase. (Initially, for p = q = 0, no the last twelve observations of the training data as the cross predictors are included.) The search stops when structural validation set. During structural search, we increasing p and searching for the best q and the best use the structural training set to estimate parameters cross-predictors does not improve the model over the and we use the structural validation set to evaluate model selected for the previous p. alternative models by the sequential predictive score. After the model structure is selected, we finally use Recall from Section 4 that the stochastic models pro- the full training data set to estimate parameters. posed in this paper smooth the parameters in the model in a way that sets the lowest allowable vari- We use a greedy search strategy when deciding ance for the predictive distribution as given by the which models to evaluate during the structural user specified observation variance, σ. Similarly, the search. The search ARMA(p, q), σARMA(p, q), and predictive distribution for an ARMA model can be ∗ σARMA (p, q) models has an outer loop that intera- smoothed by adding σ directly to the variance of the tively increases p by one starting from p = 0. For each predictive distribution. In our first comparison, we p, we greedily search for the best q by incrementing or show that this ad hoc smoothing method is not as decrementing q from its current value. (We initialize good as the smoothing afforded by our stochastic mod- q to zero for p = 0). The search stops when increasing els. We perform these comparisons using σ = 0.01, p and searching for the best q does not improve the although the results are not sensitive to this value. model over the model selected for the previous p. ∗ The search strategy for σARMAxp and σARMA xp 6 Results models is modified to handle cross predictors. In these experiments, all time-series sequences are completely xp We first consider the predictive accuracy of ARMA observed. A σARMA model therefore factorizes into versus σARMA and σARMA∗ models on single time the individual time series with cross predictors, and series. It is conventional wisdom that most time-series each components is learned independently as follows. data can be adequately represented by ARMA(p,q) We first create a ranked list of cross predictors ex- models with lag R = max(p, q) not greater than two pected to have the most influence on the time-series (see, e.g., Box, Jenkins, and Reinsel, 1994). To ver- component. Cross predictors are ranked according to ify that our method with unrestricted the sequential predictive score on the structural vali- lag does not overfit, we ran experiments where we re- dation data for models with p = q = 0 and only one stricted the maximum lag R at varying values. As we cross predictor. We then greedily search for p, q, and increased R, the average predictive score for all exper- the cross predictors. In the outer loop, we increase p by iments increased for all three model classes. In fact, one starting from p = 0. In a middle loop, we greedily the average predictive score for unrestricted experi- search for the best q by incrementing or decrementing ments was the best but not significantly better than q from its current value. (Again, we initialize q to zero that for any restricted lag. In the following, we will for p = 0.) In a inner loop, we add cross predictors in report results obtained for unrestricted lag. US-Econo JPN-Econo xp ∗xp -4.4 -1.95 Table 2: σARMA and σARMA average sequen- tial predictive scores for complete data collections. -4.6 -2

-4.8 -2.05 Collection Method Ave. score

Average score -5 Average score -2.1 xp US-Econo σARMA (fixed β0) -4.506 -5.2 -2.15 US-Econo ARMA∗xp (free ) -4.526 0% 10% 20% 30% 40% 50% 0% 10% 20% 30% 40% 50% σ β0 Missing data Missing data xp JPN-Econo σARMA (fixed β0) -2.019 ∗xp sigmaARMA sigmaARMA* sigmaARMA(filled-in data) JPN-Econo σARMA (free β0) -1.980

Figure 3: Average sequential predictive scores for com- 7 Related and Future Work plete and incomplete data collections. The most common approach to handling incomplete- ness of data for ARMA models is to transform the model into a state-space model and perform Kalman filtering—and possibly Kalman smoothing (see, e.g., Jones, 1980). As described in Ghahramani (1998) it is Table 1 shows average sequential predictive scores for well-known that the Kalman filter for an ARMA model each of the two collections of complete data sets— can be represented as forward inference in a dynamic US-Econo and JPN-Econo—for ARMA, smoothed Bayesian network with graphical structure as shown in ∗ Figure 4, where Xt is a state vector and Yt an obser- ARMA, σARMA (fixed β0), and σARMA (free β0). Figure 3 shows predictive scores as the data becomes vation variable. Many different state-space representa- more and more incomplete. In these experiments, tions are possible for an ARMA model, but these rep- σARMA models are trained on both incomplete data resentations are not necessarily intuitive and take some and filled-in data. Note that we have not included to construct. In contrast, the σARMA results for JPN-Econo with 50% missing data in the representation is straight-forward and the graphical figure; these results follow the same trend as for less representation clearly describes the model structure missing data but the scores are too small to show in and as such, the σARMA representation is easy to the chosen data . extend to more sophisticated models. For both collections, we see that smoothing mat- In a forthcoming paper we plan to show that one fil- ters: σARMA and σARMA∗ predict more accurately tering step in the Kalman filter for an ARMA model than smoothed ARMA, and smoothed ARMA predicts is equivalent to a one step forward propagation of ev- more accurately than ARMA. All differences are signif- idence in the clique representation for the σARMA icant by the sign test described previously. In contrast, graphical model and similarly for Kalman smooth- the predictive accuracy of σARMA and σARMA∗ are ing and back-propagation. Therefore, any optimiza- not significantly different. We also see that treating tion algorithm used to estimate ARMA models in the missing data in a principled way is important. In par- state-space representation should be applicable to our ticular, as the data becomes more and more incom- σARMA (with σ = 0) representation as well; and we plete, σARMA learned using EM with missing data should get the same result in either representation. yields more accurate predictions than does σARMA Conversely, the EM algorithm should be applicable for learned using filled-in data. The differences are sig- the state-space representation of a smoothed ARMA nificant for 30% missing data and more for both data model. collections. X X X X X Now we consider multiple time series and the effect of 1 2 3 4 5 cross predictors. Table 2 shows average sequential pre- dictive scores for σARMAxp and σARMA∗xp models Y1 Y2 Y3 Y4 Y5 for the two complete data collections. We see that, in all cases, allowing cross-predictors improves the aver- age score slightly. For the US-Econo , Figure 4: State-space model. the improvement is significant for σARMAxp. For the the JPN-Econo data collection, the improvement is As an alternative to the Kalman-filter approach for significant for both σARMAxp and σARMA∗xp. handling missing data in ARMA models, Penzer and Shea (1997) suggest an approach relying on a Cholesky Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). decomposition of the banded for a Maximum likelihood from incomplete data via the transformation of the observations in the time series. EM algorithm. Journal of the Royal Statistical So- They demonstrate that this approach is computation- ciety, B 39, 1–38. ally superior for high-order ARMA models with many more AR regressors than MA regressors, and becomes Ghahramani, Z. (1998). Learning dynamic bayesian less favorable as the amount of missing data increases, networks. In Adaptive processing of sequences and the models become smaller, or more MA regressors are data structures. lecture notes in artificial intelli- introduced. gence, 168–197. Berlin: Springer-Verlag. Turning our attention to the cross-predictor extension Golub, G. H., & Van Loan, C. F. (1996). Matrix com- of the σARMA models, the autoregressive tree (ART) putations. London: The Johns Hopkins University models, as defined in Meek et al. (2002), allow for se- Press. lective cross predictors from related time series. ART Jones, R. H. (1980). Maximum likelihood fitting of models, however, are generalizions of AR models and ARMA models to time series with missing observa- have no MA component. In a slightly different inter- tions. Technometrics, 22, 389–395. pretation of cross predictors, Bach and Jordan (2004) are concerned with finding a graphical model structure Lauritzen, S. L., & Jensen, F. (2001). Stable lo- for entire time series. Their algorithm learns relations cal computation with conditional gaussian distribu- between entire time series for spectral representations tions. Statistics and Computing, 11, 191–203. of the time series. Meek, C., Chickering, D., & Heckerman, D. (2002). Finally, as a further generalization of the stochastic Autoregressive tree models for time-series . ARMA models, defined in this paper, it seems reason- Proceedings of the Second International SIAM Con- able to let the observation variance σ vary freely and ference on (pp. 229–244). Arlington, estimate this parameter from data as well. However, VA: SIAM. in preliminary experiments we have experienced that σ quickly converges to zero and the EM algorithm will Penzer, J., & Shea, B. (1997). The exact likelihhod thereafter not be able to improve the remaining pa- of an autoregressive-moving average model with in- rameters (see Section 3.3). We intend to investigate complete data. Biometrika, 84, 919–928. this behaviour further, and hope to find a theoretical explanation. Reinsel, G. C. (2003). Elements of multivariate time series analysis. New York: Springer-Verlag. Acknowledgments

We thank Jessica Lin for the implementation of ARMA.

References

Ansley, C. F. (1979). An algorithm for the exact likeli- hood of a mixed autoregressive-moving average pro- cess. Biometrika, 66, 59–65. Bach, F. R., & Jordan, M. I. (2004). Learning graph- ical models for stationary time series. IEEE Trans- actions on Processing, to appear. Box, G. E. P., Jenkins, G. M., & Reinsel, G. C. (1994). Time series analysis. New Jersey: Prentice Hall. Cooper, G., Horvitz, E., & Heckerman, D. (1988). A model for temporal probabilistic reasoning (Technical Report KSL-88-30). Stanford University, Section on Medical Informatics, Stanford, California. Dean, T. L., & Kanazawna, K. (May, 1988). Prob- abilistic temporal reasoning (Technical Report). Brown University.