Predictive Minimum Description Length Criterion for Time Series Modeling with Neural Networks

Communicated by Richard Zemel Predictive Minimum Description Length Criterion for Time Series Modeling with Neural Networks Mikko Lehtokangas Jukka Saarinen Tampere University of Technology,Microelectronics Laboratory, P.O. Box 692, FIN-33101 Tampere, Finland Pentti Huuhtanen University of Tampere, Department ofMatkematica1 Sciences, P.O. Box 607, FIN-33101 Tampere, Finland Kimmo Kaski Tampere University of Technology,Microelectronics Laboratory, P.O. Box 692, FIN-33102 Tampere, Finland Nonlinear time series modeling with a multilayer perceptron network is presented. An important aspect of this modeling is the model selection, i.e., the problem of determining the size as well as the complexity of the model. To overcome this problem we apply the predictive minimum description length (PMDL) principle as a minimization criterion. In the neural network scheme it means minimizing the number of input and hidden units. Three time series modeling experiments are used to examine the usefulness of the PMDL model selection scheme. A comparison with the widely used cross-validation technique is also presented. In our experiments the PMDL scheme and the cross-validation scheme yield similar results in terms of model complexity. However, the PMDL method was found to be two times faster to compute. This is significant improvement since model selection in general is very time consuming. 1 Introduction During the past 70 years time series analysis has become a highly devel- oped subject. The first attempts date back to the 1920s when Yule (1927) applied a linear autoregressive model to the study of sunspot numbers. In the 1950s the basic theory of stationary time series was covered by Doob (1953). Now there are well-established methods for fitting a wide range of models to time series data. The most well known is the set of linear probabilistic ARIMA models (Box and Jenkins 1970). There are Nwml Computation 8, 583-593 (1996) @ 1996 Massachusetts Institute of Technology Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/neco.1996.8.3.583 by guest on 02 October 2021 583 Mikko Lehtokangas et al. also several nonlinear models, e.g., Volterra series (Volterra 19591, bi- linear models (Priestley 1988), threshold AR models (TAR; Tong 19831, exponential AR models (EXPAR; Ozaki 1978), and AR models with conditional heteroscedastisity (ARCH; Engle 1982). Recently several neural network techniques such as niultilayer perceptron (MLP) network (Rumelhart et 01. 1986) and radial basis function (RBF) network (Powell 1987; Moody and Darken 1988) have also been applied for time series modeling. The common factor in neural network techniques is that the models arc constructed from simple processing units that perform nonlinear input-output mapping. The nonlinear nature of these units makes neural network techniques well suited for nonlinear modeling. An important aspect of the modeling is the problem of model selection, i.e., the problem of determining the size and complexity of the model (Weigend ~f 01. 1990). There is a trade-off because an undersized model does not have the power to model the given data. On the other hand, an oversized model has a tendency to perform poorly on unseen data. To overcome this problem we apply the predictive minimum description length (PMDL) principle (Rissanen 1984). It provides a criterion for minimizing the complexity of the model. Using the systematical PMDL procedure MLP networks are applied for time series modeling and prediction. The method reduces the risk of the model under- or overfitting. 2 Multilayer Perceptron Neural Network In this study we used the MLP architecture shown in Figure 1 for time series modeling. The number of input units is p and the number of hidden units is q. The notation MLP(p.9) will be used to refer to a specific network structure. The weights in the connections between the input and hidden layer are denoted by u’,~and weights between the hidden and output layer are denoted by u,.In addition, the hidden and output neurons have the bias terms zoo, and z’o, respectively. The activation function was chosen to be the hyprrlmiic tnizgeiit (taizh) function in the hidden units. The output neuron was set to be linear. The mathematical formula for the network can be written as (2.1) The training of the network was done in two phases. First, the initial values for the weights were calculated with the orthogonal least squares algorithm (Lehtokangas et al. 1995). In the second phase the standard backpropagation algorithm was used for weight adjusting (Rumelhart ef nl. 1986). Note that the first initialization phase was used merely to speed up the backpropagation training. This does not affect the model selection results which are shown in the experiments section. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/neco.1996.8.3.583 by guest on 02 October 2021 Time Series Modeling 585 Figure 1: Three layer perceptron network with single output. 3 Stochastic Complexity and the Predictive MDL Principle Even though time series modeling has been studied extensively, there has been no definite methodology for solving the model selection problem. Also, despite the existing model selection techniques there is still a tendency to use models that have an excessive number of parameters. For instance, in neural network modeling too many units in the hidden layer are commonly used. Usually an excessive number of parameters deteriorates the generalization properties of a model. Therefore, it is important to find a model that has the simplest possible structure for the given problem. Inspired by the algorithmic notion of complexity (Solomonoff 1964; Kolmogorov 1965; Chaitin 1966) as well as Akaike's work (Akaike 19771, Rissanen proposed the shortest code length for the observed data as a criterion for model selection (Rissanen 1978). In subsequent papers (Ris- sanen 1983, 1984, 1986, 1987) this gradually evolved into stochastic complexity, which is briefly described in the following. For applications, the most important coding system is obtained from a class of parametric probability models: M = Cf(x 1 8),~(8) 1 8 E ilk, k = 1,2,.. .} (3.1) in which ilk is a subset of the k-dimensional Euclidean space with non- empty interior. Hence, there are k "free" parameters. The stochastic complexity of x, relative to the model class M is now according to Rissanen (1987) I(x I M)= - logf(x I M), with f(x 1 M) = ,f(x 8) d~(8)(3.21 em I Although the model class M includes the so-called "prior" distribution K, its role in expressing prior knowledge is here not different from that Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/neco.1996.8.3.583 by guest on 02 October 2021 586 Mikko Lehtokangas et al. of f(x I 8). In fact, the former need not be selected at all, for it is possible to construct it from the model class as a generalization of Jeffreys' prior (Clarke and Barron 1993; Rissanen 1993). Also, particularly important pairs of distributions f(x 1 0) and ~(0)are the so-called conjugate distributions, because for them the integral 3.2 can be evaluated in a closed form. The stochastic complexity represents now the shortest code length attainable by the given model class. Yet there is at least one problem to solve, namely the integral in 3.2. Various ways to approximate the integral are discussed in Rissanen (1987). In the following one approximate version, the so-called predictive MDL principle, is presented. Frequently, for example in curve fitting and related problems, the models are not primarily expressed in terms of a distribution. Rather we are given a parametric predictor if+]= F(x 1 8) as in the case of neural networks, for which x = [xf,. , is the input and 6' denotes the array of all the weights as parameters. In addition, there is a distance function b(~,)for measuring the prediction error ~t = xt - 2'. Such a prediction model can immediately be reduced to a probabilistic model. In this case conditional gaussian distribution can be defined for the prediction errors as follows: (3.3) in which xf = XI, . , xf. Negative logarithm is taken from the density (3.3) and it is extended to sequences by multiplication as follows: (3.4) The above code length can also be written in the form f (.t+l I Xf3 &+I. a;+]) +CIn (3.5) t=O f (Xf+l I 2.4. .:) The first term in 3.5 is the predictive code length for the data or Shannon's information. The additional code length represented by the sum in 3.5 is the model cost, i.e., the code length needed to encode the model. It has been proven by Rissanen (1994) that the sum term in 3.5 is asymptotically klog(n)/2. After having fixed the model class, we have the problem of estimating the shortest code length attainable with this class of models. Let e(x') and e2(xf)be written briefly as 8, and 6;. They are the maxiinurn likeli- hood estimates, i.e., the parameter values that minimize the code length - lnf(xf+l1 x'. 8: a2)for the past data. In particular, Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/neco.1996.8.3.583 by guest on 02 October 2021 Time Series Modeling 587 The predictive code length for the data and the model is given now by 1 n-1 n -1nf (x" 1 k) = TC [F:+~/&: +21nb1] + -In(Z.rr)2 (3.7) t=O in which a suitable initial value for 6; is picked. In this form the model cost appears only implicitly.

Predictive Minimum Description Length Criterion for Time Series Modeling with Neural Networks

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support