<<

Communicated by Richard Zemel

Predictive Minimum Description Length Criterion for Modeling with Neural Networks

Mikko Lehtokangas Jukka Saarinen Tampere University of Technology,Microelectronics Laboratory, P.O. Box 692, FIN-33101 Tampere, Finland

Pentti Huuhtanen University of Tampere, Department ofMatkematica1 Sciences, P.O. Box 607, FIN-33101 Tampere, Finland

Kimmo Kaski Tampere University of Technology,Microelectronics Laboratory, P.O. Box 692, FIN-33102 Tampere, Finland

Nonlinear time series modeling with a multilayer perceptron network is presented. An important aspect of this modeling is the model selec- tion, i.e., the problem of determining the size as well as the complexity of the model. To overcome this problem we apply the predictive mini- mum description length (PMDL) principle as a minimization criterion. In the neural network scheme it means minimizing the number of in- put and hidden units. Three time series modeling experiments are used to examine the usefulness of the PMDL scheme. A comparison with the widely used cross-validation technique is also pre- sented. In our experiments the PMDL scheme and the cross-validation scheme yield similar results in terms of model complexity. However, the PMDL method was found to be two times faster to compute. This is significant improvement since model selection in general is very time consuming.

1 Introduction

During the past 70 years time series analysis has become a highly devel- oped subject. The first attempts date back to the 1920s when Yule (1927) applied a linear autoregressive model to the study of sunspot numbers. In the 1950s the basic theory of stationary time series was covered by Doob (1953). Now there are well-established methods for fitting a wide of models to time series data. The most well known is the set of linear probabilistic ARIMA models (Box and Jenkins 1970). There are

Nwml Computation 8, 583-593 (1996) @ 1996 Massachusetts Institute of Technology

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/neco.1996.8.3.583 by guest on 02 October 2021 583 Mikko Lehtokangas et al.

also several nonlinear models, e.g., Volterra series (Volterra 19591, bi- linear models (Priestley 1988), threshold AR models (TAR; Tong 19831, exponential AR models (EXPAR; Ozaki 1978), and AR models with con- ditional heteroscedastisity (ARCH; Engle 1982). Recently several neu- ral network techniques such as niultilayer perceptron (MLP) network (Rumelhart et 01. 1986) and radial basis (RBF) network (Powell 1987; Moody and Darken 1988) have also been applied for time series modeling. The common factor in neural network techniques is that the models arc constructed from simple processing units that perform non- linear input-output mapping. The nonlinear nature of these units makes neural network techniques well suited for nonlinear modeling. An important aspect of the modeling is the problem of model selec- tion, i.e., the problem of determining the size and complexity of the model (Weigend ~f 01. 1990). There is a trade-off because an undersized model does not have the power to model the given data. On the other hand, an oversized model has a tendency to perform poorly on unseen data. To overcome this problem we apply the predictive minimum descrip- tion length (PMDL) principle (Rissanen 1984). It provides a criterion for minimizing the complexity of the model. Using the systematical PMDL procedure MLP networks are applied for time series modeling and pre- diction. The method reduces the risk of the model under- or .

2 Multilayer Perceptron Neural Network

In this study we used the MLP architecture shown in Figure 1 for time series modeling. The number of input units is p and the number of hid- den units is q. The notation MLP(p.9) will be used to refer to a specific network structure. The weights in the connections between the input and hidden layer are denoted by u’,~and weights between the hidden and output layer are denoted by u,.In addition, the hidden and output neurons have the bias terms zoo, and z’o, respectively. The activation func- tion was chosen to be the hyprrlmiic tnizgeiit (taizh) function in the hidden units. The output neuron was set to be linear. The mathematical formula for the network can be written as

(2.1)

The training of the network was done in two phases. First, the initial values for the weights were calculated with the orthogonal algorithm (Lehtokangas et al. 1995). In the second phase the standard backpropagation algorithm was used for weight adjusting (Rumelhart ef nl. 1986). Note that the first initialization phase was used merely to speed up the backpropagation training. This does not affect the model selection results which are shown in the experiments section.

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/neco.1996.8.3.583 by guest on 02 October 2021 Time Series Modeling 585

Figure 1: Three layer perceptron network with single output.

3 Stochastic Complexity and the Predictive MDL Principle

Even though time series modeling has been studied extensively, there has been no definite methodology for solving the model selection prob- lem. Also, despite the existing model selection techniques there is still a tendency to use models that have an excessive number of parameters. For instance, in neural network modeling too many units in the hidden layer are commonly used. Usually an excessive number of parameters deteriorates the generalization properties of a model. Therefore, it is im- portant to find a model that has the simplest possible structure for the given problem. Inspired by the algorithmic notion of complexity (Solomonoff 1964; Kolmogorov 1965; Chaitin 1966) as well as Akaike's work (Akaike 19771, Rissanen proposed the shortest code length for the observed data as a criterion for model selection (Rissanen 1978). In subsequent papers (Ris- sanen 1983, 1984, 1986, 1987) this gradually evolved into stochastic com- plexity, which is briefly described in the following. For applications, the most important coding system is obtained from a class of parametric probability models: M = Cf(x 1 8),~(8) 1 8 E ilk, k = 1,2,.. .} (3.1) in which ilk is a subset of the k-dimensional Euclidean space with non- empty interior. Hence, there are k "free" parameters. The stochastic com- plexity of x, relative to the model class M is now according to Rissanen (1987)

I(x I M)= - logf(x I M), with f(x 1 M) = ,f(x 8) d~(8)(3.21 em I Although the model class M includes the so-called "prior" distribution K, its role in expressing prior knowledge is here not different from that

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/neco.1996.8.3.583 by guest on 02 October 2021 586 Mikko Lehtokangas et al.

of f(x I 8). In fact, the former need not be selected at all, for it is possible to construct it from the model class as a generalization of Jeffreys' prior (Clarke and Barron 1993; Rissanen 1993). Also, particularly important pairs of distributions f(x 1 0) and ~(0)are the so-called conjugate distri- butions, because for them the integral 3.2 can be evaluated in a closed form. The stochastic complexity represents now the shortest code length attainable by the given model class. Yet there is at least one problem to solve, namely the integral in 3.2. Various ways to approximate the inte- gral are discussed in Rissanen (1987). In the following one approximate version, the so-called predictive MDL principle, is presented. Frequently, for example in fitting and related problems, the models are not primarily expressed in terms of a distribution. Rather we are given a parametric predictor if+]= F(x 1 8) as in the case of neural networks, for which x = [xf,. . . , is the input and 6' denotes the array of all the weights as parameters. In addition, there is a distance function b(~,)for measuring the prediction error ~t = xt - 2'. Such a prediction model can immediately be reduced to a probabilistic model. In this case conditional gaussian distribution can be defined for the prediction errors as follows: (3.3)

in which xf = XI, . . . , xf. Negative logarithm is taken from the density (3.3) and it is extended to sequences by multiplication as follows:

(3.4)

The above code length can also be written in the form

f (.t+l I Xf3 &+I. a;+]) +CIn (3.5) t=O f (Xf+l I 2.4. .:) The first term in 3.5 is the predictive code length for the data or Shannon's information. The additional code length represented by the sum in 3.5 is the model cost, i.e., the code length needed to encode the model. It has been proven by Rissanen (1994) that the sum term in 3.5 is asymptotically klog(n)/2. After having fixed the model class, we have the problem of estimating the shortest code length attainable with this class of models. Let e(x') and e2(xf)be written briefly as 8, and 6;. They are the maxiinurn likeli- hood estimates, i.e., the parameter values that minimize the code length - lnf(xf+l1 x'. 8: a2)for the past data. In particular,

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/neco.1996.8.3.583 by guest on 02 October 2021 Time Series Modeling 587

The predictive code length for the data and the model is given now by

1 n-1 n -1nf (x" 1 k) = TC [F:+~/&: +21nb1] + -In(Z.rr)2 (3.7) t=O

in which a suitable initial value for 6; is picked. In this form the model cost appears only implicitly. However, as equation 3.5 showed, the model cost is indeed included in this criterion. Therefore, in the predictive MDL algorithm, the network parameters need not be encoded, and they can be calculated from the past string by an algorithm. Hence, the model cost gets added to the prediction errors, and overfitting and underfit- ting characteristics are penalized automatically. More details of the MDL principles have been described in Rissanen (1994) and Lehtokangas et al. (1993a,b). Here we have not used the general form (3.7) of code length, but its approximative version, by assuming the variance n2 to be con- stant. Thus we need not estimate it. Due to this assumption, the total code length will be shorter, and in cases in which the variance is not even close to constant, the results may be distorted. However, in gen- eral, this assumption does not critically affect PMDL model selection (Rissanen 1994). The model structure with the minimum predictive code length represents the PMDL optimal model for the given problem. The predictive MDL algorithm can be represented by the following steps:

Step 1. Generate a data string xn of length n. Step 2. Divide the data string x" into k,,, = [n/dl consecutive seg- ments of length d. Step 3. Select a model structure to be tested and initialize the value of the squared prediction error Rtotal to be zero.

For k = 1 to k,,, - 1 Train selected model with data segmentb) 1, . . . , k. Count squared prediction error Rk for data segment k + 1 using the trained model. Add this prediction error to Rtotal. Next k Divide Rtotal with n - d and set this value for &,ode{. If all model candidates have been tested, then go to Step 4; else go to Step 3. Step 4. Find the minimum value.

The model structure that has the minimum &,ode1 value is the PMDL optimum model.

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/neco.1996.8.3.583 by guest on 02 October 2021 588 Mikko Lehtokangas et al.

4 Time Series Modeling Experiments

In this section the usefulness of the PMDL model selection scheme is examined with three time series modeling experiments. For comparison purposes the cross-validation (CV) technique (Wahba and Wold 1975; Utans and Moody 1991) was also used for model selection. The version of the cross-validation that was used includes similar segmentation of data to the PMDL scheme. In turn, each segment is left out of the validation segment and the rest of the segments are used for training. The first benchmark time series is artificial and it was generated by the formula

xt+1 = COS(Xf) + axt-1 lXt-2l0 + Et+l (4.1)

in which cy = -1.75 and p = 1.5 were used. All the initial conditions were zeros. The additive noise was gaussian such that the signal-to-noise ratio was ux/ae= 10. The second modeling problem deals with time series data measured from a physics laboratory experiment. The time series represents the fluctuations in a far-infrared laser and it was recently used in the Santa Fe Time Series Prediction and Analysis Competition (Weigend and Gershenfeld 1994).’ The third time series represents the load in an electrical network. This series was obtained from industry and, therefore, we cannot reveal any further details about it. We wanted to include these data in this study because they are a good example of real world data. All the three series consist of 2000 points and they are depicted in Figure 2. The first 1000 points of each series were used for model se- lection with PMDL and cross-validation methods. As the initial values of the weights vary the results, model selection was repeated five times. The final selection was based on the repetitions such that the minimum criterion values for each structure were used. After model selection the first 1000 points were used as a training set and the remaining 1000 points were used as a test set for generalization. The testing of the se- lected model structures was repeated a hundred times with different initial weight values on each trial. The normalized mean square error (NMSE) was used as the error metric. It is defined as

(4.2)

in which u2 is the variance of the time series and m is the number of observations. Results for the three experiments are shown in Table 1. The given NMSE error values are averages of the hundred repetitions. The standard

’The data are available by anonymous ftp at ftp.cs.colorado.edu in /pub/Time- Series/SantaFe in files A.dat (the first 1000 points) and A.cont (contains continuation for file A.dat; the first 1000 points of the continuation were used as the test set).

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/neco.1996.8.3.583 by guest on 02 October 2021 Time Series Modeling 589

b) 1 1 1 1 1

0.6

0.4

0.2

0

-0.2

-0.4

-0.6

-0.8 1 --I1 ' 0 200 400 600 800 1000 1200 1400 1600 1800 2000

Figure 2: Benchmark time series scaled to the interval -1 to 1. (a) Artificial time series, (b) time series measured from a far-infrared laser, and (c) time series representing the load in an electrical network.

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/neco.1996.8.3.583 by guest on 02 October 2021 590 Mikko Lehtokangas et al.

Table 1: Results for the Benchmark Problems

Time series Model NMSE ~NMSE NMSE ~NMSE selection Resulted training training test test method structure set set set set Artificial PMDL MLP(3,5) 0.0374 0.0062 0.0386 0.0061 time series cv MLP(3,5) 0.0374 0.0062 0.0386 0.0061 Fluctuations PMDL MLP(2,6) 0.0295 0.0033 0.0337 0.0046 in a laser cv MLP(2,7) 0.0283 0.0042 0.0320 0.0057 Load in an PMDL MLP(4,lO) 0.0623 0.0051 0.1085 0.0291 electrical net CV MLP(4,9) 0.0629 0.0052 0.1102 0.0330

deviations u~MSEfor the errors are also given. As can be seen, PMDL and CV model selection schemes yield very similar results in terms of model complexity. With an artificial series the result is exactly the same and with the other time series there is only one hidden node difference. This is not really surprising since the presented version of the PMDL method can be regarded as a type of cross-validation. However, one should keep in mind that the PMDL method has a strong theoretical background and it may be possible to create an even sharper version of it in which as- sumptions such as constant error variance are not needed (Weigend and Nix 1994). However, the PMDL scheme presented does have at least one advantage over the cross-validation method. The PMDL method was found to be two times faster to compute. This is a significant improve- ment, since model selection, in general, is very time consuming. The speed-up is a direct result of data segmentation. By assuming that the data set used for the model selection is divided into s (s > 1) segments of equal size and that the computational cost for training one segment to a model is constant, the computational costs of the methods can be com- pared directly by examining how many times each segment is trained to the model. With CV this number is s(s- 1) and with PMDL it is s(s-1)/2. Hence, with the above assumptions the PMDL method is twice as fast compared to the CV method. Of course in practical simulations the train- ing times may have small variations, and the above comparisons give at best a rough estimate for the real situation. It is also noted that Rissanen (1994) has proposed a practical modification that can significantly reduce the computational cost of the PMDL procedure.

5 Conclusions

Time series modeling with a multilayer perceptron network was pre- sented in this study. The problem of selecting optimum sized network architecture for a given problem was studied by using the predictive minimum description length principle. The approach provides a system-

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/neco.1996.8.3.583 by guest on 02 October 2021 592 Mikko Lehtokangas et al.

Moody, J., and Darken, C. 1988. Learning with localized receptive fields. In Pro- ceedings of the 1988 Connectionist Models Summer School, D. Touretzky, G. Hin- ton, and T. Sejnowski, eds., pp. 133-143. Ozaki, 1978. Non-linear models for non-linear random vibrations. Tech. Rep. 92, Department of Mathematics, University of Manchester Institute of Science and Technology, UK. Powell, M. 1987. Radial basis functions for multivariate . In IMA Conference on Algorithms for Approximation of Functions and Data, J. Mason and M. Cox, eds., pp. 143-167. Oxford University Press, Oxford. Priestley, M. 1988. Nonlinear and Non-stationary Time Series Analysis. Academic Press, London. Rissanen, J. 1978. Modelling by shortest data description. Automatica 14, 465- 471. Rissanen, J. 1983. A universal prior for integers and estimation by minimum description length. Ann. Statist. 11(2), 416431. Rissanen, J. 1984. Universal coding, information, prediction, and estimation. IEEE Transact. Inform. Theory IT-30(4), 629-636. Rissanen, J. 1986. Stochastic complexity and modeling. Anna. Statist. 14(3), 1080-1100. Rissanen, J. 1987. Stochastic complexity. I. Royal Statist. SOC.Ser. B 49(3), 223-239 and 252-265. Rissanen, J. 1993. Fisher information and stochastic complexity. lEEE Transact. Inform. Theory, submitted. Rissanen, J. 1994. Information theory and neural nets. In Mathematical Perspec- tives on Neural Networks, P. Smolensky, M. Mozer, and D. Rumelhart, eds. Laurence Erlbaum, Hillsdale, NJ. Rumelhart, D., Hinton, G., and Williams, . 1986. Learning internal representa- tions by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, D. Rumelhart, J. L. McClelland, and the PDP Research Group, eds., chap. 8, MIT Press, Cambridge, MA. Solomonoff, R. 1964. A formal theory of inductive inference. Inform. Control, Part I, 7, 1-22; Part 11, 7, 224-254. Tong, H. 1983. Threshold Models in Non-linear Time Series Analysis. Springer- Verlag, New York. Utans, J., and Moody, J. 1991. Selecting neural network architectures via the prediction risk: Application to corporate bond rating prediction. In Proceed- ings of the First International Conference on Artificial Intelligence Applications on Wall Street. Volterra, V. 1959. Theory of Functionals and Integro-differential Equations. Dover, New York. Wahba, G., and Wold, S. 1975. A completely automatic french curve: Fitting functions by crossvalidation. Commun. Statist. 4(1), 1-17. Weigend, A., and Gershenfeld, N. (eds.). 1994. Time Series Prediction: Forecasting the Future and Understanding the Past. Addison-Wesley, Reading, MA. Weigend, A., and Nix, D. 1994. Predictions with confidence intervals (local error bars). In Proceedings of the International Conference on Neural Information Processing, ICONIP-94, Seoul, Korea, pp. 847-852.

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/neco.1996.8.3.583 by guest on 02 October 2021 Time Series Modeling 591

atic procedure for searching and constructing an optimal model based on input-output observations. A comparison with the cross-validation tech- nique showed that the PMDL method is a useful alternative for model selection in time series applications. Both methods gave similar results in terms of model complexity, but the PMDL method was found to be two times faster to compute. This difference in computing speed is sig- nificant, since model selection is, in general, very time consuming. Also, the PMDL optimum structures were found to generalize adequately.

Acknowledgments

The authors would like to express special thanks to Dr. Jorma Rissanen for his valuable advice on the PMDL method. Also the authors wish to thank the reviewers for their valuable comments on the manuscript.

References ~.

Akaike, H. 1977. On entropy maximization principle. Appliratfiorrs of Statistics, P. R. Krishnaiah, ed., pp. 27-41. North-Holland Publishing Co., Amsterdam. pp. 27-41. Box, G., and Jenkins, G. 1970. Tiriir Series Arial!/ Foremtirig arid Corztrol. Holden-Day, San Francisco. Chaitin, G. 1966. On the length of programs for computing finite binary se- quences. I. Asscic. Comp. Mach. 13, 547-569. Clarke, B., and Barron, A. 1993. Jeffreys’ prior is asymptotically least favorable under entropy risk. 1. Stat. Plaririirig Zriferrrice, in press. Doob, J. 1953. Stodinsfic Processes. Wiley, New York. Engle, R. 1982. Autoregressive conditional heteroscedasticity with estimates of the variance of U.K. inflation, Ec-oriorrietricn 50, 987-1008. Kolmogorov, A. 1965. Three approaches to the quantitative definition of infor- mation. Problciris Zriforrri. Trarisrliiss. 1, 4-7. Lehtokangas, M., Saarinen, J,, Huuhtanen, P., and Kaski, K. 1993a. Neural network prediction of non-linear time series using predictive MDL principle. In Pvocwdirigs ojtheJEEE Wiriter Worksiiop of1 Norilirrenr Digital Sigrid ProcessiHg, Tampere, Finland, January 17-20, 7.2-2.1-7.2-2.6. Lehtokangas, M., Saarinen, J., Huuhtanen, I?, and Kaski, K. 1993b. Neural net- work modeling and prediction of multivariate time series using predictive MDL principle. In Proccrdirigs of the Zritrrriatiorinl Corifr.rmce or1 Artificial Nc~r- 1.111 Netzuorks, ICANN-93, Amsterdam, The Netherlands, September 13-16 pp. 826-829. Lehtokangas, M., Saarinen, J., Huuhtanen, P., and Kaski, K. 1995. Initializing weights of a multilayer perceptron network by using the orthogonal least squarcs algorithm. Nciiral Corrip. 7(3, 982-999.

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/neco.1996.8.3.583 by guest on 02 October 2021 Time Series Modeling 593

Weigend, A., Huberman, B., and Rumelhart, D. 1990. Predicting the future: A connectionist approach. Int. f. Neural Sysf. 1(3), 193-209. Yule, G. 1927. On a method of investigating periodicities in disturbed series with special reference to Wolfer’s sunspot numbers. Philos. Trans. X. Soci. A226,267-298.

Received April 12, 1994; accepted August 15, 1995.

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/neco.1996.8.3.583 by guest on 02 October 2021