LETTER Communicated byJos e´ Principe

Learning ChaoticAttractors by Neural Networks

RembrandtBakker DelftChemTech,Delft University of Technology,2628 BL Delft,The

JaapC. Schouten ChemicalReactor Engineering, Eindhoven University of T echnology,5600 MB Eind- hoven,The Netherlands

C. Lee Giles NEC ResearchInstitute, Princeton, NJ 08540,U.S.A.

FlorisT akens Departmentof Mathematics, University of , 9700 AV Groningen,The Nether- lands

Cor M.van den Bleek DelftChemTech,Delft University of Technology,5600 MB Eindhoven,The Netherlands

An algorithmis introduced thattrains a neuralnetwork toidentifychaotic dynamicsfrom a singlemeasured . During training,the algo- rithmlearns to short-termpredict the time series. At the same time a criterion,developed by Diks,van Zwet,T akens,and deGoede (1996)is monitored thattests the hypothesis thatthe reconstructed of model-generatedand measureddata are the same. T rainingis stopped when theprediction error is low and themodel passesthis test. Two other featuresof thealgorithm are (1) the way thestate of thesystem, consist- ing of delaysfrom the time series, has its dimension reducedby weighted principalcomponent analysisdata reduction, and (2)the user -adjustable predictionhorizon obtainedby “errorpropagation” — partiallypropagat- ing predictionerrors to thenext time step. The algorithmis Ž rstapplied to datafrom an experimental-driven chaoticpendulum, of which two of thethree state variables are known. Thisis a comprehensiveexample that shows how wellthe Diks test candistinguish between slightly dif ferentattractors. Second, thealgo- rithmis applied to thesame problem, but now one of thetwo known statevariables is ignored. Finally,wepresent a model forthe laser data fromthe Santa Fe time-series competition (set A). It is the Ž rstmodel forthese data that is not only usefulfor short-term predictions but also generatestime series with similar chaotic characteristics as themeasured data.

Neural Computation 12, 2355–2383 (2000) c 2000Massachusetts Institute of Technology ° 2356R. Bakker,J.C.Schouten,C. L.Giles, F.Takens, and C.M.van den Bleek

1Introduction

Atimeseries measured froma deterministic chaoticsystem has the ap- pealing characteristic that its evolution isfully determined and yet its pre- dictability is limiteddue toexponential growth oferrors in model or mea- surements. Avariety ofdata-driven analysis methods forthis type oftime series was collectedin 1991 during the Santa Fe time-series competition (Weigend &Gershenfeld, 1994). The methods focused on either character- ization orprediction of the timeseries. Noattention was given to athird, and much moreappealing, objective: given the data and the assumption that itwas producedby adeterministic chaoticsystem, Žnd aset ofmodel equations that willproduce a timeseries with identical chaoticcharacter- istics, having the same chaoticattractor. The modelcould be based onŽrst principlesif the system is well understood, but here we assume knowl- edge ofjust the timeseries and use aneural network–based, black-box model.The choicefor neural networks isinspiredby the inherent stability oftheir iterated predictions,as opposedto for example, polynomialmodels (Aguirre& Billings, 1994) and locallinear models(Farmer & Sidorowich, 1987). Lapedes and Farber (1987) were among the Žrst who tried the neu- ralnetwork approach.In conciseneural network jargon, we formulate our goal: train anetwork tolearn the chaoticattractor .Anumber ofauthors have addressed this issue (Aguirre& Billings, 1994; Principe,Rathie, & Kuo, 1992; Kuo& Principe,1994; Deco& Schurmann,¨ 1994; Rico-Mart´õ nez, Krischer,Kevrekidis, Kube, &Hudson, 1992; Krischeret al., 1993; Albano, Passamente, Hediger, &Farrell,1992). The commonapproach consists of two steps:

1. Identify amodelthat makes accurate short-term predictions. 2. Generate along timeseries with the modelby iterated prediction,and comparethe nonlinear-dynamic characteristics ofthe generated time series with the original,measured timeseries.

Principeet al. (1992) found that inmany cases, this approachfails; the modelcan make goodshort-term predictionsbut has not learned the chaotic . The method would be greatly improvedif we couldminimize directlythe difference between the reconstructed attractors ofthe model- generated and measured data rather than minimizingprediction errors. However, we cannot reconstruct the attractor without Žrst having apredic- tion model.Therefore, research isfocused on how tooptimize both steps. Wehighly reduce the chance offailure by integrating step 2intostep 1, the modelidentiŽ cation. Rather than evaluating the modelattractor after training, we monitorthe attractor during training, and introduce anew test developed by Diks, van Zwet, Takens, and de Goede (1996) as astoppingcri- terion. It tests the null hypothesis that the reconstructed attractors ofmodel- generated and measured data are the same. The criteriondirectly measures Learning ChaoticAttractors 2357 the distance between two attractors, and agoodestimate ofits variance isavailable. Wecombinedthe stoppingcriterion with two specialfeatures that we found very useful: (1) an efŽcient state representation by weighted principalcomponent analysis (PCA) and (2) aparameter estimation scheme based on amixtureof the output-errorand equation-error method, previ- ously introduced as the compromisemethod (Werbos, McAvoy,&Su, 1992). WhileW erbos et al. promotedthe method to be used where equation error fails, we here use itto make the predictionhorizon user adjustable. The method partially propagates errorsto the next timestep, controlledby a user-speciŽed errorpropagation parameter . In this articlewe present three successful applicationsof the algorithm. First, aneural network istrained ondata froman experimental driven and damped pendulum. This system isknown tohave three state variables, of which one ismeasured and asecond, the phase ofthe sinusoidal driving force,is known beforehand. This experimentis used as acomprehensive visualisation ofhow well the Diks test can distinguish between slightly different attractors and how its performancedepends on the number of data points. Second, amodelis trained onthe same pendulum data, but this timethe informationabout the phase ofthe drivingforce is completely ignored. Instead, embedding with delays and PCA is used. This experiment isapracticalveriŽ cation ofthe Takens theorem (Takens, 1981), as explained insection 3. Finally,the errorpropagation feature ofthe algorithmbecomes importantin the modelingof the laser data (set A)fromthe 1991 Santa Fe time-series predictioncompetition. The resulting neural network model opens new possibilities foranalyzing these data because itcan generate timeseries up toany desired length and the Jacobian ofthe modelcan be computedanalytically at any time, which makes itpossible to computethe Lyapunov spectrumand Žnd periodicsolutions ofthe model. Section2 gives abrief descriptionof the two data sets that are used to explainconcepts ofthe algorithmthroughout the article.In section 3, the input-output structure ofthe modelis deŽned: the state is extracted from the single measured timeseries using the method ofdelays followedby weighted PCA, and predictionsare made by acombinedlinear and neural network model.The errorpropagation training is outlined insection 4, and the Diks test, used to detect whether the attractor ofthe modelhas converged tothe measured one, insection 5. Section6 contains the results ofthe two different pendulum models, and section 7coversthe laser data model.Section 8 concludes with remarks on attractor learning and future directions.

2 Data Sets

Two data sets are used toillustrate features ofthe algorithm:data froman experimental driven pendulum and far-infrared laser data fromthe 1991 Santa Fe time-series competition(set A).The pendulum data are described 2358R. Bakker,J.C.Schouten,C. L.Giles, F.Takens, and C.M.van den Bleek

Figure 1:Plots of the Žrst 2000samples of(a) driven pendulum and (b)Santa Fe laser time series.

insection 2.1. Forthe laser data we referto the extensive descriptionof Hubner,¨ Weiss, Abraham, &Tang (1994). The Žrst 2048 points ofthe each set are plotted inFigures 1a and 1b.

2.1Pendulum Data. The pendulum we use isatype EM-50 pendulum producedby Daedalon Corporation(see Blackburn, Vik, &Binruo, 1989, for details and Figure 2fora schematic drawing). Ideally,the pendulum would

Figure 2:Schematic drawing ofthe experimental pendulum and its driving force.The pendulum arm canrotate around its axis;the angle h is measured. During the measurements, the frequency,ofthe driving torque was0.85 Hz. Learning ChaoticAttractors 2359 obey the followingequations ofmotion, in dimensionless form,

h v d v c v sinh a sin w , (2.1) dt 0 1 D 0¡ ¡ C 1 w vD @ A @ A where h is the angle ofthe pendulum, v its angular velocity, c is a damping constant, and (a, vD, w ) are the amplitude, frequency,and phase, respec- tively,ofthe harmonicdriving torque. Asobserved by De Korte,Schouten, &van den Bleek (1995), the real pendulum deviates fromthe ideal behavior ofequation 2.1 because ofits fourelectromagnetic drivingcoils that repulse the pendulum at certain positions(top, bottom, left, right). However, from acomparisonof Poincar e´ plotsof the ideal and real pendulum by Bakker, de Korte,Schouten, Takens, &van den Bleek (1996), itappears that the real pendulum can be described by the same three state variables (h, v, w ) as the equations 2.1. In oursetup the angle h ofthe pendulum is measured at an accuracy ofabout 0.1 degree. It is not awell-behaved variable, forit isan angle deŽned only inthe interval [0, 2p ] and itcan jumpfrom 2 p to 0 in a single timestep. Therefore, we construct an estimate ofthe angular velocity v by differencing the angle timeseries and removingthe discontinuities. Wecallthis estimate V.

3Model Structure

The experimental pendulum and FIR-laser are examplesofautonomous, de- terministic,and stationary systems. In discrete time, the evolution equation forsuch systems is

xn 1 F(xn), (3.1) E C D E where F is anonlinear function, and x is the state ofthe system at time En t t nt, and t isthe sampling time. In most real applications,only a n D 0 C single variable ismeasured, even ifthe evolution ofthe system depends on several variables. AsGrassberger, Schreiber,and Schaffrath (1991) pointed out, itis possible to replacethe vectorof “true”state variables by avectorthat contains delays ofthe single measured variable. The state vectorbecomes a delay vector xn (yn m 1, . . . , yn), where y is the single measured variable E D ¡ C and m isthe number ofdelays, called embedding dimension. Takens (1981) showed that this method ofdelays willlead to evolutions ofthe type of equation 3.1 ifwe choose m 2D 1, where D is the dimension ofthe ¸ C system’s attractor. Wecan now use this result intwo ways: 1. Choose an embedding aprioriby Žxing the delay time t and the number ofdelays. This results inanonlinear auto regressive (NAR) modelstructure— in ourcontext, referredto as atime-delay neural network. 2360R. Bakker,J.C.Schouten,C. L.Giles, F.Takens, and C.M.van den Bleek

2. Donot Žxthe embedding, but create a(neural network) modelstruc- ture with recurrent connections that the network can use to learn an appropriatestate representation during training.

Although the second optionemploys a more exible modelrepresenta- tion(see, forexample, the FIR network modelused by Wan, 1994, orthe partiallyrecurrent network ofBulsari &Saxen,´ 1995), we preferto Žxthe state beforehand since we donot want to mixlearning oftwo very differ- ent things: an appropriatestate representation; and the approximationof F inequation 3.1. Bengio, Simard,and Frasconi (1994) showed that this mix makes itvery difŽcult to learn long-termdependencies, while Lin, Horne, Tins, and Giles (1996) showed that this can be solved by using the NAR(X) modelstucture. Finally,Siegelmann, Horne,and Giles (1997) showed that the NAR(X)model structure may be not as compactbut iscomputationally as strong as afully connected recurrent neural network.

3.1Choice of Embedding. The choiceof the delay time t and the em- bedding m ishighly empirical.First we Žxthe timewindow ,with length T t (m 1) 1. Fora properchoice of T,we considerthat itis tooshort ifthe D ¡ C variation within the delay vectoris dominated by noise, and itis longerthan necessary ifit exceeds the maximumprediction horizon of the system—that is, ifthe Žrst and last partof the delay vectorhave no correlationat all. Next we choose the delay time t.Figure 3depicts two possible choicesof t for aŽxed window length. Itis importantto realize that t equals the one-step- ahead predictionhorizon of ourmodel. The results ofRico-Martinez et al. (1992) show that if t is chosen large (such that successive delays show little linear correlation),the long-termsolutions ofthe modelmay be apparently different fromthe real system due tothe very discrete nature ofthe model. Asecond argument to choose t smallis that the further we predictahead inthe future, the morenonlinear willbe the mapping F inequation 3.1. A smalldelay timewill require the lowest functional . Choosing t (very) smallintroduces three potential problems:

1. Sincewe Žxed the timewindow ,the delay vectorbecomes very long, and subsequent delays willbe highly correlated. This we solve by reducing its dimension with PCA (see section 3.2). 2. The one-step predictionwill be highly correlatedwith the most recent delays, and alinear modelwill already make goodone-step predic- tions (see section 3.4). Weadopt the error-propagationlearning algo- rithm(see section 4) toensure learning ofnonlinear correlations. 3. The timeneeded tocompute predictions over a certain timeinterval is inversely proportionalto the delay time. Forreal-time applications, toosmall a delay timewill make the computationof predictions pro- hibitively time-consuming. Learning ChaoticAttractors 2361

Figure 3:T woexamples ofan embedding with delays, applied tothe pendulum time series ofFigure 1a.Compared to(a), in (b)the number ofdelays within the Žxed time window is large, there is much (linear) correlation between the delays, and the one-step prediction horizon is small.

The above considerations guided us forthe pendulum tothe embedding of Fig. 3b, where t 1, and T m 25. D D D 3.2Principal Component Embedding. The consequence ofa smallde- lay timein aŽxed timewindow isalarge embedding dimension, m T/t, ¼ causing the modelto have an excess number ofinputs that are highly cor- related. Broomhead and King (1986) proposeda very convenient way to reduce the dimension ofthe delay vector.They called itsingular system analysis, but itis better known as PCA. PCA transforms the set ofdelay vectors to anew set oflinearly uncorrelated variables ofreduced dimen- sion (see Jackson, 1991, fordetails). The principalcomponent vectors z are En computedfrom the delay vectors x En z VTx , (3.1) En ¼ En where V is obtained from USVT X,the singular value decomposition D ofthe matrix X that contains the originaldelay vectors x .Anessential En ingredient ofPCA is to examine how muchenergy islost foreach element of z that is discarded. Forthe pendulum, itisfound that the Žrst 8components E 2362R. Bakker,J.C.Schouten,C. L.Giles, F.Takens, and C.M.van den Bleek

Figure 4:Accuracy (RMSE— root mean square error) ofthe reconstruction of the original delay vectors (25delays) from the principal component vectors (8 principal components) forthe pendulum time series: (a)nonweighted PCA; (c)weighted PCA;(d) weighted PCAand use ofupdateformula equation 4.1. The standard proŽle we use forweighted PCAis shown in (b).

(out of25) explain99.9% ofthe variance ofthe originaldata. Toreconstruct the originaldata vectors, PCA takes aweighted sum ofthe eigenvectors of X (the columnsof V).Notethat King, Jones, and Broomhead (1987) found that forlarge embedding dimensions, principalcomponents, are essentially FouriercoefŽ cients and the eigenvectors are sines and cosines. In particularfor the case oftime-series prediction,where recent delays are moreimportantthan olderdelays, itis importantto know how well each individual componentof the delay vector x can be reconstructed from z . En En Figure 4a shows the average reconstruction error(RMSE, rootmean squared error)for each individual delay,forthe pendulum examplewhere we use 8 principalcomponents. Itappears that the errorvaries with the delay,and, in particular,the most recent and the oldest measurement are less accurately represented thanthe others. This is not asurprisesince theŽrst andlast delay have correlatedneighbors only onone side. Ideally,one would liketo put some moreweight onrecent measurements than olderones. Grassberger et al. (1991) remarked that this seems not easy when using PCA. Fortunately the contrary is true, since we can doa weighted PCA and set the desired Learning ChaoticAttractors 2363 accuracy ofeach individual delay: each delay is multipliedby aweight constant. The larger this constant is, the morethe delay willcontribute to the total variance ofthe data, and the moreaccurately this delay will need to be represented. Forthe pendulum we tried the weight proŽle of Figure 4b. It puts adouble weight on the most recent value, tocompensate forthe neighbor effect. The reconstruction result isshown inFigure 4c. As expected, the reconstruction ofrecent delays has become moreaccurate.

3.3Updating PrincipalComponents. In equation 3.1, the new state of the system ispredictedfrom the previousone. But the method ofdelays has introduced informationfrom the past inthe state ofthe system. Since itis not sound to predictthe past, we Žrst predictonly the new measured variable,

yn 1 F0 (xn), (3.2) C D E and then update the oldstate xn with the prediction yn 1 to give xn 1. For de- E C E C lay vectors xn (yn m 1, . . . , yn),this update simplyinvolves ashift ofthe E D ¡ C timewindow: the oldest value of y is removedand the new,predictedvalue comesin. In ourcase ofprincipal components x z ,the update requires En D En three steps: (1) reconstruct the delay vectorfrom the principalcomponents, (2) shift the timewindow ,and (3) construct the principalcomponents from the updated delay vector.Fortunately ,these three operations can be done inasingle matrixcomputation that we callupdate formula(see Bakker, de Korte,Schouten, Takens, &van den Bleek, 1997, fora derivation):

zn 1 Azn byn 1 (3.3) E C D E C E C with m 1 1 Aij V¡ Vk 1,j bi V¡ . (3.4) D i,k ¡ I E D i,1 k 2 XD 1 Since V is orthogonal, the inverse of V simplyequals its transpose, V¡ D VT.The use ofthe update formula,equation 3.3, introduces some extra loss ofaccuracy ,forit computes the new state not fromthe originalbut fromreconstructed variables. This has aconsequence, usually small, forthe average reconstruction error;compare for the pendulum exampleFigures 4c to 4d. AŽnal importantissue when using principalcomponents as neural net- work inputs is that the principalcomponents are usually scaled according tothe value oftheir eigenvalue. But the smallrandom weights initialization forthe neural networks that we willuse accounts forinputs that are scaled approximatelyto zero mean and unit variance. This isachieved by replacing 1 T T V by its scaled version W VS and its inverse by W¡ (X XW ) . Note D D that this does not affect reconstruction errorsof the PCA compression. 2364R. Bakker,J.C.Schouten,C. L.Giles, F.Takens, and C.M.van den Bleek

Figure 5:Structure of the prediction model. The stateis updatedwith anew valueof the measured variable(update unit), then fed into the linear model and anMLP neural network topredict the next valueof the measured variable.

3.4Prediction Model. AfterdeŽ ning the state ofour experimental sys- tem, we need tochoose arepresentation ofthe nonlinear function F in equa- tion2.1. Because we preferto use ashort delay time t,the one-step-ahead predictionwill be highly correlatedwith the most recent measurement, and alinear modelmay already make accurate one-step-ahead predictions.A chaoticsystem is, however, inherently nonlinear and therefore we add a standard multilayer perceptron(MLP) to learn the nonlinearities. There are two reasons forusing both alinear and an MLP model,even though the MLP itself can represent alinear model:(1) the linear modelparameters can be estimated quickly beforehand with linear least squares, and (2) itis in- structive tosee at an early stage oftraining ifthe neural network islearning morethan linear correlationsonly .Figure 5shows aschematic representa- tionof the predictionmodel, with aneural network unit, alinear model unit, and aprincipalcomponent update unit.

4TrainingAlgorithm

Aftersetting up the structure ofthe model,we need toŽ nd aset ofparam- eters forthe modelsuch that itwill have an attractor as closeas possible to that ofthe real system. In neural network terms, train the modelto learn the attractor.Recallthat we would liketo minimize directly the discrepancy between modeland system attractor, but stick tothe practicaltraining ob- jective: minimizeshort-term predictionerrors. The question arises whether this is sufŽcient: willthe predictionmodel also have approximatelythe Learning ChaoticAttractors 2365 same attractor as the real system? Principeet al. (1992) showed inan ex- perimentalstudy that this isnot always the case, and some years later Kuo and Principe(1994) and also Decoand Schurmann¨ (1994) proposed,as a remedy,to minimizemultistep rather than just one-step-ahead prediction errors.(See Haykin &Li, 1995, and Haykin &Principe,1998, foran overview ofthese methods.) This approachstill minimizes prediction errors (and can stillfail), but itleaves moreroom for optimization as itmakes the predic- tionhorizon user adjustable. Inthis study we introduce, insection 4.1, the compromisemethod (Werbos et al., 1992) as acomputationallyvery efŽcient alternative forextending the predictionhorizon. In section 5, we assess how best to detect differences between chaoticattractors. Wewillstop training when the predictionerror is small and no differences between attractors can be detected.

4.1Error Propagation. Totrain apredictionmodel, we minimizethe squared difference between the predictedand measured future values. The predictionis computedfrom the principalcomponent state z by E

yn 1 F0 (zn). (4.1) O C D E In section 3.1 we recommendedchoosing the delay time t small. As acon- sequence, the one-step-ahead predictionwill be highly correlatedwith the most recent measurements and eventually may be dominated by noise. This willresult inbad multistep-ahead predictions.As asolution, Su, McAvoy, and Werbos (1992) proposedto train the network with an output error method, where the state isupdated with previouslypredicted outputs in- stead ofmeasured outputs. They callthis apurerobust method and applied itsuccessfully to nonautonomous, nonchaotic systems. Forautonomous systems, applyingan output errormethod would implythat one trains the modelto predictthe entire timeseries fromonly asingle initialcon- dition. But chaoticsystems have onlya limitedpredictability (Bockmann, 1991), and this approachmust failunless the predictionhorizon is limitedto just afew steps. This isthe multistep-ahead learning approachof Kuo and Principe(1994), Decoand Schurmann¨ (1994), and Jaeger and Kantz (1996): anumber ofinitialstates are selected fromthe measured timeseries, and the predictionerrors over a smallprediction sequence are minimized.Kuo and Principe(1994) related the length ofthe sequences to the largest Lya- punov exponent ofthe system, and Decoand Schurmann¨ (1994) also put less weight on longer-termpredictions to account forthe decreased pre- dictability.Jaeger and Kantz (1996) address the so-called error-in-variables problem:a standard least-squares Žtappliedto equation 4.1 introduces a bias inthe estimated parameters because itdoes not account forthe noise inthe independent variables z . En Weinvestigate asecond possibilityto make the training ofmodels for chaoticsystems morerobust. Itrecognizes that the decreasing 2366R. Bakker,J.C.Schouten,C. L.Giles, F.Takens, and C.M.van den Bleek ofchaotic systems can alternatively be explained as aloss ofinformation about the state ofthe system when timeevolves. Fromthis pointof view , the output errormethod willfail to predictthe entire timeseries fromonly an initialstate because this single pieceof information will be lost after ashort predictiontime— unless additional informationis suppliedwhile the predictionproceeds. Let us comparethe mathematical formulationof the three methods: one-step ahead, multistep ahead, and output errorwith addition ofinformation. All cases use the same predictionequation, 4.1. They differin how they use the update formula,equation 3.3. The one-step- ahead approachupdates each timewith anew,measured value,

r r zn 1 Azn byn 1. (4.2) E C D E C E C The state islabeled zr because itwill be used as areference state later E inequation 4.9. Trajectorylearning, oroutput errorlearning with limited predictionhorizon, uses the predictedvalue

zn 1 Azn byn 1. (4.3) E C D E C E O C Forthe third method, the addition ofinformationmust somehow Žnd its source in the measured timeseries. Weproposetwo ways to update the state with amixtureof the predictedand measured values. The Žrst one is

zn 1 Azn b[(1 g)yn 1 gyn 1], (4.4) E C D E C E ¡ C C O C or,equivalently ,

zn 1 Azn b[yn 1 gen], (4.5) E C D E C E C C where the predictionerror en is deŽned en yn 1 yn 1.Equation 4.5 and ´ O C ¡ C the illustrations inFigure 6make clearwhy we call g the errorpropagation parameter: previousprediction errors are propagated to the next timestep, depending on g that we must choose between zero and one. Alternatively, we can adopt the view ofWerbos et al. (1992), who presented the method as acompromisebetween the output-error(pure-robust) and equation-error method. Predictionerrors are neither completelypropagated to the next timestep (output-errormethod) norsuppressed (equation-error method) but they are partiallypropagated to the next timestep. For g 0, error D propagationis switched off,and we get equation 4.2. For g 1, we have D fullerror propagation, and we get equation 4.3, which willwork only if the predictiontime is small. Inpractice,we use g as auser-adjustable pa- rameter intended tooptimize model performance by varying the effective predictionhorizon. Forexample, forthe laser data modelin section 7, we didŽ ve neural network training sessions, each with adifferent value of g, Learning ChaoticAttractors 2367

Figure 6:Three ways tomake predictions: (a)one-step-ahead, (b)multistep- ahead,and (c)compromise (partiallykeeping the previous error in the stateof the model). ranging from0 to 0.9. Inthe appendix, we analyze errorpropagation train- ing appliedto asimpleproblem: learning the chaotictent map.There itis found that measurement errorswill be suppressed ifa goodvalue for g is chosen, but that toohigh avalue willresult ina severe bias inthe estimated parameter. Multistep-ahead learning has asimilardisadvantage; the length ofthe predictionsequence should not be chosen toohigh. The errorpropagation equation, 4.5, has aproblemwhen appliedto the linear model,which we often add before even starting totrain the MLP(see section 3.4). Ifwe replacethe predictionmodel, 4.4, by alinear model,

yn 1 Cxn, (4.6) C D E and substitute this inequation 4.4, we get,

zn 1 Azn b[(1 g)yn 1 gCzn] (4.7) E C D E C E ¡ C C E or

zn 1 (A gbC)zn b(1 g)yn 1. (4.8) E C D C E E C E ¡ C This is alinear evolution of z with y as an exogeneous input. It is BIBO E (bounded input, bounded output) stable ifthe largest eigenvalue ofthe matrix A gbC issmallerthan 1. This eigenvalue is nonlinearly dependent C E on g,and we found many cases inpracticewhere the evolution isstable when g is zero orone, but not fora range ofin-between values, thus making errorpropagation training impossiblefor these values of g.Agoodremedy is to change the method intothe following,improved error propagation formula,

r r zn 1 A[z g(zn z )] b[yn 1 gen], (4.9) E C D En C E ¡ En C E C C 2368R. Bakker,J.C.Schouten,C. L.Giles, F.Takens, and C.M.van den Bleek where the reference state zr is the state obtained without errorpropagation, En as deŽned inequation 4.2. Nowstability is assessed by taking the eigenval- ues ofg(A bC), which areg times smallerthan the eigenvalues of (A bC), C E C E and therefore errorpropagation cannot disturb the stability ofthe system inthe case ofa linear model.W epreferequation 4.9 overequation 4.5, even ifno linear modelis added to the MLP.

4.2Optimization and Training. Fornetwork training we use backprop- agation through time(BPTT) and conjugate gradient minimization.BPTT is needed because anonzero errorpropagation parameter introduces re- currency intothe network. Ifa linear modelis used parallelto the neural network, we always optimizethe linear modelparameters Žrst. This makes convergence ofthe MLP initiallyslow (the easy parthas already been done) and normallater on. The size ofthe MLPs isdetermined by trialand error. Instead ofusing “early stopping”to prevent overŽtting, we use the Diks monitoringcurves (see section 5) todecide when to stop: select exactly the number ofiterations that yields an acceptable Diks test value and a low predictionerror on independent test data. The monitoringcurves are also used to Žnd asuitable value forthe errorpropagation parameter; we pick the value that yields acurve with the least irregularbehavior (see the exam- ples insections 6and 7). Fortunately,the extra work to optimize g replaces the search foran appropriatedelay time. Wechoose that as smallas the sampling timeof the measurements. In apreliminaryversion ofthe algorithm(Bakker et al., 1997), the pa- rameter g was taken as an additional output ofthe predictionmodel, to make itstate dependent. The idea was that the training procedurewould automatically choose g large inregions ofthe embedding space where the predictabilityis high, and smallwhere predictabilityis low.Wenow think that this gives the minimizationalgorithm too many degrees offreedom, and we prefersimply to optimize g by hand.

5DiksT estMonitoring

Because the errorpropagation training does not guarantee learning ofthe system attractor,we need to monitora comparisonbetween the modeland system attractor. Severalauthors have used dynamic invariants such as Lyapunov exponents, entropies, and correlationdimension tocompare at- tractors (Aguirre& Billings, 1994; Principeet al., 1992; Bakker et al., 1997; Haykin &Puthusserypady,1997). These invariants are certainly valuable, but they donot uniquely identify the chaoticsystem, and their conŽdence bounds are hard to estimate. Diks et al. (1996) has developed amethod that isunique: ifthe two attractors are the same, the test value willbe zero fora sufŽcient number ofsamples. The method can be seen as an objec- tive substitute forvisual comparisonof attractors plotted inthe embedding Learning ChaoticAttractors 2369 space. The test works forlow- and high-dimensional systems, and, unlike the previouslymentioned invariants, its variance can be well estimated from the measured data. The followingnull hypothesis is tested: the twosets of delay vectors(model generated, measured)are drawn fromthe samemultidimen- sionalprobability distribution .The two distributions, r 0 (r) and r 0 (r ), are Žrst 1 E 2 E smoothed using gaussian kernels, resulting in r (r ) and r (r),and then the 1 E 2 E distance between the smoothed distributions isdeŽned to be the volume integral overthe squared difference between the two distributions:

2 Q dr(r 0 (r ) r 0 (r)) . (5.1) D E 1 E ¡ 2 E Z The smoothing is needed tomake Q applicableto the probabilitydis- tributions encountered inthe strange attractors ofchaoticsystems. Diks et al. (1996) providean unbiased estimate for Q and forits variance, Vc, where subscript c means “under the conditionthat the null hypothesis holds.”In that case, the ratio S Q pV is arandomvariable with zero mean and unit D O / c variance. Two issues need specialattention: the test assumes independence among allvectors, and the gaussian kernels use abandwidth parameter d that determines the length scale ofthe smoothing. In ourcase, the succes- sive states ofthe system willbe very dependent due to the short prediction time. The simplest solution isto choose alarge timeinterval between the de- lay vectors—at least several embedding times—and ignorewhat happens inbetween. Diks et al. (1996) proposea method that makes moreefŽ cient use ofthe available data. It divides the timeseries insegments oflength l and uses an averaging proceduresuch that the assumption ofindependent vectors can be replacedby the assumption ofindependent segments. The choiceof bandwidth d determines the sensitivity ofthe test. Diks et al. (1996) suggest using asmallpart of the available data toŽ nd out which value of d gives the highest test value and using this value forthe Žnal test with the completesets ofdata. In ourpractical implementation, we compressthe delay vectors to prin- cipalcomponent vectors, forcomputational efŽ ciency ,knowing that

x x 2 U(z z ) 2 z z 2, (5.2) kE1 ¡ E2k ¼ k E1 ¡ E2 k D kE1 ¡ E2k since U is orthogonal. Wevary the sampling interval randomlybetween one and two embedding times and use l 4. Wegenerate 11 independent D sets with the model,each with alength oftwo times the measured time- series length. The Žrst set is used to Žnd the optimumvalue for d, and the other 10 sets to get an estimate of S and to verifyexperimentally its standard deviation that is supposed to be unity.Diks et al. (1996) suggest rejecting the null hypothesis with morethan 95% conŽdence for S > 3. Our estimate of S isthe average of10 evaluations. Each evaluation uses new and independent model-generated data, but the same measured data (only the 2370R. Bakker,J.C.Schouten,C. L.Giles, F.Takens, and C.M.van den Bleek

Figure 7:Structure of pendulum model Ithatuses allavailable information aboutthe pendulum: the measured angle h,the known phase ofthe driving force w,the estimated angular velocity V,and nonlinear terms (sines and cosines) that appearin the equations ofmotion, equation 2.1. sampling is done again). Toaccount forthe 10 half-independent evaluations, we lowerthe 95% conŽdence threshold to S 3/p(1/2 10) 1.4. Note D ¢ ¼ that the computationalcomplexity of the Diks test isthe same as that of computingcorrelation entropies ordimensions— roughly O(N2), where N is the number ofdelay vectors inthe largest ofthe two sets to be compared.

6Modeling theExperimental Pendulum

In this chapter, we present two modelsfor the chaotic, driven, and damped pendulum that was introduced insection 2.1. The Žrst model,pendulum modelI, uses allpossible relevant inputs, that is, the measured angle, the known phase ofthe drivingforce, and some nonlinear terms suggested by the equations ofmotion of the pendulum. This modeldoes not use error propagationor principalcomponent analysis; itisused just to demonstrate the performanceof the Diks test. The second model,pendulum modelII, uses aminimumof different inputs—just asingle measured series ofthe estimated angular velocity V (deŽned insection 2.1). This is apractical veriŽcation ofthe Takens theorem (see section 3), since we disregard two ofthe three “true”state variables and use delays instead. Wewillreduce the dimension ofthe delay vectorby PCA. There isno need to use error propagationsince the data have avery lownoise level.

6.1Pendulum Model I. This modelis the same as that ofBakker et al. (1996),exceptforthe external input that is notpresent here.The input-output structure is shown inFigure 7. The network has 24 nodes inthe Žrst and 8inthe second hidden layer.The network istrained by backpropagation Learning ChaoticAttractors 2371

Figure 8:Development ofthe Poincar´eplotof pendulum model I during train- ing, tobe compared with the plotof measured data(a). The actualdevelopment is not assmooth assuggested by(b–f), as canbe seen from the Diks test moni- toring curves in Figure 9.

with conjugate gradient minimizationon atimeseries of16,000 samples and reaches an errorlevel for the one-step-ahead predictionof the angle h as lowas 0.27 degree onindependent data. The attractor ofthis system isthe plotof the evolutions ofthe state vectors (V, h, Q)n ina three-dimensional space. Two-dimensional projections,called Poincare´ sections, are shown in Figure 8: each timea new cycleof the sinusoidal drivingforce starts, when the phase Q is zero, apoint (V, h)n isplotted. Figure 8a shows measured data, while Figures 8b through 8f show how the attractor ofthe neural network modelevolves as training proceeds. Visual inspectionreveals that the Poincare´ section after 8000 conjugate gradient iterations isstillslightly different fromthe measured attractor, and itisinteresting tosee ifwe would draw the same conclusionif we use the Diks test. Toenable afaircomparison with the pendulum modelII results in the next section, we computethe Diks test not directlyfrom the states (V, h, Q)n (that would probably be most powerful), but instead reconstruct the attractor ina delay space ofthe variable V,the onlyvariable thatwillbe available inpendulum modelII, and use these delay vectors to computethe Diks test. Tosee how the sensitivity of the test depends on the length ofthe timeseries, we computeit three times, 2372R. Bakker,J.C.Schouten,C. L.Giles, F.Takens, and C.M.van den Bleek

Figure 9:Diks test monitoring curve forpendulum model I.The Diks test con- verges tozero (no difference between model and measured attractorcan be found),but the spikes indicate the extreme sensitivity ofthe model attractorfor the model parameters. using 1, 2, and 16 data sets of8000 points each. These numbers should be comparedto the length ofthe timeseries used tocreate the Poincare´ sections: 1.5 millionpoints, ofwhich each 32nd point(when the phase Q is 0) isplotted. The Diks test results are printedin the top-rightcorner of each ofthe graphs inFigure 8. The test based on only one set of8000 (Diks 8K) points cannot see the difference between Figures 8a and 8c, whereas the test based onall16 sets (Diks 128K) can stilldistinguish between Figures 8a and 8e. It may seem fromthe Poincare´ sections inFigures 8b through 8f that the modelattractor gradually converges toward the measured attractor when training proceeds. However, fromthe Diks test monitoringcurve inFigure 9, wecanconclude thatthis isnotat allthe case.The spikesinthe curve indicate that even after the modelattractor gets very closeto the measured attractor, acompletemismatch may occurfrom one iteration to the other. Wewill see asimilarbehavior forthe pendulum modelII (see section 6.2) and the laser data model(see section 7), and address the issue inthe discussion (see section 8).

6.2Pendulum Model II. This modeluses informationfrom just asingle variable, the estimated angular velocity V.Fromthe equations ofmotion, equation 2.1, itfollows that the pendulum has three state variables, and the Takens theorem tells us that inthe case ofnoise-free data, an embedding of 7(2 3 1) delays willbe sufŽcient. Inourpractical application, we take the ¤ C embedding ofFigure 3b with 25 delays, which are reduced to 8principal componentsby weighted PCA. The network has 24 nodes inthe Žrst and 8 inthe second hidden layer, and alinear modelis added. First, the network is trained for4000 iterations on atrain set of8000 points; then asecond set of8000 pointsis added, and training proceedsfor another 2000 iterations, Learning ChaoticAttractors 2373 reaching an errorlevel of 0.74 degree, two and ahalf times largerthan modelI. The Diks test monitoringcurve (not shown) has the same irregular behavior as forpendulum modelI. In between the spikes, iteventually varies between one and two. Comparethis to modelI after 2000 iterations, as shown inFigure 8e, which has aDiks test of0.6: modelII does not learn the attractor as well as modelI. That is the pricewe pay foromitting a priori informationabout the phase ofthe drivingforce and the nonlinear terms fromthe equations ofmotion. Wealso computedLyapunov exponents, l.Formodel I, we recallthe es- timated largest exponent of2.5 bits perdriving cycle, as found inBakker et al. (1996). Formodel II, we computeda largest exponent of2.4 bits percycle, using the method ofvon Bremen et al. (1997) that uses QR-factorizations oftangent maps along model-generated trajectories. The method actually computes the entire Lyapunov spectrum, and fromthat we can estimate the informationdimension D1 ofthe system, using the Kaplan-Yorkeconjec- ture (Kaplan &Yorke, 1979): take that (real) value of i where the interpo- lated curve l versus i crosses zero. This happens at D 3.5, with an i i 1 D estimated standard deviation of0.1. Forcomparison, the ideal chaoticpen- P dulum, as deŽned by equation 2.1, has three exponents, and as mentioned inBakker et al. (1996), the largest is positive,the second zero, and the sum ofallthree isnegative. This tells us that the dimension must liebetween 2 and 3. This means that pendulum modelII has ahigher dimension than the real system. Wethink this is because ofthe high embedding space (eight principalcomponents) that is used. Apparentlythe modelhas not learned to make the Žve extra, spurious, Lyapunov exponents morenegative than the smallest exponent inthe real system. Eckmann, Kamphorst,Ruelle, and Ciliberto(1986) reportthe same problemwhen they modelchaotic data with locallinear models. Wealso computedan estimate forthe correla- tiondimension and entropy ofthe timeseries with the computerpackage RRCHAOS(Schouten &van den Bleek, 1993), which uses the maximum likeliehoodestimation proceduresof Schouten, Takens, and van den Bleek (1994a, 1994b). Forthe measured data, we found acorrelationdimension of 3.1+/ 0.1, and forthe model-generated data 2.9+/ 0.1: there was no ¡ ¡ signiŽcant difference between the two numbers, but the numbers are very closeto the theoretical maximumof three. Forthe correlationentropy ,we found 2.4+/ 0.1bits percycle for measured data, and 2.2+/ 0.2 bits per ¡ ¡ cyclefor the model-generated data. Again there was no signiŽcant differ- ence and, moreover,the entropy is almost the same as the estimated largest . Forpendulum modelII we reliedon the Diks test to compareattractors. However, even ifwe donot have the phase ofthe drivingforce available as we did inmodelI, we can stillmake plotsof cross-sections ofthe attractor by projectingthem ina two-dimensional space. This iswhat we doin Figure 10: every timewhen the Žrst state variable upwardly crosses zero, we plotthe remaining seven state variables ina two-dimensional projection. 2374R. Bakker,J.C.Schouten,C. L.Giles, F.Takens, and C.M.van den Bleek

Figure 10:Poincar e´ section based on measured (a)and model-generated (b) T data.A series oftruncated states ( z1, z2, z3) is reconstructed from the time series; T a point (z2, z3) is plotted every time z1 crosses zero(upward).

7LaserData Model

The second applicationof the training algorithmis the well-known laser data set (set A)fromthe Santa Fe timeseries competition(W eigend &Ger- shenfeld, 1994). The modelis similarto that reportedin Bakker, Schouten, Giles, &van den Bleek (1998), but instead ofusing the Žrst 3000 points fortraining, we use only 1000 points—exactly the amount ofdata that was available before the results ofthe time-series competitionwere published. Wetook an embedding of40 and used PCA to reduce the 40 delays to 16 principalcomponents. First, alinear modelwas Žtted to make one-step- ahead predictions.This already explained 80% ofthe output data variance. The modelwas extended with an MLP,having 16 inputs (the 16 principal components), two hidden layers with 32 and 24 sigmoidalnodes, and a single output (the time-series value to be predicted). Five separate training sessions were carriedout, using 0%, 60%, 70%, 80%, and 90% errorprop- agation (the percentage correspondsto g 100). Each session lasted 4000 ¢ conjugate gradient iterations, and the Diks test was evaluated every 20 iter- ations. Figure 11 shows the monitoringresults. Wedraw three conclusions from it:

1. Whilethe predictionerror on the training set (not shown) monoton- icallydecreases during training, the Diks test shows very irregular behavior.Wesaw the same irregularbehavior forthe two pendulum models.

2. The case of70% errorpropagation converges best to amodelthat has learned the attractor and does not “forget the attractor”when training proceeds. The errorpropagation helps tosuppress the noise that ispresent inthe timeseries. Learning ChaoticAttractors 2375

Figure 11:Diks test monitoring curves forthe laser datamodel with (a)0%, (b)60%, (c) 70%, (d) 80%, (e) 90% error propagation.On the basisof these curves, it is decided toselect the model trained with 70%error propagationand after 1300iterations forfurther analysis.

3. The commonstopping criterion, “ stop when erroron test set is at its minimum,”may yield amodelwith an unacceptable Diks test result. Instead, we have tomonitor the Diks test carefully and select exactly that modelthat has both alowprediction error and agoodattractor— alowDiks test value that does not change fromone iteration to the other.

Experiencehas shown that the Diks test result ofeach training session issensitive forthe initialweights ofthe MLP; the curves are not well re- producible.T rialand erroris required. Forexample, inBakker et al. (1998) where we used moretraining data to modelthe laser,we selected 90% error propagation(EP) to be the best choiceby lookingat the Diks test monitoring curves. Forfurther analysis, we select the modeltrained with 70% EPafter 1300 iterations, having aDiks test value averaged over100 iterations as low as 0.2. Figure 12b shows atimeseries generated by this modelalong ¡ with measured data. Forthe Žrst 1000 iterations, the modelruns in90% EP modeand followsthe measured timeseries. Then the modeis switched to free run (closed-loopprediction), or 100% EP,and the modelgener- 2376R. Bakker,J.C.Schouten,C. L.Giles, F.Takens, and C.M.van den Bleek

Figure 12:Comparison oftime series (c)measured from the laser and (d)closed- loop(free run) predicted bythe model. Plots(a) and (b)show the Žrst 2000points of(c) and (d),respectively . ates atime-series on its own. In Figure 12d the closed-loopprediction is extended overa much longertime horizon. Visual inspectionof the timeseries and the Poincare´ plotsof Figures 13a and 13b conŽrms the results ofthe Diks test: the measured and model-generated series look the same, and itis very surprising that the neural network learned the

Figure 13:Poincar´ eplots of(a) measured and (b)model-generated laser data. T Aseries oftruncated states ( z1, z2, z3, z4 ) is reconstructed from the time series; T a point (z2, z3, z4 ) is plotted every time z1 crosses zero(upward). Learning ChaoticAttractors 2377 chaoticdynamics from1000 points that containonlythree“rise-and-collapse” cycles. Wecomputedthe Lyapunov spectrumof the model,using series oftan- gent maps along 10 different modelgenerated trajectories oflength 6000. The three largest exponents are 0.92, 0.09, and 1.39 bits per40 samples. The ¡ largest exponent is larger than the rough estimate by Hubner¨ et al. (1994) of0.7. The second exponent must be zero accordingto Haken (1983). To verifythis, we estimate, from10 evaluations, astandard deviation ofthe second exponent of0.1, which is ofthe same orderas its magnitude. From the Kaplan-Yorkeconjecture, we estimate an informationdimension D1 of 2.7, not signiŽcantly different fromthe value of2.6 that we found inanother modelfor the same data (Bakker et al., 1998). Finally,we comparethe short-term predictionerror of the modelto that ofthe winner ofthe Santa Fe time-series competition(see Wan, 1994). Over the Žrst 50 points, ournormalized mean squared error(NMSE) is 0.0068 (winner: 0.0061); overthe Žrst 100 itis 0.2159 (winner: 0.0273). Given that the long-termbehavior ofourmodel closely resembles the measured behavior, we want to stress that the largershort-term predictionerror does not imply that the modelis “worse”than Wan’s model.The system is known to be very unpredictable at the signal collapses, having anumber ofpossible continuations. The modelmay followone continuation, while avery slight deviation may cause the “true”continuation to be another.

8Summary and Discussion

Wehave presented an algorithmto train aneural network tolearn chaotic dynamics, based on asingle measured timeseries. The task ofthe algo- rithmis to create aglobal nonlinear modelthat has the same behavior as the system that producedthe measured timeseries. The commonap- proachconsists oftwo steps: (1) train amodelto short-term predictthe timeseries, and (2) analyze the dynamics ofthe long-term(closed-loop) predictionsof this modeland decide to accept orreject it. Wedemon- strate that this approachis likelyto fail, and ourobjective is to reduce the chance offailure greatly.Werecognize the need foran efŽcient state representation, by using weighted PCA; an extended predictionhorizon, by adopting the compromisemethod ofW erbos, McAvoy,&Su, (1992); and (3) astoppingcriterion that selects exactly that modelthat makes ac- curate short-term predictions and matches the measured attractor.Forthis criterionwe proposeto use the Diks test (Diks et al., 1996) that compares two attractors by lookingat the distribution ofdelay vectors inthe recon- structed embedding space. Weuse an experimental chaoticpendulum to demonstrate features ofthe algorithmand then benchmark itby modeling the well-known laser data fromthe Santa Fe time-series competition.W e succeeded increating amodelthat produces atimeseries with the same typicalirregular rise-and-collapse behavior as the measured data, learned 2378R. Bakker,J.C.Schouten,C. L.Giles, F.Takens, and C.M.van den Bleek froma 1000-sample data set that contains only three rise-and-collapse se- quences. AmajordifŽ culty with training aglobal nonlinear modelto reproduce chaoticdynamics isthat the long-termgenerated (closed-loop)prediction of the modelis extremely sensitive forthe modelparameters. Weshow that the modelattractor does not gradually converge toward the measured attractor when training proceeds. Even after the modelattractor is getting closeto the measured attractor, acompletemismatch may occurfrom one training iter- ation to the other, as can be seen fromthe spikes inthe Diks test monitoring curves ofallthree models(see Figures 9and 11). Wethink oftwo possible causes. First, the sensitivity ofthe attractor forthe modelparameters isa phenomenon inherent inchaotic systems (see, forexample, the bifurcation plotsin Aguirre & Billings, 1994, Fig. 5). Alsothe presence ofnoise inthe dynamics ofthe experimental system may have asigniŽcant inuence on the appearance ofits attractor. Asecond explanation is that the global non- linear modeldoes not know that the data that itistrained onare actually on an attractor. Imagine, forexample, that we have atwo-dimensional state- space and the training data set is movingaround inthis space following acircularorbit. This orbitmay be the attractor ofthe underlying system, but itmight as well be an unstable periodicorbit of acompletelydifferent underlying system. Forthe short-term predictionerror, it does not matter if the neural network modelconverges to the Žrst orthe second system, but inthe latter case, its attractor willbe wrong! Aninteresting development is the work ofPrincipe, W ang, and Motter (1998), who use self-organizing maps to make apartitioningof the input space. It is aŽrst step toward a modelthat makes accurate short-term predictionsand learns the outlines of the (chaotic)attractor. During the review processof this manuscript, such an approachwas developed by Bakker et al. (2000).

Appendix: On theOptimum Amount of ErrorPropagation: TentMap Example

When using the compromisemethod, we need to choose avalue forthe errorpropagation parameter such that the inuence ofnoise onthe result- ing modelis minimized.W eanalyze the case ofa simplechaotic system, the ,with noise added to both the dynamics and the available measurements. The evolution ofthe tent mapis

xt a|xt 1 ut | c, (A.1) D ¡ C C with a 2 and c 1/2, and where ut isthe added dynamical noise, as- D D 2 sumed to be apurelyrandom process with zero mean and variance su . It is convenient forthe analysis to convertequation A.1 to the following Learning ChaoticAttractors 2379 notation,

at a, (xt 1 ut) < 0 xt at(xt 1 ut) c, D ¡ C . (A.2) D ¡ C C at a, (xt 1 ut) 0 » D ¡ ¡ C ¸ Wehave aset ofnoisy measurements available,

y x m , (A.3) t D t C t 2 where mt is apurelyrandom process, with zero mean and variance sm. The modelwe willtry toŽ tisthe following:

zt b|zt 1 | c, (A.4) D ¡ C where z isthe (one-dimensional) state ofthe modeland b isthe only param- eter to be Žtted. Inthe alternate notation,

bt b, zt 1 < 0 zt btzt 1 c, D ¡ . (A.5) D ¡ C Bt b, zt 1 0 » D ¡ ¡ ¸ The predictionerror is deŽned as

e z y , (A.6) t D t ¡ t and during training we minimizeSSE, the sum ofsquares ofthis error.If, during training, the modelis run inerror propagation (EP) mode, each new predictionpartially depends onpreviousprediction errors according to

zt bt(yt 1 get 1) c, (A.7) D ¡ C ¡ C where g isthe errorpropagation parameter .Wewillnow derive how the 2 2 expected SSEdepends on g, su and sm.First eliminate z fromthe prediction errorby substituting equation A.7 inA.6,

et bt(yt 1 get 1) c yt, (A.8) D ¡ C ¡ C ¡ and use equation A.3 to eliminate yt and A.2 to eliminate xt:

et (bt at)xt 1 atut btmt 1 mt btget 1. (A.9) D f ¡ ¡ ¡ C ¡ ¡ g C ¡

Introduce t (bt at)xt 1 atut bt(1 g)mt 1 , and assume e0 0. Then 2 D2f ¡ ¡ ¡ C ¡ ¡ g 2 D the variance se of et,deŽned as the expectation E[et ],can be expanded:

2 t 1 2 s E[ mt t (btg) t 1 (b2g) ¡ 1] . (A.10) e D ¡ C 2 C 2 ¡ C ¢ ¢ ¢ C 2 2380R. Bakker,J.C.Schouten,C. L.Giles, F.Takens, and C.M.van den Bleek

Forthe tent map,it can be veriŽed that there is no autocorrelation in x (for typicalorbits that have aunit density function on the interval [ 0.5,0.5] ¡ (see Ott, 1993), under the assumption that u issmallenough not tochange this property.Because mt and ut are purelyrandom processes, there is no autocorrelation in m or u,and no correlationbetween m and u, x, and m. Equation A.2 suggests acorrelationbetween x and u,but this cancels out because the multiplier a jumpsbetween 1and 1, independent from u t ¡ under the previousassumption that u issmallcompared to x.Therefore, the expansion A.10 becomes asum ofthe squares ofall its individual elements without crossproducts:

2 2 2 2 2 2(t 1) 2 se E[mt ] E[ t (btg) t 1 (b2g) ¡ 1 ]. (A.11) D C 2 C 2 ¡ C ¢ ¢ ¢ C 2 Fromthe deŽnition of a inequation A.2, itfollows that a2 a2. A similar t t D argument holds for bt,and ifwe assume that at each time t,both the argu- ments xt 1 ut inequation A.2 and yt 1 get 1 inA.7 have the same sign ¡ C ¡ C ¡ (are inthe same leg ofthe tent map),we can also write

(b a )2 (b a)2. (A.12) t ¡ t D ¡ Since (b g)2 (bg)2,the followingsolution appliesfor the series A.11: t D

2t 2 2 2 1 (bg) se sm s ¡ , |bg| 6 1, (A.13) D C 2 1 (bg)2 D ¡ 2 where s is the variance of t.Written infull, 2 ¡ 2 ¢ 1 (bg)2t s2 s2 ((b a)2s2 s2 b2(1 g)2s2 ) ¡ , |bg| 1, (A.14) e D m C ¡ x C u C ¡ m 1 (bg)2 D6 ¡ and forlarge t, ¡ ¢ 1 s2 s2 ((b a)2s2 s2 b2(1 g)2s2 ) , |gb| 1. (A.15) e D m C ¡ x C u C ¡ m 1 (gb)2 D6 ¡ Ideally we would likethe difference between the modeland real-system parameter, (b a)2s2,to dominate the cost function¡ that¢ is minimized.W e ¡ x can draw three importantconclusions fromequation A.15:

1. The contribution ofmeasurement noise mt is reduced when gb ap- proaches 1. This justiŽes the use oferrorpropagation for time series contaminated by measurement noise.

2. The contribution ofdynamical noise ut is not lowered by errorprop- agation. Learning ChaoticAttractors 2381

3. During training, the minimizationroutine willkeep b well below 1/g, 1 because otherwise the multiplier, 1 (gb)2 ,would become large and so would the cost function. Therefore, ¡b willbe biased toward zero if ga iscloseto unity. The above analysis isalmost the same forthe trajectory learning algo- rithm(see section 4.1). In that case, g 1and equation A.14 becomes D 1 b2t s2 s2 ((b a)2s2 s2) ¡ , |b| 1. (A.16) e D m C ¡ x C u 1 b2 D6 ¡ Again, measurement noise can be suppressed by choosing t large. This leads 1 b2t to abias toward zero of b because¡ fora given¢ t > 0, the multiplier, 1¡b2 , isreduced by lowering b (assuming b > 1, anecessary conditionto have ¡ a chaoticsystem).

Acknowledgments

This work is supported by the Netherlands Foundation forChemical Re- search (SON)with Žnancial aid fromthe Netherlands Organization forSci- entiŽc Research (NWO). WethankCees Diksforhis helpwith the implemen- tation ofthe test forattractor comparison.W ealso thank the two anonymous referees who helped to improvethe articlegreatly .

References

Aguirre, L.A.,& Billings, S.A.(1994).V alidating identiŽed nonlinear models with chaoticdynamics. Int.J. Bifurcationand Chaos, 4 , 109–125. Albano,A. M.,Passamante,A., Hediger ,T.,&Farrell, M.E.(1992).Using neural nets tolook forchaos. PhysicaD, 58 , 1–9. Bakker,R.,de Korte, R.J.,Schouten,J. C.,Takens, F.,&van den Bleek, C.M. (1997).Neural networks forprediction and control ofchaotic uidized bed hydrodynamics: AŽrst step. , 5, 523–530. Bakker,R.,Schouten,J. C.,Takens, F.,&van den Bleek, C.M.(1996).Neural network model tocontrol anexperimental chaoticpendulum. PhysicalReview E, 54A,3545–3552. Bakker,R.,Schouten,J. C.,Coppens,M.-O., T akens, F.,Giles, C.L.,&van den Bleek, C.M.(2000).Robust learning ofchaoticattractors. In S.A.Solla,T .K. Leen, &K.-R.M uller¨ (Eds.), Advancesin neuralinformation processing systems, 12.Cambridge, MA:MIT Press (pp.879– 885). Bakker,R.,Schouten, J. C.,Giles, C.L.,& van den Bleek, C.M.(1998).Neural learning ofchaoticdynamics— The error propagationalgorithm. In proceed- ingsof the1998 IEEE WorldCongress on ComputationalIntelligence (pp. 2483– 2488). Bengio, Y.,Simard, P.,&Frasconi, P.(1994).Learning long-term dependencies with gradient descent is difŽcult. IEEE Trans.Neural Networks, 5 , 157–166. 2382R. Bakker,J.C.Schouten,C. L.Giles, F.Takens, and C.M.van den Bleek

Blackburn,J. A.,Vik, S.,&Binruo, W.(1989).Driven pendulum forstudying chaos. Rev.Sci. Instrum., 60 , 422–426. Bockman, S.F.(1991).Parameter estimation forchaotic systems. In Proceedings ofthe 1991 American Control Conference (pp.1770– 1775). Broomhead, D.S.,&King, G.P.(1986).Extracting qualitativedynamics from experimental data. PhysicaD, 20 , 217–236. Bulsari, A.B.,&Saxen,´ H.(1995).A recurrent network formodeling noisy tem- poralsequences. Neurocomputing,7 , 29–40. Deco,G., &Schurmann,¨ B.(1994).Neural learning ofchaotic system behavior. IEICE Trans.Fundamentals, E77A ,1840–1845. DeKorte, R.J.,Schouten,J. C.,&van den Bleek, C.M.(1995).Experimental con- trol ofa chaoticpendulum with unknown dynamics using delay coordinates. PhysicalReview E, 52A ,3358–3365. Diks, C.,van Zwet, W.R.,Takens, F.,&de Goede, J.(1996).Detecting differences between delay vector distributions. PhysicalReview E, 53 ,2169–2176. Eckmann, J.-P.,Kamphorst, S.O.,Ruelle, D.,& Ciliberto, S.(1986).Liapunov exponents from time series. PhysicalReview A, 34 ,4971–4979. Farmer,J.D.,&Sidorowich, J.J.(1987).Predicting chaotictime series. Phys. Rev. Letters, 59, 62–65. Grassberger,P.,Schreiber,T.,&Schaffrath,C. (1991).Nonlinear time sequence analysis. Int.J. BifurcationChaos, 1 , 521–547. Haken, H.(1983).At least one Lyapunov exponent vanishes ifthe trajectory of anattractor does not containa Žxed point. PhysicsLetters, 94A , 71. Haykin, S.,&Li, B.(1995).Detection ofsignals in chaos. Proc.IEEE, 83 , 95–122. Haykin, S.&Principe, J.(1998).Making sense ofa complexworld. IEEE Signal ProcessingMagazine , pp. 66–81. Haykin, S.,& Puthusserypady,S.(1997).Chaotic dynamics ofsea clutter. Chaos, 7, 777–802. Hubner¨ ,U.,W eiss, C.-O.,Abraham, N. B.,& Tang, D.(1994).Lorenz-like chaos

in NH3-FIRlasers (dataset A).In Weigend &Gershenfeld (Eds.), Time se- riesprediction: Forecasting the future and understanding the past .Reading, MA: Addison-Wesley. Jackson,J. E.(1991). Auser’s guideto principal components .New York: Wiley. Jaeger,L.,& Kantz,H. (1996).Unbiased reconstruction ofthe dynamics under- lying anoisy chaotictime series. Chaos, 6, 440–450. Kaplan,J., &Yorke, J.(1979).Chaotic behavior ofmultidimensional difference equations. LectureNotes in Mathematics , 730. King, G.P.,Jones, R.,& Broomhead, D. S.(1987). NuclearPhysics B (Proc.Suppl.), 2, 379. Krischer,K.,Rico-Martinez, R., Kevrekidis, I.G.,Rotermund, H.H.,Ertl, G., &Hudson, J.L.(1993).Model identiŽcation of aspatiotemporally varying catalyticreaction. AICheJournal, 39 , 89–98. Kuo,J. M.,&Principe, J.C.(1994).Reconstructed dynamics and chaoticsignal modeling. In Proc.IEEE Int’l Conf.Neural Networks, 5 ,3131–3136. Lapedes, A.,& Farber,R.(1987). Nonlinearsignal processing using neural networks: Predictionand system modelling (Tech. Rep. No.LA-UR-87-2662). Los Alamos, NM: Los Alamos NationalLaboratory . Learning ChaoticAttractors 2383

Lin, T.,Horne, B.G.,Tino, P.,&Giles, C.L.(1996).Learning long-term depen- dencies in NARX recurrent neural networks. IEEE Trans.on NeuralNetworks, 7, 1329. Ott,E. (1993). Chaosin dynamicalsystems .New York: Cambridge University Press. Principe, J.C.,Rathie, A.,& Kuo,J. M.(1992).Prediction ofchaotic time series with neural networks and the issue ofdynamic modeling. Int.J. Bifurcation and Chaos, 2, 989–996. Principe, J.C.,Wang, L.,& Motter,M.A.(1998).Local dynamic modeling with self-organizing mapsand applicationsto nonlinear system identiŽcation and control. Proc.of theIEEE, 6 ,2240–2257. Rico-Mart´õ nez, R.,Krischer,K.,Kevrekidis, I.G.,Kube,M. C.,& Hudson, J. L.(1992).Discrete- vs. continuous-time nonlinear signal processing ofCu Electrodissolution Data. Chem.Eng. Comm.,118 , 25–48. Schouten,J. C.,&van den Bleek, C.M.(1993). RRCHAOS: Amenu-drivensoftware packagefor chaotic time series analysis .Delft, The Netherlands: ReactorResearch Foundation. Schouten,J. C.,Takens, F.,&van den Bleek, C.M.(1994a).Maximum-likelihood estimation ofthe entropy ofan attractor. PhysicalReview E, 49 , 126–129. Schouten,J. C.,T akens, F.,&van den Bleek, C.M.(1994b).Estimation ofthe dimension ofanoisy attractor. PhysicalReview E, 50 ,1851–1861. Siegelmann, H.T.,Horne, B. C.,&Giles, C.L.(1997).Computational capabil- ities ofrecurrent NARX neural networks. IEEE Trans.on Systems,Man and Cybernetics—Part B: Cybernetics,27 , 208. Su,H-T .,McAvoy,T.,&Werbos, P.(1992).Long-term predictions ofchemical processes using recurrent neural networks: Aparallel training approach. Ind. Eng. Chem.Res., 31 ,1338–1352. Takens, F.(1981).Detecting strange attractorsin turbulence. LectureNotes in Mathematics,898 , 365–381. von Bremen, H.F.,Udwadia, FE.,& Proskurowski, W.(1997).An efŽcient QR based method forthe computationof Lyapunov exponents. PhysicaD, 101 , 1–16. Wan,E. A.(1994).Time series prediction byusing aconnectionist network with internal delay lines. InA.S.Weigend &N.A.Gershenfeld (Eds.), Time se- riesprediction: Forecasting the future and understanding the past .Reading, MA: Addison-Wesley. Weigend, A.S.,&Gershenfeld, N.A.(1994). Timeseries prediction: Forecasting the futureand understanding the past .Reading, MA:Addison-Wesley. Werbos, P.J.,McAvoy,T.,&Su,T. (1992).Neural networks, system identiŽcation, and control in the chemical process industries. In D.A.White &D.A.Sofge (Eds.), Handbookof intelligent control .New York: VanNostrand Reinhold.

Received December 7,1998;accepted October 13,1999.