Predictability, Complexity, and Learning

ARTICLE Communicated byJean-Pierre Nadal Predictability,Complexity, and Learning WilliamBiale k NEC ResearchInstitute, Princeton, NJ 08540,U.S.A. IlyaNemenm an NEC ResearchInstitute, Princeton, New Jersey08540, U.S.A., and Department of Physics,Princeton University, Princeton, NJ 08544,U.S.A. NaftaliTishby NEC ResearchInstitute, Princeton, NJ 08540,U.S.A., and School of Computer Science andEngineering and Center for Neural Computation, Hebrew University, Jerusalem 91904,Israel We dene predictiveinformation Ipred(T) asthe mutual informat ion be- tweenthepastandthefutureof atimeseries.Threequalitati velydifferent behaviorsare found inthelimit of largeobservat ion times T: Ipred(T) can remain nite,grow logarithmically,or grow asafractionalpower law.If thetime series allows us tolearna model witha nitenumber ofparameters, then Ipred(T) grows logarithmicallywith a coefcient that counts thedimensio nalityof themodel space.In contrast,power-la wgrowth is associated,for example ,withthe learning of innite paramet er(ornonparametric)models such as continuous functionswith smoothne sscon- straints.Thereare connectio ns betweenthe predictiv einformation and measuresof complexity thathave been de ned both in learningtheory and theanalysis of physicalsystems through statisticalm echanicsand dynamicalsystemstheory .Furthermore,in thesam eway thatentropy provides theunique measureof availableinformation consistentwith somesim pleand plausibleconditions ,wearguethat the divergent part of Ipred(T) provides theunique measurefor the complexi ty of dynam- icsunderlying atimeseries.Finally, we discusshow theseideas may be usefulin problemsin physics,statisti cs,and biology. 1Introduction There is an obvious interest inhaving practicalalgorithms forpredicting the future, and there isacorrespondinglylarge literature onthe problemof time-series extrapolation. 1 But predictionis both moreand less than extrap- 1 Theclassic papers areby Kolmogoroff(1939, 1941) and Wiener (1949),who essentially solved all the extrapolation problems thatcould besolved bylinear methods.Our under- Neural Computation 13, 2409–2463 (2001) c 2001Massachusetts Institute of Technology ° 2410 W.Bialek, I.Nemenman, and N.Tishby olation. Wemight be able to predict,for example, the chance ofrain inthe comingweek even ifwe cannot extrapolate the trajectory oftemperature uctuations. In the spiritof its thermodynamic origins, informationtheory (Shannon, 1948) characterizes the potentialities and limitationsof all possi- ble predictionalgorithms, as well as unifying the analysis ofextrapolation with the moregeneral notion ofpredictability.Specically ,we can dene a quantity—the predictiveinformation— that measures how much ourobser- vations ofthe past can tellus about the future. The predictiveinformation characterizes the worldwe are observing, and we shall see that this char- acterization is closeto ourintuition about the complexityof the underlying dynamics. Predictionis one ofthe fundamental problemsin neural computation. Much ofwhat we admirein expert human performanceis predictivein character: the pointguard who passes the basketball to aplacewhere his teammate willarrive in asplitsecond, the chess master who knows how moves made now willin uence the end game two hours hence, the investor who buys astock inanticipation that itwill grow inthe year to come. Moregenerally ,we gather sensory informationnot forits own sake but in the hopethat this informationwill guide ouractions (including ourverbal actions). But acting takes time, and sense data can guide us only tothe extent that those data informus about the state ofthe worldat the time ofour actions, so the only components ofthe incomingdata that have a chance ofbeing useful are those that are predictive.Put bluntly, nonpredictive informationis uselessto the organism ,and ittherefore makes sense to isolate the predictiveinformation. It willturn out that most ofthe information we collectover a long periodof timeis nonpredictive,so that isolating the predictiveinformation must go along way toward separating out those features ofthe sensory worldthat are relevant forbehavior. One ofthe most importantexamples ofpredictionis the phenomenon of generalization in learning. Learning is formalizedas nding amodelthat explains ordescribes aset ofobservations, but again this isuseful only be- cause we expect this modelwill continue to be valid. Inthe language of learning theory (see, forexample, Vapnik, 1998), an animal can gain selec- tive advantage not fromits performanceon the training data but only from its performanceat generalization. Generalizing—and not “overtting” the training data—isprecisely the problemof isolating those features ofthe data that have predictivevalue (see also Bialek and Tishby,inpreparation).Fur- thermore, we know that the success ofgeneralization hinges oncontrolling the complexityof the modelsthat we are willingto consideras possibilities. standingof predictability was changedby developments in dynamicalsystems, which showed thatapparently random (chaotic) time series could arise fromsimple determin- istic rules, andthis led to vigorous exploration of nonlinearextrapolation algorithms (Abarbanelet al.,1993). For a review comparingdifferent approaches,see the conference proceedings edited byWeigend andGershenfeld (1994). Predictability,Complexity,and Learning 2411 Finally,learning amodelto describe adata set can be seen as an encoding ofthose data, as emphasized by Rissanen (1989), and the quality of this encoding can be measured using the ideas ofinformation theory .Thus, the explorationof learning problemsshould provideus with explicitlinks among the concepts ofentropy ,predictability,and complexity. The notion ofcomplexity arises not only inlearning theory,but also in several other contexts. Somephysical systemsexhibit morecomplex dynam- icsthan others (turbulent versus laminar ows inuids), and some systems evolve toward morecomplex states than others (spin glasses versus ferro- magnets). The problemof characterizing complexityin physical systems has asubstantial literature ofits own (foran overview,see Bennett, 1990). In this context several authors have considered complexitymeasures based onentropy ormutual information,although, as far as we know,no clear connections have been drawn among the measures ofcomplexitythat arise inlearning theory and those that arise indynamical systems and statistical mechanics. Anessential difculty inquantifying complexityis to distinguish com- plexityfrom randomness. Atrue randomstring cannot be compressedand hence requires along description ; itthus iscomplexin the sense dened by Kolmogorov(1965 ; Li & Vitanyi,Â 1993; VitanyiÂ &Li, 2000), yet the phys- icalprocess that generates this string may have avery simpledescription. Both instatistical mechanics and inlearning theory,ourintuitive notions ofcomplexity correspond to the statements about complexityof the underlying process, and not directlyto the descriptionlength orKolmogorov complexity. Our central result is that the predictiveinformation provides a general measure ofcomplexity,which includes as specialcases the relevant concepts fromlearning theory and dynamical systems. Whilework oncomplexity inlearning theory rests specically on the idea that one is trying to infera modelfrom data, the predictiveinformation is apropertyof the data (or, moreprecisely ,ofan ensemble ofdata) themselves without reference to a specic class ofunderlying models. Ifthe data are generated by aprocessin aknown class but with unknown parameters, then we can calculate the predictiveinformation explicitly and show that this informationdiverges logarithmicallywith the size ofthe data set we have observed ; the coefcient of this divergence counts the number ofparameters inthe modelor, more pre- cisely,the effective dimension ofthe modelclass, and this providesa link to known results ofRissanen andothers.Wealso can quantifythe complexityof processes that falloutside the conventional nite dimensional models, and we show that these morecomplex processes are characterized by apower law rather than alogarithmicdivergence ofthe predictiveinformation. By analogy with the analysis ofcriticalphenomena instatistical physics, the separation oflogarithmic from power-law divergences, together with the measurement ofcoefcients and exponents forthese divergences, allows us to dene “universality classes ” forthe complexityof data streams. The 2412 W.Bialek, I.Nemenman, and N.Tishby powerlaw ornonparametric class ofprocesses may be crucialin real-world learning tasks, where the effective number ofparameters becomes so large that asymptotic results for nitely parameterizable modelsare inaccessible in practice.There is empiricalevidence that simplephysical systems can generate dynamics in thiscomplexityclass, and there are hints that language also may fallin this class. Finally,we argue that the divergent components ofthe predictivein- formationprovide a unique measure ofcomplexity that isconsistent with certain simplerequirements. Thisargument isinthe spiritof Shannon ’sorig- inal derivation ofentropy as the unique measure ofavailable information. Webelieve that this uniqueness argument providesa conclusive answer to the question ofhow one should quantify the complexityof a process generating atimeseries. With the evident cost oflengthening ourdiscussion, we have tried to give aself-contained presentation that develops ourpoint of view ,uses simpleexamples to connect with known results, and then generalizes and goes beyond these results. 2 Even incases

Predictability, Complexity, and Learning

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support