<<

ARTICLE Communicated byJean-Pierre Nadal

Predictability,, and Learning

WilliamBiale k NEC ResearchInstitute, Princeton, NJ 08540,U.S.A.

IlyaNemenm an NEC ResearchInstitute, Princeton, New Jersey08540, U.S.A., and Department of Physics,Princeton University, Princeton, NJ 08544,U.S.A.

NaftaliTishby NEC ResearchInstitute, Princeton, NJ 08540,U.S.A., and School of Computer andEngineering and Center for Neural Computation, Hebrew University, Jerusalem 91904,Israel

We deŽne predictiveinformation Ipred(T) asthe mutual informat ion be- tweenthepastandthefutureof atimeseries.Threequalitati velydifferent behaviorsare found inthelimit of largeobservat ion times T: Ipred(T) can remainŽ nite,grow logarithmically,or grow asafractionalpower law.If thetime series allows us tolearna model witha Žnitenumber ofparam- eters, then Ipred(T) grows logarithmicallywith a coefŽcient that counts thedimensio nalityof themodel space.In contrast,power-la wgrowth is associated,for example ,withthe learning of inŽnite paramet er(ornon- parametric)models such as continuous functionswith smoothne sscon- straints.Thereare connectio nsbetween the predictiv einformation and measuresof complexity thathave been deŽ ned both in learningtheory and theanalysis of physicalsystems through statisticalm echanicsand dynamicalsystemstheory .Furthermore,in thesam eway thatentropy provides theunique measureof availableinformation consistentwith somesim pleand plausibleconditions ,wearguethat the divergent part of Ipred(T) provides theunique measurefor the complexi ty of dynam- icsunderlying atimeseries.Finally, we discusshow theseideas may be usefulin problemsin physics,statisti cs,and .

1Introduction

There is an obvious interest inhaving practicalalgorithms forpredicting the future, and there isacorrespondinglylarge literature onthe problemof time-series extrapolation. 1 But predictionis both moreand less than extrap-

1 Theclassic papers areby Kolmogoroff(1939, 1941) and Wiener (1949),who essentially solved all the extrapolation problems thatcould besolved bylinear methods.Our under-

Neural Computation 13, 2409–2463 (2001) c 2001Massachusetts Institute of ° 2410 W.Bialek, I.Nemenman, and N.Tishby olation. Wemight be able topredict, for example, the chance ofrain inthe comingweek even ifwe cannot extrapolate the trajectory oftemperature uctuations. In the spiritof its thermodynamic origins, informationtheory (Shannon, 1948) characterizes the potentialities and limitationsof all possi- ble predictionalgorithms, as well as unifying the analysis ofextrapolation with the moregeneral notionof .SpeciŽcally ,we can deŽne a quantity—the predictiveinformation— that measures how much ourobser- vations ofthe past can tellus about the future. The predictiveinformation characterizes the worldwe are observing, and we shall see that this char- acterization is closeto ourintuition about the complexityof the underlying dynamics. Predictionis one ofthe fundamental problemsin neural computation. Much ofwhat we admirein expert human performanceis predictivein character: the pointguard who passes the basketball to aplacewhere his teammate willarrive in asplitsecond, the chess master who knows how moves made now willin uence the end game two hours hence, the investor who buys astock inanticipation that itwill grow inthe year to come. Moregenerally ,we gather sensory informationnot forits own sake but in the hopethat this informationwill guide ouractions (including ourverbal actions). But acting takes time, and sense data can guide us only tothe extent that those data informus about the state ofthe worldat the time ofour actions, so the only componentsof the incomingdata that have a chance ofbeing useful are those that are predictive.Put bluntly, nonpredictive informationis uselessto the organism ,and ittherefore makes sense toisolate the predictiveinformation. It willturn out that most ofthe we collectover a long periodof timeis nonpredictive,so that isolating the predictiveinformation must go along way toward separating out those features ofthe sensory worldthat are relevant forbehavior. One ofthe most importantexamples ofpredictionis the phenomenon of generalization inlearning. Learning is formalizedas Žnding amodelthat explains ordescribes aset ofobservations, but again this isuseful only be- cause we expect this modelwill continue to be valid. Inthe language of learning theory (see, forexample, Vapnik, 1998), an animal can gain selec- tive advantage not fromits performanceon the training data but only from its performanceat generalization. Generalizing—and not “overŽtting” the training data—isprecisely the problemof isolating those features ofthe data that have predictivevalue (see also Bialek and Tishby,inpreparation).Fur- thermore,we know that the success ofgeneralization hinges oncontrolling the complexityof the modelsthat we are willingto consideras possibilities. standingof predictability waschanged by developments in dynamicalsystems, which showed thatapparently random (chaotic) could arise fromsimple determin- istic rules, andthis led to vigorous exploration of nonlinearextrapolation (Abarbanelet al.,1993). For a review comparingdifferent approaches,see the conference proceedings edited byWeigend andGershenfeld (1994). Predictability,Complexity,and Learning 2411

Finally,learning amodelto describe adata set can be seen as an encod- ing ofthose data, as emphasized by Rissanen (1989), and the quality of this encoding can be measured using the ideas ofinformation theory .Thus, the explorationof learning problemsshould provideus with explicitlinks among the concepts ofentropy ,predictability,and complexity. The notion ofcomplexity arises not only inlearning theory,but also in several other contexts. Somephysical systemsexhibit morecomplex dynam- icsthan others (turbulent versus laminar ows inuids), and some evolvetoward morecomplex states than others (spin glasses versus ferro- magnets). The problemof characterizing complexityin physical systems has asubstantial literature ofits own (foran overview,see Bennett, 1990). In this context several authors have considered complexitymeasures based onentropy ormutual information,although, as far as we know,no clear connections have been drawn among the measures ofcomplexitythat arise inlearning theory and those that arise indynamical systems and . Anessential difŽculty inquantifying complexityis to distinguish com- plexityfrom . Atrue randomstring cannot be compressedand hence requires along description ; itthus iscomplexin the sense deŽned by Kolmogorov(1965 ; Li & Vitanyi, 1993; Vitanyi &Li, 2000), yet the phys- icalprocess that generates this string may have avery simpledescription. Both instatistical mechanics and inlearning theory,ourintuitive notions ofcomplexity correspond to the statements about complexityof the un- derlying process, and not directlyto the descriptionlength orKolmogorov complexity. Our central result is that the predictiveinformation provides a general measure ofcomplexity,which includes as specialcases the relevant concepts fromlearning theory and dynamical systems. Whilework oncomplexity inlearning theory rests speciŽcally on the idea that one is trying to infera modelfrom data, the predictiveinformation is apropertyof the data (or, moreprecisely ,ofan ensemble ofdata) themselves without reference to a speciŽc class ofunderlying models. Ifthe data are generated by aprocessin aknown class but with unknown parameters, then we can calculate the pre- dictiveinformation explicitly and show that this informationdiverges loga- rithmicallywith the size ofthe data set we have observed ; the coefŽcient of this counts the number ofparameters inthe modelor, more pre- cisely,the effective ofthe modelclass, and this providesa linkto known results ofRissanen andothers.Wealso can quantifythe complexityof processes that falloutside the conventional Žnite dimensional models, and we show that these morecomplex processes are characterized by apower law rather than alogarithmicdivergence ofthe predictiveinformation. By analogy with the analysis ofcriticalphenomena instatistical physics, the separation oflogarithmic from power-law divergences, together with the measurement ofcoefŽcients and exponents forthese divergences, allows us to deŽne “universality classes ” forthe complexityof data streams. The 2412 W.Bialek, I.Nemenman, and N.Tishby powerlaw ornonparametric class ofprocesses may be crucialin real-world learning tasks, where the effective number ofparameters becomes so large that asymptotic results forŽ nitely parameterizable modelsare inaccessible inpractice. There is empiricalevidence that simplephysical systems can generate dynamics inthis complexityclass, and there are hints that language also may fallin this class. Finally,we argue that the divergent componentsof the predictivein- formationprovide a unique measure ofcomplexity that isconsistent with certain simplerequirements. Thisargument isinthe spiritof Shannon ’sorig- inal derivation ofentropy as the unique measure ofavailable information. Webelieve that this uniqueness argument providesa conclusive answer to the question ofhow one should quantify the complexityof a process generating atimeseries. With the evident cost oflengthening ourdiscussion, we have tried to give aself-contained presentation that develops ourpoint of view ,uses simpleexamples to connect with known results, and then generalizes and goes beyond these results. 2 Even incases where at least the qualitative form ofour results isknown fromprevious work, we believe that ourpoint of view elucidates some issues that may have been less the focus ofearlier studies. Last but not least, we explorethe possibilities forconnecting our theoretical discussion with the experimental characterization oflearning and complexityin neural systems.

2ACurious

Before starting the systematic analysis ofthe problem,we want to motivate ourdiscussion further by presenting results ofsome simplenumerical ex- periments. Sincemost ofthe articledraws examples fromlearning, here we considerexamples fromequilibrium statistical mechanics. Supposethat we have aone-dimensional chain ofIsing spins with the Hamiltoniangiven by

H Jijsisj, (2.1) D ¡ i,j X where the matrixof interactions Jij is not restricted to nearest neighbors ; long- interactions are also allowed. One may identify spins pointing upward with 1and downward with 0, and then aspin chain is equivalent tosome sequence ofbinary digits. This sequence consists of(overlapping) N N words of N digits each, Wk, k 0, 1 2 1. There are 2 such words total, D ¢ ¢ ¢ ¡ and they appear with very different frequencies n(Wk) inthe spin chain (see

2 Someof the basic ideaspresented here,together with some connections to earlier work,can be found in brief preliminary reports (Bialek,1995 ; Bialek &Tishby,1999). Thecentral results of this work,however, were atbest conjectures in these preliminary accounts. Predictability,Complexity,and Learning 2413

W 0 = 0 0 0 0

W 1 = 0 0 0 1 W = 0 0 1 0 0 0 0 0 1 1 1 0 0 1 1 1 0 0 0 0 1 2 . . .

W 15 = 1 1 1 1 W 0 W 1 . . .W 9 . . .W 7 . . . W 0 W 1

Figure 1: Calculatingentropy ofwords oflength 4in achain of17 spins. For this chain, n(W ) n(W ) n(W ) n(W ) n(W ) n(W ) 2, n(W ) 0 D 1 D 3 D 7 D 12 D 14 D 8 D n(W9 ) 1,and allother frequencies are zero.Thus, S(4) 2.95 bits. D ¼

Figure 1fordetails). Ifthe number ofspins is large, then counting these frequencies providesa goodempirical estimate of PN (Wk),the probabil- ity distribution ofdifferent words oflength N.Then one can calculate the entropy S(N) ofthis probabilitydistribution by the usual formula :

2N 1 ¡ S(N) P (W ) log P (W ) (bits). (2.2) D ¡ N k 2 N k k 0 XD Notethat this is not the entropy ofaŽnite chain with length N; instead, it isthe entropy ofwords orstrings with length N drawn froma much longer chain. Nonetheless, since entropy isan extensive property, S(N) is propor- tional asymptotically to N forany spinchain, that is, S(N) 0 N. The ¼ ¢ usual goal instatistical mechanics is to understand this “thermodynamicS limit” N ,and hence to calculate the entropy density 0.Different sets ! 1 S ofinteractions Jij result indifferent values of 0,but the qualitative result S S(N) N is true forall reasonable Jij . / f g Weinvestigated three different spin chains of1 billionspins each. As usual instatistical mechanics, the probabilityof any conŽguration ofspins si is given by the Boltzmann distribution, f g

P[ si ] exp( H/ kBT), (2.3) f g / ¡ where to normalizethe scale ofthe Jij we set k T 1. Forthe Žrst chain, B D only Ji,i 1 1was nonzero, and its value was the same forall i. The second C D chain was also generated using the nearest-neighbor interactions, but the value ofthe couplingwas reset every 400,000 spins by taking arandom number froma gaussian distribution with zero and unit . In the third case, we again reset the interactions at the same frequency, but now interactions were long-ranged ; the ofcouplingconstants decreased with the distance between the spins as J2 1/(i j)2. We plot h iji D ¡ S(N) forall these cases inFigure 2, and, ofcourse, the asymptotically linear 2414 W.Bialek, I.Nemenman, and N.Tishby

25 fixed J variable J, short range interactions variable J’s, long range decaying interactions 20

15 S

10

5

0 5 10 15 20 25 N

Figure 2: Entropy asa functionof the word length forspin chains with different interactions. Notice thatall lines start from S(N) log 2 1since atthe values D 2 D ofthe coupling weinvestigated, the correlation length is much smaller than the chain length (1 109 spins). ¢ behavior isevident—the extensive entropy shows no qualitative distinction among the three cases we consider. The situation changes drastically ifwe removethe asymptotic linear con- tribution and focus onthe correctionsto extensive behavior.SpeciŽcally ,we write S(N) 0 N S1(N),and plotonly the sublinear component S1(N) of D ¢ C the entropy.AsS we see inFigure 3, the three chains then exhibit qualitatively different features : forthe Žrst one, S1 is constant; forthe second one, itis logarithmic ; and forthe third one, itclearlyshows apower-law behavior. What isthe signiŽcance ofthese ? Of course, the differences inthe behavior of S1(N) must be related to the ways we chose J’s for the . In the Žrst case, J is Žxed, and ifwe see N spins and try to predict the state ofthe N 1st spin, allthat really matters is the state ofthe spin C sN; there isnothing to “learn” fromobservations on longersegments ofthe chain. Forthe second chain, J changes, and the statistics ofthe spin words are different indifferent parts ofthe sequence. By lookingat these statistics, one can “learn” the couplingat the current position ; this estimate improves the morespins (longerwords) we observe. Finally,inthe third case, there are many couplingconstants that can be learned. As N increases, one becomes sensitive to weaker correlationscausedby interactionsoverlarger andlarger Predictability,Complexity,and Learning 2415

7 fixed J variable J, short range interactions variable J’s, long range decaying interactions 6 fits

5 0.5 S =const +const N 1 1 2

4 1 S

3 S =const +1/2 log N 1 1 2

S =const 1 1

0 5 10 15 20 25 N

Figure 3: Subextensive partof the entropy asafunctionof the word length. distances. So,intuitively ,the qualitatively different behaviors of S1(N) in the three plotted cases are correlatedwith differences inthe problemof learning the underlying dynamics ofthe spin chains fromobservations on samples ofthe spins themselves. Much ofthis articlecan be seen as expanding on and quantifying this intuitive observation.

3Fundamentals

The problemof predictioncomes in various forms,as noted inthe Intro- duction. Informationtheory allows us totreat the different notions ofpre- dictionon the same footing. The Žrst step is torecognize that allpredictions are probabilistic.Even ifwe can predictthe temperature at noon tomorrow, we should provideerror bars orconŽ dence limitson ourprediction. The next step is to rememberthat even before we lookat the data, we know that certain futures are morelikely than others, and we can summarize this knowledge by apriorprobability distribution forthe future. Our observa- tions onthe past lead us toa new,moretightly concentrated distribution : the distribution offutures conditionalon the past data. Different kinds of predictionsare different slices through oraverages overthis conditional distribution, but informationtheory quantiŽes the “concentration ” of the distribution without making any commitmentas to which averages willbe most interesting. 2416 W.Bialek, I.Nemenman, and N.Tishby

Imagine that we observe astream ofdata x(t) overa timeinterval T < ¡ t < 0. Let allof these past data be denoted by the shorthand xpast. We are interested insaying something about the future, so we want to know about the data x(t) that willbe observed inthe timeinterval 0 < t < T0; let these fu- ture data be called xfuture.In the absence ofany other knowledge, futures are drawn fromthe probabilitydistribution P(xfuture),and observations ofpar- ticularpast data xpast tellus that futures willbe drawn fromthe conditional distribution P(xfuture|xpast).The greater concentration ofthe conditionaldis- tribution can be quantiŽed by the fact that ithas smallerentropy than the priordistribution, and this reduction inentropy is Shannon ’sdeŽnition of the informationthat the past providesabout the future. Wecan write the average ofthis predictiveinformation as

P(xfuture|xpast) pred( , 0 ) log2 (3.1) I T T D P(xfuture) ½ µ ¶ ¾ log P(x ) log P(xpast) D ¡h 2 future i ¡ h 2 i log P(x , xpast) , (3.2) ¡ ¡h 2 future i where denotes an average£ overthe jointdistribution ¤ ofthe past and h¢ ¢ ¢i the future, P(xfuture, xpast). Each ofthe terms inequation 3.2 is an entropy.Sincewe are interested inpredictability or generalization, which are associated with some features ofthe signal persisting forever,we may assume stationarity orinvariance under timetranslations. Then the entropy ofthe past data depends only on the duration ofourobservations, so we can write log P(xpast) S(T), ¡h 2 i D and by the same argument log P(x ) S(T0 ).Finally,the entropy ¡h 2 future i D ofthe past and the future taken together is the entropy ofobservations on awindow ofduration T T0, so that log P(xfuture, xpast) S(T T0 ). C ¡h 2 i D C Putting these equations together, we obtain

pred(T, T0 ) S(T) S(T0 ) S(T T0 ). (3.3) I D C ¡ C It is importantto recallthat mutual informationis asymmetricquantity . Thus, we can view pred(T, T0 ) as either the informationthat adata segment I of duration T providesabout the future oflength T0 or the informationthat adata segment ofduration T0 providesabout the immediate past ofdura- tion T.This isadirectapplication of the deŽnition of informationbut seems counterintuitive. Shouldn ’titbe moredifŽ cult topredict than to postdict? One can perhaps recoverthe correctintuition by thinking about alarge en- semble ofquestion-answer pairs. Predictioncorresponds to generating the answer toa given question, while postdictioncorresponds to generating the question that goes with agiven answer.Weknow that guessing ques- tions given answers is also ahard problem 3 and can be just as challenging

3 This is the basisof the popular American television gameshow Jeopardy! widely viewed asthe most “intellectual ” of its genre. Predictability,Complexity,and Learning 2417 as the moreconventional problemof answering the questions themselves. Our focus here ison predictionbecause we want to make connections with the phenomenon ofgeneralization inlearning, but itis clearthat generat- ing aconsistent interpretation ofobserved data may involveelements of both predictionand postdiction(see, forexample, Eagleman &Sejnowski, 2000); itisattractive that the information-theoreticformulation treats these problemssymmetrically . In the same way that the entropy ofagas at Žxed density is proportional to the volume, the entropy ofa timeseries (asymptotically) isproportionalto its duration, so that lim T S(T)/ T 0; entropy is an extensive quantity. !1 D But fromEquation 3.3, any extensive componentS of the entropy cancels in the computationof the predictiveinformation : predictabilityis adeviation fromextensivity .Ifwe write S(T) 0T S1(T),then equation 3.3 tells D C us that the predictiveinformation is relatedS only tothe nonextensive term S1(T).Notethat ifwe are observing adeterministicsystem, then 0 0, but D this is independent ofquestions about the structure ofthe subextensiveS term S1(T).Itis attractive that informationtheory gives us auniŽed discussion ofpredictionin deterministic and probabilisticcases. Weknow two general facts about the behavior of S1(T).First, the correc- tions to extensive behavior are positive, S1(T) 0. Second,the statement ¸ that entropy is extensive isthe statement that the limit

S(T) lim 0 (3.4) T T D !1 S exists, and forthis tobe true we must also have

S1(T) lim 0. (3.5) T T D !1 Thus, the nonextensive terms inthe entropy must be subextensive, that is, they must grow with T less rapidlythan alinear function. Taken together, these facts guarantee that the predictiveinformation is positiveand subex- tensive. Furthermore, ifwe let the future extend forward fora very long time, T0 ,then we can measure the informationthat oursample pro- ! 1 vides about the entire future :

Ipred(T) lim pred(T, T0 ) S1(T). (3.6) D T D 0 !1 I Similarly,instead ofincreasing the duration ofthe future to inŽnity ,we couldhave considered the mutual informationbetween asample oflength T and allof the inŽnite past. Then the postdictiveinformation also is equal to S1(T),and the symmetry between predictionand postdictionis even more profound; not only is there symmetry between questions and answers, but observations onagiven periodof timeprovide the same amount ofinfor- mation about the historicalpath that led to ourobservations as about the 2418 W.Bialek, I.Nemenman, and N.Tishby future that willunfold fromthem. Insome cases, this statement becomes even stronger.Forexample, ifthe subextensive entropy ofalong, discon- tinuous observation ofatotal length T with agap ofa duration dT T is ¿ equal to S (T) O( dT ),then the subextensive entropy ofthe present isnot 1 C T only its informationabout the past or the future, but also the information about the past and the future. Ifwe have been observing atimeseries fora (long) time T,then the total amount ofdata we have collectedis measured by the entropy S(T), and at large T this isgiven approximatelyby 0T.But the predictiveinformation that we have gathered cannot grow linearlyS with time, even ifwe are making predictionsabout afuture that stretches out to inŽnity .Asaresult, ofthe total informationwe have taken in by observing xpast,only avanishing fractionis ofrelevance tothe :

Predictive information Ipred(T) lim 0. (3.7) T Total information D S(T) ! !1 In this precisesense, most ofwhat we observe is irrelevant tothe problem ofpredicting the future. 4 Consider the case where timeis measured indiscrete steps, so that we have seen N time pointsx1, x2, . . . , xN.Howmuch havewe learned aboutthe underlying pattern inthese data? The morewe know,the moreeffectively we can predictthe next data point xN 1 and hence the fewer bits we will C need to describe the deviation ofthis data pointfrom our prediction. Our accumulated knowledge about the timeseries ismeasured by the degree to which we can compressthe descriptionof new observations. On average, the length ofthe codeword required to describe the point xN 1, given that C we have seen the previous N points, is given by

`(N) log P(xN 1 | x1, x2, . . . , xN) bits, (3.8) D ¡h 2 C i where the expectation value istaken overthe jointdistribution ofallthe N 1 points, P(x1, x2, . . . , xN, xN 1).Itiseasy tosee that C C @S(N) `(N) S(N 1) S(N) . (3.9) D C ¡ ¼ @N As we observe forlonger times, we learn more,and this wordlength de- creases. It is natural to deŽne alearning curve that measures this improve- ment. Usually we deŽne learning curves by measuring the frequency or

4 We canthink of equation3.7 as alaw of diminishing returns. Although we collect data in proportion to our observation time T,asmaller andsmaller fraction ofthis information is useful in the problem of prediction. Thesediminishing returns arenot due to alimited lifetime, since we calculate the predictive informationassuming that we havea future extending forwardto inŽnity .Asenior colleague points out thatthis is anargumentfor changingŽ elds before becoming too expert. Predictability,Complexity,and Learning 2419

costs oferrors ; here the cost is that ourencoding ofthe point xN 1 is longer C than itcould be ifwe had perfectknowledge. This ideal encoding has a length that we can Žnd by imagining that we observe the timeseries foran inŽnitely long time, `ideal limN `(N),but this is just another way of D !1 deŽning the extensive componentof the entropy 0.Thus, we can deŽne a learning curve S

L (N) `(N) `ideal (3.10) ´ ¡ S(N 1) S(N) 0 D C ¡ ¡ S S1(N 1) S1(N) D C ¡ @S1(N) @Ipred(N) , (3.11) ¼ @N D @N and we see once again that the extensive componentof the entropy cancels. It is well known that the problemsof prediction and compressionare related, and what we have done here is toillustrate one aspect ofthis con- nection. SpeciŽcally ,ifwe ask how much one segment ofatimeseries can tellus about the future, the answer is contained inthe subextensive behavior ofthe entropy.Ifwe ask how much we are learning about the structure of the timeseries, then the natural and universally deŽned learning curve isre- lated again to the subextensive entropy ; the learning curve is the derivative ofthe predictiveinformation. This universal learning curve is connected to the moreconventional learning curves inspeciŽ c contexts. Asan example(cf. section 4.1), consider Žtting aset ofdata points xn, yn with some class offunctions y f (x ®), f g D I where ® are unknown parameters that need tobe learned ; we also allow for some gaussian noise inour observation ofthe yn.Herethe natural learning curve is the evolution of Â2 forgeneralization as afunction ofthe number of examples. Within the approximationsdiscussed below,itis straightforward to show that as N becomes large,

2 1 2  (N) y f (x ®) (2 ln 2) L (N) 1, (3.12) D s2 ¡ I ! C D E D£ ¤ E where s2 is the variance ofthe noise. Thus, amoreconventional measure ofperformance at learning afunction is equal to the universal learning curve deŽned purelyby information-theoreticcriteria. In other words, ifa learning curve is measured inthe right units, then its integral represents the amount ofthe useful informationaccumulated. Then the subextensivity of S1 guarantees that the learning curve decreases to zero as N . ! 1 Different quantities related to the subextensive entropy have been dis- cussed inseveral contexts. Forexample, the codelength `(N) has been deŽned as alearning curve inthe speciŽc case ofneural networks (Opper &Haussler,1995) and has been termed the “thermodynamic dive ” (Crutch- Želd &Shalizi, 1999) and “Nth orderblock entropy ” (Grassberger, 1986). 2420 W.Bialek, I.Nemenman, and N.Tishby

The universal learning curve L (N) has been studied as the expected instan- taneous informationgain by Haussler,Kearns, &Schapire(1994). Mutual informationbetween allof the past and allof the future (both semi-inŽnite) is known also as the excess entropy,effective measure complexity,stored in- formation,and so on(see Shalizi& CrutchŽeld, 1999, and references therein, as well as the discussion below). Ifthe data allow adescriptionby amodel with aŽnite (and insome cases also inŽnite) number ofparameters, then mutual informationbetween the data and the parameters is ofinterest. This is easily shown to be equal to the predictiveinformation about allof the future, and itis also the cumulative informationgain (Haussler, Kearns, & Schapire,1994) orthe cumulative relative entropy risk (Haussler &Opper, 1997). Investigation ofthis problemcan be traced back to Renyi (1964) and Ibragimovand Hasminskii (1972), and some particulartopics are stillbeing discussed (Haussler &Opper,1995 ; Opper& Haussler,1995 ; Herschkowitz &Nadal, 1999). Incertain limits,decoding asignal froma populationof N neurons can be thought ofas “learning” aparameter from N observations, with aparallelnotion ofinformation transmission (Brunel &Nadal, 1998 ; Kang &Sompolinsky,2001). In addition, the subextensive componentof the descriptionlength (Rissanen, 1978, 1989, 1996 ; Clarke &Barron, 1990) averaged overa class ofallowed modelsalso is similarto the predictive information.What is importantis that the predictiveinformation or subex- tensive entropy isrelated to allthese quantities and that itcan be deŽned for any processwithout areference to aclass ofmodels. Itis this universality that we Žnd appealing, and this universality isstrongest ifwe focus onthe limitof long observation times. Qualitatively,inthis regime( T ) we ! 1 expect the predictiveinformation to behave inone ofthree different ways, as illustrated by the Ising modelsabove : itmay either stay Žnite orgrow to inŽnity together with T; inthe latter case the rate ofgrowth may be slow (logarithmic)or fast (sublinear power)(see Barronand Cover, 1991, fora similarclassiŽ cation inthe framework ofthe minimaldescription length [MDL] analysis). The Žrst possibility,lim T Ipred(T) constant, that no matter !1 D how long we observe, we gain only aŽnite amount ofinformation about the future. This situation prevails,for example, when the dynamics are too regular. Fora purelyperiodic , completeprediction is possible once we know the phase, and ifwe sample the data at discrete times, this isaŽnite amount ofinformation ; longerperiod orbits intuitively are morecomplex and also have larger Ipred,but this does not change the limitingbehavior limT Ipred(T) constant. !1 D Alternatively,the predictiveinformation can besmallwhen the dynamics are irregular,but the best predictionsare controlledonly by the immediate past, so that the correlationtimes ofthe observable data are Žnite (see, for example, CrutchŽeld &Feldman, 1997, and the Žxed J case inFigure 3). Imagine, forexample, that we observe x(t) at aseries ofdiscrete times tn , f g and that at each timepoint we Žnd the value xn.Then we always can write Predictability,Complexity,and Learning 2421 the jointdistribution ofthe N data points as aproduct,

P(x1, x2, . . . , x ) P(x1)P(x2|x1)P(x3|x2, x1) . (3.13) N D ¢ ¢ ¢

ForMarkov processes, what we observe at tn depends only onevents at the previoustime step tn 1, so that ¡

P(xn| x1 i n 1 ) P(xn|xn 1), (3.14) f · · ¡ g D ¡ and hence the predictiveinformation reduces to

P(xn|xn 1) ¡ Ipred log2 . (3.15) D P(xn) ½ µ ¶¾ The maximumpossible predictiveinformation in this case isthe entropy of the distribution ofstates at one timestep, which is bounded by the loga- rithmof the number ofaccessible states. Toapproachthis bound, the system must maintain memoryfor a long time, since the predictiveinformation is reduced by the entropy ofthe transition . Thus, systems with morestates and longermemories have largervalues of Ipred. Moreinteresting are those cases inwhich Ipred(T) diverges at large T. In physical systems, we know that there are criticalpoints where correlation times become inŽnite, so that optimalpredictions will be inuenced by events inthe arbitrarilydistant past. Under these conditions, the predictive informationcan grow without bound as T becomes large ; formany systems, the divergence is logarithmic, I (T ) ln T,as forthe variable Jij, pred ! 1 / short-range Ising modelof Figures 2and 3. Long-range correlationsalso are importantin a timeseries where we can learn some underlying rules. It willturn out that when the set ofpossible rules can be described by aŽnite number ofparameters, the predictiveinformation again diverges logarithmically,and the coefŽcient ofthis divergence counts the number of parameters. Finally,afaster growth isalso possible, so that Ipred(T ) a ! 1 / T ,as forthe variable Jij long-range Ising model,and we shall see that this behavior emerges from,for example, nonparametriclearning problems.

4Learning and Predictability

Learning is ofinterest preciselyin those situations where correlationsor associations persist overlong periodsof time. In the usual theoretical mod- els, there is some rule underlying the observable data, and this rule is valid forever; examples seen at one timeinform us about the rule, and this infor- mation can be used to make predictionsor generalizations. The predictive informationquantiŽ es the average generalization powerof examples, and we shall see that there is adirectconnection between the predictiveinfor- mation and the complexityof the possible underlying rules. 2422 W.Bialek, I.Nemenman, and N.Tishby

4.1 ATest Case. Let us begin with asimpleexample, already referredto above. Weobserve two streams ofdata x and y,orequivalently astream of pairs (x1, y1), (x2, y2), . . . , (xN, yN ).Assume that we know inadvance that the x’sare drawn independently and at randomfrom a distribution P(x), while the y’sare noisy versions ofsome function acting on x,

yn f (xn ®) gn, (4.1) D I C where f (x ®) is aclass offunctions parameterized by ®, and gn is noise, I which forsimplicity we willassume is gaussian with known standard devi- ation s.Wecan even start with avery simplecase, where the function class is just alinear combinationof basis functions, so that

K f (x ®) am wm (x). (4.2) I D m 1 XD The usual problemis to estimate, from N pairs xi, yi ,the values ofthe f g parameters ®.In favorable cases such as this, we might even be able to Žnd an effective regression formula.W eare interested in evaluating the predictiveinformation, which means that we need to know the entropy S(N).Wego through the calculation insome detail because itprovides a modelfor the moregeneral case. Toevaluate the entropy S(N),we Žrst construct the probabilitydistribu- tion P(x1, y1, x2, y2, . . . , xN, yN).The same set ofrules appliesto the whole data stream, which here means that the same parameters ® applyfor all pairs xi, yi ,but these parameters are chosen at randomfrom a distribution f g (®) at the start ofthe stream. Thus we write P P(x1, y1, x2, y2, . . . , xN, yN)

K d aP(x1, y1, x2, y2, . . . , xN, yN|®) (®), (4.3) D P Z and now we need to construct the conditionaldistributions forŽ xed ®. By hypothesis, each x ischosen independently,and once we Žx ®, each yi is correlatedonly with the corresponding xi,so that we have

N P(x1, y1, x2, y2, . . . , xN, yN| ) P(xi) P(yi | xi ) . (4.4) ® D ® i 1 I YD £ ¤ Furthermore, with the simpleassumptions above about the class offunc- tions and gaussian noise, the conditionaldistribution of yi has the form

2 1 1 K P(yi | xi ®) exp yi am wm (xi) . (4.5) I D p 2 ¡2 2 ¡ 2p s 2 s m 1 3 ³ XD ´ 4 5 Predictability,Complexity,and Learning 2423

Putting allthese factors together,

P(x1, y1, x2, y2, . . . , xN, yN )

N N N 1 K 1 2 P(xi) d a (®) exp yi D p 2 ¡ 2 "i 1 # 2p s P " 2s i 1 # YD ³ ´ Z XD N K K exp A ( xi )a a N B ( xi, yi )a , (4.6) £ ¡ 2 mº f g m º C m f g m " m ,º 1 m 1 # XD XD where

1 N A ( xi ) w (xi)w (xi) and (4.7) mº D 2 m º f g s N i 1 XD 1 N B ( xi, yi ) yiw (xi). (4.8) m D 2 m f g s N i 1 XD Our placement ofthe factors of N means that both Amº and Bm are of order unity as N .These quantities are empiricalaverages overthe samples ! 1 xi, yi , and if the w are well behaved, we expect that these empiricalmeans f g m converge to expectation values formost realizations ofthe series xi : f g 1 lim Amº( xi ) Am1º dxP(x)wm (x)wº(x), (4.9) N f g D D s2 !1 Z K lim Bm ( xi, yi ) Bm1 Am1ºaº, (4.10) N f g D D N !1 º 1 XD where ® are the parameters that actually gave rise tothe data stream xi, yi . N 2 f g In fact, we can make the same argument about the terms in yi , N K P 2 2 lim yi Ns am Am1ºaº 1 . (4.11) N D N N C !1 i 1 " m ,º 1 # XD XD Conditions forthis convergence ofempiricalmeans to expectation values are at the heart oflearning theory.Our approachhere is Žrst to assume that this convergence works, then to examine the consequences forthe predictive information,and Žnally to address the conditionsfor and implicationsof this convergence breaking down. Putting the different factors together, we obtain

N N 1 K P(x1, y1, x2, y2, . . . , xN, yN ) P(xi) d a ( ) 2 ® ! "i 1 # p2p s P YD ³ ´ Z f exp NEN(® xi, yi ) , (4.12) £ ¡ I f g £ ¤ 2424 W.Bialek, I.Nemenman, and N.Tishby where the effective “energy” persample is given by

1 1 K EN(® xi, yi ) (am am )A1 (aº aº). (4.13) I f g D 2 C 2 ¡ N mº ¡ N m ,º 1 XD Herewe use the symbol to indicate that we not only take the limitof ! large N but also neglect the uctuations. Notethat inthis approximation, the dependence on the samplef points themselves is hidden inthe deŽnition of ® as being the parameters that generated the samples. N The integral that we need to doin equation 4.12 involves an exponential with alarge factor N inthe exponent ; the energy EN isoforderunity as N .This suggests that we evaluate the integral by asaddle-point ! 1 orsteepest-descent approximation(similar analyses were performedby Clarke &Barron, 1990 ; MacKay,1992 ; and Balasubramanian, 1997) :

K d a (®) exp NEN (® xi, yi ) P ¡ I f g Z £ ¤ K N (®cl) exp NEN(®cl xi, yi ) ln ¼ P ¡ I f g ¡ 2 2p µ 1 ln det N , (4.14) ¡ 2 F C ¢ ¢ ¢ ¶ where ®cl is the “classical” value of ® determinedby theextremal conditions

@EN(® xi, yi ) I f g 0, (4.15) @am D ­a acl ­D the matrix N consists­ ofthe second derivatives of EN, F ­ 2 @ EN (® xi, yi ) N I f g , (4.16) F D @am @aº ­a acl ­D and denotes terms that vanish­ as N .Ifwe formulate the problem ¢ ¢ ¢ ­ ! 1 ofestimating the parameters ® fromthe samples xi, yi , then as N , f g ! 1 the matrix N N is the Fisher informationmatrix (Cover & Thomas, 1991) ; the eigenvectorsF ofthis matrixgive the principalaxes forthe errorellipsoid inparameter space, and the (inverse) eigenvalues give the variances of parameter estimates along each ofthese directions. The classical ®cl differs from ® only interms oforder 1 /N; we neglect this difference and further N simplifythe calculation ofleading terms as N becomes large. Aftera little morealgebra, we Žnd the probabilitydistribution we have been looking for:

P(x1, y1, x2, y2, . . . , xN, yN)

N 1 N 2 K P(xi) (®) exp ln(2p es ) ln N , (4.17) ! N ¡ ¡ C ¢ ¢ ¢ "i 1 # ZA P 2 2 YD µ ¶ f Predictability,Complexity,and Learning 2425 where the normalization constant

K ZA (2p ) det A . (4.18) D 1 Again we notep that the sample points xi, yi are hidden inthe value of ® f g N that gave rise tothese points. 5 Toevaluate the entropy S(N),we need to computethe expectation value ofthe (negative) logarithmof the probabilitydistribution inequation 4.17 ; there are three terms. One isconstant, so averaging is trivial.The second termdepends only on the xi,and because these are chosen independently fromthe distribution P(x),the average again is easy toevaluate. The third terminvolves ®,and we need to average this overthe jointdistribution N P(x1, y1, x2, y2, . . . , xN, yN).Asabove, we can evaluate this average insteps. First, we choose avalue ofthe parameters ®,then we average overthe N samples given these parameters, and Žnally we average overparameters. But because ® is deŽned as the parameters that generate the samples, this N stepwise proceduresimpliŽ es enormously.The end result is that

1 2 K S(N) N Sx log 2p es log N D C 2 2 C 2 2 µ ± ²¶ S log ZA , (4.19) C a C 2 a C ¢ ¢ ¢ where means averaging« ¬ overparameters, S is the entropy ofthe h¢ ¢ ¢ia x distribution of x,

S dx P(x) log P(x), (4.20) x D ¡ 2 Z and similarlyfor the entropy ofthe distribution ofparameters,

K Sa d a (®) log (®). (4.21) D ¡ P 2 P Z

5 Weemphasizeonce more thatthere aretwo approximationsleading to equation4.17. First, we havereplaced empirical meansby expectation values, neglecting uctuations as- sociated with the particular set ofsample points xi, yi .Second,we haveevaluated the f g averageover parametersin asaddle-point approximation.At least undersome condi- tions, bothof these approximationsbecome increasingly accurate as N , so that ! 1 this approachshould yield the asymptotic behaviorof the distribution and,hence, the subextensive entropyat large N.Although we give amore detailed analysisbelow ,it is worth notinghere how thingscan go wrong. The two approximationsare independent, andwe could imaginethat  uctuations areimportant but saddle-point integration still works,for example. Controlling the uctuations turns out to beexactly the question of whether our Žnite parameterizationcaptures the true dimensionality ofthe class ofmod- els, asdiscussed in the classic work ofVapnik,Chervonenkis, and others (see Vapnik, 1998,for a review). Thesaddle-point approximationcan break down because the saddle point becomes unstable orbecause multiple saddlepoints become important.It will turn out thatinstability is exponentially improbableas N ,while multiple saddlepoints area real problem in certain classes ofmodels, againwhen ! 1 counting parametersdoes not really measurethe complexity of the model class. 2426 W.Bialek, I.Nemenman, and N.Tishby

The different terms inthe entropy equation 4.19 have astraightforward interpretation. First, we see that the extensive termin the entropy,

1 2 0 Sx log (2p es ), (4.22) S D C 2 2 reects contributions fromthe randomchoice of x and fromthe gaussian noise in y; these extensive terms are independent ofthe variations inpa- rameters ®,and these would be the only terms ifthe parameters were not varying (that is, ifthere were nothing to learn). There also is atermthat reects the entropy ofvariations inthe parameters themselves, Sa. This entropy is not invariant with respect tocoordinate transformations inthe parameter space, but the term log Z compensates forthis noninvari- h 2 Aia ance. Finally,and most interesting forour purposes, the subextensive piece ofthe entropy isdominated by alogarithmicdivergence, K S1(N) log N (bits). (4.23) ! 2 2 The coefŽcient ofthis divergence counts the number ofparameters indepen- dent ofthe coordinatesystem that we choose inthe parameter space. Fur- thermore,this result does not depend onthe set ofbasis functions w (x) . f m g This is ahint that the result in equation 4.23 is moreuniversal than our simpleexample.

4.2 Learning aParameterized Distribution. The problemdiscussed above isan exampleof supervised learning. Weare given examples ofhow the points xn map into yn,and fromthese examples we are to induce the association orfunctional relation between x and y.Analternative view is that the pairof points (x, y) should be viewed as avector x,and what we are E learning isthe distribution ofthis vector.The problemof learning adistri- bution usually iscalled unsupervised learning, but inthis case, supervised learning formallyis aspecialcase ofunsupervised learning ; ifwe admit that allthe functional relations orassociations that we are trying to learn have an element ofnoise orstochasticity,then this connectionbetween supervised and unsupervised problemsis quite general. Supposea series ofrandom vector variables xi isdrawn independently fE g fromthe same probabilitydistribution Q(x| ),and this distribution de- E ® pends on a(potentially inŽnite dimensional) vectorof parameters ®. As above, the parameters are unknown, and before the series starts, they are chosen randomlyfrom a distribution (®).With no constraints onthe den- sities ( ) or Q(x| ),itis impossiblePto derive any regression formulas for ® E ® parameterP estimation, but one can stillsay something about the entropy of the data series and thus the predictiveinformation. For a Žnite-dimensional vectorof parameters ®,the literature on bounding similarquantities is especially rich(Haussler ,Kearns, &Schapire,1994 ; Wong &Shen, 1995 ; Haussler &Opper,1995, 1997, and references therein), and related asymp- toticcalculations have been done (Clarke &Barron, 1990 ; MacKay,1992 ; Balasubramanian, 1997). Predictability,Complexity,and Learning 2427

Webegin with the deŽnition of entropy :

S(N) S[ xi ] ´ fE g dx1 dxNP(x1, x2, . . . , xN) log P(x1, x2, . . . , xN). (4.24) D ¡ E ¢ ¢ ¢ E E E E 2 E E E Z By analogy with equation 4.3, we then write

N K P(x1, x2, . . . , xN) d a ( ) Q(xi| ). (4.25) E E E D ® E ® P i 1 Z YD Next, combiningthe equations 4.24 and 4.25 and rearranging the orderof integration, we can rewrite S(N) as

S(N) dK® (®) D ¡ N P N Z N dx dx Q(x | ) log P( x ) . (4.26) E1 EN Ej ® 2 Ei £ 8 ¢ ¢ ¢ j 1 N f g 9

N N | K Q(xi ®) P(x1, . . . , xN) Q(xj | ®) d a (®) E E E D E N Q(x | ) j 1 P i 1 Ei ® YD Z YD µ N ¶ N K Q(xj | ) d a ( ) exp [ N N( xi )] , (4.27) D E ® ® ® E j 1 N P ¡ E I f g YD Z N 1 Q(xi|®) N (® xi ) ln E . (4.28) I fE g D ¡N Q(xi| ) E i 1 E ® XD µ N ¶ Sinceby ourinterpretation, ® are the true parameters that gave rise to N the particulardata xi ,we may expect that the empiricalaverage inequa- fE g tion4.28 willconverge tothe correspondingexpectation value, so that | D Q(x ®) N(® xi ) d xQ(x | ®) ln E y(®, ® xi ), (4.29) E I fE g D ¡ N Q(x|®) ¡ N I f g Z µ E N ¶ where y 0 as N ; here we neglect y and return to this termbelow . ! ! 1 2428 W.Bialek, I.Nemenman, and N.Tishby

The Žrst termon the right-hand side ofequation 4.29 is the Kullback-¨ Leibler (KL)divergence, DKL(® ®),between the true distribution charac- N k terized by parameters ® and the possible distribution characterized by ®. N Thus, at large N, we have

P(x , x , . . . , x ) E1 E2 EN N K Q(xj| ) d a ( ) exp [ NDKL( )] , (4.30) E ® ® ® ® ! j 1 N P ¡ N k YD Z f where again the notation reminds us that we are not only taking the limit ! of large N but also makinganother approximationinneglecting uctuations. By the same arguments asf above, we can proceed(formally) to computethe entropy ofthis distribution. WeŽnd

(a) S(N) 0 N S (N), (4.31) ¼ S ¢ C 1 K D 0 d a (®) d xQ(x|®) log Q(x|®) , and (4.32) S D P ¡ E 2 E Z µ Z ¶

(a) K K NDKL (a a) S (N) d a (®) log d aP(®)e¡ N k . (4.33) 1 D ¡ N P N 2 Z µZ ¶ (a) Here S1 is an approximationto S1 that neglects uctuations y. This is the same as the annealed approximationin the statistical mechanics ofdisor- dered systems, as has been used widely inthe study ofsupervised learn- ing problems(Seung, Sompolinsky,&Tishby,1992). Thus, we can identify the particulardata sequence x1, . . . , xN with the disorder, N(® xi ) with E E E I fE g the energy ofthe quenched system, and DKL(® ®) with its annealed ana- N k log. The extensive term 0,equation 4.32, is the average entropy ofa dis- tribution inour family Sof possible distributions, generalizing the result of equation 4.22. The subextensive terms inthe entropy are controlledby the N dependence ofthe partitionfunction

K Z(® N) d a (®) exp [ NDKL(® ®)] , (4.34) N I D P ¡ N k Z a and S (N) log Z(® N) a is analogous to the free energy.Sincewhat is 1 D ¡h 2 N I iN importantin this integral isthe KLdivergence between different distribu- tions, itis natural to ask about the density ofmodels that are KLdivergence D away fromthe target ®, N

K r (D ®) d a (®)d[D DKL(® ®)]. (4.35) I N D P ¡ N k Z Predictability,Complexity,and Learning 2429

This density couldbe very different fordifferent targets. 6 The density of divergences is normalized because the originaldistribution overparameter space, P(®),isnormalized,

K dDr (D ®) d a (®) 1. (4.36) I N D P D Z Z Finally,the partitionfunction takes the simpleform

Z(® N) dDr(D ®) exp[ ND]. (4.37) N I D I N ¡ Z Werecallthat instatistical mechanics, the partitionfunction is given by

Z(b) dEr (E) exp[ bE], (4.38) D ¡ Z where r(E) is the density ofstates that have energy E and b isthe inverse temperature. Thus, the subextensive entropy in ourlearning problemis analogous to asystem inwhich energy correspondsto the KLdivergence relative to the target model,and temperature is inverse to the number of examples. Aswe increase the length N ofthe timeseries we have observed, we “cool” the system and hence probemodels that approachthe target ; the dynamics ofthis approachis determined by the density oflow-energy states, that is, the behavior of r (D ®) as D 0.7 I N ! The structure ofthe partitionfunction is determined by acompetition between the (Boltzmann) exponential term, which favors modelswith small D,and the density term, which favorsvalues of D that can be achieved by the largest possible number ofmodels. Because there (typically)are many pa- rameters, there are veryfew modelswith D 0. This pictureof competition ! between the Boltzmann factorand adensity ofstates has been emphasized inpreviouswork onsupervised learning (Haussler et al., 1996). The behavior ofthe density ofstates, r (D ®), at small D is related to I N the moreintuitive notionof dimensionality.Inaparameterized familyof distributions, the KLdivergence between two distributions with nearby parameters isapproximatelya quadratic form,

1 D ( ) ( ) , (4.39) KL ® ® am am mº aº aº C N k ¼ 2 mº N ¡ F N ¡ ¢ ¢ ¢ X ¡ ¢

6 Ifparameter space is compact, then arelated description ofthe space oftargetsbased onmetric entropy,alsocalled Kolmogorov ’s -entropy,is used in the literature (see, for 2 example, Haussler& Opper,1997).Metric entropyis the logarithmof the smallest number of disjoint partsof the sizenot greaterthan into which the targetspace canbe partitioned. 2 7 It maybe worth emphasizingthat this analogyto statistical mechanics emerges from the analysisof the relevant probability distributions ratherthan being imposed onthe problem throughsome assumptions aboutthe natureof the learningalgorithm. 2430 W.Bialek, I.Nemenman, and N.Tishby where N is the Fisher informationmatrix. Intuitively ,ifwe have areason- able parameterizationF ofthe distributions, then similardistributions willbe nearby inparameter space, and, moreimportant, points that are far apart inparameter space willnever correspondto similardistributions ; Clarke and Barron(1990) referto this conditionas the parameterization forming a “sound” familyof distributions. Ifthis conditionis obeyed, then we can approximatethe low D limitof the density r (D ®): I N

K r (D ®) d a (®)d[D DKL(® ®)] I N D P ¡ N k Z K 1 d a (®)d D am am mº (aº aº) ¼ P ¡ 2 N ¡ F N ¡ Z " mº # X ¡ ¢ K 1 2 d a (® »)d D Lmj , (4.40) D C ¢ ¡ m P N U " 2 m # Z X where isamatrixthat diagonalizes , U F

( T )mº Lmdmº. (4.41) U ¢ F ¢ U D The delta function restricts the componentsof » inequation 4.40 to be of order pD orless, and so if P(®) issmooth, we can make aperturbation expansion. Aftersome algebra, the leading termbecomes

K/2 2p 1/2 (K 2)/2 r (D 0 ®) (®) (det )¡ D ¡ . (4.42) ! I N ¼ P N C(K/2) F

Hereas before, K is the dimensionality ofthe parameter vector.Computing the partitionfunction fromequation 4.37, we Žnd

C(K/2) Z(® N ) f (®) , (4.43) N I ! 1 ¼ N ¢ NK/2 where f (®) is some function ofthe target parameter values. Finally,this N allows us to evaluate the subextensive entropy,fromequations 4.33 and 4.34 :

(a) K S (N) d a (®) log Z(® N) (4.44) 1 D ¡ N P N 2 N I Z K log N (bits), (4.45) ! 2 2 C ¢ ¢ ¢ where are Žnite as N .Thus, general K-parameter modelclasses ¢ ¢ ¢ ! 1 have the same subextensive entropy as forthe simplest exampleconsidered inthe previoussection. Tothe leading order,this result is independent even Predictability,Complexity,and Learning 2431 ofthe priordistribution (®) on the parameter space, so that the predictive informationseems to countP the number ofparameters under some very general conditions (cf. Figure 3fora very different numericalexample of the logarithmicbehavior). Although equation 4.45 is true under awide range ofconditions, this cannot be the whole story.Much ofmodernlearning theory isconcerned with the fact that counting parameters is not quite enough to characterize the complexityof amodelclass ; the naive dimension ofthe parameter space K should be viewed inconjunction with the pseudodimension (also known as the shattering dimension orV apnik-Chervonenkis dimension dVC), which measures capacity ofthe modelclass, and with the phase-space dimension d, which accounts forvolumes inthe space ofmodels(V apnik, 1998 ; Opper, 1994). Both ofthese can differfrom the number ofparameters. One possibilityis that dVC isinŽnite when the number ofparameters isŽnite, aproblemdiscussed below.Another possibilityis that the determinant of F iszero, and hence both dVC and d are smallerthan the number ofparameters because we have adopted aredundant description.This sort ofdegeneracy couldoccur over a Žnite fractionbut not all, ofthe parameter space, and this isone way togenerate an effective fractionaldimensionality .One can imagine multifractalmodels such that the effective dimensionality varies continuously overthe parameter space, but itis not obvious where this would be relevant. Finally,modelswith d < dVC < also are possible (see, 1 forexample, Opper,1994), and this list probably is not exhaustive. Equation 4.42 lets us actually deŽne the phase-space dimension through the exponent inthe small DKL behavior ofthe modeldensity ,

(d 2)/2 r(D 0 ®) D ¡ , (4.46) ! I N / and then d appears inplace of K as the coefŽcient ofthe logdivergence in S1(N) (Clarke &Barron, 1990 ; Opper,1994). However, this simplecon- clusion can failin two ways. First, itcan happen that the density r (D ®) I N accumulates amacroscopicweight at some nonzero value of D, so that the small D behavior is irrelevant forthe large N asymptotics. Second, the uc- tuations neglected here may be uncontrollably large, so that the asymptotics are never reached. Sincecontrollability of  uctuations isafunction of dVC (see Vapnik, 1998 ; Haussler, Kearns, &Schapire,1994 ; and later inthis arti- cle),we may summarize this inthe followingway .Provided that the small D behavior ofthe density function is the relevant one, the coefŽcient of the logarithmicdivergence of Ipred measures the phase-space orthe scaling dimension d and nothing else. This asymptote is valid, however,only for N dVC.It is stillan openquestion whether the two pathologies that can À violate this asymptotic behavior are related.

4.3 Learning aParameterized Process. Consider aprocesswhere sam- ples are not independent, and ourtask is to learn their jointdistribution 2432 W.Bialek, I.Nemenman, and N.Tishby

Q(x , . . . , x | ). Again, is an unknown parameter vectorchosen ran- E1 EN ® ® domlyat the beginning ofthe series. If ® is a K-dimensional vector,then one stilltries tolearn just K numbers, and there are still N examples, even ifthere are correlations.Therefore, although such problemsare much more general than those already considered, itis reasonable to expect that the predictiveinformation still is measured by (K/2) log2 N providedthat some conditionsare met. One might suppose that conditionsfor simple results on the predictive informationare very strong, forexample, that the distribution Q is a Žnite- orderMarkov model.In fact, allwe really need are the followingtwo con- ditions:

N S [ xi | ®] d x Q( xi | ®) log Q( xi | ®) fE g ´ ¡ E fE g 2 fE g Z N 0 0¤ 0¤ O(1), (4.47) ! S C S I S D

DKL [Q( xi | ®) Q( xi |®)] N KL (® ®) o(N). (4.48) fE g N k fE g ! D N k C Herethe quantities 0, ¤, and KL (® ®) are deŽned by taking limits 0 N k N inboth equations.S S The ŽrstD ofthe constraints limitsdeviations from ! 1 extensivity to be oforder unity ,so that if ® isknown, there are no long-range correlationsin the data. Allof the long-range predictabilityis associated with learning the parameters. 8 The second constraint, equation 4.48, is a less restrictiveone, and itensures that the “energy” ofour statistical system is an extensive quantity. With these conditions, itis straightforward toshow that the results ofthe previoussubsection carryover virtually unchanged. With the same cautious statements about uctuations and the distinction between K, d, and dVC, one arrives at the result :

(a) S(N) 0 N S (N), (4.49) D S ¢ C 1 (a) K S (N) log N (bits), (4.50) 1 D 2 2 C ¢ ¢ ¢ where stands forterms oforderone as N .Noteagain that forthe ¢ ¢ ¢ ! 1 result ofequation 4.50 tobe valid, the processconsidered isnot required to be aŽnite-order Markov process. Memoryof allprevious outcomes may be kept, providedthat the accumulated memorydoes not contribute adiver- gent termto the subextensive entropy.

8 Suppose thatwe observe agaussianstochastic process andtry to learnthe power spectrum. Ifthe class ofpossible spectra includes ratios of polynomials in the frequency (rationalspectra), then this condition is met. Onthe other hand,if the class of possible spectra includes 1 / f noise, then the condition maynot bemet. Formore onlong-range correlations, see below. Predictability,Complexity,and Learning 2433

It isinteresting to ask what happens ifthe conditionin equation 4.47 is violated, so thatthere are long-range correlationseven inthe conditionaldis- tribution Q(x1, . . . , xN| ).Suppose,for example, that ¤ (K¤ /2) log N. E E ® 0 D 2 Then the subextensive entropy becomes S

( ) K K¤ S a (N) C log N (bits). (4.51) 1 D 2 2 C ¢ ¢ ¢ Wesee that the subextensive entropy makes no distinction between pre- dictability that comesfrom unknown parameters and predictabilitythat comesfrom intrinsic correlations in the data ; inthis sense, two modelswith the same K K are equivalent. This actually must be so. Asan example, con- C ¤ sider achain ofIsing spins with long-range interactions inone dimension. This system can order(magnetize) and exhibit long-range correlations,and so the predictiveinformation will diverge at the transition toordering. In one view,there is no global parameter analogous to ®—just the long-range interactions. On the other hand, there are regimes inwhich we can approxi- mate the effect ofthese interactions by saying that allthe spins experience a mean Želd that isconstant across the whole length ofthe system, and then formallywe can think ofthe predictiveinformation as being carriedby the mean Želd itself. In fact, there are situations in which this is not just an approximationbut an exact statement. Thus, we can trade adescriptionin terms oflong-range interactions ( K¤ 0, but K 0) forone inwhich there D6 D are unknown parameters describing the system, but given these parameters, there are no long-range correlations( K 0, K¤ 0). The two descriptions D6 D are equivalent, and this is captured by the subextensive entropy. 9

4.4 Taming the Fluctuations. The precedingcalculations ofthe subex- tensive entropy S1 are worthless unless we provethat the uctuations y are controllable.W eare going todiscuss when and ifthis indeed happens. We limitthe discussion tothe case ofŽnding aprobabilitydensity (section 4.2) ; the case oflearning aprocess(section 4.3) isvery similar. Clarke and Barron(1990) solved essentially the same problem.They did not make aseparation intothe annealed and the uctuation term, and the quantity they were interested inwas abit different fromours, but, inter- preting loosely,they provedthat, modulosome reasonable technical as- sumptions on differentiability offunctions inquestion, the uctuation term always approaches zero. However, they did not investigate the speed of this approach,and we believe that they therefore missed some important qualitative distinctions between different problemsthat can arise due toa difference between d and dVC.In orderto illuminate these distinctions, we here gothrough the trouble ofanalyzing uctuations allover again.

9 Thereare a numberof interesting questions abouthow the coefŽcients in the diverg- ingpredictive informationrelate to the usual critical exponents, andwe hope to return to this problem in alater article. 2434 W.Bialek, I.Nemenman, and N.Tishby

Returning to equations 4.27, 4.29 and the deŽnition of entropy,we can write the entropy S(N) exactly as

N K S(N) d a (®) dxj Q(xj|®) D ¡ N P N E E N Z Z j 1 YD £ ¤ N K log Q(xi|®) d a (®) £ 2 E "i 1 N P YD Z

NDKL (a a) Ny (a,a xi ) e¡ N k C N IfE g . (4.52) £ # This expression can be decomposedinto the terms identiŽed above, plus a new contribution tothe subextensive entropy that comesfrom the uctua- (f) tions alone, S1 (N):

(a) (f) S(N) 0 N S (N) S (N), (4.53) D S ¢ C 1 C 1 N (f) K S d a (®) dxj Q(xj|®) 1 D ¡ N P N E E N Z j 1 YD £ ¤ K ( ) d a ® NDKL (a a) Ny (a,a xi ) log P e¡ N k C N IfE g , (4.54) £ 2 Z(® N) µZ N I ¶ where y isdeŽned as inequation 4.29, and Z as inequation 4.34. Someloose but useful bounds can be established. First, the predictivein- formationis apositive(semideŽ nite) quantity,and so the uctuation term (a) may not be smaller(more negative) than the value of S as calculated ¡ 1 inequations 4.45 and 4.50. Second, since uctuations make itmoredifŽ - cult to generalize fromsamples, the predictiveinformation should always be reduced by uctuations, so that S(f) is negative. This last statement cor- responds tothe fact that forthe statistical mechanics ofdisordered systems, the annealed free energy always is less than the average quenched free en- ergy and may be provenrigorously by applyingJensen ’sinequality tothe (concave) logarithmfunction inequation 4.54 ; essentially the same argu- ment was given by Haussler and Opper(1997). Arelated Jensen ’sinequality argument allows us toshow that the total S1(N) is bounded,

K K S1(N) N d a d a (®) (®)DKL(® ®) · N P P N N k Z Z NDKL(® ®) a,a, (4.55) ´ h N k iN so that ifwe have aclass ofmodels(and aprior (®))such that the aver- age KLdivergence among pairsof models is Žnite,P then the subextensive entropy necessarily isproperlydeŽ ned.In its turn, Žniteness ofthe average Predictability,Complexity,and Learning 2435

KLdivergence orof similar quantities is ausual constraint inthe learning problemsof this type (see, forexample, Haussler &Opper,1997). Notealso that DKL(® ®) a,a includes 0 as one ofits terms, so that usually 0 and h N k iN S S S1 are well orilldeŽ ned together. Tighter bounds require nontrivialassumptions about the class ofdistri- butions considered. The uctuation termwould be zero if y were zero, and y isthe difference between an expectation value (KLdivergence) and the correspondingempirical mean. There is abroad literature that deals with this type ofdifference (see, forexample, Vapnik, 1998). Westart with the case when the pseudodimension ( dVC)ofthe set of probabilitydensities Q(x | ®) isŽnite. Then forany reasonable function f E g F(x b),deviations ofthe empiricalmean fromthe expectation value can be EI bounded by probabilisticbounds ofthe form

1 | N j F(xj b) dx Q(x ®) F(x b) P sup E I ¡ E E N EI > 2 ( b ­ P RL[F] ­ ) ­ ­ ­ cN 2 ­ <­M( , N, dVC)e¡ 2 , ­ (4.56) ­ 2 ­ where c and L[F] depend onthe details ofthe particularbound used. Typi- cally, c isaconstant oforderone, and L[F] either is some momentof F or the range ofits variation. In ourcase, F is the logratio of two densities, so that L[F] may be assumed bounded foralmost all b without loss ofgenerality inview ofequation 4.55. In addition, M( , N, dVC) isŽnite at zero, grows 2 at most subexponentially inits Žrst two arguments, and depends exponen- tially on dVC.Bounds ofthis formmay have different names in different contexts—for example, Glivenko-Cantelli, Vapnik-Chervonenkis, Hoeffd- ing, Chernoff, (forreview see Vapnik, 1998, and the references therein). (f) Tostart the proofof Ž niteness of S1 inthis case, we Žrst show that only the region ® ® isimportantwhen calculating the inner integral in ¼ N equation 4.54. This statement isequivalent to saying that at large values of ® ®,the KLdivergence almost always dominates the uctuation term, ¡ N that is, the contribution ofsequences of xi with atypically large uctua- fE g tions isnegligible (atypicality isdeŽned as y d, where d is some small ¸ constant independent of N).Sincethe uctuations decrease as 1 /pN (see equation 4.56) and DKL is oforder one, this is plausible. Toshow this, we bound the logarithmin equation 4.54 by N times the supremum value of y.Then we realize that the averaging over ® and xi isequivalent tointe- N fE g gration overall possible values ofthe uctuations. The worst-case density ofthe uctuations may be estimated by differentiating equation 4.56 with respect to (this brings down an extra factorof N ).Thus the worst-case 2 2 contribution ofthese atypical sequences is

(f),atypical 1 2 2 cN 2 cNd2 S d N M( )e¡ 2 e¡ 1 for large N. (4.57) 1 » 2 2 2 » ¿ Zd 2436 W.Bialek, I.Nemenman, and N.Tishby

This bound lets us focus ourattention onthe region ® ®. We expand ¼ N the exponent ofthe integrand ofequation 4.54 around this pointand per- forma simplegaussian integration. In principle,large uctuations might lead to an instability (positiveor zero curvature) at the saddle point,but this is atypical and therefore is accounted foralready .Curvatures at the saddle points ofboth numerator and denominator are ofthe same order, and throwing away unimportant additive and multiplicativeconstants of orderunity ,we obtain the followingresult forthe contribution oftypical sequences:

(f),typical K N 1 S d a ( ) d x Q(xj| ) N (B ¡ B) 1 ® E E ® » N P N j N A I Z Y 1 @ log Q(xi|®) Bm E N , B x 0 D E D N i @am h i I X N 2 1 @ log Q(xi|®) ( )mº E N , x . (4.58) D E D A N i @am @aº hAi F X N N Here x means an averaging with respect toall xi’s keeping ® constant. h¢ ¢ ¢iE E N One immediately recognizes that B and are, respectively,Žrst and second derivatives ofthe empiricalKL divergenceA that was inthe exponent ofthe inner integral inequation 4.54. Weare dealing now with typicalcases. Therefore, large deviations of A1 from are not allowed, and we may bound equation 4.58 by replacing ¡ with F 1(1 ), where again is independent of N.Nowwe have to averageA ¡ Cd d abunchF ofproductslike | | @ log Q(xi ®) 1 @ log Q(xj ®) E N ( ¡ )mº E N (4.59) @am F @aº N N over all xi’s. Only the terms with i jsurvive the averaging. There are 2 E D 1 K N such terms, each contributing oforder N¡ .This means that the total contribution ofthe typical uctuations isbounded by anumber oforder one and does not grow with N. This concludes the proofof controllability of uctuations for dVC < . 1 One may notice that we never used the speciŽc formof M( , N, dVC), which 2 is the only thing dependent onthe precisevalue ofthe dimension. Actually, amorethorough lookat the proofshows that we donot even need the strict uniformconvergence enforced by the Glivenko-Cantelli bound. With some modiŽcations, the proofshould stillhold if there exist some apriori improbablevalues of ® and ® that lead toviolation of the bound. That is, N if the prior (®) has sufŽciently narrow support,then we may stillexpect uctuationsP to be unimportant even forVC-inŽ nite problems. Aproofof this can be found inthe realmof the structural risk mini- mization (SRM)theory (Vapnik, 1998). SRMtheory assumes that an inŽnite Predictability,Complexity,and Learning 2437

structure ofnested subsets C1 C2 C3 can be imposedonto the C ½ ½ ½ ¢ ¢ ¢ set C ofall admissible solutions ofalearning problem,such that C Cn. D The idea is that having aŽnite number ofobservations N,one is conŽned S to the choiceswithin some particularstructure element Cn, n n(N) when D lookingfor an approximationto the true solution ; this prevents overŽtting and poorgeneralization. As the number ofsamples increases and one is able todistinguish within moreand morecomplicated subsets, ngrows. If dVC forlearning inany Cn, n < , is Žnite, then one can show convergence 1 ofthe estimate to the true value as well as the absolute smallness of uc- tuations (Vapnik, 1998). It isremarkable that this result holds even ifthe capacity ofthe whole set C isnot described by aŽnite dVC. In the context ofSRM,the roleof the prior P(®) is toinduce astructure on the set ofall admissible densities, and the Žght between the number of samples N and the narrowness ofthe prioris preciselywhat determines howthe capacity ofthe currentelement ofthe structure Cn, n n(N), grows D with N.Arigorousproof of smallness ofthe uctuations can be constructed based on well-known results, as detailed elsewhere (Nemenman, 2000). Herewe focus onthe question ofhow narrow the priorshould be so that every structure element isofŽ nite VCdimension, and one can guarantee eventual convergence of uctuations tozero. Consider two examples. Avariable x is distributed accordingto the fol- lowing probabilitydensity functions :

1 1 Q(x|a) exp (x a)2 , x ( ) (4.60) D p2 ¡2 ¡ 2 ¡1I C1 I p µ ¶ exp ( sin ax) ( | ) ¡ , [0 2 ). (4.61) Q x a 2p x p D dx exp ( sin ax) 2 I 0 ¡ R Learning the parameter inthe Žrst case isa dVC 1 problem; inthe second D case, dVC .Inthe Žrst example, as we have shown, one may construct D 1 auniformbound on uctuations irrespectiveof the prior P(®). The sec- ond one does not allow this. Supposethat the prioris uniformin abox 0 < a < amax and zero elsewhere, with amax rather large. Then fornot too many sample points N,the data would be better Žtted not by some value inthe vicinityof the actual parameter,but by some much largervalue, for which almost alldata points are at the crests of sin ax.Adding anew data ¡ pointwould not help, until that best, but wrong, parameter estimate is less 10 than amax. Sothe uctuations are large, and the predictiveinformation is

10 Interestingly,since forthe model equation4.61 KL divergence is boundedfrom below andfrom above, for amax the weight in r (D a) at small DKL vanishes,and a ! 1 I N Žnite weight accumulates atsome nonzerovalue of D.Thus,even putting the uctuations aside,the asymptotic behaviorbased on the phase-spacedimension is invalidated, as mentioned above. 2438 W.Bialek, I.Nemenman, and N.Tishby smallin this case. Eventually,however, data points would overwhelmthe box size, and the best estimate of a would swiftly approachthe actual value. Atthis pointthe argument ofClarke and Barronwould become applicable, and the leading behavior ofthe subextensive entropy would converge to its asymptotic value of (1/2) log N.On the other hand, there is no uniform bound onthe value of N forwhich this convergence willoccur ; itis guaran- teed only for N dVC,which is never true if dVC .Forsome sufŽciently À D 1 wide priors,this asymptotically correctbehavior would never be reached inpractice. Furthermore, ifwe imagine athermodynamic limitwhere the box size and the number ofsamples both become large, then by anal- ogy with problemsin supervised learning (Seung, Sompolinsky,&Tishby, 1992; Haussler et al., 1996), we expect that there can be sudden changes inperformance as afunction ofthe number ofexamples. The arguments ofClarke and Barroncannot encompass these phase transitions or “Aha!” phenomena. Afurther bridge between VCdimension and the information- theoretic approachto learning may be found inHaussler, Kearns, &Schapire (1994), where the authors bounded predictiveinformation-like quantities with loosebut useful bounds explicitlyproportional to dVC. Whilemuch oflearning theory has focused on problemswith Žnite VC dimension,itmight bethatthe conventional scenario inwhich the number of examples eventually overwhelms the number ofparameters ordimensions is tooweak todeal with many real-worldproblems. Certainly inthe present context, there is not only aquantitative but also aqualitative difference between reaching the asymptotic regimein just afew measurements or inmany millionsof them. Finitely parameterizable modelswith Žnite or inŽnite dVC fallin essentially different universality classes with respect to the predictiveinformation.

4.5 Beyond Finite Parameterization :General Considerations. The pre- vious sections have considered learning fromtime series where the underly- ing class ofpossible modelsis described with aŽnite number ofparameters. Ifthe number ofparameters isnot Žnite, then inprincipleit is impossibleto learn anything unless there is some appropriateregularization ofthe prob- lem.If we let the number ofparameters stay Žnite but become large, then there is moreto be learned and, correspondingly,the predictiveinformation grows inproportion to this number, as inequation 4.45. On the other hand, ifthe number ofparameters becomes inŽnite without regularization, then the predictiveinformation should go tozero since nothing can be learned. Weshould be able to see this happen ina regularized problemas the regu- larization weakens. Eventually the regularization would be insufŽcient and the predictiveinformation would vanish. The only way this can happen is ifthe subextensive termin the entropy grows moreand morerapidly with N as we weaken the regularization, until Žnally itbecomes extensive at the pointwhere learning becomes impossible.More precisely ,ifthis scenario forthe breakdown oflearning isto work, there must be situations inwhich Predictability,Complexity,and Learning 2439 the predictiveinformation grows with N morerapidly than the logarithmic behavior found inthe case ofŽnite parameterization. Subextensive terms inthe entropy are controlledby the density ofmodels as afunction oftheir KLdivergence tothe target model.If the modelshave Žnite VCand phase-space dimensions, then this density vanishes forsmall (d 2)/2 divergences as r D ¡ .Phenomenologically,ifwe let the number of » parameters increase, the density vanishes moreand morerapidly .Wecan imagine that beyond the class ofŽnitely parameterizable problems,there isaclass ofregularized inŽnite dimensional problemsin which the density r (D 0) vanishes morerapidly than any powerof D.As an example, we ! could have

B r(D 0) A exp , m > 0, (4.62) ! ¼ ¡Dm µ ¶ that is, an essential singularity at D 0. Forsimplicity ,we assume that the D constants A and B can depend onthe target modelbut that the nature of the essential singularity ( m )is the same everywhere. Before providingan explicitexample, let us explorethe consequences ofthis behavior. Fromequation 4.37, we can write the partitionfunction as

Z(® N) dDr(D ®) exp[ ND] N I D I N ¡ Z B(®) A(®) dD exp N ND ¼ N ¡ Dm ¡ Z µ ¶ 1 m 2 m /(m 1) A(®) exp C ln N C(®)N C , (4.63) ¼ Q N ¡2 m 1 ¡ N µ C ¶ where inthe last step we use asaddle-point orsteepest-descent approxima- tionthat is accurate at large N,and the coefŽcients are

1/(m 1) 1/2 2p m C 1/(2m 2) A(®) A(®) [B(®)] C (4.64) Q N D N m 1 ¢ N µ C ¶ 1/(m 1) 1 1/(m 1) ( ) [ ( )] C C . (4.65) C ® B ® m /(m 1) m N D N m C C µ ¶ Finally we can use equations 4.44 and 4.63 to computethe subextensive termin the entropy,keeping only the dominant termat large N,

(a) 1 m /(m 1) S (N) C(®) aN C (bits), (4.66) 1 ! ln 2 h N iN where a denotes an average overall the target models. h¢ ¢ ¢iN 2440 W.Bialek, I.Nemenman, and N.Tishby

This behavior ofthe Žrst subextensive termis qualitatively different fromeverything we have observed so far. Apower-law divergence ismuch stronger than alogarithmicone. Therefore, alotmore predictive informa- tionis accumulated inan “inŽnite parameter ” (ornonparametric) system ; the system is intuitively and quantitatively much richerand morecomplex. Subextensive entropy also grows as apowerlaw inaŽnitely parameter- izable system with agrowing number ofparameters. Forexample, suppose that we approximatethe distribution ofarandomvariable by ahistogram with K bins, and we let K grow with the quantity ofavailable samples as K Nº.Astraightforward generalization ofequation 4.45 seems to imply » º then that S1(N) N log N (Hall& Hannan, 1988 ; Barron& Cover,1991). » Whilenot necessarily wrong, analogies ofthis type are dangerous. Ifthe number ofparameters grows constantly,then the scenario where the data overwhelmall the unknowns isfar fromcertain. Observations may provide muchless informationabout features that were introduced intothe modelat some large N than about those that have existed since the very Žrst measure- ments. Indeed, ina K-parameter system, the Nth sample pointcontributes K/2N bits to the subextensive entropy (cf.equation 4.45). If K changes » º 1 as mentioned, the Nth examplethen carries N ¡ bits. Summingthis up (a) » overall samples, we Žnd S Nº,and ifwe let º m /(m 1), we obtain 1 » D C equation 4.66 ; note that the growth ofthe number ofparameters is slower than N (º m /(m 1) < 1), which makes sense. Rissanen, Speed, and Yu D C (1992) made asimilarobservation. Accordingto them, formodels with in- creasing number ofparameters, predictivecodes, which are optimalat each particular N (cf.equation 3.8), providea strictly shorter codinglength than nonpredictivecodes optimizedfor all data simultaneously.This has to be contrasted with the Žnite-parameter modelclasses, forwhich these codes are asymptotically equal. Power-law growth ofthe predictiveinformation illustrates the point made earlierabout the transition fromlearning moreto Ž nally learning nothing as the class ofinvestigated modelsbecomes morecomplex. As m increases, the problembecomes richerand morecomplex, and this is expressed inthe stronger divergence ofthe Žrst subextensive termof the entropy; forŽ xed large N,the predictiveinformation increases with m . How- ever, if m ,the problemis toocomplex for learning ; inourmodel ex- ! 1 ample,the number ofbins grows inproportion to the number ofsamples, which means that we are trying to Žnd toomuch detail inthe underlying distribution. Asaresult, the subextensive termbecomes extensive and stops contributing to predictiveinformation. Thus, at least to the leading order, predictabilityis lost, as promised.

4.6 Beyond Finite Parameterization : Example. Whileliterature on problemsin the logarithmicclass is reasonably rich,the research on es- tablishing the power-law behavior seems to be inits early stages. Some authors have found speciŽc learning problemsfor which quantities similar Predictability,Complexity,and Learning 2441

to, but sometimes very nontriviallyrelated to S1,are bounded by power– law functions (Haussler &Opper,1997, 1998 ; Cesa-Bianchi &Lugosi, in press). Others have chosen to study Žnite-dimensional models, forwhich the optimalnumber ofparameters (usually determined by the MDLcrite- rionof Rissanen, 1989) grows as apowerlaw (Hall& Hannan, 1988 ; Rissa- nen et al., 1992). In addition to the questions raised earlierabout the danger ofcopying Ž nite-dimensional intuition to the inŽnite-dimensional setup, these are not examples oftruly nonparametricBayesian learning. Instead, these authors make use ofa prioriconstraints torestrict learning to codes ofparticularstructure ( codes), while anon- is conducted within the class. Without Bayesian averaging and with restric- tions on the codingstrategy ,itmay happen that arealization ofthe code length is substantially different fromthe predictiveinformation. Similar conceptual problemsplague even true nonparametricexamples, as consid- ered, forexample, by Barronand Cover(1991). In summary,we donot know ofacompletecalculation inwhich afamilyof power-law scalings of the predictiveinformation is derived froma Bayesian formulation. The discussion inthe previoussection suggests that we should look forpower-law behavior ofthe predictiveinformation in learning problems where, rather than learning ever moreprecise values fora Žxed set ofpa- rameters, we learn aprogressivelymore detailed description—effectively increasingthe number ofparameters—as wecollectmore data. Oneexample ofsuch aproblemis learning the distribution Q(x) fora continuous variable x,but rather than writing aparametricform of Q(x),we assume only that this function itself ischosen fromsome distribution that enforces adegree ofsmoothness. There are some natural connections ofthis problemto the methods ofquantum Želd theory (Bialek, Callan, &Strong,1996), which we can exploitto give acompletecalculation ofthe predictiveinformation, at least fora class ofsmoothness constraints. We write Q(x) (1/ l0) exp[ w (x)] so that the positivityof the distribu- D ¡ tionis automatic, and then smoothness may be expressed by saying that the “energy” (oraction) associated with afunction w (x) isrelated to an integral overits derivatives, likethe strain energy inastretched string. The simplest possibilityfollowing this lineof ideas is that the distribution offunctions is given by

2 1 l @w 1 w (x) [w (x)] exp dx d dx e¡ 1 , (4.67) P D ¡2 @x l0 ¡ Z " Z ³ ´ # µ Z ¶ where is the normalization constant for [w],the delta function ensures that eachZ distribution Q(x) is normalized,P and l sets ascale forsmooth- ness. Ifdistributions are chosen fromthis distribution, then the optimal Bayesian estimate of Q(x) froma set ofsamples x1, x2, . . . , xN converges to the correctanswer ,and the distribution at Žnite N isnonsingular,so that the regularization providedby this prioris strong enough toprevent the devel- 2442 W.Bialek, I.Nemenman, and N.Tishby opmentof singular peaks at the locationof observed data points (Bialek et al., 1996). Further developments ofthe theory,including alternative choices of P[w (x)],have been given by Periwal (1997, 1998), Holy(1997), Aida (1999), and Schmidt(2000) ; fora detailed numericalinvestigation ofthis problem see Nemenman and Bialek (2001). Our goal here is tobe illustrative rather than exhaustive. 11 Fromthe discussion above, we know that the predictiveinformation is related to the density ofKL divergences and that the power-law behavior we are lookingfor comes from an essential singularity inthis density function. Thus, we calculate ( , ) inthe modeldeŽ ned by equation 4.67. r D wN With Q(x) (1/ l0) exp[ w (x)],we can write the KLdivergence as D ¡ 1 DKL[w (x) w (x)] dx exp[ w (x)][w (x) w (x)] . (4.68) N k D l0 ¡ N ¡ N Z Wewant tocompute the density,

r (D w ) [dw (x)] [w (x)]d D DKL[w (x) w (x)] (4.69) I N D P ¡ N k Z ¡ ¢ M [dw (x)] [w (x)]d MD MDKL[w(x) w (x)] , (4.70) D P ¡ N k Z ¡ ¢ where we introduce afactor M,which we willallow to become large so that we can focus ourattention on the interesting limit D 0. Tocomputethis ! integral overall functions w (x),we introduce aFourierrepresentation for the delta function and then rearrange the terms :

dz r (D w ) M exp(izMD) [dw (x)] [w (x)] exp( izMDKL) (4.71) I N D 2p P ¡ Z Z dz izM M exp izMD dx w (x) exp[ w (x)] D 2p C l0 N ¡ N Z ³ Z ´ [dw (x)] [w (x)] £ P Z izM exp dx w (x) exp[ w (x)] . (4.72) £ ¡ l0 ¡ N ³ Z ´ The inner integral overthe functions w (x) is exactly the integral evaluated inthe originaldiscussion ofthis problem(Bialek, Callan, &Strong, 1996) ; inthe limitthat zM is large, we can use asaddle-point approximation, and standard Želd-theoreticmethods allow us to computethe uctuations

11 Wecaution thatour discussion in this section is less self-contained thanin other sections. Since the crucial steps exactly parallel those in the earlier work,here we just give references. Predictability,Complexity,and Learning 2443 around the saddle point.The result is that

izM [dw (x)] [w (x)] exp dx w (x) exp[ w (x)] P ¡ l0 ¡ N Z ³ Z ´ izM exp dx w (x) exp[ w (x)] Seff[w(x) zM] , (4.73) D ¡ l0 N ¡ N ¡ N I ³ Z ´ 2 1/2 l @w 1 izM Seff[w zM] dx N dx exp[ w (x)/2]. (4.74) NI D 2 @x C 2 ll0 ¡ N Z ³ ´ ³ ´ Z Nowwe can dothe integral over z,again by asaddle-point method. The two saddle-point approximationsare both validin the limitthat D 0 and ! MD3/2 ; we are interested preciselyin the Žrst limit,and we are free to ! 1 set M as we wish, so this gives us agoodapproximation for r(D 0 w ). ! I N This results in

3/2 B[w(x)] r(D 0 w ) A[w (x)]D¡ exp N , (4.75) ! I N D N ¡ D ³ ´ 1 l 2 A[w (x)] exp dx @xw N D 16 ll ¡2 N p 0 µ Z ¶ ¡ ¢ p dx exp[ w (x)/2], (4.76) £ ¡ N Z 1 2 B[w (x)] dx exp[ w (x)/2] . (4.77) N D 16ll0 ¡ N ³Z ´ 3/2 Exceptfor the factorof D¡ ,this is exactly the sort ofessential singularity 3/2 that we considered inthe previoussection, with m 1. The D¡ prefactor D does not change the leading large N behavior ofthe predictiveinformation, and we Žnd that

(a) 1 1/2 S1 (N) dx exp[ w(x)/2] N , (4.78) » 2 ln 2 ll ¡ N 0 ½ Z ¾wN where denotes anp average overthe target distributions w (x) weighted h¢ ¢ ¢iwN N once again by [ ( )].Noticethat if isunbounded, then the average in wN x x equation 4.78 isP infrared divergent ; ifinstead we let the variable x range from 0 to L,then this average should be dominated by the uniformdistribution. Replacing the average by its value at this point,we obtain the approximate result 1/2 (a) 1 L S (N) pN bits. (4.79) 1 » 2 ln 2 l ³ ´ Tounderstand the result inequation 4.79,werecallthat thisŽeld-theoretic approachis moreor less equivalent to an adaptive binning procedurein which we divide the range of x intobins oflocal size l/NQ(x) (Bialek, p 2444 W.Bialek, I.Nemenman, and N.Tishby

Callan, &Strong,1996). Fromthis pointof view ,equation 4.79 makes perfect sense: the predictiveinformation is directlyproportional to the number of bins that can be put inthe range of x.This also is indirect accord with acommentin the previoussection that power-law behavior ofpredictive informationarisesfromthe number ofparametersintheproblemdepending on the number ofsamples. This counting ofparameters allows aschematic argument about the smallness of uctuations inthis particularnonparametric problem. If we take the hint that at every step pN bins are being investigated, then we » can imagine that the Želd-theoretic priorin equation 4.67 has imposeda structure on the set ofallpossible densities, so that the set Cn is formed of allcontinuous C piecewise linear functions that have not morethan nkinks. Learning such functions forŽ nite nis a dVC nproblem.Now ,as N grows, D the elements with higher capacities n pN are admitted. The uctuations » insuch aproblemare known to be controllable(V apnik, 1998), as discussed inmore detail elsewhere (Nemenman, 2000). One thing that remains troubling isthat the predictiveinformation de- pends on the arbitrary parameter l describing the scale ofsmoothness in the distribution. In the originalwork, itwas proposedthat one should in- tegrate overpossible values of l (Bialek, Callan, &Strong, 1996). Numerical simulations demonstrate that this parameter can be learned fromthe data themselves (Nemenman &Bialek, 2001), but perhaps even moreinterest- ing isaformulationof the problemby Periwal (1997, 1998), which recovers completecoordinate invariance by effectively allowing l to vary with x. In this case, the whole theory has no length scale, and there also is no need to conŽne the variable x toa box (here ofsize L).Weexpect that this coordinate invariant approachwill lead to auniversal coefŽcient multiplying pN in the analog ofequation 4.79, but this remains tobe shown. In summary,the Želd-theoretic approachto learning asmooth distri- bution inone dimension providesus with aconcrete, calculable example ofa learning problemwith power-law growth ofthe predictiveinforma- tion. The scenario isexactly as suggested inthe previoussection, where the density ofKL divergences develops an essential singularity.Heuristiccon- siderations (Bialek, Callan, &Strong, 1996 ; Aida, 1999) suggest that different smoothness penalties and generalizations tohigher-dimensional problems willhave sensible effects on the predictiveinformation— all have power- law growth, smoother functions have smallerpowers (less to learn), and higher-dimensional problemshave largerpowers (moreto learn)—but real calculations forthese cases remainchallenging.

5 Ipred as aMeasure of Complexity

Theproblemof quantifying complexityis very old(seeGrassberger,1991, for ashort review). Solomonoff(1964), Kolmogorov(1965), and Chaitin (1975) investigated amathematically rigorousnotion ofcomplexitythat measures Predictability,Complexity,and Learning 2445

(roughly) the minimumlength ofacomputerprogram that simulates the observed timeseries (see also Li& Vit anyi, 1993). Unfortunately there is no algorithmthat can calculate the Kolmogorovcomplexity of all data sets. In addition, algorithmicor is closelyrelated to the Shannon entropy,which means that it measures something closerto our intuitive conceptof randomness than to the intuitive conceptof complexity (as discussed, forexample, by Bennett, 1990). These problemshave fueled continued research along two different paths, representing two majormo- tivations fordeŽ ning complexity.First, one would liketo make precisean impressionthat some systems—such as lifeon earth ora turbulent uid ow—evolve toward astate ofhigher complexity,and one would liketo be able to classify those states. Second, inchoosing among different models that describe an , one wants toquantify apreference forsimpler explanations or,equivalently,providea penalty forcomplex models that can be weighed against the moreconventional goodness-of-Žt criteria.W e bring ourreaders up to date with some developments inboth ofthese di- rections and then discuss the roleof predictive information as ameasure of complexity.This also gives us an opportunityto discuss morecarefully the relation ofour results to previouswork.

5.1 Complexity of Statistical Models. The construction ofcomplexity penalties formodel selection isastatistics problem.As far as we know,how- ever, the Žrst discussions ofcomplexity in this context belong tothe philo- sophicalliterature. Even leaving aside the early contributions ofWilliamof Occamon the need forsimplicity ,Humeon the problemof induction, and Popperon falsiŽability ,Kemeney (1953) suggested explicitlythat itwould be possible to create amodelselection criterionthat balances goodness of Žtversus complexity.Wallace and Boulton (1968) hinted that this balance may result inthe modelwith “the briefest recordingof all attribute informa- tion.” Although he probably had asomewhat different motivation,Akaike (1974a, 1974b) made the Žrst quantitative step along these lines. Hisad hoc complexityterm was independent ofthe number ofdata points and was proportionalto the number offree independent parameters inthe model. These ideas were rediscovered and developed systematically by Rissa- nen inaseries ofpapers starting from1978. Hehas emphasized strongly (Rissanen, 1984, 1986, 1987, 1989) that Žtting amodelto data represents an encoding ofthose data, orpredictingfuture data, and that insearching for an efŽcient code, we need to measure not only the number ofbits required to describe the deviations ofthe data fromthe model ’spredictions(goodness ofŽt), but also the number ofbits required tospecify the parameters ofthe model(complexity). This speciŽcation has to be done to aprecisionsup- portedby the data. 12 Rissanen (1984) and Clarke and Barron(1990) infull

12 Within this framework,Akaike ’ssuggestion canbe seen ascoding the model to (suboptimal) Žxed precision. 2446 W.Bialek, I.Nemenman, and N.Tishby generality were able to provethat the optimalencoding ofa modelrequires a codewith length asymptotically proportionalto the number ofindependent parameters and logarithmicallydependent onthe number ofdata points we have observed. The minimalamount ofspace one needs to encode adata string (minimumdescription length, orMDL) within acertain assumed modelclass was termed by Rissanen stochastic complexity ,and inrecent work he refers to the pieceof the stochastic complexityrequired forcod- ing the parameters as the modelcomplexity (Rissanen, 1996). This approach was further strengthened by arecent result (Vit anyi &Li, 2000) that an es- timationof parameters using the MDLprincipleis equivalent toBayesian parameter estimations with a “universal” prior(Li & Vit anyi, 1993). There should be acloseconnection between Rissanen ’sideas ofencoding the data stream and the subextensive entropy.Weare accustomed to the idea that theaverage lengthofacodeword for symbols drawn froma distribution P isgivenby the entropyofthatdistribution ;thus, itis tempting to saythatan encoding ofa stream x1, x2, . . . , xN willrequire an amount ofspace equal to the entropy ofthe jointdistribution P(x1, x2, . . . , xN ).The situation here isa bit moresubtle,becausetheusual proofsof equivalence betweencodelength and entropy relyon notions oftypicality and asymptoticsas wetry to encode sequencesofmany symbols. Herewe already have N symbols, and so itdoes not really make sense to talk about astream ofstreams. One can argue, how- ever,thatatypical sequences are nottrulyrandomwithin aconsidered distri- bution since their codingbythemethodsoptimizedfor the distribution is not optimal.So atypical sequences are better considered as typicalones coming froma different distribution (a pointalso made by Grassberger, 1986). This allows us to identify propertiesof an observed (long) string with the proper- tiesofthe distribution itcomes from, asVit anyi and Li(2000) did. Ifwe accept this identiŽcation ofentropy with codelength, then Rissanen ’sstochastic complexityshould be the entropy ofthe distribution P(x1, x2, . . . , xN ). As emphasized by Balasubramanian (1997), the entropy ofthe jointdis- tribution of N points can be decomposedinto pieces that represent noise orerrors in the model ’slocalpredictions— an extensive entropy—and the space required to encode the modelitself, which must be the subexten- sive entropy.Sincein the usual formulationall long-term are associated with the continued validity ofthe modelparameters, the domi- nant componentof the subextensive entropy must be this parameter coding (ormodel complexity ,inRissanen ’sterminology).Thus, the subextensive entropy should be the modelcomplexity ,and in simplecases where we can describe the data by a K-parameter model,both quantities are equal to (K/2) log2 N bits to the leading order. The fact that the subextensive entropy orpredictive information agrees with Rissanen ’smodelcomplexity suggests that Ipred providesa reasonable measure ofcomplexity in learning problems.This agreement might lead the reader to wonder ifallwe have done is torewrite the results ofRissanen et al. ina different notation. Tocalmthese fears, we recallagain that our Predictability,Complexity,and Learning 2447 approachdistinguishes inŽnite VCproblemsfrom Ž nite ones and treats nonparametriccases as well. Indeed, the predictiveinformation is deŽned without reference tothe idea that we are learning amodel,and thus we can make alink tophysical aspects ofthe problem,as discussed below. The MDLprinciplewas introduced asaprocedurefor froma data stream to amodel.In contrast, we take the predictiveinforma- tionto be acharacterization ofthe data stream itself. Tothe extent that we can think ofthe data stream as arising froma modelwith unknown pa- rameters, as inthe examples ofsection 4, allnotions ofinference are purely Bayesian, and there is no additional “penalty forcomplexity . ” Inthis sense ourdiscussion ismuch closerin spiritto Balasubramanian (1997) than to Rissanen (1978). On the other hand, we can always think about the ensem- ble ofdata streams that can be generated by aclass ofmodels, provided that we have aproperBayesian prioron the members ofthe class. Then the predictiveinformation measures the complexityof the class, and this char- acterization can be used to understand why inference within some(simpler) modelclasses willbe moreefŽ cient (forpractical examples along these lines, see Nemenman, 2000, and Nemenman &Bialek, 2001).

5.2 Complexity of Dynamical Systems. Whilethere are afew attempts to quantifycomplexityinterms ofdeterministic predictions(Wolfram,1984), the majorityof efforts tomeasure the complexityof physical systems start with aprobabilisticapproach. In addition, there isastrong prejudicethat the complexityof physical systems should be measured by quantities that are not only statistical but are also at least related tomore conventional ther- modynamicquantities (forexample, temperature, entropy), since this isthe only way one willbe able tocalculate complexitywithin the framework of statistical mechanics. Most proposalsdeŽ ne complexityas an entropy-like quantity,but an entropy ofsome unusual ensemble. Forexample, Lloyd and Pagels (1988) identiŽed complexityas thermodynamic depth, the en- tropyof the state sequences that lead to the current state. The idea clearly isinthe same spiritas the measurement ofpredictiveinformation, but this depth measure does not completelydiscard the extensive componentof the entropy (CrutchŽeld &Shalizi, 1999) and thus fails toresolve the essential difŽculty inconstructing complexitymeasures forphysical systems : distin- guishing genuine complexityfrom randomness (entropy), the complexity should be zero forboth purelyregular and purelyrandom systems. New deŽnitions ofcomplexity that try to satisfy these criteria(Lopez- Ruiz, Mancini, &Calbet, 1995 ; Gell-Mann &Lloyd, 1996 ; Shiner,Davison, &Landsberger,1999 ; Sole& Luque, 1999, Adami& Cerf, 2000) and criti- cismsof these proposals(CrutchŽ eld, Feldman, &Shalizi, 2000, Feldman & CrutchŽeld, 1998 ; Sole& Luque, 1999) continue to emerge even now.Aside fromthe obvious problemsof not actually eliminating the extensive com- ponent forall or apartof the parameter space ornot expressing complexity as an average overa physical ensemble, the critiques often are based on 2448 W.Bialek, I.Nemenman, and N.Tishby acleverargument Žrst mentioned explicitlyby Feldman and CrutchŽeld (1998). In an attempt to create auniversal measure, the constructions can be made over–universal : many proposedcomplexity measures depend only on the entropy density 0 and thus are functions only ofdisorder— not a desired feature. Inaddition,S many ofthese and other deŽnitions are awed because they failto distinguish among the richness ofclasses beyond some very simpleones. In aseries ofpapers, CrutchŽeld and coworkersidentiŽ ed statistical com- plexitywith the entropy ofcausal states, which inturn are deŽned as all those microstates (orhistories) that have the same conditionaldistribution offutures (CrutchŽeld &Young, 1989 ; Shalizi& CrutchŽeld, 1999). The causal states providean optimaldescription of a system ’sdynamics inthe sense that these states make as gooda predictionas the histories them- selves. Statistical complexityis very similarto predictiveinformation, but Shaliziand CrutchŽeld (1999) deŽne aquantity that iseven closerto the spiritof our discussion : their excess entropy isexactly the mutual informa- tionbetween the semi-inŽnite past and future. Unfortunately,by focusing on cases inwhich the past and future are inŽnite but the excess entropy is Žnite, their discussion is limitedto systems forwhich (inour language) I (T ) constant. pred ! 1 D In ourview ,Grassberger (1986, 1991) has made the clearest and the most appealing deŽnitions. Heemphasized that the slow approachof the en- tropyto its extensive limitis asign ofcomplexity and has proposedseveral functions toanalyze this slow approach.His effective measure complexity is the subextensive entropy termof an inŽnite data sample. UnlikeCrutch- Želd et al., he allows this measure to grow to inŽnity .Asan example, for low-dimensional dynamical systems, the effective measure complexityis Žnite whether the system exhibits periodicor chaotic behavior, but at the bifurcation pointthat marks the onset ofchaos, itdiverges logarithmically. Moreinteresting, Grassberger also notes that simulations ofspeciŽ c cellular automaton modelsthat are capable ofuniversal computationindicate that these systems can exhibit an even stronger, power-law,divergence. Grassberger (1986, 1991) also introduces the true measure complexity,or the complexity,which is the minimalinformation one needs to extract fromthe past inorder to provide optimal prediction. Another com- plexitymeasure, the logicaldepth (Bennett, 1985), which measures the time needed todecode the optimalcompression of the observed data, is bounded frombelow by this quantity because decoding requires reading allof this information.In addition, true measure complexityis exactly the statisti- calcomplexity of CrutchŽ eld et al., and the two approaches are actually much closerthan they seem. The relation between the true and the effective measure ,or between the statistical complexityand the excess entropy,closelyparallels the idea ofextracting orcompressingrelevant in- formation(Tishby ,Pereira, &Bialek,1999 ; Bialek &Tishby,inpreparation), as discussed below. Predictability,Complexity,and Learning 2449

5.3 AUnique Measure of Complexity? Werecallthat entropy provides ameasure ofinformationthat isunique insatisfying certain plausible con- straints (Shannon, 1948). It would be attractive ifwe couldprove a sim- ilaruniqueness theorem forthe predictiveinformation, or any partof it, as ameasure ofthe complexityor richness ofa time-dependent signal x(0 < t < T) drawn froma distribution P[x(t)].Before proceedingalong these lines, we have to be morespeciŽ c about what we mean by “complex- ity.” In most cases, including the learning problemsdiscussed above, itis clearthat we want tomeasure complexityof the dynamics underlying the signal or,equivalently,the complexityof a modelthat might be used to describe the signal. 13 There remains aquestion, however, as to whether we want toattach measures ofcomplexity to a particularsignal x(t) or whether we are interested inmeasures (likethe entropy itself) that are deŽned by an average overthe ensemble P[x(t)]. One problemin assigning complexityto single realizations is that there can be atypical data streams. Either we must treat atypicality explicitly, arguing that atypical data streams fromone source should be viewed as typicalstreams fromanother source, as discussed by Vit anyi and Li(2000), orwe have to lookat average quantities. Grassberger (1986) inparticular has argued that ourvisual intuition about the complexityof spatial patterns isan ensemble concept,even ifthe ensemble is only implicit(see also Tong inthe discussion session ofRissanen, 1987). In Rissanen ’sformulationof MDL, one tries to computethe descriptionlength ofa single string with respect tosome class ofpossible modelsfor that string, but ifthese models are probabilistic,we can always think about these modelsas generating an ensemble ofpossible strings. The fact that we admit probabilisticmodels is crucial.Even at acolloquiallevel, ifwe allow forprobabilistic models, then there isasimpledescription for a sequence oftruly randombits, but ifwe insist onadeterministicmodel, then itmay be very complicatedto generate preciselythe observed string ofbits. 14 Furthermore, inthe context ofproba- bilisticmodels, ithardly makes sense to ask fora dynamics that generates a particulardata stream ; we must askfordynamics that generate the data with reasonable probability,which ismoreor less equivalent to asking that the given string be atypicalmember of the ensemble generated by the model. Allof these paths lead us to thinking not about single strings but about ensembles inthe tradition ofstatistical mechanics, and so we shall search formeasures ofcomplexitythat are averages overthe distribution P[x(t)].

13 Theproblem of Žndingthis model orof reconstructing the underlyingdynamics mayalso be complex in the computational sense,so thatthere maynot exist anefŽcient .More interesting, the computational effort required maygrow with theduration T ofour observations.W eleave these algorithmic issues asidefor this discussion. 14 This is the statement thatthe Kolmogorov complexity of arandomsequence is large. Theprograms or algorithmsconsidered in the Kolmogorov formulation aredeterministic, andthe programmust generateprecisely the observed string. 2450 W.Bialek, I.Nemenman, and N.Tishby

Once we focus onaverage quantities, we can start by adopting Shannon ’s postulates as constraints onameasure ofcomplexity : if there are N equally likelysignals, then the measure should be monotonicin N; ifthe signal is decomposable intostatistically independent parts, then the measure should be additive with respect to this decomposition ; and ifthe signal can be described as aleaf onatree ofstatistically independent decisions, then the measure should be aweighted sum ofthe measures at each branching point. Webelieve that these constraints are as plausible forcomplexity measures as forinformation measures, and itis well known fromShannon ’s original work that this set ofconstraints leaves the entropy as the only possibility. Sincewe are discussing atime-dependent signal, this entropy depends on the duration ofour sample, S(T).Weknow,ofcourse, that this cannot be the end ofthe discussion, because we need todistinguish between randomness (entropy) and complexity.The path tothis distinction is tointroduce other constraints on ourmeasure. First we noticethat ifthe signal x is continuous, then the entropy is not invariant under transformations of x.Itseems reasonable to ask that complexitybe afunction ofthe processwe are observing and not ofthe coordinatesystem inwhich we choose torecord our observations. The ex- amples above show us, however, that itis not the whole function S(T) that depends onthe coordinate system for x;15 itisonlythe extensive component ofthe entropy that has this noninvariance. This can be seen moregenerally by noting that subextensive terms inthe entropy contribute to the mutual informationamong different segments ofthe data stream (including the predictiveinformation deŽ ned here), while the extensive entropy cannot ; mutual informationis coordinateinvariant, so allof the noninvariance must reside inthe extensive term. Thus, any measure complexitythat is coordi- nate invariant must discard the extensive componentof the entropy. The fact that extensive entropy cannot contribute to complexityis dis- cussed widely inthe physics literature (Bennett, 1990), as ourshort re- view shows. Tostatisticians and computerscientists, who are used toKol- mogorov’sideas, this isless obvious. However,Rissanen (1986, 1987) also talks about “noise” and “useful information ” ina data sequence, which is similarto splitting entropy intoits extensive and the subextensive parts. His “modelcomplexity , ” aside fromnot being an average as required above, is essentially equal to the subextensive entropy.Similarly,Whittle (inthe dis- cussion ofRissanen, 1987) talks about separating the predictivepart of the data fromthe rest. Ifwe continue along these lines, we can think about the asymptotic ex- pansion ofthe entropy at large T.The extensive termis the Žrst termin this series, and we have seen that itmust be discarded. What about the

15 Herewe consider instantaneoustransformations of x,not Žltering orother transfor- mationsthat mix points atdifferent times. Predictability,Complexity,and Learning 2451 other terms?In the context oflearning aparameterized model,most ofthe terms inthis series depend indetail onourprior distribution inparameter space, which might seem odd fora measure ofcomplexity .Moregener- ally,ifwe considertransformations ofthe data stream x(t) that mixpoints within atemporalwindow ofsize t, then for T t,the entropy S(T) À may have subextensive terms that are constant, and these are not invariant under this class oftransformations. On the other hand, ifthere are diver- gent subextensive terms, these are invariant under such temporallylocal transformations. 16 Soif we insist that measures ofcomplexitybe invariant not only under instantaneous coordinatetransformations, but also under temporallylocal transformations, then we can discard both the extensive and the Žnite subextensive terms inthe entropy,leaving only the divergent subextensive terms as apossible measure ofcomplexity . Aninteresting exampleof these arguments isprovidedby the statisti- calmechanics ofpolymers. It isconventional tomake modelsof polymers as randomwalks onalattice, with various interactions orself-avoidance constraints among different elements ofthe polymerchain. Ifwe count the number ofwalks with N steps, we Žnd that (N) ANc zN (de Gennes, » 1979). NowN the entropy isthe logarithmof the numberN ofstates, and so there isan extensive entropy 0 log z,aconstant subextensive entropy log A, S D 2 2 and adivergent subextensive term S1(N) c log N.Of these three terms, ! 2 only the divergent subextensive term(related tothe criticalexponent c ) is universal, that is, independent ofthe detailed structure ofthe lattice. Thus, as inourgeneral argument, itisonly the divergent subextensive terms in the entropy that are invariant to changes in ourdescription of the local, small-scale dynamics. Wecan recast the invariance arguments ina slightly different formusing the relative entropy.Werecallthat entropy isdeŽned cleanly only fordis- crete processes and that inthe continuum there are ambiguities. Wewould liketo write the continuum generalization ofthe entropy ofaprocess x(t) distributed accordingto P[x(t)] as

Scont dx(t) P[x(t)] log P[x(t)], (5.1) D ¡ 2 Z but this is not well deŽned because we are taking the logarithmof adimen- sionful quantity.Shannon gave the solution to this problem : we use as a measure ofinformation the relative entropy orKL divergence between the distribution P[x(t)] and some reference distribution Q[x(t)], P[x(t)] Srel dx(t) P[x(t)] log , (5.2) D ¡ 2 Q[x(t)] Z ³ ´

16 Throughoutthis discussion, we assumethat the signal x atonepoint in time is Žnite dimensional.There are subtleties if we allow x to represent the conŽguration of a spatially inŽnite system. 2452 W.Bialek, I.Nemenman, and N.Tishby which is invariant under changes ofourcoordinate system onthe space of signals. The cost ofthis invariance isthat we have introduced an arbitrary distribution Q[x(t)],and so really we have afamilyof measures. Wecan Žnd aunique complexitymeasure within this familyby imposinginvari- ance principlesas above, but inthis language, we must make ourmeasure invariant to different choicesof the reference distribution Q[x(t)]. The reference distribution Q[x(t)] embodies ourexpectations forthe sig- nal x(t); inparticular, Srel measures the extra space needed toencode signals drawn fromthe distribution P[x(t)] ifwe use codingstrategies that are op- timized for Q[x(t)]. If x(t) is awritten text, two readers who expect different numbers ofspelling errors will have different Qs, but to the extent that spellingerrors can be correctedby reference to the immediate neighboring letters, we insist that any measure ofcomplexity be invariant to these dif- ferences in Q.On the other hand, readers who differin their expectations about the global subject ofthe text may well disagree about the richness of anewspaper article.This suggests that complexityis acomponentof the relative entropy that isinvariant under some class oflocal translations and misspellings. Supposethat we leave aside global expectations, and construct ourref- erence distribution Q[x(t)] by allowing only forshort-ranged interactions— certain letters tend tofollow one another, letters formwords, and so on—but we bound the range overwhich these rules are applied.Models ofthis class cannot embody the fullstructure ofmost interesting timeseries (includ- ing language), but inthe present context we are not asking forthis. On the contrary,we are lookingfor a measure that is invariant todifferences in this short-ranged structure. In the terminologyof Želd theory orstatistical mechanics, we are constructing ourreference distribution Q[x(t)] from local operators. Because we are considering aone-dimensional signal (the one di- mension being time), distributions constructed fromlocal operators cannot have any phase transitions as afunction ofparameters. Again itisimportant that the signal x at one pointin timeis Žnite dimensional. The absence of criticalpoints means that the entropy ofthese distributions (ortheir contri- bution to the relative entropy) consists ofan extensive term(proportional to the timewindow T)plus aconstant subextensive term, plus terms that vanish as T becomes large. Thus, ifwe choose different reference distribu- tions within the class constructible fromlocal operators, we can change the extensive componentof the relative entropy,and we can change constant subextensive terms, but the divergent subextensive terms are invariant. Tosummarize, the usual constraints on informationmeasures inthe continuum producea familyof allowable complexitymeasures, the relative entropy to an arbitrary reference distribution. Ifwe insist that allobservers who choose reference distributions constructed fromlocal operators arrive at the same measure ofcomplexity ,orifwe followthe Žrst lineof arguments presentedabove, thenthis measure must be thedivergent subextensivecom- ponent ofthe entropy or,equivalently ,the predictiveinformation. W ehave Predictability,Complexity,and Learning 2453 seen that this componentis connected to learning ina straightforward way, quantifying the amount that can be learned about dynamics that generate the signal, and tomeasures ofcomplexitythat have arisen instatistics and dynamical .

6Discussion

Wehave presented predictiveinformation as acharacterization ofvarious data streams. In the context oflearning, predictiveinformation is related directlyto generalization. Moregenerally ,the structure ororder in a time series orasequence isrelated almost by deŽnition to the fact that there is predictabilityalong the sequence. The predictiveinformation measures the amount ofsuch structure but does not exhibit the structure inaconcrete form.Having collecteda data stream ofduration T,what are the features of these data that carrythe predictiveinformation Ipred(T)?Fromequation 3.7, we know that most ofwhat we have seen overthe time T must be irrelevant to the problemof prediction,so that the predictiveinformation is asmall fractionof the total information.Can we separate these predictivebits from the vast amount ofnonpredictive data? The problemof separating predictivefrom nonpredictive information is aspecialcase ofthe problemdiscussed recently (Tishby,Pereira, &Bialek, 1999; Bialek &Tishby,inpreparation) : given some data x,how dowe com- press ourdescription of x while preserving as much informationas pos- sible about some other variable y?Herewe identify x xpast as the past D data and y xfuture as the future. When we compress xpast intosome re- D duced description x ,we keep acertain amount ofinformation about Opast the past, I(xpast xpast),and we also preserve acertain amount ofinforma- O I tion about the future, I(xpast xfuture).There isno single correctcompression O I xpast xpast; instead there is aone-parameter familyof strategies that trace ! O out an optimalcurve inthe plane deŽned by these two mutual , I(xpast x ) versus I(xpast xpast). O I future O I The predictiveinformation preserved by compressionmust be less than the total, so that I(xpast xfuture) Ipred(T).Generically no compressioncan O I · preserve allof the predictiveinformation so that the inequality willbe strict, but there are interesting specialcases where equality can be achieved. If predictionproceeds by learning amodelwith aŽnite number ofparameters, we might have aregression formulathat speciŽes the best estimate ofthe parameters given the past data. Using the regression formulacompresses the data but preserves allof the predictivepower. In cases likethis (more generally,ifthere exist sufŽcient statistics forthe predictionproblem) we can ask forthe minimalset of xpast such that I(xpast x ) I (T). The O O I future D pred entropy ofthis minimal x is the true measure complexitydeŽ ned by Opast Grassberger (1986) orthe statistical complexitydeŽ ned by CrutchŽeld and Young (1989) (in the framework ofthe causal states theory,avery similar commentwas made recently by Shalizi& CrutchŽeld, 2000.) 2454 W.Bialek, I.Nemenman, and N.Tishby

In the context ofstatistical mechanics, long-range correlationsare charac- terized by computingthe correlationfunctions oforderparameters, which are coarse-grained functions ofthe system ’smicroscopicvariables. When we know something about the nature ofthe orderparameter (e.g., whether it is avectoror a scalar), then general principlesallow afairlycomplete classi- Žcation and descriptionof long-range orderingand the nature ofthe critical points at which this ordercan appear orchange. On the other hand, deŽning the orderparameter itself remains something ofan art. Fora ferromagnet, the orderparameter isobtained by localaveraging ofthe microscopicspins, while foran antiferromagnet, one must average the staggered magnetiza- tionto capture the fact that the orderinginvolves an alternation fromsite to site, and so on. Sincethe orderparameter carriesall the informationthat con- tributes to long-range correlationsin space and time, itmight be possible to deŽne orderparameters moregenerally as those variables that providethe most efŽcient compressionof the predictiveinformation, and this should be especially interesting forcomplex or disordered systems where the nature ofthe orderis not obvious intuitively ; aŽrst try inthis directionwas made by Bruder (1998). Atcriticalpoints, the predictiveinformation will diverge with the size ofthe system, and the coefŽcients ofthese divergences should be related tothe standard scaling dimensions ofthe orderparameters, but the details ofthis connection need tobe worked out. Ifwe compressor extract the predictiveinformation from a timeseries, we are ineffect discovering “features” that capture the nature ofthe or- dering intime. Learning itself can be seen as an exampleof this, where we discoverthe parameters ofan underlying modelby trying to compress the informationthat one sample of N points providesabout the next, and inthis way we address directlythe problemof generalization (Bialek and Tishby,inpreparation).The fact that nonpredictiveinformation is useless to the organism suggests that one crucialgoal ofneural informationpro- cessing isto separate predictiveinformation from the background. Perhaps rather than providingan efŽcient representation ofthe current state ofthe world—as suggested by Attneave (1954), Barlow (1959, 1961), and others (Atick,1992)— the nervous system providesan efŽcient representation of the predictiveinformation. 17 Itshould be possible to test this directlyby

17 If,as seems likely,the streamof data reaching our senses hasdiverging predictive information,then the space required to write down our description grows andgrows aswe observe the world forlonger periods oftime. Inparticular, if we canobserve for avery longtime, then the amountthat we knowabout the future will exceed, byan arbitrarilylarge factor, the amountthat we knowabout the present. Thus,representing the predictive informationmay require manymore neuronsthan would berequired to represent the current data.If we imaginethat the goalof primarysensory cortex is to represent the current state of the sensoryworld, then it is difŽcult to understandwhy these cortices haveso manymore neuronsthan they have sensory inputs. Inthe extreme case,the region of primaryvisual cortex devoted to inputs fromthe fovea hasnearly 30,000 neuronsfor each photoreceptor cell in the retina (Hawken& Parker1991) ; althoughmuch Predictability,Complexity,and Learning 2455 studying the encoding ofreasonably natural signals and asking ifthe infor- mation that neural responses provideabout the future ofthe input is close to the limitset by the statistics ofthe input itself, given that the neuron only captures acertain number ofbits about the past. Thus, we might ask if,un- der natural stimulus conditions, amotion-sensitive visual neuron captures features ofthe motiontrajectory that allow foroptimal prediction or extrap- olationof that trajectory ; by using information-theoreticmeasures, we both test the “efŽcient representation ” hypothesis directlyand avoid arbitrary assumptions about the metricfor errors in prediction.For more complex signals such as communicationsounds, even identifying the features that capture the predictiveinformation is an interesting problem. It is natural to ask ifthese ideas about predictiveinformation could be used to analyze onlearning inanimals orhumans. Wehave em- phasized the problemof learning probabilitydistributions orprobabilistic modelsrather than learning deterministicfunctions, associations, orrules. It isknown that the nervous system adapts tothe statistics ofits inputs, and this adaptation is evident inthe responsesofsingleneurons (Smirnakiset al., 1997; Brenner, Bialek, &de Ruyter van Steveninck, 2000) ; these experiments providea simpleexample of the system learning aparameterized distri- bution. When making saccadic eye movements, human subjects alter their distribution ofreaction times inrelation to the relative probabilitiesof differ- ent targets, as ifthey had learned an estimate ofthe relevant likelihoodratios (Carpenter &Williams,1995). Humans also can learn to discriminatealmost optimallybetween randomsequences (faircoin tosses) and sequences that are correlatedor anticorrelated accordingto a Markov process ; this learning can be accomplishedfrom examples alone, with no other (Lopes & Oden, 1987). Acquisitionof language may require learning thejointdistribu- tion ofsuccessive phonemes, syllables, orwords, and thereisdirectevidence forlearning ofconditional probabilities from artiŽ cial sound sequences, by both infants and adults (Saffran, Aslin, &Newport,1996 ; Saffran et al., 1999). These examples, which are not exhaustive, indicate that the nervous system can learn an appropriateprobabilistic model, 18 and this offers the oppor- tunity to analyze the dynamics ofthis learning using information-theoretic methods: What is the entropy of N successive reaction times followinga switch to anew set ofrelative probabilitiesin the saccade experiment?How

remainsto belearned about these cells, it is difŽcult to imaginethat the activity of so many neuronsconstitutes anefŽcient representation ofthe current sensoryinputs. But if we live in aworld where the predictive informationin the movies reachingour retina diverges, it is perfectly possible thatan efŽcient representation ofthe predictive informationavailable to us atone instant requires thousandsof times more space thanan efŽcient representation of the imagecurrently falling onour retina. 18 As emphasizedabove, many other learningproblems, including learninga function fromnoisy examples, canbe seen asthe learningof a probabilistic model. Thus,we expect thatthis description applies to amuch wider rangeof biological learningtasks. 2456 W.Bialek, I.Nemenman, and N.Tishby much informationdoes asingle reaction timeprovide about the relevant probabilities?Following the arguments above, such analysis couldlead to ameasurement ofthe universal learning curve L (N). The learning curve L (N) exhibited by ahuman observer is limitedby the predictiveinformation in the timeseries ofstimulus trials itself. Com- paring L (N) to this limitdeŽ nes an efŽciency oflearning inthe spiritof the discussion by Barlow (1983). While itis known that the nervous system can make efŽcient use ofavailable informationin signal processing tasks (cf.Chapter 4ofRieke et al., 1997) and that itcan represent this information efŽciently inthe spike trains ofindividual neurons (cf.Chapter 3ofRieke et al., 1997, as well as Berry,Warland, &Meister, 1997 ; Stronget al., 1998 ; and Reinagel &Reid, 2000), itisnot known whether the brain is an efŽcient learning machine inthe analogous sense. Given ourclassiŽ cation oflearn- ing tasks by their complexity,itwould be natural toask ifthe efŽciency of learning were acriticalfunction oftask complexity.Perhaps we can even identify alimitbeyond which efŽcient learning fails, indicating alimitto the complexityof the internal modelused by the brain during aclass oflearning tasks. Webelieve that ourtheoretical discussion here at least frames aclear question about the complexityof internal models, even ifforthe present we can only speculate about the outcomeof such experiments. Animportant result ofour analysis is thecharacterization oftimeseries or learning problemsbeyond the class ofŽnitely parameterizable models, that is, the class with power-law-divergent predictiveinformation. Qualitatively this class ismorecomplex than any parametricmodel, no matter how many parameters there may be, because ofthe morerapid asymptotic growth of Ipred(N).On the other hand, with aŽnite number ofobservations N, the actual amount ofpredictive information in such anonparametricproblem may be smallerthan inamodelwith alarge but Žnite number ofparameters. º SpeciŽcally ,ifwe have two models, one with Ipred(N) AN and one with » K parameters so that I (N) (K/2) log N,the inŽnite parameter model pred » 2 has less predictiveinformation for all N smallerthan some criticalvalue / K K 1 º Nc log . (6.1) » 2Aº 2 2A µ ³ ´¶ In the regime N N ,itis possible to achieve moreefŽ cient prediction ¿ c by trying tolearn the (asymptotically) morecomplex model, as illustrated concretelyin simulations ofthe problem(Nemenman & Bialek, 2001). Even ifthere are aŽnite number ofparameters, such as the Žnite number ofsynapses inasmallvolume of the brain, this number may be so large that we always have N N ,so that itmay be moreeffective ¿ c to think ofthe many-parameter modelas approximatinga continuous or nonparametricone. It is tempting tosuggest that the regime N N is the relevant one for ¿ c much ofbiology .Ifwe consider, forexample, 10mm 2 ofinferotemporalcor- Predictability,Complexity,and Learning 2457 tex devoted toobject recognition(Logothetis&Sheinberg, 1996), thenumber ofsynapses is K 5 109.On the other hand, object recognitiondepends on » £ foveation, and we moveour eyes roughly three times persecond through- out perhaps 10 years ofwaking life,during which we master the art ofobject recognition.This limitsus toat most N 109 examples. Remembering that » we must have º < 1, even with large values of A,equation 6.1 suggests that we operate with N < Nc.One can make similararguments about very dif- ferent brains, such as the mushroombodies ininsects (Capaldi, Robinson, & Fahrbach, 1999). Ifthis identiŽcation ofbiological learning with the regime N N is correct,then the success oflearning inanimals must depend on ¿ c strategies that implementsensible priorsover the space ofpossible models. There is one clearempirical hint that humans can make effective use of modelsthat are beyond Žnite parameterization (inthe sense that predictive informationdiverges as apowerlaw), and this comesfrom studies oflan- guage. Long ago, Shannon (1951) used the knowledge ofnative speakers to placebounds onthe entropy ofwritten English, and his strategy made explicituse ofpredictability .Shannon showed N-letter sequences to native speakers (readers), asked them to guess the next letter,and recordedhow many guesses were required before they got the right answer. Thus, each letter inthe text isturned intoa number,and the entropy ofthe distribution ofthese numbers isan upperbound on the conditionalentropy `(N) (cf. equation 3.8). Shannon himself thought that the convergence as N becomes large was rather quick and quoted an estimate ofthe extensive entropy per letter 0.Many years later, Hilberg(1990) reanalyzed Shannon ’s data and foundS that the approachto extensivity infact was very slow ; certainly there 1/2 isevidence fora large component S1(N) N ,and this may even dominate / the extensive componentfor accessible N.Ebeling and Poschel¨ (1994 ; see also Poschel,¨ Ebeling, &Ros e, 1995) studied the statistics ofletter sequences inlong texts (such as Moby Dick)and found the same strong subextensive component.It would be attractive torepeat Shannon ’sexperiments with a design that emphasizes the detection ofsubextensive terms at large N.19 In summary,we believe that ouranalysis ofpredictive information solves the problemof measuring the complexityof time series. This analysis uni- Žes ideas fromlearning theory,codingtheory ,dynamical systems, and sta- tistical mechanics. In particular,we have focused attention on aclass of processes that are qualitatively morecomplex than those treated inconven-

19 Associated with the slow approachto extensivity is alargemutual information between words orcharacters separated by long distances, andseveral groupshave found thatthis mutual informationdeclines asa power law.Cover andKing (1978) criticize such observations bynotingthat this behavioris impossible in Markovchains of arbitraryorder. While it is possible thatexisting mutual informationdata have not reachedasymptotia, the criticism of Cover andKing misses the possibility thatlanguage is not aMarkovprocess. Of course, it cannotbe Markovian if it hasa power-law divergence in the predictive information. 2458 W.Bialek, I.Nemenman, and N.Tishby tional learning theory,and there are several reasons tothink that this class includes many examples ofrelevance to biologyand cognition.

Acknowledgments

Wethank V.Balasubramanian, A.Bell, S.Bruder,C.Callan, A.Fairhall, G.Garcia de Polavieja Embid, R.Koberle, A.Libchaber,A.Melikidze, A.Mikhailov,O.Motrunich, M.Opper,R.Rumiati, R.deRuyter van Steven- inck, N.Slonim,T .Spencer,S.Still,S. Strong, and A.Treves formany helpful discussions. Wealso thank M.Nemenman and R.Rubin forhelp with the numericalsimulations and an anonymous referee forpointing out yet more opportunitiesto connect ourwork with the earlierliterature. Our collabo- ration was aided inpartby agrant fromthe U.S.-Israel Binational Science Foundation to the Hebrew University ofJerusalem, and work at Princeton was supported inpart by funds fromNEC.

References

Where available,we give references tothe Los Alamos e–print archive. Pa- pers may beretrieved online from the website http ://xxx.lanl.gov/abs/*/*, where */* is the reference ; forexample, Adami and Cerf (2000)is found at http://xxx.lanl.gov/abs/adap-org/9605002.For preprints, this is aprimary ref- erence; forpublished papers there may bedifferences between the published and electronic versions. Abarbanel,H. D.I.,Brown, R.,Sidorowich, J.J.,&Tsimring, L.S.(1993).The analysis ofobserved chaoticdata in physical systems. Revs.Mod. Phys. , 65, 1331–1392. Adami, C.,& Cerf,N. J.(2000).Physical complexity ofsymbolic sequences. Physica D, 137,62–69. See alsoadap-org/ 9605002. Aida, T.(1999).Field theoretical analysis ofon– line learning ofprobabilitydis- tributions. Phys.Rev. Lett. , 83,3554–3557. See alsocond-mat/ 9911474. Akaike, H.(1974a). and anextension ofthe maximum likeli- hood principle. InB. Petrov &F.Csaki (Eds.), SecondInternational Symposium ofInformation Theory Budapest: Akademia Kiedo. Akaike, H.(1974b).A new look atthe statisticalmodel identiŽcation. IEEE Trans. AutomaticControl. , 19, 716–723. Atick, J.J.(1992).Could information theory provide anecological theory of sensory processing? In W.Bialek(Ed.), Princetonlectures on biophysics (pp.223– 289).Singapore : World ScientiŽc. Attneave, F.(1954).Some informational aspectsof visual . Psych. Rev., 61, 183–193. Balasubramanian,V .(1997).Statistical inference, Occam ’srazor,and statistical mechanics onthe spaceof probabilitydistributions. NeuralComp. , 9, 349–368. See alsocond-mat/ 9601030. Barlow,H.B.(1959).Sensory mechanisms, the reduction ofredundancy and intelligence. In D.V.Blake &A.M.Uttley (Eds.), Proceedingsof the Sympo- Predictability,Complexity,and Learning 2459

siumon theMechanization of ThoughtProcesses ,(Vol.2, pp.537– 574), London : H.M.Stationery OfŽce. Barlow,H.B.(1961).Possible principles underlying the transformation ofsen- sory messages. In W.Rosenblith (Ed.), SensoryCommunication (pp.217– 234), Cambridge, MA : MIT Press. Barlow,H.B.(1983).Intelligence, guesswork, language. Nature, 304, 207–209. Barron, A.,&Cover,T.(1991).Minimum complexity density estimation. IEEE Trans.Inf. Thy. , 37,1034–1054. Bennett, C.H.(1985).Dissipation, information, computationalcomplexity and the deŽnition oforganization. InD.Pines (Ed.), Emergingsyntheses in science (pp.215– 233). Reading, MA : Addison–W esley. Bennett, C.H.(1990).How todeŽ ne complexity in physics, and why.In W.H.Zurek (Ed.), Complexity,entropy and the physics of information (pp. 137– 148)Redwood City,CA : Addison–W esley. Berry II,M.J.,Warland, D.K.,& Meister,M.(1997).The structure and precision ofretinal spike trains. Proc.Natl. Acad. Sci. (USA) , 94, 5411–5416. Bialek, W.(1995). Predictiveinformation and the complexity of time series (Tech. Note).Princeton, NJ : NEC Research Institute. Bialek, W.,Callan,C. G.,&Strong, S.P.(1996).Field theories forlearning probabilitydistributions. Phys.Rev. Lett. , 77,4693–4697. See alsocond- mat/9607180. Bialek, W.,&Tishby,N.(1999). Predictiveinformation .Preprint. Availableat cond- mat/9902341. Bialek, W.,&Tishby,N.(in preparation).Extracting relevant information. Brenner,N.,Bialek, W.,&de Ruyter van Steveninck, R.(2000).Adaptive rescaling optimizes information transmission. Neuron, 26, 695–702. Bruder,S.D.(1998). Predictiveinformation in spiketrains from the blow y andmonkey visualsystems .Unpublished doctoraldissertation, Princeton University. Brunel, N.,& Nadal,P .(1998).Mutual information, Fisher information, and populationcoding. NeuralComp. , 10,1731–1757. Capaldi,E. A.,Robinson, G.E.,& Fahrbach,S. E.(1999).Neuroethology ofspatial learning : The birds and the bees. Annu. Rev.Psychol. , 50, 651– 682. Carpenter,R.H.S.,&Williams, M.L. L.(1995).Neural computationof log likelihood in control ofsaccadiceye movements. Nature, 377, 59–62. Cesa-Bianchi, N.,& Lugosi, G.(in press). Worst-casebounds forthe logarithmic loss ofpredictors. MachineLearning . Chaitin, G.J.(1975).A theory ofprogram size formally identical toinformation theory. J.Assoc.Comp. Mach. , 22, 329–340. Clarke, B.S.,& Barron, A.R.(1990).Information-theoretic asymptotics ofBayes methods. IEEE Trans.Inf. Thy. , 36, 453–471. Cover,T.M.,& King, R.C.(1978).A convergent gambling estimate ofthe entropy of English. IEEE Trans.Inf. Thy. , 24, 413–421. Cover,T.M.,& Thomas, J.A.(1991). Elementsof information theory . New York: Wiley. CrutchŽeld, J.P.,&Feldman, D.P.(1997).Statistical complexity ofsimple 1-D spin systems. Phys. Rev. E, 55,1239–1243R. See alsocond-mat/ 9702191. 2460 W.Bialek, I.Nemenman, and N.Tishby

CrutchŽeld, J.P.,Feldman, D.P.,&Shalizi,C. R.(2000).Comment Ion “Simple measure forcomplexity . ” Phys. Rev E, 62,2996–2997. See alsochao- dyn/9907001. CrutchŽeld, J.P.,&Shalizi,C. R.(1999).Thermodynamic depth ofcausalstates : Objective complexity viaminimal representation. Phys. Rev. E, 59, 275–283. See alsocond-mat/ 9808147. CrutchŽeld, J.P.,&Young, K.(1989).Inferring statisticalcomplexity . Phys. Rev. Lett., 63, 105–108. Eagleman, D.M.,&Sejnowski, T.J.(2000).Motion integration and postdiction in visual awareness. Science, 287,2036–2038. Ebeling, W.,&Poschel,¨ T.(1994).Entropy and long-range correlations in literary English. Europhys.Lett. , 26, 241–246. Feldman, D.P.,&CrutchŽeld, J.P.(1998).Measures ofstatistical complexity : Why? Phys.Lett. A , 238,244–252. See alsocond-mat/ 9708186. Gell-Mann, M.,&Lloyd, S.(1996).Information measures, effective complexity, and totalinformation. Complexity, 2, 44–52. de Gennes, P.–G.(1979). Scalingconcepts in polymerphysics . Ithaca, NY: Cornell University Press. Grassberger,P.(1986).T oward aquantitative theory ofself-generated complex- ity. Int.J. Theor.Phys. , 25, 907–938. Grassberger,P.(1991).Information and complexity measures in dynamical sys- tems. InH.Atmanspacher &H.Schreingraber (Eds.), Informationdynamics (pp.15– 33). New York : Plenum Press. Hall,P .,&Hannan, E. (1988).On stochasticcomplexity and nonparametric den- sity estimation. Biometrica, 75, 705–714. Haussler,D.,Kearns, M.,& Schapire,R. (1994).Bounds on the sample complex- ity ofBayesian learning using information theory and the VC dimension. MachineLearning , 14, 83–114. Haussler,D.,Kearns, M.,Seung, S.,&Tishby,N.(1996).Rigorous learning curve bounds from statisticalmechanics. MachineLearning , 25, 195–236. Haussler,D.,& Opper,M.(1995).General bounds on the mutualinformation between aparameter and n conditionally independent events. In Proceedings ofthe Eighth Annual Conferenceon ComputationalLearning Theory (pp.402– 411). New York: ACM Press. Haussler,D.,& Opper,M.(1997).Mutual information, metric entropy and cu- mulative relative entropy risk. Ann. Statist. , 25,2451–2492. Haussler,D.,& Opper,M.(1998).W orst caseprediction over sequences un- der log loss. In G.Cybenko, D.O ’Leary,&J.Rissanen (Eds.), The mathe- maticsof information coding, extraction and distribution . New York: Springer- Verlag. Hawken, M.J.,&Parker,A.J.(1991).Spatial receptive Želd organization in mon- key V1and its relationship tothe cone mosaic.In M.S.Landy &J.A.Movshon (Eds.), Computationalmodels of visualprocessing (pp.83– 93). Cambridge, MA : MIT Press. Herschkowitz, D.,&Nadal,J.– P.(1999).Unsupervised and supervised learning : Mutualinformation between parameters and observations. Phys. Rev. E, 59, 3344–3360. Predictability,Complexity,and Learning 2461

Hilberg, W.(1990).The well-known lower boundof information in written lan- guage: Isit amisinterpretation ofShannon experiments? Frequenz, 44, 243– 248.(in German) Holy,T.E.(1997).Analysis ofdatafrom continuousprobability distributions. Phys.Rev. Lett. , 79,3545–3548. See alsophysics/ 9706015. Ibragimov,I.,&Hasminskii, R.(1972).On aninformation in asample abouta parameter.In SecondInternational Symposium on InformationTheory (pp. 295– 309).New York : IEEE Press. Kang, K.,& Sompolinsky,H.(2001). Mutualinformation of population codes and distancemeasures in probabilityspace .Preprint. Availableat cond-mat/0101161. Kemeney,J.G.(1953).The use ofsimplicity in induction. Philos.Rev. , 62, 391–315. Kolmogoroff,A. N.(1939).Sur l ’interpolation et extrapolations des suites sta- tionnaires. C.R.Acad.Sci. Paris , 208,2043–2045. Kolmogorov,A.N.(1941).Interpolation and extrapolationof stationary random sequences. Izv.Akad. Nauk. SSSR Ser.Mat. , 5,3–14. (in Russian) Translation in A.N.Shiryagev (Ed.), Selectedworks of A.N.Kolmogorov (Vol.23, pp.272– 280). Norwell, MA: Kluwer. Kolmogorov,A.N.(1965).Three approachesto the quantitative deŽnition of information. Prob.Inf. T rans. , 1, 4–7. Li, M., & Vitanyi, P.(1993). An Introductionto Kolmogorovcomplexity and its appli- cations. New York: Springer-Verlag. Lloyd, S.,&Pagels, H.(1988).Complexity asthermodynamic depth. Ann. Phys., 188, 186–213. Logothetis, N.K.,&Sheinberg, D.L.(1996).Visual objectrecognition. Annu. Rev.Neurosci. , 19, 577–621. Lopes, L.L.,& Oden, G.C.(1987).Distinguishing between random and non- random events. J.Exp.Psych. : Learning,Memory, and Cognition , 13, 392–400. Lopez-Ruiz,R., Mancini, H.L.,& Calbet,X. (1995).A statisticalmeasure of complexity. Phys.Lett. A , 209, 321–326. MacKay,D.J.C.(1992).Bayesian interpolation. NeuralComp. , 4, 415–447. Nemenman, I.(2000). Informationtheory and learning :Aphysicalapproach . Unpub- lished doctoraldissertation. Princeton University.See alsophysics/ 0009032. Nemenman, I.,&Bialek, W.(2001).Learning continuousdistributions : Simu- lations with Želd theoretic priors. InT.K.Leen, T.G.Dietterich, &V.Tresp (Eds.), Advancesin neuralinformation processing systems , 13 (pp.287– 295). MIT Press, Cambridge, MA : MITPress. See alsocond-mat/ 0009165. Opper,M.(1994).Learning and generalization in atwo-layer neural network : The role ofthe Vapnik–Chervonenkis dimension. Phys.Rev. Lett. , 72, 2113– 2116. Opper,M.,& Haussler,D.(1995).Bounds forpredictive errors in the statistical mechanics ofsupervised learning. Phys.Rev. Lett. , 75,3772–3775. Periwal, V.(1997).Reparametrization invariant statisticalinference and gravity. Phys.Rev. Lett. , 78,4671–4674. See alsohep-th/ 9703135. Periwal, V.(1998).Geometrical statisticalinference. Nucl. Phys.B , 554[FS], 719– 730.See alsoadap-org/ 9801001. Poschel,¨ T.,Ebeling, W.,&Ros e, H.(1995).Guessing probabilitydistributions from small samples. J.Stat.Phys. , 80,1443–1452. 2462 W.Bialek, I.Nemenman, and N.Tishby

Reinagel, P.,&Reid, R.C.(2000).T emporal coding ofvisual information in the thalamus. J.Neurosci. , 20,5392–5400. Renyi, A.(1964).On the amountof information concerning anunknown pa- rameter in asequence ofobservations. Publ. Math.Inst. Hungar. Acad. Sci. , 9, 617–625. Rieke, F.,Warland, D.,de Ruyter van Steveninck, R.,& Bialek, W.(1997). Spikes: Exploringthe NeuralCode (MITPress, Cambridge). Rissanen, J.(1978).Modeling byshortest datadescription. Automatica, 14, 465– 471. Rissanen, J.(1984).Universal coding, information, prediction, and estimation. IEEE Trans.Inf. Thy. , 30, 629–636. Rissanen, J.(1986).Stochastic complexity and modeling. Ann. Statist. , 14, 1080– 1100. Rissanen, J.(1987).Stochastic complexity . J.Roy.Stat. Soc. B , 49,223–239, 253– 265. Rissanen, J.(1989). Stochasticcomplexity and statistical inquiry . Singapore: World ScientiŽc. Rissanen, J.(1996).Fisher information and stochasticcomplexity . IEEE Trans. Inf. Thy., 42, 40–47. Rissanen, J.,Speed, T.,&Yu,B.(1992).Density estimation bystochastic com- plexity. IEEE Trans.Inf. Thy. , 38, 315–323. Saffran,J. R.,Aslin, R. N.,&Newport, E.L.(1996).Statistical learning by8- month-old infants. Science, 274,1926–1928. Saffran,J. R.,Johnson, E.K.,Aslin, R.H.,&Newport, E.L.(1999).Statistical learning oftone sequences byhuman infants and adults. Cognition, 70, 27– 52. Schmidt, D.M.(2000).Continuous probabilitydistributions from Žnite data. Phys.Rev. E, 61 ,1052–1055. Seung, H.S.,Sompolinsky,H.,&Tishby,N.(1992).Statistical mechanics oflearn- ing from examples. Phys. Rev. A, 45,6056–6091. Shalizi,C. R.,& CrutchŽeld, J.P.(1999). Computationalmechanics : Pattern and prediction,structure and simplicity .Preprint. Availableat cond-mat/9907176. Shalizi,C. R.,& CrutchŽeld, J.P.(2000). Informationbottleneck, causal states, and statisticalrelevance bases : How torepresent relevant information in memoryless transduction .Availableat nlin.AO/ 0006025. Shannon, C.E.(1948).A mathematicaltheory ofcommunication. BellSys. T ech. J., 27,379–423, 623– 656. Reprinted in C.E.Shannon &W.Weaver, The mathe- maticaltheory of communication . Urbana: University ofIllinois Press, 1949. Shannon, C.E.(1951).Prediction and entropy ofprinted English. BellSys. T ech. J., 30,50–64. Reprinted in N.J.A.Sloane& A.D.Wyner,(Eds.), ClaudeElwood Shannon: Collectedpapers . New York: IEEE Press, 1993. Shiner,J.,Davison, M.,&Landsberger,P.(1999).Simple measure forcomplexity . Phys. Rev. E, 59,1459–1464. Smirnakis, S.,Berry III,M. J.,W arland, D.K.,Bialek, W.,&Meister,M.(1997). Adaptationof retinal processing toimage contrastand spatialscale. Nature, 386, 69–73. Sole,R. V.,&Luque, B. (1999). Statisticalmeasures of complexity for strongly inter- actingsystems .Preprint. Availableat adap-org/9909002. Predictability,Complexity,and Learning 2463

Solomonoff,R. J.(1964).A formaltheory ofinductive inference. Inform. and Control, 7,1–22, 224– 254. Strong, S.P.,Koberle, R.,de Ruyter van Steveninck, R.,&Bialek, W.(1998). Entropy and information in neural spike trains. Phys.Rev. Lett. , 80, 197–200. Tishby,N.,Pereira, F.,&Bialek, W.(1999).The information bottleneck method. In B. Hajek &R.S.Sreenivas (Eds.), Proceedingsof the 37th Annual Allerton Conferenceon Communication,Control and Computing (pp.368– 377). Urbana : University ofIllinois. See alsophysics/ 0004057. Vapnik, V.(1998). Statisticallearning theory . New York: Wiley. Vitanyi, P.,&Li, M.(2000).Minimum description length induction, Bayesianism, and Kolmogorov complexity. IEEE Trans.Inf. Thy. , 46,446–464. See also cs.LG/9901014. Wallace,C. S.,&Boulton, D.M.(1968).An information measure forclassiŽ cation. Comp. J., 11, 185–195. Weigend, A.S.,&Gershenfeld, N.A.,eds. (1994). Timeseries prediction :Forecasting thefuture and understanding the past .Reading, MA : Addison–W esley. Wiener,N.(1949). Extrapolation,interpolation and smoothing of time series . New York: Wiley. Wolfram,S. (1984).Computation theory ofcellular automata. Commun. Math. Physics, 96, 15–57. Wong, W.,&Shen, X.(1995).Probability inequalities forlikelihood ratios and convergence rates ofsieve MLE ’s. Ann. Statist. , 23, 339–362.

Received October 11,2000 ; accepted January31, 2001.